Failures may be a fact on the internet, but the last few days have been particularly tough for some of us. Last Tuesday, Google Calendar was down for hours. Later in the week, Netflix, Hulu and Xbox Live had problems. (Hulu on the night of The Handmaid's Tale no less!) And on Monday morning, users in the northeastern US were hit by a widespread Web crash, the Verizon user and thousands of cloudflare sites covered websites. A little-known backbone of the Internet, providing 16 million websites with security and performance services. Even Downdetecter, a website that keeps track of other sites being active, has temporarily failed due to the problem.
The failure was a sudden reminder that the Internet is a fragile place where a small mistake – in this case from a small Pennsylvania-based company – can cause web swaths to break with little warning. In the case of Monday it was because the internet card broke.
Around 7 o'clock in the morning, the failure began to hit Verizon and then spread to parts of Amazon Web Services (another important part of the Internet infrastructure behind the scenes), Reddit podcast app Overcast the popular chat service Discord, e-commerce provider Sonassi, live streaming platform Twitch and the web hosting provider WP Engine.
Many of the affected sites were served by Cloudflare, so the company was blamed on Monday morning. People were not sure if the Verizon outage and the cloudflare outage were connected. While it's true that about 10 percent of the 16 million sites were affected by Cloudflare – a huge chunk of the Internet – Cloudflare, according to its Chief Technology Officer, did little to address the problem. That's because Cloudflare traffic never reached them.
The Internet uses a so-called BGP or Border Gateway protocol, which is basically a routing card or, as some call it, the USPS of the Web. It uses Internet traffic and data and chooses the most efficient route to route that traffic to another location on the Internet (like you). This works very well in most cases, but something went wrong on Monday. According to Cloudflare Chief Technology Officer John Graham-Cumming, this was a mistaken signal from DQE Communications, a small commercial Internet service provider serving around 2,000 buildings in Pittsburgh, Pennsylvania. "This small company said, 'These 2,400 networks, including some parts of Cloudflare, some parts of Amazon, some parts of Google and Facebook, whole parts of the Internet. & # 39; They said these networks are ours. You can send us your traffic. Graham-Cumming said. DQE has confirmed that the problem has occurred within its network and that it has worked quickly to solve the problem. "We immediately investigated the problem and adjusted our routing policies," the company spokesman said in a statement.
This misconfiguration was probably the result of automatic route optimization software and not by someone intentionally messing up the routes, according to Andree Toonk at BGPMon, a company that monitors network routes and security. However, the effect was the same: as soon as this new route was announced and the company falsely said that it could handle all traffic, it spread – through a so-called "route leak" – to Verizon, which apparently accepted the faulty routes and then forwarded. As a result, a huge chunk of Internet traffic – traffic to key destinations like Facebook and Cloudflare – plunged from a cliff to nothing. Basically, it was the sway of the internet that told thousands of drivers to drive straight into a ravine.
The problem lasted only a few hours. At 10 o'clock in the morning, most of the run-down services were back online. However, Graham-Cumming said that this poses a major problem: large ISPs do not have the necessary security measures to block and filter the spread of false routes over the Internet. "It starts with a small company doing something wrong," said Graham-Cumming. "The really big problem is that as a big company, Verizon could have actually said that this does not look right, but that we do not pass it on." But they do not have that. They let it go out into the wide world, which affected a large number of people. "
It's pretty much a miracle that this does not happen often. Cloudflare CEO has beaten Verizon on Twitter because of the problem. Verizon spoke very quietly about what exactly happened. The company said in a statement to Slate: "For some customers, there was a temporary interruption of the Internet service this morning. Our engineers fixed the problem until 9:00 ET. We are currently investigating the problem. "Cloudflares CEO and CTO also criticize the company for not responding on Monday morning when they asked about the problem.
The real problem, however, could be the entire BGP Internet routing system, which is basically based on a system of honor. "There are ways we can get away from this very trust-based system that uses cryptography," Graham-Cumming said. "That way, you have to prove that you're the owner of a network." Graham-Cumming said the adoption of similar technology – called RPKI – is the way to avoid similar hiccups in the future, hiccups a tiny company claims to own parts Cloudflare, Amazon, Facebook and Verizon's systems are not everything in question. With RPKI, networks can better filter bad BGP routes, and it is necessary for routes to be issued only by networks that have the capacity and right to announce this route. As Cloudflare pointed out in a blog post this afternoon, Verizon, using RPKI, found that the routes issued by DQE were not valid and were automatically deleted. AT & T and several other vendors have already enabled RPKI frameworks. The technology would not only stop errors like this or faulty automatic software that issues erroneous routes. It would also prevent jokes and those causing trouble on the Internet from maliciously issuing erroneous routes.
Which would make us all feel better. Nobody appreciates a reddit interruption on Monday morning.
Help us continue to handle important news and issues – and
get ad-free podcasts and bonus segments,
Member-only content and other great benefits.
Join Slate Plus