A network configuration issue has triggered a massive Cloudflare outage.
A move that should have boosted network resilience, according to Cloudflare, produced a large outage that affected more than a dozen of its data centres and hundreds of important online platforms and services today.
After analysing the event, Cloudflare stated, “Today, June 21, 2022, Cloudflare had an outage that disrupted traffic in 19 of our data centres.”
“Regrettably, these 19 sites are responsible for a considerable amount of our global traffic. This disruption was triggered by a modification that was implemented as part of a long-term strategy to improve resilience in our busiest sites.”
The entire list of compromised websites and services, according to user reports, includes Amazon, Twitch, Amazon Web Services, Steam, Coinbase, Telegram, Discord, DoorDash, Gitlab, and more.
Cloudflare’s busiest locations were affected by the outage.
After complaints of disruptions to Cloudflare’s network from customers and users throughout the world, the business began investigating the situation at roughly 06:34 AM UTC.
“Customers trying to access Cloudflare sites in the afflicted areas will receive 500 errors. All data plane services in our network are affected by the event “According to Cloudflare.
While the incident report on Cloudflare’s system status page has no data on what caused the outage, the firm provided further information about the June 21 outage on its official blog.
The Cloudflare team stated, “This interruption was triggered by a change that was part of a long-running endeavour to boost resilience in our busiest areas.”
“An outage began at 06:27 UTC due to a change in network configuration in specific sites. The first data centre was brought back up at 06:58 UTC, and by 07:42 UTC, all data centres were up and running.
“Depending on where you are in the world, you may have been unable to access Cloudflare-powered websites and services. Cloudflare continues to function correctly in other regions.”
Despite the fact that the affected locations account for approximately 4% of Cloudflare’s total network, their outage affected nearly 50% of all HTTP requests served by Cloudflare globally.
The update that caused today’s outage was part of a bigger initiative to transform data centres in Cloudlfare’s busiest locations to more resilient and flexible architecture, dubbed Multi-Colo by the company.
Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, So Paulo, San Jose, Singapore, Sydney, and Tokyo are among the data centres hit by today’s event.
Timeline for the outage:
We deliver the modification to our first site at 3:56 UTC. Because we are still using our earlier architecture, none of our sites will be affected by the change.
06:17: The modification has been implemented in our busiest sites, but not in the MCP architecture locations.
06:27: The modification has been delivered to our spines, and the rollout has reached MCP-enabled areas. This is when the problem began, as these 19 locations were quickly taken offline.
06:32: A Cloudflare internal issue has been declared.
06:51: The first update to a router is done to determine the underlying problem.
06:58: The root cause has been identified and understood. Work on reversing the faulty alteration begins.
07:42: The last of the reverts is finished. The problem resurfaced intermittently when network engineers stepped over each other’s fixes, reversing the prior reverts.