Cloudflare Experiences Major Outage Due to File Size Issue

Extended summary

Published: 21.11.2025

Introduction

A recent significant outage at Cloudflare, a leading web infrastructure and security company, has raised concerns regarding the reliability of its services. The incident, described as the worst outage since 2019, was triggered by a malfunctioning configuration file that unexpectedly increased in size, leading to widespread disruptions across its network. This summary delves into the details of the incident, the underlying causes, and the company's response to prevent future occurrences.

Understanding the Outage

The outage was primarily caused by a configuration file that exceeded the operational limits of Cloudflare's proxy service. Cloudflare employs a bot management system that has a cap on the number of machine learning features utilized at any given time, set at 200. However, the problematic file contained more than this limit, which led to system failures. As a result, the network experienced a surge in 5xx error HTTP status codes, indicating server errors, which are typically rare in Cloudflare's operations.

Technical Breakdown of the Issue

The malfunction occurred when a query running on a ClickHouse database cluster generated the faulty configuration file every five minutes. This query was part of an ongoing effort to enhance permissions management within the system. The nature of this setup meant that there was a 50% chance of either a correct or incorrect configuration file being produced with each cycle. Initially, the fluctuations in error rates misled the team into suspecting a potential cyberattack. However, it soon became clear that the issue stemmed from the faulty file being propagated across all nodes in the ClickHouse cluster.

Resolution and Recovery

To mitigate the situation, Cloudflare's team took immediate action by halting the generation and distribution of the erroneous configuration file. They then manually inserted a verified good file into the distribution queue and restarted the core proxy to restore normal operations. Following these corrective measures, the team worked on rebooting other affected services, gradually bringing the volume of 5xx error codes back to normal levels by the end of the day.

Future Preventive Measures

In the aftermath of the outage, Cloudflare has committed to implementing several strategies to enhance system resilience and prevent similar failures in the future. These measures will include strengthening the ingestion process for configuration files generated by Cloudflare, similar to protocols for user-generated input. Additionally, the company plans to introduce more global fail-safes, limit the impact of core dumps and error reports on system resources, and conduct a thorough review of failure modes across all core proxy modules.

Conclusion

This outage serves as a reminder of the complexities involved in maintaining robust digital infrastructure. While Cloudflare's response was swift and effective in restoring service, the incident highlights the need for ongoing vigilance and improvements in system design. As the company works to enhance its operational resilience, it underscores a broader trend in the tech industry: the continuous evolution of systems to mitigate risks and ensure reliability in the face of unexpected challenges.

Source: Ars Technica

We are sorry, but we no longer support this portal. If you want, pick any historical date before 2025-11-20 or go to the latest generated summaries.

Top Headlines 21.11.2025