Biggest Cloudflare Outages, Causes, and How to Survive

Because Cloudflare sits in the critical path between users and services, outages can have sweeping consequences. But what causes them and how can you stay up?
Share post:

Understanding Cloudflare and Its Central Role in the Web Ecosystem 

Cloudflare operates one of the largest content delivery networks (CDNs) and edge computing platforms in the world. It acts as a reverse proxy for millions of websites, optimizing performance, enhancing security, and protecting against distributed denial-of-service (DDoS) attacks. Its infrastructure spans over 300 cities globally, with data centers designed to route, filter, and cache traffic before it reaches origin servers. This setup reduces latency, absorbs malicious traffic, and ensures content is delivered reliably at scale.

Because Cloudflare sits in the critical path between users and services, outages can have sweeping consequences. Its services go beyond simple CDN functionality: Cloudflare also handles DNS resolution, bot detection, zero trust networking, and edge compute functions. 

Many organizations embed these services directly into their infrastructure stack, meaning that when Cloudflare fails, it can take down not just web content, but entire business operations. Its central role makes understanding its outages essential for any team building resilient internet-facing systems.

This is part of a series of articles about disaster recovery in cloud

In this article:

Major Cloudflare Outages: Timeline and Lessons Learned 

Cloudflare Outage in 2025

November 18, 2025: Bot management system causes worst outage in years

On November 18, Cloudflare suffered its most severe outage since 2019. The problem originated in the bot management system, where a change to ClickHouse database permissions altered the behavior of a query used to build a feature file for a machine learning model. The new query began returning duplicate rows, dramatically increasing the size of the generated file. Once deployed, the oversized file triggered a hard limit in the bot detection module, crashing Cloudflare’s core proxy software across its global network.

The outage began with global traffic fluctuating as different versions of the file propagated. Sites like ChatGPT, X (formerly Twitter), Spotify, Canva, and even Downdetector went offline, serving Cloudflare error pages. Engineers eventually halted the faulty file generation and distributed a known-good version. Services were largely restored after three hours, with full stabilization after six hours. Cloudflare later committed to improving internal validation mechanisms and adding stronger rollback tools to prevent similar large-scale failures.

Lessons learned: 

  • Machine learning systems tied to infrastructure must include strict input validation to catch anomalies before deployment
  • Query logic changes should be independently tested with synthetic data and volume thresholds enforced before rollout
  • Failure containment mechanisms are needed to prevent a single misgenerated file from propagating across global infrastructure

✅ Tip: Having an independent disaster recovery, with cross-account or cross-cloud restore, ensures that even provider-level issues don’t fully take your services offline.

March 21, 2025: Use of deleted credentials results in read/write failures

A credential rotation error in Cloudflare’s R2 Gateway led to a global outage lasting 1 hour and 7 minutes. Cloudflare was rotating credentials for security purposes but mistakenly deployed the new credentials only to the default (development) environment. 

The production environment, still using the old credentials, began experiencing total write failures and partial read failures. This crippled services that relied on R2 for storage, including mission-critical infrastructure. The incident highlighted the importance of strict environment separation and validation when managing credential changes in production systems.

Lessons learned:

  • Credentials should be rotated using automated, auditable pipelines that verify environment-specific deployment
  • Development and production environments must be strictly segregated, with enforced configuration checks
  • Credential rollouts require monitoring hooks to catch partial application and prevent invisible failures

✅ Tip: Orgs should test failover readiness with regular DR drills. Tools like N2W offer one-click DR tests without affecting production workloads.

Cloudflare Outage in 2024

September 16, 2024: Globally distributed servers experience network outage 

Cloudflare experienced a network-level outage affecting access to services such as Zoom and HubSpot. The incident lasted about 90 minutes. The outage caused reachability issues for applications that rely on Cloudflare’s CDN and networking infrastructure. 

Servers located in the United States, Europe, the UAE, and the Philippines were particularly impacted. Remote workers, marketers, and businesses relying on cloud-based tools saw interruptions in communication, sales funnels, and support systems. Although the incident was relatively short, it disrupted workflows for companies that depend on real-time online services.

Lessons learned: 

  • Global monitoring must include region-specific network health metrics to catch emerging latency or routing issues
  • CDN architectures need adaptive routing logic to bypass failing points of presence or backbone segments
  • Incident response plans should include customer communication workflows to manage service-level disruptions

Cloudflare Outages in 2023

November 2, 2023: Widespread network outage disrupts financial data 

Cloudflare suffered a brief but high-impact outage due to a network-wide issue. The outage lasted just 20 minutes but took down thousands of websites, including high-traffic platforms like Coindesk. 

The incident caused financial data disruptions: Coindesk’s site, for example, showed Bitcoin prices as $26 instead of $10,300. Though short in duration, the outage demonstrated how even small disruptions in availability or data accuracy can have outsized consequences for businesses and users alike.

Lessons learned: 

  • Even short outages in data delivery systems can result in misleading or corrupted business-critical information
  • Financial platforms need to validate cached data and include alerts for implausible metrics during degraded states
  • Time-sensitive services should architect for graceful degradation rather than complete shutdown under network stress

October 4, 2023: New resource type causes DNS failures

Cloudflare’s DNS resolver (1.1.1.1) and associated products like WARP and Zero Trust experienced intermittent DNS failures for about three hours. The issue stemmed from a new resource record type, ZONEMD, being added to the root zone on September 21. 

When the DNSSEC signatures on the root zone expired on October 4, Cloudflare’s resolvers were unable to validate them, causing them to return SERVFAIL errors. The failure revealed a lack of fallback or update handling in Cloudflare’s resolver logic, which made the incident more disruptive than necessary.

Lessons learned: 

  • DNS resolvers must account for changes in root zone configurations, including new record types and signature formats
  • Lack of update pathways or fallback logic in resolver software can cause outsized disruption from minor upstream changes
  • Continuous compatibility testing against root zone snapshots is essential for DNS infrastructure resilience

Cloudflare Outage in 2022

June 21, 2022: Critical failure affects global services

Cloudflare declared a P0 incident, a designation for critical failures, after a network-wide issue disrupted its data plane services globally. The outage lasted roughly 90 minutes. Users trying to access websites backed by Cloudflare in affected regions were met with HTTP 500 errors, meaning that servers could not complete their requests. 

The disruption impacted core functionality across multiple industries, with companies like Fitbit and Peloton unable to process transactions or serve content. All network services, including routing and CDN functionality, were impacted. 

Lessons learned: 

  • Routing and data plane software must be version-controlled with automated canary rollouts and rollback triggers
  • HTTP 500-level errors should be centrally tracked with correlation tools to pinpoint origin of failure faster
  • Internal alerting systems should escalate quickly when multiple edge locations report identical failure patterns

Cloudflare Outage in 2020

August 30, 2020: Transit provider’s failure causes network-wide errors

This incident was due to an IP routing failure at CenturyLink, a Cloudflare transit provider. CenturyLink’s issue caused HTTP 522 and 503 errors across Cloudflare’s network, particularly in data centers that relied on CenturyLink. 

Although Cloudflare itself was not directly at fault, its edge services were disrupted in the US and Europe, affecting sites such as Hulu, Xbox Live, and Feedly. The outage underscored how external provider issues can propagate to service platforms like Cloudflare, even when their own systems are operational.

Lessons learned: 

  • Relying on a small number of transit providers creates systemic risk; diverse upstreams reduce blast radius
  • Real-time route validation and filtering can shield infrastructure from flawed BGP announcements
  • SLAs with transit providers must include routing hygiene guarantees and incident transparency

July 17, 2020: Misconfiguration causes routing errors

This incident was triggered by a routing misconfiguration from a backbone provider. A router in Atlanta began broadcasting incorrect routes, which propagated across the internet and led to brief but significant disruptions for major sites including Shopify, Discord, Medium, League of Legends, and Patreon. Status pages and outage monitors were also affected, compounding user confusion. 

Lessons learned: 

  • BGP route acceptance policies should reject anomalous announcements from peer or customer routers
  • Geographically distributed routing validation helps detect and neutralize misconfigurations before global impact
  • Public status pages and outage monitors need independent hosting to ensure availability during major events

Cloudflare Outage in 2019

June 24, 2019: Routing error causes major network disruption 

Cloudflare experienced a major service disruption caused by a mismanaged BGP route leak from Verizon. The problem began when DQE Communications, a small ISP, announced incorrect routing information. Verizon, lacking proper route filtering, propagated these routes across the internet, unintentionally redirecting massive amounts of internet traffic through DQE’s network, far beyond its capacity.

This caused widespread congestion and failure across many networks, including Cloudflare’s. The incident lasted for about three hours and impacted a significant portion of the internet, taking down major websites such as Amazon, Google, Facebook, and Discord. Cloudflare’s CEO criticized Verizon and routing optimization firm Noction for their roles in the failure, calling out the lack of basic route validation in a system as critical as BGP.

Lessons learned: 

  • BGP lacks built-in authentication or validation, so manual route filtering remains critical at ISP and provider levels
  • Routing optimization firms must adhere to strict best practices and be monitored continuously for anomalies
  • Provider ecosystems must coordinate response and accountability when routing misbehavior affects critical services

Common Root Causes of Cloudflare Outages 

Generalizing from the above outages, let’s look at the common reasons behind many of the high profile malfunctions.

Configuration Propagation Failures Across Distributed Systems

Propagation failures occur when configuration changes, whether in software, routing, or firewall rules, are introduced without effective validation or rollback mechanisms. Distributed networks magnify the risk, as configuration errors can quickly cascade across hundreds of points of presence. In Cloudflare’s history, relatively isolated misconfigurations have led to outsized incidents because they propagated unchecked across edge locations.

Cloudflare and organizations that rely on similar architectures must develop and enforce strict deployment pipelines. Staged rollouts, automated validation, and real-time monitoring are critical for safely updating distributed configurations. Investing in robust tooling and rollback procedures is essential to minimize the impact of propagation failures on customer traffic and service reliability.

Network Congestion and Capacity Exhaustion Scenarios

Network congestion or capacity exhaustion often arises from traffic spikes, misconfigured routing, or inadequate capacity planning. When Cloudflare’s infrastructure becomes saturated, packet loss, latency, and service-level disruptions follow rapidly. Outages in 2020 and other years show that a localized or regional overload can have cascading global consequences for services reliant on seamless connectivity.

Effective mitigation depends on proactive network management, automated detection of anomalies, and dynamic scaling of resources to handle load variations. Organizations must also monitor for hot spots and traffic imbalances, making real-time adjustments as needed. Accurate forecasting and agile response resources are key strategies to prevent capacity-driven incidents from disrupting critical service delivery.

Hardware or Power Incidents Within Data Centers

Despite robust design, hardware or power failures at data centers remain a regular contributor to outages. Issues such as failing switch components, firmware bugs, or power loss can disrupt entire racks or facilities. Cloudflare’s reliance on third-party hardware makes the network susceptible to vendor-specific flaws that may not surface until production-scale loads reveal them.

Hardware redundancy, routine maintenance, and supplier risk management reduce exposure to unplanned incidents. Rapid failover capabilities, coupled with near-real-time detection and incident response, help mitigate outages when hardware issues do arise. Investing in energy resilience and multi-site deployments further insulate cloud infrastructure from isolated physical failures.

✅ Tip: When data centers fail, air-gapped, immutable backups can be your last line of defense, ensuring no data loss even in catastrophic infrastructure failure. And N2W makes this easy.

External Dependencies and Third-Party Service Failures

Cloudflare’s outages are sometimes attributed to issues outside its immediate control, such as failures at data transit providers, public DNS authorities, or cloud hardware vendors. The complexity of internet infrastructure means each provider depends on a sprawling mesh of external entities, from network carriers to power grids. Disruptions in any link can have disproportionate impacts on Cloudflare’s ability to deliver reliable service.

Mitigating these risks requires broad supply-chain visibility, redundant upstream providers, and cross-vendor monitoring. Contracts, service-level agreements, and crisis communication plans should anticipate upstream failures. Building agility into vendor relationships and monitoring integrations are essential for identifying and mitigating third-party risks on a continuous basis.

Why Cloudflare Outages Matter: Lessons for Organizations Relying on the Platform 

Single-Point-of-Failure Risk

A recurring lesson from Cloudflare outages is the inherent risk of centralized systems, which can create single points of failure affecting a large portion of the web. When a widely used service like Cloudflare experiences downtime, it can take down thousands of dependent services, revealing hidden concentration risks in internet infrastructure. Recent events have shown how configuration mistakes or software bugs in a provider’s network can instantaneously propagate to customers at global scale.

To mitigate these risks, organizations must carefully assess their dependencies and evaluate how their architecture would respond if a key third-party provider fails. Diversifying service delivery, implementing fallback paths, and maintaining an up-to-date disaster recovery strategy are crucial to reducing exposure. Businesses relying solely on Cloudflare without alternatives face serious risks when outages occur.

✅ Tip: Cloud-native DR tools like N2W allow for fast recovery into alternate regions, accounts, or even clouds—giving teams the flexibility to sidestep provider-level issues and maintain SLAs.

Operational Visibility and Failure Testing

Cloudflare outages highlight how many organizations lack clear visibility into how third-party services behave during partial or total failure. When a CDN, DNS, or security layer fails, internal teams often struggle to determine whether the issue originates in their own infrastructure or upstream. This slows response times and can lead to ineffective mitigation efforts.

Organizations should maintain detailed dependency maps, including which application components rely on Cloudflare services such as DNS, WAF, or Workers. Regular failure testing, including simulated provider outages and DNS failover drills, helps teams validate assumptions and uncover hidden coupling. Clear runbooks for third-party failure scenarios reduce confusion during real incidents and shorten recovery time.

Need for Resiliency

Cloudflare outages illustrate the broad need for resiliency engineering in internet-scale architecture. High availability and uptime require systems to anticipate disruption from all layers, including third-party provider failures. Resilient designs incorporate redundancy, automated failover, and diverse pathways for traffic and application logic, lowering the likelihood of user-facing downtime from provider outages.

Building resilience often involves adopting multi-CDN strategies, geographically distributing infrastructure, and implementing comprehensive monitoring and alerting practices. Organizations that invest in these measures will not only minimize the direct impact of outages but also improve their recovery posture. Planning for failures as inevitable, rather than exception, leads to stronger, more adaptable service delivery.

Outsmarting Cloudflare Outages: Best Practices for Reducing Risk 

1. Data Integrity, Backup and Disaster-Recovery Strategy

Maintaining data integrity and disaster recovery readiness is essential for any business relying on cloud-based services. Regular backups, verified restoration procedures, and clear data custody frameworks ensure that organizations can recover quickly from outages or data loss. Structured recovery drills help teams identify gaps in planning, test failover processes, and confirm that crucial data assets can be restored under pressure.

Establishing robust data protection policies goes hand-in-hand with leveraging Cloudflare’s security features. Organizations should ensure that backups are stored independently of production environments and accessible even if Cloudflare or another core vendor experiences downtime. This approach safeguards business continuity and helps organizations comply with regulatory or contractual data protection requirements.

✅ Tip: N2W helps you implement data resilience strategies that go beyond simple backups. With automated backup policies, cross-cloud DR, immutable storage, and air-gapped accounts, you can recover everything from individual files to full application stacks (including networking resources) in just a few clicks. It’s not just backup; it’s full-stack recovery you control.

2. Deploying Multi-CDN or Hybrid Delivery Architectures

Relying on a single CDN provider exposes applications to broader risk in the event of a provider-specific outage. Multi-CDN and hybrid delivery strategies distribute traffic across multiple vendors, improving redundancy and geographic performance. By architecting solutions that can flexibly reroute users between providers, organizations can insulate themselves from single-provider failures and benefit from load optimization.

Automation is key to managing multi-CDN setups, ensuring that traffic shifts seamlessly in response to degrading provider performance. Real-time monitoring and dynamic DNS allow instant failover, reducing manual intervention and downtime. Organizations should also review contractual relationships and technical interoperability between CDN partners to address failover triggers, billing implications, and compliance considerations.

3. Designing Resilient DNS Configurations and Fallback Paths

DNS plays a critical role in directing traffic to online resources, and DNS outages can render services inaccessible regardless of backend health. Designing resilient DNS configurations, such as using secondary DNS providers or Anycast routing, ensures that queries can resolve even when Cloudflare faces issues. Including fallback records, TTL strategies, and geographic redundancy further enhance continuity.

Automated DNS failover, combined with active health checks, allows organizations to reroute users away from failing endpoints or regions without user involvement. Documenting and regularly testing fallback scenarios is crucial for identifying weaknesses in DNS infrastructure. Comprehensive DNS resilience planning is a core element of minimizing the scope and impact of provider-specific incidents.

✅ Tip: Even when DNS fails, having an independent DR environment lets you restore critical services from a clean slate, including custom DNS zones and network settings.

4. Implementing Origin Redundancy and Health-Check Automation

Origin redundancy enhances service availability by distributing backend resources across multiple locations or cloud providers. Health-check automation monitors these origins, automatically removing unhealthy endpoints from rotation and redirecting users to healthy resources. This proactive approach drastically reduces the risk of service-wide outages when origin servers or cloud regions experience disruptions.

Configuring health checks at both the application and network layers provides faster detection of failures. Integrating these checks with traffic management solutions such as Cloudflare Load Balancer allows organizations to respond instantly to outages, reducing time-to-recovery and supporting seamless user experiences. Ongoing validation and adjustment of health check logic are needed to match evolving operational and security requirements.

✅ Tip: N2W supports orchestrated failovers and health-check-driven restores, so that dependencies like VPCs, VPNs, and load balancers come back online in the correct order for business continuity.

5. Using Caching Strategies to Minimize Dynamic Dependency

Effective use of caching reduces reliance on origin infrastructure during outages. By maximizing static and dynamic asset caching at the edge, organizations limit the need for repeated origin requests and can continue serving content even if the origin is temporarily unreachable. Cloudflare provides fine-grained cache control mechanisms that can be tuned according to content type, user characteristics, and outage scenarios.

Configuring cache failover policies, such as “serve stale” or “always online” modes, extends continuity during connectivity interruptions. Auditing and optimizing cache HIT rates ensure that the majority of user requests are fulfilled directly from the edge whenever possible. This practice not only improves performance and reduces exposure during outages but can also lower bandwidth costs and enhance overall site stability.

Web-Scale Disaster Recovery with N2W

When Cloudflare—or any critical third-party service—goes down, the real question is: how quickly can you recover?

N2W gives you full-stack, policy-based protection that spans AWS, Azure, and Wasabi—all from a single console. Whether you’re safeguarding virtual machines, databases, S3 buckets, or even Kubernetes clusters, N2W lets you backup, recover, and fail over in just a few clicks.

  • Cross-Account and Cross-Cloud DR: Recover into a clean AWS or Azure environment (even in a different region or cloud) for max resilience against provider-level outages.
  • Immutable, Air-Gapped Backups: Protect your backups from ransomware, rogue deletions, and infrastructure failures with immutable backups and locked-down DR accounts.
  • Automated DR Drills & Orchestrated Recovery: Run non-disruptive DR tests on-demand and recover entire applications—including VPCs, VPNs, and load balancers—in the right order, every time.
  • Instant Cleanup, Flexible Retention: Control backup sprawl and optimize storage costs with run-now cleanup and multiple retentions in a single policy.

Ready to bulletproof your cloud infrastructure?

Download the Cloud Outage Survival Checklist to learn how to prepare, recover, and thrive—no matter what breaks next.

You might also like

the disaster-proof backup & DR checklist

What your backup plan is missing...

Fortify your backup plan across every critical dimension with this checklist.