Cloudflare was down

pm90 · 2025-12-05T08:53:31 1764924811

This is not good. One major outage? Something exceptional. Several outages in a short time? As someone thats worked in operations, I have empathy; there are so many “temp havks” that are put in place for incidents. but the rest of the world won’t… they’re gonna suffer a massive reputation loss if this goes on as long as the last one.

berkes · 2025-12-05T09:07:48 1764925668

At least this warrants a good review of anyone's dependency on cloudflare.

If it turns out that this was really just random bad luck, it shouldn't affect their reputation (if humans were rational, that is...)

But if it is what many people seem to imply, that this is the outcome of internal problems/cuttings/restructuring/profit-increase etc, then I truly very much hope it affects their reputation.

But I'm afraid it won't. Just like Microsoft continues to push out software, that, compared to competitors, is unstable, insecure, frustrating to use, lacks features, etc, without it harming their reputation or even bottomlines too much. I'm afraid Cloudflare has a de-facto monopoly (technically: big moat) and can get away with offering poorer quality, for increasing pricing by now.

zelphirkalt · 2025-12-05T09:58:35 1764928715

Microsoft's reputation couldn't be much lower at this point, that's their trick.

The issue is the uninformed masses being led to use Windows when they buy a computer. They don't even know how much better a system could work, and so they accept whatever is shoved down their throats.

coffeebeqn · 2025-12-05T09:11:57 1764925917

Vibe infrastructure

rvz · 2025-12-05T09:16:18 1764926178

So that is what the best case definition of what "Vibe Engineering" is.

rsynnott · 2025-12-05T10:16:09 1764929769

> Just like Microsoft continues to push out software, that, compared to competitors, is unstable, insecure, frustrating to use, lacks features, etc, without it harming their reputation or even bottomlines too much.

Eh.... This is _kind_ of a counterfactual, tho. Like, we are not living in the world where MS did not do that. You could argue that MS was in a good place to be the dominant server and mobile OS vendor, and simply screwed both up through poor planning, poor execution, and (particularly in the case of server stuff) a complete disregard for quality as a concept.

I think someone who'd been in a coma since 1999 waking up today would be baffled at how diminished MS is, tbh. In the late 90s, Microsoft practically _was_ computers, with only a bunch of mostly-dying UNIX vendors for competition. And one reasonable lens through which to interpret its current position is that it's basically due to incompetence on Microsoft's part.

MrAureliusR · 2025-12-05T09:20:35 1764926435

well that's the thing, such a huge number of companies route all their traffic through Cloudflare. This is at least partially because for a long time, there was no other company that could really do what Cloudflare does, especially not at the scales they do. As much as I despise Cloudflare as a company, their blog posts about stopping attacks and such are extremely interesting. The amount of bandwidth their network can absorb is jaw-dropping.

I've said to many people/friends that use Cloudflare to look elsewhere. When such a huge percentage of the internet flows through a single provider, and when that provider offers a service that allows them to decrypt all your traffic (if you let them install HTTPS certs for you), not only is that a hugely juicy target for nation-states but the company itself has too much power.

But again, what other companies can offer the insane amount of protection they can?

gbrindisi · 2025-12-05T11:43:34 1764935014

The crowdstrike incident taught us that no one is going to review any dependency whatsoever.

ezst · 2025-12-05T12:11:47 1764936707

Yep, that's what late stage capitalism leaves you with: consolidation, abuse, helplessness and complacency/widespread incompetence as a result

bluerooibos · 2025-12-05T11:16:52 1764933412

I'm quite sure the reputational damage has already been done.

How do they not have better isolation of these issues, or redundancy of some sort?

brandensilva · 2025-12-05T13:25:13 1764941113

The seed has been planted. It will take awhile for others to fill the void. Still the big players see this as an opportunity to steal market share if Cloudflare cannot live up to their reputation.

rvz · 2025-12-05T09:02:50 1764925370

We are now seeing which companies do not consider the third party risk of single point of failures in systems they do not control as part of their infrastructure and what their contingency plan is.

It turns out so far, there isn't one. Other than contacting the CEO of Cloudflare rather than switching on a temporary mitigation measure to ensure minimal downtime.

Therefore, many engineers at affected companies would have failed their own systems design interviews.

throwaway42346 · 2025-12-05T09:23:18 1764926598

Alternative infrastructure costs money, and it's hard to get approval from leadership in many cases. I think many know what the ideal solution looks like, but anything linked to budgets is often out of the engineer's hands.

In some cases it is also a valid business decision. If you have 2 hour down time every 5 years, it may not have a significant revenue impact. Most customers think it's too much bother to switch to a competitor anyway, and even if it were simple the competition might not be better. Nobody gets fired for buying IBM

The decision was probably made by someone else who moved on to a different company, so they can blame that person. It's only when down time significantly impacts your future ARR (and bonus) that leadership cares (assuming that someone can even prove that they actually lose customers).

cryptonym · 2025-12-05T09:16:17 1764926177

Sometimes it's not worth it. Your plan is just to accept you'll be off for a day or two, while you switch to a competitor.

creamyhorror · 2025-12-05T13:51:31 1764942691

If there's a fitting competitor worth switching to.

Plus most people don't get blamed when AWS (or to a lesser extent Cloudflare) goes down, since everyone knows more than half the world is down, so there's not an urgent motivation to develop multi-vendor capability.

rvz · 2025-12-05T12:04:55 1764936295

Can't say that when it is a time critical service such as hospitals, banks, financial institutions or air-traffic control services.

cryptonym · 2025-12-05T14:49:38 1764946178

Only a fool would build an architecture for critical air-traffic with Cloudflare as a SPoF.

formerly_proven · 2025-12-05T10:00:21 1764928821

On the other thread there were comments claiming it’s unknowable what IaaS some SaaS is using, but SaaS vendors need to disclose these things one way or another, e.g. DPAs. Here is for example renders list of subprocessors: https://render.com/security

It’s actually fairly easy to know which 3rd party services a SaaS depends on and map these risks. It’s normal due diligence for most companies to do so before contracting a SaaS.

jcmfernandes · 2025-12-05T10:16:51 1764929811

Absolutely. I wouldn’t be surprised if they turned the heat up a little after the last incident. The result? Even more incidents.

belter · 2025-12-05T09:49:14 1764928154

This will be another post-mortem of...config file messed...did not catch...promise to be doing better next....We are sorry.

They problem is architectural.

lucyjojo · 2025-12-05T23:54:29 1764978869

cloudflare is a huge system in active development.

it will randomly fail. there is no way it cannot.

there is a point where the cost to not fail simply becomes too high.

pyuser583 · 2025-12-05T09:09:07 1764925747

Lots of big sites are down

wooque · 2025-12-05T11:31:16 1764934276

2 days ago they had outage that affected Europe, Cloudflare seems to be going down the drain. I removed it for my personal sites.

karmakurtisaani · 2025-12-05T08:58:57 1764925137

Probably fired a lot of their best people in the past few years and replaced it with AI. They have a de-facto monopoly, so we'll just accept it and wait patiently until they fix the problem. You know, business as usual in the grift economy.

5d41402abc4b · 2025-12-05T09:16:56 1764926216

>They have a de-facto monopoly

On what? There are lots of CDN providers out there.

esseph · 2025-12-05T09:31:38 1764927098

They do fare more than just CDN. It's the combination of service, features, reach, price, and the integration of it all.

immibis · 2025-12-05T09:28:31 1764926911

There's only one that lets everyone sign up for free.

rvz · 2025-12-05T09:18:43 1764926323

The "AI agents" are on holiday when an outage like this happens.

mvdtnz · 2025-12-05T17:41:29 1764956489

This didn't happen at all. You're just completely making shit up.

PlotCitizen · 2025-12-05T08:59:49 1764925189

This is a good reminder for everyone to reconsider making all of their websites depend on a single centralized point of failure. There are many alternatives to the different services which Cloudflare offers.

berkes · 2025-12-05T09:13:19 1764925999

But the nature of a CDN and most other products CF offers, is central by nature.

If you switch from CF to the next CF competitor, you've not improved this dependency.

The alternative here, is complex or even non-existing. Complex would be some system that allows you to hotswap a CDN, or to have fallback DDOS protection services, or to build you own in-house. Which, IMO, is the worst to do if your business is elsewhere. If you sell, say, petfood online, the dependency-risk that comes with a vendor like CF, quite certainly is less than the investment needed- and risk associted with- building a DDOS protection or CDN on your own; all investment that's not directed to selling more pet-food or get higher margins at doing so.

Zambyte · 2025-12-06T17:02:49 1765040569

IPFS is a decentralized CDN.

agnivade · 2025-12-05T09:22:03 1764926523

You can load-balance between CDN vendors as well

otikik · 2025-12-05T09:49:19 1764928159

Then your load balancer becomes the single point of failure.

roryirvine · 2025-12-05T11:04:20 1764932660

BGP Anycast will let you dynamically route traffic into multiple front-end load balancers - this is how GSLB is usually done.

Needs an ASN and a decent chunk of PI address space, though, so not exactly something a random startup will ever be likely to play with.

DaanDL · 2025-12-05T11:20:43 1764933643

Then add a load balancer in front of your load balancer, duh. /s

sofixa · 2025-12-05T10:03:02 1764928982

With what? The only (sensible) way is DNS, but then your DNS provider is your SPOF. Amazon used to run 2 DNS providers (separate NS from 2 vendors for all of AWS), but when one failed, there was still a massive outage.

altmanaltman · 2025-12-05T10:02:02 1764928922

yeah there is no incentive to do a CDN in house, esp for businesses that are not tech-oriented. And the costs of the occasional outage has not really been higher than the cost of doing it in-house. And I'm sure other CDNs gets outages as well, just CF is so huge everyone gets to know about it and it makes the news

coffeebeqn · 2025-12-05T09:12:46 1764925966

We just love to merge the internet into single points of failure

phatfish · 2025-12-05T09:54:26 1764928466

This is just how free markets work, on the internet with no "physical" limitations it is simply accelerated.

Left alone corporations to rival governments emerge, which are completely unaccountable. At least there is some accountability of governments to the people, depending on your flavour of government.

mschuster91 · 2025-12-05T09:56:34 1764928594

no one loves the need for CDNs other than maybe video streaming services.

the problem is, below a certain scale you can't operate anything on the internet these days without hiding behind a WAF/CDN combo... with the cut-off mark being "we can afford a 24/7 ops team". even if you run a small niche forum no one cares about, all it takes is one disgruntled donghead that you ban to ruin the fun - ddos attacks are cheap and easy to get these days.

and on top of that comes the shodan skiddie crowd. some 0day pops up, chances are high someone WILL try it out in less than 60 minutes. hell, look into any web server log, the amount of blind guessing attacks (e.g. /wp-admin/..., /system/login, /user/login) or path traversal attempts is insane.

CDN/WAFs are a natural and inevitable outcome of our governments and regulatory agencies not giving a shit about internet security and punishing bad actors.

koakuma-chan · 2025-12-05T09:08:09 1764925689

My Cloudflare Pages website works fine.

inferiorhuman · 2025-12-05T10:22:44 1764930164

  There are many alternatives

Of varying quality depending on the service. Most of the anti-bot/catpcha crap seems to be equivalently obnoxious, but the handful of sites that use PerimeterX… I've basically sworn off DigiKey as a vendor since I keep getting their bullshit "press and hold" nonsense even while logged in.

I don't like that we're trending towards a centralized internet, but that's where we are.

luastoned · 2025-12-05T10:10:25 1764929425

From the incident page:

A change made to how Cloudflare's Web Application Firewall parses requests caused Cloudflare's network to be unavailable for several minutes this morning. This was not an attack; the change was deployed by our team to help mitigate the industry-wide vulnerability disclosed this week in React Server Components. We will share more information as we have it today.

https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q

reassess_blind · 2025-12-05T10:41:11 1764931271

I’m really curious what their rollout procedure is, because it seems like many of their past outages should have been uncovered if they released these configuration changes to 1% of global traffic first.

lima · 2025-12-05T12:20:22 1764937222

They don't appear to have a rollout procedure for some of their globally replicated application state. They had a number of major outages over the past years which all had the same root cause of "a global config change exposed a bug in our code and everything blew up".

I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.

JB_Dev · 2025-12-05T13:44:18 1764942258

Yea agree.. This is the same discussion point that came up last time they had an incident.

I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.

Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.

Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.

franktankbank · 2025-12-05T15:05:13 1764947113

Can't get hacked when you are down.

ethbr1 · 2025-12-05T13:42:03 1764942123

For hypothetical conflicting changes (read worst case: unupgraded nodes/services can't interop with upgraded nodes/services), what's best practice for a partial rollout?

Blue/green and temporarily ossify capacity? Regional?

cryptonym · 2025-12-05T15:38:49 1764949129

- Push a version with the new logic but not yet enabled, still using legacy logic, able to implement both

- Push a version that enables new logic for 1% of traffic

- Continue rollout until 100%

nrhrjrjrjtntbt · 2025-12-05T23:36:10 1764977770

Can also do canary rollout before that. Canary means rollout to endpoints only used by CF to test. Monitor metrics and automated test results.

cryptonym · 2025-12-06T08:45:06 1765010706

That's ok but doesn't solve issues you notice only on actual prod traffic. While it can be a nice addition to catch issues earlier with minimal user impact, best practice on large scale systems still requires a staged/progressive prod rollout.

nrhrjrjrjtntbt · 2025-12-06T08:49:00 1765010940

Yep. This is definitely an "as well as"

Unit test, Integration Test, Staging Test, Staging Rollout, Production Test, Canary, Progressive Rollout

Can all be automated can smash through all that quickly with no human intervention.

tehlike · 2025-12-05T12:36:49 1764938209

You can selectively bypass many roll out procedures in a properly designed system.

lima · 2025-12-05T12:54:34 1764939274

If there is a proper rollout procedure that would've caught this, and they bypass it for routine WAF configuration changes, they might as well not have one.

nrhrjrjrjtntbt · 2025-12-05T23:34:52 1764977692

Not sure I buy it. Do 1% for 10 minutes. I mean it must have taken over half a day to code and test a patch. Why not wait another 10 minutes.

gpi · 2025-12-05T13:36:37 1764941797

I believe they use Argo according to a previous post mortem.

https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...

stogot · 2025-12-05T12:35:27 1764938127

The update they describe should never bring down all services. I agree with other posters that they must lack a rollout strategy yet they sent spam emails mocking the reliability of other clouds

brandensilva · 2025-12-05T13:18:12 1764940692

The irony is they support rolling out incrementally with some of their products for deployment.

They need that same mindset for themselves in config/updates/infra changes but probably easier said than done.

Traubenfuchs · 2025-12-05T10:52:20 1764931940

"Please don‘t block the rollout pipleline with a simple react security patch update."

philipwhiuk · 2025-12-05T10:25:51 1764930351

So their parser broke again I guess.

And no staged rollout I assume?

tialaramex · 2025-12-05T10:43:13 1764931393

Apparently somehow this had never been how Cloudflare did this. I expressed incredulity about this to one of their employees, but yeah, seems like their attitude was "We never make mistakes so it's fastest to just deploy every change across the entire system immediately" and as we've seen repeatedly in the past short while that means it sometimes blows up.

They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

rhdunn · 2025-12-05T11:11:11 1764933071

Blameless post mortems should be similar to air accident investigations. I.e. don't blame the people involved (unless they are acting maliciously), but identify and fix the issues to ensure this particular incident is unlikely to recur.

The intent of the postmortems is to learn what the issues are and prevent or mitigate similar issues happening in the future. If you don't make changes as a result of a postmortem then there's no point in conducting them.

meindnoch · 2025-12-05T12:44:06 1764938646

>don't blame the people involved (unless they are acting maliciously)

Or negligently.

jq-r · 2025-12-05T13:23:08 1764940988

That still shouldn't be a part of post mortem, more of a performance review item.

tempaccount420 · 2025-12-05T13:38:27 1764941907

They should be performantly removed.

__turbobrew__ · 2025-12-05T16:04:21 1764950661

The aviation industry regularly requires certifications, check rides, and re-qualifications when humans mess up. I have never seen anything like that in tech.

Sometimes the solution is to not let certain people do certain things which are risky.

Xunjin · 2025-12-05T11:24:16 1764933856

Agree 100%, however using your example, there is no regulatory agency that investigate the issue and demand changes to avoid related future problems. Should the industry move towards this way?

tialaramex · 2025-12-05T12:31:42 1764937902

However, one of the things you see (if you read enough of them) in accident investigation reports for regulated industries is a recurring pattern

1. Accident happens 2. Investigators conclude Accident would not happen if people did X. Recommend regulator requires that people do X, citing previous such recommendations each iteration 3. Regulator declined this recommendation, arguing it's too expensive to do X, or people already do X, or even (hilariously) both 4. Go to 1.

Too often, what happens is that eventually

5. Extremely Famous Accident Happens, e.g. killing loved celebrity Space Cowboy 6. Investigators conclude Accident would not happen if people did X, remind regulator that they have previously recommended requiring X 7. Press finally reads dozens of previous reports and so News Story says: Regulator killed Space Cowboy! 8. Regulator decides actually they always meant to require X after all

ethbr1 · 2025-12-05T13:47:00 1764942420

As bad as (3) sounds, I'll strongman the argument: it's important to keep the economic cost of any regulation in mind.*

On the one hand, you'd like to prevent the thing the regulation is seeking to prevent.

On the other hand, you'd have costs for the regulation to be implemented (one-time and/or ongoing).

"Is the good worth the costs?" is a question worth asking every time. (Not least because sometimes it lets you downscope/target regulations to get better good ROI)

*Yes, the easy pessimistic take is 'industry fights all regulation on cost grounds', but the fact that the argument is abused doesn't mean it doesn't have some underlying merit

tialaramex · 2025-12-05T14:30:02 1764945002

I think conventionally the verb is "to steelman" with the intended contrast being to a strawman, an intentionally weak argument by analogy to how straw isn't strong but steel is. I understood what you meant by "strongman" but I think that "steelman" is better here.

There is indeed a good reason regulators aren't just obliged to institute all recommendations - that would be a lot of new rules. The only accident report I remember reading with zero recommendations was a MAIB (Maritime accidents) report here which concluded that a crew member of a fishing boat has died at sea after their vessel capsized because they both they and the skipper (who survived) were on heroin, the rationale for not recommending anything was that heroin is already illegal, operating a fishing boat while on heroin is already illegal, and it's also obviously a bad idea, so, there's nothing to recommend. "Don't do that".

Cost is rarely very persuasive to me, because it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs when we require something but not before (and often not after we cease to require it either)

It's also difficult to anticipate all benefits from a good change without trying it. Lobbyists against a regulation will often try hard not to imagine benefits after all they're fighting not to be regulated. But once it's in action, it may be obvious to everyone that this was just a better idea and absurd it wasn't always the case.

Remember when you were allowed to smoke cigarettes on aeroplanes? That seems crazy, but at the time it was normal and I'm sure carriers insisted that not being allowed to do this would cost them money - and perhaps for a short while it did.

ethbr1 · 2025-12-06T14:14:22 1765030462

> it's very difficult to correctly estimate what it will actually cost to change something once you decided it's required - based on current reality where it is not. Mass production and clever cost reductions resulting from the normal commercial pressures tend to drive down costs

Difficult, but not impossible.

What are calculable and do NOT scale down is cost for compliance documentation and processes. Changing from 1 form of documentation to 4 forms of documentation has measurable cost, that will be imposed forever.

> It's also difficult to anticipate all benefits from a good change without trying it.

That's not a great argument, because it can be counterbalanced by the equally true opposite: it's difficult to anticipate all downsides to a change without trying it.

> Remember when you were allowed to smoke cigarettes on aeroplanes?

Remember when you could walk up to a gate 5 minutes before a flight, buy a ticket, and fly?

The current TSA security theater has had some benefits, but it's also made using airports far worse as a traveler.

kypro · 2025-12-05T12:35:55 1764938155

> They have blameless post mortems, but maybe "We actually do make mistakes so this practice is not good" wasn't a lesson anybody wanted to hear.

Or they could say, "we want to continue to prioritise speed of security rollouts over stability, and despite our best efforts, we do make mistakes, so sometimes we expect things will blow up".

I guess it depends what you're optimising for... If the rollout speed of security patches is the priority then maybe increased downtime is a price worth paying (in their eyes anyway)... I don't agree with that, but at least it's an honest position to take.

That said, if this was to address the React CVE then it was hardly a speedy patch anyway... You'd think they could have afforded to stagger the rollout over a few hours at least.

lima · 2025-12-05T13:03:44 1764939824

It's just poor risk management at this point. Making sure that a configuration change doesn't crash the production service shouldn't take more than a few seconds in a well-engineered system even if you're not doing staged rollout.

meindnoch · 2025-12-05T10:39:08 1764931148

React (a frontend JS framework) can now bring down critical Internet infrastructure.

I will repeat it because it's so surreal: React (a frontend JS framework) can now bring down critical Internet infrastructure.

cryptonym · 2025-12-05T10:50:09 1764931809

That's Next.js, not React.

Mentioning React Server Components in the status page can be seen as a bad way to shift the blame. Would have been better to not specify which CVE they were trying to patch. The issue is their rollout management, not the Vendor and CVE.

JimDabell · 2025-12-05T12:06:40 1764936400

> That's Next.js, not React.

React seems to think that it was React:

https://react.dev/blog/2025/12/03/critical-security-vulnerab...

cryptonym · 2025-12-05T13:03:25 1764939805

True, thanks for sharing. Worth mentioning that's on the "full-stack" part of the framework. It doesn't impact most React website while it impacts most next.js websites.

tempaccount420 · 2025-12-05T13:41:12 1764942072

It was React. Code in React's repository had to be patched to fix this.

Next.JS just happens to be the biggest user of this part of React, but blaming Next.JS is weird...

cryptonym · 2025-12-05T14:30:46 1764945046

Thanks, that's what I acknowledged in the message you just replied to.

I'm not blaming anyone. Mostly outlining who was impacted as it's not really related to the front-end parts of the framework that the initial comment was referring to.

philipwhiuk · 2025-12-05T11:03:47 1764932627

I think the "argument" is that it's a critical vuln so they can't "go slow".

So now a vuln check for a component deployed on, being generous, 1% of servers causes an outage for 30% of the internet.

The argument is dumb.

spiffytech · 2025-12-05T12:59:29 1764939569

To be accurate: React developed server-side capabilities, and that's where the vulnerability exists.

It's feels noteworthy because React started out frontend-only, but pedantically it's just another backend with a vulnerability.

phplovesong · 2025-12-05T10:40:06 1764931206

[flagged]

mvandermeulen · 2025-12-05T10:56:31 1764932191

What was the AI slop part?

GaryBluto · 2025-12-05T12:36:23 1764938183

When something goes wrong, people are starting to immediately assume it's because of the thing they don't like.

o_m · 2025-12-05T14:35:28 1764945328

I wonder if this is the new normal? Weekly Cloudflare outages that breaks huge parts of the internet.

uyzstvqs · 2025-12-05T10:27:29 1764930449

Ah yes, Cloudflare's worst enemy: The configuration change.

hinkley · 2025-12-05T18:40:14 1764960014

On fridays, yes.

aatd86 · 2025-12-05T11:59:24 1764935964

so it's react again in the end .. zzzzzzz

pepoluan · 2025-12-05T15:03:23 1764947003

So. Another regex problem?

xyproto · 2025-12-05T08:51:58 1764924718

Yes.

Weird that https://www.cloudflarestatus.com/ isn't reporting this properly. It should be full of red blinking lights.

javier2 · 2025-12-05T09:00:57 1764925257

Yeah. I only work for a small company, but you can be certain we will not update the status page if only a small portion of customers are affected, and if we are fully down, rest assured there will be no available hands to keep the status page updated

s_dev · 2025-12-05T09:10:08 1764925808

>rest assured there will be no available hands to keep the status page updated

That's not how status pages if implemented correctly work. The real reason status pages aren't updated is SLAs. If you agree on a contract to have 99.99% uptime your status page better reflect that or it invalidates many contracts. This is why AWS also lies about it's uptime and status page.

These services rarely experience outages according their own figures but rather 'degraded performance' or some other language that talks around the issue rather than acknowledging it.

It's like when buying a house you need an independent surveyor not the one offered by the developer/seller to check for problems with foundations or rotting timber.

redm · 2025-12-05T09:23:38 1764926618

SLA’s usually just give you a small credit for the exact period of the incident, which is arymetric to the impact. We always have to negotiate for termination rights for failing to meet SLA standards but, in reality, we never exercise them.

Reality is that in an incident, everyone is focused on fixing issue, not updating status pages; automated checks fail or have false positives often too. :/

korm · 2025-12-05T10:50:50 1764931850

Yep, every SLA I've ever seen only offers credit. The idea that providers are incentivized to fudge uptime % due to SLAs makes no sense to me. Reputation and marketing maybe, but not SLAs.

The compensation is peanuts. $137 off a $10,000 bill for 10 hours of downtime, or 98.68% uptime in a month, is well within the profit margins.

laurent123456 · 2025-12-05T09:16:05 1764926165

This is weird - at this level contracts are supposed to be rock solid so why wouldn't they require accurate status reporting? That's trivial to implement, and you can even require to have it on a neutral third-party like UptimeRobot and be done with it.

I'm sure there are gray areas in such contracts but something being down or not is pretty black and white.

franga2000 · 2025-12-05T09:26:45 1764926805

> something being down or not is pretty black and white

This is so obviously not true that I'm not sure if you're even being serious.

Is the control panel being inaccessible for one region "down"? Is their DNS "down" if the edit API doesn't work, but existing records still get resolved? Is their reverse proxy service "down" if it's still proxying fine, just not caching assets?

laurent123456 · 2025-12-05T10:48:34 1764931714

I understand there are nuances here, and I may be oversimplifying, but if part of the contract effectively says "You must act as a proxy for npmjs.com" yet the site has been returning 500 Cloudflare errors across all regions several times within a few weeks while still reporting a shining 99.99% uptime, something doesn't quite add up. Still, I'm aware I don't know much about these agreements, and I'm assuming the people involved aren't idiots and have already considered all of this.

remus · 2025-12-05T09:26:05 1764926765

> I'm sure there are gray areas in such contracts but something being down or not is pretty black and white.

Is it? Say you've got some big geographically distributed service doing some billions of requests per day with a background error rate of 0.0001%, what's your threshold for saying whether the service is up or down? Your error rate might go to 0.0002% because a particular customer has an issue so that customer would say it's down for them, but for all your other customers it would be working as normal.

javier2 · 2025-12-05T16:14:39 1764951279

> something being down or not is pretty black and white

it really isn't. We often have degraded performance for a portion of customers, or just down for customers of a small part of the service. It has basically never happened that our service is 100% down.

lucianbr · 2025-12-05T09:15:07 1764926107

Are the contracts so easy to bypass? Who signs a contract with an SLA knowing the service provider will just lie about the availability? Is the client supposed to sue the provider any time there is an SLA breach?

netdevphoenix · 2025-12-05T09:20:06 1764926406

Anyone who doesn't have any choice financially or gnostically. Same reason why people pay Netflix despite the low quality of most of their shows and the constant termination of tv series after 1 season. Same reason why people put up with Meta not caring about moderating or harmful content. The power dynamics resemble a monopoly

lucianbr · 2025-12-05T12:03:53 1764936233

Why bother to put the SLA in the contract at all, if people have no choice but to sign it?

Netflix doesn't put in the contract that they will have high-quality shows. (I guess, don't have a contract to read right now.)

ozim · 2025-12-05T09:32:32 1764927152

Most of services are not really critical but customers want to have 99.999% on the paper.

Most of the time people will just get by and ignore even full day of downtime as minor inconvenience. Loss of revenue for the day - well you most likely will have to eat that, because going to court and having lawyers fighting over it most likely will cost you as much as just forgetting about it.

If your company goes bankrupt because AWS/Cloudflare/GCP/Azure is down for a day or two - guess what - you won't have money to sue them ¯\_(ツ)_/¯ and most likely will have bunch of more pressing problems on your hand.

heipei · 2025-12-05T09:32:12 1764927132

The client is supposed to monitor availability themselves, that is how these contracts work.

immibis · 2025-12-05T09:26:39 1764926799

The company that is trying to cancel its contract early needs to prove the SLA was violated, which is very easy of the company providing the service also provides a page that says their SLA was violated. Otherwise it's much harder to prove.

8cvor6j844qw_d6 · 2025-12-05T09:11:30 1764925890

I imagine there will be many levels of "approvals" to get the status page actually showing down, since SLA uptime contracts is involved.

javier2 · 2025-12-05T09:18:32 1764926312

I work for a small company. We have no written SLA agreements.

lawnchair · 2025-12-05T09:12:22 1764925942

I have to say that if an incident becomes so overwhelming that nobody can spare even a moment to communicate with customers, that points to a deeper operational problem. A status page is not something you update only when things are calm. It is part of the response itself. It is how you keep users informed and maintain trust when everything else is going wrong.

If communication disappears entirely during an outage, the whole operation suffers. And if that is truly how a company handles incidents, then it is not a practice I would want to rely on. Good operations teams build processes that protect both the system and the people using it. Communication is one of those processes.

onion2k · 2025-12-05T09:24:13 1764926653

if we are fully down, rest assured there will be no available hands to keep the status page updated

There is no quicker way for customers to lose trust in your service than it to be down and for them to not know that you're aware and trying to fix it as quickly as possible. One of the things Cloudflare gets right is the frequent public updates when there's a problem.

You should give someone the responsibility for keeping everyone up to date during an incident. It's a good idea to give that task to someone quite junior - they're not much help during the crisis, and they learn a lot about both the tech and communication by managing it.

GoblinSlayer · 2025-12-05T09:15:16 1764926116

You won't be able to update the status page due to failures anyway.

PhilippGille · 2025-12-05T10:17:19 1764929839

Why not? A good status page runs on a different cloud provider in a different region, specifically to not be affected at the same time.

63stack · 2025-12-05T08:56:22 1764924982

This is just business as usual, status pages are 95% for show now. The data center would have to be under water for the status page to say "some users might be experiencing disruptions".

csomar · 2025-12-05T08:58:34 1764925114

They just did an update, and it is bad (in the sense that they are not realizing their clients are down?)

> Investigating - Cloudflare is investigating issues with Cloudflare Dashboard and related APIs.

> These issues do not affect the serving of cached files via the Cloudflare CDN or other security features at the Cloudflare Edge.

> Customers using the Dashboard / Cloudflare APIs are impacted as requests might fail and/or errors may be displayed.

Eikon · 2025-12-05T09:02:04 1764925324

> (in the sense that they are not realizing their clients are down?)

Their own website seems down too https://www.cloudflare.com/

--

500 Internal Server Error

cloudflare

mikkom · 2025-12-05T09:08:20 1764925700

>Customers using the Dashboard / Cloudflare APIs are impacted as requests might fail and/or errors may be displayed.

"Might fail"

yapyap · 2025-12-05T09:11:21 1764925881

well it does say that now, so…

which datacenter got flooded?

rvnx · 2025-12-05T09:28:25 1764926905

> In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary. Dec 05, 2025 - 09:00 UTC

It's a scheduled maintenance, so SLA should not apply right ?

darccio · 2025-12-05T08:58:26 1764925106

https://updog.ai/status/cloudflare reported the incident 13 minutes ago (at the moment of writing this).

chironjit · 2025-12-05T08:55:51 1764924951

Yeah, their status site reports nothing but then clicking on some of the links on that site bring you the 500 error

mikkom · 2025-12-05T08:55:36 1764924936

Company internal status pages are always like this. When you don't report problems they don't exist!

Havoc · 2025-12-05T09:09:52 1764925792

It’s wild how non of the big corporations can make a functional status page

javier2 · 2025-12-05T09:17:50 1764926270

They could, but accurate reporting is not good for their SLAs

dncornholio · 2025-12-05T09:11:45 1764925905

They can. They don't want to though.

hinkley · 2025-12-05T09:06:24 1764925584

They were intending to start a maintenance window starting 6 minutes ago, but they were already down by then.

dinoqqq · 2025-12-05T09:07:12 1764925632

There is an update:

"Cloudflare Dashboard and Cloudflare API service issues"

Investigating - Cloudflare is investigating issues with Cloudflare Dashboard and related APIs.

Customers using the Dashboard / Cloudflare APIs are impacted as requests might fail and/or errors may be displayed. Dec 05, 2025 - 08:56 UTC

rollulus · 2025-12-05T09:08:50 1764925730

Not weird, that’s tradition by now.

jbuild · 2025-12-05T09:11:22 1764925882

Interesting, I get a 500 if I try to visit coinbase.com, but my WebSocket connections to advanced-trade-ws.coinbase.com are still live with no issues.

emakarov · 2025-12-05T09:14:46 1764926086

probably these websockets are not going through cloudflare

fxd123 · 2025-12-05T08:59:03 1764925143

> Investigating - Cloudflare is investigating issues with Cloudflare Dashboard and related APIs.

They seem to now, a few min after your comment

redm · 2025-12-05T09:06:51 1764925611

Im much more concerned with customer sites being down which indicates are not impacted. They are.. :/

jonathanlydall · 2025-12-05T08:58:48 1764925128

Now showing a message, posted at 08:56 UTC.

jachee · 2025-12-05T09:08:59 1764925739

Management is always going to take too long (in an engineer’s opinion) to manually throw the alerts on. They’re pressing people for quick fixes so they can claim their SLAs are intact.

devmor · 2025-12-05T09:31:42 1764927102

Yes, the incident report claims this was limited to their client dashboard. It most certainly was not. I have the PagerDuty alerts to prove it...

tjpnz · 2025-12-05T09:15:48 1764926148

They have enough data to at least automate yellow.

rvz · 2025-12-05T09:09:49 1764925789

The AI agents can't help out on this time.

rifycombine1 · 2025-12-05T09:17:05 1764926225

maybe we can back to stackoverflow :)

csomar · 2025-12-05T08:54:55 1764924895

> In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary. Dec 05, 2025 - 07:00 UTC

Something must have gone really wrong.

headmelted · 2025-12-05T08:59:55 1764925195

It's 1AM in San Francisco right now. I don't envy the person having to call Matthew Prince and wake him up for this one. And I feel really bad for the person that forgot a closing brace in whatever config file did this.

artlovecode · 2025-12-05T09:07:38 1764925658

Agreed, I feel bad for them. But mostly because cloudflare's workflows are so bad that you're seemingly repeatedly set up for really public failures. Like how does this keep happening without leadership's heads rolling. The culture clearly is not fit for their level of criticality

esseph · 2025-12-05T09:27:55 1764926875

> The culture clearly is not fit for their level of criticality

I don't think anyone's is.

everfrustrated · 2025-12-05T09:59:18 1764928758

How often do you hear of Akamai going down and they host a LOT more enterprise/high value sites than Cloudflare.

There's a reason Cloudflare has been really struggling to get into the traditional enterprise space and it isn't price.

inferiorhuman · 2025-12-05T10:15:34 1764929734

A quick google turned up an Akamai outage in July that took Linode down and two in 2021. At that scale nobody's going to come up smelling like roses. I mostly dealt with Amazon crap at megacorp, but nobody that had to deal with our Akamai stuff had anything kind to say about them as a vendor.

At first blush it's getting harder to "defend" use of Cloudflare, but I'll wait until we get some idea of what actually broke. For the time being I'll save my outrage for the AI scrapers that drove everyone into Cloudflare's arms.

esseph · 2025-12-05T17:10:49 1764954649

The last place I heard of someone deploying anything to Akamai was 15 years ago in FedGov.

Akamai was historically only serving enterprise customers. Cloudflare opened up tons of free plans, new services, and basically swallowed much of that market during that time period.

viraptor · 2025-12-05T09:16:12 1764926172

> I don't envy the person having to call Matthew Prince

They shouldn't need to do that unless they're really disorganised. CEOs are not there for day to day operations.

csomar · 2025-12-05T09:05:30 1764925530

> And I feel really bad for the person that forgot a closing brace in whatever config file did this.

If a closing brace take your whole infra. down, my guess is that we'll see more of this.

shafyy · 2025-12-05T08:57:11 1764925031

Life hack: Announce bug that brings your entire network down as scheduled maintenance.

tommek4077 · 2025-12-05T08:56:45 1764925005

Yes, it’s really ‘weird’ that they refuse to share any details. Completely unlike AWS, for example. As if being open about issues with their own product wouldn’t be in their best interest. /s

timvdalen · 2025-12-05T08:53:52 1764924832

Wow, just plain 500s on customer sites. That's a level of down you don't see that often.

ablation · 2025-12-05T08:58:29 1764925109

Yeah that's a hard 500 right? Not even Cloudflare's 500 branded page like last time. What could have caused this, I wonder.

mckirk · 2025-12-05T09:06:31 1764925591

"A cable!"

"How do you know?"

"I'm holding it!"

Hamuko · 2025-12-05T09:08:53 1764925733

I hope it’s not another Result.unwrap().

singularity2001 · 2025-12-05T09:18:53 1764926333

maybe this would cause rust to adopt exception handling, and by exception I mean panic

maxekman · 2025-12-05T09:06:21 1764925581

A precious glimpse of the less seen page renders.

gwd · 2025-12-05T09:08:48 1764925728

Unlike the previous outage, my server seems fine, and I can use Cloudflare's tunnel to ssh to the host as well.

willtemperley · 2025-12-05T09:02:51 1764925371

Yes Claude is down with a 500 (cloudflare).

disillusioned · 2025-12-05T08:55:35 1764924935

At least they branded it!

Eikon · 2025-12-05T09:09:15 1764925755

Mine [0] seems to be very high latency but no 500s. But yes, most cloudflare-proxied websites I tried seems to just return 500s.

[0] https://www.merklemap.com/

ransom1538 · 2025-12-05T09:04:42 1764925482

So. I don't understand the 5 nines they promote. One bad day those nines are gone. So they next year you are pushing 2 nines.

kingstnap · 2025-12-05T09:22:56 1764926576

Its just fabricated bullshit. It's how all the companies do it. 99.999% over a year is literally 5 minutes. Or under an hour in a decade, that's wildly unrealistic.

Reddit was once down for a full day and that month they reported 99.5% uptime instead of 99.99% as they normally claimed for most months.

There is this amazing combination of nonsense going on to achieve these kinds of numbers:

1. Straight up fraudulent information on status page. Reporting incendents as more minor than any internal monitors would claim.

2. If it's working for at least a few percent of customers it's not down. Degraded is not counted.

3. If any part of anything is working then it's not down. For example with the reddit example even if the site was dead as long as the image server is still at 1% functional with some internal ping the status is good.

zelphirkalt · 2025-12-05T16:38:13 1764952693

Funnily enough an hour in a decade on a good hoster, with a stable service running on it, occasionally updated by version number ... it might even be possible. Maybe not quite, but close, if one tries. While it seems completely impossible with cloudflare, AWS, and whatnot, who are having outages every other week these days.

jondot · 2025-12-05T09:06:51 1764925611

its like someone-shut-down-the-power 500s

madjam002 · 2025-12-05T09:20:59 1764926459

Looking forward to the post mortem on this one. We weren't affected (just using the CDN), and people are saying they weren't affected who are using Cloudflare Workers (a previous culprit which we've since moved off), so I wonder what service / API was actually affected that brought down multiple websites with a 500 but not all of them.

Wise was just down which is a pretty big one.

Also odd how some websites were down this time that previously weren't down with the global outage in November

archon810 · 2025-12-05T09:31:21 1764927081

Our locations excluded from Cloudflare WAF were up, but the rest was down. I think WAF took a dump.

reassess_blind · 2025-12-05T09:31:13 1764927073

Yeah it's strange. My sites that are are proxied through Cloudflare remained up, but Supabase was taken offline so some backends were down. Either a regional PoP style issue, or a specific API or service had to be used to be affected.

gritzko · 2025-12-05T10:08:25 1764929305

The entire Cloud/SaaS story had a lot of happy-path cost optimization. The particular glitch that triggered the domino effect may be irrelevant relative to the fact that the effect reproduces.

thinkindie · 2025-12-05T09:25:16 1764926716

we were not affected too and we realised it was Cloudflare because Linear was down and they were mentioning an upstream service. Also Ecosia was affected, and I then realised they might be relying on Cloudflare too.

themly · 2025-12-05T09:32:30 1764927150

CDN was definitely down also. We were widely impacted by it with 500's.

gowthamgts12 · 2025-12-05T09:26:27 1764926787

CDN was also affected for some customers. we were down with 500.

m_mueller · 2025-12-05T09:28:58 1764926938

Maven Repository was down for me for a while, now it recovered.

cryptonym · 2025-12-05T09:34:38 1764927278

> Looking forward to the post mortem

This is becoming a meme.

meandmycode · 2025-12-05T09:38:42 1764927522

This has to be setting off some alarm bells internally, a well written postmortem on an occasional issue, great, but when your postmortem talks about learnings and improvements yet major outages keep happening, it becomes meaningless..

kryptn · 2025-12-05T09:31:41 1764927101

was interesting, some of our stuff failed, but some other stuff that used cloudflare indirectly didn't.

da_grift_shift · 2025-12-05T09:53:46 1764928426

The excuse:

>A change made to how Cloudflare's Web Application Firewall parses requests caused Cloudflare's network to be unavailable for several minutes this morning.

>The change was deployed by our team to help mitigate the industry-wide vulnerability disclosed this week in React Server Components.

>We will share more information as we have it today.

https://www.cloudflarestatus.com/incidents/lfrm31y6sw9q

madjam002 · 2025-12-05T10:02:34 1764928954

It's quite an unfortunate coincidence that React has indirectly been the reason for two recent issues at Cloudflare haha

brobdingnagians · 2025-12-05T10:06:53 1764929213

Two's a coincidence, three's a pattern; I guess we will have to wait until next month to see if it becomes a pattern. Was there a particular aspect of the React Server Components that made it easy to have this problem appear? would it have been caught or avoided in another framework or language?

GoblinSlayer · 2025-12-05T10:12:40 1764929560

Who sent an xml request?

Palmik · 2025-12-05T08:57:31 1764925051

This is second time this week: https://news.ycombinator.com/item?id=46140145

The previous one affected European users for >1h and made many Cloudflare websites nearly unusable for them.

AmateurAlert · 2025-12-05T08:52:58 1764924778

https://downdetector.com/ classic

26d0 · 2025-12-05T08:53:29 1764924809

hmm... https://downdetectorsdowndetector.com/

(edit: it's working now (detecting downdetector's down))

vanyauhalin · 2025-12-05T08:58:59 1764925139

So,

This one is green: https://downdetectorsdowndetector.com

This one is not openning: https://downdetectorsdowndetectorsdowndetector.com

This one is red: https://downdetectorsdowndetectorsdowndetectorsdowndetector....

Recursing · 2025-12-05T09:07:13 1764925633

https://en.wikipedia.org/wiki/Fundamental_theorem_of_softwar...

superdisk · 2025-12-05T09:21:36 1764926496

Lol. The fact that the 4x one actually works and is correctly reporting that the 3x one is down actually makes this a lot funnier to me.

altmanaltman · 2025-12-05T09:23:15 1764926595

it's like they didn't fully think it through/expect people to actually use it so soon

mrducksy · 2025-12-05T09:20:36 1764926436

It’s down detectors all the way down!

ssolarsystem1 · 2025-12-05T09:26:00 1764926760

downdetectorsdowndetectors didn't detect breakdown of downdetectors with 500 Error

xyproto · 2025-12-05T08:55:16 1764924916

A wrong downdetectordowntector is worse than a 500 one. :D

deveesh_shetty · 2025-12-05T08:56:23 1764924983

You had one job.

manyaoman · 2025-12-05T08:57:11 1764925031

So down²detector was fake all along?

andy_ppp · 2025-12-05T08:59:51 1764925191

https://www.youtube.com/watch?v=OC06Z6lCB_Q

Andugal · 2025-12-05T08:56:38 1764924998

So DownDetector is down, but DownDetectorDownDetector does not detect it... We probably need one more DownDetector. (no)

namjh · 2025-12-05T09:00:28 1764925228

Yes we do have[^1] but unfortunately it looks like not checking the integrity, just reachability.

[1]: https://downdetectorsdowndetectorsdowndetector.com/

halgir · 2025-12-05T08:59:47 1764925187

We have one. But according to Down Detector's Down Detector's Down Detector's Down Detector, that's also down.

Dilettante_ · 2025-12-05T09:04:20 1764925460

Well Down Detector's Down Detector isn't down...What we might need is a Down Detector's Down Detector Validator

O4epegb · 2025-12-05T09:02:22 1764925342

This is a fake detector that just has frontend logic for mocking realistic data, you can easily see it in the source code.

maxlin · 2025-12-05T08:59:37 1764925177

>half the internet is down >downdetector is down >downdetector down detector reports everything is fine

software was a mistake

aurareturn · 2025-12-05T08:54:39 1764924879

Ehh, so down detector for down detector is up but it is inaccurate.

aroman · 2025-12-05T08:59:25 1764925165

great news, schrodingersdetector.com is available!

xx_ns · 2025-12-05T08:56:03 1764924963

At least it's still right in spite of being down.

asmor · 2025-12-05T08:58:10 1764925090

That's the 30% vibe code they promised us.

Cynicism aside, something seems to be going wrong in our industry.