Facebook sorry something Went Wrong

Facebook Sorry Something Went Wrong - Early today Facebook was down or unreachable for many of you for about 2.5 hours. This is the worst failure we have actually had in over 4 years, as well as we wanted to to start with excuse it. We additionally wanted to give a lot more technical detail on what took place as well as share one huge lesson learned.

What's Wrong With Facebook

Facebook Sorry Something Went Wrong


The crucial problem that created this interruption to be so severe was an unfortunate handling of a mistake problem. An automatic system for validating configuration values wound up causing a lot more damages than it taken care of.

The intent of the automated system is to look for setup values that are void in the cache and change them with updated values from the consistent store. This works well for a short-term trouble with the cache, but it does not function when the consistent store is invalid.

Today we made a modification to the consistent copy of a setup worth that was taken invalid. This indicated that each and every single client saw the invalid value as well as attempted to fix it. Due to the fact that the repair involves making an inquiry to a collection of databases, that cluster was promptly overwhelmed by thousands of hundreds of queries a 2nd.

To make matters worse, every time a customer got an error trying to query one of the databases it interpreted it as an invalid value, and also removed the matching cache secret. This meant that also after the initial issue had been repaired, the stream of queries proceeded. As long as the databases fell short to service several of the requests, they were triggering much more demands to themselves. We had gone into a feedback loop that didn't allow the databases to recuperate.

The means to stop the comments cycle was rather uncomfortable - we had to stop all web traffic to this data source collection, which implied turning off the site. When the data sources had recovered and also the origin had been taken care of, we gradually permitted even more individuals back onto the site.

This obtained the website back up as well as running today, and also in the meantime we have actually switched off the system that tries to remedy setup worths. We're exploring brand-new designs for this configuration system complying with design patterns of various other systems at Facebook that deal more with dignity with responses loopholes and transient spikes.

We say sorry once more for the site blackout, and we want you to understand that we take the performance and also reliability of Facebook really seriously.