Facebook You Re Doing It Wrong

Facebook You Re Doing It Wrong - Early today Facebook was down or inaccessible for much of you for about 2.5 hrs. This is the worst blackout we've had in over four years, as well as we wanted to to start with apologize for it. We also wanted to give much more technical detail on what happened and also share one big lesson found out.

What's Wrong With Facebook

Facebook You Re Doing It Wrong


The essential problem that triggered this interruption to be so serious was an unfavorable handling of an error condition. A computerized system for confirming configuration worths wound up causing much more damages than it dealt with.

The intent of the computerized system is to check for setup values that are invalid in the cache and also replace them with updated worths from the persistent shop. This functions well for a transient trouble with the cache, but it does not work when the persistent store is void.

Today we made a modification to the persistent copy of a setup value that was taken invalid. This implied that every single client saw the invalid worth as well as tried to fix it. Due to the fact that the fix entails making a query to a cluster of data sources, that collection was rapidly overwhelmed by numerous thousands of questions a second.

To make issues worse, each time a client got a mistake attempting to query one of the databases it interpreted it as a void value, and removed the matching cache trick. This implied that also after the original trouble had actually been repaired, the stream of inquiries continued. As long as the data sources stopped working to service a few of the demands, they were causing a lot more requests to themselves. We had entered a feedback loop that really did not allow the databases to recuperate.

The way to quit the comments cycle was fairly painful - we needed to stop all website traffic to this database cluster, which suggested shutting off the site. As soon as the databases had recuperated and also the root cause had actually been fixed, we slowly permitted even more people back onto the site.

This got the website back up as well as running today, as well as for now we've switched off the system that tries to remedy arrangement values. We're exploring brand-new styles for this configuration system complying with style patterns of other systems at Facebook that deal even more with dignity with comments loopholes and transient spikes.

We ask forgiveness once again for the website interruption, and also we want you to recognize that we take the efficiency as well as reliability of Facebook extremely seriously.