What is Wrong with Facebook today

What Is Wrong With Facebook Today - Early today Facebook was down or unreachable for most of you for about 2.5 hrs. This is the worst blackout we've had in over 4 years, and also we intended to first of all excuse it. We also wished to offer a lot more technological detail on what took place as well as share one large lesson discovered.

What's Wrong With Facebook

What Is Wrong With Facebook Today


The key problem that created this failure to be so extreme was a regrettable handling of an error condition. An automated system for verifying arrangement worths ended up creating far more damages than it dealt with.

The intent of the automated system is to look for arrangement worths that are void in the cache and change them with upgraded worths from the persistent store. This works well for a transient problem with the cache, however it doesn't function when the relentless shop is void.

Today we made a change to the consistent duplicate of a configuration worth that was interpreted as void. This indicated that each and every single client saw the void worth and attempted to repair it. Because the solution involves making a query to a collection of databases, that cluster was rapidly overwhelmed by numerous thousands of inquiries a 2nd.

To make matters worse, each time a client obtained a mistake trying to query one of the data sources it translated it as a void value, as well as deleted the equivalent cache key. This meant that also after the initial trouble had been repaired, the stream of questions proceeded. As long as the databases stopped working to service some of the requests, they were creating a lot more requests to themselves. We had gone into a comments loop that really did not allow the data sources to recover.

The method to stop the comments cycle was rather unpleasant - we had to stop all website traffic to this data source cluster, which meant shutting off the site. As soon as the databases had recuperated and the origin had actually been fixed, we slowly enabled even more people back onto the site.

This got the site back up and also running today, and for now we've switched off the system that attempts to fix arrangement values. We're exploring new designs for this configuration system following design patterns of other systems at Facebook that deal even more gracefully with responses loopholes and transient spikes.

We say sorry once again for the website blackout, and also we desire you to understand that we take the efficiency and also integrity of Facebook very seriously.