Why does Twitter crash all the time? Various reasons, apparently:
- too many connections
- errant API project eating too many Jabber resources
- past, present, and future architecture challenges
- something occurring in between various databases, caches, web servers, daemons
- routine database update
- caching services required unscheduled restart
Details below on Twitter’s blog. Apparently, Twitter is also tweeting them (not that anyone can listen).
The good news: Getting swamped by demand is actually a high-quality problem with encouraging precedents. We dinosaurs still remember the days when 1996-era Netscape, 1997-era AOL, and 1998-era eBay occasionally collapsed under the weight of user demand, infuriating millions. If Twitter can get its act together, it should be in great shape.
In the meantime, the company wants you to know that it doesn’t really know what the hell the problem is. At least that’s how we read these recent entries…
Around 11 am in San Francisco, our main database db006, crashed because of too many connections. We have to put the service into an unscheduled maintenance mode to recover. Folks will see degraded service for the next few hours.
Friday, May 23
Too much Jabber!
We found an errant API project eating way too much of our Jabber (a flavour of instant messenger) resources. This activity (which we’ve corrected) had an affect of overloading our main database, resulting in the error pages and slowness most people are now encountering
We’re bringing services back online now. Some will be slower than others for a while, and we’ll be watching IM and IM-based API clients very closely. We’ll also be taking steps to avoid this behaviour in the future. Thanks for your patience!
Update: We’re turning off IM services for the evening (Friday) to allow for the system to recover. We hope to turn things back on Saturday.
Thursday, May 22
Wednesday, May 21
I have this graph up on my screen all the time. It should be flat. This week has been rough.
We’ve gone through our various databases, caches, web servers, daemons, and despite some increased traffic activity across the board, all systems are running nominally. The truth is we’re not sure what’s happening. It seems to be occurring in-between these parts.
We’re busy working on instrumenting and adding meters to provide visibility into what’s slowing Twitter down. We’ll use this data both to alleviate the current woes and to help inform our long-term architecture work to make Twitter a utility service people can count on. We’ve definitely failed that aim this week.
Thanks for your patience during these current frustrations (and those to come) as we figure out how to work the kinks out. Thanks also for speaking up: we’re listening. In addition to providing visibility into our systems, we’re working to give everyone greater visibility into our roadmap to solve these ongoing problems. More to come.
Tuesday, May 20
Downtime is not good. We caused a database to fail during a routine update early this afternoon. We switched to a replica and expect this recovery to take place quickly. We’re all working on it and watching right now as Twitter gets back up to speed. We have a thread open on our support forum which we’ll update when we have more details to share. Getting our act together is something we continue to work on as we grow our company and our service.
Wednesday, May 14
What Happened Today?
It is not entirely to do with the Democrats, Space Aliens, Mysterious Men in Black, or Arugula. In fact, this afternoon’s service interruption does not have a very exciting explanation.
Part of our caching service required an unscheduled restart. That means a slow rebuilding of data. You may notice some of the normal browsing related features (such as pagination) are missing while we repopulate the caching service. This is so we can get it done quicker.
Update: It’s not Groundhog Day but it sure feels like it. We are recovering today from the same problem as yesterday. This service interruption is our own fault. If you have questions that are only satisfied by technical answers please join our technical discussion forum on Google groups.
Wednesday, May 7
Doctor, doctor, give me the news
The database is back up. We’re conducting a series of health and sanity checks to make sure everything is working the way it should. Our first quick assessment is that, yes, this problem is over.