Amazon Web Services crashed yesterday, taking thousands of Web sites — mostly smaller startups — with it.The outage is now in its second day, and while the company says service has been partially restored and some sites like Reddit are back up, others like Quora and Box Office Mojo are still offline.
Amazon has not been very forthcoming about what went wrong, explaining that something caused the system to go crazy and start writing tons of unnecessary backups, which ate up storage space.
But customers are starting to figure out that Amazon didn’t live up to one of its promises. Amazon is supposed to isolate parts of its data centres in different “availability zones,” so an outage in one zone doesn’t affect others. But in this case, the isolation didn’t work.
Justin Santa Barbara of FathomDB has a good explanation of what happened on Posterous. As he explains, Amazon has multiple physical data centres, including one in Northern Virginia, which is where the failure happened. The company calls these regions.
In theory, companies could distribute their services in multiple AWS regions — for example, have a master SQL database in one region and its backup in another — but it’s slow and unreliable, as it requires sending data over the public Internet. So almost nobody does this.
So Amazon also divides each of its physical data centres into availability zones. These are supposed to be redundant and separate, yet located in physical proximity so you don’t have to send data over the public Internet if you want to use them for backup.
In this case, however, a physical failure in one availability zone took down others. So even companies who had planned redundant systems found themselves out of luck.
Santa Barbara blames Amazon for failing to follow its own specifications: “Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don’t know at this point.”
He also notes that this shouldn’t be taken as a failure of “the cloud” in general — but that it’s important to choose your cloud provider carefully. Amazon has been pretty reliable until now, but this outage could cause some customers to reconsider.