Photo: Flickr, boltron
This morning, Skype CIO Lars Rabbe explained why the service went down last week, blaming a combination of server overloads and a bug in the Skype client software. What he doesn’t say: the way Skype is built makes these kinds of cascading failures almost inevitable.As Rabbe explains, the problem started when some servers for Skype’s offline instant messaging feature went down. That normally wouldn’t have been a problem–the company would simply shut them down and send users to backup servers. But a bug in a version of the Skype software for Windows caused it to choke on the delayed responses from these failed servers. This affected about 20% of all Skype users.
So how did a 20% failure become a 100% failure? This is the tricky part. Most other real-time communications services–Google Voice, public IM services, and corporate IM products like Microsoft’s Lync–route connections through centralized servers. Once the call is connected, the clients might begin communicating directly. But call setup, routing, and breakdown is handled by the company.
But Skype is a pure peer-to-peer service–as this 2003 paper from Columbia University (PDF) explains, it was actually developed by Kazaa, which is better known for its peer-to-peer file-trading software. That means that calls are routed through directories living on the computers of other Skype users.
Skype calls these super-important directory PCs “supernodes,” and the initial Skype client failure–and millions of users restarting Skype at the same time–caused about 25% of them to go down. That created a huge load on the other supernodes, which crashed them, and so on until the entire system went down.
The important point here: Skype is using its users’ PCs and bandwidth to run its service. This helps Skype route calls efficiently, reducing latency (the annoying delay in some voice services), and improving reliability. Not incidentally, it also helps Skype keep costs down.
But it’s also vulnerable to bugs and unexpected user behaviour. If anything happens to take a bunch of Skype clients offline at once–a software bug, malware that suddenly crashes a bunch of PCs at the same time (like happened with Code Red and Nimda back in 2001), a widespread power outage–a cascading failure is far more likely than it would be if Skype had direct control over routing calls.
The lesson, once again: be careful about relying on consumer-grade services for vital business needs.
Thanks to reader Ivaylo Lenkov of SiteKreator.com for the pointer to the Columbia paper.
NOW WATCH: Tech Insider videos
Business Insider Emails & Alerts
Site highlights each day to your inbox.