August 17, 2007

Perhaps They Should Have Tested More - Skype

Skype, the computer-based internet phone service has been down for much of the past two days, affecting about 220 million customers worldwide.

They blame a fault in an algorithm that has been part of every copy of Skype downloaded since the start of service in 2003.

"This problem occurred because of a deficiency in an algorithm within Skype networking software. This controls the interaction between the user's own Skype client and the rest of the Skype network."

"Telecoms engineering is no different to any other product development – there is always a commercial penalty to pay by compromising reliability or quality. You still broadly get what you pay for in telecoms."

"The Skype network uses a so-called peer-to-peer infrastructure, meaning that calls are routed through other users’ computers instead of a central hub. But it does have servers around the world, known as supernodes, that manage access to the network. A flaw in a crucial piece of software that connects users to these servers appears to have been the source of the problem.

Skype engineers said the flaw existed in every copy of the Skype software that had been downloaded since the service’s start in 2003.

Skype executives said they still did not know why the error, sitting dormant for four years, suddenly crashed the network. They cited problems with the Internet backbone in some parts of the world as a possible contributing factor."

Update: August 20, 2007

According to the Skype Heartbeat site, service is now back to normal.

And while Skype appears to acknowledge that they had a bug, they point to a Windows Update (and subsequent reboots by many users) as the triggering event:
"The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
Normally Skype’s peer-to-peer network has an inbuilt ability to self-heal, however, this event revealed a previously unseen software bug within the network resource allocation algorithm which prevented the self-healing function from working quickly. Regrettably, as a result of this disruption, Skype was unavailable to the majority of its users for approximately two days."