So we know it's Leap Day, but what time is it?
On Wednesday, February 29th, Windows Azure experienced about 8 hours of downtime for some of its critical services. The cause? Leap Day. copyrightjoestrazzere
- a component of Windows Azure experienced a worldwide outage for eight hours
- a series of outages that affected multiple aspects of the system
- prevented customers from carrying out management operations for technology that uses the cloud management service
- issue appears to be due to a time calculation that was incorrect for the leap year
- outage apparently was triggered by a key server in Ireland housing a certificate that expired at midnight on Feb. 28
- Azure users posted a stream of critical comments about the outages to the service's official forums
- a customer described the problem as an "admin nightmare" and said they couldn't understand how such an important system could go down.
- Microsoft blamed the Azure management problems on a "cert issue triggered on 2/29/12"
- the service has not been around for four years yet, and on its first leap year day, it collapsed
- initial problems propagated to different territories, and live customer-facing sites became unavailable
- in some markets, Microsoft had promoted its Azure cloud service using the slogan “I laugh in the face of unpredictability”.
- "Microsoft will have to start its cloud marketing from scratch, to rebuild a level of trust that has now crumbled"
Perhaps Leap Day wasn't predictable for Microsoft (although experts tell me that it has been known to occur almost every 4 years), and those time calculations can indeed be tricky.
But perhaps Microsoft should have tested more.
This article originally appeared in my blog: All Things Quality
|My name is Joe Strazzere and I'm currently a Director of Quality Assurance.
I like to lead, to test, and occasionally to write about leading and testing.
Find me at http://AllThingsQuality.com/.