Last week, Amazon’s side business of selling online computing resources suffered a major failure and businesses of all sizes were significantly impacted.
In some cases, a company’s internal systems, as well as its Web sites, went down for several days and even as this is written, some sites were still being affected.
Amazon, which entered the cloud computing business five years ago, has been a leader in a field that has become popular as more and more organizations look to move computing from their own data centers onto the Internet.
With cloud computing companies are essentially purchasing raw computing power and storage. They do not need to invest in computers or operating systems; that part is handled by Amazon and its competitors.
Two major trends have evolved in this effort. One is a utility computing model à la Amazon; the other looks to sell companies cloud technology that the company owns and manages (the so-called private cloud).
In recent years, more and more data storage and processing have been off-loaded into the cloud without the appropriate precautions being taken to ensure data accessibility and redundancy.
Amazon has data centers in North America, Europe, and Asia, and the problems seem to have centered around a major data center in Northern Virginia. As of last Thursday, dozens of companies ranging from Quora, a question-and-answer site, to Foursquare, a social networking site, reported downed Web sites, service interruptions, and an inability to access data stored in Amazon’s cloud.
Some companies using Amazon’s servers were unaffected because they designed their systems to leverage Amazon’s redundant cloud architecture, which means that a malfunction in one data center would not render a system or Web site inaccessible. Less sophisticated companies typically have neither the budget nor the know-how to do this and therefore paid the price in the form of downed systems and sites.
As of Saturday, April 23, (effectively, day 3 of the outage), Amazon’s status page continued to show problems in the Northern Virginia data center including “Instance connectivity, latency and error rates” in the Amazon Elastic Compute Cloud and “Database instance connectivity and latency issues” in the Amazon Relational Database service.
Visitors to the Web site of BigDoor, a software company, saw the following message on Saturday: “We’re still experiencing issues due to the current AWS outage. Our publisher account site and API are recovering now, but apparently AWS thinks our corporate site is too awesome for you to see right now.”
At 4:09 EDT on the 25th, Amazon posted the following: “We have completed our remaining recovery efforts and though we’ve recovered nearly all of the stuck volumes, we’ve determined that a small number of volumes (0.07% of the volumes in our US-East Region) will not be fully recoverable. We’re in the process of contacting these customers.”
Even without such outages, moving your organization to a cloud computing-based architecture still entails risks. Unlike typical scenarios where in-house IT staff control access to your sensitive information, a cloud computing provider may, by design or otherwise, allow privileged users access to this information. Data is typically not segregated but stored alongside data from other companies. In addition, a cloud computing provider could conceivably go out of business, suffer an outage, or be taken over, and all of these could impact data accessibility.
Amazon’s outage serves to reinforce that cloud computing is not immune from risk and is not the magic bullet that companies that offer the service would like their customers to believe. Rather, it is in many respects no different than any other distributed system and such systems are nothing new. Indeed, distributed systems have been around for decades and IT professionals should have learnt enough about fault tolerance and security by now.
Backing up and replicating data and applications across multiple sites should be old hand by now, but the Amazon incident proves this not to be the case. Until we begin to think about cloud computing as being no different than any other type of computing, we will continue to experience major systems failures such as Amazon’s.
Jonathan B. Spira is CEO and Chief Analyst at Basex.