three blocks
Datacore Software

Opinion

Simple Storage Stutters

posted on 23 July 2008 06:32


Amazon's S3 service slips up

There was a six to eight hour outage of Amazon's S3 - Simple Storage Service - at the weekend. It's not the first and suggests that Amazon just hasn't got its service's basic infrastructure sorted out. Reliability for a cloud service is everything and here is Amazon demonstrating - again - that S3 cannot be relied upon.

S3 is a storage in the cloud service where customers store data on Amazon's storage arrays instead of their own hard drives.

On Sunday July 20th the service crashed. It was the second time this year; there was a previous outage in February. Users were unable to retrieve files, files disappeared, and the service simply went down with no status report for a while from Amazon.

Amazon boss Jeff Bezos said: "When we have an outage like yesterday, we see that as a crucial driver. We won’t be satisfied until we have uptimes and availabilities that are statistically indistinguishable from perfection. When we have a problem, we know the proximate cause, we analyze from there and find the root cause, we will find the root fix and move forward."

An Amazon statement said: "As a distributed system, the different components of S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to.

"We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests.  After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again.

"These are sophisticated systems and it generally takes a while to get to root cause in such a situation — we will be providing our customers with more information when we’ve fully investigated the incident.  We’re proud of our operational performance in operating S3 for almost 2.5 years, and our customers have generally been pleased with the reliability and performance of the service. But any downtime is unacceptable and we won’t be satisfied until it is perfect.

"Amazon S3 is used heavily by a number of services behind Amazon’s retail websites.  Those services were impacted, but the retail website did not show noticeable problems because it mostly uses cached data."

A further Amazon statement announced a service credit to customers for the outage time: "We'll be waiving our standard SLA (service-level agreement) process and applying the appropriate service credit to all affected customers for the July billing period. Customers will not need to send us an e-mail to request their credits, as these will be automatically applied. This transaction will be reflected in our customers' August billing statements."

That's the good side of dealing with Amazon. But there's that old adage 'you get what you pay for' and S3 is cheap, cheap as chips. However good the retrospective bandaid, customers would prefer that it not be necessary. It will, though, probably be needed again as the service scales; these are, simply, growing pains.

[B&F staff.]

 


tags:  cloud