This article about managing AWS S3 outage was originally published on ImageKit’s Medium blog here.
Amazon’s Simple Storage Service, or S3 as it is popularly called, is a cheap, easily accessible and a resilient cloud storage option. But the downside, just like with any other cloud service, is that when there are problems, things can go wrong pretty quickly.
Like the recent massive S3 outage in US region, which broke websites and applications including Slack upload, Imgur, Github, Giphy and many more.
Many websites today use S3 for storing and serving images, and in such a scenario, S3 becomes a single point of failure. Though very rare but an S3 outage is one of those events which can cost your company not only in terms of revenue but also brand reputation and consumer trust.
How rare are these failures?
Amazon S3 Standard storage class is backed by 99.9% availability SLA over a year. 99.9% availability gives following periods of potential unavailability:
Daily: 1m 26.4s
Weekly: 10m 4.8s
Monthly: 43m 49.7s
Yearly: 8h 45m 57.0s
Now 8 hours of downtime is a huge loss for any company which is serious about doing business online. Especially when the downtime occurs in a single big window like the recent outage that lasted around 4 hours.
For availability SLA of:
99.99% — means service can be unavailable for a maximum of 52 min
99.999% — means service can be unavailable for a maximum of 5m 15.6s
There is obviously a huge difference between 5 minutes and 8 hours.
Getting this extra 0.099% of availability adds significant cost and complexity to the infrastructure.
Its like premium of insurance. You hope to never use it but when you do, you are happy that it was there.
What are your options in this case?
Ideally you should not rely on a single cloud service provider or at least on a single geographical region of the provider for your cloud infrastructure. During this outage Amazon.com, Zappos and several other tech companies like Apple, Walmart, BestBuy etc. stayed up and running. Apple uses both AWS and Google Cloud as per reports and sites like Amazon/Zappos have spread their infrastructure across multiple geographical regions.
With regard to S3, it is easy to build this redundancy in the system to handle outages. You can use S3’s cross region replication feature that can be configured to automatically replicate new objects (or a subset of new objects) in another bucket. If you are using CloudFront + S3 for serving images on your website, then you can use Route53 to intelligently handle these failovers. You can handle this logic in your application code as well but it won’t be very easy in most of the cases.
What we do here at ImageKit?
At ImageKit, our customers rely on us for optimizing and delivering their images to their users with very low latency. Apart from using S3 as a storage internally, we use an array of other services offered by AWS.
Like several other service providers, we cannot afford to go down even for a few minutes forget hours. Hence it becomes very important for us to prepare for an event like this, no matter how rare the event might be.
We use cross region replication to handle failure of S3. Based on Route53’s health check we automatically start picking images from replica buckets in case of failure.
In addition to providing a failure-proof storage, our image processing servers are running in three different regions across the globe. It serves two main purposes:
1. Better performance for end users.
2. Redundancy in the system in case a particular region (or our server within that region) goes down.
At an enterprise scale, re-architecting the whole application to handle failures like this would be a huge challenge. But if you have started recently and don’t have much technical debt, you should think about using these strategies starting today and find some clever ways of making your application failure-proof.
Do share your thoughts on how do you build redundancy in your applications in the comments. It would be interesting to learn what works for other companies and what does not.