Weathering the AWS Storm: Zero Downtime During the AWS Outage
The fact is that the downtime these companies — and their customers — experienced didn’t need to occur. And the blame shouldn’t be placed on the infrastructure that Amazon provides, nor is it an indicator that the public cloud is any less reliable than any other IT infrastructure.
The reaction to the AWS outage is a reminder that thousands of businesses and millions of users are consuming public cloud services based on AWS. As service providers that are running on that AWS infrastructure, we all have the responsibility to make the appropriate level of investment in our software and operations to ensure an acceptable level of reliability for our customers.
For vendors that are offering non-mission critical services, the cost to engineer and support higher levels of reliability in their offering may be prohibitive.
At Okta, that is not the case, so we have made a different choice.
We run an on-demand identity and access management service for our customers, which controls access to their critical business applications. Downtime for our customers is unacceptable and, as a result, we have made the software and operational investments necessary to provide a reliable service on top of AWS.
AWS is a platform we trust. It is secure and reliable and provides us with an incredible amount of flexibility as we scale our business. But the AWS infrastructure, just like any infrastructure, will experience failures — that is a certainty. In order to account for those inevitable failures, service providers need to make software and operational investments that allow their services to continue to run. Those investments are the responsibility of the vendor.
Okta remained up and running throughout last week’s AWS outage, and the key was the underlying software and operations investments we have made as a part of our zero downtime architecture.
Weathering the AWS Storm
The recent AWS outage hit one availability zone, US-East-1, within Amazon’s Virginia region, but because of the software and operational investments Okta has made across our five-availability-zone footprint in that region (and in two availability zones in another region) our customers weren’t affected.
Amazon has published the post-mortem of Friday’s outage, but many analysts suggest that the companies that suffered extended downtimes had not made the necessary investments to ensure availability on AWS. ZDNet’s Michael Lee wrote an article that cites several industry analysts who lay the blame not on Amazon, but on the companies who went down for not creating, and paying for, geographically redundant systems. Citing Intelligent Business Services advisor Jorn Bettin, Lee writes:
“Bettin said that the cost of creating redundant connections, to ensure that a natural disaster in one area of the world won’t affect services in another, could double cloud costs, however. Despite the call for Amazon to pick up the bill and shield customers from this risk, Bettin said that this isn’t the way that cloud services should be treated.”
As I noted before, it may be that for some companies this investment is not warranted. For Okta, that is certainly not the case. We are an example that it’s absolutely possible to run a highly reliable service on AWS.
Our Customers Rely on Okta
Our customers rely on Okta to give employees seamless, constant access to the business applications they need to do their jobs. We don’t have planned downtimes at Okta. It’s critical that our service is up and running all the time, and we’re transparent with our customers about how well we live up to that promise.
AWS is the market leading infrastructure-as-a-service offering, with an innovative set of secure, reliable products that are very affordable. But if reliability is important to your business — like it is to ours — you cannot depend on the infrastructure layer alone, whether it’s run by AWS, or by your own operations staff in your own datacenter.
Infrastructure will fail, and vendors of mission critical services must make the necessary investments to ensure availability.
At Okta we are clear: Zero Downtime is critical and the buck stops with us.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)