What Really Caused the Amazon Web Services Outage?
Amazon released a summary of the Amazon EC2 and Amazon RDS service disruption in the US East Region today.
These so caled “stuck” volumes and stuck EC2 instances, well those were mine.
Right after midnight on April 21st I was working on one of my harvest tools, which pulls feeds from all major blogs, breaks them down and stores them for analysis.
I was reviewing the analysis for what it pulled in the last hour, then my EC2 instance froze. I closed everything down and logged back into the root, but it took be about 45 minutes to actually get things to load back up.
Once back in I saw some sort of scan going on? Whatever was scanning on this instance was going through large volumes of information, and it was moving so fast I really couldn’t make sense of it.
I tried to shut things down, but couldn’t. I’d lost control over my instance. I run a quadruple extra large instance for harvesting data. For nearly 4 days I couldn’t get my machine to respond.
I assumed once Amazon resolved their problems, then I could get back in.
It was late Sunday night before I was able to get back into my instance, everything looked fine. Upon closer inspection I noticed that my hard drive was all but maxed out. I started looking further and noticed my normally 500 GB EBS volume was now 1 TB. Then I noticed there were 25 separate 1TB EBS volumes attached. Holy shit, where did this come from?
I started going through these drives, they appeared to full of logs of the last 4 days activities.
I was curious to know what happened, so I started looking through. I started with the files timestamped from 4/21/2011.
The logs contained complete copies of everything I’ve harvested over the last years, but that was just the first 4 hours. After that it looks like some other data, I do not recognize it.
After looking for patterns I noticed several things. In the first 2 hours I saw my dictionary database I use for indexing harvested data processed, with following coded tags around each word:
112 114 111 99 101 115 115>>aardvaark<<112 114 111 99 101 115 115
The log files the rest of the days seemed to be processing all blog posts I’ve pulled from around the web for the last year:
76 101 97 114 110>> “The first serious infowar is now engaged,” EFF co-founder John Perry Barlow tweeted on Friday. “The field of battle is WikiLeaks. You are the troops.” Cablegate and Tech Companies. And here we are. In the week since the whistleblower site released its latest round of documents to major global newspapers, the site has been besieged by DDOS attacks (upwards of 10 Gbps at one point), forcing the site offline and hampering its ability to deliver data…<<76 101 97 114 110
Then for the next 36 hours the log files contained random posts, bits of information, which I could not make sense of:
85 110 100 101 114 115 116 97 110 100>>Care must be taken to recognize when the effects of one constraint may counteract the benefits of some other constraint. Nevertheless, it is possible for an experienced software architect to build such a derivation tree of architectural constraints for a given application domain, and then use the tree to evaluate many different architectural designs for applications within that domain. Thus, building a derivation tree provides a mechanism for architectural design guidance.<< 85 110 100 101 114 115 116 97 110 100
This is pretty what seemed to fill up my hard drive(s). After that, all it seemed to pull was was Amazon EC2 instance IDs and EBS Volume IDs:
69 120 112 108 111 114 101>>i-77f3849,i-399fF3a49<<69 120 112 108 111 114 101
I’ve looked for other reports of this type of scanning across the Internet and can’t find anything.
Whatever or whoever it was it seem to get smarter and more efficient the more it scanned. It found my information, and after it worked through that, it found a whole bunch of other content from that.
What worries me is it looks like it explored beyond my machine and found about 67,000 other instances while it was exploring. I don’t have this information on my server or storage.
The last file I found modified was a log file with timestamp Sunday at 8:01:00 PM with a single line in it:
83 108 101 101 112
I’m not sure what all this is, but it seemed to only exist on my server and started at the same time everyone started complaining about an outage. Amazons report on outage, says a similar time: 12:47 AM PDT on April 21st.
They claim it was due to a “network change”. Interesting.
It may have coincided with a network change, but what I saw seemed like something else. What I saw, seemed to be some sort of machine learning or evolving intelligence.
Maybe its all part of my imagination….oh wait…it is….none of this happened.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)