Cloud Zone is brought to you in partnership with:

Istvan Szegedi is an IT Technical Architect at Vodafone UK. He has been working at Hewlett-Packard, Nokia Networks, Google, Morgan Stanley and Vodafone. He holds certificates such as Sun Certified System Administrator, Sun Certified Java Programmer, Sun Certified Web Component Developer, Salesforce.com Certified Force.com Developer, TOGAF Certified Enterprise Architect. As a big fan of mobile and cloud computing, he likes to believe that these technologies will eventually push aside the desktop/client-server architecture Istvan is a DZone MVB and is not an employee of DZone and has posted 37 posts at DZone. You can read more from them at their website. View Full User Profile

Getting to Know Amazon Elastic MapReduce

07.16.2012
| 3904 views |
  • submit to reddit

Amazon Elastic MapReduce is a service in the AWS portfolio that can be used for data processing and analytics on vast amounts of data. It is based on Hadoop (as of writing this article it is using Hadoop 0.20.205) and relies on other AWS services such as EC2 and S3.

The data processing applications can be implemented using various technologies such as Hive, Pig, Java (Custom Jar) and Streaming (e.g. python or ruby). This post will demonstrate how to use Hive on Amazon Elastic MapReduce – the sample application will calculate the average price of Apple stock in every year from 1984 till 2012.   At the time of writing Hive version is 0.7.1 . (Side note: as it will be shown, AAPL started at around 25 USD as an average  price in 1984, managed to get down to 18 USD in 1997 and now it is around 500 – 496.32138, to be more precise -,  quite some numbers for a company that is in Infinite Loop for decades…)

How to create Elastic MapReduce Jobs?

There are three steps to manage an EMR jobflow:

1./ Upload the script (i.e. hive.q file) and the data to be processed onto S3. If you are unfamiliar with AWS, this is a good place to start to understand its structure and the way how to use it.

The test data used in the post is downloaded from Yahoo! Finance website (Historical data for AAPL stock). Go to http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices  and then  scroll down to Download to Spreadsheet link. This will create a csv file (~6,950 lines) with the following columns: Date,Open,High,Low,Close,Volume,Adj Close. Remove the header (the first line) to leave only the relevant data in the csv file.

Steps to upload the input files:

a./ go to AWS S3 console and create “stockprice” bucket:

b./ create folders under stockprice bucket: apple/input, apple/output and hive-scripts.

c./ upload apple.q hive-script into //stockprice/hive-scripts folder

d./ upload the csv input file containing AAPL stock prices into //stockprice/apple/input folder

2./ Create an Elastic MapReduce jobflow:

Natigate to https://console.aws.amazon.com/elasticmapreduce/home

b./ select “Create New Job Flow”

c./ configure job parameters:

d./ configure EC2 instances:

e./ define EC2 key pair:

f./ if you want, you can configure debugging by defining a S3 log path and selecting “Enable Debugging” (optional). I highly recommend to do it if you are in development phase:

g./ Set no bootstrap actions:

h./ review the configuration before you hit the run button:

i./ create job flow:

j./ you can verify the job flow status from STARTING to RUNNING to SHUTDOWN.

Should there be any issues occuring, you can check the stderr, stdout, syslog from “Debug” menu.

3./ Check the result:

After a few minutes of number crunching, the output will be generated in //stokcprice/apple/output folder (e.g. 000000 file). The file will have a text format with date and stock price cloumns (separeted by SOH – start of heading – ascii 001), see:

1984 25.578625
1985 20.193676
1986 32.46103
1987 53.889683
1988 41.54008
1989 41.659763
1990 37.562687
1991 52.495533
1992 54.803387
1993 41.02672……

2010 259.84247
2011 364.00433
2012 496.32138

Appendix

The hive code to process the data (apple.q) looks like this:

CREATE EXTERNAL TABLE stockprice (
yyyymmdd STRING, open_price FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, stock_volume INT, adjclose_price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
LOCATION ‘s3://stockprice/apple/input/’;
CREATE TABLE tmp_stockprice (
year INT, close_price FLOAT
)
STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE tmp_stockprice
SELECT YEAR(sp.yyyymmdd), sp.close_price
FROM stockprice sp;

CREATE TABLE avg_yearly_stockprice (
year INT, avg_price FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

INSERT OVERWRITE TABLE avg_yearly_stockprice
SELECT tmp_sp.year, avg(tmp_sp.close_price)
FROM tmp_stockprice tmp_sp
GROUP BY tmp_sp.year;

INSERT OVERWRITE DIRECTORY ‘s3://stockprice/apple/output/’
SELECT * from avg_yearly_stockprice;

Alternatively you can define LOCATION for avg_yearly_stockprice in a similar way (external table) as it is done stockprice table instead of INSERT OVERWRITE DIRECTORY.

 

 

Published at DZone with permission of Istvan Szegedi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)