Big Data/Analytics Zone is brought to you in partnership with:

I am a software architect working in service hosting area. I am interested and specialized in SaaS, Cloud computing and Parallel processing. Ricky is a DZone MVB and is not an employee of DZone and has posted 88 posts at DZone. You can read more from them at their website. View Full User Profile

What Hadoop is Good At

11.06.2009
| 16899 views |
  • submit to reddit

Hadoop is getting more popular these days. Lets look at what it is good at and what not.


The Map/Reduce Programming model

Map/Reduce offers a different programming model for handling concurrency than the traditional multi-thread model.

Multi-thread programming model allows multiple processing units (with different execution logic) to access the shared set of data. To maintain data integrity, each processing units co-ordinate their access to the shared data by using Locks, Semaphores. Problem such as "race condition", "deadlocks" can easily happen but hard to debug. This makes multi-thread programming difficult to write and hard to maintain. (Java provides a concurrent library package to ease the development of multi-thread programming)


Data-driven programming model feeds data into different processing units (with same or different execution logic). Execution is triggered by arrival of data. Since processing units can only access data piped to them, data sharing between processing units is prohibited upfront. Because of this, there is no need to co-ordinate access to data.

This doesn't mean there is no co-ordination for data access. We should think of the co-ordination is done explicitly by the graph. ie: by defining how the nodes (processing units) are connected to each other via data pipes.

Map-Reduce programming model is a specialized form of data-driven programming model where the graph is defined as a "sequential" list of MapReduce jobs. Within each Map/Reduce job, execution is broken down into a "map" phase and a "reduce" phase. In the map phase, each data split is processed and one or multiple output is produced with a key attached. This key is used to route the outputs (of the Map phase) to the second "reduce" phase, where data with the same key is collected and processed in an aggregated way.

Note that in a Map/Reduce model, parallelism happens only within a Job and execution between jobs are done in a sequential manner. As different jobs may access the same set of data, knowing that jobs is executed serially eliminate the needs of coordinating data access between jobs.

Design application to run in Hadoop is a matter of breaking down the algorithm in a number of sequential jobs and then exploit data parallelism within each job. Not all algorithms can fit in to the Map Reduce model. For a more general approach to break down an algorithm into parallel, please visit here.

Characteristics of Hadoop Processing

A detail explanation of Hadoop implementation can be found here. Basically Hadoop has the following characteristics ...

  • Hadoop is "data-parallel", but "process-sequential". Within a job, parallelism happens within a map phase and a reduce phase. But these two phases cannot be run in parallel, the reduce phase cannot be started until the map phase is fully completed.
  • All data being accessed by the map process need to be freezed (update cannot happen) until the whole job is completed. This means Hadoop processes data in chunks using a batch-oriented fashion, making it not very suitable for stream-based processing where data flows in continuously and immediate processing is needed.
  • Data communication happens via a distributed file system (HDFS). Latency is introduced as extensive network I/O is involved in moving data around (ie: Need to write 3 copies of data synchronously). This latency is not an issue for batch-oriented processing where throughput is the primary factor. But this means Hadoop is not suitable for online access where low latency is critical.

Given the above characteristics, Hadoop is NOT good at the following ...

  • Perform online data access where low latency is critical
  • Perform random ad/hoc processing of a small subset of data within a large data set
  • Perform real-time, stream-based processing where data is arrived continuously and immediate processing is needed.

From http://horicky.blogspot.com

Published at DZone with permission of Ricky Ho, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Geertjan Wielenga replied on Fri, 2009/11/06 - 7:04am

FYI, there's also Karmasphere Studio for Hadoop: http://www.hadoopstudio.org/

Dhaval Nagar replied on Sat, 2009/11/07 - 5:02pm

very simple explanation of very complex subject. good article.

Joay Sim replied on Sun, 2013/02/17 - 1:41am

Enclavia Website

 Great survey, I'm sure you're getting a great response.

      

Ron Sim replied on Wed, 2013/02/27 - 7:44am in response to: Geertjan Wielenga

 Wonderful illustrated information. I thank you about that. No doubt it will be very useful for my future projects. Would like to see some other posts on the same subject!

             teeth implants uk

Vijay Bhaskar replied on Sun, 2013/10/13 - 4:28am

Sometimes I observed that the reduce jobs starts after 96% completions of map jobs. However the reduce jobs should start only after the completion of map jobs. Can you please let me know in which cases the reduce jobs starts before the completion of map jobs.

It jobs

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.