Cloud Zone is brought to you in partnership with:

I am a software architect working in service hosting area. I am interested and specialized in SaaS, Cloud computing and Parallel processing. Ricky is a DZone MVB and is not an employee of DZone and has posted 84 posts at DZone. You can read more from them at their website. View Full User Profile

Learning How to Cope with BI at a Large Scale

  • submit to reddit

As more and more data is being collected everywhere from pretty much everything a user does, such as transactions activities, social interactions, information search ... enterprises have been actively looking into ways to turn these vast amounts of raw data into useful information.

BI process flow


It includes the following stages of processing

  1. ETL: Extract operational data (inside enterprise or external sources) into data warehouse (typically organized in Star/Snowflake schema with Fact and Dimension tables).
  2. Data exploration: Get insight into data using simple visualization tools (e.g. histogram, summary statistics) or sophisticated OLAP tools (slice, dice, rollup, drilldown)
  3. Report generation: Produce executive reports
  4. Data mining: Extract patterns of the underlying data to form models (e.g. bayesian networks, linear regression, neural networks, decision trees, support vector machines, nearest neighbors, association rules, principal component analysis)
  5. Feedback: The model will be used to assist business decision making (predicting the future)
The gap of processing BIG data
Many data mining and machine learning algorithms are available in both commercial packages (e.g. SAS, SPSS) as well as open source libraries (e.g. Weka, R). Nevertheless, most of these ML algorithms implementation are based on fitting al data in memory and not designed to process big data (e.g. Tera byte data volume).

On the other hand, massively parallel processing platform such as Hadoop, Map/Reduce, over the last few years, have been proven in processing Terabyte or even Petabyte ranges of data. Although many sequential algorithm can be restructured to run in map reduce, including a big portion of machine learning algorithm, there isn't a corresponding parallel implementation of ML available in massively parallel form.

Approach 1: Apache Mahout
One approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of Apache Mahout project. Mahout seems to have implemented an impressive list of algorithms although I haven't used them for my projects yet.

Approach 2: Ensemble of parallel independent learners
This is an alternative path that doesn't require re-implementation of existing algorithms. It works in the following way.
  1. Draw samples from the Big data into many sample data sets, which can fit into the memory of a single, individual learner.
  2. Assign each sample data set to an individual learner, who can use existing algorithms to learn the model. After learning, each individual learner keep their own learned model
  3. When a decision / prediction request is received, each individual learner will come up with its own prediction and then combine their results in some ways. (e.g. for classification task, the learners will vote for the predicted class and the majority wins. for regression, the average of the estimate values will be used to predict the output value)

I also found this approach can smoothly fade out outdated model. As user's behavior may change over time, same happens to the validity of a learned model. With this ensemble approach, I can have multiple learners each learn their model periodically. Everytime when a prediction is needed, I will pick the latest k models and combine the final prediction based on a time-decayed weighted voting model. Outdated model will automatically slide out the k-size window automatically.

One gotchas of sampling approach is the handling of rare events (since you may lost those rare events in sampling). In this case, stratified sampling (instead of simple random sampling) should be used.


Published at DZone with permission of Ricky Ho, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Goel Yatendra replied on Thu, 2012/03/15 - 2:56pm

What do you think: are the infrastructure requirements, and solutions thereof, are fundamentally different between:
a) the business intelligence (BI) tasks [computation-/storage- intensive, math-rich] and
b) the business logic (BL) tasks [user IO responsive, data-rich]?

Rickywhore Ricky replied on Tue, 2012/07/31 - 5:24am

enterprises human been actively superficial into ways to transform these vast amounts of raw assemblage into utilitarian entropy click here

Boomikag Alle replied on Tue, 2012/08/14 - 6:56am

The post is written in very a good manner and it entails many useful information for me. I am happy to find your distinguished way of writing the post. Now you make it easy for me to understand and implement the concept. Good to read this!Raspberry Ketone Weight Loss

Boomikag Alle replied on Sun, 2012/09/30 - 3:23am

This is nice site to spent time on .I stumbled on your informative weblog & desired to say that I have enjoyed reading your well written weblog posts. I will be your frequent visitor, that is for sure.Pilates Hunters Hill

Lexie Clifton replied on Wed, 2012/11/14 - 1:50am

Las grandes corporaciones sabe cómo jugar sus negocios para que puedan ahorrar dinero. Algunas de las grandes empresas están dando subvenciones para empresas. Esto es importante para ellos porque es parte de la estrategia. Ellos saben estas cosas porque de la agencia idea se han instalado en el interior de la empresa.

Olivia Beeton replied on Tue, 2012/12/04 - 1:23am

 Thanks for the wonderful post. The diagram in your post was very informative. I want to make a diagram whose design was named baxton studio furniture which I really wanted to have. One of my finished product now was release on the market and I named it baxton studio queen bed which will let you sleep like a queen.

Boomikag Alle replied on Mon, 2012/12/10 - 11:48am


I have bookmarked your website because this  site contains valuable information in it.I am really happy with articles quality and  presentation.Thanks a lot for keeping great stuff.I am very much thankful for this in Houston

Jake Spencer replied on Fri, 2013/01/18 - 1:08pm

it's almost like the same as watch family movies online

Ron Sim replied on Wed, 2013/02/27 - 10:05am in response to: Goel Yatendra

 Thanks  for another wonderful post. Where else could anybody get that type of info in such an ideal way of writing?

            zuma game

Zem Karlos replied on Fri, 2013/03/08 - 9:41pm

 Thanks you for the  article. Where else could anyone get that kind of information in such a complete way of writing ? I have a presentation incoming week, and I am on the lookout for such information.

Zem Karlos replied on Fri, 2013/05/03 - 5:01am

Just need to say this blog is quite good. I usually like to discover one thing new about this since I have the similar blog in my Country on this subject so this help´s me a lot. I did a research over a theme and discovered a excellent number of blogs but nothing like this.Thanks for sharing so much within your blog. Orlando Roof repair

Richaard Matias replied on Sun, 2014/03/16 - 7:19am

I am very happy to read this. This blog was really an awesome site which I had never found it anywhere, keep post. oh ya, visit: Jual Online Printer Epson L-Series dengan Harga TERMURAH di Kawasan Indonesia Timur 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.