Learning How to Cope with BI at a Large Scale
As more and more data is being collected everywhere from pretty much everything a user does, such as transactions activities, social interactions, information search ... enterprises have been actively looking into ways to turn these vast amounts of raw data into useful information.
It includes the following stages of processing
- ETL: Extract operational data (inside enterprise or external sources) into data warehouse (typically organized in Star/Snowflake schema with Fact and Dimension tables).
- Data exploration: Get insight into data using simple visualization tools (e.g. histogram, summary statistics) or sophisticated OLAP tools (slice, dice, rollup, drilldown)
- Report generation: Produce executive reports
- Data mining: Extract patterns of the underlying data to form models (e.g. bayesian networks, linear regression, neural networks, decision trees, support vector machines, nearest neighbors, association rules, principal component analysis)
- Feedback: The model will be used to assist business decision making (predicting the future)
Many data mining and machine learning algorithms are available in both commercial packages (e.g. SAS, SPSS) as well as open source libraries (e.g. Weka, R). Nevertheless, most of these ML algorithms implementation are based on fitting al data in memory and not designed to process big data (e.g. Tera byte data volume).
On the other hand, massively parallel processing platform such as Hadoop, Map/Reduce, over the last few years, have been proven in processing Terabyte or even Petabyte ranges of data. Although many sequential algorithm can be restructured to run in map reduce, including a big portion of machine learning algorithm, there isn't a corresponding parallel implementation of ML available in massively parallel form.
Approach 1: Apache Mahout
One approach is to "re-implement" the ML algorithm in Map/Reduce and this is the path of Apache Mahout project. Mahout seems to have implemented an impressive list of algorithms although I haven't used them for my projects yet.
Approach 2: Ensemble of parallel independent learners
This is an alternative path that doesn't require re-implementation of existing algorithms. It works in the following way.
- Draw samples from the Big data into many sample data sets, which can fit into the memory of a single, individual learner.
- Assign each sample data set to an individual learner, who can use existing algorithms to learn the model. After learning, each individual learner keep their own learned model
- When a decision / prediction request is received, each individual learner will come up with its own prediction and then combine their results in some ways. (e.g. for classification task, the learners will vote for the predicted class and the majority wins. for regression, the average of the estimate values will be used to predict the output value)
I
also found this approach can smoothly fade out outdated model. As
user's behavior may change over time, same happens to the validity of a
learned model. With this ensemble approach, I can have multiple
learners each learn their model periodically. Everytime when a
prediction is needed, I will pick the latest k models and combine the
final prediction based on a time-decayed weighted voting model.
Outdated model will automatically slide out the k-size window
automatically.
One gotchas of sampling approach is the handling
of rare events (since you may lost those rare events in sampling). In
this case, stratified sampling (instead of simple random sampling)
should be used.
Source: http://horicky.blogspot.com/2010/12/bi-at-large-scale.html
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)







Comments
Goel Yatendra replied on Thu, 2012/03/15 - 2:56pm
a) the business intelligence (BI) tasks [computation-/storage- intensive, math-rich] and
b) the business logic (BL) tasks [user IO responsive, data-rich]?
Rickywhore Ricky replied on Tue, 2012/07/31 - 5:24am
Boomikag Alle replied on Tue, 2012/08/14 - 6:56am
Boomikag Alle replied on Sun, 2012/09/30 - 3:23am
Lexie Clifton replied on Wed, 2012/11/14 - 1:50am
Olivia Beeton replied on Tue, 2012/12/04 - 1:23am
Thanks for the wonderful post. The diagram in your post was very informative. I want to make a diagram whose design was named baxton studio furniture which I really wanted to have. One of my finished product now was release on the market and I named it baxton studio queen bed which will let you sleep like a queen.
Boomikag Alle replied on Mon, 2012/12/10 - 11:48am
Jake Spencer replied on Fri, 2013/01/18 - 1:08pm
Ron Sim replied on Wed, 2013/02/27 - 10:05am
in response to:
Goel Yatendra
Thanks for another wonderful post. Where else could anybody get that type of info in such an ideal way of writing?
zuma game
Zem Karlos replied on Fri, 2013/03/08 - 9:41pm
Thanks you for the article. Where else could anyone get that kind of information in such a complete way of writing ? I have a presentation incoming week, and I am on the lookout for such information.
Zem Karlos replied on Fri, 2013/05/03 - 5:01am