The HaLoop approach to large-scale iterative data analysis

Download: PDF, VLDB 2010 slides (PDF), VLDB 2010 slides (PowerPoint), HaLoop implementation.

“The HaLoop approach to large-scale iterative data analysis” by Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. The VLDB Journal, vol. 21, no. 2, 2012, pp. 169-190.
A previous version appeared as “HaLoop: Efficient Iterative Data Processing on Large Clusters” by Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. In VLDB 2010: 36th International Conference on Very Large Data Bases, (Singapore), Sep. 2010, pp. 285-296.

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285-296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4% as much data between mappers and reducers in the applications that we tested.

Download: PDF, VLDB 2010 slides (PDF), VLDB 2010 slides (PowerPoint), HaLoop implementation.

BibTeX entry:

@article{BuHBErnst2012,
   author = {Yingyi Bu and Bill Howe and Magdalena Balazinska and Michael
	D. Ernst},
   title = {The {HaLoop} approach to large-scale iterative data analysis},
   journal = {The VLDB Journal},
   volume = {21},
   number = {2},
   pages = {169--190},
   year = {2012}
}

(This webpage was created with bibtex2web.)

Back to Michael Ernst's publications.