Escolar Documentos
Profissional Documentos
Cultura Documentos
1 Introduction
With the increasing collected data universally, data mining technique is intro-
duced to discover the hidden knowledge which can contribute to decision-making.
Classication is one category of data mining which extracts models to describe
data classes or to predict future data trends. Classication algorithms predict
categorical labels, many of which have been proposed by researchers in the eld
of machine learning, expert systems, and statistics, such as decision tree [1,2],
nave bayesian model [3,4,5,6], nearest neighbors algorithm [7,8], and so on.
While, most of these algorithms are initially memory-resident, assuming a small
data size. Recently, the research has focused on developing scalable techniques
which are able to process large and disk-resident dataset. These techniques often
consider parallel and distributed processing.
However, due to the communication and synchronization between dierent
distributed components, it is dicult to implement the parallel classication
algorithms in a distributed environment. The existing models for parallel classi-
cations are mainly implemented through message passing interface (MPI) [9],
which is complex and hard to master. For the emergence of cloud computing
[10,11], parallel techniques are able to solve more challenging problems, such as
2 MapReduce Overview
Fig. 1. Illustration of the MapReduce framework: the map is applied to all input
records, which generates intermediate results that are aggregated by the reduce