Você está na página 1de 2

Parallel Implementation of Classification

Algorithms Based on MapReduce

Qing He1 , Fuzhen Zhuang1,2 , Jincheng Li1,2 , and Zhongzhi Shi1


1
The Key Laboratory of Intelligent Information Processing, Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, 100190, China
2
Graduate University of Chinese Academy of Sciences, Beijing, 100190, China
{heq,zhuangfz,lijincheng,shizz}@ics.ict.ac.cn

Abstract. Data mining has attracted extensive research for several


decades. As an important task of data mining, classication plays an im-
portant role in information retrieval, web searching, CRM, etc. Most of
the present classication techniques are serial, which become impractical
for large dataset. The computing resource is under-utilized and the exe-
cuting time is not waitable. Provided the program mode of MapReduce,
we propose the parallel implementation methods of several classication
algorithms, such as k-nearest neighbors, naive bayesian model and deci-
sion tree, etc. Preparatory experiments show that the proposed parallel
methods can not only process large dataset, but also can be extended to
execute on a cluster, which can signicantly improve the eciency.

Keywords: Data Mining, Classication, Parallel Implementation,


Large Dataset, MapReduce.

1 Introduction
With the increasing collected data universally, data mining technique is intro-
duced to discover the hidden knowledge which can contribute to decision-making.
Classication is one category of data mining which extracts models to describe
data classes or to predict future data trends. Classication algorithms predict
categorical labels, many of which have been proposed by researchers in the eld
of machine learning, expert systems, and statistics, such as decision tree [1,2],
nave bayesian model [3,4,5,6], nearest neighbors algorithm [7,8], and so on.
While, most of these algorithms are initially memory-resident, assuming a small
data size. Recently, the research has focused on developing scalable techniques
which are able to process large and disk-resident dataset. These techniques often
consider parallel and distributed processing.
However, due to the communication and synchronization between dierent
distributed components, it is dicult to implement the parallel classication
algorithms in a distributed environment. The existing models for parallel classi-
cations are mainly implemented through message passing interface (MPI) [9],
which is complex and hard to master. For the emergence of cloud computing
[10,11], parallel techniques are able to solve more challenging problems, such as

J. Yu et al. (Eds.): RSKT 2010, LNAI 6401, pp. 655662, 2010.


c Springer-Verlag Berlin Heidelberg 2010
656 Q. He et al.

heterogeneity and frequent failures. The MapReduce model [12,13,14] provides


a new parallel implementation mode in the distributed environment. It allows
users to benet from the advanced features of distributed computing without
programming to coordinate the tasks in the distributed environment. The large
task is partitioned into small pieces which can be executed simultaneously by
the workers in the cluster.
In this paper, we introduced the parallel implementation of several classica-
tion algorithms based on MapReduce, which make them be applicable to mine
large dataset. The key is to design the proper key/value pairs.
The rest of the paper is organized as follows. Section 2 introduces MapReduce.
Section 3 presents the details of the parallel implementation of the classication
algorithms based on MapReduce. Some preparatory experimental results and
evaluations are showed in Section 4 with respect to scalability and sizeup. Finally,
the conclusions and future work are presented in Section 5.

2 MapReduce Overview

MapReduce [15], as the framework showed in Figure 1, species the computation


in terms of a map and a reduce function, and the underlying runtime system au-
tomatically parallelizes the computation across large-scale clusters of machines,
handles machine failures, and schedules inter-machine communication to make
ecient use of the network and disks.
Essentially, the MapReduce model allows users to write Map/Reduce com-
ponents with functional-style code. These components are then composed as a
dataow graph with xed dependency relationship to explicitly specify its par-
allelism. Finally, the MapReduce runtime system can transparently explore the
parallelism and schedule these components to distribute resources for execution.
All problems formulated in this way can be parallelized automatically.
All data processed by MapReduce are in the form of key/value pairs. The ex-
ecution happens in two phases. In the rst phase, a map function is invoked once

Fig. 1. Illustration of the MapReduce framework: the map is applied to all input
records, which generates intermediate results that are aggregated by the reduce

Você também pode gostar