Escolar Documentos
Profissional Documentos
Cultura Documentos
ME(computer) 9970406068
Guided By :-
Hadoop Architecture
Client
Job assignment to cluster
It is the software framework for distributed processing of large data sets on computer cluster. Map and reduce have general interface Each receives sequence of records and produces records in response A record consists of key and value
Map
Map
Map
Map
Reduc
Redu ce
Reduc
Reduc
Result By Reducer
Map operation:
Map seeks to key its output
Reduce operation:
so that the system places in the same bin the records that should come together in the reduce phase.
Mappers
M1
M2
M3
Combiner
Partitioner
Reducers
R1
R2
R3
Sorter
Problem statement for word count: There is huge file. Determine the count of each word in the file. Approach: Map reduce take advantage of the huge number of nodes presents in the cluster. Map- reduce runs in parallel at each node in the cluster.
map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1)
reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)
map(key=url, Val=contents): For each word w in contents, emit (w, 1) reduce(key=word, values=uniq_counts): Sum all 1s in values list Emit result (word, sum)
1 1 1 1 1 1
1 1 2 1 1
Deer 1 Bear 1 Bear 1 Bear 1 River 1 Deer Beer River Deer Beer River Car Car River Deer Car Bear Deer Car Bear Car Car Car River Car 1 Car 1 1 Car 1 River 1 Deer 1 Deer 1 Car 1 Deer 1 Bear 1 River 1 River 1 River 2 Deer 2 Deer 2 River 2 Car 1 Car 3 Bear 2 Car 3 Bear 2
Pull
Map Store Red uce
HDFS
Push Push
HDFS Local HDFS Map Store Red uce
Pull
See 1
M1
Bob 1
M2
Run 1
M3
See 1
M4
See
Job Tracker
See 1
R1
Bob 1
R2
Run 1
R3
Run 1
R4
Reduce Task Accept The Pipeline Data & store it in In Memory Buffer In Memory Buffer
MERGE
HDFS
Allows to send and Receive data between task and between jobs with disk i/o. Reduce Time. Enabling the user to take snapshots of approximate output.
In this seminar , we studied the Pipelined-MapReduce in the Hadoop environment , extends the MapReduce programming model which is superior to the batch, reduce the completion time of tasks. Pipedline-MapReduce can processes large datasets effctively. In our future works , we will study the applicability of the MapReduce technique in cloud computing environments.
J.Dean,S.Ghemawat,MapReduce:simplified Data Processing on Large Clusters. Proc. of Operating Systems Design and Implementation, San Francisco,CA, pp. 137150 (2004) T.Hey, S.Tansley, K.Tolle. The Fourth Paradigm: DataIntensive Scientic Discovery. Microsoft Research, Redmond, Washington, 2009 C.Ranger, R.Raghuraman, A.Penmetsa, and G.Bradski, C.Kozyrakis, Evaluating MapReduce for Multi-core and Multiprocessor Systems. Proc. of 13th Symposium on High-PerformanceComputer Architecture (HPCA), Phoenix, AZ(2007) Hadoop, http://hadoop.apache.org/core/