Copyright 2010-2011 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
1
Clouderas Introduction to Apache Hadoop: Hands-On Exercises
!"#"$%& ()*"+,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, - .%#/+01# 23"$45+"6 7+5#8 .9:; ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, < .%#/+01# 23"$45+"6 =># % ?%@="/>4" A)B ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, C Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 2 General Notes Clouueia's tiaining couises use a viitual Nachine iunning the Cent0S S.6 Linux uistiibution. This vN has Clouueia's Bistiibution incluuing Apache Bauoop veision S (CBBS) installeu in Pseuuo-Bistiibuteu moue. Pseuuo-Bistiibuteu moue is a methou of iunning Bauoop wheieby all five Bauoop uaemons iun on the same machine. It is, essentially, a clustei consisting of a single machine. It woiks just like a laigei Bauoop clustei, the only key uiffeience (apait fiom speeu, of couise!) being that the block ieplication factoi is set to 1, since theie is only a single BataNoue available. Points to note while working in the VM D, The vN is set to automatically log in as the usei training. Shoulu you log out at any time, you can log back in as the usei training with the passwoiu training. -, Shoulu you neeu it, the ioot passwoiu is training. You may be piompteu foi this if, foi example, you want to change the keyboaiu layout. In geneial, you shoulu not neeu this passwoiu since the training usei has unlimiteu suuo piivileges. <, In some commanu-line steps in the exeicises, you will see lines like this: $ hadoop fs -put shakespeare \ /user/training/shakespeare The backslash at the enu of the fiist line signifies that the commanu is not completeu, anu continues on the next line. You can entei the coue exactly as shown (on two lines), oi you can entei it on a single line. If you uo the lattei, you shoulu !"# type in the backslash. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 3 Hands-On Exercise: Using HDFS E# *F5+ "3"$45+" G)> H5&& B"85# *) 8"* %4I>%5#*"/ H5*F *F" .%/))@ *))&+, J)> H5&& K%#5@>&%*" L5&"+ 5# .9:;M *F" .%/))@ 95+*$5B>*"/ :5&" ;G+*"K, Hadoop Bauoop is alieauy installeu, configuieu, anu iunning on youi viitual machine. Bauoop is installeu in the /usr/lib/hadoop uiiectoiy. You can iefei to this using the enviionment vaiiable $HADOOP_HOME, which is automatically set in any teiminal you open on youi uesktop. Nost of youi inteiaction with the system will be thiough a commanu-line wiappei calleu hadoop. If you stait a teiminal anu iun this piogiam with no aiguments, it piints a help message. To tiy this, iun the following commanu: $ hadoop (Note: although youi commanu piompt is moie veibose, we use '$' to inuicate the commanu piompt foi bievity's sake.) The hadoop commanu is subuiviueu into seveial subsystems. Foi example, theie is a subsystem foi woiking with files in BBFS anu anothei foi launching anu managing NapReuuce piocessing jobs. Step 1: Exploring HDFS The subsystem associateu with BBFS in the Bauoop wiappei piogiam is calleu FsShell. This subsystem can be invokeu with the commanu hadoop fs. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 4 D, 0pen a teiminal winuow (if one is not alieauy open) by uouble-clicking the Teiminal icon on the uesktop. -, In the teiminal winuow, entei: $ hadoop fs You see a help message uesciibing all the commanus associateu with this subsystem. <, Entei: $ hadoop fs -ls / This shows you the contents of the ioot uiiectoiy in BBFS. Theie will be multiple entiies, one of which is /user. Inuiviuual useis have a "home" uiiectoiy unuei this uiiectoiy, nameu aftei theii useiname - youi home uiiectoiy is /user/training. N, Tiy viewing the contents of the /user uiiectoiy by iunning: $ hadoop fs -ls /user You will see youi home uiiectoiy in the uiiectoiy listing. O, Tiy iunning: $ hadoop fs -ls /user/training Theie aie no files, so the commanu silently exits. This is uiffeient than if you ian hadoop fs -ls /foo, which iefeis to a uiiectoiy that uoesn't exist anu which woulu uisplay an eiioi message. Note that the uiiectoiy stiuctuie in BBFS has nothing to uo with the uiiectoiy stiuctuie of the local filesystem; they aie completely sepaiate namespaces. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 5 Step 2: Uploading Files Besiues biowsing the existing filesystem, anothei impoitant thing you can uo with FsShell is to uploau new uata into BBFS. D, Change uiiectoiies to the uiiectoiy containing the sample uata we will be using in the couise. cd ~/training_materials/developer/data If you peifoim a 'iegulai' ls commanu in this uiiectoiy, you will see a few files, incluuing two nameu shakespeare.tar.gz anu shakespeare-stream.tar.gz. Both of these contain the complete woiks of Shakespeaie in text foimat, but with uiffeient foimats anu oiganizations. Foi now we will woik with shakespeare.tar.gz. -, 0nzip shakespeare.tar.gz by iunning: $ tar zxvf shakespeare.tar.gz This cieates a uiiectoiy nameu shakespeare/ containing seveial files on youi local filesystem. <, Inseit this uiiectoiy into BBFS: $ hadoop fs -put shakespeare /user/training/shakespeare This copies the local shakespeare uiiectoiy anu its contents into a iemote, BBFS uiiectoiy nameu /user/training/shakespeare. N, List the contents of youi BBFS home uiiectoiy now: $ hadoop fs -ls /user/training You shoulu see an entiy foi the shakespeare uiiectoiy. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 6 O, Now tiy the same fs -ls commanu but without a path aigument: $ hadoop fs -ls You shoulu see the same iesults. If you uon't pass a uiiectoiy name to the -ls commanu, it assumes you mean youi home uiiectoiy, i.e. /user/training. Relative paths If you pass any relative (non-absolute) paths to FsShell commands (or use relative paths in MapReduce programs), they are considered relative to your home directory. For example, you can see the contents of the uploaded shakespeare directory by running: $ hadoop fs -ls shakespeare You also could have uploaded the Shakespeare files into HDFS by running the following although you should not do this now, as the directory has already been uploaded: $ hadoop fs -put shakespeare shakespeare Step 3: Viewing and Manipulating Files Now let's view some of the uata copieu into BBFS. D, Entei: $ hadoop fs -ls shakespeare This lists the contents of the /user/training/shakespeare uiiectoiy, which consists of the files comedies, glossary, histories, poems, anu tragedies. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 7 -, The glossary file incluueu in the taiball you began with is not stiictly a woik of Shakespeaie, so let's iemove it: $ hadoop fs -rm shakespeare/glossary Note that you $"%&' leave this file in place if you so wisheu. If you uiu, then it woulu be incluueu in subsequent computations acioss the woiks of Shakespeaie, anu woulu skew youi iesults slightly. As with many ieal-woilu big uata pioblems, you make tiaue-offs between the laboi to puiify youi input uata anu the piecision of youi iesults. <, Entei: $ hadoop fs -cat shakespeare/histories | tail -n 50 This piints the last Su lines of ()!*+ -./ 01*# 2 to youi teiminal. This commanu is hanuy foi viewing the output of NapReuuce piogiams. veiy often, an inuiviuual output file of a NapReuuce piogiam is veiy laige, making it inconvenient to view the entiie file in the teiminal. Foi this ieason, it's often a goou iuea to pipe the output of the fs -cat commanu into head, tail, more, oi less. Note that when you pipe the output of the fs -cat commanu to a local 0NIX commanu, the full contents of the file aie still extiacteu fiom BBFS anu sent to youi local machine. 0nce on youi local machine, the file contents aie then mouifieu befoie being uisplayeu. N, If you want to uownloau a file anu manipulate it in the local filesystem, you can use the fs -get commanu. This commanu takes two aiguments: an BBFS path anu a local path. It copies the BBFS contents into the local filesystem: $ hadoop fs -get shakespeare/poems ~/shakepoems.txt $ less ~/shakepoems.txt
Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 8 Other Commands Theie aie seveial othei commanus associateu with the FsShell subsystem, to peifoim most common filesystem manipulations: rmr (iecuisive rm), mv, cp, mkdir, etc. D, Entei: $ hadoop fs This uisplays a biief usage iepoit of the commanus within FsShell. Tiy playing aiounu with a few of these commanus if you like. This is the end of the Exercise Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 9 Hands-On Exercise: Run a MapReduce Job E# *F5+ "3"$45+" G)> H5&& 4)K@5&" A%P% L5&"+M 4$"%*" % AQ=M %#/ $># ?%@="/>4" R)B+, In auuition to manipulating files in BBFS, the wiappei piogiam hadoop is useu to launch NapReuuce jobs. The coue foi a job is containeu in a compileu }AR file. Bauoop loaus the }AR into BBFS anu uistiibutes it to the woikei noues, wheie the inuiviuual tasks of the NapReuuce job aie executeu. 0ne simple example of a NapReuuce job is to count the numbei of occuiiences of each woiu in a file oi set of files. In this lab you will compile anu submit a NapReuuce job to count the numbei of occuiiences of eveiy woiu in the woiks of Shakespeaie. Compiling and Submitting a MapReduce Job D, In a teiminal winuow, change to the woiking uiiectoiy, anu take a uiiectoiy listing: $ cd ~/training_materials/developer/exercises/wordcount $ ls This uiiectoiy contains a README file anu the following }ava files: WordCount.java: A simple NapReuuce uiivei class. WordCountWTool.java: A uiivei class that accepts geneiic options. WordMapper.java: A mappei class foi the job. SumReducer.java: A ieuucei class foi the job. Examine these files if you wish, but uo not change them. Remain in this uiiectoiy while you execute the following commanus. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 10 -, Compile the foui }ava classes: $ javac -classpath $HADOOP_HOME/hadoop-core.jar *.java Youi commanu incluues the classpath foi the Bauoop coie API classes. The compileu (.class) files aie placeu in youi local uiiectoiy. These }ava files use the 'olu' mapred API package, which is still valiu anu in common use: ignoie any notes about uepiecation of the API which you may see. <, Collect youi compileu }ava files into a }AR file: $ jar cvf wc.jar *.class N, Submit a NapReuuce job to Bauoop using youi }AR file to count the occuiiences of each woiu in Shakespeaie: $ hadoop jar wc.jar WordCount shakespeare wordcounts This hadoop jar commanu names the }AR file to use (wc.jar), the class whose main methou shoulu be invokeu (WordCount), anu the BBFS input anu output uiiectoiies to use foi the NapReuuce job. Youi job ieaus all the files in youi BBFS shakespeare uiiectoiy, anu places its output in a new BBFS uiiectoiy calleu wordcounts. O, Tiy iunning this same commanu again without any change: $ hadoop jar wc.jar WordCount shakespeare wordcounts Youi job halts iight away with an exception, because Bauoop automatically fails if youi job tiies to wiite its output into an existing uiiectoiy. This is by uesign: since the iesult of a NapReuuce job may be expensive to iepiouuce, Bauoop tiies to pievent you fiom acciuentally oveiwiiting pieviously existing files. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 11 S, Review the iesult of youi NapReuuce job: $ hadoop fs -ls wordcounts This lists the output files foi youi job. (Youi job ian with only one Reuucei, so theie shoulu be one file, nameu part-00000, along with a _SUCCESS file anu a _logs uiiectoiy.) T, view the contents of the output foi youi job: $ hadoop fs -cat wordcounts/part-00000 | less You can page thiough a few scieens to see woius anu theii fiequencies in the woiks of Shakespeaie. Note that you coulu have specifieu wordcounts/* just as well in this commanu. U, Tiy iunning the WoiuCount job against a single file: $ hadoop jar wc.jar WordCount shakespeare/poems pwords When the job completes, inspect the contents of the pwords uiiectoiy. C, Clean up the output files piouuceu by youi job iuns: $ hadoop fs -rmr wordcounts pwords Stopping MapReduce Jobs It is impoitant to be able to stop jobs that aie alieauy iunning. This is useful if, foi example, you acciuentally intiouuceu an infinite loop into youi Nappei. An impoitant point to iemembei is that piessing ^C to kill the cuiient piocess (which is uisplaying the NapReuuce job's piogiess) uoes #)* actually stop the job itself. The NapReuuce job, once submitteu to the Bauoop uaemons, iuns inuepenuently of any initiating piocess. Losing the connection to the initiating piocess uoes not kill a NapReuuce job. Insteau, you neeu to tell the Bauoop }obTiackei to stop the job. Copyright 2010-2011 Cloudera, Inc. All rights reserved. Not to be reproduced without prior written consent. 12 D, Stait anothei woiu count job like you uiu in the pievious section: $ hadoop jar wc.jar WordCount shakespeare count2 -, While this job is iunning, open anothei teiminal winuow anu entei: $ hadoop job -list This lists the job ius of all iunning jobs. A job iu looks something like: job_200902131742_0002 <, Copy the job iu, anu then kill the iunning job by enteiing: $ hadoop job -kill jobid The }obTiackei kills the job, anu the piogiam iunning in the oiiginal teiminal, iepoiting its piogiess, infoims you that the job has faileu. This is the end of the Exercise