Você está na página 1de 162

Best Machine learning http://patriciahoffmanphd.com/ http://www.sussex.ac.uk/Users/christ/crs/ml/handbook.html Python http://webloria.loria.fr/~rougier/teaching/ Best! https://github.com/"r"ohansson/scientific#python#lectures http://nb$iewer.ipython.org/github/"r"ohansson/scientific#python#lectures/tree/master// www.gregreda.com/%&'(/'&/%)/intro#to#pandas#data#structures/ https://sites.google.com/site/aslugsguidetopython/basics/basic#syntax http://datacommunitydc.org/blog/%&'(/&*/python#for#data#analysis#the#landscape#of# tutorials/ http://manishamde.github.

io/blog/%&'(/&(/&*/pandas#and#python#top#'&/ https://www.kaggle.com/wiki/+etting,tarted-ithPython.or/ata,cience https://github.com/fonnesbeck/statistical#analysis#python#tutorial http://nb$iewer.ipython.org/github/twiecki/financial#analysis#python# tutorial/blob/master/'.0%&Pandas0%&Basics.ipynb http://nb$iewer.ipython.org/urls/gist.github.com/fonnesbeck/121&(*1/raw/c'2cfcd312&d (2%cb)d'4e4*&2aab((a&3')ff(e/'.56ntroduction5to5Pandas.ipynb http://conference.scipy.org/scipy%&'(/tutorial7detail.php8id9'&3 ,tatical /ata analysis python! http://sentdex.com/sentiment#analysisbig#data#and#python#tutorials#algorithmic#trading/ http://nealcaren.web.unc.edu/big#data/ http://www.bearrelroll.com/%&'(/&1/python#pandas#tutorial/

http://udinra.com/blog/picalo#python#data#analysis#tutorial#pdf#free#download python data science Best http://"attenberg.github.io/P/,#.all#%&'(/ best python data science! http://aimotion.blogspot.in/%&''/''/machine#learning#with#python#logistic.html practle data machine learning example! http://people.duke.edu/~ccc'4/pcfb/analysis.html http://people.stern.nyu.edu/"a'1'*/pdsfall%&'%/ http://nb$iewer.ipython.org/github/carl"$/-ill7it7Python/blob/master/M:.;/<;%/ch%.ip ynb http://newdatascientist.blogspot.in/p/useful#links.html=,elf#teaching0%&>esources https://github.com/datasciencemasters/go/8 utm7source9hackernewsletter?utm7medium9email

http://datascienc.es/schedule/ http://diggdata.in/post/1&4'&%)3%&*/a#practical#intro#to#data#science http://datacommunitydc.org/blog/%&'(/&*/python#for#data#analysis#the#landscape#of# tutorials/ http://datacommunitydc.org/blog/%&'(/&(/getting#started#with#python#for#data#scientists/ http://www.hselab.org/machinery/content/learning#python#suggestions#and#resources# business#analytics#students#and#professionals http://www.randalolson.com/%&'%/&2/&)/statistical#analysis#made#easy#in#python/ $ery good! http://newdatascientist.blogspot.in/p/useful#links.html=,elf#teaching0%&>esources http://blog.yhath@.com/posts/data#science#in#python#tutorial.html http://diggdata.in/tagged/datascience http://datascienceacademy.com/free#data#science#courses/ http://datascienc.es/schedule/ http://ocw.mit.edu/courses/sloan#school#of#management/'1#&)%#data#mining#spring# %&&(/ http://pages.stern.nyu.edu/~dbackus/data7science.htm good help to start data science! http://www.datasciencecentral.com/profiles/blogs/an#indispensable#python#data# sourcing#to#data#science http://icanhaAdatascience.blogspot.in/ http://scipy#lectures.github.io/intro/ http://people.stern.nyu.edu/"a'1'*/pdsfall%&'%/ Python practcle data science! http://www.analyticbridge.com/group/codesnippets/forum/topics/a#couple#good#python# resources#for#data#science s http://www.datasciencecentral.com/profiles/blogs/an#indispensable#python#data# sourcing#to#data#science

http://blog.yhath@.com/posts/data#science#in#python#tutorial.html http://coding$entures.com/articles/+etting#,tarted#with#Python/ http://datasciencelab.wordpress.com/%&'(/'%/'%/clustering#with#k#means#in#python/ http://www.reddit.com/r/datascience http://datasciencerules.blogspot.in/%&'%/'&/how#to#learn#data#science#part#'.html=more http://www.datasciencecentral.com/profiles/blogs/an#indispensable#python#data# sourcing#to#data#science http://datacommunitydc.org/blog/%&'(/&*/python#for#data#analysis#the#landscape#of# tutorials/ http://datacommunitydc.org/blog/%&'(/&(/getting#started#with#python#for#data#scientists/

https://github.com/adambard/learnxinyminutes#docs/blob/master/python.html.markdown https://de$elopers.google.com/edu/python/set#up https://www.kaggle.com/wiki/+etting,tarted-ithPython.or/ata,cience http://inclass.kaggle.com/c/predict#'&#key#crime#incidences#for#'&#districts#in# karnataka/details/data#science#guidelines http://cm.dce.har$ard.edu/%&'4/&'/'4(%2/publication:isting.shtml http://www.cs'&3.org/resources.php http://datathinking.wordpress.com/%&'(/&%/'&/e$ent#recommendation#engine#challenge# kaggle/ http://gerardnico.com/wiki/data7mining/getting7started http://traims.com/ http://completebusinessanalytics.com/category/Butorial.aspx https://twitter.com/ds7ldn http://gregemmerich.wordpress.com/%&'(/&4/'*/demystifying#big#data/ http://coding$entures.com/articles/+etting#,tarted#with#Python/

Data Mning best links


http://www.math.umass.edu/~la$ine/Book/book.html http://www.youtube.com/results8search7type9?search7@uery9stats5%&% http://prdeepakbabu.wordpress.com/%&'&/&(/&(/datamining#$ideo#lectures#best#way#to# learn/ http://dataminingworld.wordpress.com/category/training/ http://www.stat.cmu.edu/~cshaliAi/(1&/ http://patriciahoffmanphd.com/machinelearning.php http://www.stats%&%.com/original7index.html http://www.math.umass.edu/~la$ine/Book/book.html http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/ http://www.youtube.com/user/<reati$e;euristics

http://libguides.princeton.edu/content.php? pid=27916&sid=456792
http://crystal.uta.edu/~cli/cse1((4/ http://sAamitogepesnyel$esAet.blogspot.in/%&''/&4/on#nltk#and#python#inter$iwe#with# "acob.html

http://krishnasblog.com/%&'%/&3/')/gate#nltk#basic#components#of#machine#learning#ml# system/

Machine learning and Data Science


http://www.kaggle.com/wiki/Butorials http://www.kaggle.com/wiki/;ome http://fastml.com/machine#learning#courses#online/ best machine learning ! http://ragle.sanukcode.net/articles/machine#learning#self#study#resources/ http://www.reddit.com/r/Machine:earning/comments/'ic4&2/self7study7machine7learni ng/ http://stacko$erflow.com/@uestions/132*%)/o$erwhelmed#by#machine#learning#is#there# an#ml'&'#book https://github.com/datasciencemasters/go best data science! http://thebigdatainstitute.wordpress.com/%&'(/&4/%3/introduction#to#big#data#and# hadoop#ecosystem#for#beginners/ http://www.dwbiconcepts.com/database/%'#database#basic#concepts/*#what#is#a# database#a#@uestion#for#both#pro#and#newbie.html http://www.togaware.com/datamining/sur$i$or/<ontents.html http://shi$amkrpandey.blogspot.in/search/label//ata0%&-arehousing http://completebusinessanalytics.com/category/Butorial.aspx http://machinelearning'%(.pbworks.com/w/page/%)%*&*&4/.rontPage http://www.p#$alue.info/%&'%/''/free#datascience#books.html python data science Python learning http://ourchiefweapons.wordpress.com/%&'(/&)/'%/getting#started#with#python#mooc/ http://python.berkeley.edu/ http://lurn@.com/lesson/+etting#started#with#Python#Bips#Bools#and#>esources/ http://interacti$epython.org/runestone/default/user/register= http://freepythontips.wordpress.com/%&'(/&3/&'/best#python#resources/

> Machine learning best links http://www.cs.utoronto.ca/~radford/csc4''..&)/ http://blenditbayes.blogspot.in/%&'(/&(/r#where#should#i#start.html best links! http://"eromyanglim.blogspot.in/%&'&/&1/$ideos#on#data#analysis#with#r.html http://onepager.togaware.com/ http://www.alsharif.info/=Ciom1(&/c%'o* digitheadslabnotebook.blogspot.in/p/usingr.html

Approachable ata !ining "utorials #or the $on !iner


May ('D %&'( by /on Erapohl F list of se$eral sources to learn data science in a hands#on format

ata

https://www.kaggle.com/wiki/Butorials # Pro$ides data sourcesD forumsD scenariosD and real#world competitions to teach data mining http://deeplearning.net/tutorial/ # Butorial on /eep :earning G introduction to machine learning image analysis algorithms http://tryr.codeschool.com/ # 6nteracti$e introduction to > :anguage

%hoosing a &irst !achine 'earning (ro)ect: *tart b+ ,eading or b+ oing?


Posted by Danny Tarlow

,arath writes about doing a pro"ect during his final year of uni$ersity related to machine learning: 6 am writing this email to ask for some ad$ice. well the thing is i ha$enHt decided on my pro"ect yetD as i decided it will be better if i took some time to "ust strengthen my fundamentals and may be work on something small. well i came across this great blog called measuring measures where they had put up a reading list for machine learning and it was may i say a bit o$erwhelming. http://measuringmeasures.com/blog/%&'&/(/'%/learning#about#machine#learning#%nd# ed.html

My present goal is doing a graduate course in some good uni$ersity with some good machine learning research and one of the reason i wanted to do a great pro"ect as i ha$e heard that would be a great way to getting into a good uni$ersity. ,o my @uestion is should my first priority be getting a really good and deep understanding of the sub"ect or should i be more concerned with doing some good pro"ect with respect to admissions8 Bhere are others who are likely more @ualified than 6 am to answer this oneD but here are my two cents: Bhat post certainly has things that would be nice to learnD but you donHt need to know all of that in order to be a successful researcher. /epending on what area you go intoD you might need different subsets of those referencesD or you might need something different all together. .or exampleD a reference 6 go back to time and time again is ,chri"$erHs <ombinatorial IptimiAationD but itHs not on that list!. 6 think you should pick a pro"ect in an area that you find interestingD then "ust di$e in. Ft firstD 6Hd be less concerned with doing something new. .irstD focus on understanding a couple different existing approaches to the specific problem youH$e chosenD and pick up the necessary background as you go by trying to implement the algorithms and replicate published resultsD following references when you get confusedD looking up termsD etc. Perhaps most importantlyD work on your research skills. 6mportant things: <learly write up exactly what you are doing and why you are doing it. Eeep it as short as possible while still ha$ing all the important information. ,et up a framework so you are organiAed when running experiments J$en if the results are not state of the art or terribly surprisingD keep track of all the outputs of all your different executions with different data sets as inputsD different parameter settingsD etc. KisualiAe e$erything interesting about the data you are usingD the execution of your algorithmsD and your results. :ook for patternsD and try to understand why you are getting the results that you are. Fll the whileD be on the lookout for specific cases where an algorithm doesnHt work $ery wellD assumptions that seem strangeD or connections between the approach youHre working on to other algorithms or problems that youH$e run across before. Fny of these can be the seed of a good research pro"ect. 6n my estimationD 6Hd think graduate schools would be more impressed by a rele$antD carefully done pro"ectD e$en if itHs not terribly no$elD than they would be with you saying on your application that you ha$e read a lot of books. 6f youHre looking for pro"ect ideasD check out recent pro"ects that ha$e been done by students of Fndrew LgHs machine learning course at ,tanford: http://www.stanford.edu/class/cs%%3/pro"ects%&&2.html http://www.stanford.edu/class/cs%%3/pro"ects%&&3.html

Perhaps some readers who ha$e experience on graduate committees can correct or add to anything that 6 said that was wrong or incomplete.

Getting started with Data Science and Machine Learning


Mul %% %&'( Fs you might ha$e noticed /ata ,cienceD Big /ataD ,tatisticsD Machine :earningD Bext Mining and Latural :anguage Processing are popular buAAwords at this point of time. Bhis buAA is well "ustified and far more than hype. ,earch enginesD -ord ProcessorsD recommendation engines. social media analyticsD news aggregatorsD Einect and self dri$ing cars: all of them directly or indirectly depend on /ata ,cience and/or Machine :earning.

-.portant "opics in

ata *cience:

Data Acquisition, Mining, Scraping and Parsing


Bhis is the stage where one actually ac@uires data. Bhis might in$ol$e crawling web pagesD or simulating +JB and PI,B re@uests to collect target data. MechaniAe in >uby or Python might be a reasonably simple starting point in this direction. Bhey help you na$igate to different web pagesD enter data in formsD simulate clicks of ,ubmit buttonsD and collect the data. Nou might also need to understand how to parse webpages and ;BM:D for which you might need to understand something like OPath or something simpleD friendly and intuiti$e like Beautiful ,oup

Text Processing and Regular Expressions


F lot of the timeD in fact most of the timeD the data you collect from the webD will be in a format different from what youPd like. Nou might need to extract something specificD like Phone Lumbers. IrD you might need to identify address fragments on a business listing page. Nou will almost certainly need to understand parsing libraries for OM:D M,ILD <,K and ;BM:. Ine of the most important and efficient tools while working with dataD is the >egular Jxpression. -e ha$e a track to get you started with regular expressions. Bhe power of regular expressions is not to be underestimated. .rom cleaning and normaliAing text data which you ha$e collectedD to categoriAing text data based on popular words or patterns: this is often the foundation stage of se$eral important data science experimentsD specially those which in$ol$e data scraped from the web.

Machine Learning and atural Language Processing


Ft this pointD we mo$e to the more Math intensi$e part of this exciting spaceD much of which is hea$ily dependent on ,tatisticsD Probability and :inear Flgebra. Machine :earning tries to learn from historical dataD or to effecti$ely cluster or segment data. Latural :anguage Processing is the science # or perhaps art # of trying to analyAeD understand and generate language. BherePs a fair bit of o$erlap in the areas and Latural :anguage Processing uses @uite a few of Machine :earning based techni@ues. :earning from data is powerfulD and large $olumes of data re$eal a lot of informationD which could effecti$ely be used in sol$ing or predicting unseen cases. 6f youPd like to appreciate the power of dataD take a look at this simple %' line ,pelling <orrector written by Peter Lor$ig. Fnd while youPre at itD try out some of the problems on our Machine :earning track. 6n the processD youPll learn how something as simple as a histogram can be used to build powerful real world features such as a B3 text prediction engineD or a spell checker. Bhere are also some challenges such as the Quora Fnswer <lassifier which might gi$e you the experience of real world data. BherePs a $ariety of problems on this track: some which help you brush up text processing D statistics and regressionR and others where you learn how to apply the fundamentals to real problems.

!nline Resources "or #etting Started with Data Science and Machine Learning
.or someone trying to get started with M:D here is a resource where the complexity is $ust right. 6t introduces you to a lot of the essential MathematicsD but doesnPt go too deep into it. 6t is an e@ui$alent of the Fpplied M: course at ,tanford. Kery brieflyD here are the M: algorithms which are $ery useful and basicD and will help you sol$e a lot of problems.

>egression# ,ingleD Multiple KariablesD :ogistic >egression I$erfitting and Underfitting issues# SBiasP and SKarianceP ,imple clustering algorithms# E#Means Fpplying basic linear algebra: Principal <omponent Fnalysis >ecommendation ,ystems and :arge ,cale ,ystems.

Many people ha$e gone on to become top Eaggle contestants a popular data science contest portal! after doing this course. Bhese introductory algorithms can be extremely useful. Fpart from this 6Pd also recommend learning a bit about text processing such as regular expressionsD string functions and language models. Nou might find them in the first few lectures and tutorials on this Latural :anguage Processing <ourse. 6Pd like to emphasiAe that

F lot of the Mathematics in$ol$ed doesnPt re@uire much more than an introductory statistics code. 6f you go through our problem statementsD many of them like ,pell <heckD are structured like a tutorialD and actually guide you through the steps re@uired to get an interesting and powerful feature working. ,o it is possible to build#sol$e/as well as learn at the same time. .or instanceD all that ,pell <heck re@uires you to knowD is what a ;istogram isC .or a @uick and general introduction to /ata ,cienceD the course material from this <oursera course is greatD and introduces >D PythonD Map#>educe and /ata KisualiAation techni@ues. Ft a later stageD and for those looking for more academically challenging and abstract coursesD you might be interested in the :earning from /ata course from <altech and the Probabilistic +raphical Models course from ,tanford. Bhe first course gi$es more theoretical insights into the foundations of Machine :earning and ,tatistical :eaning BheoryR the second is about mixing /ata ,tructures with ,tatistics to e$ol$e Bayesian Letworks and ;idden Marko$ Models: powerful toolsD which are used in medical diagnosticsD speech recognition enginesD Einect # and ha$e been found to be significant impro$ements on traditional Machine :earning techni@ues.

niket /ecember '*D %&'( at 4:(1 am = Bhanks for sharing your experienceD Mason. 6 ha$e been learning/studying M: by trial and error for last '5 year. Ffter a lot of courses/$ideos/books pickups and drops 6 can distill my experience as below: <ourses 6 Aeroed in are the following: '. M: course by Fndrew Lg T coursera %. :earning from data by Naser. Baking these two courses in close succession really helped me a lot. If course 6 keep coming back to them again and again aided by materials of ,tanford <,%%3. Books: 6 felt that the first two books mentioned by you were not really helpful in a constructi$e way. But they really helped me to realise that M: canPt be learnt by blindly following these algorithmic approaches. 6t was great to see the outputs but the e$er ringing @uestion UwhyV really kept frustrating me. Fnd then optimum dose of :inear algebraD Probability and ,tatistics T Ehan academy and ,tats trilogy by Fni Fdhikari T Jdx really helped me a lot in building up the foundation. 6t has been an iterati$e effort and most probably it will continue that way.

Bhanks again. Machine learning http://ianma.wordpress.com/%&&3/&*/'3/machine#learning#for#beginners/ http://ufal.mff.cuni.cA/mlnlpr'(/ https://sites.google.com/site/mldmda/guide#% https://sites.google.com/site/mldmda/guide#' http://berkeley#mltea.pbworks.com/w/page/*3&4*13/;adoop0%&for0%&Machine 0%&:earning0%&+uide http://www.opensourceconnections.com/%&'(/&4/&4/complete#n&&bs#guide#to# enhancing#solrlucene#search#with#mahouts#machine#learning/ http://www.cs.iastate.edu/~cs1*(x/studyguide.html http://www.chioka.in/getting#started#in#machine#learning#for#those#who#suck#at# mathematics/

-an !a

de$elopment ruby on rails community > database weka "a$a

:ife :abs

!achine 'earning #or /eginners

by vyolian
:atelyD 6 got huge cra$ings to learn about machine learning seriously. 6 donPt get these curious attacks often but when they doD theyPre @uite extreme. 6 touched on basic machine learning algorithms at the end of my F.6 course at U-. 6 ended up auditing a graduate machine learning course while 6 started Jggsprout. Bhose somehow left me wanting moreW a lot more. .or some reason thoughD there arenPt as many learning resources on it than 6Pd like. Maybe itPs because 6Pm only a wanna#be right now and the topic is geared more towards ad$anced graduate#le$el scholars. 6s there a beginner machine learning community out there8 Bhere must be others like meW

ED%T& %'(e started a ho)e "or )achine learning "ol*s powered by Eggsprout at http&++)achine,learning-eggsprout-co)- Please $oin )e and bring others li*e us togetherDon't be alar)ed i" the site is a little e)pty now, all co))unities ha(e to start so)ewhere &.6f youPre that kind of person and you somehow came across this blogD herePs what 6P$e found useful so far: ,tanford <,%%3 Machine :earning <ourse on Noutube: Probably the most useful and rele$ant resource for a beginner. 6 somewhat en"oy Fndrew LgPs teaching style. 6 bet hePd be a great mentor to ha$e for research. Bhe way the course is organiAed is $ery different from how professor Pedro /omingos taught us though. Lot far enough in the series to say whether 6 like it more or not. <,%%3 :ecture Lotes: :ecture notes that accompany the Noutube $ideos. U- Part#time Masters :ectures: Baught by professor Pedro /omingosD awesome teacher and incredibly genius. Kideo:ectures.net: Bhese look like theyPd be more ad$anced but interesting nonetheless. ,o much to watchW so little time. >esearch<hannel.org: 6 lo$e this resource for "ust about any topic. 6P$e spent hours on this studying neurobiology back when 6 first entered college. ;opefully the machine learning topics here are "ust as interesting. Machine :earning /ata >epository: Lice repository of data for when youPre ready to practice using an algorithm. Ine of my homework assignments from college was from here. >uby F.6 Plugins: F few ruby machine learning libraries. 6Pm a rubyist and itPd be nice to see more of these. Maybe 6Pll contribute to these some day :!. Begu: F machine learning system in >uby de$eloped by /a$id >ichards. ;e looks like an interesting character and 6Pm keeping an eye on him and his pro"ects. 6 like his dedication to ruby and machine learning. Aside& /hy do #oogle searches on 0Machine Learning1 always turn up results about sewing )achines2
About these ads

Published: Muly '3D %&&3 .iled Under: de$elopment Bags: beginners : learning : lecture : machine : $ideo

10 ,esponses to 1!achine 'earning #or /eginners2


'. Mi*e Muly '3D %&&3 at ):%) am thanks for the links. 6Pm gearing up to get serious my self. 6tPs nice to see 6Pm not alone. 6 swear 6 could of written this blog myself. <heers. >eply

%.

(yolian Muly '3D %&&3 at ):14 am TMike sweet 6 was a little worried myself that 6 was the only one. .eel free to suggest your own resources and point me to your blog for insight when you get startedC >eply

(.

Sidney Fugust %(D %&&3 at 2:%' pm ;iD 6 "ust came across your post. Fre you still interested in Machine :earning8 6Pm a beginner as wellD but ha$e been dabbling in some of the machine learning contests lately LetflixD github!. 6f youPd like to collaborate on some small pro"ects at some pointD shoot me an email. #,idney >eply

4.

(yolian Fugust %2D %&&3 at '&:(% pm

;ey ,idneyD thanks for dropping by to reach out. 6 ha$e a feeling your knowledge is already a bit more ad$anced than mine if youP$e done some Letflix contests already :!. Le$erthelessD 6 think itPs a great idea that we collaborate at some pointC By the wayD since writing my blogD 6 "ust opened up a home for us machine learning folks. 6Pd lo$e for you to "oin me at http://machine# learning.eggsprout.com. :et me know what you think about it and possibly letting me know if youPre interested in helping me build this community. Please keep in touch X 6an. >eply

1.

Dryer 3ent 4leaning ,eptember '3D %&&3 at 1:(2 am <ool siteD lo$e the info. >eply

).

we(e5eeweew ,eptember %%D %&&3 at (:%( am 6 donPt know 6f 6 said it already but W;ey good stuffWkeep up the good workC 6 read a lot of blogs on a daily basis and for the most partD people lack substance butD 6 "ust wanted to make a @uick comment to say 6Pm glad 6 found your blog. BhanksD! F definite great read.. #Bill#Bartmann >eply

*.

chathuri*a May %D %&'& at ':%( pm chaththaD

6 am $ery new to machine learning..started the first pro"ect about ) months agoW your blog helped alot..thankA. >eply 2. 6. 7 ewtonStudio says: Fugust (D %&'& at '%:4& pm Y...Z : http://ianma.wordpress.com/%&&3/&*/'3/machine#learning#for# beginners/ Y...Z >eply

3.

hu)an )athe)atics May %%D %&'' at '':'1 pm 6Pm coming to it from the other direction. 6Pm $ery familiar with machine learningD math and statsD but 6 canPt e$en install a rubygem to write a damn script. <omputers are frustrating. >eply

'&.

Ra$an Manic*a(asaga) 68ra$an)anic*. .ebruary %3D %&'% at '&:'( am +reat set of links.. thanks.. >eply

3adoop #or !achine 'earning 4uide


Page history last edited by Eurt 4 years ago ;adoop hadoop.apache.org/core/! is a tool that makes it easy to run programs on clusters. 6t uses the Map>educe framework: it distributes the computation o$er indi$idual records such as data points! o$er a cluster and then allows the results of that

computation to be combined in a reduce step. Bhere is a $ery good tutorial at hadoop.apache.org/core/docs/current/mapred7tutorial.html that goes o$er the basics of ;adoop operation.

6n order to use ;adoopD you need to either connect to a machine that has it installed or install it on your machine. Ince installedD the main executable can be run by changing to the installation directory and running

bin/hadoop

Bhis will list all the different options for running ;adoop. ,ee the >JF/MJ in the code linked below for example usages.

Writing Hadoop Programs for ML


F large number of programs in M: look like:

'. 6nitialiAe parameters %. .or each data point %a. /o something compute gradientD sufficient statisticsD etc! %b. <ombine it with the results on pre$ious data points add to the gradientD etc! (. Update parameters based on the computation of %. 4. +oto %.

,tep % often takes the longest amount of compute time and can be easily paralleliAed. Bhis is where ;adoop comes in. 6t distributes your data and data#based computation across the clusterD allowing you to compute and sum gradients in parallel.

Bhe code presented at the M: Bea can be downloaded from hadoop7example.tar.gA. Bhis is a simple demonstration of how to paralleliAe logistic regression and we hope you will adapt it to write your own programs. LIBJ: you need to ha$e Ma$a '.) in order to run this demo. .ind out about getting ;adoop to run on a (%#bit Mac.

Nou can also use ;adoop with other languages. /a$id >osenberg wrote a ;adoop,treaming > :ibrary. Nou can also "ust use the text#based streaming interface which reads/writes from stdin/stdout.

Where to Run Hadoop?


Low that you ha$e a working ;adoop program that you ha$e tested out on your own machineD where can you run it8 6f you ha$e access to a clusterD it is fairly easy to set up ;adoop. <heck out the cluster setup guide from ;adoop for more information.

6f you do not ha$e access to a clusterD FmaAonHs J<% may be the way to go for you. J<% allows you to rent many machines for relati$ely cheap. Nou only pay for the time you useD so it costs as much to run a single "ob for '&&& hours as it does to run '&&& "obs for ' hour. Bhis is a great way to get started with ;adoop.

More Questions?
6n addition to the tutorial a$ailable from ;adoopD there are also se$eral great tutorial $ideos a$ailable from <loudera. 6f you ha$e checked all that and still ha$e more @uestionsD feel free to get in touch with us.

Percy <pliang AT cs[ or Eurt <tadayuki AT cs[

Machine Learning

4uide 1

'earning About *tatistical 'earning


6 get a lot of inbound re@uests for book recommendations from people who want to either get into researchD or better understand it so they can build high @uality supporting systems for research and analytics. Bhe typical desire is participate in some form of what statisticians call inferenceD or what machine learning people call learning the two are more or less the same!.

Bo make it easier to reply to these re@uestsD 6 decided to create a \@uasi#@uick start guide.\ 6t is a @uasi#@uick start guide because the learning cur$e is steepD similar to what Peter Lor$ig says about programming. 6f you want to be able to lead research pro"ects at a startup or a team within a larger organiAation!D or do your own researchD youHll need to approach mastery of many topics from <,D engineeringD maths and statistics. 6f you want to play a specialiAed role within an existing research teamD you can probably get by without the full breadth of these skills. ,o you can pick and choose based on your goals. Bhis guide addresses folks coming from a <, and engineering background. 6f the guide pro$es usefulD 6 may follow up with another one for those coming from a maths/physics background. Before we get to the book recommendationsD letHs talk about a couple dirty little secrets of modern applied research work. '! Nou will ha$e to write a lot of code. Bhe biggest single problem 6 ha$e encountered in the field with applied researchers that ha$e a strong theoretical background is that they canHt code their way out of a wet paper bag. :earning how to do good engineering work is more important than people realiAe. Ft a minimumD 6 would recommend learning python numpy/scipy!D >D and at least one nice functional language probably ;askellD <lo"ureD or I<aml!. :earn how to make use of all the abstractions a$ailable to youD how to compose big functionality from many small pieces of codeD and learn how to write tests for your code. 6Hm not going to get into the topics of programming and engineering further hereR that is something 6 will address in greater depth if 6 do 6 guide for folks with $ery strong theory maths/physics/<,! that want to mo$e into applied work. %! Bhere is no substitute for experience. Bhere is much to learn from building something real and seeing it through to production. Beware of doing yourself a disser$ice by o$erestimating your capabilities # it is a mistake that 6 ha$e fallen prey to on se$eral occasions. Low on to the books. Please note that most of these books are explicitly written for people coming from <, and engineering and needing a crash courseD so they should fit right in with your needs. Most of them also ha$e rigorD but be aware that some are on the lighter side. .or exampleD if you want rigorous probability theoryD you should select one

4uide 2
6n my opinionD these are some of the necessary skills: '. Python/C /R/!ava # you will probably want to learn all of these languages at some point if you want a "ob in machine#learning. PythonHs Lumpy and ,cipy libraries Y%Z are awesome because they ha$e similar functionality to MFB:FBD but can be easily integrated into a web ser$ice and also used in ;adoop see below!. <55 will be needed to speed code up. > Y(Z is great for statistics and plotsD and ;adoop Y4Z is written in Ma$aD so you may need to implement mappers and reducers in Ma$a although you could use a scripting language $ia ;adoop streaming Y)Z! %. Probability and "tatistics: F good portion of learning algorithms are based on this theory. Lai$e Bayes Y)ZD +aussian Mixture Models Y*ZD ;idden Marko$ Models Y2ZD to name a few. Nou need to ha$e a firm understanding of Probability and ,tats to understand these models. +o nuts and study measure theory Y3Z. Use statistics as an model e$aluation metric: confusion matricesD recei$er#operator cur$esD p#$aluesD etc. (. #pplied Math #lgorithms: .or discriminate models like ,KMs Y'&ZD you need to ha$e a firm understanding of algorithm theory. J$en though you will probably ne$er need to implement an ,KM from scratchD it helps to understand how the algorithm works. Nou will need to understand sub"ects like con$ex optimiAation Y''ZD gradient decent Y'%ZD @uadratic programming Y'(ZD lagrange Y'4ZD partial differential e@uations Y'1ZD etc. +et used to looking at summations Y')Z. 4. $istributed Computing: Most machine learning "obs re@uire working with large data sets these days see /ata ,cience! Y'*Z. Nou cannot process this data on a single machineD you will ha$e to distribute it across an entire cluster. Pro"ects like Fpache ;adoop Y4Z and cloud ser$ices like FmaAonHs J<% Y'2Z makes this $ery easy and cost#effecti$e. Flthough ;adoop abstracts away a lot of the hard#coreD distributed computing problemsD you still need to ha$e a firm understanding of map#reduce Y%%ZD distribute#file systems Y'3ZD etc. Nou will most likely want to check out Fpache Mahout Y%&Z and Fpache -hirr Y%'Z. 1. %&pertise in 'ni& (ools: Unless you are $ery fortunateD you are going to need to modify the format of your data sets so they can be loaded into >D;adoopD;Base Y%(ZDetc. Nou can use a scripting language like python using re! to do this but the best approach is probably "ust master all of the awesome unix tools that were designed for this: cat Y%4ZD grep Y%1ZD find Y%)ZD awk Y%*ZD sed Y%2ZD sort Y%3ZD cut Y(&ZD tr Y('ZD and many more. ,ince all of the processing will most likely be on linux#based machine ;adoop doesnt run on -indow 6 belie$e!D you will ha$e access to these tools. Nou should learn to lo$e them and use them as much as possible. Bhey certainly ha$e made my life a lot easier. F great example can be found here Y'Z. ). )ecome familiar *ith the Hadoop sub+pro,ects: ;BaseD ]ookeeper Y(%ZD ;i$e Y((ZD MahoutD etc. Bhese pro"ects can help you store/access your dataD and they scale. *. Learn about advanced signal processing techni-ues: feature extraction is one of the

most important parts of machine#learning. 6f your features suckD no matter which algorithm you chooseD your going to see horrible performance. /epending on the type of problem you are trying to sol$eD you may be able to utiliAe really cool ad$ance signal processing algorithms like: wa$elets Y4%ZD shearlets Y4(ZD cur$elets Y44ZD contourlets Y41ZD bandlets Y4)Z. :earn about time#fre@uency analysis Y4*ZD and try to apply it to your problems. 6f you ha$e not read about .ourier FnalysisY42Z and <on$olutionY43ZD you will need to learn about this stuff too. Bhe ladder is signal processing '&' stuff though. .inallyD practice and read as much as you can. 6n your free timeD read papers like +oogle Map#>educe Y(4ZD +oogle .ile ,ystem Y(1ZD +oogle Big Bable Y()ZD Bhe Unreasonable Jffecti$eness of /ata Y(*ZDetc Bhere are great free machine learning books online and you should read those also. Y(2ZY(3ZY4&Z. ;ere is an awesome course 6 found and re# posted on github Y4'Z. 6nstead of using open source packagesD code up your ownD and compare the results. 6f you can code an ,KM from scratchD you will understand the concept of support $ectorsD gammaD costD hyperplanesD etc. 6tHs easy to "ust load some data up and start trainingD the hard part is making sense of it all. +ood luck. Y'Z http://radar.oreilly.com/%&''/&4... Y%Z http://numpy.scipy.org/ Y(Z http://www.r#pro"ect.org/ Y4Z http://hadoop.apache.org/ Y1Z http://hadoop.apache.org/common/... Y)Z http://en.wikipedia.org/wiki/Lai... Y*Z http://en.wikipedia.org/wiki/Mix... Y2Z http://en.wikipedia.org/wiki/;id... Y3Z http://en.wikipedia.org/wiki/Mea... Y'&Z http://en.wikipedia.org/wiki/,up... Y''Z http://en.wikipedia.org/wiki/<on... Y'%Z http://en.wikipedia.org/wiki/+ra... Y'(Z http://en.wikipedia.org/wiki/Qua... Y'4Z http://en.wikipedia.org/wiki/:ag...

Y'1Z http://en.wikipedia.org/wiki/Par... Y')Z http://en.wikipedia.org/wiki/,um... Y'*Z http://radar.oreilly.com/%&'&/&)... Y'2Z http://aws.amaAon.com/ec%/ Y'3Z http://en.wikipedia.org/wiki/+oo... Y%&Z http://mahout.apache.org/ Y%'Z http://incubator.apache.org/whirr/ Y%%Z http://en.wikipedia.org/wiki/Map... Y%(Z http://hbase.apache.org/ Y%4Z http://en.wikipedia.org/wiki/<at... Y%1Z http://en.wikipedia.org/wiki/+rep Y%)Z http://en.wikipedia.org/wiki/.ind Y%*Z http://en.wikipedia.org/wiki/F-E Y%2Z http://en.wikipedia.org/wiki/,ed Y%3Z http://en.wikipedia.org/wiki/,or... Y(&Z http://en.wikipedia.org/wiki/<ut... Y('Z http://en.wikipedia.org/wiki/Br7... Y(%Z http://Aookeeper.apache.org/ Y((Z http://hi$e.apache.org/ Y(4Z http://static.googleusercontent.... Y(1Zhttp://static.googleusercontent.... Y()Zhttp://static.googleusercontent.... Y(*Zhttp://static.googleusercontent....

Y(2Z http://www.ics.uci.edu/~welling/... Y(3Z http://www.stanford.edu/~hastie/... Y4&Z http://infolab.stanford.edu/~ull... Y4'Z https://github.com/"osephmisiti/... Y4%Z http://en.wikipedia.org/wiki/-a$... Y4(Z http://www.shearlet.uni#osnabrue... Y44Z http://math.mit.edu/icg/papers/.... Y41Z http://www.ifp.illinois.edu/~min... Y4)Z http://www.cmap.polytechni@ue.fr... Y4* Zhttp://en.wikipedia.org/wiki/Bim... Y42Z http://en.wikipedia.org/wiki/.ou... Y43 Zhttp://en.wikipedia.org/wiki/<on... Fnother guy:

;ere are some resources 6H$e collected about working with dataD 6 hope you find them useful note: 6Hm an undergrad studentD this is not an expert opinion in any way!. '! Learn about matri& factori.ations

Bake the <omputational :inear Flgebra course it is sometimes called Fpplied :inear Flgebra or Matrix <omputations or Lumerical Fnalysis or Matrix Fnalysis and it can be either <, or Fpplied Math course!. Matrix decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard \machine learning\ curriculum. -ith BBs of data traditional tools such as Matlab become not suitable for the "obD you cannot "ust run eig ! on Big /ata. /istributed matrix computation packages such as those included in Fpache Mahout Y'Z are trying to fill this $oid but you need to understand how the numeric algorithms/:FPF<E/B:F, routines Y%ZY(ZY4ZY1Z

work in order to use them properlyD ad"ust for special casesD build your own and scale them up to terabytes of data on a cluster of commodity machines.Y)Z Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prere@uisites. 6Hd recommend these resources for self study/reference material: ,ee: -hat are some good resources for learning about numerical analysis8

%! Learn about distributed computing

6t is important to learn how to work with a :inux cluster and how to design scalable distributed algorithms if you want to work with big data -hy the current obsession with \big\ data8 !. 6f you want to s@ueeAe the most out of your rented! hardware it is also becoming increasingly important to be able to utiliAe the full power of multicore see http://en.wikipedia.org/wiki/Moo... ! Lote: this topic is not part of a standard Machine :earning track but you can probably find courses such as /istributed ,ystems or Parallel Programming in your <,/JJ catalog. ,ee -hat are some good resources for learning about distributed computing8 -hy8 and <omputer ,cience >esearch: -hich <, areas ha$e the most low# hanging fruit for research8

(! Learn about statistical analysis


,ee -hat are some good resources for learning about statistical analysis8 -hy8 <osma ,haliAi at <MU compiled some great materials on computational statistics and data analysisD check out his courses: http://www.stat.cmu.edu/~cshaliAi/ Most importantly pick up some > manuals see -hat are essential references for >8! and experiment with real#world data sets: /ata: -hat are some freeD public data sets8 <heck out "ob descriptions on E/Luggets and see what do you need to know in order to get a "ob as a statistician: http://www.kdnuggets.com/"obs/ ,ee what interests you moreD do your market research. -ould you prefer working with $endor tools and do mostly modeling and reportingD or build data mining systems yourself and write a lot of code8 /o you see yourself as a corporate employeeD a researcher in academia or a startup founder in the future8 -hat data interests you8 ,tructure your studies based on that.

4! Learn about optimi.ation

Bhis sub"ect is essentially prere@uisite to understanding many Machine :earning

and ,ignal Processing algorithmsD besides being important in its own right. ,tart with ,.P. BoydHs $ideo lectures: http://www.stanford.edu/~boyd/ also see -hat are some good resources to learn about optimiAation8

1! Learn about machine learning


,ee Machine :earning: -hat are some good resources for learning about machine learning8 -hy8 :arge ,cale :earning: -hat are some introductory resources for learning about large scale machine learning8 -hy8 ,tatistics $s. machine learningD fightC: http://brenocon.com/blog/%&&2/'%... Nou can structure your study program according to online course catalogs and curricula of M6BD ,tanford or other top schools. Jxperiment with data a lotD hack some codeD ask @uestionsD talk to good peopleD set up a web crawler in your garage: http://www.columbia.edu/~ak%2(4/... Nou can "oin one of these startups and learn by doing: -hat startups are hiring engineers with strengths in machine learning/L:P8 Bhe alternati$e and rather expensi$e! option is to enroll in a <, program/Machine :earning track if you prefer studying in a formal setting. ,ee: -as your MasterHs in <omputer ,cience M, <,! degree worth it and why8 Bry to a$oid o$erspecialiAation. Bhe breadth#first approach often works best when learning a new field and dealing with hard problems. ,ee http://en.wikipedia.org/wiki/,ec...

)! Learn about information retrieval


Machine learning 6s not as cool as it sounds: http://teddAiuba.com/%&&2/&1/mac... ,ee -hat are the best resources to learn about web crawling and scraping8 and 6nformation >etrie$al: -hat are some good resources to get started with 6nformation >etrie$al8 -hy8

*! Learn about signal detection and estimation

Bhis is a classic topic and \data science\ par excellence in my opinion. ,ome of these methods were used to guide the Fpollo mission or detect enemy submarines and are still in acti$e use in many fields. Bhis is often part of the JJ curriculum. ,ee -hat are some good resources for learning about signal estimation and detection8

2! Master algorithms and data structures

-hat are the most learner#friendly resources for learning about algorithms8

3! Practice

<arpentry: http://software#carpentry.org/ Programming :anguages: http://www."software.com/papers/... Programming <hallenges: -hat are some good \toy problems\ in data science8 /ata: -here can 6 get large datasets open to the public8

6f you do decide to go for a Masters degree: '&! "tudy %ngineering 6Hd go for <, with a focus on either 6> or Machine :earning or a combination of both and take some systems courses along the way. Fs a \data scientist\ you will ha$e to write a ton of code and probably de$elop distributed algorithms/systems to process massi$e amounts of data. M, in ,tatistics will teach you how to do modeling and regression analysis etcD not how to build systemsD 6 think the latter is more urgently needed these days as the old tools become obsolete with the a$alanche of data. Bhere is a shortage of engineers who can build a data mining system from the ground up. Nou can pick up statistics from books and experiments with > see item ( abo$e! or take some statistics classes as a part of your <, studies. +ood luck. Y'Z http://mahout.apache.org/ Y%Z http://www.netlib.org/lapack/ Y(Z http://www.netlib.org/eispack/ Y4Z http://math.nist.go$/"a$anumeric... Y1Z http://www.netlib.org/scalapack/ Y)Z http://labs.google.com/papers/ma... ##### '! Bry to take some of the undergrad math courses you missed. :inear FlgebraD Fd$anced <alculusD /iff. J@.D ProbabilityD ,tatistics are the most important. Ffter thatD take some Machine :earning courses. >ead a few of the leading M: textbooks and keep up with "ournals to get a good sense of the field. %! >ead up on what the top data companies are doing. Ffter ' or % machine learning courses you should ha$e enough background to follow most of the academic papers. 6mplement some of these algorithms on real data.

(! 6f you are working with large datasetsD get familiar with the latest techni@ues ? tools ;adoopD Lo,Q:D >D etc.! by putting them into practice at work or outside of work!. #### am currently working as a data engineer with a team of others and 6 can tell you what we all ha$e in common: '! M, or Ph/s in Fpplied Mathematics or Jlectrical Jngineering %! .luency <55/Matlab/Python (! Jxperience building distributed systems and algorithms. ####

5ee6l+ *tud+ 4uide 7 *pring 2010

Please Lote: :ecture notes will be updated a"ter the lecture. Wee/ 0 1!anuary 002 34045 I$er$iew of the course. I$er$iew of machine learning. -hy should machines learn8 Iperational definition of learning. Baxonomy of machine learning. >e$iew of probability theory and random $ariables. Probability spaces. Intological and epistemological commitments of probabilistic representations of knowledge. Bayesian sub"ecti$e $iew of probability! ## Probabilities as measures of belief conditioned on the agentHs knowledge. Possible world interpretation of probability. Fxioms of probability. <onditional probability. Bayes theorem. >andom Kariables. /iscrete >andom Kariables as functions from e$ent spaces to Kalue sets. Possible world interpretation of random $ariables. >e$iew of probability theoryD random $ariablesD and related topics continued!. Moint Probability distributions. <onditional Probability /istributions. <onditional 6ndependence of >andom $ariables. Pair#wise independence and independence.

Bayesian /ecision Bheory. Iptimal Bayes <lassifier. Minimum >isk Bayes <lassifier. Re-uired readings

I$er$iew of the <ourse. <ourse Policies :ecture slidesD Kasant ;ona$ar <hapters ' and 4 from 6ntroduction to Probability ## <harles +rinstead and :aurie ,nell. <hapters ' from MitchellHs Machine :earning Bextbook. I$er$iew of Machine :earningD Lils Lilsson. /oes Machine :earning >eally -ork8D Bom Mitchell. <hapter '( Frtificial 6ntelligence # F Modern FpproachD ,. >ussell and P. Lor$ig. <hapter ) from: MitchellD B. '33*. Machine :earning. Lew Nork: Mc+raw ;ill. <hapter 'D ,ection '.1 of <. Bishop %&&)!. Pattern >ecognition and Machine :earning I> <hapter (D sections (.'#(.) from: /udaD >.D ;artD P.D and ,torkD /. %&&'!. Pattern >ecognition. -JEF Machine :earning Flgorithms in Ma$a o -JEF ## F ,tarterHs +uide

Recommended Readings

I$er$iew of Frtificial 6ntelligence P/.!D Kasant ;ona$ar. <hapters ' and % from Probability ## Bhe :ogic of ,cience by J.B. Maynes. Fn article on Probability Bheory by Bom :oredo Futomated :earning and /isco$ery: ,tate#If#Bhe#Frt and >esearch Bopics in a >apidly +rowing .ield ,ebastian BhrunD <hristos .aloutsosD Bom Mitchell and :arry -asserman. Bechnical >eport <MU#<,#<F:/#32#'&&. '332.

"trongly Recommended !ava Readings for those unfamiliar *ith !ava6


+etting ,tarted with Ma$a Ma$a for <55 Programmers by Mar$ ,olomonD Uni$ersity of -isconsin#Madison. ,kim through! online Ma$a >esources.

#dditional 7nformation

FFF6 Machine :earning Bopics Page MaynesD J.B. Probability Bheory: Bhe :ogic of ,cienceD <ambridge Uni$ersity PressD %&&(. <oxD >.B. Bhe Flgebra of Probable 6nferenceD Bhe Mohns ;opkins PressD '3)'. BooleD +. Bhe :aws of BhoughtD .irst published: '214!. Prometheus BooksD %&&(. .ellerD -. Fn 6ntroduction to Probability Bheory and its Fpplications. Kols 'D %.

Lew Nork: -iley. '3)2. >ussellD ,. and Lor$igD P. %&&(. Frtificial 6ntelligence: F Modern Fpproach. Prentice ;all.

Wee/ 3 1)eginning !anuary 082 34045 6ntroduction to +enerati$e Models. Lai$e Bayes <lassifier >e$isited. Fpplications of Lai$e Bayes <lassifiers # ,e@uence and Bext <lassification. Maximum :ikelihood Probability Jstimation. Properties of Maximum :ikelihood Jstimators. :imitations of Maximum :ikelihood Jstimators. Bayesian Jstimation. <on"ugate Priors. /etailed treatment of Bayesian estimation in the multinomial case using /irichlet priors. Maximum F posteriori Jstimation. >epresentati$e applications of Lai$e Bayes classifiers. J$aluation of classifiers. FccuracyD PrecisionD >ecallD <orrelation <oefficientD >I< cur$es. J$aluation of classifiers ## estimation of performance measuresR confidence inter$al calculation for estimatesR cross#$alidation based estimates of hypothesis performanceR lea$e#one#out and bootstrap estimates of performanceR comparing two hypothesesR hypothesis testingR comparing two learning algorithms. Re-uired readings

<hapter % from <. Bishop %&&)!. Pattern >ecognition and Machine :earning. <hapters ( and ) from: MitchellD B. '33*. Machine :earning. Lew Nork: Mc+raw ;ill. :ecture ,lides. Kasant ;ona$ar BaldiD P.D BrunakD ,.D <hau$inD N. and LielsenD ;. %&&&! Fssessing the accuracy of prediction algorithms for classification: an o$er$iew. Bioinformatics Kol. '). pp. 4'%#4%4. <onfidence 6nter$al and ;ypothesis Besting. <hapter 1 from MitchellHs Machine :earning Bextbook. EochanskiD +. '331!. <onfidence 6nter$als and ;ypothesis Besting. ,tarkD Philip B. ,tatistics Bools for 6nternet and <lassroom 6nstruction ,alAbergD ,. '333!. In <omparing <lassifiers: Pitfalls to F$oid and a >ecommended Fpproach /ata Mining and Enowledge /isco$eryD Kol. 'D pp. ('*#(%2. . Pro$ostD B .awcettD > Eoha$i '332!. F case against accuracy estimation of machine learning algorithms. 6n: proceedings of the fifteenth 6nternational <onference on Machine :earning. .awcettD B. %&&(! >I< +raphs: Lotes and Practical <onsiderations for >esearchers ;P :abs Bech. >eport ;P:%&&(#4. ;andD /. Measuring classifier performance: a coherent alternati$e to the area

under the >I< cur$eMachine :earningD Kol. **D pp. '&(#'%(. Recommended Readings P. /omingos and M. PaAAani. In the optimality of the simple Bayesian classifier under Aero#one loss. Machine :earningD %3:'&(##'(&D '33*. >ishD 6. Fn Jmpirical ,tudy of Lai$e Bayes <lassifierD 6n: Proc. 6<M: %&&'. /. /. :ewis. Lai$e Bayes at forty: Bhe independence assumption in information retrie$al. 6n J<M:#32: Proceedings of the Benth Juropean <onference on Machine :earningD pages 4##'1D <hemnitAD +ermanyD Fpril '332. ,pringer. ]hangD M.D EangD /#E.D ,il$esuD F. and ;ona$arD K. %&&)!. Jxtended $ersion of :earning <ompact and Fccurate Lai$e Bayes <lassifiers from Fttribute Kalue Baxonomies and /ata Mournal of Enowledge and 6nformation ,ystems. Mc<allumD F. and LigamD E. F <omparison of J$ent Models for Lai$e Bayes Bext <lassification.. 6n FFF6/6<M:#32 -orkshop on :earning for Bext <ategoriAationD pp. 4'#42. Bechnical >eport -,#32#&1. FFF6 Press. '332. Mason /. M. >ennieD :awrence ,hihD Maime Bee$an and /a$id >. Earger Backling the Poor Fssumptions of Lai$e Bayes Bext <lassifiers Proceedings of the Bwentieth 6nternational <onference on Machine :earning. %&&(. EangD /#E.D ,il$escuD F. and ;ona$arD K. %&&)!. >LB:#ML: F >ecursi$e Lai$e Bayes :earner for ,e@uence <lassification. 6n: Proceedings o" the Tenth Paci"ic,Asia 4on"erence on 9nowledge Disco(ery and Data Mining 6PA9DD :;;<.- Lecture otes in 4o)puter Science-. Berlin: ,pringer#Kerlag. #dditional 7nformation

<hapter 1 ,tatistical <oncepts! from: MordanD M. %&&(!. Fn 6ntroduction to Probabilistic +raphical Models. /raft. :angleyD P.D 6baD -.D and BhompsonD E. '33%!. Fn Fnalysis of Lai$e Bayes <lassifier. 6n: Proceedings of FFF6. '33%. :angleyD P. and ,ageD ,. '333!. Bractable a$erage#case analysis of nai$e Bayesian classifiers.. Proceedings of the ,ixteenth 6nternational <onference on Machine :earning pp. %%&#%%2!. BledD ,lo$enia: Morgan Eaufman. Bhorsten Moachims. F probabilistic analysis of the >occhio algorithm with B.6/. for text categoriAation. Bechnical >eport <MU#<,#3)#''2D ,chool of <omputer ,cienceD <arnegie Mellon Uni$ersityD March '33). NangD N. and +. 6. -ebb %&&(!. In -hy /iscretiAation -orks for Lai$e#Bayes <lassifiers. 6n Proceedings of the ')th Fustralian <onference on F6 F6 &(!:ecture Lotes F6 %3&(D pages 44&#41%. Berlin: ,pringer#Kerlag. +eorge ;. MohnD Pat :angleyD Jstimating <ontinuous /istributions in Bayesian <lassifiers Proceedings of the '331 <onference on Machine :earning. ,usana JyheramendyD /a$id :ewisD /a$id Madigan In the Lai$e Bayes Model for Bext <ategoriAation. 6n: Proceedings of the Linth 6nternational -orkshop on Frtificial 6ntelligence and ,tatistics.D BishopD <.M. and .reyD B. Jd!. %&&(.

Wee/ 9 1)eginning !an 3: 34045

Modeling dependence between attributes. Bhe decision tree classifier. 6ntroduction to information theory. 6nformationD entropyD mutual informationD and related concepts Eullback#:iebler di$ergence!. :earning hypothesis from data re$isited. :earning Maximum a#posteriori MFP! and Maximum :ikelihood M:! hypothesis from data. Bhe relationship between MFP hypothesis learningD minimum description length principle IccamHs raAor! and the role of priors. J@ui$alence of M: hypothesis learner and consistent learner for classification tasks. Flgorithm for learning decision tree classifiers from data. I$erfitting and methods to a$oid o$erfitting ## dealing with small sample siAesR prepruning and post#pruning. Pitfalls of entropy as a splitting criterion for multi#$alued splits. Flternati$e splitting strategies ## two#way $ersus multi#way splitsR Flternati$e split criteria: +ini impurityD JntropyD etc. <ost#sensiti$e decision tree induction ## incorporating attribute measurement costs and misclassification costs into decision tree induction. /ealing with categoricalD numericD and ordinal attributes. /ealing with missing attribute $alues during tree induction and instance classification. Re-uired Readings

:ecture ,lides. Kasant ;ona$ar In ,hannon and 6nformation BheoryD U<,/. <hapters %D 4 from 6nformation BheoryD 6nferenceD and :earning Flgorithms. /a$id MacEay. <hapter % from Jntropy and 6nformation Bheory >.M. +ray. Fn article on Probability Bheory by Bom :oredo /ecision BreesD Lils Lilsson. QuinlanD >. 6nduction of /ecision BreesD Machine :earningD Kol. 'D pp. 2'#'&)D '32). .riedmanD M. ;. '3**!. F recursi$e partitioning decision rule for nonparametric classification. 6JJJ Bransactions on <omputersD <#%)D 4&4#4&2. /ecision Bree ButorialD by ;. ;amiltonD J. +urakD :. .indlaterD and -. Ili$e <arageaD /.D ,il$escuD F.D and ;ona$arD K. %&&4!. F .ramework for :earning from /istributed /ata Using ,ufficient ,tatistics and its Fpplication to :earning /ecision Brees. 6nternational Mournal of ;ybrid 6ntelligent ,ystems. 7nvited Paper. Kol. '. pp. 2&#23. ]hangD M. and ;ona$arD K. %&&(!. :earning /ecision Bree <lassifiers from Fttribute Kalue Baxonomies and Partially ,pecified /ata. 6n: Proceedings of the 6nternational <onference on Machine :earning 6<M:#&(!. -ashingtonD /<. pp. 22&#22*. .ayyadD U. and 6raniD E.B. '33%!. In the handling of continuous $alued attributes in decision tree generation. Machine :earning $ol. 2. pp. 2*#'&%. -angD ;. and ]anioloD <. %&&&!. <MP: F .ast /ecision Bree <lassifier Using Multi#$ariate Predictions 6n: Proceedings of the 6nternational <onference on /ata Jngineering.

/omingosD P. '333!. Bhe >ole of IccamHs >aAor in Enowledge /isco$ery. Kol. (D no. 4.D pp. 4&3#4%1.

Recommended Readings <hapters ( from MitchellHs Machine :earning Bextbook. F Mathematical Bheory of <ommunicationD <. ,hannon. ,chmidD ;. Probabilistic Part#of#,peech Bagging Using /ecision Brees.6n: Proceedings of the <onference on Lew Methods in :anguage ProcessingD '334 ,il$escuD F.D and ;ona$arD K. %&&'!. Bemporal Boolean Letwork Models of +enetic Letworks and Bheir 6nference from +ene Jxpression Bime ,eries. 4o)plex Syste)s-. Kol. '(. Lo. '. pp. 14#. -angD O.D ,chroederD /.D /obbsD /.D and ;ona$arD K. %&&(!. Futomated /ata# /ri$en /isco$ery of Motif#Based Protein .unction <lassifiers. %n"or)ation Sciences. Kol. '11. pp. '#'2. %&&(. <odringtonD <. -. and BrodleyD <. J.D In the Qualitati$e Beha$ior of 6mpurity# Based ,plitting >ules: Bhe Minima#.ree Property. Bech. >ep. 3*#&1. /ept. of <omputer ,cience. <ornell Uni$ersity. BrodleyD <. and UtgoffD P. '331!. Multi#$ariate /ecision Brees. Machine :earning '3: 41#**. Ftramento$D F.D :ei$aD ;.D and ;ona$arD K. %&&(!. F Multi#>elational /ecision Bree :earning Flgorithm # 6mplementation and Jxperiments.. 6n: Proceedings of the Bhirteenth 6nternational <onference on 6nducti$e :ogic Programming. Berlin: ,pringer#Kerlag. :ecture Lotes in <omputer ,cience. Kol. %2(1D pp. (2#1). MartinD E.M. '33*!. Fn exact Probability Metric for /ecision Bree ,plitting and ,topping. Machine :earning %2: %1*#%3'. M. J. +ehrkeD K. +antiD >. >amakrishnanD and -#N :oh. BIFB ## Iptimistic /ecision Bree <onstruction. 6n Proceedings o" the =>>> S%#M!D 4on"erenceD PhiladelphiaD Pennsyl$aniaD '333. Mohannes J. +ehrkeD >aghu >amakrishnanD and Kenkatesh +anti. >F6L.I>J,B # F .ramework for .ast /ecision Bree <onstruction of :arge /atasets. /ata Mining and Enowledge /isco$eryD Kolume 4D 6ssue %/(D Muly %&&&D pp '%*#')%. MordanD M.D +hahramaniD ].D and ,aulD :. '33)!. ;idden Marko$ /ecision Brees M6B <omputational <ogniti$e ,cience Bechnical >eport 3)&1. #dditional 7nformation

<hapters %D 4 from 6nformation BheoryD 6nferenceD and :earning Flgorithms. /a$id MacEay. <hapter % from Jntropy and 6nformation Bheory >.M. +ray. <4.1: Programs for Machine :earning by M. >oss Quinlan. Morgan Eaufmann PublishersD 6nc.D '33( Flgorithmic 6nformation Bheory. . +. <haitin.

Wee/ ; 1)eginning <ebruary 02 34045 6ntroduction to Frtificial Leural Letworks and :inear /iscriminant .unctions. Bhreshold

logic unit perceptron! and the associated hypothesis space. <onnection with :ogic and +eometry. -eight space and pattern space representations of perceptrons. :inear separability and related concepts. Perceptron :earning algorithm and its $ariants. <on$ergence properties of perceptron algorithm. -inner#Bake#Fll Letworks. Re-uired Readings

Kasant ;ona$arD :ecture ,lides. ,ection 4.' from BishopD <. %&&)!. Pattern >ecognition and Machine :earning. Lils Lilsson. Leural Letworks. ;ona$arD K. Bhreshold :ogic Units. ;ona$arD K. Perceptron :earning Flgorithm. ;ona$arD K. Multi#category +eneraliAations of Perceptron Flgorithm.

Recommended Readings

/ietterichD B. '332!. Fpproximate ,tatistical Bests for <omparing ,uper$ised <lassification Flgorithms Leural <omputation. '& *!:'231#'3%(. Fdam M. +ro$eD Lick :ittlestoneD and /ale ,chuurmans. +eneral con$ergence results for linear discriminant updates. 6n <I:B#3*D pages '*'##'2(D '33*. -ellingD M. %&&1!. .isher :inear /iscriminant Fnalysis.

#dditional 7nformation

LilssonD L. M. Mathematical .oundations of :earning Machines. Palo FltoD <F: Morgan Eaufmann '33%!. MinskyD M. amd PapertD ,. Perceptrons: 6ntroduction to <omputational +eometry. <ambridgeD MF: M6B Press '322!. Mc<ullochD -. Jmbodiments of Mind. <ambridgeD MF: M6B Press.

Wee/ : 1)eginning <ebruary 8 34045 6ntroduction to Frtificial Leural Letworks and :inear /iscriminant .unctions <ontinued!. Multi#class classifiers and -inner#Bake#Fll Letworks. /ual representation of Perceptrons. F learning algorithm using dual representation of perceptrons. :inear discriminants # classification $ia regressionD .isher :inear discriminant functions. Lonlinear feature space mappings for learning non linear decision boundaries. <hallenges of learning non linear decision boundaries using feature space mappings # computational problem of handling high dimensional feature spaces and the curse of dimensionality with implications for generaliAation! to be re$isited!.

+enerati$e $ersus /iscriminati$e Models for <lassification. Bayesian .ramework for classification re$isited. Lai$e Bayes classifier as a generati$e model. >elationship between generati$e models and linear classifiers. Fdditional examples of generati$e models. +enerati$e models from the exponential family of distributions. +enerati$e models $ersus discriminati$e models for classification. Re-uired Readings

:ecture ,lides. Kasant ;ona$ar. Lils Lilsson. Leural Letworks. :ecture ,lides.. Kasant ;ona$ar ,ections 4.% and 4.( from BishopD <. %&&)!D Pattern >ecognition and Machine :earning. >ubinsteinD / and ;astieD B. /iscriminati$e $s 6nformati$e :earning. 6n: Proceedings of the F<M ,6+E// <onference on Enowledge /isco$ery and /ata Mining. '33*. MitchellD B. %&&1!. +enerati$e and /iscriminati$e <lassifiers: Lai$e Bayes and :ogistic >egressionD /raft book chapter.

Recommended Readings Bhe weighted Ma"ority FlgorithmD 6nformation and <omputation Kol. '&2: %'%# %)'. '334. Blum F. '331!. Jmpirical ,upport for -innow and -eighted Ma"ority FlgorithmsD 6n: Proceedings of the Bwelfth 6nternational <onference on Machine :earningD pages )4##*%. Morgan EaufmannD '331. Fdam M. +ro$eD Lick :ittlestoneD and /ale ,chuurmans. +eneral con$ergence results for linear discriminant updates. 6n <I:B#3*D pages '*'##'2(D '33*. +oldingD F. and >othD /. '333!. Fpplying -innow to <ontext#,ensiti$e ,pelling <orrection Machine :earningD (4 '#(!:'&*##'(&D '333. >ubinsteinD / and ;astieD B. /iscriminati$e $s 6nformati$e :earning. 6n: Proceedings of the F<M ,6+E// <onference on Enowledge /isco$ery and /ata Mining. '33*. BouchardD +. and BriggsD B. %&&4!. Bhe tradeoff between +enerati$e and /iscriminati$e <lassifiersD Proceedings of <omputational ,tatistics <ompstat &4!. >ainaD >.D ,henD N.D LgD F.D and Mc<allumD F. %&&(!. <lassification with ;ybrid +enerati$e//iscriminati$e Models. 6n Proceedings of the 6JJJ <onference on Leural 6nformation ,ystems L6P, %&&(!. #dditional 7nformation NakhnenkoD I.D ,il$escuD F. and ;ona$arD K. %&&1!. /iscriminati$ely Brained Marko$ Models for ,e@uence <lassification6n: 6JJJ <onference on /ata Mining 6</M %&&1!. ;oustonD Bexas. 6JJJ Press. pp. 432#1&1. :asserreD M.D <. M. BishopD and B. Minka %&&)!. Principled hybrids of generati$e

and discriminati$e models. 6n: Proceedings %&&) 6JJJ <onference on <omputer Kision and Pattern >ecognitionD Lew Nork. UlusoyD 6. and <. M. Bishop %&&1b!. +enerati$e $ersus discriminati$e models for ob"ect recognition. 6n Proceedings 6JJJ 6nternational <onference on <omputer Kision and Pattern >ecognitionD <KP>.D ,an /iego. -allachD ;. M. %&&4!. <onditional >andom .ields: Fn 6ntroduction :affertyD M.D Mc<allumD F. PereiraD .. %&&&!. <onditional >andom .ields: Probabilistic Models for ,egmenting and :abeling ,e@uence /ata. Mc<allum says: \/onHt bother reading the section on parameter estimation###use B.+, instead of 6terati$e ,calingR e.g. see YMc<allum UF6 %&&(Z\! LgD F. and MordanD M. %&&%! In /iscriminati$e $s. +enerati$e <lassifiers: F comparison of logistic regression and Lai$e BayesD Proceedings of the 6JJJ <onference on Leural 6nformation ,ystems L6P, %&&%!. Mc<allumD F. %&&(!. Jfficiently 6nducing .eatures of <onditional >andom .ields. <onference on Uncertainty in Frtificial 6ntelligence UF6!D %&&(. <onditional >andom .ields ,oftwareD ,arawagiD ,. and <ohen. BergerD F. F brief maxent tutorial.

Wee/ = 1beginning <ebruary 0:2 34045 ,ummary of comparison of generati$e $ersus discriminati$e models. /eri$ation of learning algorithms for discriminati$e models based on the exponential family for classification. >e$iew of Basics of IptimiAation # MaximiAation and MinimiAation of .unctions. >e$iew of >ele$ant Mathematics :imitsD <ontinuity and /ifferentiablity of .unctionsD :ocal Minima and MaximaD /eri$ati$esD Partial /eri$ati$esD Baylor ,eries FpproximationD Multi#Kariate Baylor ,eries Fpproximation!. /eri$ation of gradient#based learning algorithms for discriminati$e models for classificationD e.g.D gradient#based algorithm for logistic regressionD a$oiding o$erfitting # regulariAed logistic regression. Maximum Margin <lassifiers. Perceptron <lassifier re$isited. <hallenges of learning non linear decision boundaries using feature space mappings # computational problem of handling high dimensional feature spaces and ensuring generaliAation a$oiding curse of dimensionality! in high dimensional feature spaces. Re-uired readings

:ecture ,lides.. Kasant ;ona$ar :ecture ,lides.. Kasant ;ona$ar F Butorial on IptimiAation by Martin Isborne read chapters '#4!. MitchellD B. %&&1!. +enerati$e and /iscriminati$e <lassifiers: Lai$e Bayes and :ogistic >egressionD /raft book chapter. ,ections 4.% and 4.( from BishopD <. %&&)!. Pattern >ecognition and Machine

:earning. MinkaD B.P. %&&4! F comparison of Lumerical IptimiAers for :ogistic >egreassion.

Recommended readings

NakhnenkoD I.D ,il$escuD F. and ;ona$arD K. %&&1!. /iscriminati$ely Brained Marko$ Models for ,e@uence <lassification6n: 6JJJ <onference on /ata Mining 6</M %&&1!. ;oustonD Bexas. 6JJJ Press. pp. 432#1&1. :asserreD M.D <. M. BishopD and B. Minka %&&)!. Principled hybrids of generati$e and discriminati$e models. 6n: Proceedings %&&) 6JJJ <onference on <omputer Kision and Pattern >ecognitionD Lew Nork. Baner"eeD F. %&&*!. Fn Fnalysis of :ogistic Models: Jxponential .amily <onnections and Inline Performance ,6FM 6nternational <onference on /ata Mining ,/M!. -allachD ;. M. %&&4!. <onditional >andom .ields: Fn 6ntroduction :affertyD M.D Mc<allumD F. PereiraD .. %&&&!. <onditional >andom .ields: Probabilistic Models for ,egmenting and :abeling ,e@uence /ata. Mc<allum says: \/onHt bother reading the section on parameter estimation###use B.+, instead of 6terati$e ,calingR e.g. see YMc<allum UF6 %&&(Z\! Mc<allumD F. %&&(!. Jfficiently 6nducing .eatures of <onditional >andom .ields. <onference on Uncertainty in Frtificial 6ntelligence UF6!D %&&(. BergerD F. F brief maxent tutorial.

#dditional 7nformation

<onditional >andom .ields ,oftwareD ,arawagiD ,. and <ohen.

Wee/ > 1<eb 332 34045 Maximum Margin <lassifiers. Bhe ,upport Kector Machine ,KM! solution # Eernel functions for dealing with the computational problem. Eernel Matrices. Eernel .unctions. Properties of Eernel Matrices and Eernel .unctions. ;ow to tell a good kernel from a bad one. ;ow to construct kernels. .rom Eernel Machines to ,upport Kector Machines. Maximal Margin ,eparating ;yperplanes ## -hy8 /igression: Kapnik#<her$onenkis K<! /imesion and its properties. K< dimension of the hypothesis space of hyperplanes. KapnikHs bounds on Misclassification rate error rate!. MinimiAing misclassification risk by maximiAing margin. .ormulation of the problem of finding margin maximiAing separating hyperplane as an optimiAation problem.

6ntroduction to :agrange/Earush#Euhn#Bucker IptimiAation Bheory. IptimiAation problems. :inearD @uadraticD and con$ex optimiAation problems. Primal and dual representations of optimiAation problems. <on$ex Quadratic programming formulation of the maximal margin separating hyperplane finding problem. <haracteristics of the maximal margin separating hyperplane. 6mplementation of ,upport Kector Machines. Re-uired Readings

:ecture ,lides.. Kasant ;ona$ar <on$ex IptimiAationD Max -elling. Eernel ,upport Kector MachinesD Max -elling. F Butorial on IptimiAation by Martin Isborne chapters '#*!. ,ections ).'D ).% and *.' from BishopD <. %&&)!. Pattern >ecognition and Machine :earning. ,kolkopfD B. %&&&!. ,tatistical :earning and Eernel Methods. Bechnical >eport M,>#B>#%&&&#%(. Microsoft >esearch. %&&&. F Butorial on ,upport Kector Machines. Lello <hristianiniD 6nternational <onference on Machine :earning 6<M: %&&'!. +raepelD B.D ;erbrichD >.D ? -illiamsonD >. <. %&&'!. .rom margin to sparsity. 6n Fd$ances in Leural 6nformation ,ystem Processing '(. M. F. ;earstD B. ,ch$lkopfD ,. /umaisD J. IsunaD and M. Platt. Brends and contro$ersies # support $ector machines. 6JJJ 6ntelligent ,ystemsD '( 4!:'2#%2D '332. BrownD M. P.D +rundyD -. L.D :inD /.D <ristianiniD L.D ,ugnetD <. -.D .ureyD B. ,.D FresD M. MrD and ;ausslerD /. Enowledge#based analysis of microarray gene expression data by using support $ector machines. Proc. Latl. Fcad. ,ci. U,F 3*: %)%#%)*: %&&&. B. Moachims. Bext categoriAation with support $ector machines: :earning with many rele$ant features. 6n Juropean <onference on Machine :earning J<M:# 32!D '332. NanD <.D /obbsD /.D and ;ona$arD K. F Bwo#,tage <lassifier for 6dentification of Protein#Protein 6nterface >esidues. Bioinformatics Kol. %&. pp. i(*'#i(*2. %&&4. PlattD M. .ast training of support $ector machines using se@uential minimal optimiAation. 6n B. ,cholkopfD <. M. <. BurgesD and F. M. ,molaD editorsD Fd$ances in Eernel Methods ### ,upport Kector :earningD pages '21#%&2D <ambridgeD MFD '333. M6B Press.

Recommended Readings

M.<. Burges. F tutorial on support $ector machines for pattern recognition. /ata Mining and Enowledge /isco$eryD % %!:'%'##')*D '332. ,cheinbergD E. %&&)!. Fn Jfficient 6mplementation of an Fcti$e ,et Method for ,KMs Mournal of Machine :earning >esearchD Kol. *. pp. %%(*#%%1*. MangasarianD I. %&&)!. Jxact '#Lorm ,upport Kector Machines $ia Unconstrained <on$ex /ifferentiable MinimiAation. Mournal of Machine :earning

>esearchD Kol. *. pp. '1'*#'1(&. :asko$D P.D +ehlD <.D ErugerD ,.D MullerD E#>. %&&)! 6ncremental ,upport Kector :earning: FnalysisD 6mplementation and Fpplications D Mournal of machine learning researchD Kol. *. pp. '3&3#'3(). ;suD <#-.D :inD <#M. %&&%!. F ,imple /ecomposition Method for ,upport Kector MachinesD Machine :earningD Kol. 4)D pp. %3'#('4. ,ollichD P. %&&%!. Bayesian Methods for ,upport Kector Machines: J$idence and Predicti$e <lass ProbabilitiesD Machine learningD Kol. 4)D pp. %'#1%. .ungD +. and MangasarianD I. %&&1!. Multicategory Proximal ,upport Kector Machine <lassifiers D Machine :earning Kol. 13D pp. **#3*. :eslieD <.D JskinD J.D -estonD M.D and LobleD -.,. %&&%!. Mismatch ,tring Eernels for Protein <lassification. Leural 6nformation Processing ,ystems %&&%. ;suD <#-.D and :inD <#M. %&&%!. F <omparison of methods for multi#class ,upport Kector MachinesD 6JJJ Bransactions on Leural LetworksD Kol. '(D pp. 4'1#4%1. /uanD E#B.D and EeerthiD ,. %&&1!. -hich is the best multiclass ,KM Method8 # Fn Jmpirical ,tudyD ,pringer#Kerlag :ecture Lotes in <omputer ,cience Kol. (14'D pp. %*2#%21.

#dditional 7nformation

K. KapnikD F. :erner '3)(!. Pattern recognition using generaliAed portrait methodD Futomation and >emote <ontrol %4 **4##*2&. K. KapnikD F. <her$onenkis '3)4!. F note on one class of perceptrons. Futomation and >emote <ontrolD Kol. %1. MangasarianD I. '3)1!. :inear and Lonlinear ,eparation of Patterns by :inear ProgrammingD Iperations >esearchD Kol. '(. pp. 444#41%. <ristianiniD L. and ,hawe#BaylorD M. %&&&!. ,upport Kector Machines. :ondon: <ambridge Uni$ersity Press. ,hawe#BaylorD M. and <ristianiniD L. %&&4!. Eernel Methods for Pattern <lassification. :ondon: <ambridge Uni$ersity Press. MullerD F.>.D Mika. ,.D >atschD +.D BsudaD E.D and ,kolkopf. B. %&&'!. Fn 6ntroduction to Eernel Methods for Pattern <lassification. 6JJJ Bransactions on Leural Letworks. Kol. '%D pp. '22#%&'. Jdgar IsunaD >obert .reundD and .ederico +irosi. ,upport $ector machines: Braining and applications. Bechnical >eport F6M#')&%D '33*. ,.,. EeerthiD ,.E. ,he$adeD <. BhattacharyyaD and E.>.E. Murthy. 6mpro$ements to plattHs ,MI algorithm for ,KM classifier design. Bechnical reportD /ept of <,FD 66,cD BangaloreD 6ndiaD '333. ,.,. Eeerthi and J.+. +ilbertD <on$ergence of a generaliAed ,MI algorithm for ,KM classifier designD Bechnical >eport </#&&#&'D /ept. of Mechanical and Production Jng.D Lational Uni$ersity of ,ingaporeD %&&&. Ni :i and Philip M. :ongD Bhe >elaxed Inline Maximum Margin FlgorithmD Machine :earning Kol. 4)D pp. ()'#D %&&%. ManewitAD :. and NousefD M. %&&'! Ine#class ,upport Kector Machine for /ocument <lassification. Mournal of Machine learning research. Kol. %. pp. '(3#

'14. M. PlattD L. <ristianiniD M. ,hawe#BaylorD :arge Margin /F+s for Multiclass <lassificationD in: Fd$ances in Leural 6nformation Processing ,ystems '%D pp. 14*#11(D M6B PressD %&&&!. BsochantaridisD 6.D MoachimsD B.D ;ofmannD B.D FltunD N. %&&1!. :arge Margin Methods for ,tructured and 6nterdependent Kariables Mournal of Machine :earning >esearch Kol. ). pp. '41)#'424. ]elenkoD /.D FoneD <.D and >ichardellaD F. %&&(!. Eernel Methods for >elation Jxtraction Mournal of Machine :earning >esearch. Kol. (. pp. '&2(#''&). MebaraD B.D EondorD >.D and ;owardD F. %&&4!. Probability Product Eernels Mournal of Machine :earning >esearch. Kol. 1. pp. 2'3#244.

Wee/s 8+? 1)eginning March 02 34045 Probabilistic +raphical Models. Bayesian Letworks. 6ndependence and <onditional 6ndependence. Jxploiting independence relations for compact representation of probability distributions. 6ntroduction to Bayesian Letworks. ,emantics of Bayesian Letworks. /#separation. /#separation examples. Fnswering 6ndependence Queries Using /#,eparation tests. Probabilistic 6nference Using Bayesian Letworks. Bayesian Letwork 6nference. Fpproximate inference using stochastic simulation samplingD re"ection samplingD and liklihood weighted sampling :earning Bayesian Letworks from /ata. :earning of parameters conditional probability tables! from fully specified instances when no attribute $alues are missing! in a network of known structure re$iew!. :earning Bayesian networks with unknown structure ## scoring functions for structure disco$eryD searching the space of network topologies using scoring functions to guide the searchD structure learning in practiceD Bayesian approach to structure disco$eryD examples. :earning Bayesian network parameters in the presence of missing attribute $alues using Jxpectation MaximiAation! when the structure is knownR :earning networks of unknown structure in the presence of missing attribute $alues. ,ome special classes of probabilistic graphical models. Marko$ modelsD mixture models. Re-uired readings

:ecture ,lides.. Kasant ;ona$ar <hapter '4 from the #rtificial 7ntelligence@ # Modern #pproach textbook by >ussell and Lor$ig. <hapter 2 from. ,ections 2.%D 2.(D 2.4 2.4.'#2.4.)!D <hapter '' section ''.'! from Pattern Recognition and Machine LearningD by <. Bishop. :ecture ,lidesD Kasant ;ona$ar

:ecture ,lides. Kasant ;ona$ar F Butorial on :earning with Bayesian LetworksD /a$id ;eckerman. Bech. >ep. M,>#B>#31#&). Microsoft >esearch. Fpproximating /iscrete Probability /istributions with /ependence Brees. <houD <.E. and :iuD <.L. 6JJJ Bransactions on 6nformation Bheory. '4 (!D '3)2. pp. 4)%#4)*. :earning Bayesian belief networks: Fn approach based on the M/: principle.D -. :am and .. BacchusD <omputational 6ntelligenceD '& 4!D '334. Bayesian Letwork <lassifiers .riedmanD L.D +eigerD /.D and +oldsAmidtD M. Machine :earning %3: pp. '('#')(. '33*. Fn 6ntroduction to M<M< for Machine :earningD Fndrieu et al.D Machine :earningD %&&'.

Recommended Readings /ependency Modeling <ourseD Uni$ersity of ;elsinki Bayesian Letworks and /ecision#Bheoretic >easoning for Frtificial 6ntelligence D /aphne Eoller and Mack Breese. Butorial +i$en at FFF6#3*. F :ogical Lotion of <onditional 6ndependence: Properties and Fpplications by F. /arwiche.

/ependency Modeling <ourseD Uni$ersity of ;elsinki 6nference in Bayesian Letworks ## F Procedural +uide ;uangD <.D F. /arwiche. Mournal of Fpproximate >easoning. Kol '1. pp. %%1#%)(. http://www.cs.ubc.ca/~nando/papers/mlintro.pdf6ntroduction to M<M< for Machine :earning Being Bayesian about Letwork ,tructure # F Bayesian Fpproach Bo ,tructure /isco$ery in Bayesian Letworks L. .riedman and /. Eoller. Machine :earning Kol. 1&. pp. 31#'%1. %&&(. Bractable :earning of :arge Bayes Let ,tructures from ,parse /ataD +oldernbergD F. and MooreD F. %&&4!. 6n Proceedings of the 6nternational <onference on Machine :earningD %&&4. :earning Bayesian Letwork ,tructure from Massi$e /ata ,ets # Bhe ,parse <andidate Flgorithm 6n Proceedings of the <onference on Uncertainty in Frtificial 6ntelligence UF6! '333. 6nferring <ellular Letworks Using Probabilistic +raphical Models ,cience Kol. (&(D pp. *33#2&1. :earning Bayesian Belief Letworks Based on the Minimum /escription :ength Principle: Basic Properties.M. ,uAukiD 6J6<J Bransactions on .undamentalsD $ol. J2%D Lo. '&.D pp. %%(* %%41 <omparing Model ,election <riteria for Belief Letworks Bim Kan FllenD >uss +reinerD %&&&. Module Letworks: 6dentifying >egulatory Modules and their <ondition#,pecific >egulators from +ene Jxpression /ataD ,egalD J.D ,hapiraD M.D >ege$D F.D PeHerD /.D BotsteinD /.D EollerD /.D and .riedmanD L. Lature +enetics Kol. (4D pp. '))# '*). %&&(. :earning Bayesian Letwork <lassifiers for <redit ,coring Using Marko$ <hain

Monte <arlo ,earchD B. Baesens et al.D %&&'. Iperations for :earning with +raphical Models -ray Buntine. Mournal of Frtificial 6ntelligence >esearch. Kol. %. pp. '13#%%1. '334.

"PR7AB )R%#C

Wee/ 04 1)eginning March 332 34045 Midterm solutions. 6ntroduction to >elational and multi#relational learning. +uest lecture by ;arris :in! 6ntroduction to semi#super$ised learning. +uest lecture by .adi Bowfic!. Re-uired Readings

/. ;eckermanD <. MeekD and /. Eoller. Probabilistic Models for >elational /ata. Bechnical >eport M,>#B>#%&&4#(&D Microsoft >esearchD MarchD %&&4. /AeroskiD ,. %&&(!. Multi#relational :earning: Fn 6ntroduction. . ,6+E// Jxplorations. Ftramento$D F.D :ei$aD ;.D and ;ona$arD K. %&&(!. F Multi#>elational /ecision Bree :earning Flgorithm # 6mplementation and Jxperiments. 6n: Proceedings of the Bhirteenth 6nternational <onference on 6nducti$e :ogic Programming. Berlin: ,pringer#Kerlag. PopesculD F.D UngarD :.D :awrenceD ,.D and PennockD /. %&&(!. ,tatistical >elational :earning for /ocument Mining. 6n: Proceedings of 6</M %&&(!. /a$isD M. et al. %&&1!. Kiew :earning for ,tatistical >elational :earning -ith Fpplications to Mammography. . 6n: Proceedings of 6<M: %&&1!. <eciD M.D FppiceD F.D and MalerbaD /. %&&(!. Mr#,B<: F multi#relational Lai$e Bayes <lassifier. . 6n Proceedings of PE// %&&(!. /omingosD P. and >ichardsonD M. %&&2!. Marko$ :ogic Letworks: F Unifying .ramework for Multi#relational :earning. EnobbeD F.D ;aasD M.D and ,iebesD F. %&&'!. PropositionaliAation and Fggregates. 6n: Proceedings of PE// %&&'!. ]huD O. and +oldbergD F. %&&3!. 6ntroduction to ,emi#super$ised :earning.

Wee/ 00 1)eginning March 3? 34045 Bopics in <omputational :earning Bheory Probably Fpproximately <orrect PF<! :earning Model. Jfficient PF< learnability. ,ample <omplexity of PF< :earning in terms of cardinality of hypothesis space for finite hypothesis classes!. ,ome <oncept <lasses that are easy to learn within the PF<

setting. Jfficiently PF< learnable concept classes. ,ufficient conditions for efficient PF< learnability. ,ome concept classes that are not efficiently learnable in the PF< setting. Making hard#to#learn concept classes efficiently learnable ## transforming instance representation and hypothesis representation. Iccam :earning Flgorithms. PF< :earnability of infinite concept classes. Kapnik#<her$onenkis K<! dimension. Properties of K< dimensionD K< dimension and learnabilityD :earning from Loisy examplesD Bransforming weak learners into PF< learners through accuracy and confidence boostingD :earning under helpful distributions # Eolmogoro$ <omplexityD <onditional Eolmogoro$ <omplexityD Uni$ersal distributionsD :earning ,imple <onceptsD :earning from ,imple Jxamples Re-uired readings

:ecture ,lides. K. ;ona$ar I$er$iew of the Probably Fpproximately <orrect PF<! :earning .ramework. /. ;ausslerD '331. EearnsD M. '332. Jfficient Loise Bolerant :earning from ,tatistical Queries. Mournal of the F<M. Kol. 41D pp. 32(#'&&). ParekhD >. and ;ona$arD K. %&&&!. In the >elationships between Models of :earning in ;elpful Jn$ironments. 6n: Proceedings o" the ?i"th %nternational 4on"erence on #ra))atical %n"erence. :isbonD Portugal. ParekhD >. and ;ona$arD K. %&&'!. /.F :earning from ,imple Jxamples. Machine Learning- Kol. 44. pp. 3#(1.

Recommended Readings <esa#BianchiD L.D /ichtermanD J.D .ischerD P.D ,hamirD J.D ,imonD ;. '333. ,ample#Jfficient ,trategies for :earning in the Presence of Loise. Mournal of the F<M. Kol. 4). pp. )24#*'3. +oldreichD I. and +oldwasserD ,. '332. Property testing and its connection to :earning and approximation. Mournal of the F<M. Kol. 41. pp. )1(#*1&. EhardonD >. and >othD /. '33*. :earning to >eason. Mournal of the F<M. KolD 44. pp. )3*#*%1. KaliantD :. %&&&. F Leuroidal Frchitecture for <ogniti$e <omputation. Mournal of the F<M. Kol. 4*. pp. 214#22%. MaassD -. '334. Jfficient Fgnostic PF< :earning -ith ,imple ;ypotheses. . Proceedings of the ,e$enth Fnnual <onference on <omputational :earning Bheory. '334. pp. )*#*1. Benedek. +. and 6taiD F. /ominating /istributions and :earnability. 6n: Fnnual -orkshop on <omputational :earning Bheory. '33%. #dditional 7nformation

EearnsD M. and KaAiraniD U. '334!. Fn 6ntroduction to <omputational :earning

BheoryD M6B Press. <omputational :earning Bheory ,ally +oldman.

Wee/ 03 1beginning #pril :2 340456 Mistake bound analysis of learning algorithms. Mistake bound analysis of online algorithms for learning <on"uncti$e <oncepts. Iptimal Mistake Bounds. Kersion ,pace ;al$ing Flgorithm. >andomiAed ;al$ing Flgorithm. :earning monotone dis"unctions in the presence of irrele$ant attributes ## the -innow and Balanced -innow Flgorithms. Multiplicati$e Update Flgorithms for concept learning and function approximation. -eighted ma"ority algorithm. Fpplications. Re-uired readings

:ecture ,lides. K. ;ona$ar Bhe weighted Ma"ority FlgorithmD :ittlestoneD L.D and -armuthD M. 6nformation and <omputation Kol. '&2: %'%#%)'. '334. Jmpirical ,upport for -innow and -eighted Ma"ority FlgorithmsD BlumD F. 6n: Proceedings of the Bwelfth 6nternational <onference on Machine :earningD pages )4##*%. Morgan EaufmannD '331. Fpplying -innow to <ontext#,ensiti$e ,pelling <orrection +oldingD F.D and >othD /. Machine :earningD (4 '#(!:'&*##'(&D '333. M.;. NangD /. >othD and L. Fhu"a. F ,Lo-#based face detector. L6P, '%!D pages 211#2)'D %&&&.

Recommended Readings F.M. +ro$eD L. :ittlestoneD and /. ,chuurmans. +eneral con$ergence results for linear discriminant updates. 6n Proc. '&th Fnnu. <onf. on <omput. :earning BheoryD pages '*'##'2(D '33*. Bong ]hang. >egulariAed winnow methods. 6n Fd$ances in Leural 6nformation Processing ,ystems '(D pages *&(#*&3D %&&'. ;elmboldD /.P.D ,chapireD >.J.D ,ingerD N.D -armuthD M.E. In#line portfolio selection using multiplicati$e updates. Mathematical .inanceD $ol. 2 4!D pp.(%1# (4*D '332. #dditional 7nformation

EearnsD M. and KaAiraniD U. '334!. Fn 6ntroduction to <omputational :earning BheoryD M6B Press.

Wee/ 09 1)eginning #pril 032 34045 Jnsemble <lassifiers. Bechni@ues for generating base classifiersR techni@ues for combining classifiers. <ommittee Machines and Bagging. Boosting. Bhe Fdaboost

Flgorithm. Bheoretical performance of Fdaboost. Boosting in practice. -hen does boosting help8 -hy does boosting work8 Boosting and additi$e models. :oss function analysis. Boosting of multi#class classifiers. Boosting using classifiers that produce confidence estimates for class labels. Boosting and margin. Kariants of boosting # generating classifiers by changing instance distributionR generating classifiers by using subsets of featuresR generating classifiers by changing the output code. .urther insights into boosting. Re-uired readings

:ecture ,lides. K. ;ona$ar <hapter '4 ,ections '4.'#'4.(! from Pattern >ecognition and Machine :earningD <. Bishop. .reundD >. '333! F ,hort 6ntroduction to Boosting Mournal of the Mapanese ,ociety for Frtificial 6ntelligenceD Kol '4D pp. **'#*2&. Jnglish Branslation by Laoki Fbe!. .riedmanD M.D ;astieD B.D and BibshiraniD >. %&&&!. F ,tatistical Kiew of BoostingD Fnnals of ,tatisticsD Kol. (1D pp. ((*#4&*. MeirD >. and >atschD +. %&&%!. Fn 6ntroduction to Boosting and :e$eraging. Fd$anced :ectures on Machine :earning. :ecture Lotes in <omputer ,cienceD pp. ''2#'2(D Berlin: ,pringer#Kerlag. BaurD J.D and Eoha$iD >. '333! Fn Jmpirical <omparison of Koting <lassification Flgorithms: BaggingD BoostingD and Kariants Machine :earning. Kol. (). pp. '&1#'4%. BreimanD :. '334!. Bagging Predictors. Bech. >ep. 4%'D /epartment of ,tatisticsD Uni$ersity of <aliforniaD BerkeleyD <F.

Recommended Readings

Jfficient Margin MaximiAation Using Boosting +. >atsch and ME -armuthD MM:> %&&1. .reundD >. '33)!. +ame BheoryD Inline PredictionD and Boosting 6n: Proceedings of the <onference on <omputational :earning Bheory <I:B '33)!. ,chapireD >. and ,ingerD N. '333!. 6mpro$ed Boosting Flgorithms Using <onfidence#>ated PredictionsD Machine :earning Kol. (*.

Wee/ 09 1)eginning #pril 0>2 34045 Multiple instance learning. +eneraliAed multiple instance learning. Multiple label learning. Multiple instance multiple label learning.

Wee/ 0; 1)eginning #pril 3: 34045 Bayesian >ecipe for function approximation and :east Mean ,@uared :M,! Jrror

<riterion. 6ntroduction to neural networks as trainable function approximators. .unction approximation from examples. MinimiAation of Jrror .unctions. /eri$ation of a :earning >ule for MinimiAing Mean ,@uared Jrror .unction for a ,imple :inear Leuron. Momentum modification for speeding up learning. 6ntroduction to neural networks for nonlinear function approximation. Lonlinear function approximation using multi#layer neural networks. Uni$ersal function approximation theorem. /eri$ation of the generaliAed delta rule +/>! the backpropagation learning algorithm!. +eneraliAed delta rule backpropagation algorithm! in practice # a$oiding o$erfittingD choosing neuron acti$ation functionsD choosing learning rateD choosing initial weightsD speeding up learningD impro$ing generaliAationD circum$enting local minimaD using domain#specific constraints e.g.D translation in$ariance in $isual pattern recognition!D exploiting hintsD using neural networks for function approximation and pattern classification. >elationship between neural networks and Bayesian pattern classification. Kariations ## >adial basis function networks. :earning non linear functions by searching the space of network topologies as well as weights. :aAy :earning Flgorithms. 6nstance based :earningD E#nearest neighbor classifiersD distance functionsD locally weighted regression. >elati$e ad$antages and disad$antages of laAy learning and eager learning. Re-uired readings

<hapter 1D Pattern >ecognition and Machine :earningD <. Bishop. :ecture ,lides.. Kasant ;ona$ar ;ona$arD K. .unction Fpproximation from Jxamples. ;ona$arD K. Multi#layer networks. ;ona$arD K. >adial Basis .unction Letworks. <. +. FtkesonD ,. F. ,chaal and Fndrew -D MooreD :ocally -eighted :earningD F6 >e$iewD3olu)e ==, Pages ==,@A 69luwer Publishers. '33*

Recommended Readings B. +. /ietterichD ;. ;ildD and +. Bakiri. F comparati$e study of 6/( and backpropagation for Jnglish text#to#speech mapping. 6n Proceedings of *th 6M:-D FustinD '33&. Morgan Eaufmann. N. :e <unD B. BoserD M.,. /enkerD /. ;endersonD >. ;owardD -. ;ubbardD and :. Mackel.;andwritten digit recognition with a backpropagation neural network. 6n /. BouretAkyD editorD Fd$ances in Leural 6nformation Processing ,ystems %D pages (3)##4&4. Morgan EaufmannD ,an MateoD <FD '33&. Bhrun and B.M. Mitchell. :earning one more thing. Bechnical >eport <MU#<,# 34#'24D <arnegie Mellon Uni$ersityD PittsburghD PF '1%'(D '334. >. -illiamsD and /. ]ipser. +radient#Based :earning Flgorithms for >ecurrent Letworks and Bheir <omputational <omplexity. 6n Backpropagation: BheoryD FrchitecturesD and FpplicationsD <hau$in and >umelhartD Jds.D :JFD '331D pp. 4((#421.

,olomonD >. and M. :. $an ;emmen '33)!. Fccelerating backpropagation through dynamic self#adaptation. Leural Letworks 3 4!D 123##)&'. <ra$enD M. and ,ha$likD M. Using Leural Letworks for /ata Mining. .uture +eneration <omputer ,ystems '(:%''#%%3. >. ,etionoD -.E. :eow and M.M. ]urada. Jxtraction of rules from artificial neural networks for nonlinear regressionD 6JJJ Bransactions on Leural LetworksD %&&%. Poggio et al. '331. >egulariAation Bheory and Leural Letwork Frchitectures Leural <omputationD *:%'3##%)3D '331. .ahlmanD ,. and :ebiereD <. '33'. <ascade <orrelation Frchitecture Bechnical >eport <MU#<,#3&#'&& <arnegie Mellon Uni$ersityD Fugust '33'. Poggio et al. '323. F Bheory of Letworks for Fpproximation and :earning Bechnical >eport ''4&D M6B Frtificial 6ntelligence :aboratoryD '323.

Machine Learning Buidance <or )eginners


Posted on Manuary 2D %&'( by whiteswami -ith a deluge of machine learning resources both online and offlineD a newbie in this field would simply get awestruck and might get stranded due to indecisi$eness. Bhere are people who are good at spotting what to read/follow and what not . ParticularlyD this post is for M: enthusiasts who are not able to find a good way to understand and use M: but this is what they had always wanted to wet their hands into. Y;ilary masonHs $ideo on M: Z for ;ackers gi$es a great introductory feel of the M: area in (& minutes People who think a rigorous background of stochasticD optimiAation and linear algebra is utmost necessary to start with might not always be correct. Most important thing is to get started and the other mathematical fundamentals can be learnt on the fly. ButD yes some prior knowledge might be helpful. F person cannot learn swimming unless he/she di$es into the waterD no matter how much you ha$e read about swimming. ,ame analogy can be used here. ButD one should be cautious in their approach. 6 ha$e seen many of them ha$ing run away from M: for reasons like its "ust statisticsD too much mathsD etc. ,ome e$en get to learn things but do not know where to use it. Bhese factors would essentially kill their enthusiasm. BhereforeD a good balance between theory and practical is necessary. Ine should try to apply the $arious M: stuffs learnt and once people start applying M: there are non# ending U-I-sV. ,oD where does one start 8 6 would recommend people to go through an ad$anced track of Fndrew LgPs online M: course on coursera Fndrew LgPs online M: course on coursera! to begin with. 6t is fairly

broad and its thorough. Bhis course has a good balance between learning and its application. Bhis would not only strengthen the basics but will also try to make you program and apply them. Bhe stanford <,%%3 course stanford <,%%3 course! by Fndrew Lg offers more depth and is much better for understanding the internals of M:. Flong with Fndrew LgPs course one also needs to work a bit on algebra and probability to take a bigger leap. Fnother great set of $ideo lectures is by Prof. Naser ,. Fbu#MostafaD from caltech. Bhe course is titled :earning from /ata. 6 personally consider this course superior than Fndrew LgPs course due to the content as well as Prof.Ps approach towards M:. MathematicalmonkPs channel on youtube! is another comprehensi$e resource on Machine learning. Flong with the probability primer lecturesD this really becomes $ery helpful in co$ering a broad range of topics with good mathematical fundamentals. Ither than these $ideo resourcesD there are @uite a few good introductory books on M:: '. Ine of my fa$orites is YP>M: book by BishopZ %. YBom MitchellHs book Z is another widely accepted book. (. More mathematical but a nice read is YPattern <lassification by /uda and ;artZ Low that one has gathered good fundamentals on M: and is aware of $arious terminologies and "argonsD one could explore $arious areas based on their own interest. ButD at this point of time one needs to decide whether one wants to merely use existing M: algorithms/tools or do they want to code the algorithms themsel$es. Lone of the two is inferior to the other. ButD people deciding to write the new/existing M: algorithms need to be aware of internals behind the curtain. Bhis is where Fndrew LgPs course lacks immensely. FndrewPs course is more like tool gatherer's approach and in many ways good for M: enthusiasts but not desirable for all. M: tool gatherer category of people also need to e$ol$e to large scale machine learning because of its rele$ancy in the current era. .or thisD Programming collecti$e intelligence by Boby ,egaran is a great resource . F good tool to start experimenting with large scale data is mahout mahout.apache.org!. Machine :earning for ;ackers is another great practical book. 6 ^ data ::

Bo carry out experiment themsel$es one would be able to find a plethora of data repositories listed here: http://www.@uora.com//ata/-here#can#6#find#large# datasets#for#modeling#confidence#during#the#financial#crisis#which#is#open#to# the#public Bo assess oneself and ha$e funD one could as well start competing on Eaggle www.kaggle.com!.

Machine learning is a kind of decision making and henceD more thorough knowledge on related fields like optimi.ation and Bame theory needs to be de$eloped. F strong mathematical background on algebra and stochastic also needs to be ac@uired along with exploring statistical learning theory to its limits. F background on information theory is also helpful. F list of literature sur$eysD re$iewsD and tutorials on Machine :earning and related topics like computational biologyD L:PD etc. ha$e been compiled along with link to papers T www.mlsur$eys.com . /eep :earningD ,KMD ,+/D Bayesian statisticsD >ecommender engines and Mapreduce for M: are some of the key hot topics in M: in the current scene.
About these ads

#bout *hites*ami
F machine learning enthusiast currently wetting my hands in the Big /ata analytics . >ahul Eumar Mishra is currently a >esearcher at .lytxt P$t :td. www.flytxt.com!D Bri$andrumD 6ndia. ;e completed his Masters from 6ndian 6nstitute of BechnologyD Bombay www.iitb.ac.in! in Muly %&'%. Prior to "oining .lytxtD he was a student at 66B Bombay and was an prominent member of the prestigious 6UFB< http://www.iu# atc.com/theme%.html! on next generation infrastructure. 6n this pro"ectD he worked under the guidance of Prof. U. B. /esai http://www.iith.ac.in/~ubdesai!D /irectorD 6ndian 6nstitute of BechnologyD ;yderabad and Prof. ,.L. Merchant http://www.ee.iitb.ac.in/wiki/faculty/merchant!D 66B Bombay. ;e is also pri$ileged to ha$e worked at 6ndiaHs first software >?/ B>//<D Pune www.tcs#trddc.com! in different pursuits not limited to ,tatic <ode FnalysisD /ata Mining and 6nformation Jxtraction. ;e is an aspiring data scientist and his current research area includes Big /ataD Unsuper$ised :earning and ,uper$ised :earning. Personal -ebsite: https://sites.google.com/site/reachrahulkmishra/ Blogs: whiteswami.wordpress.com mlthirst.wordpress.com Kiew all posts by whiteswami _ Bhis entry was posted in UncategoriAed and tagged /ata MiningD guidanceD Machine :earning. Bookmark the permalink. ` M: /ependencies and 6nterplay

2 ,esponses to Machine Learning Guidance For Beginners


'. pan*a$=;<<Pan*a$ says: ,eptember '4D %&'( at ):12 pm

;iD6 am an a$erage <omputer science engineer and worked on mainframe technologies for % years. 6 am currently pursuing a one year full time course in Business Fnalytics at Praxis Business ,chool. 6 am $ery much interested in Machine :earning.Fnd this blog is really helpful for a beginner like me. Please keep writing and share your knowledge. Flso suggest a good book which 6 can refer to begin exploring /ata Mining ? Machine :earning. -e ha$e not yet finalised the book which we are going to use in our batch for the course in our next semester for the same. Please suggest a good book which co$ers data mining and machine learning from a beginners point of $iew. BhanksD Panka" >eply

%.

$os*id says: Ictober (&D %&'( at ):%2 pm >eblogged this on "osephdung. >eply

NesterdayD Mohn and 6 ga$e a talk to the /< ;adoop Users +roup about using Mahout with ,olr to perform :atent ,emantic 6ndexing X calculating and exploiting the semantic relationships between keywords. -hile we were thereD 6 realiAedD a lot of people could benefit from a bigger pictureD less in#depthD point of $iew outside of our specific story. 6n general where do Mahout and ,olr fit together8 -hat does that relationship look likeD and how does one exploit Mahout to make search e$en more awesome8 ,o 6 thought 6Pd blog about how you too can get start to put together these pieces to simultaneously exploit ,olrPs search and MahoutPs machine learning capabilities. Bhe root of how this all works is with a slightly obscure feature of :ucene#based search X Berm Kectors. :ucene#based search applications gi$e you the ability to generate term $ectors from documents in the search index. 6tPs a feature often turned on for specific search featuresD but other than that can appear to be a weird opa@ue feature to beginners. -hat is a term $ectorD you might ask8 Fnd why would you want to get one8

What is a term vector?


Bo answer the first @uestionD a documentPs term $ector is simply of listing for each term in the entire corpus with how fre@uent the term occurs in this document. ,o if our corpus is these three documents:
doc 1: brown dog

doc 2: brown unicorn doc 3: unicorn eats unicorn

-e could choose to represent a document by assigning a column to each term in the corpus and a row to each document. Bhe number at each row/column would correspond to the termPs fre@uency in that document. -e refer to each row as a term $ector for a document and the entire set of term $ectors as a term#document matrix. ;erePs an example of the term#document matrix for the abo$e corpusD containing three term $ectors for each of our documents: doc doc' doc% doc( bro*n dog eats unicorn ' ' & & ' & & ' & & ' %

Fs stated abo$eD each documentPs term $ector has a spot for e$ery possible term that occurs in the entire corpus of text. Fs you can imagineD with real documents in real corpuses with lots of terms this turns into term $ectors that are mostly &. Bhe mostly & nature of this matrix leads us to calling these matrices and their component $ectors! sparse.

What can you do *ith this information?


Using the mathematical concept of a $ector has pretty specific implications. 6f you recall your high school +eometryD youPre probably remembering a $ector as a kind of arrow coming from the origin of a graph &D&! with a pointy thingy at the $ectorPs coordinates 'D%! for example!. ,o youPd get something like this. 6f you learned anything past ;igh ,chool mathD you likely got that therePs such things as (/ $ectors. Bhey ha$e three numbers because in (/ space we ha$e numbers for xD yD and A.

>emember this guy8. Ik well stretch your mind and think about the term $ectors abo$e. Bhese $ectors ha$e fre@uencies for all four terms in the corpus X they are four dimensional $ectors. Fnd this is a $ery tri$ial example. 6n our testing with the ,ci.i ,tackexchange data setD there are roughly ()&&& terms meaning $ectors ()&&& wideD with each indi$idual $ector mostly &s. :uckily the same math that applied to two and three dimensions can be applied to $ectors of any dimensionality. Low that weP$e brought it back to our high school mathD lets remember what kinds of things we learned about $ector back then:

-e can take the dot product of two $ectors x'x: B y=y% 5 A'aA% W for more dimensions!

calculate $ector magnitude sqrt(x1 2 ! y1 2 """ # Fdd two $ectors Yx'5x%D y'5y%D ... Z etc etc ;aAy memories of some trig stuff and a guy name Iscar that always wanted to copy my homework

-hen you realiAe that the xDyDAW abo$e are term fre@uencies in a term $ectorD $ector math turns out to ha$e certain $ery practical implications. :ets think through an example that might awaken your dormant math neuronsC -hen term $ectors point in the same direction in highly dimensional spaceD its because therePs a lot of terms that o$erlap between those documents. Ine way to figure that out is the dot product. .or example consider the dot product of document ' Ubrown dogV! and document ( Uunicorn eats unicornV!. -e can detect therePs no o$erlap by taking the dot product of the two four# dimensional $ectors: doc bro*n dog eats unicorn dotprod doc'Ddoc(! 'a&5 'a&5 &a'5 &a% 99& <ompare this to the dot product of two documents that o$erlapD document % Ubrown unicornV! and document ( Uunicorn eats unicornV! then the UunicornV parts of the dot product amplify each otherD letting us know of some similarity: doc bro*n dog eats unicorn dotprod doc%Ddoc(! 'a&5 'a&5 &a'5 'a% 99% Pretty coolD huhC Bhis is a pretty minor example. 6f you like finding practical ways to exploit cool mathD herePs an area "ust for you. 6t gets e$en cooler when you work with our term $ectors as if itPs a matrixD and heck therePs really neat stuff you can do with that tooC ,o where do 6 go to get gobs of good math to apply to giant $ectors/matrices8

(his is *here Mahout steps in


Mahout is a machine learning library built on top of ;adoop. Machine learning sounds hardD but its "ust marketing speak for math you donPt know yet. ;adoop is a map reduce framework for managing "obs and a storage system to work on gigantic data sets. ,o a more realistic term#document matrixD maybe one with one billion documents and one million terms might actually be usable. Mahout has all the tidbitsD ad$anced and beginnerD that can help us do $ery interesting machine learning read:math! processing on our giant term#document matrices. .or example we can cluster the rowsD use similarity metrics to calculate distances between our term $ectors. -e can fill in some of the empty spots in our sparse $ectors by detecting relationships between terms UbananaV and UpeelV commonly co#occurD letPs score this document for banana tooC! and much much more.

Jach of these processes has its own mathematics behind it. Many of these processes like singular#$alue decomposition and E#means clustering you can get at least get a basic appreciation for what its doing. -ith a little work you can get your head around the math by brushing off some old textbooks. -eP$e also got a great presentation on ,ingular Kalue /ecompositionD which might be a great place to e@uip some machine learning with your search application.

Bet "tartedD
,o greatD where do you start playing around using Mahout with your term $ectors8 NouPll need to get term $ectors into a format that ;adoop/Mahout can understand. Mahout gi$es us a nice utility called lucene.$ector for dumping term $ectors from a :ucene index and sa$ing the information you need to feed data back into the search index. .rom this point onD the sky is the limit. :earn some cool math and machine learning and find interesting/fun ways you can play with your term $ectorsC ;a$e funC

http://www.b#eye#network.com/blogs/eckerson/archi$es/%&''/''/part7iii7#7the.php http@//insideanalysis6com/3409/48/understanding+analytical+databases/ http@//***6/indleconsulting6com/resources/big+data+blog/big+data+anaylytics6html http@//searchbusinessanalytics6techtarget6com/essentialguide/Buide+to+big+data+ analytics+tools+trends+and+best+practices

*ecrets o# Anal+tical 'eaders: -nsights #ro. -n#or.ation -nsiders

Eignesh Pra,apati Kignesh Pra"apatiD from 6ndiaD is a Big /ata enthusiastD a Pingax ***6pinga&6com! consultant and a software professional at Jn"ay. ;e is an experienced M: /ata engineer. ;e is experienced with Machine learning and Big /ata technologies such as >D ;adoopD MahoutD PigD ;i$eD and related ;adoop components to analyAe datasets to achie$e informati$e insights by data analytics cycles. ;e pursued B.J from +u"arat Bechnological Uni$ersity in %&'% and started his career as /ata Jngineer at Bat$ic. ;is professional experience includes working on the de$elopment of $arious /ata analytics algorithms for +oogle Fnalytics data sourceD for pro$iding economic $alue to the products. Bo get the M: in actionD he implemented se$eral analytical apps in collaboration with +oogle Fnalytics and +oogle Prediction FP6 ser$ices. ;e also contributes to the > community by de$eloping the >+oogleFnalyticsH > library as an open source code +oogle pro"ect and writes articles on /ata#dri$en technologies. Kignesh is not limited to a single domainR he has also worked for de$eloping $arious interacti$e apps $ia $arious +oogle FP6sD such as +oogle Fnalytics FP6D >ealtime FP6D +oogle Prediction FP6D +oogle <hart FP6D and Branslate FP6 with the Ma$a and P;P platforms. ;e is highly interested in the de$elopment of open source technologies. Kignesh has also re$iewed the Fpache Mahout <ookbook for Packt Publishing. Bhis book pro$ides a freshD scope#oriented approach to the Mahout world for beginners as well as ad$anced users. Mahout <ookbook is specially designed to make users aware of the different possible machine learning applicationsD strategiesD and algorithms to produce an intelligent as well as Big /ata application.

Jnterprner Fnd bussiness best http://patsyiskul.blogspot.in/%&'%/''/free#ebook#chronic#marketer#confessions.html

data Mining Python best http://www%.compute.dtu.dk/courses/&%2'3/

/ata science and hadoop http://www.mpi#inf.mpg.de/~rgemulla/ http://www.datasciencecentral.com/profiles/blogs/weekly#digest#december#(& http://bigdatastudio.com/hadoop/ wikibon.org/wiki/$/Big7/ata:7;adoopD7Business7Fnalytics7and7Beyond http://www.re$olutionanalytics.com/free#webinars/r#hadoop#big#data#analytics#how# re$olution#analytics#rhadoop#pro"ect#allows#all

/ata mining Best http://www.dwbiconcepts.com/data#warehousing/''#data#mining/3*#data#mining#for# beginners.html http://users.dsic.up$.es/~"orallo/dm/ http://www.youtube.com/watch8$9m*kp6B+Jdk6 $ery good $edio for machine learning! http://www.stats%&%.com/

Statistics 202: Statistical Aspects of

Data Mining)
http://www.youtube.com/$iew7play7list8p9F23/<.F)F/F<J133

> Butorials http://www.r#statistics.com/%&&3/'&/free#statistics#e#books#for#download/

http://freedownloadebookonline.blogspot.in/%&'(/&2/the#analysis#of#biological#data# by.html http://skeetersays.wordpress.com/%&&2/&2/

> $edio http://"eromyanglim.blogspot.in/%&'&/&1/$ideos#on#data#analysis#with#r.html

http://axon.cs.byu.edu//an/4*2/schedule.php8id9class

https://888.open2stud+.co./enrol/460
http://cil$r.nyu.edu/doku.php8id9courses:bigdata:start http://cs%%3.stanford.edu/ http://work.caltech.edu/telecourse http://cs.brown.edu/courses/cs'31#1/ Big data bignerr /uring the last presidential electionD the role of analytics in decision making and prediction mo$ed to the forefront in the minds of many as a way to gain an ad$antage in competiti$e situations. .or a data nerd like myselfD this was refreshingD as the a$erage citiAen started to gain a new found respect for a set of skills that had always largely been $iewed as the sausage making that decision makers werenPt necessarily interested in understanding. LowD e$erybody is talking about UBhe Big /ata >e$olutionVD and news stories are popping up consistently around how organiAations or corporations are using data to dri$e salesD predict re$enueD plan ser$icesD or e$en predict the life stage that their constituents are in if you doubt thisD +oogle Barget and pregnancy!. ,o where does this lead8 Bhere has been an e$er#expanding market for statisticiansD business analystsD and data analysts that is screaming for more @ualified applicants. More importantlyD most managers are now skilled in the area of understanding analytics and how they should affect their business decisions.

6n this postD 6 want to pro$ide some tools for the a$erage manager to better learn the areas of their fundraising efforts that could benefit from introducing analytics G whether ma"or giftsD planned gi$ingD direct marketingD amongst others G and how to manage from a position of Sdata strengthP.

"elf+(aught
Bhere are a few important skills that can be easily learned through self#taught mechanisms. Bhe most important are the ability to identify the right problemD ask the right @uestionsD and understand where data will help inform a decision. Bhere is a significant amount of literature a$ailable to start the pursuit of becoming Smore data dri$enP. Bhe area of analytics and data consumption is still e$ol$ing. 6 recommend finding some strong blogs including this oneC! that will help you understand. 6 use a combination of non#profit and for#profit specific blogs. Bhere is a great deal to learn and 6 suggest you find a few and subscribe for daily updates. Bhere are a number of texts a$ailable around how to le$erage data to make better decisions. Most of these texts are targeted to non#statisticians G or the consumers of what statisticians produce. Bhe challenge with using texts is that the industry is mo$ing so @uicklyD you chance some information being out of date. ,ome of my fa$orite authors are:

Late ,il$er The Signal and the oise& /hy So Many Predictions ?ail,but So)e Don't!D Mohn Miglautsch Spinning Straw into #old& An Executi(e #uide to the Magic o" Turning Data into Money! Jric ,iegel Predicti(e Analytics& The Power to Predict /ho /ill 4lic*, Cuy, Lie, or Die!

.or fundraising specific analyticsD 6 recommend:

Moshua M. BirkholA ?undraising Analytics& Dsing Data to #uide Strategy!

<acilitator Led
Bhere are many classes offered at $arious institutes of higher education or career learning around business intelligence. 6 would highly recommend at least taking one class in an analytics o$er$iew to understand what @uestions can be answered with which analytical tools. 6f you are interested in understanding the details of the ShowPD there are statistics courses typically a$ailable for all le$elsD from beginners to ad$anced. 6t is well worth the in$estment to take one of these as a night or weekend class from your local community college or uni$ersity.

Conferences

Bhere are a number of conferences that offer sessions on statistical analysisD business intelligenceD marketing insightsD and other pseudonyms. 6 could start listing them out hereD but 6 donPt want to gi$e the impression that my fa$orites are the right ones. Most of them are high @uality in terms of content and offer a $arying degree of insights. Ft BB<IL each yearD there are a number of tracks which lend themsel$es to understanding more about how you can use analytics at your organiAation to help use data to dri$e results and understand your constituents and their moti$ations. :ast yearD 6 presented with a few co#presenters around $arious methods of using business intelligenceD including reporting and data $isualiAationsD modelingD scoringD clusteringD and data mining.

Conclusion
Bhe three main areas 6 outlined are not the only methods for learning. Bhe best method 6 found is to get hands on. Jxtract some data in excel and start playing. +ather your team and start to outline areas of your business where you see opportunities and start building a portfolio of @uestions that will help you to understand the S-hat happened8P and the S-hy did it happen8P of past performanceD and mo$e onto the S-hat will happen8P and S-hat should 6 do8P. Bhe road of analytics is longD but each step along the way will continue to add $alue to your organiAation.

/eginners %ourse on Anal+tics 8ith ,


Understand the role of > analytics in the current technology sector 6nstructed by an experienced professional and trainer 6ncludes real life case studies from $arious domain 6deal for those who are looking to pursue data science as a career

Fbout the <ourse Language of 7nstruction@ Jnglish <ourse /escription


Fre you aware of the trend in the way big corporations like FccentureD McEinseyD B<,D Mu ,igmaD +enpactD Lo$artisD <itibank manage data8 Bhey all do it with the > software. 6t is the biggest competitor to ,F,. Fdd to it the fact that this is an open source free softwareD its appeal is understandable and rising. Jnroll in this course to learn about Fnalytics with >.

What is Analytics in R ?

> analytics is a free software programming language and a software en$ironment for statistical computing and graphics. Bhe > language is widely used among statisticians and data miners for de$eloping statistical software and data analysis as > has stronger ob"ect# oriented programming facilities than most statistical computing languages.

Fnother strength of > is static graphicsD which can produce publication#@uality graphsD including mathematical symbols. /ynamic and interacti$e graphics are a$ailable through additional packages.

What makes R powerful?

,ome of the most ad$anced statistical analysis capabilities .acilitates predicti$e analytics its core strength! Mindblowing data $isualiAations to help you report better Fbility to process and make sense of unstructured data such as textD $ideoD $oice and log data

<reate data analysis with flexibility to mix#and#match models for the best results Jstablished in both academia and corporations for robustnessD reliability and accuracy.

Applications of R Analytics

6ndustrial forecasting /rug testing and de$elopment >eal#time trading >isk assessment ,oftware analysisD testing and de$elopment Mostly e$ery business and industry that needs data to make business#critical decisions

Why to choose R-Analytics?

> is an open source software programming language and a software en$ironment for statistical computing and graphics. 6t is widely used among statisticians and data miners

around the worldD both in academics and industries for de$eloping statistical software and data analysis. 6ts popularity has increased substantially in recent years.

What will you learn by end of this course?

Fc@uire knowledge of analytic tool >! Bo get knowledge of statistical concepts Bo ac@uire analytical skills

Whats included in the Beginners Course on Analytics with R online course?

%4 :6KJ interacti$e classes of 0 hour each / '% :i$e classes of 3 hours each for 4 weeksD using the -iA6Q Kirtual <lassroom# you will be able to $iew the instructor :6KJ and participate in real#time discussions with the instructor

<lass timings: ):&& FM#&2:&& FM and ):&& PM#2.&& PM 6,B Bhis course includes '% tests

:earning aids: P/.D -ord/ PPB files would be shared to be downloaded/$iewed online :earning through hands on working with > 6nstructor led trainingD hence you can clarify your doubts ,tatistics concepts explained <ase based learning

Prere uisites!

-orking knowledge of M, -indows and M, office Maths and ,tatistics knowledge ,chool le$el!

"he course outline!

6ntroduction to Fnalytics

Fbout Fnalytics 6ntroduction to popular tools 6ntroduction to "ob roles in analytics Methodologies in Fnalytics

6ntroduction to ,tatistical concepts used in business analytics

Fbout ,tatistics

Probability theory basics +enerating random numbersD ,amples <alculating Probabilities for distributions Plotting

,tatistics basics /ata summariAation ;ypothesis testing Besting correlation

Bests of significance and testing

Basic Fnalytic Bechni@ues using >

6nstallation and La$igation in >

/ata exploration 6nput and output of data in > -orking with $ariablesD $ectors Performing $ector arithmetic

/ata ,tructure -orking with $ectors :ist Matrix Iperations /ata frames

/ata Bransformation Fpplying function to data elements

/ata Kisualisation -orking with graphics <reating plots

<reating .unctions

,tatistical Models

/ata Manipulation

Modeling Bechni@ues

:inear >egression

Performing linear regression and understanding regression summary <omparing models by using FLIKF

<luster Fnalysis <luster analysis with >

/ecision Brees

Bime ,eries Fnalysis >epresenting and plotting time series data Performing calculations on time series

<ase ,tudies

;ow to draw insights from data.

#A$s

Ques 0@ $o 7 need to purchase a license to install R in my machine? #ns@ > is a$ailable as .ree ,oftware under the terms of the .ree ,oftware .oundationHs +LU +eneral Public :icense in source code form.

Ques 3@ $o 7 need e&pert level /no*ledge of statistics to pursue this course? #ns@ LoD you donHt need expert le$el knowledge of statistics to pursue this course.

Ques 9@ $uring the course *ill 7 be able to *or/ hands on R? #ns@ NesD we will be doing hands on practice on >#Fnalytics.

Ques ;@ What if 7 miss the R+#nalytics class? #ns@ Fll classes are recorded automatically. Nou can access class recordings in your -iA6Q account as many times as you want.

Ques :@ $oes online R+#nalytics course include hands+on+training? #ns@ Bhe tutor will pro$ide you regular practical assignments and feedback on areas to impro$e. Bhere will be suggestions for exercises which can be taken up between the classes.

Ques =@ What is the min net speed re-uired to attend the R+#nalytics L7E% classes? #ns@ %1)Ebps of internet speed is recommended to attend >#Fnalytics :6KJ classes. ;owe$erD students are attending the classes from a slower internet too performance canPt be guaranteed though!.

Ques >@ <or ho* long can 7 access R+#nalytics class recording? #ns@ Nou can access this online class recordings for ) months any number of times.

Ques 8@ What are the payment options? #ns@ Nou can pay by <redit <ard or PayPal Fccount.

Ques ?@ What should 7 do if 7 stuc/ *ith any soft*are problem during the online course? #ns@ 6n no manner you will lose your >#Fnalytics lecture fore$er. Nou can access the ># Fnalytics class recording. >est we also ha$e a %4x* supportD so in case you need any clarification on concepts or help in debug or installationD the support team can help you one# on#one.

9ob description
/o you want to change the dynamics of local commerce8 /o you lo$e to take data dri$en decisions and build no$el products from scratch8 -ould you like to work with a top le$el engineering teamD mixing the state of the art in computer science with operations research to make great things happen8 Bhat is what we do e$erydayD and we need youC +roupon is looking for creati$e and inno$ati$e minds to "oin the team to focus on strategic data dri$en ideas that will propel the future growth of +roupon. Fs part of this team you will focus on $arious problems spanning the spectrum from re$enue optimiAationD dynamic pricingD merchant satisfactionD and self#ser$ice. Nou will be responsible for making sense of heterogeneous data sources and using this knowledge wisely for building products to be used by millions of users. FlsoD you will be responsible for modeling and sol$ing key optimiAation problems at +rouponD whose solutions will be the core components of engines that will pro$ide information to be consumed by se$eral clients. >e@uirements:

M, in Iperations >esearchD <omputer ,cienceD Fpplied MathematicsD ,tatisticsD or related areas. Jxtremely strong candidates with a B, and rele$ant work experience will be considered. ,trong background in optimiAationD probabilitiesD and statistics. /eep knowledge in topics such as data miningD machine learningD statistical analysisD operations researchD and mathematical modeling in general. <oding skills in PythonD >D PerlD >ubyD or similar scripting languages and other programming languages like Ma$aD <55D ,calaD etc. Jxperience in ,Q: is fundamental. Understanding how to extract information from dataD counterfactual thinkingD and how to deal with the data. e.g. how to apply statistical methodsD how and when to use machine learning instead of more simple approachesD what to do when thereHs too little data or too much dataD how to clean transactional data or dirty and unstructured dataD etc.!

Basic unix usageD data and files handling sortD cutD catD uni@D awkD etc.!D and $ersion control systems e.g. +it! +ood in explaining results and how did you got themD and most importantly team work.

Preferred ,kills:

Ph/ in Iperations >esearchD /ata MiningD Machine :earningD ,tatisticsD <omputer ,cienceD Fpplied MathematicsD or e@ui$alent. %5 years hands#on practical experience with large scale data analysis. .luent in > not "ust basic regressions and basic plottingD e.g. ggplot%D >I<>D e'&*'D rpartD etc.! and exceptional in Python e.g. scikit#learnD PandasD numpyD scipy!D knowledge in analytical and mathematical tools such as Icta$e/MatlabD FMP:D +FM,D etc. Jxperience with software engineering best practices such as Best /ri$en /e$elopmentD peer programingD code re$iewsD agile software methodsD etc.

%ar:ing ;ut a %areer in /ig ata


By ,haron .lorentine Nou ha$e probably seen it starting to crop up in "ob titles on "ob boards. 6t has become a $ery hot topic both in the business as well as the 6B press. Big data is hot. -ant proof8 <onsider this: 6BM has created an entire new product di$ision around this phenomenon called Big /ata Products. What is )ig $ata? Ft its most basicD Ubig dataV is extremely large amounts of structured and/or unstructured data too big for analysis in traditional databases and database management tools. Bhis data can come from sensorsD click streamsD posts to social media sitesD multimedia data imagesD $ideo!D transactionsD log filesD real#time +P, data and more. Bhis means enormous amounts of data. Berms such as exabytes %& Aeros! and @uintillion '2 Aeros! are used. Ft this point we create %.1 @uintillion bytes of data a day which means that 3& percent of the data in the world today has been created in the last two years ,ource: 6BM!. What (ypes of !obs #re (here 7n )ig $ata? Bhe need to Udo somethingV with all of this data is creating a significant number of big data related "obs. -hat should you look for when looking for a big data "ob8 6n a @uick

scan of "ob boards you will find "ob titles such as: Big /ata JngineerD /ata Mining Jngineer or /ata Mining Big /ata! Jngineer. Bhe title that has really come into $ogue in the last '2 months is /ata ,cientistD but big data "obs can come with more mundane titles such as Business FnalystD Business 6ntelligence FnalystD /ata Fnalytics JngineerD or /ata Frchitect. -hat these "obs all ha$e in common is the need to Umake senseV of all of this information G to turn it into something that can dri$e insight and action. What "/ills #re Aeeded? ,o we know that big data is bigD no pun intended. -e know that it is dri$ing "ob growth. ;ow do you "ump on this bandwagon8 -hat do you need to know and what skills should you ha$e if you want to pursue a big data career8 Big data "obs typically re@uire a broad range of skills. Bhe good news for tech#sa$$y power#users and business#users is that many of the "obs do not re@uire hard#core programming skills but rather re@uire business or other "ob#specific knowledgeD strong analytical skillsD and knowledge of analytical tools. -hat are some specific skills that can help8 Enowledge of:

/ata mining and machine learning techni@ues /ata $isualiAation tools /ata warehousing JB: extractD translateD load! ;adoop ;adoop is an Fpache pro"ect to pro$ide an open#source implementation of frameworks for reliableD scalableD distributed computing and data storage.! Predicti$e modeling ,tatistical modeling with tools such as >D ,F,D or ,P,, ,tructured and unstructured databases

Where to #c-uire (hese "/ills 6n addition to what you ha$e or may be able to learn on#the#"obD Big /ata Uni$ersity is a great place to learn more about big data and to start to ac@uire some of the necessary skills. Bhe good news is that many of the courses are free. FlsoD many $endors pro$ide big data training. .or exampleD JM< offers data science and big data analytics training and 6BM offers big data courses. /onPt forget higher education. <olleges and uni$ersities offer degree programs in analyticsD predicti$e analyticsD business analyticsD business intelligenceD and data miningD all of which pro$ide a great foundation for launching a big data career. 6f you do not want to di$e into a full#fledged degree programD certificate programs are also a$ailable in these topic areas.

Mold an %&isting #nalytics !ob into a )ig $ata #nalytics !ob 6f you are already in$ol$ed with analyAing data in your current "obD try to take it to the next le$el. Bake some of the free courses and see if your company will pay for other courses that pro$ide you with the necessary skills to analyAe larger and more complex data sets. 6t is always best to learn new and $ery marketable skills in an existing "ob. $onFt <orget the Mar/eting and "ales "ide of (hisD 6n addition to the "obs that re@uire technical and analytic skillsD donPt forget that there is also a need for people with a strong Ucon$ersationalV knowledge of big data to market and sell big data products and ser$ices. ,o donPt forget those Big /ata and ;adoop Product ManagerD and Big /ata ,ales >epresentati$e "obs. Sharon Florentine is a Rac*space blogger- Rac*space 5osting is the ser(ice leader in cloud co)puting, and a "ounder o" !penStac*, an open source cloud operating syste)The San Antonio,based co)pany pro(ides ?anatical Support to its custo)ers and partners, across a port"olio o" %T ser(ices, including Managed 5osting and 4loud 4o)puting-

)ussiness 7ntelligenc best http@//,ohn6marsland6org/blog/businessintelligence/learn+business+intelligence+s/ills/ http@//***6dbms36com/3404/4>/3?/ho*+should+somebody+teach+themselves+programming+s/ills/

$iscreate "tructutre http@//courses6csail6mit6edu/=64;3/fall09/class+material6shtml )ig data and all http@//practicalanalytics6*ordpress6com/predictive+analytics+040/ http@//*mbriggs6com/blog/?pG=;=: Machine Learning )est

http://practicalanal+tics.8ordpress.co./predicti:e7anal+tics7101
http@//a&on6cs6byu6edu/$an/;>8/schedule6php?idGclass http@//slac/prop6*ordpress6com/category/machine+learning/ http@//*ebdocs6cs6ualberta6ca/H.aiane/courses/cmput=?4/inde&6html http@//*ebdocs6cs6ualberta6ca/H.aiane/htmldocs/teaching6html

http@//***6dataminingarticles6com/info/data+mining+introduction/ $ata Mining #nd )ig data Lin/s http@//dmml6asu6edu/resources http@//***6predictiveanalytics*orld6com/predictiveIanalytics6php http@//vserver06cscs6lsa6umich6edu/Hcrshali.i/notabene/data+mining6html http@//*la6ber/eley6edu/Hcs=0a/fa00/=0a+python/content/***/inde&6html http@//vserver06cscs6lsa6umich6edu/Hcrshali.i/teaching/ http@//***6dataminingblog6com/list+of+blogs/ http@//***6stat6cmu6edu/Hrnugent/ http@//***6stat6*isc6edu/Hst:>0+0/ https@//computing6llnl6gov/?setGtrainingJpageGinde& http@//*hatsthebigdata6com/3403/03/0;/past+courses+in+big+data+analytics+and+data+science+content+online/ http@//***0bpt6bridgeport6edu/H,elee/courses/C"=:0I<09/C"=:0I<096htm http@//***6cs6du/e6edu/Hshivnath/ http@//***6cs6du/e6edu/Hshivnath/ http@//***6cs6/ent6edu/Hvlee/classes/cs;944:I<344>/cs;944:Icalendar6html http@//***6cs6/ent6edu/Hdragan/C";+:=040+$J#of#lg"046html http@//***6cs6sfu6ca/CourseCentral/8;9/,pei/ http@//***6cse6sc6edu/Hrose/:?4)/

hadoop installation http@//,effreybreen6*ordpress6com/3403/49/04/big+data+step+by+step+slides/ http@//***6applams6com/3409/4>/install+hadoop+in+*indo*s+>+*rite+and6htmlKD Loutube vedio http@//***6youtube6com/playlist?listGPL<83<=;??%8?%0)#%JfeatureGmhIlol.KD

http://www.youtube.com/playlist8list9P:.2%.)433J23J'BFJ

Linden guidance source

http@//***6lin/edin6com/groups/)est+*ay+learn+Hadoop+?88?:>6"60;=300934 http@//***6coderanch6com/t/:89==?/hadoop/databases/Prere-uisites+learning+hadoop http@//***6lin/edin6com/groups/What+is+best+*ay+learn+?88?:>6"6003>44>94

Besides the $loudera resources 6Hd highly recommend you the reference books from IH>eilly :

;adoop: Bhe /efiniti$e +uide Programming Pig Programming ;i$e ;Base: Bhe /efiniti$e +uide

http@//***6cloudera6com/content/cloudera+content/cloudera+docs/Hadoop(utorial/C$H;/Hadoop+ (utorial6htmlKHadoop(utorial+Purpose http@//courses6cs6*ashington6edu/courses/cse;?4h/48au/video6htm http@//***6thecloudavenue6com/3403/04/is+,ava+prere-uisite+for+getting6html http@//bigdatacircus6com/3403/48/0>/hadoop+getting+started+*ith+hadoop+and+mapreduce/KD http@//cscarioni6blogspot6in/3404/00/hadoop+basics6html 1)asic5 http@//cs6smith6edu/class*i/i/inde&6php/C"C9:3IHadoopIClusterIHo*to http@//***6michael+noll6com/tutorials/*riting+an+hadoop+mapreduce+program+in+python/ http@//,ayant>/6blogspot6in/3448/04/setting+up+hadoop6html http@//suchi+techno+*orld6blogspot6in/3409/4:/hadoop+installation6html http@//nos-l6mypopescu6com/post/040=9==;49/nos-l+guide+for+beginners http@//university6cloudera6com/certification/prep/datascience 1)ig $ata Hadoop $ata"cience5

http@//bettere&plained6com/articles/a+brief+introduction+to+probability+statistics/ http@//***6thecloudavenue6com/3409/40/virtual+machine+for+learning+hadoop6html 1best66666best5 http@//allthingshadoop6com/podcast/ 1)essssssssssssssssssst5 http@//atbro&6com/3400/00/4?/mapreduce+hadoop+algorithms+in+academic+papers+:th+update+M%3M84M?9+nov+3400/ 1)essssssssssssssssssssssssssst5 http@//***6thecloudavenue6com/p/usecases6html

http@//developer6yahoo6com/hadoop/tutorial/inde&6html http@//bigdatapro,ects6org/?pageIidG>9 1Hadoop best5 http@//***6bigdataplanet6info/3409/04/hadoop+tutorials+part+0+*hat+is+hadoop6html http@//***6coreservlets6com/hadoop+tutorial/KNvervie* http@//vb/tech6-uora6com/ http@//***6stanford6edu/class/cs9;0/ http@//***6thecloudavenue6com/p/hadoopresources6html http@//thebigdatainstitute6*ordpress6com/3409/4;/3?/introduction+to+big+data+and+hadoop+ecosystem+for+beginners/ http@//***6tostring6co/learning+*ith+big+data/ http@//***6hadoop*i.ard6com/top+04+presentations+for+learning+hadoop+on+slideshare/1"lides5 http@//***6slideshare6net/ra&039/savedfiles?sItitleGbig+data+stepbystep+using+r+hadoop+*ith+rhadoops+rmr+ pac/ageJuserIloginG,effreybreen 1slides best5 http@//***6/dnuggets6com/3409/04/>+steps+learning+data+mining+data+science6html

"calable "ystems@ $esign2 7mplementation and 'se of Large "cale Clusters2 #utumn 3448 *earch< istributed *+ste.s and %loud %o.puting=
Ir

*earch<(rallelel processing and

istributed *+ste.s and=

http://www.eurecom.fr/~michiard/teaching/clouds.html ..........Best!
http@//courses6cs6*ashington6edu/courses/cse;?4h/48au/lectures6htm 1)estOOOOO66<or hadoop5 http@//courses6cs6*ashington6edu/courses/cse;?4h/48au/video6htm http@//***6-atar6cmu6edu/Hmsa/r/0:90?+s04/ $at a Mining )esttttOOOO6 http@//guidetodatamining6com/ $ata)ase )est http@//confluence6cci6emory6edu@84?4/dashboard6action

)ig data machine learning http@//machinelearningbigdata6pb*or/s6com/*/page/9>=:0;:;/<rontPage

http@//courses6cs6*ashington6edu/courses/cse:??c0/09*i/readings6html

Handoop Pro,ect http@//***6-uora6com/Programming+Challenges+0/What+are+some+good+toy+ problems+in+data+science http@//***6stanford6edu/class/cs9;0/

http@//vb/tech6-uora6com/Hadoop+isnt+a+threat+to+R$)M" http@//***6hadooptrainingindia6in/*hat+is+mongodb/ 1basic tutorials5 Cassendra http@//***6sinbadsoft6com/blog/cassandra+up+and+running+on+*indo*s+in+04+min+ or+so/ 1Cassendra installation on *indo*s 04 minutes5 http@//***6sinbadsoft6com/blog/cassandra+data+model+cheat+sheet/ 1cheat sheet5 http@//***6rac/space6com/blog/cassandra+by+e&ample/

Pro,ect http@//e&ascale6info/"tudentPro,ectsKtrac/ing couch $) http@//guide6couchdb6org/editions/0/en/inde&6html

Machine Learning #nd Hadoop Pro,ects


http@//atbro&6com/3400/00/4?/mapreduce+hadoop+algorithms+in+academic+papers+:th+update+M%3M84M?9+ nov+3400/1)essssssssssssssssssssssssssst5

http@//***6-uora6com/Machine+Learning/Ho*+does+one+start+*riting+practical+machine+learning+programs 6f you prefer >D then \Machine :earning for ;ackers\ is extremely recommended for beginner. Jlse if you prefer PythonD 6 recommend \machine :earning in Fction\. Both of these two books are fairly easy to follow up. Nou would take no more than two weeks to go through each of them.

http@//cs33?6stanford6edu/pro,ect7deasI34036html http@//uha*eb6hartford6edu/compsci/ccli/samplep6htm http@//cs33?6stanford6edu/pro,ects34036html http@//cs33?6stanford6edu/pro,ects34006html http@//***6-uora6com/$ata/Where+can+7+find+large+datasets+open+to+the+public http@//*ildanm6*ordpress6com/344?/04/0:/pro,ect+ideas+for+hadoop/

/ig

ata 8ith 3adoop and %loud

Posted on May '1D %&'( by admin Posted in <loud <omputingD Ma$a .

What is )ig $ata? UBig /ataV is a catch phrase that has been bubbling up from the high performance computing niche of the 6B market. 6ncreasingly suppliers of processing $irtualiAation and storage $irtualiAation software ha$e begun to flog UBig /ataV in their presentations. -hatD exactlyD does this phrase mean8

UBig dataV is data that becomes large enough that it cannot be processed using con$entional methods. -eb search enginesD social networksD mobile phonesD sensors and science chip in to petabytes of data created on a daily basis. ,cientistsD intelligence analystsD go$ernmentsD meteorologistsD air traffic controllersD architectsD ci$il engineers#nearly e$ery industry or profession experience the era of big data. Fdd to that the fact that the democratiAation of 6B has made e$eryone a sort of! data expertD well#known with searches and @ueriesD and wePre seeing a huge burst of awareness in big data. Fn example often cited is how much weather data is collected on a daily basis by the U.,. Lational Iceanic and Ftmospheric Fdministration LIFF! to aide in climateD ecosystemD weather and commercial research. Fdd that to the masses of data collected by the U.,. Lational Feronautics and ,pace Fdministration LF,F! for its research and the numbers get pretty big. Bhe greater part of data has multifaceted and undisco$ered relationships. 6t doesnPt fit simply into relational models. Practical examples Y'Z for big data processing are:

#6Lin/ed7n@
.or disco$ering People Nou May Enow and other fun facts. 6tem#6tem >ecommendations Member and <ompany /eri$ed /ata UserPs network statistics -ho Kiewed My Profile8 Fbuse detection UserPs ;istory ,er$ice >ele$ance data <rawler detection

)6Mobile#nalytic6(E@
Latural :anguage Processing Mobile ,ocial Letwork ;acking -eb <rawlers/Page scrapping

Bext to ,peech Machine generated Fudio ? Kideo with remixing Futomatic P/. creation ? 6>

C6$atagraph
Batch#processing large >/. datasetsD for indexing >/. data. >/. extends the linking structure of the -eb to use U>6s to name the relationship between things as well as the two ends of the link. Jxecuting long#running offline ,PF>Q: @ueries

$6BumBum+7in+image ad net*or/
+um+um is an analytics and monetiAation platform for online content. 6mage and ad$ertising analytics

%6Lineberger Comprehensive Cancer Center P )ioinformatics Broup


.or accumulating and analyAing Lext +eneration se@uencing data produced for the <ancer +enome Ftlas pro"ect and other groups.

<6Pharm3Phor/ Pro,ect P #gricultural (raceability


Processing of obser$ation messages generated by >.6//Barcode readers as items mo$e through supply chain. Fnalysis of BPJ: generated log files for monitoring and tuning of workflow processes.

Ma$ab Fpplication /e$elopment on :inuxc G .ree 133 Page eBook

Jnterprise Ma$a 6nfo-orldPs Ma$a Jssential Fpache KirtualiAation: Ma$a 6/J Braining Makarta <omparison <ommons: >eusable Understanding ,trategy +uide: Ma$ab the B<I <omponents 6mplications

Jnabling >apid >I6: -ith Ma$ab G Based Business 6ntelligence Fpplications:

Why it is important for enterprises to loo/ into this ;uman#generated data fits well into relational tables or arraysR Jxamples are con$entional transactions G purchase/saleD in$entory/manufacturingD employment status changeD etc. Fnother type of data is the machine generated data. Machines produce unstoppable streams of big dataY%Z:

'.<omputer logs %.,atellite telemetry espionage or science! (.+P, outputs 4.Bemperature and en$ironmental sensors 1.6ndustrial sensors ).Kideo from security cameras *.Iutputs from medical de$ises 2.,eismic and +eo#physical sensors Big data that doesnPt be con$entional to known models is discarded or sent to archi$e un# analyAed. Fs a resultD Jnterprises miss informationD insightD and prospects to extract new $alue. Earious "olutions Big /ata re@uires exceptional technologies to efficiently process large @uantities of data within tolerable elapsed times. Bechnologies being applied to Big /ata include massi$ely parallel processing MPP! databasesD data mining infrastructures such as the Fpache ;adoop .rameworkD distributed file systemsD distributed databasesD Map>educe algorithmsD and cloud computing platformsD the 6nternetD and archi$al storage systems. Map>educe is a programming model and an associated implementation for processing and generating big data sets. Users specify a map function that processes a key/$alue pair to generate a set of intermediate key/$alue pairsD and a reduce function that merges all intermediate $alues associated with the same intermediate key. <omputational processing can take place on data stored either in a file system unstructured! or within a database structured!. Programs written in this functional style are automatically paralleliAed and executed on a big cluster of commodity machines. Bhis allows programmers without any experience with parallel and distributed systems to effortlessly utiliAe the resources of a large distributed system.

Map >educe ,ource: google!


Bhere are two ways to process 0Cig data1 with the use of Map>educe& =. 5P4 :. 4loud 4o)puting 5P4 includes ad$anced computingD communicationsD and information technologies. 6t includes scientific workstationsD supercomputer systemsD high speed networksD special purpose and experimental systems. Lew generation of large scale parallel systemsD and applications and systems software with all components well incorporated and linked o$er a high speed network are used for big data processing. ,econd way is to process UBig /ataV with <loud computing. 6t will be a key break through in /ata processing due to benefits of using a <loud <omputing which are: Jasy and inexpensi$e set#up because hardwareD application and bandwidth costs are co$ered by the pro$ider ,calability to meet needs. Lo wasted resources because you pay for what you use. Bhere are different ways to implement big data processing in the <loud like '! ;i$e %! Pig and (! ;adoop 5i(e pro$ides a rich set of tools in multiple languages to perform ,Q:#like data analysis on data stored in ;/.,. Pigis usedfor writing ,Q:#like operations that apply to datasets. Pig pro"ect pro$ides a compiler that produces Map>educe "obs from a Pig :atin script. Iur ma"or attention is on ;adoop. -e can add fla$or by introducing ;adoop for big data processing in <loud. Apache 5adoop is a software framework inspired by +ooglePs Map>educe and +oogle .ile ,ystem +.,! papers.

;adoop and its Usecases


;adoop Map>educe is a programming model for writing applications that rapidly process $ast amounts of data in parallel on large clusters of compute nodes. Y(Z ;adoop processes and analyAes $ariety of new and older data to extract meaningful business operations intelligence. Braditionally data mo$es to the computation node. 6n ;adoopD data is processed where the data resides. Bhe types of @uestions one ;adoop helps answer areY%Z: J$ent analytics X what series of steps lead a purchase or registration8 :arge scale web click stream analytics >e$enue assurance and price optimiAations .inancial risk management and affinity engine etc. Ho* Cloud Computing comes into picture?

%n 4loud 4o)puting, we ha(e "ew options a(ailable "or 5adoop i)ple)entation-=. A)aEon %aaS :. A)aEon MapReduce A. 4loudera A)aEon Elastic 4o)pute 4loud FmaAon J<% / 6aa,! is a web ser$ice that pro$ides resiAable compute capacity in the cloudY4Z. 6t is designed to make web#scale computing easier for de$elopers. 6f you run ;adoop on FmaAon J<% you might consider using FmaAon,( for accessing "ob data data transfer to and from ,( from J<% instances is free!. 6nitial input can be read from ,( when a cluster is launched. Bhe final output can be written back to ,( before the cluster is decommissionedY)Z. 6ntermediateD temporary dataD only needed between Map>educe passesD is more efficiently stored in ;adoopPs /.,. 6t became a popular way for big data processing and that lead to the emergence of another ser$ice called FmaAon Jlastic Map>educe. A)aEon Elastic MapReduceD a web ser$ice enables businessesD researchersD data analystsD and de$elopers to easily and cost#effecti$ely process $ast amounts of dataY1Z.

FmaAon Jlastic Map >educe


6t utiliAes a hosted ;adoop framework running on the web#scale infrastructure of FmaAon Jlastic <ompute <loud FmaAon J<%! and FmaAon ,(. 6n a nutshellD the Jlastic Map>educe ser$ice runs a hosted ;adoop instance on an J<% instance master!. 6tPs able to instantly pro$ision other

pre#configured J<% instances sla$e nodes! to distribute the Map>educe processY1Z. Fll nodes are terminated once the Map>educe tasks complete running. 4loudera has two products: <louderaPs /istribution for ;adoop </;! and <loudera Jnterprise. </; is a data management platform incorporates ;/.,D ;adoop Map>educeD ;i$eD PigD ;BaseD ,@oopD .lumeD IoAieD ]ookeeper and ;ue!. 6t is a$ailable free under an Fpache license. Y*Z <loudera Jnterprise is a package which includes <louderaPs /istribution for ;adoopD production support and tools designed to make it easier to run ;adoop in a production en$ironment. <loudera offers ser$ices including supportD consulting ser$ices and training both public and pri$ate!.

<loudera
Bhe <louderaPs /istribution for ;adoop </;! cloud scripts enables you to run ;adoop on cloud pro$idersP clusters. BherePs no need to install the >PMs for </; or do any configurationR a working cluster will start immediately with one command. <loudera supports FmaAon J<% only. <loudera pro$ides FmaAon Machine 6mages and associated launch scripts that make it easy to run </; on J<%. 4D5 being open source is "ree and )anage)ent ser(ices ha(e to be paid "orReferences

Y'Z;adoop -ikiD http://wiki.apache.org/hadoop/PoweredBy Y%ZMiha Fhrono$itAD Euldip PablaD -hy ;adoop as part of the 6B8D
http://thecloudtutorial.com/hadoop#tutorial.html

Y(ZFpache ;adoopD http://hadoop.apache.org/ Y4ZFmaAon J<%D http://aws.amaAon.com/ec%/ Y1ZFmaAon Jlastic Map>educeD http://aws.amaAon.com/elasticmapreduce/ Y)ZUbin MallaD Using ;adoop and FmaAon Jlastic Map>educe to Process Nour /ata More
JfficientlyD http://blog.controlgroup.com/%&'&/'&/'(/hadoop#and#amaAon#elastic#mapreduce# analyAing#log#files/

Y*Z<louderaD Fpache ;adoop for JnterpriseD http://www.cloudera.com/

Y2ZFmaAon J<% <ost <omparison <alculatorD


http://media.amaAonwebser$ices.com/FmaAon7J<%7<ost7<omparison7<alculator.xls

$;*>' and %loud %o.puting


Posted on Ictober '4D %&'' by admin Posted in <loud <omputing . 7ntroduction <loud <omputing is mo$ing from being U6B buAAwordV to reasonable yet reliable way of deploying applications in the 6nternet. 6B managers within companies are considering deploying some applications within cloud. F cloud#related trend that de$elopers ha$e been paying attention is the idea of ULo,Q:UD a set of operational#data technologies based on non#relational concepts. ULo,Q:V is Ua sea changeV idea to consider data storage options beyond the traditional ,Q:#based relational database. FccordinglyD a new set of open source distributed database is acti$ely propping up to le$erage the facilities and ser$ices pro$ided through the cloud architecture. BhusD web applications and databases in cloud are undergoing ma"or architectural changes to take ad$antage of the scalability pro$ided by the cloud. Bhis article is intended to pro$ide insight on the LI,Q: in the context of <loud computing. <ace off H "QL2 AN"QL J Cloud Computing F key disad$antage of ,Q: /atabases is the fact that ,Q: /atabases are at a high abstraction le$el. Bhis is a disad$antage because to do a single ,tatementD ,Q: often re@uires the data to be processed multiple times. BhisD of courseD takes time and performance. .or instanceD multiple @ueries on ,Q: /ata occur when there is a SMoinP operation. 4loud co)puting en(iron)ents need high,per"or)ing and highly scalable databasesLo,Q: /atabases are built without relations. But is it really that UgoodV to go for Lo,Q: /atabases8 F world without relationsD no "oins and pure scalabilityC Lo,Q: databases typically emphasiAe horiAontal scalability $ia partitioningD putting them in a good position to le$erage the elastic pro$isioning capabilities of the cloud. Bhe general definition of a LI,Q: data store is that it manages data that is not strictly tabular and relationalD so it does not make sense to use ,Q: for the creation and retrie$al of the data. LI,Q: data stores are usually non#relationalD distributedD open#sourceD and horiAontally scalable. 6f we look at the big Platforms in the -eb like .acebook or BwitterD there are some /atasets that do not need any relations. Bhe challenge for Lo,Q: /atabases is to keep the data consistent. 6magine the fact that a user deletes his or her account. 6f this is hosted on a Lo,Q: /atabaseD all the tables ha$e to check for any data the user has produced in the past. -ith Lo,Q:D this has to be done by code.

F ma"or ad$antage of Lo,Q: /atabases is the fact that /ata replication can be done more easily then it would be with ,Q: /atabases. Fs there are no relationsD Bables donPt necessary ha$e to be on the same ser$ers. FgainD this allows better UscalingV than ,Q: /atabases. /onPt forget: scaling is one of the key aspects in <loud computing en$ironments. Fnother disad$antage of ,Q: databases is the fact that there is always a schema in$ol$ed. I$er timeD re@uirements will definitely change and the database somehow has to support this new re@uirements. Bhis can lead to serious problems. UMust imagineV the fact that applications need two extra fields to store data. ,ol$ing this issue with ,Q: /atabases might get $ery hard. Lo,Q: databases support a changing en$ironment for data and are a better solution in this case as well. ,Q: /atabases ha$e the ad$antage o$er Lo,Q: /atabases to ha$e better support for UBusiness 6ntelligenceV. <loud <omputing Platforms are made for a great number of people and potential customers. Bhis means that there will be millions of @ueries o$er $arious tablesD millions or e$en billions of read and write operations within seconds. ,Q: /atabases are built to ser$e another market: the Ubusiness intelligenceV oneD where fewer @ueries are executed. Bhis implies that the way forward for many de$elopers is a hybrid approachD with large sets of data stored inD ideallyD cloud#scale Lo,Q: storageD and smaller specialiAed data remaining in relational databases. -hile this would seem to amplify management o$erheadD reducing the siAe and complexity of the relational side can drastically simplify things. ;owe$erD it is up to the Use#<ase to identify if you want a Lo,Q: approach or if you better stay with ,Q:. QAN"QLR $atabases for Cloud Bhe Lo,Q: or Unot only ,Q:V! mo$ement is defined by a simple premise: Use the solution that best suits the problem and ob"ecti$es. 6f the data structure is more appropriately accessed through key#$alue pairsD then the best solution is likely a dedicated key $alue pair database. 6f the ob"ecti$e is to @uickly find connections within data containing ob"ects and relationshipsD then the best solution is a graph database that can get results without any need for translation I/> mapping!. BodayPs a$ailability of numerous technologies that finally support this simple premise are helping to simplify the application en$ironment and enable solutions that actually exceed the re@uirementsD while also supporting performance and scalability ob"ecti$es far into

the future. Many cloud web applications ha$e expanded beyond the sweet spot for these relational database technologies. Many applications demand a$ailabilityD speedD and fault tolerance o$er consistency. Flthough the original emergence of LI,Q: data stores was moti$ated by web#scale dataD the mo$ement has grown to encompass a wide $ariety of data stores that "ust happen to not use ,Q: as their processing language. Bhere is no general agreement on the taxonomy of LI,Q: data storesD but the categories below capture much of the landscape. Tabular + 4olu)nar Data Stores ,toring sparse tabular dataD these stores look most like traditional tabular databases. Bheir primary data retrie$al paradigm utiliAes column filtersD generally le$eraging hand#coded map#reduce algorithms. )ig(able is a compressedD high performanceD and proprietary database system built on +oogle .ile ,ystem +.,!D <hubby :ock ,er$iceD and a few other +oogle programsR H)ase is an open sourceR non#relationalD distributed database modeled after +ooglePs BigBable and is written in Ma$a. 6t runs on top of ;/.,D pro$iding a fault#tolerant way of storing large @uantities of sparse data. Hypertable is an open source database inspired by publications on the design of +ooglePs BigBable. ;ypertable runs on top of a distributed file system such as the Fpache ;adoop /.,D +luster.,D or the Eosmos .ile ,ystem E.,!. 6t is written almost entirely in <55 for performance. Eolt$) is an in#memory database. 6t is an F<6/#compliant >/BM, which uses a shared nothing architecture. Kolt/B is based on the academic ;,tore pro"ect. Kolt/B is a relational database that supports ,Q: access from within pre#compiled Ma$a stored procedures. Boogle <usion (ables is a free ser$ice for sharing and $isualiAing data online. 6t allows you to upload and share dataD merge data from multiple tables into interesting deri$ed tablesD and see the most up#to#date data from all sources. Docu)ent Stores Bhese LI,Q: data sources store unstructured i.e.D text! or semi#structured i.e.D OM:! documents. Bheir data retrie$al paradigm $aries highlyD but documents can always be retrie$ed by uni@ue handle. OM: data sources le$erage OQuery. Bext documents are indexedD facilitating keyword search#like retrie$al.

#pache Couch$)D commonly referred to as <ouch/BD is an open source document# oriented database written in the Jrlang programming language. 6t is designed for local replication and to scale $ertically across a wide range of de$ices. Mongo$) is an open sourceD scalableD high#performanceD schema#freeD document# oriented database written in the <55 programming language. (errastore is a distributedD scalable and consistent document store supporting single# cluster and multi#cluster deployments. 6t pro$ides ad$anced scalability support and elasticity feature without loosening the consistency at data le$el. #raph Databases Bhese LI,Q: sources store graph#oriented data with nodesD edgesD and properties and are commonly used to store associations in social networks. Aeo;, is an open#source graph databaseD implemented in Ma$a. 6t is UembeddedD disk# basedD fully transactional Ma$a persistence engine that stores data structured in graphs. #llegroBraph is a +raph database. 6t considers each stored item to ha$e any number of relationships. Bhese relationships can be $iewed as linksD which together form a networkD or graph. <loc/$) is an open source distributedD fault#tolerant graph database for managing data at webscale. 6t was initially used by Bwitter to build its database of users and manage their relationships to one another. 6t scales horiAontally and is designed for on#lineD low# latencyD high throughput en$ironments such as websites. Eerte&$) is a high performance graph database ser$er that supports automatic garbage collection. 6t uses the ;BBP protocol for re@uests and M,IL for its response data format and the FP6 are inspired by the .U,J file system FP6 plus a few extra methods for @ueries and @ueues. 9ey+3alue Stores Bhese sources store simple key/$alue pairs like a traditional hash table. Bheir data retrie$al paradigm is simpleR gi$en a keyD return the $alue. $ynamo is a highly a$ailableD proprietary key#$alue structured storage system. 6t has properties of both databases and distributed hash tables /;Bs!. 6t is not directly exposed as a web ser$iceD but is used to power parts of other FmaAon -eb ,er$ices Memcached is a general#purpose distributed memory caching system. 6t is often used to speed up dynamic database#dri$en websites by caching data and ob"ects in >FM to reduce the number of times an external data source must be read.

Cassandra is an open source distributed database management system. 6t is designed to handle $ery large amounts of data spread out across many commodity ser$ers while pro$iding a highly a$ailable ser$ice with no single point of failure. 6t is a Lo,Q: solution that was initially de$eloped by .acebook and powers their 6nbox ,earch feature. #ma.on "imple$) is a distributed database written in Jrlang by FmaAon.com. 6t is used as a web ser$ice in concert with J<% and ,( and is part of FmaAon -eb ,er$ices. Eoldemort is a distributed key#$alue storage system. 6t is used at :inked6n for certain high#scalability storage problems where simple functional partitioning is not sufficient. Cyoto Cabinet is a library of routines for managing a database. Bhe database is a simple data file containing recordsR each is a pair of a key and a $alue. Bhere is neither concept of data tables nor data types. >ecords are organiAed in hash table or B5 tree. "calaris is a scalableD transactionalD distributed key#$alue store. 6t can be used for building scalable -eb %.& ser$ices. Ria/ is a /ynamo#inspired database that is being used in production by companies like MoAilla. !b$ect and Multi,(alue Databases Bhese types of stores preceded the LI,Q: mo$ementD but they ha$e found new life as part of the mo$ement. Ib"ect databases store ob"ects as in ob"ect#oriented programming!. Multi#$alue databases store tabular dataD but indi$idual cells can store multiple $alues. Jxamples include Ib"ecti$ityD +em,tone and Unidata. Proprietary @uery languages are used. Miscellaneous LI,Q: ,ources ,e$eral other data stores can be classified as LI,Q: storesD but they donPt fit into any of the categories abo$e. Jxamples include: +B.MD 6BM :otus//ominoD and the 6,6, family. Sources for further Reading http://news.cnet.com/2(&'#'(24)7(#'&4'%1%2#)%.html=ixAA'/+I>B>BP http://cloudcomputing.blogspot.com/%&'&/&(/nos@l#is#not#s@l#and#thats#problem.html http://news.cnet.com/2(&'#'(24)7(#'&4'%1%2#)%.html http://www.readwriteweb.com/cloud/%&'&/&*/cassandra#predicting#the#futur.php http://cloud$ane.wordpress.com/tag/nos@l/

http://www.rackspacecloud.com/blog/%&'&/&%/%1/should#you#switch#to#nos@l#too/ http://pro.gigaom.com/%&'&/&(/what#cloud#computing#can#learn#from#nos@l/ http://www.drdobbs.com/database/%%43&&1&& http://cloudcomputing.blogspot.com/%&'&/&4/disrupti$e#cloud#computing#startups# at.html http://www.informationweek.com/cloud# computing/blog/archi$es/%&'&/&4/nos@l7needed7fo.html http://www.elance.com/s/cloudcomputing/ http://www.thesa$$yguideto.com/gridblog/%&&3/''/a#look#at#nos@l#and#nos@l#patterns/ http://blogs.forrester.com/application7de$elopment/%&'&/&%/nos@l.html http://www.yafla.com/dforbes/+etting7>eal7about7Lo,Q:7and7the7,Q:76snt7,calable 7:ie/ http://arstechnica.com/business/data#centers/%&'&/&%/#since#the#rise#of.ars/%

5ill /ig

ata %log $et8or6s 5ith /ig "ra##ic?

Dece)ber =Ath, :;== Cy& 4olleen Miller

F new study from 6nfineta pro"ects a surge in unstructured data stored in ;adoop and other large#scale storage systems.

Big /ataD housed in new and disrupti$e technologiesD is expected to account for more than 1& percent of the worldPs data in the next fi$e yearsD according to a a new study. -hile it offers huge and untapped $alueD the ine$itable result is stress and strain on the worldPs 6nterent infrastructure as companies seek to manage this explosion of information. Bhe new studyD released "ointly by 6nternet >esearch +roup and 7nfineta "ystemsD a pro$ider of -FL optimiAation systemsD examines how big data is affecting enterprise -FL -ide Frea Letwork! throughout the country. Big /ata G which is defined as datasets whose siAe is beyond the ability of typical database software tools to captureD storeD manage and analyAe G is most often found in petabyte to exabyte siAeD and is unstructuredD distributed and in flat schemas. Fs big data continues to growD the industry anticipates both enormous change and untapped $alue for enterprises. Fccording to 6nfinetaPs reportD most companies will adopt key Big /ata technologies in the next year to '%#'2 months.

%hallenging $et8or6 %apacit+


Fll this data in need of captureD storageD processing and distribution has the potential to clog networks. Fbout .1 +bps of bandwidth is needed per petabye of Big /ata under management by ;adoopD an open source platform for large#scale computing. Bhe bandwidth demand can result in compromises in the latencyD speed and reliability of the enterprise -FL. 6nfineta is interested in this topicD as the pri$ately#held company based in ,an MoseD <alifornia supplies products that support critical machine#scale workflows across the data center interconnect. ;owe$erD the study findings highlight de$eloping trends that are impacting the entire data center industry. Eey trends identified by 6nfineta include:

Cheaper storage pricing6 -hile traditional data storage runs d1 per gigabyteD for the same amount of storage using ;adoopD the cost goes to d.%1 per gigabyte. 7ncreased scalability6 ;adoop enables companies to add additional storage for a fraction of the cost that was pre$iously charged. Bhe scalability of ;adoop could lead to more than 1& percent of the worldPs data stored in ;adoop en$ironments within fi$e years. Lac/ of analysis6 Inly one to fi$e percent of data collected outside Big /ata deployments is actually analyAed. Bhere is $alue that is being missed by lack of analysis. McEinsey recently reported tha tif the healthcare industry analyAed 31 percent of their uncaptured dataD that it would ha$e an estimated annual $alue of d(&& billion. Fnother example of lack of analysis is the oil industry where oil rigs generate %1k data points per secD and the company uses fi$e percent of that information.

Bhe report finds that organiAations are deploying ;adoop clusters as a centraliAed ser$ice offering so that indi$idual di$isions donPt ha$e to build and run their ownD and that Ubigger is betterV when it comes to processing batch workloads. Bhis set up leads to Big Braffic G data mo$ement between clustersD within a data center and between data centers. /ata mo$ement includes but is not limited to replication and synchroniAationD which will become especially important as ;adoop becomes a significant factor in enterprise storage. Big Braffic data mo$ement ser$ices support Big /ata analyticsD regulatory compliance re@uirementsD high a$ailability ser$ices and security ser$ices.

;racle $o*>' atabase


Posted on Ictober %1D %&'' by admin Posted in <loud <omputingD <loud LJ-, .

Iracle Lo,Q: /atabase Bhe Iracle Lo,Q: /atabase is a distributed key#$alue database. 6t is designed to pro$ide highly reliableD scalable and a$ailable data storage across a configurable set of systems that function as storage nodes.

,imple /ata Model o Eey#$alue pair data structureD keys are composed of Ma"or ? Minor keys o Jasy#to#use Ma$a FP6 with simple PutD /elete and +et operations ,calability o FutomaticD hash#function based data partitioning and distribution o 6ntelligent Lo,Q: /atabase dri$er is topology and latency awareD pro$iding optimal data access Predictable beha$ior o F<6/ transactionsD configurable globally and per operation o Bounded latency $ia B#tree caching and efficient @uery dispatching ;igh F$ailability

Lo single point of failure Built#inD configurable replication >esilient to single and multi#storage node failure /isaster reco$ery $ia data center replication Jasy Fdministration o -eb console or command line interface o ,ystem and node management o ,hows system topologyD statusD current loadD trailing and a$erage latencyD e$ents and alerts
o o o o

>eferences: http://www.oracle.com/technetwork/database/nos@ldb/o$er$iew/index.html http://www.oracle.com/technetwork/database/nos@ldb/downloads/index.html

"utorial on 3adoop 8ith ?!8are (la+er


Posted on March 3D %&'% by admin Posted in B6+ /ataD <loud <omputingD ;ow Bo...D Pri$ate <loudD KMwareD -indows .

Map >educe ,ource: google! <unctional Programming Fccording to -6E6D 6n computer scienceD functional programming is a programming paradigm that treats computation as the e$aluation of mathematical functions and a$oids state and mutable data. 6t emphasiAes the application of functionsD in contrast to the imperati$e programming styleD which emphasiAes changes in state. ,ince there is no hidden dependency $ia shared state!D functions in the /F+ can run anywhere in parallel

as long as one is not an ancestor of the other. 6n other wordsD analyAe the parallelism is much easier when there is no hidden dependency from shared state. Map/reduce is a special form of such a directed acyclic graph which is applicable in a wide range of use cases. 6t is organiAed as a UmapV function which transform a piece of data into some number of key/$alue pairs. Jach of these elements will then be sorted by their key and reach to the same nodeD where a UreduceV function is use to merge the $alues of the same key! into a single result. Map Reduce F way to take a big task and di$ide it into discrete tasks that can be done in parallel. Map / >educe is "ust a pair of functionsD operating o$er a list of data. Map>educe is a patented software framework introduced by +oogle to support distributed computing on large data sets on clusters of computers. Bhe framework is inspired by map and reduce functions commonly used in functional programmingDY(Z although their purpose in the Map>educe framework is not the same as their original forms. Hadoop F :arge scale Batch /ata Processing ,ystem. 6t uses MFP#>J/U<J for computation and ;/., for storage. Fpache ;adoop is a software framework that supports data#intensi$e distributed applications under a free license. 6t enables applications to work with thousands of nodes and petabytes of data. ;adoop was inspired by +ooglePs Map>educe and +oogle .ile ,ystem +.,! papers. 6t is a framework written in Ma$a for running applications on large clusters of commodity hardware and incorporates features similar to those of the +oogle .ile ,ystem and of Map>educe. ;/., is a highly fault#tolerant distributed file system and like ;adoop designed to be deployed on low#cost hardware. 6t pro$ides high throughput access to application data and is suitable for applications that ha$e large data sets. ;adoop is an open source Ma$a implementation of +ooglePs Map>educe algorithm along with an infrastructure to support distributing it o$er multiple machines. Bhis includes itPs own filesystem ;/., ;adoop /istributed .ile ,ystem based on the +oogle .ile ,ystem! which is specifically tailored for dealing with large files. -hen thinking about ;adoop itPs important to keep in mind that the infrastructure it has is a huge part of it. 6mplementing Map>educe is simple. 6mplementing a system that can intelligently manage the distribution of processing and your filesD and breaking those files down into more manageable chunks for processing in an efficient way is not. ;/., breaks files down into blocks which can be replicated across itPs network how many times itPs replicated it determined by your application and can be specified on a per file basis!. Bhis is one of the most important performance features andD according to the

docs UWis a feature that needs a lot of tuning and experience.V Nou really donPt want to ha$e 1& machines all trying to pull from a 'BB file on a single data nodeD at the same timeD but you also donPt want to ha$e it replicate a 'BB file out to 1& machines. ,oD itPs a balancing act. ;adoop installations are broken into three types. $ Bhe LameLode acts as the ;/., masterD managing all decisions regarding data replication. $ Bhe MobBracker manages the Map>educe work. 6t UWis the central location for submitting and tracking M> "obs in a network en$ironment.V $ Bask Bracker and /ata LodeD which do the grunt work

;adoop G LameLodeD /ataLodeD MobBrackerD BaskBracker Bhe MobBracker will first determine the number of splits each split is configurableD ~')# )4MB! from the input pathD and select some BaskBracker based on their network proximity to the data sourcesD then the MobBracker send the task re@uests to those selected BaskBrackers. Jach BaskBracker will start the map phase processing by extracting the input data from the splits. .or each record parsed by the U6nput.ormatVD it in$oke the user pro$ided UmapV functionD which emits a number of key/$alue pair in the memory buffer. F periodic wakeup process will sort the memory buffer into different reducer node by

in$oke the UcombineV function. Bhe key/$alue pairs are sorted into one of the > local files suppose there are > reducer nodes!. -hen the map task completes all splits are done!D the BaskBracker will notify the MobBracker. -hen all the BaskBrackers are doneD the MobBracker will notify the selected BaskBrackers for the reduce phase. Jach BaskBracker will read the region files remotely. 6t sorts the key/$alue pairs and for each keyD it in$oke the UreduceV functionD which collects the key/aggregatedKalue into the output file one per reducer node!. Map/>educe framework is resilient to crash of any components. Bhe MobBracker keep tracks of the progress of each phases and periodically ping the BaskBracker for their health status. -hen any of the map phase BaskBracker crashesD the MobBracker will reassign the map task to a different BaskBracker nodeD which will rerun all the assigned splits. 6f the reduce phase BaskBracker crashesD the MobBracker will rerun the reduce at a different BaskBracker. LetFs try Hands on Hadoop Ib"ecti$e of the tutorial is to set up multi#node ;adoop cluster using the ;adoop /istributed .ile ,ystem ;/.,! on Ubuntu :inux with the use of KMware Player.

;adoop and KMware Player

7nstallations / Configurations Aeeded@

'aptop
Physical Machine :aptop with )& +B ;//D % +B >FMD (%bit ,upportD I, G Ubuntu '&.&4 :B, G the :ucid :ynx 6P Fddress#0?360=86069 YUsed in configuration filesZ Kirtual Machine ,ee KMware Player sub section

o8nload @buntu -*; #ile


Ubuntu '&.&4 :B, G the :ucid :ynx 6,I file is needed to install on $irtual machine created by KMware Player to set up multi#node ;adoop cluster.

/ownload Ubuntu /esktop Jdition http://www.ubuntu.com/desktop/get#ubuntu/download Aote@ Login *ith user QrootR to avoid any /ind of permission issues 17n your machine and Eirtual Machine56 Update the Ubuntu packages: sudo apt#get update

?!8are (la+er A&ree8areB


/ownload it from http://downloads.$mware.com/d/info/desktop7downloads/$mware7player/(7&

/ownload KMware Player

,elect KMware Player to /ownload

KMware Player .ree Product /ownload 6nstall KMware Player on your physical machine with the use of the downloaded bundle.

KMware Player G >eady to install

KMware Player G installing LowD create $irtual machine with the use of it and install Ubuntu '&.&4 :B, on it with the use of 6,I file and do appropriate configurations for the $irtual machine.

Browse Ubuntu 6,I Proceed with instructions and let the set up finish.

Kirtual Machine in KMware Player Ince you are done with it successfullyaD ,elect Play $irtual Machine.

,tart Kirtual Machine in KMware Player Ipen Berminal <ommand prompt in Ubuntu! and check the 6P address of the Kirtual Machine. AN(%@ 7P address may change so if Eirtual machine cannot be connected by ""H from physical machine then have a loo/ on 7P address 0st6

Ubuntu Kirtual Machine G ifconfig

Appl+ #ollo8ing con#iguration in ph+sical & :irtual .achine #or 9a:a 6 and 3adoop installation onl+. -nstalling 9a:a 6
sudo apt#get install sun#"a$a)#"dk sudo update#"a$a#alternati$es #s "a$a#)#sun YKerify Ma$a KersionZ

*etting up 3adoop 0.20.2


/ownload ;adoop from http://www.apache.org/dyn/closer.cgi/hadoop/core and place under /usr/local/hadoop

3A ;;( %on#igurations
;adoop re@uires ,,; access to manage its nodesD i.e. remote machines Y6n our case $irtual MachineZ plus your local machine if you want to use ;adoop on it. In Physical Machine +enerate an ,,; key

+enerate an ,,; key Jnable ,,; access to your local machine with this newly created key.

Jnable ,,; access to your local machine Ir you can copy it from d;IMJ/.ssh/id7rsa.pub to d;IMJ/.ssh/authoriAed7keys manually. Best the ,,; setup by connecting to your local machine with the root user.

Best the ,,; setup Use ssh 0?360=86069 from physical machine as well. 6t will gi$e same result. In Kirtual Machine Bhe root user account on the sla$e Kirtual Machine! should be able to access physical machine $ia a password#less ,,; login. Fdd the Physical MachinePs public ,,; key which should be in ! to the authoriAed7keys file of Kitual Machine in this userPs !. Nou can do this manually 1Physical Machine5SHNM%/6ssh/idIrsa6pub +T 1EM5SHNM%/6ssh/authori.edI/eys ,,; Eey may look like <anPt be same though M! ssh rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0 !wn"E#A$%&3Lh3''() aCI*2+0,s-!.hB/c0LME1#12wr$*(34IG521MCM6"7G78a)z!9:'s8C68//)4(,$cB '3;(6q+y-/C-<X:577=NnM:c3'w0z5,1609>+M7*)7 <n%IA?+G2X4aE2@rJ$Bq1*z n3 B9G;jn0*5LwQ/E!A35EMchq9AI42+1=ry0;:;<2MG;5r<s5L/7aXA7yMs*y,5:?:? Q+.w$ L8A.39!8r&s>+Iq6a35a1E51/=-+0E0&J7!q673+a#Q1h=@*&E9IL /Csqc7w <#+X:*w$1B4j;!03;/j!3n*1>0wN AL75zyEQ3wBB r55:C@ :2shD=a;:5; 's2 ssh 192.168.1.3 8r5@ / r:?a= @ach n2 :5 /2r 8y ssh acc2ss an6 ha/2 a 822= 58 : :5 ?n62rs:an6 ssh w5r3 n1E

,5r @5r2 ?n62rs:an6 n1F * n1 192.168.1.3 and 192.168.28.136 8r5@ 2ach 5:h2rE ,5r 62:a = n85r@a: 5n 5n N2:w5r3 02:: n1s n -M.ar2 *=ay2r / s : h::;G++wwwE/@war2Ec5@+s?;;5r:+ws55+65c+wsHn2:Hc5n8 1?ra: 5nsHc5@@5nE h:@= -Mwar2 *=ay2r has s @ =ar c5nc2;:sE 's n1 0E0E0E0 85r :h2 /ar 5?s n2:w5r3 n1Dr2=a:26 "a655; c5n8 1?ra: 5n 5;: 5ns w == r2s?=: n "a655; ) n6 n1 :5 :h2 I*/% a66r2ss2s 58 ')?n:? )5<E #5 6 sa)=2 I*/% 5n ')?n:? 10E09 L#0F 5;2n /etc/sysctl.conf n :h2 26 :5r 58 y5?r ch5 c2 an6 a66 :h2 85==5w n1 = n2s :5 :h2 2n6 58 :h2 8 =2G I6 sa)=2 ;/% n2:E ;/%Ec5n8Ea==E6 sa)=2H ;/% B 1 n2:E ;/%Ec5n8E628a?=:E6 sa)=2H ;/% B 1 n2:E ;/%Ec5n8E=5E6 sa)=2H ;/% B 1

Ubuntu G /isable 6P$) J"A7!!*HIN0#ALLK/conf/hadoop-env.sh DK s2: :h2 JA-AH"!ME 2n/ r5n@2n: /ar a)=2 :5 :h2 0?n J7(+J$E % 6 r2c:5ryE

I #h2 ja/a @;=2@2n:a: 5n :5 ?s2E $2q? r26E 2<;5r: JA-AH"!MEB+?sr+= )+j/@+ja/aD%Ds?nD1E%E0E20

J"A7!!*HIN0#ALLK/conf/core-site.xml DK

C5n8 1?r2 :h2 6 r2c:5ry wh2r2 "a655; w == s:5r2 :s 6a:a 8 =2sF :h2 n2:w5r3 ;5r:s : = s:2ns :5F 2:cE !?r s2:?; w == ?s2 "a655;Ls 7 s:r )?:26 , =2 0ys:2@F

;adoop G core#site.xml "7,0F 2/2n :h5?1h 5?r = ::=2 Mc=?s:2rN 5n=y c5n:a ns 5?r s n1=2 =5ca= @ach n2E J;r5;2r:yK ha655;E:@;E6 r +?sr+=5ca=+ha655;+:@;+6 r+ha655;DOP?s2rEna@2Q J+;r5;2r:yK J"A7!!*HIN0#ALLK/conf/mapred-site.xml DK J;r5;2r:yK Jna@2K@a;r26Ej5)E:rac32rJ+na@2K J/a=?2K142E1%AE1E3G59311J+/a=?2K J+;r5;2r:yK

;adoop G mapred#site.xml J"A7!!*HIN0#ALLK+c5n8+h68sDs :2E<@=

J;r5;2r:yK Jna@2K68sEr2;= ca: 5nJ+na@2K J/a=?2K2J+/a=?2K J+;r5;2r:yK

Physical Machine vs Virt al Machine !Master/"lave# "ettin$s on Physical Machine only


J"A7!!*HIN0#ALLK/conf/masters %he conf/masters file defines the namenodes of o r m lti-node cl ster. &n o r case' this is ( st the master machine. 142E1%AE1E3 J"A7!!*HIN0#ALLK/conf/slaves %his conf/slaves file lists the hosts' one per line' )here the *adoop slave daemons !datanodes and tas+trac+ers# )ill ,e r n. -e )ant ,oth the master ,ox and the slave ,ox to act as *adoop slaves ,eca se )e )ant ,oth of them to store and process data. 142E1%AE1E3 142E1%AE2AE13% ./%01 *ere 192.168.1.3 2 192.168.28.136 are the &P addresses of Physical Machine and Virt al machine respectively )hich may vary in yo r case. 3 st 0nter &P 4ddresses in files and yo are done555

LetFs en,oy the ride *ith Hadoop@


A== 02: 85r ha/ n1 M"AN70 !N "A7!!*NE

6ormattin$ the name node


!N *hys ca= Mach n2 an6 - r:?a= Mach n2 #h2 8 rs: s:2; :5 s:ar: n1 ?; y5?r "a655; ns:a==a: 5n s 85r@a:: n1 :h2 "a655; 8 =2sys:2@ wh ch s @;=2@2n:26 5n :5; 58 :h2 =5ca= 8 =2sys:2@ 58 y5?r Mc=?s:2rN Rwh ch nc=?62s 5n=y y5?r =5ca= @ach n2 8 y5? 85==5w26 :h s :?:5r a=SE

&5? n226 :5 65 :h s :h2 8 rs: : @2 y5? s2: ?; a "a655; c=?s:2rE 7o not format a r nnin$ *adoop filesystem' this )ill ca se all yo r data to ,e erased.

hadoop namenode #format

"tartin$ the m lti-node cl ster


1E 0:ar: "7,0 6a2@5ns $?n :h2 c5@@an6 +) n+s:ar:D68sEsh 5n :h2 @ach n2 y5? wan: :h2 R;r @aryS na@2n562 :5 r?n 5nE #h s w == )r n1 ?; "7,0 w :h :h2 na@2n562 r?nn n1 5n :h2 @ach n2 y5? ran :h2 ;r2/ 5?s c5@@an6 5nF an6 6a:an562s 5n :h2 @ach n2s = s:26 n :h2 c5n8+s=a/2s 8 =2E *hys ca= Mach n2

;adoop G start#dfs.sh -M

;adoop G /ataLode on ,la$e Machine 1E 0:ar: Ma;$26?c2 6a2@5ns $?n :h2 c5@@an6 +) n+s:ar:D@a;r26Esh 5n :h2 @ach n2 y5? wan: :h2 j5):rac32r :5 r?n 5nE #h s w == )r n1 ?; :h2 Ma;$26?c2 c=?s:2r w :h :h2

j5):rac32r r?nn n1 5n :h2 @ach n2 y5? ran :h2 ;r2/ 5?s c5@@an6 5nF an6 :as3:rac32rs 5n :h2 @ach n2s = s:26 n :h2 c5n8+s=a/2s 8 =2E *hys ca= Mach n2

;adoop G ,tart Map>educe daemons -M

BaskBracker in ;adoop

8 nnin$ a Map8ed ce (o,


"2r2Ls :h2 2<a@;=2 n;?: 6a:a I ha/2 ?s26 85r :h2 @?=: Dn562 c=?s:2r s2:?; 62scr )26 n :h s :?:5r a=E A== 2)553s sh5?=6 )2 n ;=a n :2<: ?sDasc 2nc56 n1E

h::;G++wwwE1?:2n)2r1E5r1+2:2<:+20917 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+5000 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+9300 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+132 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+1%%1 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+472 h::;G++wwwE1?:2n)2r1E5r1+2:2<:+14%44 75wn=5a6 a)5/2 2)553s an6 s:5r2 : n =5ca= 8 =2 sys:2@E C5;y =5ca= 2<a@;=2 6a:a :5 "7,0

;adoop G <opy local example data to ;/., $?n :h2 Ma;$26?c2 j5) ha655;D0E20E2+) n+ha655; jar ha655;D0E20E2D2<a@;=2sEjar w5r6c5?n: 2<a@;=2s 2<a@;=2D5?:;?:

.ailed ;adoop Mob

,etrie:e the )ob result #ro. 3 &*


#5 r2a6 :h2 8 =2 6 r2c:=y 8r5@ "7,0 w :h5?: c5;y n1 : :5 :h2 =5ca= 8 =2 sys:2@E In :h s :?:5r a=F w2 w == c5;y :h2 r2s?=:s :5 :h2 =5ca= 8 =2 sys:2@ :h5?1hE @36 r +:@;+2<a@;=2D5?:;?:D8 na= ) n+ha655; 68s D12:@2r12 2<a@;=2D5?:;?:D8 na= +:@;+ 2<a@;=2D5?:;?:D8 na=

;adoop G -ord count example

;adoop G Map>educe Fdministration

;adoop G >unning and <ompleted Mob #as3 #rac32r .2) In:2r8ac2

;adoop G Bask Bracker -eb 6nterface

;adoop G LameLode <luster ,ummary

References
h::;G++wwwE@ cha2=Dn5==Ec5@+w 3 +$?nn n1H"a655;H!nH')?n:?HL n?<HRM?=: D N562HC=?s:2rS

h::;G++wwwE@ cha2=D n5==Ec5@+w 3 +.r : n1HAnH"a655;HMa;$26?c2H*r51ra@HInH*y:h5n h::;G++ja/aE6z5n2Ec5@+ar: c=2s+h5wDha655;D@a;r26?c2Dw5r3s h::;G++ay2n62Ec5@+B=51+arch /2+2010+03+19+@a;Dr26?c2Dn6ashDaD/ s?a=D 2<;=ana: 5nEas;< h::;G++wwwEy5?:?)2Ec5@+wa:chT/BAq0<2z%4syM h::;G++wwwE1r 61a nsys:2@sEc5@+w 3 +6 s;=ay+GG15'G+Ma;$26?c2+!/2r/ 2w h::;G++@a;Dr26?c2Ew 3 s;ac2sEas?E26?+ h::;G++)=51sEs?nEc5@+8 85rs+2n:ry+@a;Hr26?c2 h::;G++wwwE/@war2Ec5@+s?;;5r:+ws55+65c+wsHn2:Hc5n8 1?ra: 5nsHc5@@5nE h:@= h::;G++wwwE )@Ec5@+62/2=5;2rw5r3s+a <+= )rary+a?Dc=5?6Ha;ach2+

ssh rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwjhqJ7MyXGnn5Ly+0 !wn"E#A$%&3Lh3''() aCI*2+0,s-!.hB/c0LME1#12wr$*(34IG521MCM6"7G78a)z!9:'s8C68//)4(,$cB '3;(6q+y-/C-<X:577=NnM:c3'w0z5,1609>+M7*)7 <n%IA?+G2X4aE2@rJ$Bq1*z n3 B9G;jn0*5LwQ/E!A35EMchq9AI42+1=ry0;:;<2MG;5r<s5L/7aXA7yMs*y,5:?:? Q+.w$ L8A.39!8r&s>+Iq6a35a1E51/=-+0E0&J7!q673+a#Q1h=@*&E9IL /Csqc7w <#+X:*w$1B4j;!03;/j!3n*1>0wN AL75zyEQ3wBB r55:C@ :2shD=a;:5; Related articles Big /ata with ;adoop ? <loud clean#clouds.com! BuesdayD '( Muly %&'& e Back

4etting *tarted 8ith $o*>'


Fbout:

NoSQL theory
D

interview
D <ouple of weeks agoD 6 had the pleasure to sit down with Mathias MeyerD <hief Kisionary at ,calariumD a Berlin startup and discuss Lo,Q: adoption. :ike myselfD Mathias is really excited about Lo,Q: and he uses e$ery opportunity to introduce more people to the Lo,Q: space. >ecently he ga$e @uite a few presentations around the Jurope about Lo,Q: databases. Bhe discussion has focused on how would someone start learning and using Lo,Q: databases and the path to follow in this new ecosystem. Below is a transcript of our con$ersation. Alex: Ho* does one get started *ith Ao"QL8 Mathias: -ellD thatPs a @uestion 6 get @uite a lotD but it is not that easy to answer. .or meD 6 "ust pick one tool and start playing with it. 6f 6 see a use case for itD 6 add it to my tool box. 6f notD 6 broadened my personal horiAon. Must that is always a win in my book. .rom a business perspecti$eD you are probably going to find some use cases where storing your data in a relational database doesnPt make too much sense and youPll start looking for ways to get it out of the database. .or exampleD think about storing logs dataD or collecting historical dataD or page impressions. Alex: ,oD as a de$eloper you should "ust gi$e yourself a chance to play with the new shiny toys. Fs a businessD a Lo,Q: database can be a $iable solution for scenarios where you disco$er that your data doesnPt really fit the relational model. Mathias: 6ndeed. Nou ha$e stuff in your database and it is too much for your databaseD or it puts too much load on your databaseD and youPre looking for ways to get that out of your database. :oad is a relati$e termD but consider data like logging or statistical data that grows somewhat exponentially. >elational databases are not a great fit to keep track of that kind of dataD as it gets harder and harder to maintain or clean up as it grows. Fs a de$eloper playing with new tools and different ways of sol$ing problems makes sense all by itselfD simply because it adds to your toolboxD and it broadens your personal and professional horiAon. BhatPs basically how 6 got into Lo,Q:. 6 stumbled upon toolsD which in turn use databases that are more optimiAed to store data for their use case. 6tPs

"ust fun playing with themD and new tools with different approaches of storing data always managed to make me curious. Fnd who can resist a database that allows you to connect through telnet8 6 think that appeals to any geek 6 know. Alex: Bhere are @uite a few Lo,Q: databases out there. /o you ha$e any fa$orites or recommendations8 Mathias: 6f therePs any bunch of tools 6Pd recommend for anyone to start playing withD itPd probably be Mongo/BD <ouch/B or >edis. Bhey are excellent candidates to take data off your main databaseD and happily li$e alongside of it. 6f you "ust want to play with a Lo,Q: databaseD and youPre coming from a relational backgroundD your easiest bet would probably be Mongo/BD as itPs a good mix what youPre used to from relational databases with the best of schemaless storage. >edis makes sense to look at because itPs a good candidate to take certain types of data out of your main database. ,tatisticsD message @ueuesD historical data are "ust some examples. -hen you work with something like Mongo/B and <ouch/B youPll get a good idea of what Lo,Q: is aboutD as Mongo/B is halfway between a relational and Lo,Q: database while <ouch/B is basically totally different thinking all the way. 6f all youPre looking for is scaleD ha$e a look at >iak or <assandra. Bhey follow pretty interesting models of scaling up. Alex: Bhese Lo,Q: databases are proposing some new non#relational data models. /o you like one model more than the others8 6Pd say my fa$orite is the document database as it is pretty much the most $ersatile of all of them. Nou can put any data in a document database and it lea$es you all the freedom to model that data and to model some of the relationships between documents. 6t lea$es all that up to you. Fnd it is $ery flexible on how you can do that. Personally 6 like looking into different solutions and maybe e$en combining them. BhatPs exactly what 6 do in practice 6 usually ha$e something like <ouch/B as my main database and something like >edis as a really nice and handy small store on the side where 6 put data thatPs not suited for putting into <ouch/B. Alex: 7s there something that you should be a*are of before trying any of these Ao"QL pro,ects8 Mathias: 6t depends if you are doing it for your business or for yourselfD or if you are using it on green field pro"ects because thatPs usually a lot easier. Bhe things 6 always like to tell people is that they need to look at what they think their data is gonna be shaped like. Ib$iously you wonPt know that right from the startD but youPll still ha$e an idea of how loose your data will beD if you need something like typed relationshipsD transactionsD and so on.

Nou canPt really gi$e a uni$ersal answer here. 6n the end youPll ha$e to get an idea of what your data will look like and how youPre going to read or write it. 6f a Lo,Q: database seemingly is a good fit for itD go for it. 6tPs "ust important to be aware of both the benefits and the potential downsidesD but that should be common sense for any tool you pick for a particular use case. Alex: -ellD 6Pd say that based on my experience with relational databased there are at least ( things 6P$e really gotten used to: the relational modelD the @uery model and transactions. ,o for someone looking to Lo,Q: databases he should be aware that all these ( concepts will ha$e a different form. Mathias: NesD absolutely. Nou need to be aware that youPll meet a different data modelD which brings great power and flexibility. NouPll find that most of the tools in the Lo,Q: landscape remo$ed any kind of transactional meansD for the benefit of simplicityD making it a lot easier to scale up. -e might not realiAe that transactions are not always neededD which is not to say theyPre totally unnecessaryD itPs merely that oftentimes theyPre lack is not really a problem. Fs for @ueryingD for the most part youPre saying good bye to ad#hoc @ueries. Most Lo,Q: databases remo$ed means to run any kind of dynamic @uery on your dataD Mongo/B being the noteworthy exception here. /ata is usually pre#aggregated by e.g. using Map/>educeD or access is simply done by keys. 6s it a problem8 Inly you can make that decision simply based on re@uirements and features. Jither wayD it does take a while to get used to these thingsD no doubt. Alex: Ince you start using Lo,Q: databasesD *ill you have to get rid of R$)M"8 Mathias: Lo. 6f someone comes to me asking if they should switch to a Lo,Q: database without ha$ing a specific problemD my answer is always no. Nou should look for alternati$e solutions only when you need to sol$e a real problem which is usually that your current database is not able to keep up with all the types of data you throw at itD or youPre storing lots of data and itPs kind of a pain to get it out againD both in terms of @uerying or simply remo$ing stale data in large tablesD or your data simply has reached a limit where itPs too high a cost to migrate your schema!. Fs Man :ehnardt saidD Lo,Q: is more about choiceD you pick the tool that is right for the "obD and if that tool is an >/BM, then you donPt need to look for a Lo,Q: database until you ha$e a specific problem. -hile the new tools are shiny and tempting to throw at e$ery problemD therePs always a learning cur$e in$ol$edD both in de$elopment and operations. 6t makes more sense to start off slowD and see how you go by "ust mo$ing small parts at a time to a secondary database. Alex: Bhanks MathiasC ;ome f

myLo,Q: BhursdayD %) Fugust %&'& e Back

$o*>' 4uide #or /eginners


Fbout:

NoSQL theory
D Flexandre Porcelli has a great post for Lo,Q: beginners: one of the most fre@uent @uestion that people use to ask me about nos@l is: what is the best nos@l tool that enables me start with using my programming language "a$aD .netD phpD pythonD etc..!8 its almost impossible to ha$e a @uick answer Scos it in$ol$es many things like: data#modelD durability and usage scenario single nodeD @ty of nodes or cloud setup!D language binding and easy to setup.

6tPs a great addition to the guide on getting started with Lo,Q:D expanding on some of the principles 6P$e mentioned in the role of data modeling with Lo,Q:.

0 < 0:2;s :5 E<:rac: -a=?2 8r5@ B 1 7a:a


"ponsored Content by Rogue Wave "oft*are

F top challenge facing businesses trying to adapt in the new age of big data is determining how to extract $alue from their data. Businesses that le$erage information obtained from their data ha$e a competiti$e ad$antage o$er those who donPt. ;owe$erD companies ha$e begun to consolidate and organiAe $ast collections of disparate data sources and now may be wondering what to do next. 6n order to understand the hidden $alue in a companyPs biggest assetD its dataD begin by identifying what sets the company apart. Jxamine the workings of the company and then explore the areas where the company needs impro$ement. Ince you understand the goals of the businessD you can begin to not only make decisions based on your expertise but also combine with ob"ecti$e analytical results based on your data. Bhis process could pro$ide insights to the companyPs competiti$e ad$antages. Fs an exampleD consider the healthcare industry. ;igh#le$el goals of the industry are to pro$ide exceptional ser$ice and healthcare for patients and to ha$e inno$ati$e and in$enti$e treatments of disease. Bhe ob"ecti$es suited to analytics are much more specific. .or exampleD hospitals and their networks can use analytics to reduce the patient readmission rates. Physicians and nurses may not be aware of the factors that contribute to high readmission ratesD such as early identification of high risk patientsD staffing problemsD lack of consistency among proceduresD as well as certain characteristics of a procedure. ItherD less ob$ious factors affecting readmission rates may emerge from clustering or other data mining techni@ues. Bhe process used to decipher information from data is called the data analysis process. Bhe six steps outlined here will help your company build competiti$e ad$antages through data analysis. Bhe data mining process is meant to be cyclical and repeat continuously.

"tep Nne@ Process and Clean $ata 6t is important to $erify your data matches your business goals. 6f it does notD there are se$eral @uestions to address: -hat are the $iable proxies8 Fre there outliers that need to be taken into account8 /oes the data contain bias8 Fre there missing $alues8 :ook for functionalities that will correctly address the $arious needs to clean and process the data. Bhere are a number of methods that can be used to imputeD or fill in missing $aluesD such as mean interpolationD Ealman filterD and F>MF. Bhis step is one of the most importantD but may take *&#3& percent of your data analysis pro"ect time. Bhe @uality of your data will greatly affect your analysis results. "tep (*o@ %&plore and Eisuali.e $ata Jxplore the processed data and $isually inspect the data for patternsD trendsD and clusters. Bhis is the time to examine relationships and build hypotheses according to your findings. Bhe easiest way to complete this process is with the aid of $isualiAation tools. Bhere are a number of simple yet powerful $isual aidsD such as scatter plotsD line graphsD stacked bar chartsD box#plotsD and heat#maps. "tep (hree@ $ata Mine Nou can use $arious methods to facilitate pattern recognitionD including clustering E# MeansD hierarchical clusteringD market basket analysisD Eohonen ,elf#IrganiAing maps for $isualiAationD principal component analysisD factor analysisD and multi#dimensional scaling. IrganiAations that le$erage and mine their data predicti$ely ha$e a significant competiti$e ad$antage o$er their ri$alsD as they can gain important insights and react @uickly to expand their business in a way that was not possible without predicti$e analytics. "tep <our@ )uild Model Be sure to ha$e a wide range of models that pro$ide different perspecti$es of the data. ,ome possible models to consider are decision treesD Lag$e Bayes classifierD neural networksD F>6MFD regressionsD ,KMD and discriminant analysis. J$ery algorithm has its suitabilityD and it is important to understand that all models ha$e limitations. Bhere could be more than one model that would work for a problem. F$oid o$erfitting. Understand not only the probable errorsD but also the most serious onesD and set parameters to control against making the most serious of false inferences. Be sure to document and communicate the assumptions and results clearly. "tep <ive@ Benerate Results and Nptimi.e Predicti$e results are used to establish ob"ecti$e functions in order to generate actionable results. Bhere are many applicable methodsD such as linear and @uadratic programmingD least s@uares sol$ersD and differential e@uation sol$ers P/JD I/J!. Ine specific method may be more appropriate than another depending on the nature of the ob"ecti$e function

linearD @uadraticD or discontinuous! and constraints on the $ariables linear or not!. Bhe goal is to produce results that lead to $aluable business decisions. 6f the hospital staff knows a certain surgical procedure has high readmissionsD they may change the process to help reduce readmissionsD such as allowing for an extra day of post#operati$e care. "tep "i&@ Ealidate Results Ffter you implement your business decisionsD allow time to produce results. 6t is important to carefully $alidate the results against the initial business ob"ecti$e. >eturning to our healthcare exampleD the hospitalPs business ob"ecti$e is reducing readmissions. Fnalysts should re$iew data to see if current rates ha$e declined in an appreciable way. "electing the Right (ools Nou may find your toolkit stocked with se$eral complimentary software products to support the data analysis processD among them analytic software that supplies mathematical and statistical algorithms. Bhere are se$eral important criteria to consider such as scalabilityD reliabilityD performanceD data source consumabilityD and ease of deployment. -hen selecting a data analysis toolD it is important to consider these @uestions:

6s the tool memory#bounded8 >ecogniAe that reliable software should inform users of data errors. -hat if user input data is not $iable8 Jxamine the siAe of the problemR does it ha$e an informati$e message to let the user know what is happening or would it hang the application8 <onsider supported data typeD formatD and en$ironments. Bhis includes relational databasesD structure and unstructured dataD data connection supportD and language support. /oes the tool support streaming data8 <an the analytic be used inside the database8 6n terms of performance and technologyD what is the de$elopment as well as the target deployment en$ironment8 -ill the analytics be thread safe8 /oes it support Map>educe which will be needed for ;adoop!8 6s the analytic software optimiAed for a deployment platform8 /oes it take ad$antage of multicore ser$ers and can your computation be paralleliAed8 -hat does the deployed solution look like8 /oes it use industry standard nati$e language to simplify embedding in your webD :inux or -indows application and deployment8 ;as it been tested across platforms8 6f notD the computational results can be slightly different and cause differences in analytical results. /oes it re@uire any framework to support the deployment8 6f soD what are the additional hardwareD softwareD and maintenance costs8

Predictability is a characteristic of the data processD not a characteristic of the model. Nou can use predicti$e analytics to go beyond merely impro$ing the efficiency of your current processesR you can create new opportunities or products based on the insight you gathered from the data. -hile this process seems complicatedD there are sophisticatedD

commercially#a$ailable tools that ha$e been testedD triedD and in productionD such as >ogue -a$e ,oftwarePs 6M,: Lumerical :ibrariesD to help companies implement all six steps in this process. Bhe 6M,: :ibraries pro$ide sophisticated analytics in high# performanceD mission#critical applications. -ith 6M,:D companies and organiAations reduce de$elopment timeD realiAe a lower total cost of ownershipD and impro$e @uality and maintainability. /ownload the whitepaper U/ri$ing <ompetiti$e Fd$antage by Predicting the .utureV to del$e into a deeper discussion of data mining and learn what predicti$e analytics can do for your company.

What is Hadoop? #pache Hadoop


-hat is ;adoop 8 U;adoopVD the name itself is weirdD isnPt it8 the term ;adoop came from the name of a toy elephant. ;adoop is all about processing huge data irrespecti$e of whether its structured or unstructuredD huge data means hundreds of +6+s and more. Braditional >/BM, system may not be apt when you ha$e to deal with huge data sets. J$en though Udatabase shardingV is trying to address this issueD chances of node failure makes this less approachable. ;adoop was originally deri$ed from +oogle .ile ,ystem +.,! papers and +ooglePs Map>educe. ;adoop is a framework which enables applications to work with multiple nodes which can store enormous amount of data. 6t comprises of % components:

Fpache ;adoop /istributed .ile ,ystem ;/.,! +ooglePs Map>educe .ramework

Fpache ;adoop was created by /oug <uttingD he named it after his sonPs toy elephant. ;adoopPs original purpose was to support the Lutch search engine pro"ect. But ;adoopPs significance grown too far from thatD now its a top le$el Fpache pro"ect and is being used by a large community of usersD to name a fewD .acebookD Lew Nork timesD Nahoo are some of the examples of Fpache ;adoop implementations. ;adoop is written in the Ma$a programming languageCC "ignificance of Hadoop Bhe data on the -orld -ide -eb is growing at an enormous rate. Fs the number of acti$e internet users increasesD the amount of data getting uploaded is increasing. ,ome of the estimates related to the growth of data are as follows:

6n %&&)D the total estimated siAe of the digital data stood at &.'2 Aettabytes By %&''D a forecast estimates it to stand at '.2 Aettabytes.

Ine Aettabyte 9 '&&& exabytes 9 ' million petabytes 9 ' billion terabytes

,ocial networking sites hosting photosD Kideo streaming sitesD ,tock exchange transactions are some of the ma"or reasons of this huge amount of data. Bhe growth of data also brings some challenges with it. J$en though the amount of data storage has increased o$er timeD the data access speed has not increased at the same rate. 6f all the data resides on one nodeD then it deteriorates the o$erall data access time. >eading becomes slowerR writing becomes e$en slower. Fs a solution to thisD if the same data is accessed from multiple nodes in parallelD then the o$erall data access time can be reduced. 6n order to implement thisD we need the data to be distributed among multiple nodes and there should be a framework to control these multiple nodesP >ead and write. ;ere comes the role of ;adoop kind of system. :etPs see the problems that can happen with shared storage and how Fpache ;adoop framework o$ercomes it. Hard*are <ailure ;adoop is not expecting all nodes to be up and running all the time. ;apoop has a mechanism to handle the node failuresD it replicates the data. Combining the data retrieved from multiple nodes <ombining the output of each worker node is a challengeD +ooglePs Map>educe framework helps to sol$e this problem. Map is more like a key#$alue pair. Map>educe framework has a mechanism of mapping the data retrie$ed from the multiple disks and thenD combining them to generate one output

Components Nf #pache Hadoop ;adoop framework is consisting of % parts Fpache ;adoop /istributed .ile ,ystem ;/.,! and Map>educe. Hadoop $istributed <ile "ystem 1H$<"5 ;adoop /istributed .ile ,ystem is a distributed file system which is designed to run on commodity hardware. ,ince the ;adoop treats node failures as a norm rather than an exceptionD ;/., has been designed to be highly fault tolerant. Fnd moreo$erD it is designed to run on low cost shared hardware.

;/., is designed to reliably store $ery large files across machines in a large cluster ;/., stores each file as a se@uence of blocksR all blocks in a file except the last block are the same siAe.

Bhe blocks of a file are replicated for fault tolerance and this replication is configurable. Fn application can specify the number of replicas of a file. Bhe replication factor can be specified at file creation time and can be changed later. Bhe LameLode makes all decisions regarding replication of blocks. 6t periodically recei$es a ;eartbeat and a Blockreport from each of the /ataLodes in the cluster. >eceipt of a ;eartbeat implies that the /ataLode is functioning properly. F Blockreport contains a list of all blocks on a /ataLode >eplica placement is crucial for faster retrie$al of data by the clientsD .or thisD ;/., uses a techni@ue known as >ack Fwareness ;/., tries to satisfy a read re@uest from a replica that is closest to the client. Fll ;/., communication protocols are layered on top of the B<P/6P protocol.

MapReduce Map>educe is the framework which helps in the data analysis part of Fpache ;adoop implementation. .ollowing are the notable points of Map>educe.

Map>educe is a patented software framework introduced by +oogle to support distributed computing on large data sets on clusters of computers Map>educe framework is inspired by map and reduce functions commonly used in functional programming Map>educe is consisting of a Map step and a >educe step to sol$e a gi$en problem. Map ,tep: o Bhe master node takes the inputD chops it up into smaller sub#problemsD and distributes those to worker nodes. o F worker node may do this again in turnD leading to a multi#le$el tree structure. o Bhe worker node processes that smaller problemD and passes the answer back to its master node. >educe ,tep: o Bhe master node then takes the answers to all the sub#problems and combines them in a way to get the output. Fll Maps steps execute in a parallel fashion Bhe >educe step takes in the input from the Map step. Fll the Maps with the same key fall under one reducer. ;owe$erD there are multiple reducers and it will work in parallel. Bhis parallel execution offers the possibility of reco$ery from partial failure. 6f one node Mapper/>educer! failsD then its work can be re#scheduled to another node.

-ntroduction to /ig &or /eginnersE


'

ata and 3adoop Ccos+ste. D

Fpril %3D %&'( by sushilpramanick

-e li$e in the age where data grows at a faster rate than itPs preceding second in time. Kery soonD it will be defeating the MoorePs theory in specific to growth rate in data $olume. Big /ata refers to such exponential explosion of data that canPt be handled by traditional architectural and structural data solutions. Bhe four key attributes of Big /ata is KelocityD KolumeD Kariety and Kalue. 6f you look around your en$ironmentD e$ery gadget you use generates data that can possibly shape how you will use it next time. .or exampleD the cable channel can store your preferences on genre and attributes of mo$ies or shows you likeD ad$ertisement you skip or you watch and build a custom channel suited to your needs with your shows and ad$ertisement. Fnother exampleD the car you dri$e will be able to transmit real#time your dri$ing patternsD $iolations and speed to /MK and 6nsurance companies that may affect your insurance rates. :ogistics and Manufacturing companies captures millions of unstructured data daily from ser$er logsD machine sensorsD >.6/ and supply chain processes that can be mined to be more cost effecti$e and producti$e. 6magine li$ing in a sci#fi world where by companies will get real time data from your cellphonesD 6Pads and tabletsD laptopsD game controllers and ,ocial MediaD electronic channels to know about you and present you with products right in time before you ask for oneC Bhere are endless possibilities in future to harness the power of data into information to pro$ide custom solutions and products to human race. 6tPs is said that structured data makes up only about %&0 of data and rest 2&0 of is unstructured that comes in complexD unstructured formatsD e$erything from web sitesD social media and emailD to $ideosD presentationsD etc. 6n the past we ha$e been o$erwhelmed with structured data and we built big ,un ser$ers and 6BM ser$ers but gi$en the Petabytes of data and logs to processD the industry demands more scalableD robust and performance optimiAed solution to process this information. I$er a decade backD +oogle designed scalable frameworks like Map>educe and +oogle .ile ,ystem. 6nspired by these designsD an Fpache open source initiati$e was started under the name ;adoop. Fpache ;adoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. Bhe Fpache ;adoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. 6t is designed to scale up from single ser$ers to thousands of machinesD each offering local computation and storage. >ather than rely on hardware to deli$er high# a$ailabilityD the library itself is designed to detect and handle failures at the application layerD so deli$ering a highly#a$ailable ser$ice on top of a cluster of computersD each of which may be prone to failures.

Fpache ;adoop consists of % sub#pro"ects G ;adoop Map>educe and ;adoop /istributed .ile ,ystem. ;adoop Map>educe is a programming model and software framework for writing applications that rapidly process $ast amounts of data in parallel on large clusters of compute nodes. ;/., is the primary storage system used by ;adoop applications. ;/., creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliableD extremely rapid computations. Ither ;adoop# related pro"ects at Fpache include <assandraD <hukwaD ;i$eD ;BaseD MahoutD ,@oopD ]ooEeeperD Ma@lD F$roD Pig. H$<" ;adoop /istributed .ile ,ystem ;/.,b! is the primary storage system used by ;adoop applications. ;/., creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliableD extremely rapid computations. MapReduce ;adoop Map>educe is a programming model and software framework for writing applications that rapidly process $ast amounts of data in parallel on large clusters of compute nodes. Cassandra Bhe Fpache <assandra database is the right choice when you need scalability and high a$ailability without compromising performance. :inear scalability and pro$en fault# tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission#critical data. <assandraPs support for replicating across multiple datacenters is best#in#classD pro$iding lower latency for your users and the peace of mind of knowing that you can sur$i$e regional outages. <assandraPs <olumn .amily data model offers the con$enience of column indexes with the performance of log#structured updatesD strong support for materialiAed $iewsD and powerful built#in caching. <assandra is in use at LetflixD BwitterD Urban FirshipD <onstant <ontactD >edditD <iscoD IpenOD /iggD <loudEickD IoyalaD and more companies that ha$e largeD acti$e data sets. Bhe largest known <assandra cluster has o$er (&& BB of data in o$er 4&& machines. Chu/*a <hukwa is an open source data collection system for monitoring large distributed systems. <hukwa is built on top of the ;adoop /istributed .ile ,ystem ;/.,! and Map/>educe framework and inherits ;adoopPs scalability and robustness. <hukwa also includes a hexible and powerful toolkit for displayingD monitoring and analyAing results to make the best use of the collected data.

.lume from <loudera is similar to <hukwa both in architecture and features. FrchitecturallyD <hukwa is a batch system. 6n contrastD .lume is designed more as a continuous stream processing system. Hive ;i$e is a data warehouse system for ;adoop that facilitates easy data summariAationD ad# hoc @ueriesD and the analysis of large datasets stored in ;adoop compatible file systems. ;i$e pro$ides a mechanism to pro"ect structure onto this data and @uery the data using a ,Q:#like language called ;i$eQ:. Ft the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is incon$enient or inefficient to express this logic in ;i$eQ:. Bhe main building blocks of ;i$e are G '. Metastore stores the system catalog and metadata about tablesD columnsD partitionsD etc. %. /ri$er manages the lifecycle of a ;i$eQ: statement as it mo$es through ;i$e (. Query <ompiler compiles ;i$eQ: into a directed acyclic graph for Map>educe tasks 4. Jxecution Jngine executes the tasks produced by the compiler in proper dependency order 1. ;i$e,er$er pro$ides a Bhrift interface and a M/B< / I/B< ser$er H)ase ;Base is the ;adoop database. Bhink of it as a distributedD scalableD big data store. Use ;Base when you need randomD realtime read/write access to your Big /ata. Bhis pro"ectPs goal is the hosting of $ery large tables X billions of rows O millions of columns X atop clusters of commodity hardware. ;Base is an open#sourceD distributedD $ersionedD column#oriented store modeled after +ooglePs Bigtable: F /istributed ,torage ,ystem for ,tructured /ata by <hang et al. Must as Bigtable le$erages the distributed data storage pro$ided by the +oogle .ile ,ystemD ;Base pro$ides Bigtable#like capabilities on top of ;adoop and ;/.,.

<eatures

:inear and modular scalability. ,trictly consistent reads and writes. Futomatic and configurable sharding of tables Futomatic failo$er support between >egion,er$ers. <on$enient base classes for backing ;adoop Map>educe "obs with ;Base tables. Jasy to use Ma$a FP6 for client access. Block cache and Bloom .ilters for real#time @ueries. Query predicate push down $ia ser$er side .ilters

Bhrift gateway and a >J,B#ful -eb ser$ice that supports OM:D ProtobufD and binary data encoding options Jxtensible "ruby#based M6>B! shell ,upport for exporting metrics $ia the ;adoop metrics subsystem to files or +angliaR or $ia MMO

Mahout Bhe success of companies and indi$iduals in the data age depends on how @uickly and efficiently they turn $ast amounts of data into actionable information. -hether itPs for processing hundreds or thousands of personal e#mail messages a day or di$ining user intent from petabytes of weblogsD the need for tools that can organiAe and enhance data has ne$er been greater. Bherein lies the premise and the promise of the field of machine learning. ;ow do we easily mo$e all these concepts to big data8 -elcome MahoutC Mahout is an open source machine learning library from Fpache. 6tPs highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is $ery largeD perhaps far too large for a single machine. Ft the momentD it primarily implements recommender engines collaborati$e filtering!D clusteringD and classification. >ecommender engines try to infer tastes and preferences and identify unknown items that are of interest. <lustering attempts to group a large number of things together into clusters that share some similarity. 6tPs a way to disco$er hierarchy and order in a large or hard#to#understand data set. <lassification decides how much a thing is or isnPt part of some type or categoryD or how much it does or doesnPt ha$e some attribute. "-oop :oading bulk data into ;adoop from production systems or accessing it from map#reduce applications running on large clusters can be a challenging task. Bransferring data using scripts is inefficient and time#consuming. ;ow do we efficiently mo$e data from an external storage into ;/., or ;i$e or ;Base8 Meet Fpache ,@oop. ,@oop allows easy import and export of data from structured data stores such as relational databasesD enterprise data warehousesD and Lo,Q: systems. Bhe dataset being transferred is sliced up into different partitions and a map#only "ob is launched with indi$idual mappers responsible for transferring a slice of this dataset. UooCeeper ]ooEeeper is a centraliAed ser$ice for maintaining configuration informationD namingD pro$iding distributed synchroniAationD and pro$iding group ser$ices. Fll of these kinds of ser$ices are used in some form or another by distributed applications. Jach time they are

implemented there is a lot of work that goes into fixing the bugs and race conditions that are ine$itable. Because of the difficulty of implementing these kinds of ser$icesD applications initially usually skimp on them Dwhich make them brittle in the presence of change and difficult to manage. J$en when done correctlyD different implementations of these ser$ices lead to management complexity when the applications are deployed. %clipse is a popular 6/J donated by 6BM to the open source community. Lucene :ucene is a text search engine library written in Ma$a. :ucene pro$ides Ma$a#based indexing and search technologyD as well as spellcheckingD hit highlighting and ad$anced analysis/tokeniAation capabilities. "olr ,olr is a high performance search ser$er built using :ucene <oreD with OM:/;BBP and M,IL/Python/>uby FP6sD hit highlightingD faceted searchD cachingD replicationD and a web admin interface Pig Pig is a platform for analyAing large data sets that consists of a high#le$el language for expressing data analysis programsD coupled with infrastructure for e$aluating these programs. Bhe salient property of Pig programs is that their structure is amenable to substantial paralleliAationD which in turns enables them to handle $ery large data sets. Ft the present timeD PigPs infrastructure layer consists of a compiler that produces se@uences of Map#>educe programsD for which large#scale parallel implementations already exist e.g.D the ;adoop subpro"ect!. PigPs language layer currently consists of a textual language called Pig :atinD which has the following key properties:

%ase of programming6 6t is tri$ial to achie$e parallel execution of simpleD Uembarrassingly parallelV data analysis tasks. <omplex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow se@uencesD making them easy to writeD understandD and maintain. Nptimi.ation opportunities6 Bhe way in which tasks are encoded permits the system to optimiAe their execution automaticallyD allowing the user to focus on semantics rather than efficiency. %&tensibility6 Users can create their own functions to do special#purpose processing.

#mbari F web#based tool for pro$isioningD managingD and monitoring Fpache ;adoop clusters which includes support for ;adoop ;/.,D ;adoop Map>educeD ;i$eD ;<atalogD ;BaseD

]ooEeeperD IoAieD Pig and ,@oop. Fmbari also pro$ides a dashboard for $iewing cluster health such as heatmaps and ability to $iew Map>educeD Pig and ;i$e applications $isually alongwith features to diagnose their performance characteristics in a user# friendly manner. !a-l MFQ: or "ackalD is a @uery language for Ma$a,cript open notation. #vro F$ro is a data serialiAation system. F$ro pro$ides:

>ich data structures. F compactD fastD binary data format. F container fileD to store persistent data. >emote procedure call >P<!. ,imple integration with dynamic languages. <ode generation is not re@uired to read or write data files nor to use or implement >P< protocols. <ode generation as an optional optimiAationD only worth implementing for statically typed languages.

'7M# is the architecture for the de$elopmentD disco$eryD composition and deployment for the analysis of unstructured data.

Hadoop @ What is it and is it a Roc/et "cience to learn?


Mun %1 Posted by mohittare Bhe turbulence that )ig $ata is causing today in the 6B sea is enormous. -ith this new era of Big /ata the 6B giants ha$e really understood that the power of )ig $ata is really )7B if we can harness it.

Bhe nature of Big data -ith the complexity that is lying around Big /ata D comes the sa$oir ;adoop. -hat is ;adoop 8 #pache Hadoop is an open source Ma$a framework for processing and @uerying $ast amounts of data . 6 do agree that ;adoop is little bit complex to learn for a beginner but not really a rocket science to learn and practice some beginner tasks.

IE this looks interesting but from Where do i learn about Hadoop ? : Nou can entirely learn it by yourself specially with the world wide web for our help. "ource 0 @ %MC Corporation )ig ideas Eideo Playlist @ Nes Bhis may seem a bit nai$e and simple but they ha$e made an amaAing $ideo series to understand and learn about what is Big /ata and -hat is ;adoop. 6n fact its one of the fastest way to gain an o$er$iew of hadoop. :ink : http://www.youtube.com/playlist8list9P:/%32<B.2/&3&2J4< "ource 3@ !oin )ig data 'niversity @ Nes you heard it right. )ig data university has a bunch of amaAing courses on ;adoop technology. Bhe beginners entry point will be ;adoop fundamentals course. Bhis course will teach you all the basics of ;adoop. Based on my personal experience it is a nice way to dig a bit deeper into ;adoop. :ink : http://bigdatauni$ersity.com/courses/ "ource 9@ Hadoop in practice@ 6 really ha$e not gone much deeper into this book D but yes the re$iews tell it is good book to study and learn about ;adoop. :ink : http://www.amaAon.com/;adoop#Practice#Flex#;olmes/dp/')'*%3&%(2

"ource ;@ Hadoop (utorial from L$A and 7)M $eveloper Wor/s @ Bhe Nahoo de$eloper networks and 6BM de$eloper works ha$e also a bunch of amaAing tutorials on ;adoop. 6BM de$eloper-orks ha$e a bit ad$anced tutorial on other technologies related with hadoop like ;i$eDPig .lume etc. :ink : Nahoo N/L ;adoop Butorial : http://de$eloper.yahoo.com/hadoop/tutorial/module'.html :ink : 6BM de$eloper-orks : http://www.ibm.com/de$eloperworks/data/library/techarticle/dm# '%&3hadoopbigdata/ "ource :@ #pache Hadoop *i/i and official *eb Page @ Flways if you want to dig more you can refer to Fpache ;adoop official wiki. :ink : http://hadoop.apache.org/docs/current/ Ik now 7 have learnt about the basics no* *hat ? Low time to see it action. Bhe biggest hurdle that i faced is setting up an single node cluster on ;adoop and how to configure it to learn. 6 will post about it soon. Meanwhile here is my presentation on Big /ata and ;adoop which will also ser$e as an insight .

CFa.ples o# /ig

ata (ro)ects

;erePs another way to capture what a Big /ata pro"ect could mean for your company or pro"ect: study how others ha$e applied the idea. ;ere are some real#world examples of Big /ata in action:

<onsumer product companies and retail organiAations are monitoring social media like .acebook and Bwitter to get an unprecedented $iew into customer beha$iorD preferencesD and product perception. Manufacturers are monitoring minute $ibration data from their e@uipmentD which changes slightly as it wears downD to predict the optimal time to replace or maintain. >eplacing it too soon wastes moneyR replacing it too late triggers an expensi$e work stoppage Manufacturers are also monitoring social networksD but with a different goal than marketers: Bhey are using it to detect aftermarket support issues before a warranty failure becomes publicly detrimental. .inancial ,er$ices organiAations are using data mined from customer interactions to slice and dice their users into finely tuned segments. Bhis enables these financial institutions to create increasingly rele$ant and sophisticated offers. Fd$ertising and marketing agencies are tracking social media to understand responsi$eness to campaignsD promotionsD and other ad$ertising mediums.

6nsurance companies are using Big /ata analysis to see which home insurance applications can be immediately processedD and which ones need a $alidating in# person $isit from an agent. By embracing social mediaD retail organiAations are engaging brand ad$ocatesD changing the perception of brand antagonistsD and e$en enabling enthusiastic customers to sell their products. ;ospitals are analyAing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. Bhe hospital can then inter$ene in hopes of pre$enting another costly hospital stay. -eb#based businesses are de$eloping information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs. Bhe go$ernment is making data public at both the nationalD stateD and city le$el for users to de$elop new applications that can generate public good. ,ports teams are using data for tracking ticket sales and e$en for tracking team strategies.

3arnessing /ig

ata

Posted Manuary ')thD %&'( in Iutsourcing 6B by E. Piskorski

<ive terabytes6 (hatFs ho* much data *ill be created directly or indirectly by every man2 *omen and child on earth by 34346 Bhis numberD coming from 6nternational /ata <orp reportD is only an a$erage. F U, citiAen who works in 6BD uploads NouBube $ideosD owns a /,:>D and is also a$ailable on numerous online platformsD might ha$e a digital footprint ten times this siAe. In the other

handD some people in the Bhird -orld might be represented only by a small string of kilobytes G a single record in the tax system of their country. But "ust for the sake of simplicityD imagine that e$eryone you meet in the street has a stack of 'BB hard dri$es floating o$er his or her head. BhatPs their digital footprint. <ompanies and businesses also ha$e footprints. :etPs imagine them as huge piles of dri$es sitting next to the office buildings. Bhis data exists physically somewhere in the world. 6n a Malaysian data center. In your pendri$e. In a ser$er of your insurance company. In the account in the online store you often use. Low imagine that all the stacks are doubling in siAe e$ery two years. 6tPs not a fantasy. Fccording to the Fberdeen +roup reportD the a$erage dataset of the U, company grew by )&0 last yearD and many businesses o$er the pond double their storage e$ery two years. 6t is this growth that fuels many new sectors of 6B. Ine of them became an important buAA wordD already known to almost e$ery 6B professional on the planet. Bhe Cig Data. <ontrary to what you might think after this introductionD Big /ata is not only about storage and archi$ing. 6tPs more about finding inno$ati$e and often $ery profitable! ways to useD analyAe and interconnect the immense multi#format data collections G from database records and statistics to $ideosD pictures and social streams. 6n %&'&D spending in this sector was close to d(.% billion. -e already know that in %&'1 it should reach d').3 billion G an astonishing annual growth rate of 4&0. BhatPs * times the growth rate of the general 6B market. Bhis sector is rising so fast that 6/< predicts an upcoming shortage of specialists with experience in Big /ata pro"ects. -hen local "ob markets dry outD many businesses will turn to outsourcing and offshore de$elopment G a fact that companies like P+, ,oftware keenly take note of. 6f %&'% was a year of Big /ataD then %&'( is going to be a year of Big /ata outsourcing. What can *e do *ith it? -hy is it happening8 6s there really so much $alue in this marketD or is Big /ata "ust a buAA propagated by technology "ournalists8 :et me gi$e you a good example. :ast yearD a small team of Big /ata enthusiasts was able to create an application called Bwitter;ealth. Bheir software has been analyAing an enormous sea of Bwitter feedsD looking for social updates that could indicate someone is suffering from a flu. Fs you would expectD Bwitter users $ery often write if they feel sickD

or if they intend to stay at home G the application takes ad$antage of that. -here$er the flu strikesD such tweets are much more pre$alent. By cross#referencing the semantic analysis with geographical informationD Bwitter;ealth was able to create a surprisingly goodD real#time map of flu epidemics. Low the best thing G the map pro$ed "ust as good as the one prepared by the <enter for /isease <ontrol that used information from hundreds of medical practitioners. But it was much fasterD much cheaperD and worked on data a$ailable to e$eryone. Fnother example comes from Mapan. F company there introduced a custom application for decommissioning post#lease cars. Monitoring auction houses and prices at local used $ehicle dealershipsD the system automatically finds the best place in the country to sell the car. Ft the same time itPs also using a big technical database to $irtually disassemble itD seek information about current prices of used partsD set $alue on specific componentsD see if therePs a market for themD and modify the $ehicle price accordingly. I$erallD this automated system allowed company to earn d'1& more on each of %1& &&& cars sold per year. -ho would say no to extra d(*D1 million out of nowhere8 BhatPs what Big /ata is all about Gusing whatPs already a$ailable in a smarter way. Bhe system 6 ha$e "ust mentionedD created by Ipera ,olutionsD is "ust one of many economical and e#commerce pro"ects based on Big /ata that appear all around us. Many of them will pro$e important to our economy or e$en politics. MPMorgan <hase ? <oD a powerful financial company that came into the spotlight during in$estigation into the Fmerican financial crash of %&&2D currently uses Big /ata software to control processing of deri$ati$es G a financial 6B system based on many inputs from economyD market trends and e$en world news. My guess isD come next crashD wePll be talking about algorithms and software engineersD instead of bankers with fat paychecks. But there are also many good non#commercial examples of Big /ata application de$elopment. Ine of them is a '&&& +enomes Pro"ect. Bhe genome mapping and research has one important trait G it generates F :IB of information. Low "ust imagine that doAens of teams across the world work simultaneouslyD mapping genomes of many indi$idualsD creating heaps of raw data. BhatPs why the pro"ect creators started an FmaAon -eb ,er$ice that gi$es e$ery genome researcher on the globe an easy access to the data of all the other scientists. F real global hub for the genome research G and an example why Big /ata is important not only for commercial $entures. Bhe last thing 6Pd like to highlight comes from +oogle. Bhe company employs whatPs called a ,tatistical Machine Branslation to fuel its popular +oogle Branslate ser$ice. +oogle doesnPt really try to UunderstandV grammar of the worldPs languages or the context of phrases it translates. 6t "ust takes an immense database of digitiAed texts in both languagesD looks for established patternsD and then tries to guess which string in a foreign language most likely represents the string in the input language.

NesD +oogle Branslate often gets it wrong. But the thing isD it becomes more accurate the more data you pump into it. BhatPs one of the reasons why +oogle is running its book digitiAation effort. J$ery new $olume added to the database makes +ooglePs translation ser$ices a tiny bit better. PotentiallyD when it gobbles up and digests all known written sources in e$ery language on earthD it could make translators and translating software obsolete G a groundbreaking perspecti$e that might become true in "ust se$eral years. Bhere are more examples. +reenhouses that connect to the publically a$ailable weather data to determine when they should openD and when they should close. ,tores that manage stocks based on the social media buAA. Fnd yesD sometimes it gets a bit scary. Must like the algorithm created by scientists from the Uni$ersity of Birmingham. 6t o$erlaid data from cell phones of %&& people on the mapD and then started to learn about themD taking note of all their mo$ements during the dayD meetingsD social interactionsD work flow. 6t pro$ed way too effecti$e. Ince the analysis was finishedD the algorithm could predict with a 3(0 certainty where any person would be at any gi$en time and date in the foreseeable futureD with an accuracy of "ust %& meters. 6tPs hard to decide whatPs more disturbing G the fact that our daily routines are so repeatableD or that it takes so little to track us. (he impact on 7( outsourcing Low you know why Big /ata sector is so importantD and why it grows fast. But the true depth of this market is not created by end#user solutions. Big chunk of the soon#to#be d').3 billion pie is occupied by middlewareD aiming to pro$ide connection between different data assets. .or example: Mark:ogic software that allows you to analyAe unstructured data in formats hard to processD such as random documents or $ideos. Many leading companies also in$est to create their own Big /ata analysis platformsD and e$ery one of them is a massi$e undertaking. BodayD too many people still think about Big /ata as of something dullD e$oking images of endless ser$er roomsD tape streamers and cloud data centers. But the truth isD this rapidly growing sector is all about creati$ity. -e need to realiAe that itPs Big /ata G not handheld gadgetsD new phonesD phablets or tablets G thatPs going to really change our li$es ten years down the line. Fccording to 6/<D %10 of the information we currently ha$e in the world is potentially useful and $aluable!. 6t only needs to be taggedD analyAed and interconnected. But so farD we only managed to process &.10.

BhatPs why teams like my P+, team are eagerly waiting for entrepreneurs with a $ision on how to utiliAe the remaining %4.10. -e know that Big /ata is the oil of the %'st century G a new resource that lies waiting for people who will find a way to use it. -e know that the push we currently see with Big /ata sector growing by 4&0 each year is a new gold rush. -e also worked on some interesting pro"ects from this sector G for example a piece of software that tracked the ebbs and flows of the phone market analyAing raw data from cellular towers. BhatPs why wePre all going to watch this field closely in %&'(. Maybe in one of many upcoming Big /ata pro"ectsD wePre going to see a glimpse of the future of the entire 6B industry.
Bags: analysisD bigD creati$eD dataD digitalD footprintD itD IutsourcingD report Bhis entry was posted on -ednesdayD Manuary ')thD %&'( at '':1% and is filed under Iutsourcing 6B. Nou can follow any responses to this entry through the >,, %.& feed. Both comments and pings are currently closed.

/ig data: 5hatGs +our plan?


(he payoff from "oining the big#data and ad$anced#analytics management re$olution is no longer in doubt. Bhe tally of successful case studies continues to buildD reinforcing broader research suggesting that when companies in"ect data and analytics deep into their operationsD they can deli$er producti$ity and profit gains that are 1 to ) percent higher than those of the competition.' Bhe promised land of new data#dri$en businessesD greater transparency into how operations actually workD better predictionsD and faster testing is alluring indeed. But that doesnPt make it any easier to get from here to there. Bhe re@uired in$estmentD measured both in money and management commitmentD can be large. <6Is stress the need to remake data architectures and applications totally. Iutside $endors hawk the power of black#box models to crunch through unstructured data in search of cause#and# effect relationships. Business managers scratch their headsXwhile insisting that they must knowD upfrontD the payoff from the spending and from the potentially disrupti$e organiAational changes. Bhe answerD simply putD is to de$elop a plan. :iterally. 6t may sound ob$iousD but in our experienceD the missing step for most companies is spending the time re@uired to create a simple plan for how dataD analyticsD frontline toolsD and people come together to create business $alue. Bhe power of a plan is that it pro$ides a common language allowing senior executi$esD technology professionalsD data scientistsD and managers to discuss where the greatest returns will come from andD more importantD to select the two or three places to get started.

BherePs a compelling parallel here with the management history around strategic planning. .orty years agoD only a few companies de$eloped well#thought#out strategic plans. ,ome of those pioneers achie$ed impressi$e resultsD and before long a wide range of organiAations had harnessed the new planning tools and frameworks emerging at that time. BodayD hardly any company sets off without some kind of strategic plan. -e belie$e that most executi$es will soon see de$eloping a data#and#analytics plan as the essential first step on their "ourney to harnessing big data. Bhe essence of a good strategic plan is that it highlights the critical decisionsD or trade# offsD a company must make and defines the initiati$es it must prioritiAe: for exampleD which businesses will get the most capitalD whether to emphasiAe higher margins or faster growthD and which capabilities are needed to ensure strong performance. 6n these early days of big#data and analytics planningD companies should address analogous issues: choosing the internal and external data they will integrateR selectingD from a long list of potential analytic models and toolsD the ones that will best support their business goalsR and building the organiAational capabilities needed to exploit this potential. ,uccessfully grappling with these planning trade#offs re@uires a cross#cutting strategic dialogue at the top of a company to establish in$estment prioritiesR to balance speedD costD and acceptanceR and to create the conditions for frontline engagement. F plan that addresses these critical issues is more likely to deli$er tangible business results and can be a source of confidence for senior executi$es.

5hatGs in a plan?
Fny successful plan will focus on three core elements.

ata
F game plan for assembling and integrating data is essential. <ompanies are buried in information thatPs fre@uently siloed horiAontally across business units or $ertically by function. <ritical data may reside in legacy 6B systems that ha$e taken hold in areas such as customer ser$iceD pricingD and supply chains. <omplicating matters is a new twist: critical information often resides outside companiesD in unstructured forms such as social# network con$ersations. Making this information a useful and long#li$ed asset will often re@uire a large in$estment in new data capabilities. Plans may highlight a need for the massi$e reorganiAation of data architectures o$er time: sifting through tangled repositories separating transactions from analytical reports!D creating unambiguous golden#source dataD% and implementing data#go$ernance standards that systematically maintain accuracy. 6n the short termD a lighter solution may be possible for some companies: outsourcing the problem to data specialists who use cloud#based software to unify enough data to attack initial analytics opportunities.

Anal+tic .odels
6ntegrating data alone does not generate $alue. Fd$anced analytic models are needed to enable data#dri$en optimiAation for exampleD of employee schedules or shipping networks! or predictions for instanceD about flight delays or what customers will want or do gi$en their buying histories or -eb#site beha$ior!. F plan must identify where models will create additional business $alueD who will need to use themD and how to a$oid inconsistencies and unnecessary proliferation as models are scaled up across the enterprise. Fs with fresh data sourcesD companies e$entually will want to link these models together to sol$e broader optimiAation problems across functions and business units. 6ndeedD the plan may re@uire analytics UfactoriesV to assemble a range of models from the growing list of $ariables and then to implement systems that keep track of both. Fnd e$en though models can be daAAlingly robustD itPs important to resist the temptation of analytic perfection: too many $ariables will create complexity while making the models harder to apply and maintain.

"ools
Bhe output of modeling may be strikingly richD but itPs $aluable only if managers andD in many casesD frontline employees understand and use it. Iutput thatPs too complex can be o$erwhelming or e$en mistrusted. -hatPs needed are intuiti$e tools that integrate data into day#to#day processes and translate modeling outputs into tangible business actions: for instanceD a clear interface for scheduling employeesD fine#grained cross#selling suggestions for call#center agentsD or a way for marketing managers to make real#time decisions on discounts. Many companies fail to complete this step in their thinking and planningXonly to find that managers and operational employees do not use the new modelsD whose effecti$eness predictably falls. BherePs also a critical enabler needed to animate the push toward dataD modelsD and tools: organiAational capabilities. Much as some strategic plans fail to deli$er because organiAations lack the skills to implement themD so too big#data plans can disappoint when organiAations lack the right people and capabilities. <ompanies need a road map for assembling a talent pool of the right siAe and mix. Fnd the best plans will go furtherD outlining how the organiAation can nurture data scientistsD analytic modelersD and frontline staff who will thri$e and stri$e for better business outcomes! in the new data# and tool#rich en$ironment. By assembling these building blocksD companies can formulate an integrated big#data plan similar to whatPs summariAed in the exhibit. If courseD the details of plansXanalytic approachesD decision#support toolsD and sources of business $alueXwill $ary by industry. ;owe$erD itPs important to note an important structural similarity across industries: most companies will need to plan for ma"or data#integration campaigns. Bhe reason is that many of the highest#$alue models and tools such as those shown on the right of the exhibit! increasingly will be built using an extraordinary range of data sources such as

all or most of those shown on the left!. BypicallyD these sources will include internal data from customers or patients!D transactionsD and operationsD as well as external information from partners along the $alue chain and -eb sitesXplusD going forwardD from sensors embedded in physical ob"ects.

%&hibit
F successful data plan will focus on three core elements.

Jnlarge

Bo build a model that optimiAes treatment and hospitaliAation regimesD a company in the health#care industry might need to integrate a wide range of patient and demographic informationD data on drug efficacyD input from medical de$icesD and cost data from hospitals. F transportation company might combine real#time pricing informationD +P, and weather dataD and measures of employee labor producti$ity to predict which shipping routesD $esselsD and cargo mixes will yield the greatest returns.

"hree 6e+ planning challenges


J$ery plan will need to address some common challenges. 6n our experienceD they re@uire attention from the senior corporate leadership and are likely to sound familiar: establishing in$estment prioritiesD balancing speed and costD and ensuring acceptance by the front line. Fll of these are part and parcel of many strategic plansD too. But there are important differences in plans for big data and ad$anced analytics.

1. !atching in:est.ent priorities 8ith business strateg+


Fs companies de$elop their big#data plansD a common dilemma is how to integrate their Usto$epipesV of data acrossD sayD transactionsD operationsD and customer interactions. 6ntegrating all of this information can pro$ide powerful insightsD but the cost of a new data architecture and of de$eloping the many possible models and tools can be immense Xand that calls for choices. Planners at one low#costD high#$olume retailer opted for models using store#sales data to predict in$entory and labor costs to keep prices low. By contrastD a high#endD high#ser$ice retailer selected models re@uiring bigger in$estments and aggregated customer data to expand loyalty programsD nudge customers to higher# margin productsD and tailor ser$ices to them. BhatD in a microcosmD is the in$estment#prioritiAation challenge: both approaches sound smart and wereD in factD well#suited to the business needs of the companies in @uestion. 6tPs easy to imagine these alternati$es catching the eye of other retailers. 6n a world of scarce resourcesD how to choose between these or other! possibilities8 BherePs no substitute for serious engagement by the senior team in establishing such priorities. Ft one consumer#goods companyD the <6I has created heat maps of potential sources of $alue creation across a range of in$estments throughout the companyPs full business systemXin big dataD modelingD trainingD and more. Bhe map gi$es senior leaders a solid fact base that informs debate and supports smart trade#offs. Bhe result of these discussions isnPt a full plan but is certainly a promising start on one. Ir consider how a large bank formed a team consisting of the <6ID the <MID and business#unit heads to sol$e a marketing problem. Bankers were dissatisfied with the results of direct#marketing campaignsXcosts were running highD and the uptake of the new offerings was disappointing. Bhe heart of the problemD the bankers disco$eredD was a siloed marketing approach. 6ndi$idual business units were sending multiple offers across the bankPs entire base of customersD regardless of their financial profile or preferences. Bhose more likely to need in$estment ser$ices were getting offers on a range of deposit productsD and $ice $ersa. Bhe senior team decided that sol$ing the problem would re@uire pooling data in a cross# enterprise warehouse with data on income le$elsD product historiesD risk profilesD and more. Bhis central database allows the bank to optimiAe its marketing campaigns by targeting indi$iduals with products and ser$ices they are more likely to wantD thus raising the hit rate and profitability of the campaigns. F robust planning process often is needed

to highlight in$estment opportunities like these and to stimulate the top#management engagement they deser$e gi$en their magnitude.

2. /alancing speedH costH and acceptance


F natural impulse for executi$es who UownV a companyPs data and analytics strategy is to shift rapidly into action mode. Ince some in$estment priorities are establishedD itPs not hard to find software and analytics $endors who ha$e de$eloped applications and algorithmic models to address them. Bhese packages co$ering pricingD in$entory managementD labor schedulingD and more! can be cost#effecti$e and easier and faster to install than internally builtD tailored models. But they often lack the @ualities of a killer appXone thatPs built on real business cases and can energiAe managers. ,ector# and company#specific business factors are powerful enablers or enemies! of successful data efforts. BhatPs why itPs crucial to gi$e planning a second dimensionD which seeks to balance the need for affordability and speed with business realities including easy#to# miss risks and organiAational sensiti$ities!. Bo understand the costs of omitting this stepD consider the experience of one bank trying to impro$e the performance of its small#business underwriting. ;oping to mo$e @uicklyD the analytics group built a model on the flyD without a planning process in$ol$ing the key stakeholders who fully understood the business forces at play. Bhis model tested well on paper but didnPt work well in practiceD and the company ran up losses using it. Bhe leadership decided to start o$erD enlisting business#unit heads to help with the second effort. F re$amped modelD built on a more complete data set and with an architecture reflecting differences among $arious customer segmentsD had better predicti$e abilities and ultimately reduced the losses. Bhe lesson: big#data planning is at least as much a management challenge as a technical oneD and therePs no shortcut in the hard work of getting business players and data scientists together to figure things out. Ft a shipping companyD the critical @uestion was how to balance potential gains from new data and analytic models against business risks. ,enior managers were comfortable with existing operations#oriented modelsD but there was pushback when data strategists proposed a range of new models related to customer beha$iorD pricingD and scheduling. F particular concern was whether costly new data approaches would interrupt well#oiled scheduling operations. /ata managers met these concerns by pursuing a prototype which used a smaller data set and rudimentary spreadsheet analysis! in one region. ,ometimesD Uwalk before you can runV tactics like these are necessary to achie$e the right balanceD and they can be an explicit part of the plan. Ft a health insurerD a key challenge was assuaging concerns among internal stakeholders. F black#box model designed to identify chronic#disease patients with an abo$e#a$erage risk of hospitaliAation was highly accurate when tested on historical data. ;owe$erD the companyPs clinical directors @uestioned the ability of an opa@ue analytic model to select which patients should recei$e costly pre$entati$e#treatment regimes. 6n the endD the insurer opted for a simplerD more transparent data and analytic approach that impro$ed on current practices but sacrificed some accuracyD with the likely result that a wider array of

patients could @ualify for treatment. Firing such tensions and trade#offs early in data planning can sa$e time and a$oid costly dead ends. .inallyD some planning efforts re@uire balancing the desire to keep costs down through uniformity! with the need for a mix of data and modeling approaches that reflect business realities. <onsider retailingD where players ha$e uni@ue customer basesD ways of setting prices to optimiAe sales and marginsD and daily sales patterns and in$entory re@uirements. Ine retailerD for instanceD has @uickly and inexpensi$ely put in place a standard next# product#to#buy model( for its -eb site. But to de$elop a more sophisticated model to predict regional and seasonal buying patterns and optimiAe supply#chain operationsD the retailer has had to gather unstructured consumer data from social mediaD to choose among internal#operations dataD and to customiAe prediction algorithms by product and store concept. F balanced big#data plan embraces the need for such mixed approaches.

I. Cnsuring a #ocus on #rontline engage.ent and capabilities


J$en after making a considerable in$estment in a new pricing toolD one airline found that the producti$ity of its re$enue#management analysts was still below expectations. Bhe problem8 Bhe tool was too complex to be useful. F different problem arose at a health insurer: doctors re"ected a -eb application designed to nudge them toward more cost# effecti$e treatments. Bhe doctors said they would use it only if it offeredD for certain illnessesD treatment options they considered important for maintaining the trust of patients. Problems like these arise when companies neglect a third element of big#data planning: engaging the organiAation. Fs we said when describing the basic elements of a big#data planD the process starts with the creation of analytic models that frontline managers can understand. Bhe models should be linked to easy#to#use decision#support toolsXcall them killer toolsXand to processes that let managers apply their own experience and "udgment to the outputs of models. -hile a few analytic approaches such as basic sales forecasting! are automatic and re@uire limited frontline engagementD the lionPs share will fail without strong managerial support. Bhe aforementioned airline redesigned the software interface of its pricing tool to include only '& to '1 rule#dri$en archetypes co$ering the competiti$e and capacity#utiliAation situations on ma"or routes. ,imilarlyD at a retailerD a red flag alerts merchandise buyers when a competitorPs 6nternet site prices goods below the retailerPs le$els and allows the buyers to decide on a response. Ft another retailerD managers now ha$e tablet displays predicting the number of store clerks needed each hour of the day gi$en historical sales dataD the weather outlookD and planned special promotions. But planning for the creation of such worker#friendly tools is "ust the beginning. 6tPs also important to focus on the new organiAational skills needed for effecti$e implementation. .ar too many companies belie$e that 31 percent of their data and analytics in$estments should be in data and modeling. But unless they de$elop the skills and training of frontline managersD many of whom donPt ha$e strong analytics backgroundsD those

in$estments wonPt deli$er. F good rule of thumb for planning purposes is a 1&G1& ratio of data and modeling to training. Part of that in$estment may go toward installing UbimodalV managers who both understand the business well and ha$e a sufficient knowledge of how to use data and tools to make betterD more analytics#infused decisions. -here this skill set existsD managers will of course want to draw on it. <ompanies may also ha$e to create incenti$es that pull key business players with analytic strengths into data#leadership roles and then encourage the cross#pollination of ideas among departments. Ine parcel#freight company found pockets of analytical talent trapped in siloed units and united these employees in a centraliAed hub that contracts out its ser$ices across the organiAation. -hen a plan is in placeD execution becomes easier: integrating dataD initiating pilot pro"ectsD and creating new tools and training efforts occur in the context of a clear $ision for dri$ing business $alueXa $ision thatPs unlikely to run into funding problems or organiAational opposition. I$er timeD of courseD the initial plan will get ad"usted. 6ndeedD one key benefit of big data and analytics is that you can learn things about your business that you simply could not see before. ;ereD tooD there may be a parallel with strategic planningD which o$er time has morphed in many organiAations from a formalD annualD Uby the bookV process into a more dynamic one that takes place continually and in$ol$es a broader set of constituents.4 /ata and analytics plans are also too important to be left on a shelf. But thatPs tomorrowPs problemR right nowD such plans arenPt e$en being created. Bhe sooner executi$es change thatD the more likely they are to make data a real source of competiti$e ad$antage for their organiAations.

&irst *teps on the ,oad to a /ig

ata (ro)ect

/hen does it )a*e sense to start up a Cig Data progra)2 %" your e)ail )ar*eting syste) isnFt tal*ing to your sales "orce auto)ation syste), and neither is synched up with your online purchase syste), are you really ready to tac*le a Cig Data pro$ect2 The answer )ay surprise you as we exa)ine Cig Data and its i)pact on the next generation digital experience in this sixth, and "inal, install)ent o" our ongoing series GAre Hou Ready "or Cig Data2G \,tart small with Big /ataD\ is the ad$ice from author Bill .ranks.

Cill ?ran*s, Teradata 6dentify a few relati$ely simple analytics that wonHt take much time or data to run. .or exampleD an online retailer might start by identifying what products each customer $iewed within "ust a few key categories so that the company can send a follow#up offer if they donHt purchase. Fn organiAation that is entering the Big /ata waters needs simpleD intuiti$e examples to see what the data can doD .ranks saysD adding that this approach also yields results that are easy to test to see what type of lift the analytics pro$ide. LextD design a one#off test on some company data: a single month of data from one di$ision for one set of productsD for example. .ranks cautions against attempting to analyAe \all of the data all of the time\ when first starting. Bhat can muddy the water with too much dataD and lead to high initial costsD a problem that plagues many Big /ata initiati$es. 6nsteadD utiliAe only the data you need to perform the initial tests. Ft this pointD .ranks recommendsD turn analytic professionals loose on the data. Bhey can create test and control groups to whom they can send the follow#up offersD and then they can help analyAe the results. /uring this processD theyHll also learn an awful lot about the data and how to make use of it. ,uccessful prototypes also make it far easier to get the support re@uired for a largerD more comprehensi$e effort. Best of allD the full effort will now be less risky because the data is better understood and the $alue is already partially pro$en. 6tHs also worthwhile to learn early when the initial analytics arenHt as $aluable as hoped. 6t tells you to focus your effort elsewhere before youH$e wasted many months and a lot of money. \Pursuing Big /ata with smallD targeted steps can actually be the fastestD least expensi$eD and most effecti$e way to goD\ .ranks says. \6t enables an organiAation to pro$e thereHs $alue in a ma"or in$estment before making itD and to understand better how to make a Big /ata program pay off for the long term.\ -hate$er the siAe of your initial forayD experts ad$ise to remember that itHs a processD a loop. /onHt expect fantastic insights the $ery first time you route two data streams into

the same ri$er. Iften the benefits donHt start to accrue until after youH$e run your tests through a few iterations. J$en thenD because of the newness of the fieldD Big /ata pro"ectsXe$en successful ones Xcan be frustrating.

Shawndra 5ill, The /harton School, Dni(ersity o" Pennsyl(annia \-e still ha$e a ways to go to be able to combine e$idence from different types of data sourcesGfor example from textD social networksD and time series dataD\ says ,hawndra ;ill of the Iperations and 6nformation Management /epartment at Bhe -harton ,chool of the Uni$ersity of Pennsyl$ania. \Bhe methods ha$e not caught up yet with the scale and complexities of todayPs Big /ata.\ ,he addsD \Bhis is both exciting and scary. Jxciting because there are a lot of new solutions to be generatedD and scary because we are probably lea$ing a lot of $alue in databasesD and that $alue may be harder to find as Big /ata becomes e$en bigger data with e$en more complexity and noise.\

4et *tarted
Fnalyst Mike +ualtieriD a principal analyst with .orrester >esearch in <ambridgeD Mass.D likes to cite a .orrester study that predicts that by %&')D ' billion people will ha$e smartphones and tabletsD \and that number will keep increasingD\ he says.

Mi*e #ualtieri, ?orrester Research \Bhe more technology people useD the more data they generateD and the more opportunity there is to pro$ide personal experiencesD\ +ualtieri says. \Bhe firms that make things personal will dri$e things in the future. Bhe others will drop off.\

Bo those who are on the fenceD considering a Big /ata pro"ectD +ualtieri has a simple piece of ad$ice. \/onHt sit this outD\ he urges. \Bhis is real.\ Editor's ote& This is the sixth, and "inal, post in the ongoing series 0Are Hou Ready "or Cig Data21 by D4 Denison- Download the co)plete GAre Hou Ready "or Cig DataG eboo* to learn )ore about Cig Data, its applications in creating the next generatlon digital experience, and what it ta*es to get into the ga)e-

"o 'earn /ig

ataH "a6e a 'esson &ro. *ports

Ine of the buAAwords of %&'( has got to be Ubig data.V 6 was recently in a boardroom meeting where talk of le$eraging Ubig dataV was a popular fa$ourite around the table. Bhis made me smile. I$er the last %& years 6P$e been in$ol$ed in understanding how sport and entertainment can be used as a powerful brand communication and marketing platform. Many organisations and businesses ha$e a lot to learn from sports rights owners whoP$e been exploiting data for gain competiti$e ad$antage for many years.

-hether its soccerD .'D tennisD basketball or baseballD Ubig dataV has been at the heart of de$eloping a winning strategy in this industry. 6t starts from understanding the factors behind success and failureD assessing competitorsP strengths and weaknesses and measuring the impact of tactical changes. 6n sports such as athleticsD e$en the smallest of margins G measured in fractions of a second G can make a difference. /ata analysis that goes to the heart of what dri$es performance has become a critical exercise. Fs :ord ,ebastian <oeD who o$ersaw the biggest Ilympic gold medals haul e$er by Beam +B at :ondon %&'% remarkedD U6f you donPt know why you failedD how can you impro$e8 Fnd if you donPt know why you succeededD it must be an accident.V

Bhis insight is $ery $aluable for marketers seeking to impro$e their efforts and understand their setbacksD as sport can teach us how to use data to deri$e a competiti$e ad$antage. Fnd 6Pm not talking about some after#dinner reminiscence of a sporting legend whoPs made a tenuous link between what happens inside and outside of the rugby scrum. /ata analysis and data#dri$en thinking are fields in which athletesD players and teams ha$e genuinely led the way. 6tPs no coincidence that those whoP$e en"oyed success on the field of play also happen to be at the forefront of data exploitation. Using insight and intelligence to make an informed "udgment makes perfect sense. 6t "ust so happens thatD culturally and technicallyD this has been @uite challenging for many organisations. Ine of the issues is that many people donPt use data to form an opinionD but rather to "ustify an opinion. Bhis narrow $iew can actually hinder an effecti$e decision making process. Bhe renowned economist Eenneth +albraith once said: U.aced with the choice between changing onePs mind and pro$ing therePs no need to do soD almost e$eryone gets busy on the proof.V Many people 6P$e come across in my career often see e$idence as a threat to their own point of $iew or methods. Bechnical barriers shouldnPt be underestimated either. Many organisations ha$e lost faith in their data due to years of reports where the numbers "ust donPt work due to poorly integrated systems or not being around long enough to deli$er real insight. 6n the absence of this robust approachD managers are forced to make Sseat of the pantsP "udgment calls. F recent article in the highly respected ;ar$ard Business >e$iew calls for a mo$e away from de$eloping strategies based on ;6PPI ;ighest Paid PersonPs Ipinion! and towards data#dri$en decision#making. Bhe e$idence is that companies that do this are 10 more producti$e and )0 more profitable than their competitors. MeanwhileD on the %2 MulyD Mercedes FM+ .' dri$er :ewis ;amilton pro$ed the point when he clinched the ;ungarian +rand Prix. UBhis was one of the most important wins of my careerDV he tweeted. :ewis ;amilton and his team ingest huge $olumes of data during a race G taken $ia telemetry from his carD external weather feedsD historical patterns of performance G and process this to de$elop a real#time race strategy. 6n factD this approach is so powerful that many .' teams claim that by lap three theyPre able to predict the outcome of a race to a 3&0 degree of accuracy. -ith this big data at their disposalD therePs no room for hunches.

7n summaryD marketers should learn to become higher#performing by following these simple se$en steps: "tep K0: ;a$e a $ision and a plan for how to get there "tep K3: Build a model for what youPre trying to measure and impro$e "tep K9: 6ntegrate data so that you can see the whole picture "tep K;: Jnsure good data @uality "tep K:: Put data into the hands of the marketing team "tep K=: ;a$e a broad funnel of ideas "tep K>: Use data dri$en thinking to challenge recei$ed wisdom. .ollow these steps and youPre on your way to being a championC

*.all steps 6e+ to big data success


,eptember '3D %&'(

Big data analytics is widely seen as critical to the future of countless businesses. Bhe technology has already been adopted by numerous firmsD and there is little doubt that the solutions will become increasingly uni$ersalD as organiAations realiAe that they cannot hope to remain competiti$e if they cannot take full ad$antage of their a$ailable unstructured and semistructured data resources. Bhe only real @uestionD thenD is how firms can and should go about pursuing these goals. Bhere are no clear#cut answers hereD and many organiAations ha$e de$eloped different strategies for utiliAing big data analytics. >ecentlyD /oug ;enschenD writing for 6nformation-eekD spoke to a number of industry professionals concerning the nature of big data. Bhese indi$iduals offered a wide range of ad$iceD and one of the most important lessons they highlighted was the need to approach big data analytics both carefully and creati$ely. "tepping to success Bhe occasion for ;enschenHs inter$iews was the upcoming <6I ,ummitD to be held in Lew Nork in early Ictober. Bwo of the experts ;enschen spoke toD +ary ;oberman and Mark :iebermanD will speak at this e$entD and each offered compelling ad$ice for maximiAing big data $alue.

:iebermanD <JI at Bi$o >esearch FnalyticsD emphasiAed the need for businesses to stri$e toward real#time performance with their big data efforts. ;e asserted that in the digital media spaceD automated action based on real#time insight will be the standardD and any organiAations that ha$e failed to keep up will struggle to remain rele$ant. Bo reach this goalD he told the news sourceD businesses should adopt a \crawl#walk#run\ strategy. >ather than attempting to achie$e real#time automation immediatelyD firms should slowly work to de$elop the processes necessary to achie$e this goal. Fn effort to de$elop real#time functionality too @uickly will almost certainly lead to inefficienciesD o$ersights and other missteps that compromise the utility of the big data resources. ;owe$erD ;oberman asserted that businesses must also a$oid locking themsel$es and their employees into too rigid modes when de$eloping big data efforts. Bhis is particularly true for larger companies that likely ha$e comprehensi$e policies in placeD he told ;enschen. \Nou ha$e to think like a startup and act like a startup and "ust get it doneD\ ;oberman saidD according to the news source. \Bhe @uestions about what methodology you use arenHt as important as "ust belie$ing that you can get it done.\ 6nstead of limiting employeesH big data analytics de$elopment strategiesD firms should encourage creati$ity and experimentationD ;oberman argued. Inly by pro$iding such free reign to workers can businesses hope to maximiAe their gains from their big data effortsD ;enschen explained. (ools necessary Fs ;oberman emphasiAedD business decision#makers must support their employeesH efforts to utiliAe big data analytics as much as possible. Bhis includes not only encouraging de$elopment and allowing experimentationD but also pro$iding the tools necessary to implement these efforts. F key factor in these regards is data integration. -ithout high#@uality data integration solutions in placeD personnel within an organiAation will not ha$e sufficient information a$ailability to optimiAe their use of big data analytics. J$en more soD achie$ing real#time functionality is impossible if data is not instantly being integrated into a centraliAedD coherent data center. Bhat is why businesses eager to make the most of their big data resources should consider in$esting in change data capture </<! solutions. </< tools ensure that any changes made to one database are immediately reflected throughout the organiAation. Bhis ensures that personnel ha$e access to the most rele$antD up#to#date information possibleD thereby optimiAing the potential of the firmHs big data analytics effortsD whether pursuing real# time functionality or simply aiming to impro$e the @uality of corporate decision#making. Big /ataHs Frri$al .ebruary 'D %&'%

)y
Paul .ain Lew students are more likely to drop out of online colleges if they take full courseloads than if they enroll part timeD according to findings from a research pro"ect that is challenging con$entional wisdom about student success. But perhaps more important than that potentially game#changing nuggetD researchers saidD is how the pro"ect has chipped away at skepticism in higher education about the power of Ubig data.V >esearchers ha$e created a database that measures (( $ariables for the online coursework of )4&D&&& students G a whopping ( million course#le$el records. -hile the work is far from completeD the $ariables help track student performance and retention across a broad range of demographic factors. Bhe data can show what works at a specific type of institutionD and what doesnPt. Bhat sort of predicti$e analytics has long been embraced by corporationsD but not so much by the academy. Bhe ongoing data#mining effortD which was kicked off last year with a d' million grant from the Bill and Melinda +ates .oundationD is being led by -<JBD the -6<;J <ooperati$e for Jducational Bechnologies. Pro,ect Participants Fmerican Public Uni$ersity ,ystem <ommunity <ollege ,ystem of <olorado >io ,alado <ollege Uni$ersity of ;awaii ,ystem Uni$ersity of 6llinois#,pringfield Uni$ersity of Phoenix F broad range of institutions see factbox! are participating. ,ix ma"or for#profitsD research uni$ersities and community colleges ## the sort of group that doesnPt always play nice ## are sharing the $ault of information and tips on how to put the data to work. U;a$ing the Uni$ersity of Phoenix and Fmerican Public Uni$ersityD itPs hugeDV said /an ;ustonD coordinator of strategic systems at >io ,alado <ollegeD a participant.

Fccording to early findings from the researchD at#risk students do better if they ease into online education with a small number of coursesD which flies in the face of widely#held belief in the benefits of full student immersion. UJach of the different institutions has a $ery different organiAational structure for how they deli$er coursesDV said ,ebastiin /jaAD the pro"ectHs senior statistician and an associate professor of technology at -est Kirginia Uni$ersity. U-hat the data seem to suggestD howe$erD is that for students who seem to ha$e a high propensity of dropping out of an online course#based programD the fewer courses they take initiallyD the better#off they are.V Bhat disco$ery warrants a rethinking of how to introduce students to college#le$el workD the researchers said. Fnd the problem of too many concurrent courses may be worse for students who depend on financial aid. ,tudents can only recei$e the maximum Pell +rant award when they take '% credit hoursD which \forces people into concurrencyDV said Phil 6ceD $ice president of research and de$elopment for the Fmerican Public Uni$ersity ,ystem and the pro"ectPs lead in$estigator. U,o the @uestion becomesD is the current federal financial aid structure actually setting these indi$iduals up for failure8V This paragraph has been )odi"ied because o" a "actual error-! %arly Warning "ystem Most of the pro"ectPs participants were already collecting sophisticated data about their students. But researchers said this research is on a different scaleD as are its practical applications. Bhe downsideD howe$erD is that the data sets are so detailed that analyAing them is far from o$er. U6tPs going to be a taxing processDV 6ce said. -hile results from the pro"ectD which is dubbed the Predicti$e Fnalytics >eporting .rameworkD are preliminaryD participants are already putting them to use. >io ,aladoD for exampleD has used the database to create a student performance tracking system. Bhe two#year collegeD which is based in FriAonaD has a particularly strong online presence for a community college G 4(D&&& of its students are enrolled in online programs. Bhe new tracking system allows instructors to see a redD yellow or green light for each studentPs performance. Fnd students can see their own tracking lights. In Manuary 3 >io ,alado turned on the switch for the system across 2& percent of its online coursesD according to college officials. 6t measures student engagement through their -eb interactionsD how often they look at textbooks and whether they respond to feedback from instructorsD all in addition to their performance on coursework.

Michael <ottamD >io ,aladoPs associate dean of instructionD said the college would not ha$e been able to track students with the same detail and real#time speed without the research from the -<JB pro"ect. UPre$ious to nowD we ha$enPt had the dataDV <ottam said. UBhis gi$es you something tangibleD something real.V Match6com for Higher %d? Bhe data set has the potential to gi$e institutions sophisticated information about small subsets of students G such as which academic programs are best suited for a %1#year#old male :atino with strength in mathematicsD for example. Bhe tool could e$en become a sort of Match.com for students and online uni$ersitiesD 6ce said. Bhat application is nowhere near to being a realityD in part because institutions are loath to share competiti$e information with each otherD or the general public. But researchers said the pro"ect will almost certainly help other colleges follow >io ,aladoPs lead in using predicti$e analytics to help design better academic programs. U6f institutions of higher education did more of this type of analyticsD\ /jaA saidD they could tell their prospecti$e students: \:ookD these are the kinds of students who tend to ha$e more success at our institution.V Bhe pro"ect appears to ha$e built support in higher education for the broader use of -all ,treet#style slicing and dicing of data. <olleges ha$e resisted those practices in the pastD perhaps because some educators ha$e $iewed Udata snoopingV warily. Bhat may be changingD obser$ers saidD as the pro"ect is showing that big data isnPt "ust good for hedge funds. Fnd the researchers ha$e already achie$ed one of their primary goalsD which was to pro$e they could create such a largeD workable database. 6n addition to studying the dataD the pro"ectPs leaders hope to begin a second round soonD maybe adding up to '2 new institutions. -hat comes next depends on how colleges use what they learn. +oing public with hard facts gleaned from such a databaseD /jaA saidD could help students and their parents make better decisions. U>ather than "ust going on rankings done by a particular news agencyDV he saidD they couldD Ureally look at tailoring which institution pro$ides the best fit for a particular indi$idual student.V Ste(e 9olowich contributed reporting-

I0 big data pro)ect ta6ea8a+s


MU,B >JF/: -indows 2.': -hat enterprise users need to know Bopic: Big /ata

/isco$er .ollow $ia: >,, Jmail Flert %<omments ' Kote more 5

I0 big data pro)ect ta6ea8a+s


,ummary: ;ereHs a look at the big data lessons learned in the field from a be$y of technology execs.

By :arry /ignan for Between the :ines k Ictober '&D %&'% ## '&:&& +MB '1:(& 6,B! Bechnology executi$es are hopping on the big data bandwagon at a rapid clip as they run ;adoop pilotsD eye internal information streams and struggle to find talent. ;ereHs a look at (& big data takeaways o$er the last two weeks $ia a conference at Bemple Uni$ersity as well as ]/LetHs Bech:ines roundtable discussion last week.

'. -here do you start a big data pro"ect8 ,kunk works pro"ects were a popular route and then those groups e$ol$ed to become doAens of employees and petabytes of data. Ither options included the underser$ed business unit. ,ome companies had business leaders as sponsors. %. :eaders will ha$e to take a few chances on big data pro"ects. Branslation: Brust your peopleD spend some money and take the leap. (. Use cases for big data abound. Fmong the possibilities: o Letwork optimiAation. o .raud detection. o ,eeing what the customer experiences. o ;ealthcare simulations. o <onsumer focused marketing efforts re@uire more social networking analysis and predicti$e capabilities. <onsumer data is inherently unstructured. o Bra$el and expense management to make intelligent decisions about costs. .or instanceD a company could notice it is sending too many people to one conference with aggregated data across %&&D&&& employees.

Marketing support and tracking of attrition rates in a subscriber#based business. o <loser ties between partners and suppliers $ia collaborati$e data and insight sharing. o <hristine BwifordD ManagerD Letwork Bechnology ,olutions at B#MobileD said analytics ga$e the wireless pro$ider confidence that it could offer an unlimited data plan without crushing the network. 4. Fnalytics and business intelligence are bridging into big data applications. ;istorical data from years back has been usableD said Michael <a$arettaD Bechnical :eaderD Predicti$e Fnalytics ? /ata Mining at .ord. 6n the futureD <a$aretta said .ord will focus on data from the $ehicleD but the real win may be the stream of information through the manufacturing process. 1. Bhe big data Petri dish will be the healthcare industry. \BhereHs a lot of incenti$e out there to use big data to impro$e healthcareD\ said Eatrina MontinolaD Kice President of Jngineering at Frchimedes. ). .acebook is another big data Petri dish. .acebook could use big data techni@ues to make more money###while treading carefully on pri$acy. <on$erselyD .acebook is a huge data set by definition. Ffter allD one billion users are sharing gobs of data. .acebook data could \pro$ide an O#ray $iew\ of whatHs going on in a customerHs head. <ompanies could optimiAe that data to impro$e experience. Montinola said that .acebook would pro$ide an ideal population for clinical trials. ,kytland said .acebook could be \an amaAing platform for collecti$e action.\ *. \Big data is the oil of the information ageD\ said Licholas ,kytlandD Program ManagerD Ipen +o$ernment 6nitiati$e. 2. ,hared analytics ser$ices are commonly used as a way to harness big data and blend in predicti$e techni@ues. 3. ,torage will be an ongoing big data issue because data scientists are pack rats### e$en hoarders###but thereHs a budget limit. B#Mobile can only keep '& days of its clickstream dataD said BwifordD who noted the company is trying to process more information in flight. ,torage limitations will result in sampling. '&. Fs for data samplingD data scientists will ultimately make the call on what information is hoarded and whatHs sampled. ''. /ata scientists will be in high demand and ser$e as in$estigators that test hypotheses. /ata scientists will be paired with business domain experts. -hatHs unclear is how many of these data wonks you need. 6n many respectsD weHll all be data scientists to some degree###or at least data literate. Bwiford said thereHs a talent challenge. BhereHs also a challenge in recruiting big data talent and companies should look beyond ,ilicon Kalley. '%. Big data talent is tough to find. Ine company appointed internal people with business knowledge and supplement with a partner who had statistic and analytics wonks a$ailable consultants!. Bhe long#term talent strategy for this company is to recruit hea$ily from uni$ersities to build an analytic employee pool. Balent has to be able to use data. '(. KisualiAation tools and crowdsourcing may alle$iate the big data talent crunchD said ,kytland. Perhaps \citiAen scientists\ will bridge the gapD said ,kytland. KisualiAation tools can bring big data to the masses.
o

'4. Uni$ersities and retraining will also bridge the big data talent gap. '1. Boo much time is being spent preparing big data and not enough actually analyAing it. /isco$ery and decision#making is being short#changed for preparation. /ata preparation should be automated. '). -hen pitching big data to business leaders you need to start with this @uestion: -hat business @uestions need to be answered8 '*. Most corporate big data pro"ects are in their infancy. Fs a resultD many are looking to combine data warehouse information with other data to be prescripti$e. Ine company was looking to build a data warehouse on steroids. '2. Partner with companies that can pro$ide $isualiAation tools $ia FP6s. If courseD you ha$e to liberate your data and open it up firstD said ,kytland. '3. LF,F is planning missions that will collect %4 terabytes of data a day. \-e want to make sense of that data and actually na$igate itD\ ,kytland. %&. Bhere are thousands of silos in corporate Fmerica and sharing data is the biggest challenges. Big data could be a way to bridge those corporate silos. %'. Big data applications are rolling first at business to consumer @uestions because they tie together experienceD sales and analytics. ,ocial media and multiple channels also mean that companies need to look for patterns in streaming dataD said Mames EobielusD 6BMHs big data e$angelist. %%. ;adoop clusters are surfacing e$erywhere in corporate Fmerica. 6f %&'% was the year of enterprise ;adoop pilotsD %&'( will a ramp of usage. %(. LF,F initially created its own big data systemsD but is using more commercial applications ranging from FmaAon -eb ,er$ices and a cloud infrastructure. %4. Big data isnHt newD but now has reached critical mass as people digitiAe their li$es. \People are walking sensorsD\ said ,kytland. %1. ,ocial media is hyped in big data applicationsD but the diary of consumersH li$es is great market intelligence. <hief marketing officers are pushing social media and big data pro"ects. <a$aretta said .ord is using social data because it goes beyond what consumers pro$ide in sur$eys and \represents what they are thinking.\ %). 6B practitioners said that they wanted the largest data sets possible. Bhe idea is that companies wouldnHt ha$e to rely on samples. ;owe$erD thereHs a business challenge in determining what information is worth keeping and what should head to the archi$e or tossed. %*. Making archi$ed data usable for big data pro"ects is going to be a running challenge. %2. +o$ernments and the ability to pro$ide datasets can create entire industries. Under this theoryD go$ernments will essentially be data pro$iders as one of its primary functions. %3. Bwiford said that B#Mobile is using big data techni@ues to learn more about the preferences of no#contract customersD which donHt offer as much profile information as contract ones. (&. /ata analytics as a ser$ice and data $isualiAation as a ser$ice will become commonplace. Bhird party $endors will mo$e toward big data as a ser$ice to make it consumable for the masses. Bech $endors to go this route are likely the big market share leaders today 6BMD ,FPD IracleD ,alesforce.com!.

Você também pode gostar