Você está na página 1de 6

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

16analyticdisciplinescomparedtodatascience
PostedbyVincentGranvilleonJuly24,2014at7:00pm

ViewBlog

Whatarethedifferencesbetweendatascience,datamining,machinelearning,statistics,operations
research,andsoon?
HereIcompareseveralanalyticdisciplinesthatoverlap,toexplainthedifferencesandcommon
denominators.Sometimesdifferencesexistfornothingelseotherthanhistoricalreasons.Sometimesthe
differencesarerealandsubtle.Ialsoprovidetypicaljobtitles,typesofanalyses,andindustries
traditionallyattachedtoeachdiscipline.Underlineddomainsaremainsubdomains.Itwouldbegreatif
someonecanaddanhistoricalperspectivetomyarticle.

Sourceforthepicture
DataScience
First,let'sstartbydescribingdatascience,thenewdiscipline.
Jobtitlesincludedatascientist,chiefscientist,senioranalyst,directorofanalyticsandmanymore.It
coversallindustriesandfields,butespeciallydigitalanalytics,searchtechnology,marketing,fraud
detection,astronomy,energy,healhcare,socialnetworks,finance,forensics,security(NSA),mobile,
telecommunications,weatherforecasts,andfrauddetection.
Projectsincludetaxonomycreation(textmining,bigdata),clusteringappliedtobigdatasets,
recommendationengines,simulations,rulesystemsforstatisticalscoringengines,rootcause
analysis,automatedbidding,forensics,exoplanetsdetection,andearlydetectionofterroristactivityor
pandemics,Animportantcomponentofdatascienceisautomation,machinetomachine
communications,aswellasalgorithmsrunningnonstopinproductionmode(sometimesinrealtime),for
instancetodetectfraud,predictweatherorpredicthomepricesforeachhome(Zillow).
AnexampleofdatascienceprojectisthecreationofthefastestgrowingdatascienceTwitterprofile,for
computationalmarketing.Itleveragesbigdata,andispartofaviralmarketing/growthhackingstrategy
thatalsoincludesautomatedhighquality,relevant,syndicatedcontentgeneration(inshort,digital
publishingversion3.0).
Unlikemostotheranalyticprofessions,datascientistsareassumedtohavegreatbusinessacumenand
domainexpertizeoneofthereasonswhytheytendtosucceedasentrepreneurs.Therearemany
typesofdatascientists,asdatascienceisabroaddiscipline.Manyseniordatascientistsmastertheir
art/craftsmanshipandpossessthewholespectrumofskillsandknowledgetheyreallyaretheunicorns
thatrecruiterscan'tfind.Hiringmanagersanduninformedexecutivesfavornarrowtechnicalskillsover
combineddeep,broadandspecializedbusinessdomainexpertizeabyproductofthecurrenteducation 1/6
data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

combineddeep,broadandspecializedbusinessdomainexpertizeabyproductofthecurrenteducation
systemthatfavorsdisciplinesilos,whiletruedatascienceisasilodestructor.Unicorndatascientists(a
misnomer,becausetheyarenotraresomearefamousVC's)usuallyworkasconsultants,oras
executives.Juniordatascientiststendtobemorespecializedinoneaspectofdatascience,possess
morehottechnicalskills(Hadoop,Pig,Cassandra)andwillhavenoproblemsfindingajobifthey
receivedappropriatetrainingand/orhaveworkexperiencewithcompaniessuchasFacebook,Google,
eBay,Apple,Intel,Twitter,Amazon,Zillowetc.Datascienceprojectsforpotentialcandidatescanbe
foundhere.
Datascienceoverlapswith
Computerscience:computationalcomplexity,Internettopologyandgraphtheory,distributed
architecturessuchasHadoop,dataplumbing(optimizationofdataflowsandinmemoryanalytics),
datacompression,computerprogramming(Python,Perl,R)andprocessingsensorandstreaming
data(todesigncarsthatdriveautomatically)
Statistics:designofexperimentsincludingmultivariatetesting,crossvalidation,stochastic
processes,sampling,modelfreeconfidenceintervals,butnotpvaluenorobscuretestsof
thypothesesthataresubjectstothecurseofbigdata
Machinelearninganddatamining:datascienceindeedfullyencompassesthesetwodomains.
Operationsresearch:datascienceencompassesmostofoperationsresearchaswellasany
techniquesaimedatoptimizingdecisionsbasedonanalysingdata.
Businessintelligence:everyBIaspectofdesigning/creating/identifyinggreatmetricsandKPI's,
creatingdatabaseschemas(beitNoSQLornot),dashboarddesignandvisuals,anddatadriven
strategiestooptimizedecisionsandROI,isdatascience.
Comparisonwithotheranalyticdiscplines
Machinelearning:Verypopularcomputersciencediscipline,dataintensive,partofdatascience
andcloselyrelatedtodatamining.Machinelearningisaboutdesigningalgorithms(likedata
mining),butemphasisisonprototypingalgorithmsforproductionmode,anddesigningautomated
systems(biddingalgorithms,adtargetingalgorithms)thatautomaticallyupdate
themselves,constantlytrain/retrain/updatetrainingsets/crossvalidate,andrefineordiscovernew
rules(frauddetection)onadailybasis.PythonisnowapopularlanguageforMLdevelopment.
Corealgorithmsincludeclusteringandsupervisedclassification,rulesystems,andscoring
techniques.Asubdomain,closetoArtificialIntelligence(seeentrybelow)isdeeplearning.
Datamining:Thisdisciplineisaboutdesigningalgorithmstoextractinsightsfromratherlargeand
potentiallyunstructureddata(textmining),sometimescallednuggetdiscovery,forinstance
unearthingamassiveBotnetsafterlookingat50millionrowsofdata.Techniquesincludepattern
recognition,featureselection,clustering,supervisedclassificationandencompassesafew
statisticaltechniques(thoughwithoutthepvaluesorconfidenceintervalsattachedtomost
statisticalmethodsbeingused).Instead,emphasisisonrobust,datadriven,scalabletechniques,
withoutmuchinterestindiscoveringcausesorinterpretability.Dataminingthushavesome
intersectionwithstatistics,anditisasubsetofdatascience.Dataminingisappliedcomputer
engineering,ratherthanamathematicalscience.Dataminersuseopensourceandsoftwaresuch
asRapidMiner.
Predictivemodeling:Notadisciplineperse.Predictivemodelingprojectsoccurinallindustries
acrossalldisciplines.Predictivemodelingapplicationsaimatpredictingfuturebasedonpastdata,
usuallybutnotalwaysbasedonstatisticalmodeling.Predictionsoftencomewithconfidence
intervals.Rootsofpredictivemodelingareinstatisticalscience.
Statistics.Currently,statisticsismostlyaboutsurveys(typicallyperformedwithSPSSsoftware),
theoreticalacademicresearch,bankandinsuranceanalytics(marketingmixoptimization,cross
selling,frauddetection,usuallywithSASandR),statisticalprogramming,socialsciences,global
warmingresearch(andspaceweathermodeling),economicresearch,clinicaltrials
(pharmaceuticalindustry),medicalstatistics,epidemiology,biostatistics.andgovernmentstatistics.
AgencieshiringstatisticiansincludetheCensusBureau,IRS,CDC,EPA,BLS,SEC,andEPA
(environmental/spatialstatistics).Jobsrequiringasecurityclearancearewellpaidandrelatively
secure,butthewellpaidjobsinthepharmaceuticalindustry(thegoldengooseforstatisticians)are
threatenedbyanumberoffactorsoutsourcing,companymergings,andpressurestomake
data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

2/6

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

threatenedbyanumberoffactorsoutsourcing,companymergings,andpressurestomake
healthcareaffordable.Becauseofthebiginfluenceoftheconservative,riskadverse
pharmaceuticalindustry,statisticshasbecomeanarrowfieldnotadaptingtonewdata,andnot
innovating,loosinggroundtodatascience,industrialstatistics,operationsresearch,datamining,
machinelearningwherethesameclustering,crossvalidationandstatisticaltrainingtechniques
areused,albeitinamoreautomatedwayandonbiggerdata.Manyprofessionalswhowerecalled
statisticians10yearsago,haveseentheirjobtitlechangedtodatascientistoranalystinthelast
fewyears.Modernsubdomainsincludestatisticalcomputing,statisticallearning(closertomachine
learning),computationalstatistics(closertodatascience),datadriven(modelfree)inference,sport
statistics,andBayesianstatistics(MCMC,BayesiannetworksandhierarchicalBayesian
modelsbeingpopular,moderntechniques).OthernewtechniquesincludeSVM,structuralequation
modeling,predictingelectionresults,andensemblemodels.
Industrialstatistics.Statisticsfrequentlyperformedbynonstatisticians(engineerswithgood
statisticaltraining),workingonengineeringprojectssuchasyieldoptimizationorloadbalancing
(systemanalysts).Theyuseveryappliedstatistics,andtheirframeworkisclosertosix
sigma,qualitycontrolandoperationsresearch,thantotraditionalstatistics.Alsofoundinoiland
manufacturingindustries.Techniquesusedincludetimeseries,ANOVA,experimentaldesign,
survivalanalysis,signalprocessing(filtering,noiseremoval,deconvolution),spatialmodels,
simulation,Markovchains,riskandreliabilitymodels.
Mathematicaloptimization.Solvesbusinessoptimizationproblemswithtechniquessuchasthe
simplexalgorithm,Fouriertransforms(signalprocessing),differentialequations,andsoftwaresuch
asMatlab.TheseappliedmathematiciansarefoundinbigcompaniessuchasIBM,researchlabs,
NSA(cryptography)andinthefinanceindustry(sometimesrecruitingphysicsorengineer
graduates).Theseprofessionalssometimessolvetheexactsameproblemsasstatisticiansdo,
usingtheexactsametechniques,thoughtheyusedifferentnames.Mathematiciansuseleast
squareoptimizationforinterpolationorextrapolationstatisticiansuselinearregressionfor
predictionsandmodelfitting,butbothconceptsareidentical,andrelyontheexactsame
mathematicalmachinery:it'sjusttwonamesdescribingthesamething.Mathematicaloptimization
ishoweverclosertooperationsresearchthanstatistics,thechoiceofhiringamathematicianrather
thananotherpractitioner(datascientist)isoftendictatedbyhistoricalreasons,especiallyfor
organizationssuchasNSAorIBM.
Actuarialsciences.Justasubsetofstatisticsfocusingoninsurance(car,health,etc.)using
survivalmodels:predictingwhenyouwilldie,whatyourhealthexpenditureswillbebasedonyour
healthstatus(smoker,gender,previousdiseases)todetermineyourinsurancepremiums.Also
predictsextremefloodsandweathereventstodeterminepremiums.Theselattermodelsare
notoriouslyerroneous(recently)andhaveresultedinfarbiggerpayoutsthanexpected.Forsome
reasons,thisisaveryvibrant,secretivecommunityofstatisticians,thatdonotcallthemselves
statisticiansanymore(jobtitleisactuary).Theyhaveseentheiraveragesalaryincreasenicelyover
time:accesstoprofessionisrestrictedandregulatedjustlikeforlawyers,fornootherreasonsthan
protectionismtoboostsalariesandreducethenumberofqualifiedapplicantstojobopenings.
Actuarialsciencesisindeeddatascience(asubdomain).
HPC.Highperformancecomputing,notadisciplineperse,butshouldbeofconcerntodata
scientists,bigdatapractitioners,computerscientistsandmathematicians,asitcanredefinethe
computingparadigmsinthesefields.Ifquantumcomputingeverbecomessuccessful,itwilltotally
changethewayalgorithmsaredesignedandimplemented.HPCshouldnotbeconfusedwith
HadoopandMapReduce:HPCishardwarerelated,Hadoopissoftwarerelated(thoughheavily
relyingonInternetbandwidthandserversconfigurationandproximity).
Operationsresearch.AbbreviatedasOR.Theyseparatedfromstatisticsawhileback(like20
yearsago),buttheyareliketwinbrothers,andtheirrespectiveorganizations(INFORMSandASA)
partnertogether.ORisaboutdecisionscienceandoptimizingtraditionalbusinessprojects:
inventorymanagement,supplychain,pricing.TheyheavilyuseMarkovChainmodels,Monter
Carlosimulations,queuingandgraphtheory,andsoftwaresuchasAIMS,MatlaborInformatica.
Big,traditionaloldcompaniesuseOR,newandsmallones(startups)usedatasciencetohandle
data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

3/6

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

Big,traditionaloldcompaniesuseOR,newandsmallones(startups)usedatasciencetohandle
pricing,inventorymanagementorsupplychainproblems.Manyoperationsresearchanalystsare
becomingdatascientists,asthereisfarmoreinnovationandthusgrowthprospectindatascience,
comparedtoOR.Also,ORproblemscanbesolvedbydatascience.ORhasasiginficantoverlap
withsixsigma(seebelow),alsosolveseconometricproblems,andhasmany
practitioners/applicationsinthearmyanddefensesectors.cartrafficoptimizationisamodern
exampleofORproblem,solvedwithsimulations,commutersurveys,sensordataandstatistical
modeling.
Sixsigma.It'smoreawayofthinking(abusinessphilosophy,ifnotacult)ratherthanadiscipline,
andwasheavilypromotedbyMotorolaandGEafewdecadesago.Usedforqualitycontrolandto
optimizeengineeringprocesses(seeentryonindustrialstatisticsinthisarticle),bylarge,traditional
companies.TheyhaveaLinkedIngroupwith270,000members,twiceaslargeasanyother
analyticLinkedIngroupsincludingourdatasciencegroup.Theirmottoissimple:focusyourefforts
onthe20%ofyourtimethatyields80%ofthevalue.Applied,simplestatisticsareused(simple
stuffworksmustofthetime,Iagree),andtheideaistoeliminatesourcesofvariancesinbusiness
processes,tomakethemmorepredictableandimprovequality.Manypeopleconsidersixsigmato
beoldstuffthatwilldisappear.Perhaps,butthefondamentalconceptsaresolidandwillremain:
thesearealsofundamentalconceptsforalldatascientists.Youcouldsaythatsixsigmaisamuch
moresimpleifnotsimplisticversionofoperationsresearch(seeaboveentry),wherestatistical
modelingiskepttoaminimum.Risks:nonqualifiedpeopleusenonrobustblackboxstatistical
toolstosolveproblems,itcanresultindisasters.Insomeways,sixsigmaisadisciplinemore
suitedforbusinessanalysts(seebusinessintelligenceentrybelow)thanforseriousstatisticians.
Quant.QuantpeoplearejustdatascientistsworkingforWallStreetonproblemssuchashigh
frequencytradingorstockmarketarbitraging.TheyuseC++,Matlab,andcomefromprestigious
universities,earnbigbucksbutlosetheirjobrightawaywhenROIgoestooSouthtooquickly.They
canalsobeemployedinenergytrading.Manywhowerefiredduringthegreatrecessionnowwork
onproblemssuchasclickarbitraging,adoptimizationandkeywordbidding.Quantshave
backgroundsinstatistics(fewofthem),mathematicaloptimization,andindustrialstatistics.
Artificialintelligence.It'scomingback.Theintersectionwithdatascienceispatternrecognition
(imageanalysis)andthedesignofautomated(somewouldsayintelligent)systemstoperform
varioustasks,inmachinetomachinecommunicationmode,suchasidentifyingtherightkeywords
(andrightbid)onGoogleAdWords(payperclickcampaignsinvolvingmillionsofkeywordsper
day).Ialsoconsidersmartsearch(creatingasearchenginereturningtheresultsthatyouexpect
andbeingmuchbroaderthanGoogle)oneofthegreatestproblemsindatascience,arguablyalso
anAIandmachinelearningproblem.AnoldAItechniqueisneuralnetworks,butitisnowloosing
popularity.Tothecontrary,neuroscienceisgainingpopularity.
Computerscience.Datasciencehassomeoverlapwithcomputerscience:HadoopandMap
Reduceimplementations,algorithmicandcomputationalcomplexitytodesignfast,scalable
algorithms,dataplumbing,andproblemssuchasInternettopologymapping,randomnumber
generation,encryption,datacompression,andsteganography(thoughtheseproblemsoverlap
withstatisticalscienceandmathematicaloptimizationaswell).
Econometrics.Whyitbecameseparatedfromstatisticsisunclear.Somanybranches
disconnectedthemselvesfromstatistics,astheybecamelessgenericandstartdevelopingtheir
ownadhoctools.Butinshort,econometricsisheavilystatisticalinnature,usingtimeseriesmodels
suchasautoregressiveprocesses.Alsooverlappingwithoperationsresearch(itselfoverlaping
withstatistics!)andmathematicaloptimization(simplexalgorithm).EconometricianslikeROCand
efficiencycurves(sodosixsigmapractitioners,seecorrespondingentryinthisarticle).Manydo
nothaveastrongstatisticalbackground,andExcelistheirmainoronlytool.
Dataengineering.Performedbysoftwareengineers(developers)orarchitects(designers)inlarge
organizations(sometimesbydatascientistsintinycompanies),thisistheappliedpartofcomputer
science(seeentryinthisarticle),topowersystemsthatallowallsortsofdatatobeeasily
data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

4/6

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

science(seeentryinthisarticle),topowersystemsthatallowallsortsofdatatobeeasily
processedinmemoryornearmemory,andtoflownicelyto(andbetween)endusers,including
heavydataconsumerssuchasdatascientists.Asubdomaincurrentlyunderattackisdata
warehousing,asthistermisassociatedwithstatic,siloedconventationaldatabases,data
architectures,anddataflows,threatenedbytheriseofNoSQL,NewSQLandgraphdatabases.
Transformingtheseoldarchitecturesintonewones(onlywhenneeded)ormakethemcompatible
withnewones,isalucrativebusiness.
Businessintelligence.AbbreviatedasBI.Focusesondashboardcreation,metricselection,
producingandschedulingdatareports(statisticalsummaries)sentbyemailordelivered/presented
toexecutives,competitiveintelligence(analyzingthirdpartydata),aswellasinvolvementin
databaseschemadesign(workingwithdataarchitects)tocollectuseful,actionablebusinessdata
efficiently.Typicaljobtitleisbusinessanalyst,butsomearemoreinvolvedwithmarketing,product
orfinance(forecastingsalesandrevenue).TheytypicallyhaveanMBAdegree.Somehave
learnedadvancedstatisticssuchastimeseries,butmostonlyuse(andneed)basicstats,andlight
analytics,relyingonITtomaintaindatabasesandharvestdata.TheyusetoolssuchasExcel
(includingcubesandpivottables,butnotadvancedanalytics),Brio(Oraclebrowserclient),Birt,
MicroSreategyorBusinessObjects(asenduserstorunqueries),thoughsomeofthesetoolsare
increasinglyequippedwithbetteranalyticcapabilities.Unlesstheylearnhowtocode,theyare
competingwithsomepolyvalentdatascientiststhatexcelindecisionscience,insightsextraction
andpresentation(visualization),KPIdesign,businessconsulting,andROI/yield/business/process
optimization.BIandmarketresearch(butnotcompetitiveintelligence)arecurrentlyexperiencinga
decline,whileAIisexperiencingacomeback.Thiscouldbecyclical.Partofthedeclineisdueto
notadaptingtonewtypesofdata(e.g.unstructuredtext)thatrequireengineeringordatascience
techniquestoprocessandextractvalue.
Dataanalysis.Thisisthenewtermforbusinessstatisticssinceatleast1995,anditcoversalarge
spectrumofapplicationsincludingfrauddetection,advertisingmixmodeling,attributionmodeling,
salesforecasts,crosssellingoptimization(retails),usersegmentation,churnanalysis,computing
longtimevalueofacustomerandcostofacquisition,andsoon.Exceptinbigcompanies,data
analystisajuniorrolethesepractitionershaveamuchmorenarrowknwoledgeandexperience
thandatascientists,andtheylack(anddon'tneed)businessvision.Theyaredetailorirentedand
reporttomanagerssuchasdatascientistsordirectorofanalytics,Inbigcompanies,someonewith
ajobtitlesuchasdataanalystIIImightbeverysenior,yettheyusuallyarespecializedandlackthe
broadknowledgegainedbydatascientistsworkinginavarietyofcompanieslargeandsmall.
Businessanalytics.Sameasdataanalysis,butrestrictedtobusinessproblemsonly.Tendsto
haveabitmoreofafinacial,marketingorROIflavor.Popularjobtitlesincludedataanalystand
datascientist,butnotbusinessanalyst(seebusinessintelligenceentryforbusinessintelligence,a
differentdomain).

Finally,therearemorespecializedanalyticdisciplinesthatrecentlyemerged:healthanalytics,
data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

5/6

11/6/2015

16analyticdisciplinescomparedtodatascienceDataScienceCentral

Finally,therearemorespecializedanalyticdisciplinesthatrecentlyemerged:healthanalytics,
computationalchemistryandbioinformatics(genomeresearch),forinstance.

data:text/htmlcharset=utf8,%3Cdiv%20class%3D%22xg_headline%20xg_headlineimg%20xg_headline2l%22%20style%3D%22margin%3A%200px%3B%2

6/6

Você também pode gostar