Escolar Documentos
Profissional Documentos
Cultura Documentos
CZ4032CPE489CSC489
DataAnalyticsandMining
[DataMining]
1
Outline
MotivationofDataMining
EvolutionofDataMining
DefinitionsofDataMiningandKDD
DataSources
DataMiningTasks
Summary
Motivation:TheAgeofBigData
Motivation:WhyMineData?
ExplosiveGrowthofData:fromTerabytestoPetabytes
Datacollectionanddatastoragetechnology
Automateddatacollectiontools,database
systems,Web,computerizedsociety,
mobiledevices,socialnetworks,etc.
Majorsourcesofabundantdata
Business:Web,ecommerce,
transactions,stocks,
Science:Remotesensing,bioinformatics,
scientificsimulation,
Societyandeveryone:news,digitalcameras,
socialmedia,
4
Motivation:WhyMineData?
Example:
Facebook
800 million active users
60 billion photos in total, 250 million photos uploaded per day
80 groups/events per user (till Feb 2011)
Flickr
60 million users
Five billion photos
10 million groups (till Feb 2011)
Twitter
175 million users (registered)
140 million tweets per day
Weibo
200 million users
(till June 2011)
Wearedrowningindata,butstarvingforknowledge!
Necessityisthemotherofinvention
DataMiningAutomatedanalysisofmassivedatasets
5
Motivation:WhyMineData?
FromCommercialViewpoint
Lotsofdataarebeingcollected
andwarehoused
Webdata,ecommerce
purchasesatdepartment/
grocerystores
Bank/CreditCard
transactions
Socialnetworks
Computershavebecomecheaperandmorepowerful
Competitivepressureisstrong
Providebetter,customizedservicesforanedge(e.g.in
CustomerRelationshipManagement)
Motivation:WhyMineData?
FromScientificViewpoint
Datacollectedandstoredat
enormousspeeds(GB/hour)
remotesensorsonasatellite
telescopesscanningtheskies
microarraysgeneratinggene
expressiondatainbiology
scientificsimulations
generatingterabytesofdata
Traditionaltechniquesinfeasibleforrawdata
Dataminingmayhelpscientists
inclassifyingandsegmentingdata
inHypothesisFormation
7
Evolution:fromDBtoDMtoDS
Datacollection
Databasecreation
1960s:Fromprimitivedatacollectionsystemstosophisticatedand
powerfuldatabasesystems(NavigationalDBMS)
DatamanagementAdvanceddataanalysis
1970s:Fromearlyhierarchicalandnetworkmodelstorelational
models(OLTP),SQLDBMS
1980s:Furtherresearchintoobjectorienteddatabasesystems,
Internet,applicationoriented,etc.
1990s:Cheaperhardware,advancedDBMS,DataWarehouse,OLAP
(OnLineanalyticalprocessing)
Late1990s/present:Datarichbutinformationpoorsituationrequired
powerfulanalyticaltools,whichmotivatesDataMiningtechnology
DataScienceKnowledgeDiscoveryinDatabases
Definitions:KDD&DataMining
KDD(KnowledgeDiscoveryinDatabases)
Theoverallprocessofnontrivialextractionof
implicit,previouslyunknown andpotentiallyuseful
knowledgefromlargeamountsofdata
KDDalsostandforKnowledgeDiscoveryandData
Mining
DataMining:AKDDProcess
DataMining:ThecorestepsofKDD
Applicationofspecific
algorithmsforextracting
patternsfromdata
10
Whatis(not)DataMining?
WhatisNOT DataMining?
Lookupphonenumber
inaphonedirectory
QueryaWebsearch
engineforinformation
aboutAmazon
WhatisDataMining?
Certainnamesaremore
prevalentincertainUSlocations
(OBrien,ORurke,OReillyin
Bostonarea)
Grouptogethersimilar
documentsreturnedbysearch
engineaccordingtotheir
content/context(e.g.Amazon
rainforest,Amazon.com,)
11
OriginsofDataMining
Drawsideasfromstatistics/AI,machinelearning/pattern
recognition,anddatabasesystems,etc.
TraditionalTechniques
maybeunsuitabledueto
Statistics/
Machine Learning/
AI
Pattern
Enormityofdata
Recognition
Highdimensionality
Data Mining
ofdata
Heterogeneous,
distributednature
Database
systems
ofdata
12
Whynotuseclassicaldataanalysis?
Tremendousamountofdata
Algorithmsmustbehighlyscalabletohandlemassivedata,
suchasterabytesofdata
Highdimensionalityofdata
E.g.,microarraydatamayhavetensofthousandsofdimensions
Highcomplexityofdata
Datastreamsandsensordata
Timeseriesdata,temporaldata,sequencedata
Structuredata,graphs,socialnetworksandmultilinkeddata
Heterogeneousdatabasesandlegacydatabases
Spatial,spatiotemporal,multimedia,textandWebdata
Softwareprograms,scientificsimulations
Newandsophisticatedapplications
13
MajorStepsofDataMining(KDD)
Input
data
Data
Preprocessing
Data
Mining
Postprocessing
Knowledge
1. DataPreprocessing
A. DataIntegration
Combinemultipledatasources
B. DataCleaning
Removenoiseandinconsistentdata
C. DataSelection
Selecttaskrelevantdata
D. DataTransformation
Transform/consolidateselecteddataforfurtheranalysis
14
MajorStepsofDataMining(KDD)
2. DataMining
Applydatamining&machinelearningmethods
(e.g.,classification/clustering)toextractpatternsfromdata
3. PatternEvaluation(PostProcessing)
EvaluateandIdentifytrulyinterestingpatterns
4. Visualization(PostProcessing)
Presenttheminedpatternstousers
TheArchitectureofaTypicalDataMiningSystem
User
Visualization
PostProcessingInterestingPatterns
DataMiningEngine
DataPreprocessing(DataWarehouse,DBServer)
Cleaning/Integration/Selection/Reduction/Transformation
Databases
16
DataMining&BusinessIntelligence
17
DataMiningTasks
Prediction Methods
Usesomevariablestopredict unknownorfuture
valuesofothervariables.
Description Methods
Findhumaninterpretablepatternsthatdescribe
thedata.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
18
DataMiningTaxonomy
DataMiningTasks
Descriptive
Association
Rule Mining
Clustering
Predictive
Classification
Sequence
Pattern Mining
Regression
Outlier
Detection
Thistaxonomyisbasedonthekindsofpatternsoutputbydataminingtasks.
19
AssociationRuleMining:Definition
Givenasetofrecords eachofwhichcontainssomenumberof
items fromagivencollection,
Producedependency rules whichwillpredictoccurrence
ofanitembasedonoccurrences ofotheritems.
TID
Items
1
2
3
4
5
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
markettransactionalDB
20
AssociationRuleMining:Application1
MarketingandSalesPromotion
Lettherulediscoveredbe
{Coke,}>{PotatoChips}
PotatoChipsasconsequent
Canbeusedtodeterminewhatshouldbedonetoboost
itssales.
Coke intheantecedent
Canbeusedtoseewhichproductswouldbeaffectedif
thestorediscontinuessellingCoke.
Coke inantecedentandPotatochipsinconsequent
Canbeusedtoseewhatproductsshouldbesoldwith
CoketopromotesaleofPotatochips!
21
AssociationRuleMining:Application2
Supermarketshelfmanagement.
Goal:Toidentifyitemsthatareboughttogetherbysufficiently
manycustomers.
Approach:Processthepointofsaledatacollectedwith
barcodescannerstofinddependenciesamongitems.
Hereisaclassicalrule:{diaper,milk}>{beer}
Ifacustomerbuysdiaper andmilk,thenheisverylikelytobuybeer.
So,dontbesurprisedifyoufindsixpacksstackednexttodiapers!
Anotherpossiblerulemaybe:{flowers,beer}>{condoms}
Ifacustomerbuysflowersandbeer,thenhecouldverylikelybuy
condoms
22
Classification/Regression/Prediction
Classification
Fromthetrainingdatawithclasslabels,aclassificationmodel is
learnttodistinguishdatainstancesbetweendifferentclasses
Themodel canberepresentedasclassificationrules,decisiontrees,
mathematicalformulae,neuralnetworks,etc.
Whennew(test)datawithoutclasslabelscomes,themodelisused
topredict itsclasslabels
Regression
Usedtomap adataitemtoarealvalued predictionvariable
Involvesthelearningofthefunctionthatdoesthemapping
Regressionlearnsamodeltopredictcontinuoustarget,whereas
classificationlearnsamodeltopredictcategoricallabels.
23
Classification:Definition
Givenacollectionofrecords(trainingset),eachrecordcontainsa
setofattributes,oneoftheattributesistheclass.
Findamodel forclassattributeasafunction ofthevaluesofother
attributes.
Goal:Toensurethatpreviouslyunseen recordsshouldbeassigneda
classasaccurately aspossible.
ExampleofAClassificationTask
Class Label
25
Classification Rules
Class Labels
26
Decision Tree
Attributes
AttributeValues
Class Labels
27
Classification:Application1
DirectMarketing
Goal:
Reducecostofmailingbytargeting asetofconsumerslikelyto
buyanewcellphoneproduct.
Approach:
Usethedataforasimilarproductintroducedbefore.Weknow
whichcustomersdecidedtobuyandwhichdecidedotherwise.
This{buy,dontbuy}decisionformstheclass attribute.
Collectvariousdemographic,lifestyle,andcompanyinteraction
relatedinformationaboutallsuchcustomers.
Typeofbusiness,wheretheystay,howmuchtheyearn,etc.
Usethisinformationasinputattributestolearnaclassifier
model.
28
Classification:Application2
FraudDetection
Goal:
Predictfraudulentcasesincreditcardtransactions.
Approach
Usecreditcardtransactionsandtheinformationonitsaccountholder
asattributes.
Whendoesacustomerbuy,whatdoeshebuy,howoftenhepayson
time,etc.
Labelpasttransactionsasfraud orfair transactions.Thisformstheclass
attribute.
Learnamodel fortheclassofthetransactions.
Usethismodel todetectfraudbyobservingcreditcardtransactionson
anaccount.
29
Regression
Goal:
Predictavalueofagivencontinuousvaluedvariablebasedonthe
valuesofothervariables,assumingalinearornonlinearmodelof
dependency.
Extensivelystudiedinstatistics,neuralnetworkfields.
Examples:
Predictingsalesamountsofnewproductbasedonadvertising
expenditure.
Predictingwindvelocitiesasafunctionoftemperature,humidity,
airpressure,etc.
Timeseriespredictionofstockmarketindices.
30
Prediction&Regression:Example1
Relationshipbetweensystolicbloodpressure(y),birthweight (x1),andage (indays)(x2)
i
Birthweight
in oz (x1)
Age
in days (x2)
Systolic BP
mm HG (y)
135
89
120
90
100
83
105
77
130
92
125
98
125
82
105
85
120
96
10
90
95
11
120
80
12
95
79
13
120
86
14
150
97
15
160
92
16
125
88
Exampleofmultipleregression:
Useleastsquaresmethodto
determinetheregressioneqn.
Prediction:
TopredictthesystolicBPofa
babywithbirthweight 8lb(128
oz)measuredat3daysoflife
32
Clustering:Definition
Givenasetofdatapoints,eachhavingasetofattributes,
andaproximitymeasureamongthem,thegoalof
clusteringistofindclusters suchthat
Datapointsinthesame clusteraremore similar toone
another.
Datapointsindifferent clustersareless similartoone
another.
ProximityMeasures:
EuclideanDistanceifattributesarecontinuous.
CosineSimilarityfordocumentdata
OtherProblemspecificMeasures.
33
Clustering:Principle
Unlikeclassificationandprediction,whichanalyzeclasslabeled
dataobjects,clusteringanalyzesdatawithoutclasslabels
Couldhelptodetermineclasslabels
Objectsareclusteredbasedontheprinciple:
minimizetheintraclusterdistance andmaximizetheinterclusterdistance
Intraclusterdistances
areminimized
Interclusterdistances
aremaximized
34
Clustering:Application1
MarketSegmentation:
Goal:
Tosubdivideamarketintodistinctsubsetsofcustomers
whereanysubsetmayconceivablybeselectedasamarket
targettobereachedwithadistinctmarketingmix.
Approach:
Collectdifferentattributesofcustomersbasedontheir
geographicalandlifestylerelatedinformation.
Findclustersofsimilarcustomers.
Measuretheclusteringqualitybyobservingbuyingpatterns
ofcustomersinsameclustervs.thosefromdifferentclusters.
35
Clustering:Application2
DocumentClustering:
Goal:
Tofindgroupsofdocumentsthataresimilartoeachotherbased
ontheimportanttermsappearinginthem.
Approach:
Toidentifyfrequentlyoccurringtermsineachdocument.Forma
similaritymeasurebasedonthefrequenciesofdifferentterms.
Useittocluster.
Gain/Consequence:
InformationRetrievalcanutilizetheclusterstorelateanew
documentorsearchtermtoclustereddocuments.
36
Clustering:Application3
ClusteringofS&P500StockData
ObserveStockMovementseveryday.
Clusteringpoints:Stock{UP/DOWN}
SimilarityMeasure:Twopointsaremoresimilariftheeventsdescribedbythem
frequentlyhappentogetheronthesameday.
37
Outlier/AnomalyDetection
Outlier:somedatapointdoesnotcomplywiththegeneral
behaviorofthedata
Goal:Todetectsignificantdeviations(outliers)fromthe
normalbehavior
Althoughinmanyapplicationsoutliersareunnecessary,in
someapplicationstheyareveryuseful
Frauddetectionincreditcardpurchase,authentication
(password),networkintrusiondetection
38
Outlier/AnomalyDetection
Applications:
CreditCardFraudDetection
NetworkIntrusionDetection
Approach:
Itcanbedetectedbyassumingadistributionforthegeneral
behaviorandanypointoutsidethisareconsideredoutlier
Example
Creditcardpurchase:checktheamount,placeof
purchase,purchasefrequency
39
Summary
Datamining:Discoveringinterestingpatternsfromlarge
amountsofdata
Anaturalevolutionofdatabasetechnology,ingreat
demand,withwideapplications
AKDDprocessincludesdatacleaning,dataintegration,
dataselection,transformation,datamining,pattern
evaluation,andknowledgepresentation
40
ChallengesofDataMining
Scalability
Dimensionality
ComplexandHeterogeneousData
DataQuality
DataOwnershipandDistribution
PrivacyPreservation
StreamingData
41
CareerinDataMining
http://www.kdnuggets.com/2015/03/salaryanalyticsdatasciencepollwell
compensated.html 2015Salarysurvey(US$)
42
ABriefHistoryofDataMiningSociety
1989IJCAIWorkshoponKnowledgeDiscoveryinDatabases
KnowledgeDiscoveryinDatabases(G.PiatetskyShapiroandW.Frawley,
1991)
19911994WorkshopsonKnowledgeDiscoveryinDatabases
AdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky
Shapiro,P.Smyth,andR.Uthurusamy,1996)
19951998InternationalConferencesonKnowledgeDiscoveryinDatabasesand
DataMining(KDD9598)
JournalofDataMiningandKnowledgeDiscovery(1997)
ACMSIGKDDconferencessince1998andSIGKDDExplorations
Moreconferencesondatamining
PAKDD(1997),PKDD(1997),SIAMDataMining(2001),(IEEE)ICDM(2001),
etc.
ACMTransactionsonKDDstartingin2007
43
ConferencesandJournalsonDataMining
KDDConferences
Otherrelatedconferences
ACMSIGKDDInt.Conf.on
ACMSIGMOD
KnowledgeDiscoveryinDatabases
VLDB
andDataMining(KDD)
(IEEE)ICDE
SIAMDataMiningConf.(SDM)
WWW,SIGIR
(IEEE)Int.Conf.onDataMining
ICML,CVPR,NIPS
(ICDM)
Conf.onPrinciplesandpracticesof Journals
KnowledgeDiscoveryandData
DataMiningandKnowledge
Mining(PKDD)
Discovery(DAMIorDMKD)
PacificAsiaConf.onKnowledge
IEEETrans.OnKnowledgeand
DiscoveryandDataMining(PAKDD)
DataEng.(TKDE)
KDDExplorations
ACMTrans.onKDD
44