Você está na página 1de 93

UNIT1Introduction

Lecture1Motivation:Whydatamining?
Lecture2Whatisdatamining?
g
Lecture3DataMining:Onwhatkindof
data?
Lecture4Dataminingfunctionality
Lecture5Classificationofdatamining
systems
L
Lecture6Majorissuesindatamining
6 M j i i d i i

1
Unit1
Unit 1DatawarehouseandOLAP
Data warehouse and OLAP

L t
Lecture7
7 Wh t i d t
Whatisadatawarehouse?
h ?

Lecture8 Amultidimensionaldatamodel

Lecture9 Datawarehousearchitecture

Lecture10&11 Datawarehouseimplementation

Lecture12 Fromdatawarehousingtodatamining
g g

2
Lecture 1
Lecture1
Motivation: Why data mining?
Motivation:Whydatamining?

3
EvolutionofDatabaseTechnology

1960sandearlier:
1960 d li
DataCollectionandDatabaseCreation
Primitivefileprocessing

4
EvolutionofDatabaseTechnology

1970s early1980s:
DataBaseManagementSystems
D B M S
Hieraticalandnetworkdatabasesystems
RelationaldatabaseSystems
Querylanguages:SQL
Transactions,concurrencycontrolandrecovery.
On
Online
linetransactionprocessing(OLTP)
transaction processing (OLTP)

5
EvolutionofDatabaseTechnology

Mid
Mid 1980s
1980s present:
present:
Advanceddatamodels
Extendedrelational,objectrelational
Extended relational object relational
AdvancedapplicationorientedDBMS
spatial,scientific,engineering,temporal,multimedia,
ti l i tifi i i t l lti di
active,streamandsensor,knowledgebased

6
EvolutionofDatabaseTechnology

Late1980spresent
p
AdvancedDataAnalysis
DatawarehouseandOLAP
Dataminingandknowledgediscovery
i i dk l d di
Advanceddataminingappliations
Dataminingandsocity
1990spresent:
XMLbaseddatabasesystems
Integrationwithinformationretrieval
Dataandinformationintegreation

7
EvolutionofDatabaseTechnology

Present
Present future:
future:
Newgenerationofintegrateddataand
information system
informationsystem.

8
Lecture2
What Is Data Mining?
WhatIsDataMining?

9
WhatIsDataMining?

Data
Dataminingreferstoextractingormining
mining refers to extracting or mining
knowledgefromlargeamountsofdata.
Miningofgoldfromrocksorsand
Mining of gold from rocks or sand
Knowledgeminingfromdata,knowledge
extraction,data/patternanalysis,data
i d / l i d
archeology,anddatadreding.
KnowledgeDiscoveryfromdata,orKDD

10
DataMining:AKDDProcess

Pattern Evaluation
Datamining:thecoreof
knowledgediscovery
process
process. Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
11
Steps of a KDD Process
StepsofaKDDProcess
1.
1 Datacleaning
Data cleaning
2. Dataintegration
3
3. Dataselection
l i
4. Datatransformation
5. Datamining
6
6. Pattern evaluation
Patternevaluation
7. Knowledgepresentaion

12
StepsofaKDDProcess
p
Learningtheapplicationdomain:
relevantpriorknowledgeandgoalsof
l i k l d d l f
application
Creatingatargetdataset:dataselection
Creating a target data set: data selection
Datacleaningandpreprocessing
Datareductionandtransformation:
Data reduction and transformation:
Findusefulfeatures,dimensionality/variable
reduction,invariantrepresentation.

13
Steps of a KDD Process
StepsofaKDDProcess
Choosingfunctionsofdatamining
Choosing functions of data mining
summarization,classification,regression,association,
clustering.
Choosingtheminingalgorithms
Datamining:searchforpatternsofinterest
Patternevaluationandknowledgepresentation
visualization,transformation,removingredundant
patterns,etc.
Useofdiscoveredknowledge

14
ArchitectureofaTypicalData
Mi i S t
MiningSystem
G hi l user interface
Graphical i f

Pattern evaluation

Data
a a mining
g engine
g
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse

15
Data Mining and Business Intelligence
DataMiningandBusinessIntelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data
D t Mi
Mining
i Data
D t
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

16
Lecture3
DataMining:OnWhatKindofData?

17
DataMining:OnWhatKindofData?

Relationaldatabases
Datawarehouses
Transactionaldatabases

18
Data Mining: On What Kind of Data?
DataMining:OnWhatKindofData?

AdvancedDBandinformationrepositories
Advanced DB and information repositories
Objectorientedandobjectrelationaldatabases
Spatialdatabases
Spatial databases
Timeseriesdataandtemporaldata
Textdatabasesandmultimediadatabases
T td t b d lti di d t b
Heterogeneousandlegacydatabases
WWW

19
Lecture4
Lecture 4
DataMiningFunctionalities

20
DataMiningFunctionalities
g

Conceptdescription:Characterizationand
p p
discrimination
Datacanbeassociatedwithclassesorconcepts
p
Ex.AllElectronicsstoreclassesofitemsforsaleinclude
computerandprinters.
Descriptionofclassorconceptcalledclass/concept
description.
Datacharacterization
Datadiscrimination

21
Data Mining Functionalities
DataMiningFunctionalities
Mining
MiningFrequentPatterns,Associations,and
Frequent Patterns Associations and
Correlations
Frequentpatters
Frequent patters patternsoccursfrequently
patterns occurs frequently
Itemsets,subsequencesandsubstructures
Frequentitemset
Sequentialpatterns
Structuredpatterns

22
Data Mining Functionalities
DataMiningFunctionalities
AssociationAnalysis
Association Analysis
Multidimensionalvs.singledimensional
association
association
age(X,20..29)^income(X,20..29K)=>buys(X,
PC) [
PC)[support=2%,confidence=60%]
t 2% fid 60%]
contains(T,computer)=>contains(x,
f
software)[support=1%,confidence=75%]
) [ f ]

23
DataMiningFunctionalities

ClassificationandPrediction
Findingmodels(functions)thatdescribeand
Finding models (functions) that describe and
distinguishdataclassesorconceptsforpredictthe
class whose label is unknown
classwhoselabelisunknown
E.g.,classifycountriesbasedonclimate,orclassify
cars based on gas mileage
carsbasedongasmileage
Models:decisiontree,classificationrules(ifthen),
neuralnetwork
l t k
Prediction:Predictsomeunknown ormissing
numericalvalues
24
Data Mining Functionalities
DataMiningFunctionalities

Clusteranalysis
Cluster analysis
Analyzeclasslabeleddataobjects,clustering
analyze data objects without consulting a known
analyzedataobjectswithoutconsultingaknown
classlabel.
Clusteringbasedontheprinciple:maximizingthe
Cl t i b d th i i l i i i th
intraclasssimilarityandminimizingtheinterclass
similarity

25
DataMiningFunctionalities
g
Outlieranalysis
Outlier:adataobjectthatdoesnotcomplywiththegeneralbehavior
Outlier: a data object that does not comply with the general behavior
ofthemodelofthedata
Itcanbeconsideredasnoiseorexceptionbutisquiteusefulinfraud
detection,rareeventsanalysis

Trendandevolutionanalysis
y
Trendanddeviation:regressionanalysis
Sequentialpatternmining,periodicityanalysis
Sequential pattern mining periodicity analysis
Similaritybasedanalysis

26
Lecture5
Lecture 5
DataMining:ClassificationSchemes

27
DataMining:ConfluenceofMultiple
Disciplines
Database
Statistics
Technology

Information
Science Data Mining MachineLearning

Visualization Other
Disciplines

28
Data Mining: Classification Schemes
DataMining:ClassificationSchemes

Generalfunctionalityy
Descriptivedatamining
Predictivedatamining
Predictive data mining

Dataminingvariouscriteria's:
Kindsofdatabasestobemined
Kindsofknowledgetobediscovered
Kindsoftechniquesutilized
Kindsofapplicationsadapted
pp p

29
DataMining:ClassificationSchemes
Databasestobemined
Relational,transactional,objectoriented,object
, , j , j
relational,active,spatial,timeseries,text,multimedia,
heterogeneous,legacy,WWW,etc.
Knowledgetobemined
Knowledge to be mined
Characterization,discrimination,association,
classification,clustering,trend,deviationandoutlier
analysis etc
analysis,etc.
Multiple/integratedfunctionsandminingatmultiple
levels
analysis,Webmining,Webloganalysis,etc.

30
Data Mining: Classification Schemes
DataMining:ClassificationSchemes

Techniques
Techniquesutilized
utilized
Databaseoriented,datawarehouse(OLAP),
machine learning statistics visualization
machinelearning,statistics,visualization,
neuralnetwork,etc.
Applicationsadapted
A li i d d
Retail,telecommunication,banking,fraud
analysis,DNAmining,stockmarket

31
Lecture6
Lecture 6
MajorIssuesinDataMining

32
MajorIssuesinDataMining

Miningmethodologyanduserinteractionissues
Miningdifferentkindsofknowledgeindatabases
Interactiveminingofknowledgeatmultiplelevelsof
abstraction
Incorporationofbackgroundknowledge
Dataminingquerylanguagesandadhocdatamining
Expressionandvisualizationofdataminingresults
Handlingnoiseandincompletedata
Patternevaluation:theinterestingnessproblem

33
Major Issues in Data Mining
MajorIssuesinDataMining
Performanceissues
Performance issues

Efficiencyandscalabilityofdataminingalgorithms
Effi i d l bilit f d t i i l ith
Parallel,distributedandincrementalmining
methods
h d

34
MajorIssuesinDataMining

Issuesrelatingtothediversityofdatatypes
g y yp

Handlingrelationalandcomplextypesofdata
Handling relational and complex types of data

Mininginformationfromheterogeneousdatabases
Minin information from hetero eneo s databases
andglobalinformationsystems(WWW)

35
Lecture7

Wh t i D t W h
WhatisDataWarehouse?
?

36
WhatisDataWarehouse?
Definedinmanydifferentways
Adecisionsupportdatabasethatismaintainedseparately
from the organizations operational database
fromtheorganizationsoperationaldatabase
Supportinformationprocessingbyprovidingasolid
platformofconsolidated,historicaldataforanalysis.

Adatawarehouseisasubjectoriented, integrated,time
variant and nonvolatile collectionofdatainsupportof
variant,andnonvolatile collection of data in support of
managementsdecisionmakingprocess.W.H.Inmon

Datawarehousing:
h i
Theprocessofconstructingandusingdatawarehouses

37
D t W h
DataWarehouseSubjectOriented
S bj t O i t d
Organizedaroundmajorsubjects,suchascustomer,product,
Organized around major subjects such as customer product
sales.
Focusingonthemodelingandanalysisofdatafordecision
Focusing on the modeling and analysis of data for decision
makers,notondailyoperationsortransactionprocessing.
Provideasimpleandconciseviewaroundparticularsubject
Provide a simple and concise view around particular subject
issuesbyexcludingdatathatarenotusefulinthedecision
support process
supportprocess.

38
DataWarehouseIntegrated
Constructedbyintegratingmultiple,heterogeneous
datasources
relationaldatabases,flatfiles,onlinetransactionrecords
Datacleaninganddataintegrationtechniquesare
g g q
applied.
Ensureconsistencyinnamingconventions,encoding
structures,attributemeasures,etc.amongdifferentdata
sources
E.g.,Hotelprice:currency,tax,breakfastcovered,etc.
g, p y, , ,
Whendataismovedtothewarehouse,itisconverted.

39
Data Warehouse Time Variant
DataWarehouseTimeVariant
Thetimehorizonforthedatawarehouseis
significantlylongerthanthatofoperationalsystems.
Operationaldatabase:currentvaluedata.
Datawarehousedata:provideinformationfromahistorical
perspective(e.g.,past510years)
Everykeystructureinthedatawarehouse
k i h d h
Containsanelementoftime,explicitlyorimplicitly
Butthekeyofoperationaldatamayormaynotcontain
timeelement.

40
Data Warehouse Non Volatile
DataWarehouseNonVolatile
Aphysicallyseparatestoreofdatatransformedfrom
theoperationalenvironment.
Operationalupdateofdatadoesnotoccurinthe
datawarehouseenvironment.
Doesnotrequiretransactionprocessing,recovery,and
concurrencycontrolmechanisms
y
Requiresonlytwooperationsindataaccessing:
initialloadingofdata
t a oad g of data aandaccessofdata.
d access of data

41
Data Warehouse vs Operational DBMS
DataWarehousevs.OperationalDBMS
Distinctfeatures(OLTPvs.OLAP):
Userandsystemorientation:customervs.market
U d t i t ti t k t
Datacontents:current,detailedvs.historical,consolidated
Databasedesign:ER+applicationvs.star+subject
Database design: ER + application vs star + subject
View:current,localvs.evolutionary,integrated
Accesspatterns:updatevs.readonlybutcomplexqueries
Access patterns: update vs read only but complex queries

42
Data Warehouse vs. Operational DBMS
DataWarehousevs.OperationalDBMS
OLTP(onlinetransactionprocessing)
MajortaskoftraditionalrelationalDBMS
Daytodayoperations:purchasing,inventory,banking,
manufacturing,payroll,registration,accounting,etc.

OLAP(onlineanalyticalprocessing)
Majortaskofdatawarehousesystem
Dataanalysisanddecisionmaking

43
OLTP vs OLAP
OLTPvs.OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design
g application-oriented
pp subject-oriented
j
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad hoc
ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex queryy
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

44
Why Separate Data Warehouse?
WhySeparateDataWarehouse?
Highperformanceforbothsystems

DBMS tunedforOLTP:accessmethods,indexing,
concurrencycontrol,recovery

WarehousetunedforOLAP:complexOLAP
W h t d f OLAP l OLAP
queries,multidimensionalview,consolidation.

45
Why Separate Data Warehouse?
WhySeparateDataWarehouse?
Differentfunctionsanddifferentdata:
Different functions and different data:
missingdata:Decisionsupportrequireshistorical
datawhichoperationalDBsdonottypically
maintain
dataconsolidation:DSrequiresconsolidation
(aggregation summari ation) of data from
(aggregation,summarization)ofdatafrom
heterogeneoussources
dataquality:differentsourcestypicallyuse
data quality: different sources typically use
inconsistentdatarepresentations,codesand
formatswhichhavetobereconciled

46
L
Lecture8
8

Amultidimensionaldatamodel

47
Cube:ALatticeofCuboids

all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time item
time,item time location
time,location item location
item,location location supplier
location,supplier
2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier

48
Conceptual Modeling of Data Warehouses
ConceptualModelingofDataWarehouses

Modelingdatawarehouses:dimensions&measures
Starschema:Afacttableinthemiddleconnectedtoasetof
di
dimensiontables
i bl
Snowflakeschema:Arefinementofstarschemawhere
some dimensional hierarchy is normalized into a set of
somedimensionalhierarchyisnormalizedintoasetof
smallerdimensiontables,formingashapesimilarto
snowflake
Factconstellations:Multiplefacttablessharedimension
tables,viewedasacollectionofstars,thereforecalled
galaxyschemaorfactconstellation

49
Example of Star Schema
ExampleofStarSchema
time
time_key
time key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time key
time_key type
year supplier_type
item_key
branch key
branch_key
branch location
location_key
branch_key location_key
bbranch
a c _name
a e units sold
units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures

50
Example of Snowflake Schema
ExampleofSnowflakeSchema
time
time_key
time key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time key
time_key t
type
year item_key supplier_key

branch_key
y
branch location
location_key
location_key
branch_key
units_sold street
bbranch
a c _name
a e
city_key
it k
branch_type city
dollars_sold
city_key
avg_sales cityy
Measures province_or_street
country

51
ExampleofFactConstellation
p
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_keyy type
yp item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
M
Measures shipper_key
shipper_name
location_key
shipper_type
52
ADataMiningQueryLanguage,DMQL:Language
Primitives

CubeDefinition(FactTable)
Cube Definition (Fact Table)
definecube<cube_name>[<dimension_list>]:
<measure_list>
DimensionDefinition(DimensionTable)
definedimension<dimension_name>as
(
(<attribute_or_subdimension_list>)
ib bdi i li )
SpecialCase(SharedDimensionTables)
Fi
Firsttimeascubedefinition
t ti b d fi iti
definedimension<dimension_name>as
<dimension_name_first_time>incube
<cube_name_first_time>

53
DefiningaStarSchemainDMQL

definecubesales_star[time,item,branch,location]:
[ , , , ]
dollars_sold=sum(sales_in_dollars),avg_sales=
avg(sales_in_dollars),units_sold=count(*)
definedimensiontimeas(time_key,day,day_of_week,month,
quarter,year)
define dimension item as (item key item name brand type
definedimensionitemas(item_key,item_name,brand,type,
supplier_type)
definedimensionbranchas(branch_key,branch_name,
branch_type)
definedimensionlocationas(location_key,street,city,
province or state country)
province_or_state,country)

54
DefiningaSnowflakeSchemainDMQL

definecubesales_snowflake[time,item,branch,location]:
dollars_sold=sum(sales_in_dollars),avg_sales=
d ll ld ( l i d ll ) l
avg(sales_in_dollars),units_sold=count(*)
definedimensiontimeas(time_key,day,day_of_week,
define dimension time as (time key day day of week
month,quarter,year)
definedimensionitemas(item_key,item_name,brand,
define dimension item as (item key item name brand
type,supplier(supplier_key,supplier_type))

55
Defining a Snowflake Schema in DMQL
DefiningaSnowflakeSchemainDMQL

definedimensionbranchas(branch_key,
define dimension branch as (branch key
branch_name,branch_type)
definedimensionlocationas(location_key,
(
street,city(city_key,province_or_state,
country))

56
DefiningaFactConstellationinDMQL
definecubesales[time,item,branch,location]:
dollars_sold=sum(sales_in_dollars),avg_sales=
avg(sales_in_dollars),units_sold=count(*)
( ) (*)
definedimensiontimeas(time_key,day,day_of_week,month,
q
quarter,year)
,y )
definedimensionitemas(item_key,item_name,brand,type,
supplier_type)
define dimension branch as (branch key branch name branch type)
definedimensionbranchas(branch_key,branch_name,branch_type)
definedimensionlocationas(location_key,street,city,
province_or_state,country)

57
Defining a Fact Constellation in DMQL
DefiningaFactConstellationinDMQL
definecubeshipping[time,item,shipper,from_location,
to_location]:
dollar_cost=sum(cost_in_dollars),unit_shipped=
count( )
count(*)
definedimensiontimeastimeincubesales
definedimensionitemasitemincubesales
definedimensionshipperas(shipper_key,shipper_name,
locationaslocationincubesales,shipper_type)
definedimensionfrom
de e d e s o o _locationaslocationincubesales
ocat o as ocat o cube sa es
definedimensionto_locationaslocationincubesales

58
Measures:ThreeCategories
distributive:iftheresultderivedbyapplyingthe
function to n aggregate values is the same as that
functiontonaggregatevaluesisthesameasthat
derivedbyapplyingthefunctiononallthedata
without partitioning
withoutpartitioning.
E.g.,count(),sum(),min(),max().
algebraic:
algebraic:ifitcanbecomputedbyanalgebraic
if it can be computed by an algebraic
functionwithM arguments(where M isabounded
integer) each of which is obtained by applying a
integer),eachofwhichisobtainedbyapplyinga
distributiveaggregatefunction.
E.g.,avg(),min_N(),standard_deviation().
E g avg() min N() standard deviation()

59
Measures: Three Categories
Measures:ThreeCategories

holistic:
holistic:ifthereisnoconstantboundonthe
if there is no constant bound on the
storagesizeneededtodescribeasub
aggregate.
aggregate
E.g.,median(),mode(),rank().

60
AConceptHierarchy:Dimension(location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

61
M ltidi
MultidimensionalData
i lD t
Sales
Salesvolumeasafunctionofproduct,
volume as a function of product,
month,andregion Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Prooduct

Product City Month Week

Office Day

Month

62
A Sample Data Cube
ASampleDataCube
Date Total annual sales
1Qtr 2Qtr 3Qtr 4Qt
4Qtr sum of TV in U
U.S.A.
SA
TV
PC U.S.A
VCR

Country
y
sum
Canada

C
Mexico

sum

63
Cuboids Corresponding to the Cube
CuboidsCorrespondingtotheCube

all
0-D(apex) cuboid
product
d date country
1-D cuboids

pproduct,date
, pproduct,country
, y date,, country
y
2-D cuboids

3 D(b ) cuboid
3-D(base) b id
product, date, country

64
OLAPOperations
p

Rollup(drill
Roll up (drillup):
up):summarizedata
summarize data
by climbing up hierarchy or by
dimension reduction
Drilldown(rolldown):reverseofrollup
from
f higher
hi h level
l l summary to
t lower
l
level summary or detailed data, or
introducing new dimensions
Sliceanddice:
project and select

65
OLAP Operations
OLAPOperations
Pivot
Pivot(rotate):
(rotate):
reorient the cube, visualization, 3D to
series
se es of
o 2D planes.
p a es
Otheroperations
drill across: involving (across) more
than one fact table
drill through: through the bottom level
of the cube to its back-end relational
tables (using SQL)

66
Lecture9

Datawarehousearchitecture

67
StepsfortheDesignandConstructionof
DataWarehouse
h

Thedesignofadatawarehouse:abusiness
analysis framework
analysisframework
Theprocessofdatawarehousedesign
Athreetierdatawarehousearchitecture

68
DesignofaDataWarehouse:ABusinessAnalysis
Framework

Fourviewsregardingthedesignofadatawarehouse
F i di th d i f d t h
Topdownview
allowsselectionoftherelevantinformation
necessaryforthedatawarehouse

69
DesignofaDataWarehouse:ABusinessAnalysis
F
Frameworkk
Datawarehouseview
Data warehouse view
consistsoffacttablesanddimensiontables

Datasourceview
exposes
exposestheinformationbeingcaptured,stored,and
the information being captured stored and
managedbyoperationalsystems

Businessqueryview
seestheperspectives
sees the perspectives

70
Data Warehouse Design Process
DataWarehouseDesignProcess

Topdown,
Top down,bottom
bottomup
upapproachesoracombination
approaches or a combination
ofboth
Topdown:Startswithoveralldesignandplanning
(mature)
Bottomup:Startswithexperimentsandprototypes(rapid)
Fromsoftwareengineeringpointofview
Waterfall:structuredandsystematicanalysisateachstep
before proceeding to the next
beforeproceedingtothenext
Spiral:rapidgenerationofincreasinglyfunctionalsystems,
shortturnaroundtime,quickturnaround

71
Data Warehouse Design Process
DataWarehouseDesignProcess

Typicaldatawarehousedesignprocess
Typical data warehouse design process
Chooseabusinessprocesstomodel,e.g.,orders,
invoices etc
invoices,etc.
Choosethegrain (atomiclevelofdata)ofthe
business process
businessprocess
Choosethedimensionsthatwillapplytoeachfact
tablerecord
Choosethemeasurethatwillpopulateeachfact
tablerecord

72
Multi--Tiered Architecture
Multi

Monitor
& OLAP Server
other Metadata
Integrator
sources
Analysis
A l i
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


73
MetadataRepository
p y
Metadataisthedatadefiningwarehouseobjects.Ithasthe
f ll i ki d
followingkinds
Descriptionofthestructureofthewarehouse
schema,view,dimensions,hierarchies,deriveddatadefn,datamartlocations
andcontents
Operationalmetadata
datalineage(historyofmigrateddataandtransformationpath),currencyof
data(active,archived,orpurged),monitoringinformation(warehouseusage
statistics,errorreports,audittrails)
Thealgorithmsusedforsummarization
Themappingfromoperationalenvironmenttothedatawarehouse
Datarelatedtosystemperformance
warehouseschema,viewandderiveddatadefinitions
Businessdata
businesstermsanddefinitions,ownershipofdata,chargingpolicies

74
DataWarehouseBackEndToolsandUtilities

Dataextraction:
getdatafrommultiple,heterogeneous,andexternal
sources
Datacleaning:
detecterrorsinthedataandrectifythemwhenpossible
Datatransformation:
convertdatafromlegacyorhostformattowarehouse
format
Load:
sort,summarize,consolidate,computeviews,check
integrity, and build indices and partitions
integrity,andbuildindicesandpartitions
Refresh
propagatetheupdatesfromthedatasourcestothe
warehouse

75
ThreeDataWarehouseModels
Enterprisewarehouse
collectsalloftheinformationaboutsubjectsspanningtheentire
j p g
organization
DataMart
asubsetofcorporatewidedatathatisofvaluetoaspecificgroups
b f id d h i f l ifi
ofusers.Itsscopeisconfinedtospecific,selectedgroups,suchas
marketingdatamart
Independentvs.dependent(directlyfromwarehouse)data
mart
Virtualwarehouse
Vi t l h
Asetofviewsoveroperationaldatabases
Onlysomeofthepossiblesummaryviewsmaybematerialized
y p y y

76
DataWarehouseDevelopment:A
Recommended Approach
RecommendedApproach

Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
E t i
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


77
TypesofOLAPServers
RelationalOLAP(ROLAP)
UserelationalorextendedrelationalDBMStostoreand
managewarehousedataandOLAPmiddlewaretosupport
missingpieces
IncludeoptimizationofDBMSbackend,implementationof
Include optimization of DBMS backend implementation of
aggregationnavigationlogic,andadditionaltoolsand
services
greaterscalability
MultidimensionalOLAP(MOLAP)
Arraybasedmultidimensionalstorageengine(sparsematrix
techniques)
fastindexingtoprecomputedsummarizeddata
fast indexing to pre computed summarized data

78
Types of OLAP Servers
TypesofOLAPServers
HybridOLAP(HOLAP)
Hybrid OLAP (HOLAP)
Userflexibility,e.g.,lowlevel:relational,high
level: array
level:array
SpecializedSQLservers
specializedsupportforSQLqueriesover
specialized support for SQL queries over
star/snowflakeschemas

79
Lecture10
Lecture 10&11
& 11

Data warehouse implementation


Datawarehouseimplementation

80
EfficientDataCubeComputation
Datacubecanbeviewedasalatticeofcuboids
Thebottommostcuboidisthebasecuboid
Th b tt t b id i th b b id
Thetopmostcuboid(apex)containsonlyonecell
HowmanycuboidsinanndimensionalcubewithLlevels?
How many cuboids in an n dimensional cube with L levels?

n
T of( Ldata
Materialization i 1) cube
Materializationofdatacube
i 1
Materializeevery(cuboid)(fullmaterialization),none(no
materialization) or some (partial materialization)
materialization),orsome(partialmaterialization)
Selectionofwhichcuboidstomaterialize
Basedonsize,sharing,accessfrequency,etc.
, g, q y,

81
CubeOperation
Cube definition and computation in DMQL
define cube sales[item, city, year]: sum(sales_in_dollars)
compute cube sales
Transform it into a SQLlike language (with a new operator cube
by introduced by Gray et al.
by, al 96)
96)
SELECT item, city, year, SUM (amount)
()
FROM SALES
CUBE BY item, city, year
(city) (item) (year)
Need compute the following GroupBys
(date, product,
(date product customer),
customer)
(date,product),(date, customer), (product, customer),
(date), (product), (customer) (city, item) (city, year) (item, year)
()
(city, item, year)

82
CubeComputation:ROLAPBasedMethod

Efficientcubecomputationmethods
p
ROLAPbasedcubingalgorithms(Agarwaletal96)
Arraybasedcubingalgorithm(Zhaoetal97)
Bottomupcomputationmethod(Bayer&Ramarkrishnan
Bottom up computation method (Bayer & Ramarkrishnan99)
99)

ROLAPbasedcubingalgorithms
SSorting,hashing,andgroupingoperationsareappliedtothe
ti h hi d i ti li d t th
dimensionattributesinordertoreorderandclusterrelatedtuples
Groupingisperformedonsomesubaggregatesasapartial
groupingstep
Aggregatesmaybecomputedfrompreviouslycomputed
aggregates,ratherthanfromthebasefacttable
t th th f th b f t t bl

83
MultiwayArrayAggregationfor
Cube Computation
CubeComputation
Partitionarraysintochunks(asmallsubcubewhichfitsin
memory).
Compressedsparsearrayaddressing:(chunk_id,offset)
Computeaggregatesinmultiwaybyvisitingcubecellsinthe
orderwhichminimizesthe#oftimestovisiteachcell,and
reduces memory access and storage cost
reducesmemoryaccessandstoragecost.

84
MultiwayArrayAggregationfor
C b C
CubeComputation
t ti

C c3 61
c2 45
62 63 64
46 47 48
c11 29 30 31 32
c0
B13 14 15 16 60
b3 44
B b2 28 56
9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A

85
MultiWayArrayAggregationforCube
Computation
Computation
Method:theplanesshouldbesortedand
computed according to their size in ascending
computedaccordingtotheirsizeinascending
order.
Idea:keepthesmallestplaneinthemain
Idea: keep the smallest plane in the main
memory,fetchandcomputeonlyonechunkata
timeforthelargestplane
Limitationofthemethod:computingwell
onlyforasmallnumberofdimensions
Iftherearealargenumberofdimensions,
bottomupcomputationandicebergcube
computation methods can be explored
computationmethodscanbeexplored

86
IndexingOLAPData:BitmapIndex
Indexonaparticularcolumn
Eachvalueinthecolumnhasabitvector:bitopisfast
The length of the bit vector: # of records in the base table
Thelengthofthebitvector:#ofrecordsinthebasetable
The ithbitissetifthe ithrowofthebasetablehasthe
valuefortheindexedcolumn
notsuitableforhighcardinalitydomains

Base table Index on Region Index on Type


Cust Region Type RecIDAsia Europe America RecID Retail Dealer
C1 A i
Asia R t il
Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 A
America
i R t il
Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1

87
IndexingOLAPData:JoinIndices
Joinindex:JI(Rid,Sid)whereR(Rid,)
S(Sid,)
Traditionalindicesmapthevaluestoalistof
p
recordids
ItmaterializesrelationaljoininJIfileandspeeds
uprelationaljoin
p j arathercostlyoperation
y p
Indatawarehouses,joinindexrelatesthe
valuesofthedimensions ofastartschema
to rows inthefacttable.
torows in the fact table.
E.g.facttable:Salesandtwodimensionscity and
product
Ajoinindexoncity
A join index on city maintainsforeachdistinct
maintains for each distinct
cityalistofRIDsofthetuplesrecordingthe
Salesinthecity
Joinindicescanspanmultipledimensions

88
EfficientProcessingOLAPQueries
Determinewhichoperationsshouldbeperformedon
the available cuboids:
theavailablecuboids:
transformdrill,roll,etc.intocorrespondingSQLand/orOLAP
operations,e.g,dice=selection+projection
i di l i j i

Determinetowhichmaterializedcuboid(s)therelevant
operationsshouldbeapplied.
Exploringindexingstructuresandcompressedvs.
E l i i d i t t d d
densearraystructuresinMOLAP

89
Lecture12

From data warehousing to data


Fromdatawarehousingtodata

mining
i i

90
DataWarehouseUsage
Threekindsofdatawarehouseapplications
Informationprocessing
Information processing
supportsquerying,basicstatisticalanalysis,andreportingusing
crosstabs,tables,chartsandgraphs
Analyticalprocessing
l l
multidimensionalanalysisofdatawarehousedata
pp p , , g, p g
supportsbasicOLAPoperations,slicedice,drilling,pivoting
Datamining
knowledgediscoveryfromhiddenpatterns
supportsassociations,constructinganalyticalmodels,
performingclassificationandprediction,andpresentingthe
miningresultsusingvisualizationtools.
Differencesamongthethreetasks

91
FromOnLineAnalyticalProcessingtoOnLineAnalytical
Mining (OLAM)
Mining(OLAM)

Whyonlineanalyticalmining?
Why online analytical mining?
Highqualityofdataindatawarehouses
DWcontainsintegrated,consistent,cleaneddata
Availableinformationprocessingstructuresurroundingdata
l bl f d d
warehouses
ODBC,OLEDB,Webaccessing,servicefacilities,reportingand
O
OLAPtools l
OLAPbasedexploratorydataanalysis
miningwithdrilling,dicing,pivoting,etc.
Onlineselectionofdataminingfunctions
integrationandswappingofmultipleminingfunctions,
algorithms,andtasks.
g ,
ArchitectureofOLAM
92
AnOLAMArchitecture
Mi i query
Mining Mi i result
Mining l L
Layer4
4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine
g Engine
g OLAP/OLAM

Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


y
Layer1
Data cleaning Data
Databases Data
Data integration
Warehouse
Repository
93

Você também pode gostar