Escolar Documentos
Profissional Documentos
Cultura Documentos
Background
1980s to early 1990s
Focus on computerizing business processes To gain competitive advantage
By early 1990s
All companies had operational systems It no longer offered any advantage
8/18/11
8/18/11
OLTP Systems
Has relational normalized design Redundant data is undesirable Consists of many tables High volume retrieval is inefficient Optimized for repetitive narrow queries Common data in many applications
8/18/11
8/18/11
8/18/11
8/18/11
2178
8/18/11
8/18/11
10
Data Warehouse
A decision support database that is maintained separately from the organizations operational databases. A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making
8/18/11 SS ZG515, Data Warehousing 11
Subject Oriented
Data Warehouse is designed around subjects rather than processes A company may have
Retail Sales System Outlet Sales System Catalog Sales System
8/18/11
12
Subject Oriented
Retail Sales System Outlet Sales System Catalog Sales System
Integrated
Heterogeneous Source Systems Little or no control Need to Integrate source data For Example: Product codes could be different in different systems Arrive at common code in DW
8/18/11
14
Non-Volatile(ReadMostly)
Write USER Read
OLTP
Read USER
DW
8/18/11
15
Time Variant
Most business analysis has a time component Trend Analysis (historical data is required)
Sales
2001
2002
2003
2004
8/18/11
16
OLAP servers
8/18/11
Data Marts
17
Convert from legacy/host format to warehouse format Sort, summarize, consolidate, compute views, check integrity, build indexes, partition Bring new data from source systems
SS ZG515, Data Warehousing 18
Load
n
Refresh
n
8/18/11
time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality
SS ZG515, Data Warehousing 19
20
Presentation Servers
n
n n
A target physical machine on which DW data is organized for Direct querying by end users using OLAP Report writers Data Visualization tools Data mining tools Data stored in Dimensional framework Analogy Sitting area of a restaurant
8/18/11
21
Data Cleaning
n
Why?
n n
n
n
Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis
Detecting data anomalies and rectifying them early has huge payoffs Long Term Solution
n
8/18/11
Change business practices and data entry toolsSS ZG515, Data Warehousing
22
Soundex Algorithms
Misspelled terms For example NAMES Phonetic algorithms can find similar sounding names n Based on the six phonetic classifications of human speech sounds
n n n
8/18/11 SS ZG515, Data Warehousing 23
OLTP
8/18/11
DW
SS ZG515, Data Warehousing 24
8/18/11
25
OLAP Queries
How much of product P1 was sold in 1999 state wise? Top 5 selling products in 2002 Total Sales in Q1 of FY 2002-03? Color wise sales figure of cars from 2000 to 2003 Model wise sales of cars for the month of Jan from 2000 to 2003
8/18/11
26
8/18/11
27
Continuum of Analysis
SQL
Specialized Algorithms
OLTP
Primitive & Canned Analysis
OLAP
Complex Ad-hoc Analysis
Data Mining
Automated Analysis
8/18/11
28
Design Requirements
Design of the DW must directly reflect the way the managers look at the business
8/18/11
Should capture the measurements of importance along with parameters by which n these parameters are viewed It must facilitate data analysis, i.e., answering business questions
n
SS ZG515, Data Warehousing
29
ER Modeling
A logical design technique that seeks to eliminate data redundancy Illuminates the microscopic relationships among data elements Perfect for OLTP systems Responsible for success of transaction processing in Relational Databases
8/18/11
30
ER vs Dimensional Modeling
ER models are constituted to
Remove redundant data (normalization) Facilitate retrieval of individual records having certain critical identifiers Thereby optimizing OLTP performance
Dimensional model supports the reporting and analytical needs of a data warehouse system.
8/18/11
32
8/18/11
33
Fact
Dimensions
34
Star Schema
Product Dimension
FK
FK
Location Dimension
Time Dimension
8/18/11
FK
FK
Promotion Dimension
35
Dimensional Modeling
Facts are stored in FACT Tables Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema BIG fact table in center surrounded by SMALL dimension tables
8/18/11 SS ZG515, Data Warehousing 36
Time Dimension PERIOD KEY Period Desc Year Quarter Month Day
8/18/11
37
Fact Tables
Contains numerical measurements of the business Each measurement is taken at the intersection of all dimensions Intersection is the composite key Represents Many-to-many relationships between dimensions Examples of facts Sale_amt, Units_sold, Cost, Customer_count
8/18/11 SS ZG515, Data Warehousing 38
Dimension Tables
Contains attributes for dimensions 50 to 100 attributes common Best attributes are textual and descriptive DW is only as good as the dimension attributes Contains hierarchal information albeit redundantly Entry points into the fact table
8/18/11
39
Types of Facts
Fully-additive-all dimensions
Units_sold, Sales_amt
Semi-additive-some dimensions
Account_balance, Customer_count 28/3,tissue paper,store1, 25, 250,20 28/3,paper towel,store1, 35, 350,30 Is no. of customers who bought either tissue paper or paper towel is 50? NO.
Non-additive-none
Gross margin=Gross profit/amount Note that GP and Amount are fully additive Ratio of the sums and not sum of the ratios
8/18/11
40
8/18/11
41
Some Terms
SKU (Stock Keeping Units) UPC (Universal Product Codes) EPOS ( Electronic Point of Sales)
8/18/11
43
8/18/11
44
Grocery Store DW
n n n
Step 1: Sales Business Process Step 2: Daily Grain A word about GRANULARITY
n
Temp sensor data: per ms, sec, min, hr? Size of the DW is governed by granularity Daily grain (club products sold on a day for each store) Aggregated data Receipt line Grain (each line in the receipt is recorded finest grain data)
SS ZG515, Data Warehousing 45
8/18/11
8/18/11
46
Quantity sold (additive) Dollar revenue (additive) Dollar cost (additive) Customer count (semi-additive, not additive along the product dimension)
8/18/11
47
8/18/11
48
Market-Basket Analysis
What products customers are buying together? Beer & Diapers Polo Shirts & Barbie Dolls How do we find this out? Market-Basket Analysis! Transaction No. Receipt Line Grain Degenerate Dimension
8/18/11 SS ZG515, Data Warehousing 49
Promotion Dimension
Causal Dimension Which causes or being the cause Promotion conditions include TPRs End-aisle displays Newspapers ads Coupons Combinations are common
8/18/11
50
Q&A
Thank You