Você está na página 1de 52

Data Warehousing

Vikas Singh Computer Science Department BITS, Pilani

Background
1980s to early 1990s
Focus on computerizing business processes To gain competitive advantage

By early 1990s
All companies had operational systems It no longer offered any advantage

How to get competitive advantage??

8/18/11

SS ZG515, Data Warehousing

OLTP Systems: Primary Purpose


Run the operations of the business For example: Banks, Railway reservation etc. Based on ER Data Modeling Transaction based system Data is always current valued Little history is available Data is highly volatile Has Intelligent keys

8/18/11

SS ZG515, Data Warehousing

OLTP Systems
Has relational normalized design Redundant data is undesirable Consists of many tables High volume retrieval is inefficient Optimized for repetitive narrow queries Common data in many applications

8/18/11

SS ZG515, Data Warehousing

Need for Data Warehousing


Companies, over the years, gathered huge volumes of data Hidden Treasure Can this data be used in any way? Can we analyze this data to get any competitive advantage? If yes, what kind of advantage?

8/18/11

SS ZG515, Data Warehousing

Benefits of Data Warehousing


Allows efficient analysis of data Competitive Advantage Analysis aids strategic decision making Increased productivity of decision makers Potential high ROI Classic example: Diaper and Beer

8/18/11

SS ZG515, Data Warehousing

Decision Support Systems, DW, & OLAP


Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. Data Warehouse is a DSS A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational systems. Data Warehouse is not an Intelligent system On-Line Analytical Processing (OLAP) is an element of DSS

8/18/11

SS ZG515, Data Warehousing

DW: Interesting Statistics


2500 2000 1500 1100 1000 500 37.4 0
8/18/11

2178

I nvestment ($ BN) Users Size(GB)

626 393 148.5

SS ZG515, 2000 Data Warehousing 2003

Data Warehouse: Characteristics


Analysis driven Ad-hoc queries Complex queries Used by top managers Based on Dimensional Modeling Denormalized structures

8/18/11

SS ZG515, Data Warehousing

Data Warehouse: Major Players


SAS institute IBM Oracle Sybase Microsoft HP Cognos Business Objects

8/18/11

SS ZG515, Data Warehousing

10

Data Warehouse
A decision support database that is maintained separately from the organizations operational databases. A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making
8/18/11 SS ZG515, Data Warehousing 11

Subject Oriented
Data Warehouse is designed around subjects rather than processes A company may have
Retail Sales System Outlet Sales System Catalog Sales System

Problems Galore!!! DW will have a Sales Subject Area

8/18/11

SS ZG515, Data Warehousing

12

Subject Oriented
Retail Sales System Outlet Sales System Catalog Sales System

Sales Subject Area


8/18/11 SS ZG515, Data Warehousing 13

Integrated
Heterogeneous Source Systems Little or no control Need to Integrate source data For Example: Product codes could be different in different systems Arrive at common code in DW

8/18/11

SS ZG515, Data Warehousing

14

Non-Volatile(ReadMostly)
Write USER Read
OLTP

Read USER
DW

8/18/11

SS ZG515, Data Warehousing

15

Time Variant
Most business analysis has a time component Trend Analysis (historical data is required)
Sales

2001

2002

2003

2004

8/18/11

SS ZG515, Data Warehousing

16

Data Warehousing Architecture


Monitoring & Administration
Metadata Repository

OLAP servers

Analysis Query/ Reporting Data Mining

External Sources Operational dbs

Data Extract Transform Load Warehouse Serve


Refresh

8/18/11

SS ZG515, Data Warehousing

Data Marts

17

Refreshing the Warehouse


n n n

Data Extraction Data Cleaning Data Transformation


n

Convert from legacy/host format to warehouse format Sort, summarize, consolidate, compute views, check integrity, build indexes, partition Bring new data from source systems
SS ZG515, Data Warehousing 18

Load
n

Refresh
n

8/18/11

ETL Process Issues & Challenges


n Consumes 70-80% of project time n Heterogeneous Source Systems n Little or no control over source systems n Source systems scattered n Source systems operating in different n n n n
8/18/11

time zones Different currencies Different measurement units Data not captured by OLTP systems Ensuring data quality
SS ZG515, Data Warehousing 19

Data Staging Area


A storage area where extracted data is Cleaned Transformed Deduplicated n Initial storage for data n Need not be based on Relational model n Spread over a number of machines n Mainly sorting and Sequential processing n COBOL or C code running against flat files n Does not provide data access to users n8/18/11 SS ZG515, a Warehousing Analogy kitchen of Data restaurant
n

20

Presentation Servers
n

n n

A target physical machine on which DW data is organized for Direct querying by end users using OLAP Report writers Data Visualization tools Data mining tools Data stored in Dimensional framework Analogy Sitting area of a restaurant

8/18/11

SS ZG515, Data Warehousing

21

Data Cleaning
n

Why?
n n

n
n

Data warehouse contains data that is analyzed for business decisions More data and multiple sources could mean more errors in the data and harder to trace such errors Results in incorrect analysis

Detecting data anomalies and rectifying them early has huge payoffs Long Term Solution
n

8/18/11

Change business practices and data entry toolsSS ZG515, Data Warehousing

22

Soundex Algorithms
Misspelled terms For example NAMES Phonetic algorithms can find similar sounding names n Based on the six phonetic classifications of human speech sounds
n n n
8/18/11 SS ZG515, Data Warehousing 23

Data Warehouse Design


OLTP Systems are Data Capture Systems DATA IN systems DW are DATA OUT systems

OLTP
8/18/11

DW
SS ZG515, Data Warehousing 24

Analyzing the DATA


Active Analysis User Queries
User-guided data analysis Show me how X varies with Y OLAP

Automated Analysis Data Mining


Whats in there? Set the computer FREE on your data Supervised Learning (classification) Unsupervised Learning (clustering)

8/18/11

SS ZG515, Data Warehousing

25

OLAP Queries
How much of product P1 was sold in 1999 state wise? Top 5 selling products in 2002 Total Sales in Q1 of FY 2002-03? Color wise sales figure of cars from 2000 to 2003 Model wise sales of cars for the month of Jan from 2000 to 2003

8/18/11

SS ZG515, Data Warehousing

26

Data Mining Investigations


Which type of customers are more likely to spend most with us in the coming year? What additional products are most likely to be sold to customers who buy sportswear? In which area should we open a new store in the next year? What are the characteristics of customers most likely to default on their loans before the year is out?

8/18/11

SS ZG515, Data Warehousing

27

Continuum of Analysis
SQL
Specialized Algorithms

OLTP
Primitive & Canned Analysis

OLAP
Complex Ad-hoc Analysis

Data Mining
Automated Analysis

8/18/11

SS ZG515, Data Warehousing

28

Design Requirements
Design of the DW must directly reflect the way the managers look at the business

8/18/11

Should capture the measurements of importance along with parameters by which n these parameters are viewed It must facilitate data analysis, i.e., answering business questions
n
SS ZG515, Data Warehousing

29

ER Modeling
A logical design technique that seeks to eliminate data redundancy Illuminates the microscopic relationships among data elements Perfect for OLTP systems Responsible for success of transaction processing in Relational Databases

8/18/11

SS ZG515, Data Warehousing

30

Problems with ER Model


ER models are NOT suitable for DW? End user cannot understand or remember an ER Model Many DWs have failed because of overly complex ER designs Not optimized for complex, ad-hoc queries Data retrieval becomes difficult due to normalization Browsing becomes difficult
8/18/11 SS ZG515, Data Warehousing 31

ER vs Dimensional Modeling
ER models are constituted to
Remove redundant data (normalization) Facilitate retrieval of individual records having certain critical identifiers Thereby optimizing OLTP performance

Dimensional model supports the reporting and analytical needs of a data warehouse system.

8/18/11

SS ZG515, Data Warehousing

32

Dimensional Modeling: Salient Features


Represents data in a standard framework Framework is easily understandable by end users Contains same information as ER model Packages data in symmetric format Resilient to change Facilitates data retrieval/analysis

8/18/11

SS ZG515, Data Warehousing

33

Dimensional Modeling: Vocabulary


Measures or facts Facts are numeric & additive For example; Sale Amount, Sale Units Factors or dimensions Star Schemas Snowflake & Starflake Schemas

Sales Amt = fProduct, ( Location, Time)


8/18/11

Fact

SS ZG515, Data Warehousing

Dimensions

34

Star Schema
Product Dimension

FK

FK

Location Dimension

Sales Fact Table

Time Dimension
8/18/11

FK

FK

Promotion Dimension
35

SS ZG515, Data Warehousing

Dimensional Modeling
Facts are stored in FACT Tables Dimensions are stored in DIMENSION tables Dimension tables contains textual descriptors of business Fact and dimension tables form a Star Schema BIG fact table in center surrounded by SMALL dimension tables
8/18/11 SS ZG515, Data Warehousing 36

The Classic Star Schema


Store Dimension STORE KEY
Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr.

Fact Table STORE KEY PRODUCT KEY PERIOD KEY


Dollars_sold Units Dollars_cost

Time Dimension PERIOD KEY Period Desc Year Quarter Month Day

Product Dimension PRODUCT KEY


Product Desc. Brand Color Size Manufacturer

8/18/11

SS ZG515, Data Warehousing

37

Fact Tables
Contains numerical measurements of the business Each measurement is taken at the intersection of all dimensions Intersection is the composite key Represents Many-to-many relationships between dimensions Examples of facts Sale_amt, Units_sold, Cost, Customer_count
8/18/11 SS ZG515, Data Warehousing 38

Dimension Tables
Contains attributes for dimensions 50 to 100 attributes common Best attributes are textual and descriptive DW is only as good as the dimension attributes Contains hierarchal information albeit redundantly Entry points into the fact table

8/18/11

SS ZG515, Data Warehousing

39

Types of Facts
Fully-additive-all dimensions
Units_sold, Sales_amt

Semi-additive-some dimensions
Account_balance, Customer_count 28/3,tissue paper,store1, 25, 250,20 28/3,paper towel,store1, 35, 350,30 Is no. of customers who bought either tissue paper or paper towel is 50? NO.

Non-additive-none

Gross margin=Gross profit/amount Note that GP and Amount are fully additive Ratio of the sums and not sum of the ratios

8/18/11

SS ZG515, Data Warehousing

40

Data Warehouse: Design Steps


Step 1: Identify the Business Process Step 2: Declare the Grain Step 3: Identify the Dimensions Step 4: Identify the Facts

8/18/11

SS ZG515, Data Warehousing

41

Grocery Store: The Universal Example


The Scenario:
Chain of 100 Grocery Stores n 60000 individual products in each store n 10000 of these products sold on any given day(average) n 3 year data
n
8/18/11 SS ZG515, Data Warehousing 42

Some Terms

SKU (Stock Keeping Units) UPC (Universal Product Codes) EPOS ( Electronic Point of Sales)

8/18/11

SS ZG515, Data Warehousing

43

What Management is Interested In?


Ordering logistics Stocking shelves Selling products Maximize profits

8/18/11

SS ZG515, Data Warehousing

44

Grocery Store DW
n n n

Step 1: Sales Business Process Step 2: Daily Grain A word about GRANULARITY
n

Temp sensor data: per ms, sec, min, hr? Size of the DW is governed by granularity Daily grain (club products sold on a day for each store) Aggregated data Receipt line Grain (each line in the receipt is recorded finest grain data)
SS ZG515, Data Warehousing 45

8/18/11

Grocery Store: DW Size Estimate


Daily Grain Size of Fact Table = 100*10000*3*365 = 1095 million records 3 facts & 4 dimensions (49 bytes) 1095 m * 49 bytes = 53655 m bytes i.e. ~ 50 GB

8/18/11

SS ZG515, Data Warehousing

46

Facts for Grocery Store


1. 2. 3. 4.

Quantity sold (additive) Dollar revenue (additive) Dollar cost (additive) Customer count (semi-additive, not additive along the product dimension)

8/18/11

SS ZG515, Data Warehousing

47

Fact Table for Grocery Store


Field name Date key (FK) Product key (FK) Store key (FK) EPOS transaction no. Sales Quantity Sales amount Cost amount Example Values 1 1 1 100 2 72 65 Description/Remarks Surrogate key Surrogate key Surrogate key Trancsaction number generated by the Operational system to record sales No. of units bought by a customer Amount received by selling 2 units Cost price of 2 units

8/18/11

SS ZG515, Data Warehousing

48

Market-Basket Analysis
What products customers are buying together? Beer & Diapers Polo Shirts & Barbie Dolls How do we find this out? Market-Basket Analysis! Transaction No. Receipt Line Grain Degenerate Dimension
8/18/11 SS ZG515, Data Warehousing 49

Promotion Dimension
Causal Dimension Which causes or being the cause Promotion conditions include TPRs End-aisle displays Newspapers ads Coupons Combinations are common

8/18/11

SS ZG515, Data Warehousing

50

Q&A

Thank You

Você também pode gostar