Você está na página 1de 83

An Introduction

to
Data Warehousing

Course Outcomes:
Understand the need for data warehouse and data
mining
Design a data warehouse to support a business
problem
Apply different algorithms to large databases to
solve problems ,also for strategic business decisions

Evaluation Scheme for DWM


Internal
Semester
Class +
Semester
marks
Test Test
End Exam
Lab
End
Term Work
1
2
Weightage
Weightage
Work
Exam
(%)
(%)

15

15

20

50

40

100

60

Total

100

term work (20 marks)


Term work marks will be awarded on the basis of :
Student Performance throughout the Course
(DWM) Lectures and Laboratory work
Student Performance in assignments, mock tests ,
internal and external viva, presentations etc.

Why Data Warehouseing?

A producer wants to know.


Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?

What is the most


effective distribution
channel?

What product prom-otions have the biggest


impact on revenue?

Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?

Data, Data everywhere


yet ...

I cant find the data I need


data is scattered over the network
many versions, subtle differences

I cant get the data I need


need an expert to get the data

I cant understand the data I


found
available data poorly documented

I cant use the data I found


results are unexpected
data needs to be transformed
7
from one form to other

Data Warehouse
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a what
they can understand and use in
a business context.

[Barry Devlin]
8

What are the users saying...

Data should be integrated across


the enterprise

Summary data has a real value to


the organization

Historical data holds the key to


understanding data over time

What-if capabilities are required

What is Data Warehousing?


A process of transforming
Information

data into information and


making it available to
users in a timely enough
manner to make a
difference

Data

[Forrester Research, April 1996]


10

Data Warehouse: Formal Definition

A data warehouse is a subject-oriented,


integrated, nonvolatile, time-variant collection
of data in support of management's decisions.

- WH Inmon

WH Inmon - Regarded As Father Of Data Warehousing

Subject-Oriented- Characteristics of a Data Warehouse


Focus is on Subject Areas rather than Applications
Operational

Data
Warehouse

Leads

Prospects

Customers

Products

Quotes

Orders

Regions

Time

Integrated - Characteristics of a Data Warehouse


Integrated View Is The Essence Of A Data Warehouse
Appl A - m,f
Appl B - 1,0
Appl C - male,female

Appl A - balance dec fixed (13,2)


Appl B - balance pic 9(9)V99
Appl C - balance pic S9(7)V99 comp-3

m,f

balance dec
fixed (13,2)

Appl A - bal-on-hand
Appl B - current-balance
Appl C - cash-on-hand

Current balance

Appl A - date (julian)


Appl B - date (yymmdd)
Appl C - date (absolute)

date (julian)

Non-volatile - Characteristics of a Data Warehouse


Data Warehouse Is Relatively Static In Nature
insert

change

Data
Warehouse

Operational
delete

insert
load

replace

change

read only
access

Time Variant - Characteristics of a Data Warehouse


Data Warehouse Typically Spans Across Time

Operational

Current Value data


time horizon : 60-90 days

Data
Warehouse
Snapshot data
time horizon : 5-10 years
data warehouse stores historical
data

Alternate Definitions

A collection of integrated, subject oriented databases


designed to support the DSS function, where each
unit of data is relevant to some moment of time

- Imhoff

Alternate Definitions

Data Warehouse is a repository of data summarized


or aggregated in simplified form from operational
systems. End user oriented data access and

reporting tools let user get at the data for decision


support - Babcock

Evolution of Data Warehousing


1960 - 1985 : MIS Era

Unfriendly
Slow
Dependent on IS programmers
Inflexible
Analysis limited to defined reports
Focus on Reporting

Evolution of Data Warehousing


1985 - 1990 : Querying Era

Queries that are formulated


by the user on the spur of the
moment

Adhoc, unstructured access to corporate data


SQL as interface not scalable
Cannot handle complex analysis

Focus on Online Querying

Evolution of Data Warehousing


1990 - 20xx : Analysis Era

Trend Analysis
What If ?
Cross Dimensional Comparisons
Statistical profiles
Automated pattern and rule discovery
Focus on Online Analysis

Need for Data Warehousing

Better business intelligence for end-users

Reduction in time to locate, access, and analyze information

Consolidation of disparate information sources

Strategic advantage over competitors

Faster time-to-market for products and services

Replacement of older, less-responsive decision support


systems

Reduction in demand on IS to generate reports

Business Queries
Typical Business Queries

Which product generated maximum revenue over last


two quarters in a chosen geographical region, city wise,
relative to the previous version of product, compared
with the plan

What percent of customer procures product A with B in a


chosen region, broken down by city, season, and income
group

OLTP Systems Vs Data Warehouse


Remember
Between OLTP and Data Warehouse systems
users are different
data content is different,
data structures are different
hardware is different
Understanding The Differences Is The Key

OLTP Vs Warehouse
Operational System

Data Warehouse

Transaction Processing

Query Processing

Predictable CPU Usage

Random CPU Usage

Time Sensitive

History Oriented

Operator View

Managerial View

Normalized Efficient

Denormalized Design for

Design for TP

Query Processing

OLTP Vs Warehouse
Operational System

Data Warehouse

Designed for Atomicity,


Consistency, Isolation and
Durability

Designed for quite or static


database

Organized by transactions
(Order, Input, Inventory)

Organized by subject
(Customer, Product)

Relatively smaller database Large database size


Many concurrent users

Relatively few concurrent


users

Volatile Data

Non Volatile Data

OLTP Vs Warehouse
Operational System

Data Warehouse

Stores all data

Stores relevant data

Performance Sensitive

Less Sensitive to performance

Not Flexible

Flexible

Efficiency

Effectiveness

Examples Of Some Applications


Manufacturers

Retailers

Target Marketing

Market Segmentation

Budgeting

Credit Rating Agencies

Financial Reporting and Consolidation

Market Basket Analysis

Profitability Management

Event tracking

Customers

Do we need a separate database ?

OLTP and data warehousing require two very


differently configured systems

Isolation of Production System from Business


Intelligence System

Significant and highly variable resource demands of


the data warehouse

Cost of disk space no longer a concern

Data Marts

Enterprise wide data warehousing projects have a


very large cycle time

Getting consensus between multiple parties may


also be difficult

Departments may not be satisfied with priority


accorded to them

Sometimes individual departmental needs may be


strong enough to warrant a local implementation

Application/database distribution is also an


important factor

Data Marts
A Logical Subset of The Complete Data
Warehouse
Subject or Application Oriented Business View of
Warehouse
Finance, Manufacturing, Sales etc.
Smaller amount of data used for Analytic Processing
Address a single business process

Data Marts

Data Warehouse and Data Mart


Data Warehouse

Data Marts

Scope

Application Neutral
Centralized, Shared

Specific Application
Requirement
Business Process
Oriented

Data
Perspective

Historical Detailed data


Some summary

Detailed (some history)


Summarized

Subjects

Multiple subject areas

Single Partial subject


Multiple partial subjects
OLTP snapshots

Data Warehouse and Data Mart


Data Warehouse Data Marts
Data Sources

Many
Operational/ External
Data

Few
Operational, external
data
OLTP snapshots

Implementation Time
Frame
Characteristics

9-18 months for first


stage
Multiple stage
implementation
Flexible, extensible
Durable/Strategic
Data orientation

4-12 months

Restrictive, non
extensible
Short life/tactical
Project Orientation

Warehouse or Mart First ?


Data Warehouse First

Data Mart first

Expensive

Relatively cheap

Large development cycle

Delivered in < 6 months

Change management is
difficult

Easy to manage change

Difficult to obtain continuous


corporate support

Can lead to independent and


incompatible marts

Technical challenges in
building large databases

Cleansing, transformation,
modeling techniques may be
incompatible

Different kinds of Information Needs

Current

Recent

Historical

Is this medicine available


in stock

What are the tests this


patient has completed so
far

Has the incidence of


Tuberculosis increased in
last 5 years in Southern
region

Operational Data Store - Definition


Can I see credit
report from
Accounts, Sales
from marketing
and open order
report from
order entry for
this customer

Data from multiple


sources is integrated
for a subject

A subject oriented, integrated,


volatile, current valued data store

containing only corporate


Identical queries may
give different results
at different times.
Supports analysis
requiring current
data

detailed data

Data stored only for


current period. Old
Data is either
archived or moved to
Data Warehouse

Operational Data Store

Increasingly becoming integrated with the data


warehouse

Are nothing but more responsive real time data


warehouses

Data Mining has anyway forced Data Warehouses


to store transactional level data

OLTP Vs ODS Vs DWH


Characteristic

OLTP

ODS

Data Warehouse

Operating
Analysts
Managers and
Personnel
analysts
Individual records, Individual records, Set of records,
Data access
transaction driven transaction or
analysis driven
analysis driven
Current, real-time Current and near- Historical
Data content
current
Detailed and lightly Summarized and
Data granularity Detailed
summarized
derived
Subject-oriented
Subject-oriented
Data organization Functional
Audience

Data quality

All application
specific detailed
data needed to
support a business
activity

All integrated data Data relevant to


needed to support a management
business activity
information needs

OLTP Vs ODS Vs DWH


Characteristic

OLTP

ODS

Data Warehouse

Data redundancy

Somewhat
redundant with
operational
databases

Managed
redundancy

Data stability

Non-redundant
within system;
Unmanaged
redundancy among
systems
Dynamic

Data update

Field by field

Field by field

Controlled batch

Data usage

Highly structured,
repetitive

Somewhat
structured, some
analytical

Database size

Moderate

Moderate

Highly
unstructured,
heuristic or
analytical
Large to very large

Somewhat stable

Dynamic

Stable
Database
structure stability

Somewhat dynamic Static

OLTP Vs ODS Vs DWH


Characteristic

OLTP

ODS

Data Warehouse

Development
methodology

Requirements
driven, structured

Data driven,
evolutionary

Operational
priorities

Performance and
availability

Data driven,
somewhat
evolutionary
Availability

Philosophy

Support day-today operation

Predictability

Stable

Response time

Sub-second

Support day-to-day
decisions &
operational
activities
Mostly stable, some Unpredictable
unpredictability
Seconds to minutes Seconds to minutes

Return set

Small amount of
data

Small to medium
amount of data

Access flexibility
and end user
autonomy
Support managing
the enterprise

Small to large
amount of data

TOP DOWN APPROACH(The Dependent Data


Mart Structure )
Multi-tiered Data Warehouse without ODS

Data
Marts
EIS /DSS

Select

Metadata

Query Tools

Extract
Transform
Integrate
Maintain

Data
Warehouse

OLAP/ROLAP

Web Browsers

Operational
Systems/Data
Data
Preparation

Middleware/
API

Data Mining

Typical Data Warehouse Architecture


Data
Marts
Metadata

Metadata

Select

Select

Extract

Extract

Transform
Integrate

ODS

Transform

Data
Warehouse

Load

Maintain

Operational
Systems/Data
Data
Preparation

Data
Preparation

Multi-tiered Data Warehouse with ODS

TOP DOWN APPROACH(The Dependent Data


Mart Structure )

Data extraction from the operational data


sources.
Data is loaded into the staging area and
validated and consolidated and transferred to
the Operational Data Store (ODS).
Data is also loaded into the Data
warehouse in a parallel process to avoid
extracting it from the ODS.
Detailed data is regularly extracted from the
ODS and temporarily hosted in the staging
area for aggregation, summarization and then
extracted and loaded into the Datawarehouse.

TOP DOWN APPROACH(The Dependent Data


Mart Structure )

Once the Data warehouse aggregation and


summarization processes are complete, the
data mart refresh cycles will extract the
data from the Data warehouse into
the staging area and perform a new set of
transformations on them.
Then the data marts can be loaded with the
data and the OLAP environment becomes
available to the users.

BOTTOM UP APPROACH
Data
Marts

EIS /DSS

Metadata
Query Tools

Select
Extract
Transform
Integrate

Data
Warehouse

OLAP/ROLAP

Maintain

Web Browsers
Operational
Systems/Data

Data
Preparation

Middleware/
API

Multi-tiered Data Warehouse without ODS

Data Mining

A Practical Approach
The Steps in the Practical Approach are :
1. The first step is to do Planning and defining
requirements at the overall corporate level.
2. An architecture is created for a complete
warehouse.

3. The data content is conformed and standardized.


4. Consider the series of supermarts one at a time
and implement the data warehouse.

Benefits of DWH
These capabilities empower the corporate...
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive models.

Warehouse Architecture - 1

EIS /DSS
Metadata
Query Tools

Select

Extract
Transform
Integrate

Data
Warehouse

OLAP/ROLAP

Maintain

Web Browsers

Operational
Systems/Data
Data
Preparation

Middleware/
API
Data Mining

Enterprise Data Warehouse

Warehouse Architecture - 2
Metadata

EIS /DSS
Data Mart

Select

Metadata

Query Tools

Extract
Transform

Data Mart

Integrate
Maintain

OLAP/ROLAP
Metadata

Web Browsers

Operational
Systems/Data

Data Mart
Data
Preparation

Middleware/
API
Data Mining

Single Department Data Mart

Warehouse Architecture - 3
Data
Marts

EIS /DSS
Metadata

Query Tools

Select
Extract
Transform

Data
Warehouse

Integrate

OLAP/ROLAP

Maintain

Web Browsers
Operational
Systems/Data
Data
Preparation

Operational
Data Store

Multi-tiered Data Warehouse

Middleware/
API

Data Mining

Data Warehouse Architectures


There are three schools of thought about DW
architectures
One supports Dimensional Modeling all through
(Ralph Kimball)
Second supports ER for Data Warehouse and Star
Schemas for Data Marts
Third supports ER model for DW

Kimballs View
Operational Systems
Presentation Server
Staging Area

Each Star is
a Data Mart
and has both
summary and
detail data

LAN
Data Warehouse
Server
Processes
Extract
Scrubbing
Transformation
Load Jobs
Aggregation Jobs
Replication
Monitoring
Management
Meta Data Repository
Meta Data Population
Meta Data Maintenance

DW is sum
total of all
Data Marts
DW Bus using
Conformed Dimensions

Multiple Data Marts With Conformed Dimensions

Inmons View
Operational Systems
Staging Area

Data Warehouse

Data Marts

LAN
Data Warehouse Server
Processes
Extract
Scrubbing
Transformation
Load Jobs
Aggregation Jobs
Replication
Monitoring
Management
Meta Data Repository
Meta Data Population
Meta Data Maintenance

Detail Data
in ER format
Summarized Data
in Star formats

Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)

Components of a Data Warehouse Architecture


Data
Cleansing
Tools

Source
Databases

Data
Modeling
Tool

ETL Tool

Central
Metadata

ROLAP
Engine

Data Access and


Analysis Tools
-Managed Query

Central
Warehouse
(RDBMS)

RDBMS

-Desktop OLAP
-ROLAP
-MOLAP

Local meta
data

Warehouse
Admin Tool

- Data Mining
MDDB
Architected
Datamarts
Warehouse Databases

Data Warehouse Is Not Just About Data... But Tools Too

Components of a Data Warehouse Architecture

Source Data Component

Data Staging Component

Data Storage Component

Information Delivery Component

Metadata Component

Management and Control Component

Source Data Component

The source data component provides the


necessary data for the data warehouse.

Typically, the source data for the warehouse comes


from the operational applications.

Data Staging Component

As the data enters the warehouse, it is cleaned up


and transformed into an integrated structure and
format.

The transformation process may involve


conversion, summarization, filtering and
condensation of data.

Data Staging Component

The functionality includes:


Removing unwanted data from operational
databases.
Converting to common data names and definitions.
Establishing defaults for missing data.
Accommodating source data definition changes.

Data Storage Component

The central data warehouse database is the


cornerstone of the data warehousing environment.

This database is almost always implemented on the


relational database management system (RDBMS)

technology.

Information Delivery Component

The principal purpose of data warehousing is to


provide information to business users for strategic
decision-making. These users interact with the data
warehouse using front-end tools.

Delivery of information may be based on time of


day or on the completion of an external event

Metadata Component

Meta data is data about data that describes the


data warehouse. It is used for building, maintaining,
managing and using the data warehouse. Meta
data can be classified into:

Technical meta data, which contains information about warehouse data


for use by warehouse designers and administrators

Business meta data, which contains information that gives users an


easy-to-understand perspective of the information stored in the data
warehouse.

Management and Control Component

Data warehouses tend to be as much as 4 times as


large as related operational databases, reaching
terabytes in size depending on how much history
needs to be saved.

Web Enabled Data Warehouse

Web-Enabling the data warehouse means using


the web for information delivery and integrating the
click stream data from the corporate web site for
analysis.

WEB ENABLED WAREHOUSE

WEB ENABLED WAREHOUSE

In order to transform a data warehouse into a Webenabled data warehouse, first bring the data
warehouse to the Web, and secondly bring the
Web to the data warehouse.

When you bring your data warehouse to the Web,

from the point of view of the users, the key


requirements are: self-service data access,
interactive analysis, high availability and
performance

WEB ENABLED WAREHOUSE

Bringing the Web to the warehouse essentially


involves capturing the click stream of all the visitors
to your companys Web site and performing all the
traditional data warehousing functions.

A Web-enabled data warehouse uses the Web for


information delivery and collaboration among
users.

Security issues in a web enabled data


warehouse
The security issues at different levels
1) At the Network level, solutions are needed that support data
encryption and restricted transfer mechanisms.
2) At the application level, the security system must manage
authorizations on who is allowed to get into the application
and what each user is allowed to access.
3) Malicious user who may be having access to secure
information is a great threat to the data warehouse, this
aspect also needs to be addressed from security point of
view.

PLANNING YOUR DATA


WAREHOUSE

Key Issues

Value and Expectations

Risk Assessment

Top-down or Bottom-up

Build or Buy

Single Vendor or Best-of-Breed

Choosing a single vendor solution has a


few advantages:

High level of integration among the tools

Constant look and feel

Seamless cooperation among components

Centrally managed information exchange

Overall price negotiable

major advantages of the best-of breed

solution that combines products from multiple


vendors

Could build an environment to fit your organization

No need to compromise between database and

support tools

Select products best suited for the function

Let business requirements drive your data


warehouse, not technology.

Preliminary survey

Mission and functions of each user group

Computer systems used by the group

Key performance indicators

Factors affecting success of the user group

Who the customers are and how they are classified

Types of data tracked for the customers, individually and groups

Products manufactured or sold

Categorization of products and services

Locations where business is conducted

Levels at which profits are measuredper customer, per product, per


district

Levels of cost details and revenue

Current queries and reports for strategic information

Other important issues.

Top Management Support

Justifying Your Data Warehouse


Calculate the current technology costs
Calculate the business value of the proposed data warehouse
Identify all the components that will be affected by the proposed data
warehouse and those that will affect the data warehouse

REQUIREMENTS GATHERING METHODS

Interviews

Group Sessions

Current Information Sources

Subject Areas

Key Performance Metrics

REQUIREMENTS DEFINITION: SCOPE AND


CONTENT(1)

Data Sources
Available data sources

Data structures within the data sources

Location of the data sources

Operating systems, networks, protocols, and client architectures

Data extraction procedures

Availability of historical data

REQUIREMENTS DEFINITION: SCOPE AND


CONTENT(2)

Data Transformation

Data Storage

Information Delivery
Drill-down analysis

Roll-up analysis

Drill-through analysis

Slicing and dicing analysis

Ad hoc reports

Requirements Definition Document Outline


Introduction.
General requirements descriptions.
Specific requirements.
Information packages.

User expectations.
User participation and sign-off
General implementation plan.

Trends In Data Warehousing

Multiple Data Types

Data Visualization

Parallel Processing

Query Tools

Browser Tools

Data Fusion

Trends..

Multidimensional Analysis

Data Warehousing and ERP

Data Warehousing and KM

Data Warehousing and CRM

Você também pode gostar