Você está na página 1de 62

Recent Developments in Data Warehousing

Hugh J. Watson
Terry College of Business University of Georgia hwatson@terry.uga.edu
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt

Tutorial Objectives
Provide an overview of data warehousing Provide materials to support the teaching of data warehousing Discuss recent developments in data warehousing

The Importance of Data Warehousing


Provide a single version of the truth Improve decision making Support key corporate initiatives such as performance management, B2C and B2B e-commerce, and customer relationship management Estimated to be a $113.5 billion market in 2002 for systems, software, services, and in-house expenditures (Palo Alto Management Group)

A Simple Definition
A data warehouse is a collection of data created to support decisionmaking applications.

Data Warehouse Characteristics


Subject oriented -- data are organized around sales, products, etc. Integrated -- data are integrated to provide a comprehensive view Time variant -- historical data are maintained Nonvolatile -- data are not updated by users

Another Definition
Data warehousing is the entire process of data extraction, transformation, and loading of data to the warehouse and the access of the data by end users and applications.

Data Mart
A data mart stores data for a limited number of subject areas, such as marketing and sales data. It is used to support specific applications. An independent data mart is created directly from source systems. A dependent data mart is populated from a data warehouse.

Operational Data Store


An operational data store consolidates data from multiple source systems and provides a near realtime, integrated view of volatile, current data. Its purpose is to provide integrated data for operational purposes. It has add, change, and delete functionality. It may be created to avoid a full blown ERP implementation.

Data Sources
Transaction Data Prod

ETL Software
S T A G I N G A R E A O P E R A T I O N A L D A T A

Data Stores

Data Analysis Tools and Applications

Users

IBM

SQL

Mkt

IMS
Ascential

ANALYSTS

Cognos Teradata IBM Load Informatica Data Warehouse Data Marts Finance Essbase Marketing Meta Data Queries,Reporting, DSS/EIS, Data Mining EXECUTIVES Micro Strategy Sales Microsoft Siebel Business Objects Web Browser CUSTOMERS/ SUPPLIERS SAS MANAGERS

HR

VSAM

Fin

Oracle
Extract

Acctg

Syba se

Other Internal Data ERP Web Data


Clickstream

SAP

Sagent

Infor mix
SAS

External Data
Demographic

HarteHanks

S T O R E

OPERATIONAL PERSONNEL

Clean/Scrub Trans form Firstlogic

Topics Covered

Definitions and concepts Two case studies: Harrahs Entertainment (first) and Owens&Minor (last) The data mart and enterprise-wide data warehouse strategies Data extraction, cleansing, transformation and loading Meta data Data stores Online analytical processing (OLAP) Warehouse users, tools, and applications

Harrahs Entertainment

Harrahs Entertainment -- data warehousing supported a successful shift to a CRM oriented corporate strategy. Winner of the 2000 TDWI Leadership Award Operates 21 casinos across the country In 1993, the gaming laws changed, which allowed Harrahs to expand Harrahs decided to compete using a brand strategy supported by information technology Needed to know their customers exceptionally well

Harrahs Data Warehousing Architecture


WINet sources data from the casino, hotel, and event systems The patron data base serves as an operational data store The marketing workbench serves as the data warehouse

Sample Applications
Operational personnel use PDB to check the preferences, history, and value of customers Analysts use PDB and MWB to create offers to visit a Harrahs casino Analysts use MWB to support predictive modeling efforts

Predict the value of a customer Market based on that expected value Track transactions that are linked to marketing initiatives Evaluate the effectiveness Track profitability Refine Marketing Approaches

Define: Objectives Tests Control cells Learn

Right Offer Right Message Right Time

Measure: Profit & Loss Behavior change New test report

Customer Treatment

Execute Track Customer Action/ Non-Action

Customer Relationship Lifecycle


Establish Strengthen Reinvigorate

Annual Revenue

Length of Relationship

Two Data Warehousing Strategies


Enterprise-wide warehouse, top down, the Inmon methodology Data mart, bottom up, the Kimball methodology When properly executed, both result in an enterprise-wide data warehouse

The Data Mart Strategy


The most common approach Begins with a single mart and architected marts are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data warehousing Can perpetuate the silos of information problem Can postpone difficult decisions and activities Requires an overall integration plan

The Enterprise-wide Strategy


A comprehensive warehouse is built initially An initial dependent data mart is built using a subset of the data in the warehouse Additional data marts are built using subsets of the data in the warehouse Like all complex projects, it is expensive, time consuming, and prone to failure When successful, it results in an integrated, scalable warehouse

Data Sources and Types


Primarily from legacy, operational systems Almost exclusively numerical data at the present time External data may be included, often purchased from third-party sources Technology exists for storing unstructured data and expect this to become more important over time

Extraction, Transformation, and Loading (ETL) Processes


The plumbing work of data warehousing Data are moved from source to target data bases A very costly, time consuming part of data warehousing

Recent Development: More Frequent Updates


Updates can be done in bulk and trickle modes Business requirements, such as trading partner access to a Web site, requires current data For international firms, there is no good time to load the warehouse

Recent Development: Clickstream Data


Results from clicks at web sites A dialog manager handles user interactions. An ODS helps to custom tailor the dialog The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed Software is available to analyze the clickstream data

Recent Development:

Further Automation of ETL Processes


MetaRecon from Metagenix reverse engineers data into information Analyzes and profiles source systems Uncovers problems in source systems Recommends primary and secondary keys, dimensions and measures, etc. Generates ETL scripts

Data Extraction

Often performed by COBOL routines (not recommended because of high program maintenance and no automatically generated meta data) Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of dirty data in the source systems) Increasing performed by specialized ETL software

Sample ETL Tools


DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird Communications

Reasons for Dirty Data


Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems

Data Cleansing

Source systems contain dirty data that must be cleansed ETL software contains rudimentary data cleansing capabilities Specialized data cleansing software is often used. Important for performing name and address correction and householding functions Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)

Steps in Data Cleansing

Parsing
Correcting

Standardizing
Matching

Consolidating

Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. Example include replacing a vanity address and adding a zip code.

Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Examples include identifying similar names and addresses.

Consolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Data Staging

Often used as an interim step between data extraction and later steps Accumulates data from asynchronous sources using native interfaces, flat files, FTP sessions, or other processes At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse There is usually no end user access to the staging file An operational data store may be used for data staging

Data Transformation
Transforms the data in accordance with the business rules and standards that have been established Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates

Data Loading
Data are physically moved to the data warehouse The loading takes place within a load window The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

Meta Data

Data about data Needed by both information technology personnel and users IT personnel need to know data sources and targets; database, table and column names; refresh schedules; data usage measures; etc. Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.

Recent Development: Meta Data Integration


A growing realization that meta data is critical to data warehousing success Progress is being made on getting vendors to agree on standards and to incorporate the sharing of meta data among their tools Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings

Database Vendors
High end (i.e., terabyte plus) vendors include IBM (DB2) and NCR-Teradata (Teradata) Oracle (8i) and Microsoft (SQL Server 7) are major players for smaller databases

On-line Analytical Processing (OLAP)


A set of functionality that facilitates multidimensional analysis Allows users to analyze data in ways that are natural to them Comes in many varieties -- ROLAP, MOLAP, DOLAP, etc.

ROLAP

Relational OLAP Uses a RDBMS to implement and OLAP environment Typically involves a star schema to provide the multidimensional capabilities OLAP tool manipulates RDBMS star schema data Called slowlap by MOLAP vendors

MOLAP
Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to store and access data Usually requires proprietary (non SQL) data access tools Provides exceptionally fast response times

Star Schema
Creates non-normalized data structures Easier for users to understand Optimized for OLAP Uses fact (facts or measures in the business) and dimension (establishes the context of the facts) tables

OLAP Tools

Products come from vendors such as Brio, Cognos, Hyperion, and BusinessObjects Typically available as a fat or thin (i.e., browser) client In a web environment, the browser communicates with a web server, which talks to an application server, which connects to backend databases The application server provides query, reporting, and OLAP analysis functionality over the web Java applets or downloaded components augment the thin client A broadcast server may be used to schedule, run, publish, and broadcast reports, alerts, and responses over the LAN, email, or personal digital assistant.

Dimension Table Examples


Retail -- store name, zip code, product name, product category, day of week Telecommunications -- call origin, call destination Banking -- customer name, account number, branch, account officer Insurance -- policy type, insured party

Fact Table Examples


Retail -- number of units sold, sales amount Telecommunications -- length of call in minutes, average number of calls Banking -- average monthly balance Insurance -- claims amount

The Fact Table Key Concatenates the Dimension Keys


Assume that you want to know the number of television sets sold to Best Buys on January 15, 2001. The query might be:
SELECT CLIENT.CUSNAME, SALES.NOSOLD FROM CLIENT, PRODUCT, TIME, SALES WHERE CLIENT.CUSNAME=SALES.CUSNAME AND PRODUCT.PRODNAME=SALES.PRODNAME AND TIME.DATE=SALES.DATE AND CLIENT.CUSNAME=BEST BUYS AND PRODUCT.PRODNAME=TELEVISION AND

TIME.DATE=#01/15/2001#

Warehouse Users
Analysts Managers Executives Operational personnel Customers and suppliers

Warehouse Tools and Applications


SQL queries Managed query environments Structured and ad hoc reports DSS/EIS Portals Data mining Packaged applications Custom-built applications

Recent Development: Growing Dominance of MS SQL Server 7.0 with OLAP Services
Low cost, integration of bundled DSS components from one vendor, and extended SQL for OLAP Competitors are either leaving the market or are repositioning their products to be complimentary

Recent Development: Enterprise Intelligence Portals

Offers users an effective way to access information scattered across networked enterprise systems through a simple and personalized Web interface Provides access to structured and unstructured data Potentially integrates data warehousing and knowledge management

Owens & Minor

Owens&Minor -- data warehousing has supported integration along the supply chain. Winner of the 1999 TDWI Leadership Award the nation's leading distributor of name-brand medical and surgical supplies has transformed its business model by integrating supply chain management, ebusiness, data warehousing, and Internet technologies as part of this initiative, WISDOM (WebIntelligence Supporting Decisions from Owens & Minor) has been especially valuable

WISDOM

a Web-based decision support system that provides information to OMs employees, suppliers and customers accesses data from a data warehouse that maintains supplier and customer transaction data sold to trading partners as a value added product WISDOM II provides data about the transactions that suppliers and customers have with all of their trading partners

Sample Applications

Supports reporting and queries for internal personnel Supports an EIS for senior management Suppliers can determine their market share in specific hospitals Hospitals can identify which products are being bought off contract WISDOM II extends data warehousing to trading partners through an outsourcing arrangement

Articles

Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue, "Data Warehousing Supports Corporate Strategy at First American Corporation," MIS Quarterly, (December 2000), pp. 547-567. Provides a case study of how the First American Corporation turned their strategy and fortunes around through the use of data warehousing. Stoller, Wixom, and Watson, WISDOM Provides Competitive Advantage at Owens & Minor, (http://terry.uga.edu/~watson/owens&minor.doc) Provides a case study of how data warehousing can support supply chain integration. Watson, Wixom, Buonamica, and Revak, Sherwin-Williams' Data Mart Strategy: Creating Intelligence Across the Supply Chain, Communications of ACIS, April 2001. Provides a textbook example of how to implement a data mart strategy. Watson, H.J., D.A. Annino, B.H. Wixom, K.L. Avery, and M. Rutherford, Current Practices in Data Warehousing, Information Systems Management, (Winter, 2001), pp. 47-55. Provides data on companies data warehousing experiences, with an emphasis on the benefits being realized. Watson, H.J. and L. Volonino, Harrahs High Payoff from Customer Information, (http://www.terry.uga.edu/~hwatson/harrahs.doc) Provides a case study of how Harrahs Entertainment has implemented a CRM strategy facilitated by data warehousing.

Books

Devlin, Data Warehouse -- Architecture to Implementation, AddisonWesley, 1997. Gray and Watson, Decision Support in the Data Warehouse, Prentice-Hall, 1998.

Kimball, The Data Warehouse Toolkit, Wiley, 1996.


Kimball and Merz, The Data Webhouse Toolkit, Wiley, 2000. Inmon, Building the Operational Data Store, second edition, Wiley, 1999. Inmon, Imhoff, and Sousa, Corporate Information Factory, Wiley, 1999.

Websites

http://www.olapreport.com (provides detailed information about the OLAP market, products, and applications) http://www.firstlogic.com (includes an interactive demo of their data cleansing tool) http://www.billinmon.com (a wealth of current information from the father of data warehousing) http://www.metagenix.com (illustrates recent advances in ETL tools) http://www.microstrategy.com (excellent materials from one of the leading DSS vendors)

Questions

Você também pode gostar