Você está na página 1de 37

Data warehouse and ETL Concepts Infomatica and Teradata

Data warehouse:
A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making
Data Warehousing is a process, not a product. It is a technique to properly assemble and manage data from various sources to answer business questions not previously known or possible Subject Oriented: Data is arranged by subject area rather than by application, which is more intuitive for users to navigate. Integrated: Data is collected and consistently stored from multiple, Diverse sources Time Variant: The data is static, one version of the truth regardless of when the question is asked. Non-volatile: Allows for access to and analysis of data over time, rather than typical systems which generally provide just detailed current information.

Example of a Data warehouse:


Source Systems
Execution Systems CRM ERP Legacy e-Commerce

ETL Layer
Extract, Transformation, and Load (ETL) Layer
Cleanse Data Filter Records Standardize Values Decode Values Apply Business Rules Householding Dedupe Records Merge Records

Data and Metadata Repository Layer


ODS

Presentation Layer
Reporting Tools OLAP Tools Ad Hoc Query Tools
Data Mining Tools

Enterprise Data Warehouse

Data Mart

External Data Purchased Market Data Spreadsheets

Data Mart

Metadata Repository
Data Mart

Sample Technologies:
PeopleSoft SAP Siebel Oracle Applications Manugistics Custom Systems ETL Tools: Informatica PowerMart ETI Oracle Warehouse Builder Custom programs SQL scripts Oracle SQL Server Teradata DB2

Custom Tools HTML Reports Cognos Business Objects MicroStrategy Oracle Discoverer Brio Data Mining Tools Portals

The Data flow thread:

The Data flow thread:

Data Warehouse Design Process:

Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record

From software engineering point of view

Typical data warehouse design process


Staging Area:
Staging Area: The data staging area is an information hub that facilitates the enriching stages that data goes through in order to populate an ODS and/or data warehouse. It is the essential ingredient in the development of an approach and/or methodology for creating a comprehensive datacentric solution for any data warehousing project. Staging majority of the times ,Almost all a copy of the source system data without any transformations. Why we need staging? 1) The connection between source and ETL can break mid-stream. 2) The ETL process can take a long time. If we are processing in stream, well have a connection open to the source system. A long-running process can create problems with database locks and stress the transaction system. 3) You should always make a copy of the extracted, untransformed data for auditing purposes.

Dimensions :
Dimensions
Single join to the fact table (single primary key) Stores business attributes Attributes are textual in nature Organized into hierarchies More or less constant data E.g. Time, Product, Customer, Store, etc.

Facts:

Transaction grain : The transaction grain corresponds to a measurement taken at a single instant.

Fact Table: Fact tables are the foundation of the data warehouse. They contain the fundamental measurements of the enterprise, and they are the ultimate target of most data warehouse queries Grain: Level of detail ness . Grain of Fact table: The grain is the description of the measurement event in the physical world that gives rise to a measurement. Purpose of Fact table : The real purpose of the fact table is to be the repository of the numeric facts that are observed during the measurement event. It is critically important for these facts to be true to the grain

Star schema vs. Snowflake schema


Star Schema Understandability Snowflake Schema May be more difficult for business Easier for business users and analysts users and analysts due to number of to query data. tables they have to deal with. Only have one dimension table for each dimension that groups related attributes. Dimension tables are not in the third normal form. May have more than 1 dimension table for each dimension due to the further normalization of each dimension table. Dimension tables are in the third normal form (3NF).

Dimension table

Query complexity

More complex query due to multiple The query is very simple and easy to foreign key joins between dimension understand tables High performance. Database engine can optimize and boost the query performance based on predictable framework. More foreign key joins therefore longer execution time of query in compare with star schema

Query performance

When to use

Foreign Key Joins Data warehouse system

When dimension tables store large When dimension tables store relative number of rows with redundancy small number of rows, space is not a data and space is such an issue, we big issue we can use star schema. can choose snowflake schema to save space. Fewer Joins Higher number of joins Work best in any data warehouse / Better for small data warehouse/ data mart data mart

Informatica Architecture and Components:


Informatica Power Center uses a client-server architecture containing several components as shown in the below diagram.

Components:

Você também pode gostar