Escolar Documentos
Profissional Documentos
Cultura Documentos
Data warehouse:
A system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making
Data Warehousing is a process, not a product. It is a technique to properly assemble and manage data from various sources to answer business questions not previously known or possible Subject Oriented: Data is arranged by subject area rather than by application, which is more intuitive for users to navigate. Integrated: Data is collected and consistently stored from multiple, Diverse sources Time Variant: The data is static, one version of the truth regardless of when the question is asked. Non-volatile: Allows for access to and analysis of data over time, rather than typical systems which generally provide just detailed current information.
ETL Layer
Extract, Transformation, and Load (ETL) Layer
Cleanse Data Filter Records Standardize Values Decode Values Apply Business Rules Householding Dedupe Records Merge Records
Presentation Layer
Reporting Tools OLAP Tools Ad Hoc Query Tools
Data Mining Tools
Data Mart
Data Mart
Metadata Repository
Data Mart
Sample Technologies:
PeopleSoft SAP Siebel Oracle Applications Manugistics Custom Systems ETL Tools: Informatica PowerMart ETI Oracle Warehouse Builder Custom programs SQL scripts Oracle SQL Server Teradata DB2
Custom Tools HTML Reports Cognos Business Objects MicroStrategy Oracle Discoverer Brio Data Mining Tools Portals
Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record
Staging Area:
Staging Area: The data staging area is an information hub that facilitates the enriching stages that data goes through in order to populate an ODS and/or data warehouse. It is the essential ingredient in the development of an approach and/or methodology for creating a comprehensive datacentric solution for any data warehousing project. Staging majority of the times ,Almost all a copy of the source system data without any transformations. Why we need staging? 1) The connection between source and ETL can break mid-stream. 2) The ETL process can take a long time. If we are processing in stream, well have a connection open to the source system. A long-running process can create problems with database locks and stress the transaction system. 3) You should always make a copy of the extracted, untransformed data for auditing purposes.
Dimensions :
Dimensions
Single join to the fact table (single primary key) Stores business attributes Attributes are textual in nature Organized into hierarchies More or less constant data E.g. Time, Product, Customer, Store, etc.
Facts:
Transaction grain : The transaction grain corresponds to a measurement taken at a single instant.
Fact Table: Fact tables are the foundation of the data warehouse. They contain the fundamental measurements of the enterprise, and they are the ultimate target of most data warehouse queries Grain: Level of detail ness . Grain of Fact table: The grain is the description of the measurement event in the physical world that gives rise to a measurement. Purpose of Fact table : The real purpose of the fact table is to be the repository of the numeric facts that are observed during the measurement event. It is critically important for these facts to be true to the grain
Dimension table
Query complexity
More complex query due to multiple The query is very simple and easy to foreign key joins between dimension understand tables High performance. Database engine can optimize and boost the query performance based on predictable framework. More foreign key joins therefore longer execution time of query in compare with star schema
Query performance
When to use
When dimension tables store large When dimension tables store relative number of rows with redundancy small number of rows, space is not a data and space is such an issue, we big issue we can use star schema. can choose snowflake schema to save space. Fewer Joins Higher number of joins Work best in any data warehouse / Better for small data warehouse/ data mart data mart
Components: