Você está na página 1de 17

DW BASICS

DW BASICS
1. What is an operational system? Helps to run day-day activities. Backbone systems of an enterprise They are the first ones of the enterprise to be computerized. Eg: ATM withdrawals, airline reservations 2. Characteristics of operational System: (CSHUTPLS) Continous Availability Supports large number of users High volume of transaction Used by operational staff Transaction integrity Predefined access paths Low data volume per query Supports day-day control operations. 3. What is data warehousing? It is a subject oriented, integrated,time-variant, and non-volatile collection of data for management decision making process. 4. Features of DW: Multi-Dimensional View High degree of scalability High analytical capability Historical data only 5. Characteristics of DW: Subject-oriented Integrated Time variant Nonvolatile-No need for processing, recovery & concurrency control methods. 6.Operational VS Datawarehouse OPERATIONAL Primitive Current Constantly updated Minimal redundancy Highly detailed data Referential integrity Normalized design DATAWAREHOUSE Derived Historical Less frequently Managed Redundancy Summarized data Historical integrity De-Normalized design

DW BASICS

7. Types of data warehouses: Operational data store- volatile,integrated,subject-oriented,current valued,detailed data Data marts Enterprise data warehouse Audience Data Access Data content Data structure Data organization Type of data Data redundancy Data update Database size Development methodology OPERATIONAL Data Store Analysts Individual records, transaction or analysis driven Current and near-current Detailed and lightly summarized Subject-oriented Homogeneous Somewhat redundancy Field by field Moderate Data driven Day-day decisions and operational activities DATAWAREHOUSE Managers and analysts Set of records, analysis driven historical Detailed and highly summarized Subject-oriented Heterogeneous data Managed redundancy Controlled batch Large-very large Data driven Managing the enterprise

8. What is data mart? A subset of data warehouse for a single department or function. Its subject-area specific, so scope is limited to a particular subject area. It is also a de-centralized subset of data found either in the datawarehouse or as a stand-alone. 9. Use of data mart? Better query performance Easier navigation through data Advantages: Single subject area and fewer dimensions Limited feeds Quick time to market (30-120 days) Quick impact on bottom line problems Focused user needs Limited scope Optimum model for DW construction Disadvantages: Does not provide integrated view of business information Uncontolled proliferation of data marts leads to redundant data More number of data marts is complex to maintain Scalability issues 2

DW BASICS

h 10. Features of Data marts: Low cost, Controlled locally, Contains less information than data warehouse Rapid response Easy navigation and better understanding Subject area specific 11. Data warehouse VS data mart? Data Warehouse Scope Corporate, Application neutral, centralized & shared, cross LOB/enterprise Multiple Many 100 GB-TB+ Data Mart Specific application requirement, Line-of-Business (LoB), Business process oriented Single Subject & multiple partitions Few < 100GB 4-12 Months (Months) Detailed data (some history) Summarized Restrictive, non-extensible Short life/tactical Project orientation

Subjects Data Sources Size (typical)

Implementation 9-18 Months (Months to years ) Time Data Perspective Historical Detailed data Some summary Flexible,extensible Durable,strategic Data orientation

Charactersitics

DW BASICS

12. Types of data mart Dependant data mart: operational sources ->enterprise data warehouse->data marts Independent data mart: created without the use of central DW. Operational sources->data marts Hybrid data mart Dependent Data Marts A dependent data mart allows you to unite your organization's data in one data warehouse. This gives you the usual advantages of centralization.

DW BASICS

Independent Data Marts An independent data mart is created without the use of a central data warehouse. This could be desirable for smaller groups within an organization. It is not, however, the focus of this Guide.

Hybrid Data Marts A hybrid data mart allows you to combine input from sources other than a data warehouse. This could be useful for many situations, especially when you need ad hoc integration, such as after a new group or product is added to the organization.

DW BASICS

Independent Data Marts

1. Easy to implement 2. can eventually become a dw

Logical data mart

1. single version of truth 2. no historical data limits 3. allow drills downs, trend analysis 4. no transformations needed Dependent data mart 1. all advantages of logical 1. additional movement of data mart data is necessary. 2. allows for physical 2. some transformation may control over data. be needed. 13. Enterprise Data warehouse (EDW) Contains detailed data as well as summarized data Separate subject oriented database Supports detailed analysis of business trends over a period of time Used for short-term and long-term business planning 14. Approaches to DW: Top-Down approach

1. not an enterprisewide solution 2. costly as more data marts needed 3. more than one version of truth 4. limits amount of historical data 5. data transformations needed. 1. less physical control over data.

DW BASICS

Bottom-Up approach

Top-Down approach ( Bill Inmon approach): Here the entire EDW is architected and then a small slice of a subject area is chosen for construction. Subsequent slices are constructed until the entire EDW is complete. Advantages: Coordinated environment Single point of control and development Disadvantages: Cross everything nature Analysis paralysis Scope control Time to market Risk and exposure

Bottom-Up approach (Ralph Kimball approach): Initially an enterprise data mart architecture (EDMA) is created. Once EDMA is complete, an initial subject area is selected for the first incremental Architected data mart (ADM). EDMA is expanded in this area to include the full range of detail, required for design and development of incremental ADM. Advantages: Quick ROI (return on Investment) Low risk, low political exposure Fast delivery Focused problem Inherently incremental Disadvantages: Multiple team coordination Must have EDMA to integrate incremental data marts. Analysis paralysis Scope control Time to market Risk and exposure Data warehouse lifecycle tool kit by Ralph Kimball 15. DW Architecture: It consists of: 1. Operational & external Data- Consists of system-specific data and detailed data. Continually changes due to updates. Stores data upto the last transaction. Its the source for data warehouse.

DW BASICS

2. Data staging layer & ODS Extracts data fro operational & external data sources. Transforms the data and loads them into DW. This includes decoding production data and merging of records from multiple DBMS formats. 3. Data warehouse layer stores data used for informational analysis. Present summarized data to the end-user for analysis. 4. Meta data layer Is data about data. Stored in a repository. Contains all corporate metadata resources: database, catalogs & data dictionaries. 5. Process management layer - is the scheduler or high-level job control. It is used to build and maintain the DW and data dictionary information. It keeps DW up-todate. 6. Information access layer Its the end-users. They generate ad-hoc reports and perform multidimensional analysis using OLAP tools. Interfaced with DW through an OLAP server. 16. Advantages of DW: Lowers cost of information Improves customer responsiveness Identifies hidden business opportunities Helps to make strategic decisions. 17. What is OLAP? Provides multidimensional view of data They are also known as decision support systems (DSS) and business intelligence systems (BIS). It is very fast and runs within 1-3 seconds. It provides consistent information. 18. what is roll up, drill down, slice and dice, pivot, drill across and drill through. Roll up Is summarizing the data. Drilling data from low-level data to high-level data. Its like dimension reduction. Drill down - Getting data from higher level summary data to low level detailed data. Its like introducing new dimensions. Slice and dice project and select Pivot reorient the cube, visualization, 3D to series of 2D planes. Drill across - Involving (across) more than one fact table. Drill through through the bottom level of the cube to its back-end relational tables. 19. OLAP technologies MOLAP- Multidimensional OLAP ROLAP - Relational OLAP HOLAP Hybrid OLAP MOLAP: Uses multidimensional array structure to store and manage DW data. Data cube for OLAP is computed through partitioning cube array and visiting the cells of the array: base cuboid is stored in the form of multidimensional array. All

DW BASICS

non base cuboids are precomputed by partitioning and visiting the cells of the base cuboid and stored before hand for analysis. Provides fast indexing to precomputed summarized data. Uses drect addressing (by indexing the cells of the array) to access dimensions.

MOLAP Advantages: Time for OLAP analysis is more than ROLAP Design complexity is high. MOLAP disadvantages: Involves less development & maintenance costs Less scalability Sparse data sets may reduce storage utilization Aggregate nature of data reduces the scope for data drill down analysis. Not suitable for large data bases. Egs: Hyperion Essbase, Gentia, Oracle express

ROLAP: Uses RDBMS tables for storing and managing DW data. (Cube data used for OLAP). Data for OLAP is computed through extraction and cubing: base cuboid is stored in the form of a fact table. All non base cuboids are computed on fly by extracting data from fact table and performing series of group by operations on dimension values. Use value (key) based addressing to access dimension values. Includes optimization of DBMS back end through sorting, hashing and so on. ROLAP Advantages: Involves less development & maintenance costs High scalability Highly suitable for large databases Supports drill down analysis ROLAP disdvantages: Time for ROLAP analysis is more than MOLAP Design complexity is high. Egs: Informix meatacube, white light, oracle discoverer, microstrategy HOLAP: It employs combination of both MOLAP and ROLAP storage Used to store large volumes of detailed data in relational DB and frequently accessed summarized information is kept as MOLAP store.

DW BASICS

20. Selection of OLAP Analysis required and queries to be posed are known at design time Data retrieval paths follow predefined structure of data cube Wide range of different views on same data is required Raw data may be requested Ad-hoc analysis is required Db size is greater than 1000 GB If application requires combined features of MOLAP and ROLAP

MOLAP MOLAP ROLAP ROLAP ROLAP ROLAP HOLAP

21. Data modeling: Process that produces abstract data models for one or more database components of the DW.

22. Data modeling techniques: ER model o Traditional modeling technique o Technique of choice for OLTP o Suited for corporate DW o Removes data redundancy o The created databases that cannot be queried. Dimensional modeling o Analyzing business measures in the specific business context. o Helps visualize very abstract business questions o End users can easily understand and navigate the data structure. 23. ER Modeling: Logical design It represents the business requirements Arranges data into a series of logical relationships called entities and attributes. Entity - represents a chunk of information which exists in the business world. Its a business definition with a clear boundary. Its described by a noun. Attribute is a component of an entity. They are characteristics and properties of entities. It should be unique and self-explanatory. Primary, foreign key, constraints are defined on attributes. Identifier one or more attribute uniquely identifies an instance of an entity. Relationship - is structural interaction and association between entities. Its described by a verb.

10

DW BASICS

24. Converting logical design to physical design Entities tables Relationships foreign keys Attributes columns Primary unique identifiers primary key Unique identifiers unique key Logical Represents business information and defines business rules Entity Attribute Relationship Primary key Rule Physical Represents the physical implementation of the model in a database Table column Foreign keys Primary key constraint Check constraint, default value

25. Disadvantage of ER model End users cannot understand Cannot queried usefully by software It defeats the basic allure of DW namely intuitive and high-performance retrieval of data. No GUI 26. Dimensional Modeling: Represents data in a framework that allows high-performance access Processes large, complex, adhoc and data intensive queries. Every dimensional model is composed of one table with a multipart key called the fact table, and a set of smaller tables dimension tables. 27. What is a schema and the types of schema? A schema is a collection of database objects such as table, view, indexes, and synonym. Star schema Snowflake schema they normalize dimensions to eliminate redundancy. Fact constellations Star schema: Arranged logically around a huge central table that contains all the accumulated facts and figures of the business. Its called star schema because the ER diagram between the dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. It does not capture hierarchies directly Its easy to understand, easy to define hierarchies, reduces number of physical joins. Level is needed whenever aggregates are stored with detail facts.

11

DW BASICS

28. What is dimension A dimension is a structure that categorizes data in order to enable end users to answer business questions. Commonly used dimensions are customer, product and time. Types of dimension: Conformed dimension those which are consistent across data marts.0 Degenereate dimension-is data that is dimensional in nature but stored in fact table Demographic dimension- stores demographical dimension Junk dimensionstore junk records Casual dimension- used for explaining why a record exists in a fact table. They do not change the grain of the fact table. Slowly changing dimension- dimensions that change over time. These changes are smaller in magnitude. TIME DIMENSION

YEAR 1999 APRIL

MAY

28/4/99
29. Dimension table: Wide Short

4/4/99

7/5/99

12

DW BASICS

Use of surrogate keys Contains links to corresponding records in source tables Contains additional date and active field flags.

30. Surrogate keys: Using surrogate key will be faster Can handle slowly changing dimensions as well They should be integers 31. Types of slowly changing dimensions: There are 3 types: Type 1 Type 2 Type 3 Type 1: if any records needs to have its value changed then update the existing record with new value if available else inser it as new record Type 2: (refer ppt2) Types of Type 2: Versioning Flag Effective date Type 3: has effective date column which shows the date since when the new value has been effective. 32. Fact table Steps to design a fact table: Determine the granularity Identify the measures the user needs Record in fact table contains primary key which is made up of concatenation of foreign keys to the dimension tables. Facts or measures are uniquely identified by primary key. Types of facts: Additive-measures that can be added across all dimensions. Eg: sales amount Non-additive-measures that cannot be added across all dimensions. Eg: profit margin, temperature Semi-additive- measures that can be added across few dimensions and not with others. Eg: current balance, inventory Types of fact tables: Cumulative-describes what has happened over a period of time. The facts for this type are mostly additive facts. Snap shot-describes the state of things in a particular instance of time and usually includes semi-additive and non-additive facts.

13

DW BASICS

33. Factless fact tables Fact tables that contains no measures or facts are called as factless table. There are 2 types of factless fact table: Coverage tables- required when a primary fact table is sparse Event tracking tables-used for tracking an event. 34. Fact constellation (3rd type of schema) Multiple fact tables share dimension tables It is viewed as a collection of stars hence called galaxy schema or fact constellation. Sophisticated application requires such schema. Advantage: No need for the level indicator in the dimension tables. Disadvantage: Can slow performance Fact constellation is a good alternative to star schema but when the dimensions have high cardinality, the sub-selects in the dimension tables can be a source of delay. An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the snowflake schema.

35. Snowflake schema: Dimension tables are normalized by decomposing at the attribute level Each dimension has one key for each level of the dimensions hierarchy Good performance when queries involve aggregation Complicated maintenance and metadata, explosion in number of table. Makes user representation more complex and intricate Disadvantage: Complicates end-user query construction Adds additional level of join complexity Database optimizers dont handle very well Save some space at the cost of longer queries. STAR Denormalised No complex joins High performance Occupies more space ER Data remains normalized SNOWFLAKE Normalized Uses complex joins Low performance Occupies less space DIMENSIONAL Uses denormalized data 14

DW BASICS

User access more complex Useful in enterprise wide DW implementations Timestamp usually in key structure Relationship Rule

Simplified data model for user access Most often used in data marts Can be integrated through dimension sharing Foreign key Check constraint, default value

ER is used to illuminate the microscopic relationships among data elements. 36. ETL- process by which data is integrated and transformed from the operational systems into the DW environment. It handles data redundancy ETL is used when: Different source data formats, Incremental updates, Inconsistent filenames, Missing column headers 37.Steps in ETL process: Capture Scrub or data cleansing, Transform Load and index Capture- is extracting. There are 2 types of extracts: Static extract-capturing a snapshot of the source data at a point in time Incremental extract- capturing changes that have occurred since the last static extract Scrub- cleanses the data. Uses pattern recognition and AI techniques to upgrade the data quality. It does 2 things: Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, and inconsistencies. It also perform: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data Transform convert data from format of operational systems to format of DW. Record-level: selection-data partitioning Joining-data combining Aggregation-data summarization Field-Level Single field: from one field to one field Multi field: from many fields to one, or one field to many. Load and Index- place transformed data into the DW and create indexes Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to DW

15

DW BASICS

38. Extract types: Full extract- replaces existing data in DW with the new data Periodic/Incremental extract- it appends the new data and changed data to the existing data. Structural Transformation- additive and average Forma transformation- data type conversions and splitting Single field transformation- transforming old data to new data Multi field transformation- from many source fields to one target field 39. What is repository? Repository is a database containing enterprises metadata (data about data) and access and reporting mechanism for that database. Ideal repository characteristics: Openness: Flexibility: Usability: Extensibility: Versioning: Performance: optimized data store

40. What is metadata? Describes the data being captured and loaded into the warehouse. 41. Types of metadata: ETL metadata. Eg: source/target table name, DB type, fields, length, type, comment, nullable, mappings, sessions, transformation objects Database metadata. Eg: tables/views name, length, comment, nullable, stored procedures, indices, users. Reporting metadata: reports,tables BO, microstrategy 42. What is dormant data? Data that is hardly used in DW is dormant data. The faster the dw grows, the more data becomes dormant. Over a period of time the amount of dormant data in a DW increases. 43. Origin/sources of dormant data: Storing history data that is not required Storing columns that are never used Storing detail level data when only summary level data is used Creating summary data that is never used 44. Techniques for tuning a DW Handling dormant data 16

DW BASICS

Storing pre summarized data based on data pattern usage Creating indexes for data that is frequently used Merging tables that have common and regular access. ETL tools: Oracle warehouse builder Power centre/mart from informatica Datastage from ascetical abinitio Reporting tools: Discoverer Business objects Cognos Crystal reports

45.Data warhehousing tools: Design tools: ERWIN Power soft warehouse architect Oracle designer

ODS-operational data store. It is a database designed for queries on transactional data. It is often an interim or staging area for a data warehouse. It differs from a DW by the way that its contents are updated in the course of the business, where as a DW contains static data. It is designed for performance and numerous queries on small amounts of data such as account balance. But a DW is built for elaborate queries on large amounts of data. ODS has detailed data and gets data from heterogeneous sources. It does not store summary data. So it contains only current data.

17