Você está na página 1de 16

Data Warehousing

Contents

1.0
1.1 1.2

Overview ____________________________________________________________________ 3
Rationale for the Data Warehouse: _________________________________________ 3 Brief overview of data warehousing : _______________________________________ 3

2.0
2.1 2.2 2.3 2.4

Creating the Data Warehouse ________________________________________________ 4


The Developmental Phases _________________________________________________ 5 Definition Phase____________________________________________________________ 6 Generation Phase __________________________________________________________ 7 The Manage and Load Phase________________________________________________ 8

3.0
3.1 3.2 3.3 3.4

Enabling Technologies ________________________________________________________ 9


Technologies for Data warehouse generation : ______________________________ 9 Technologies for data management_________________________________________ 9 Technologies for Information access : _____________________________________ 10 Data mining : ____________________________________________________________ 11

4.0
4.1 4.2 5.0

Standards and trends________________________________________________________ 12


MetaData : ________________________________________________________________ 12 OLAP information Access: _________________________________________________ 12 Glossary ________________________________________________________________ 14

1.0 1.1

Overview Rationale for the Data Warehouse: There are several ways that data warehousing can be used successfully by companies : Provide business users with a customer centric view of heterogeneous data Provide value to customers with better information when coupled with internet access technology Provide repository of customer information for business planning and analysis Integrate islands of information to provide meaningful insights to entities such as clients and suppliers Provide macro level information views on business value delivery and performance when combined with external feeds Enhance traditional reporting with more insight Provide reports on global trends of a MNC Brief overview of data warehousing : IN a data warehouse, a company transforms the data into a more useful resource by doing the following : Summarize the data ( grouping it more conveniently for end users ) Transform the data ( putting it into more useable formats ) Categorize the data ( ability to slice and dice the data for information ) Distribute the data ( increase availability and accessibility ) Data in a data warehouse always have an explicit time dimension which allows historical information build up. The foundation of a successful data warehouse project lies in having an enterprise wide data model which provides the framework for analyzing the performance measures that underlie the business decision making processes of the enterprise. Hence an effort needs to be made to identify the performance measures, the data elements and algorithms and then map these onto the enterprise wide data model. Then the data is organized into categories ( data marts ) and the information in these individual categories is integrated into metadata. Data cleansing , improving the quality of data, is a very important prerequisite of a successful data warehouse and is an ongoing process. The slicing and dicing of the data is what is called OLAP or on line analytical processing. To serve the needs of having data warehousing, OLAP is the only way as OLTP systems have failed to serve the needs. OLAP systems allow ad hoc processing and supports access to data over time periods. OLAP needs data that are lightly to highly summarized , unchanging and accessed as read only. OLAP systems are the aggregation, transformation, integration and historical collection of OLTP data from one or more systems. In recent times, OLAP functionality is being built into OLTP systems which is called ODS ( operational data store ).

1.2

2.0

Creating the Data Warehouse

Operational Data Store

External Data

Sales Data

Warehouse Generation

Data Management

Web server Browser


Graphical Presentation Query & Reports

Information Access

The above represents the stages of creating a data warehouse. The next chapter discusses some of the technologies involved at each stage/level and some products.

2.1

The Developmental Phases


Although data warehousing projects develop in a non-linear and highly iterative fashion, three developmental phases are commonly recognized. This is an excerpt from how the projects are done in an Oracle environment: 1. Definition Phase Numerous definitions are created that describe a logical warehouse. These definitions formally describe: The warehouse schema Data sources Map, transform, and load operations For the most part, these are pure definitions.

Typical Data Warehouse Architecture 2. Generation Phase A physical instance of the logical warehouse is described by configuring a set of definitions. The configured definitions are validated and then used to generate a set of scripts to create and manage the physical instance: DDL scripts to create the warehouse and intermediate schemas PL/SQL, SQL*Loader, and TCL scripts to extract data, map and transform the data, and then load it into the physical instance. After the scripts are generated, they are used to deploy objects to the physical instance. The scripts are also deployed to a file system where they can be used by Oracle Enterprise Manager and Oracle Workflow to schedule and manage load and refresh jobs.

3. Load and Manage Phase After the generated scripts have been deployed to a file system, the infrastructure for load and refresh jobs is implemented: Job dependencies are described as an Oracle Workflow process Oracle Enterprise Manager is used to schedule individual jobs or an Oracle Workflow process. 2.2 Definition Phase During this phase, numerous definitions are created to completely describe a logical warehouse. These include definitions for the target warehouse schema, intermediate schemas that serve as work areas, and the source data whether for database sources or operating system flat files. Warehouse Schema A data warehouse project begins with the characterization of user requirements. As these requirements grow more definite, a dimensional model usually suggests itself; dimensions and facts can literally be read off of a set of user questions. Builders initial role is to capture this conceptual model as it evolves, and store the model as a set of formal definitions that define a logical warehouse. Builder stores these definitions in a repository structure called a warehouse module. As the requirements converge on solution, you can continue to modify its formal definition using Builders Warehouse Module Editor and display the schema in a tabular or graphical format. Builder is particularly powerful because you can define a logical warehouse and then configure various physical instances. Builder can then validate the definitions for a physical instance, generate the scripts to create objects for the instance, and generate another set of scripts to manage the instance. For example, you could design a comprehensive Sales and Marketing model but then deploy various versions as physical instances from simple to complex.

The Source Data As a data warehouse project converges on a dimensional model, developers start assessing sources of data for the warehouse. You can use the Import Metadata Wizard to import definitions of data sources from: Database Sources including Oracle Designer repositories Flat File Sources SAP Applications This Builder wizard imports definitions and stores them in a repository structure called a "Source Module." During a later stage, these source definitions are used in the design, development, and implementation of routines that extract, transform, and load data into target warehouse schemas. Source to Target Mappings After the target schema for the logical warehouse and its data sources have been logically defined, you can create definitions that describe how to extract, transform, and load data. These definitions are called "Mappings" and are stored in the warehouse module that defines the target warehouse schema. These definitions extract and transform data from database and flat file sources and then load it into target tables. This could be a simple one step move or a complex multi stage process involving, joins, filters, aggregations etc.

2.3

Generation Phase At this point, Builders repository contains formal definitions for the logical warehouse schema, all of its data sources, and the mapping and transformation operations. The definitions can now be configured to create and load a physical instance of the warehouse. This phase takes four steps. 1. Configure a set of definitions for a physical instance 2. Validate the set of definitions 3. Generate scripts from the set of definitions 4. Deploy objects to the physical instance and the scripts to a file The working assumption is that you know the properties required by the physical instance of the warehouse. Builder generates the following kinds of scripts from a configured set of definitions: DDL scripts that create objects for the physical instance: Database links Staging tables Dimensions, facts, and materialized views PL/SQL and SQL*Loader routines that: Extract data, transform, and load data into the physical instance TCL scripts that: Are registered with Oracle Enterprise Manager Used to define Oracle Workflow processes that manages job dependencies Run jobs that load and refresh the physical instance The generated scripts are deployed to a file system and used to deploy objects to the physical instance. Configure You describe a physical instance of the logical warehouse by configuring a set of definitions. The configuration parameters determine whether an object is deployed, the processing characteristics of selected objects, location of deployed scripts, the physical properties of warehouse objects, and a host of other properties. When you configure a set of definitions, you can also define and configure a specific set of indexes and partitions for the objects deployed to the physical instance. Validate After you configure a physical instance, you need to validate the set of definitions before you generate the scripts that implement an instance. The validation procedures verify foreign key references, object references in mappings, data type mismatches, and other properties. Generate After you have a validated set of configured definitions that describe a physical instance, you generate one set of scripts which are used to deploy objects to a physical instance and another which are used as load and refresh jobs. Deploy Objects After Builder generates the scripts for a physical instance, you can immediately deploy the database objects (database links, tables, dimensions, facts, materialized views, synonyms, and PL/SQL packages) to one or more instances. You can also deploy the scripts that create the objects to a file system, and then deploy the objects by manually executing the deployed scripts.

Deploy Scripts After Builder generates the scripts for a physical instance, you can immediately deploy the scripts to a set of directories. These directories include DDL, PLS, TCL, SQL*Loader control file directory, and others. The TCL scripts define jobs that can be manually run or schedules to load and refresh the physical instance. Register Scripts After Builder generates the scripts for a physical instance, you can register the TCL scripts that define load and refresh jobs to an Oracle Enterprise Manager repository. You can also register scripts with an Oracle Workflow server to manage job dependencies such as the dependence of fact tables on its dimensions. 2.4 The Manage and Load Phase The warehouse objects have been deployed to a physical instance and the jobs that initially load or periodically refresh the instance have been deployed to a directory and registered with Oracle Enterprise Manager. The nest step is to set up and schedule jobs that initially load and periodically refresh a data warehouse. Jobs that load new data or refresh existing data in a data warehouse must be run in a strict sequence to insure that all foreign key references can be satisfied. Generally, this means that referenced tables must be loaded before the tables making the references can be loaded. You can manage these dependencies by manually scheduling the jobs, or you can define an Oracle Workflow process to manage the dependencies and then schedule the Workflow process.

3.0

Enabling Technologies Many technologies associated with Data Warehousing is not new. Some of the key enabling technologies are : Desk top systems with powerful GUIs RDBMS Public networks such as internet Private networks using public infrastructure such as intranet Extraction and transformation products Metadata repository tools Scalability technology in storage, processing units, networking etc.

3.1

Technologies for Data warehouse generation : This step performs the following function : Prepare and Extract the data from operational systems Transform and cleanse the data Move and load the data to the warehouse server Modeling the data in the data warehouse ( technical and business metadata) Some of the vendors who provide tools and products for this step are as follows : Company Apertus Carleton D2K ETI IBM Informatica IB Platinum Praxis Int. PWC Prism Solutions Sagent Technology Vality Technology Product Metacenter Tapestry ETI Extract Visual warehouse Powermart SmartMart InfoRefiner OmniEnterprise Geneva V/T Prism Warehouse Executive Sagent Datamart Integrity

3.2

Technologies for data management Some of the newer technologies which are enhancing the ability to use database for its intended use is the emergence of Multidimensional databases ( MDBMS ). A traditional RDBMS has two dimensions relating an entity with its attributes. A MDBMS stores that data with other dimensions such as time and value. The way a join is performed is different as MDBMS can do a star join while a traditional RDBMS can only perform joins in pairs. MDBMS is optimized for data retrieval while a RDBMS is for updates. Multidimensional databases store data as sparse hypercube ( such as Oracle MDBMS ) or as micro hypercube ( such as from Brio Technology ). Some of the vendors who provide technologies for data management for data warehousing are as follows :

Company Arbor Gentia IBM Information Advantage Informix Microsoft MicroStrategy NCR Oracle Pilot Platinum technology Red Brick systems SAS Seagate Sybase 3.3 Technologies for Information access :

Product Essbase server Gentia DB2 universal database Decision Suite Informix Metacube SQL server DSS server Teradata Express Server DSS InfoBeacon Red Brick Warehouse Warehouse administrator Crystal info. IQ database

This is the third technology group needed to build a working data warehouse. This includes information access, graphical representation (graphical view of the information ) and data mining ( viewing the unknown information ). Information access can be of various types. Predefined access : 4Gls and other reporting tools are used to provide this facility. A lot of them are now pre configured into web front ends to provide commonly used information. Ad Hoc access : Desk top applications such as MS office or browsers are used for this. There are more sophisticated OLAP tools which enable aggregation, analysis and view facility on data. Tools for frequent access : Such tools, also called MQE, enable high frequency users to use the warehouse. Advanced access and analysis tools : This enables users to perform statistical and other advanced operations in the front end of the data warehouse. The graphical representation of these queries perform user access service along with converting it into meaningful information with a snap shot view of the relevant data. Some of the products which fall in the information access category are as follows : Company Andyne Brio Business objects Cognos IBM Integral Solutions IQ software Microsoft Platinum Product GQL BrioQuery Business objects Impromptu, Powerplay, Scenario Intelligent Miner toolkit Clementine data mining toolkit IQ MS Query, access , excel Forest & Trees

SAS Seagate Silicon Graphics 3.4 Data mining :

Enterprise Miner, Reporter, system for Information Delivery Crystal Info Mineset

This provides a means of extracting previously unknown, actionable information from the growing base of data in the warehouse. This technology uses sophisticated, automated algorithms to discover hidden patterns, correlations, relationships among millions of data. Typically data mining performs the following functions : Classification Clustering Association Sequencing Forecasting Some representative companies and products are given below : Company Product Angoss KnowledgeSeeker Business objects BuisnessMiner Datamind Datacruncher IBM Intelligent Miner Integral Solutions Clementine Magnify Pattern Pilot software DSS SAS Enterprise Miner Thinking Machines Darwin

4.0

Standards and trends The need for standardization is felt in some of the key aspects of warehouse generation and information access. In the generation stage Metadata standardization is critical for client flexibility and vendor independence.

4.1

MetaData : Today, there is a great deal of inconsistency in defining, storing, and managing the meta data. Each product has its own approach. As seen in the previous chapter there are several vendors for generating data warehouse and each have their own approach to it. As no standard exists, choice of the product for data generation will guide the choice of the data management and tools components in the total solution. There is a move to standardize on MetaData Interchange Specification ( MDIS ) . The relational database vendors are trying to create their own standards. Microsoft supports OIM and COM. Informatica initiated MX architecture could also provide a standard for metadata. Companies such as Brio, Business Objects, Cognos support this MX architecture. Oracle has established WTI which is a method to integrate with Oracles database. Given the databases popularity, many vendors have adopted this initiative.

4.2

OLAP information Access: The goal is transparent access between any client and any server running an OLAP application. Microsoft has established some standards called OLE DB for OLAP which facilitates this. Microsoft also offers Active Framework for data warehousing which provides a common architecture interfaces for third parties. The other standard doing the rounds is MD API. Some trends in this market place from customer perspective is to buy the entire solution rather than build it. From the vendor side, there is a consolidation of tools to cover the entire spectrum of creating a solution and leveraging internet tools to provide the access. The confluence of warehouse technology in the backend and the internet technology in the front end has profound implications for delivering the value from data warehousing. Web accessible architecture : This allows the users to use a browser to view the data in a predefined way via the intranet or extranet. This has minimal flexibility for ad hoc reporting and query. Web enabled tools : IN this, the access tools we viewed in the earlier chapter are getting web enabled. Now, these tools can run on a browser and display the information dynamically on the browser using HTML formats. Architecturally, these tools deploy the four tier architecture consisting of the browser, web server, application server and the database.

This has some flexibility but cannot handle sophisticated and advanced queries now . though this will realize as time goes by. Web exploited tools : These are the most advanced forms of information access using the web technology. Java is the platform for such advances. In this, client applets interact directly with the server side Java applications via RPC or MOMs which directly interact with the database using JDBC. Such tools will greatly enable the realization of successful use of data warehouses. The other trend is to create packaged data warehouses for vertical industries. Suppliers such as Comshare, Platinum, SAS offer such packages. A lot of cross industry applications are embedding data warehousing capabilities in their packaged solutions. ERP vendors are also adding an OLAP layer to sit on top of its OLTP systems though this has found greater impact on data marts rather than data warehouses.

5.0

Data Warehousing Glossary

Data warehousing Data warehousing is the process of extracting, integrating, aggregating, filtering, summarizing, standardizing, transforming, cleansing and quality checking the organization applications data and storing it in a consolidated database. This database will end up being the only source from which the management will access and retrieve information for decision making. Data mart Data marts can be viewed as a more specialized data warehouse. The size of a data mart is much smaller than that of a data warehouse and therefore the time required to implement a data mart is less than the time required to implement a data warehouse. The cost of implementing a data mart is less than that of a data warehouse. Data extraction Data extraction is the process of extracting data from the OLTP databases or from external sources and storing it in a consolidated area. The extracted data will be used for populating the data warehouse. Some of the data extraction techniques that can be used are snapshot extraction, time stamp extraction, etc. Data transformation The data that is extracted from the OLTP databases and external data sources is raw data. Data transformation will have to be carried out on the extracted data before it can be stored in the data warehouse. Some of the data transformation tasks that have to be carried out are: Data Substitution, Data Standardization, Data Filtering and Data Summarization. Data standardization As the data which is going to be stored in the data warehouse file will be retrieved from multiple OLTP databases it will have to be standardized. An examples of data standardization is the standardization of the data stored in date fields or the standardization of data that is stored in flags. Data filtering Data filtering is the process of extracting only the required data from the OLTP databases and external data sources. The filtered data that has been extracted will be used for populating the data warehouse files. For example the end user may want to store only the last five years sales data in the data warehouse. Data cleansing Data cleansing is the process of cleansing and enhancing the data that is stored in the data warehouse. Some of the data cleansing activities are the standardization of data, ensuring the accuracy and quality of the data that is stored in the data warehouse, initializing of fields in the data warehouse, ensuring the data integrity of the data that is stored in the data warehouse, etc.

Meta data Meta data can be thought of as data about data. It forms an important part of any data warehouse and could also determine the success of a data warehouse. There are basically two kinds of meta data that can be found in a data warehouse. These two kinds of meta data are technical meta data and business meta data. Meta data will be useful for those who are going to administer and use the data warehouse. Operational data Operational data is the data that is stored in the organizations OLTP applications. Operational data will usually be stored in relational databases or in flat files and is also called as real time data. One of the characteristics of operational data is that it is volatile. Operational data is not ideal for analysis. Informational data can be derived from the operational data. Operational database Operational databases contains an organizations operational data that is generated by their OLTP systems. The data that is present in these databases is not ideal for carrying out data analysis. The data in these operational databases will be used for populating the files in a data warehouse. OLAP OLAP otherwise know as On Line Analytical Processing is a multi dimensional analysis technique that end users can use for carrying out analysis on the data that is stored in a data warehouse. Some the basic concepts that are associated with OLAP are drill down and drill up along a dimension, slice and dice through the data, drill across, etc. Non volatile data No volatile data is data that will not undergo much change. The data that is stored in the data warehouse is non volatile data. This is one of the characteristics of the data that is stored in a data warehouse. It is advisable to carry out analysis on non volatile data. Data scrubbing The physical transformation and purifying of the operational data that is going to be stored in the data warehouse is known as data scrubbing. Data scrubbing can be defined as the process of filtering, merging, standardizing, initializing and translating the operational data in order to create informational data that can be stored in the data warehouse. Data dumping It is important to store the right data in a data warehouse. Data should not be stored in a data warehouse just because it is available. Storing of unnecessary data in a data warehouse is called as data dumping. It is up to the data warehouse team to ensure that the data warehouse does not contain any unnecessary data. Meta data users

Meta data users are those who access the meta data that is provided along with a data warehouse. Meta data users can be classified into two categories. They are technical users and business users. Aggregate data Aggregate data is the data that has undergone data summarization. By storing aggregate data in the data warehouse it will be possible to improve the performance times of the end users queries. The drawback of storing aggregate data is that the time required to populate the files containing the aggregate data could be high. Data dictionary A database about data and database structures. A data dictionary contains details about all data elements that are present in the databases. It will contain the name of each data element, the structures of each data element such as the length and type of the data elements and information about their usage. Data quality The quality of the data that is present in the data warehouse is very important. In order to ensure data quality the data warehouse team should ensure that no junk or corrupted data is stored in the data warehouse. Also the fields in the data warehouse that do not contain any data should be initialized. The quality of the data that is stored in the data warehouse could determine the success of the data warehouse. Data accuracy The accuracy of the data that is stored in the data warehouse plays an important role in determining how successful the data warehouse is going to be. It is up to the data warehouse team to ensure that the data stored in the data warehouse is accurate.

Você também pode gostar