Você está na página 1de 19

What is a data warehouse?

The standard answer to this question is it is a database designed to support the


management decision-making process. A more accurate answer might be that it is where
business managers can find the information they need to run the business correctly.
Notice that the first answer is product-oriented, the second one emphasises functionality.
The guru of data warehousing, Bill Inmon, characterises a data warehouse has being:

• linked to, but different from, the many production databases that run in an
organisation;
• subject-oriented, rather than application-oriented, to provide a consistent view of
the business;
• integrated, because data is consolidated from different application systems;
• time variant, because information has a time dimension, whereas operational data
is valid only at a particular moment;
• non-volatile, since data is added to the data warehouse, rather than replaced.

Where did the concept come from?

While many people say that the concept of an ‘information’ warehouse came from IBM, a
number of companies were building data warehouses during the 1980s but giving them
different names. In South Africa, pioneering organisations like Eskom and the former
United Building Society set up very large databases for their management and executive
information systems well before the warehouse concept was established. IBM can be
credited with realising the true potential of a database for management-oriented rather
than operational needs, and of course for promoting the concept world-wide.

Why do I need a data warehouse?

Any of the following reasons can apply.

1. Organisations have spent many years improving ways to put data into their
operational systems; now they are beginning to appreciate the value of getting that
data out again. However, in many cases the databases that support those
operational systems are not suitable for quick and easy access to information.
2. Application databases are designed to handle data for specific, quite narrow
purposes, not integrated organisational views.
3. The IS department often does not want to allow users on the operational system
for fear of degrading the system with resource-intensive queries, or breaching data
security and integrity.
4. The data management differences between databases for application systems and
for decision-making cannot be re-conciled on one system.
5. The software tools that users need should be on platforms that make it easy to
access and use data - the operational platform may not be suitable.
6. The growth in recent years of decision support and data mining applications
necessitates the move to a different database architecture.
7. ‘Get those xxxx users off our back’!

How is data warehousing developing?

If you were building a data warehouse five years ago, chances are it would be on your
company’s mainframe computer. But with increased use of the data warehouse came a
heavier load on mainframe computer resources and degradation of overall performance.
Nowadays, a common solution is to position the warehouse on a mid-range platform that
can scale up easily and at a suitably low cost - conditions not yet available on
mainframes. The lower implementation costs of a client-server warehouse make it easier
to motivate the development of a warehouse. And data warehouses are not mission-
critical systems, so the common client-server criticism about lack of system management
tools, is less significant.

What are the implications of data warehousing?

Because a data warehouse crosses organisational boundaries, there are serious non-
technical issues that you must bear in mind at an early stage; not to do so can be a career-
limiting move. Consider how organisations start data warehousing projects, from two
different approaches.

One approach begins with corporate data modelling, and builds the warehouse by taking
a broad view of business requirements; you can call this a top-down methodology. The
other approach is to focus on a specific information application and use that as the core
around which the warehouse will grow; this is the bottom-up methodology. Both
approaches are valid but have their own problems. The top-down method requires long
hard work to motivate and justify, political savvy, and approval by someone to pay up-
front before seeing the final product. And that’s before you even start on development.
The bottom-up approach avoids many of those problems, but runs the risk that the
information requirement you start with becomes quickly out-dated, or is not accepted by
business users.

Before you embark on a data warehouse project, you need to look at your current
applications and databases. If the warehouse is being developed because no one apart
from programmers can access corporate data in its current state, then have a clear
understanding about the data standards you will apply, and how the data warehouse will
be designed. Too many organisations discover that they have multiple standards or
definitions for their data. For data warehouses built using relational databases, it is a
common mistake to design the warehouse database around the same normalised form as
the application system. While the ‘normal form’ concept is fine for application efficiency,
it creates havoc for users of a data warehouse.

Next, examine the processes that will be implemented to extract and convert data from
your existing application systems to the warehouse. Among other things, these processes
must ensure cleanliness and integrity - for example, identifying invalid data - as well as
extracting the correct data every time - getting daily data for the wrong day is not an
uncommon problem. While these housekeeping functions seem obvious, they can become
a major headache as the warehouse grows, unless they are properly thought through at the
outset.

As the data warehouse grows, two other warehouse admin. functions will become more
important. One function is packaging the data for commonly requested queries so that the
queries run quicker. Doing this is easier said than done. Not many database systems
analyse the contents of queries. More significantly, common queries at one time become
rare at another time due to business changes. This is an area that requires vigilance. The
other admin. function is what to do as the data ages. As the data gets older, the
importance of detail decreases. So to conserve disk space, this data should be summarised
to higher and higher levels of summary, and eventually archived. Deciding how and what
to summarise requires knowledge of past query patterns, and some insight into future
business information requirements.

To make the contents of a data warehouse easily understood, users have begun to demand
decent directories of corporate data, otherwise known as data dictionaries. This is a bad
news and good news story. The bad news is that many organisations have spend a lot of
time and money trying to implement corporate dictionaries for a data warehouse with
little obvious success. While products do exist, the average user has great difficulty
understanding them since they are still aimed mainly at programmers. The good news is
that some of the end-user query tools that are beginning to emerge have quite adequate
data dictionary functions. Watch this space for more developments.

When starting a data warehouse, the best advice is to start with a small and focused need,
and work on getting get it right first time. This makes it easier to project benefits, and to
get data issues sorted out on a small system before you tackle a large one.

What benefits do data warehouses provide?

As with many strategic systems, determining bottom-line benefits is not straightforward.


The major reason for a data warehouse is to improve access to corporate information. If
this is accomplished successfully, users should find increased productivity because they
do not waste time looking or waiting for information. With better access to this
information, management can make better decisions. By separating the operational and
warehouse data, you allow both databases to be optimised for their own specific
purposes, and reduce possible disruptions to the operational systems. A truly useful
warehouse will also create spin-off by suggesting further applications.
Goals of the data warehouse

The data warehouse must make an organization information easily accessible :-

The contents of the data warehouse must be easily understandable. The data must
be intuitive and obvious to the business user, not merely the developer. The contents of
the data warehouse must be labeled meaningfully. Business users want to separate and
combine the data in the data warehouse in endless combinations, a process commonly
referred to as slicing and dicing. The tools that access the data warehouse must be simple
and easy to use. They must also return query to the users with minimal wait times.

The data warehouse must present the organizations information consistently :-

The data in the data warehouse must be credible. Data must be carefully
assembled from variety of sources around the organization, cleansed, quality assured, and
released only when it is fit for users consumption. Information from one business process
should match information from other business process. If two performance measures
have the same name, then they must mean the same thing. Conversely, if two measures
don’t mean the same thing, then should be labeled differently. Consistent information
means high quality information. It means that all the data is accounted for and complete.
Consistency also implies that common definitions for the contents of the data warehouse
are available for the users.

The data warehouse must be adaptive and resilient to change :-

We simply can’t avoid to change. User needs, business conditions, data and
technology are all subject to the shifting sands of time. The data warehouse must be
designed to handle this inevitable change. Changes to data warehouse must be graceful,
meaning they don’t invalidate existing data or applications. The existing data and
applications should not be changed or disrupted when the business community asks new
questions or new data is added to the warehouse. If descriptive data in the warehouse I
modified, we must account for the changes appropriately.

The data warehouse must be a secure bastion that protects our information :-

An organization’s informational crown jewels are stored in the data warehouse. At


a minimum, the warehouse likely contains information about what we are selling to
whom at what price—potentially harmful details in the hands of the wrong people. The
data warehouse must effectively control access to the organization’s confidential
information.
The data warehouse must serve as the foundation for improved decision making :-

The data warehouse must have the right data in it to support decision making.
There is only one true output from the data warehouse: the decisions that are made after
the data warehouse has presented its evidence. These decisions deliver the business
impact and the values attributable to the data warehouse. The original label that predates
the data warehouse is still the best description of what we are designing :
a decision support system.

The business community must accept the data warehouse if it is to be deemed


successful:-

It desen’t matter that we have built an elegant solution using best-of –breed and
platforms. If the business community has not embraced the data warehouse and continued
to use it actively six months after training, than we have failed the acceptance test. Unlike
an operational system rewrite, where business users have no choice but to use the new
syatem, data warehouse usage is sometimes optional. Business user acceptance has more
to do with simplicity than anything else.
Components of a data warehouse

There are four distinct components of the data warehouse:-

Operational Source System


Data Staging Area
Data Presentation Area
Data Access Tools

Operational Data Staging Data Presentation Data Access


System Area Area Area

Extract
Data Mart #1
Services:- DIMENSIONAL
Clean, Combine Atomic and
Load
and Standardize summary data Acces
Based on single Ad Hoc Query
Confirm business process Tools
dimensions

NO USER Report Writers


Extract QUERY DW bus:-
SERVICES Conform Analytic
ed Applications
Data Store:- Facts and
Flat Files and dimensio Modeling:
relational tables ns Forecasing
Scoring
Processing: Data Mining
Sorting and
sequential Data Mart #2
Extract processing Load Similarly Acces
Designed
Operational Source System:-

These are the operational systems of record that capture the transactions of the
business. The source systems should be thought of as outside the data warehouse because
presumably we have little or no control over the content and format of the data in these
operational legacy systems. The main priorities of the source systems are processing
performance and availability. Queries against the source systems are narrow, one-record
at a time queries that are part of the normal transaction flow and severely restricted on the
demands on the operational system. We make the strong assumption that source systems
are not queried in the broad and unexpected ways that data warehouses typically are
queried. The source system maintains little historical data, and if you have a good data
warehouse, the source systems can be relieved of much of the responsibility for
representing the past. Each source system is a natural application where little investment
has been made to sharing common data such as product, customer, geography or
calendar.

The difference between the data warehouse and the operational system is as
follows:-

Data Warehouse DBMS Operational System


Does not allow updating of data inside it Allows updating of data inside it.
Allows many indexes in DBMS used by Allows finite number of indexes in DBMS
data warehouse, as you do not update data used by the transaction processing. This
in it. restriction exists because you can update
and insert data in the DBMS.
Does not provide free space in the DBMS Provides free space in the DBMS used by
used by the data warehouse. Free space is the transaction processing. For a DBMS
the space reserved at the block level of used by the transaction processing, 50% of
memory for further expansion of the space is the free space.
DBMS. For a DBMS used by the data
warehouse, you do not need free space, as
data is not updated.

Data Staging Area:-


The data staging area of the data warehouse is both a storage area and a set of
processes commonly referred to as extract-transformation-load(ETL). The data staging
area is everything between the operational source systems and the data presentation area.
It is somewhat analogous to the kitchen of the restaurant , where raw food products are
transformed into a fine meal. In the data warehouse raw operational data is transformed
into a warehouse deliverable fit for user query and consumption. Similar to the
restaurant’s kitchen, the backroom data staging area is only accessible only to skilled
professionals. The data warehouse kitchen staff is busy preparing meals and
simultaneously cannot be responding the customer inquiries. Customers aren’t invited to
eat in the kitchen. It isn’t safe for the customers to wander into the kitchen. We wouldn’t
want our data warehouse customers to be injured by the dangerous equipment, hot
surfaces and sharp knives they may encounter in the kitchen so we prohibit them from
accessing the staging area. Besides, things happen in the kitchen that customers just
shouldn’t be privy to.

The key architectural requirement for the data staging area is that it is off-limits to
business users and does not provide query and presentation services.

Extraction is the first step in the process of getting data into the data warehouse
environment. Extraction means reading and understanding the source data and copying
the data needed for the data warehouse into the staging area for further manipulation.

Once data is extracted to the staging area, there are numerous potential
transformations such as cleansing the data(correcting misspellings, resolving the domain
conflicts, dealing the missing elements, or parsing into standard formats), combining data
from multiple sources and assigning warehouse keys. These transformations are all
precursors to loading the data into the data warehouse presentation area..

The data staging area is dominated by the simple activities of sorting and
sequential processing. In many cases, the data staging area is not based on relational
technology but instead may consist of a system of flat files. After you validate your data
for conformance with the defines one-to-one and many-to-one business rules, it may be
pointless to take the final step of building a full blown third-normal form physical
database.

However, there are cases, where the data arrives at the doorstep of the data staging
area in the 3rd normal form relational format. In these situations, the managers of the data
staging area simply may be more comfortable performing the cleansing and the
transformation tasks using a set of normalized structures. A normalized database for data
staging area is acceptable. However, we continue to have some reservations about this
approach. The creation of both normalized structures for staging and dimensional
structures for presentation means that the data is extracted, transformed and loaded twice-
once in normalized database and then again when we load the dimensional model.
Obviously, this two-step process requires more time and resources for the development
effort, more time for the periodic loading or updating of data, and more capacity to store
the multiple copies of the data. At the bottom line, this typically translates into the need
for the larger development, ongoing support, and hardware platform budgets.
Unfortunately, some data warehouse project teams have failed miserably because they
focused all their energy and resources on constructing the normalized structures rather
than allocating time to development of a presentation area that supports improved
business decision making. While we believe that enterprise-wide data consistency is a
fundamental goal of the data warehouse environment, there are equally effective and less
costly approaches than physically creating a normalized set of tables in your staging area,
if these structures don’t already exist.

It is acceptable to create a normalized database to support the staging


processes; however, this is not the end goal. The normalized structures must be off-
limits to user queries because they defeat understandability and performance. As
soon as a database supports query and presentation services, it must be considered
part of the data warehouse presentation area. By default, normalized databases are
excluded from the presentation area, which should be strictly dimensionally
structured.

Regardless of whether we’re working with a series of flat flies or a normalized


data structure in the staging area, the final step of the ETL process is the loading of data.
Loading in the data warehouse environment usually takes the form of presenting the
quality-assured dimensional tables to the bulk loading facilities of each data mart. The
target data mart must then index the newly arrived data for query performance. When
each data mart has been freshly loaded, indexed, supplied with appropriate aggregates,
and further quality assured, the user community is notified that the new data has been
published. Publishing includes communicating the nature of any changes that have
occurred in the underlying dimensions and new assumptions that have been introduced
into the measured or calculated facts.

Data Presentation
The data presentation area is where data is organized, stored, and made available
for direct querying by users, report writers, and other analytical applications. Since the
backroom staging area is off-limits, the presentation area is the data warehouse as far as
the business community is concerned. It is all the business community sees and touches
via data access tools.

You can create a data warehouse system for an organization by two approaches. In
the first approach you can create and implement a central data warehouse first with data
marts created later. In the second approach, the data marts are implemented in such a way
that the data warehouse works properly when their information joins in the warehouse
system. In both the approaches, the design needs centralization for perfect use and
consistency of the organizations data warehouse information. Data marts that are
designed with central specifications can produce consistent reports even though the data
is saved in different places.

FACT AND DIMESION TABLES:-

FACT Tables:-

The multidimensional models are of two types, fact tables and


dimension tables. The fact tables are the tables that store the business data, such as profit,
loss, cost, sales and money transactions. The data in the fact table is known as fact. The
facts represent the transactions and have some attributes, which represent the data that
describes the facts. The values assigned to an attribute represent the data that describes
the facts. The values assigned to an attributes are known as tuples. For example, the
attributes associated with the fact account transactions are:

• Customer Name
• Account Number
• Type of Transaction
• Amount of Transaction

The diagram below shows a schema with a fact table and dimensional tables:-
Dimension table Dimension table
Name of Fact table Area 1
product PRODUCT
AREA Area2
Product DURATION
Number Area3

Description
Year

Beginning
Date Dimension table

Date of
Completion

A fact constellation is performed when two fact tables share common


dimensional tables

The diagram below shows a fact constellation

Fact 1 Fact 2
Product Area Product Area
Duration Duration

Name of
Dimension table
Product

Product
Number

Description

In above fact constellation, fact1 and fact2 share a common dimension table.

Determining Fact Table:-


You need to identify various facts to determine the dimension table in a
database. The process of determining a fact table in the database involves

• Identifying the transactions of interest


• Identifying the dimensions for each fact
• Verifying that a fact is not a dimension table
• Verifying that a dimension table is not a fact

Identifying the transactions and dimensions for the fact:-

You need to identify the transactions of interest, which are essential events
in the business. The transactions of interest are also know as elemental transactions. For
example, records of the phone calls made by the customer and the records of the account
transactions made by the customer of the bank. The information about the transactions
should consist of appropriate details. You need to identify the dimensions related to each
fact in the fact table identifying the elemental transactions.

Fact Table Designing Process:-

The fact table can be of any size when sufficient budgets, hardware, and
database are available. The database designer needs to maintain enough funds, when
storing the details of the data. Factors that need to be considered when designing the fact
table are:-

• Determining for the historical period for which the fact table is to be
made. The historical table enables you to store the data related to the
required historical period. The time for which the data is stored is also
known as data retention period.
• Determining whether the collected data consists of the required details
• Minimizing the size of the columns in the fact tables.
• Including the time factor in the fact table.
• Subdividing the fact table consisting of larger amounts of data into smaller
fact tables. The data in the smaller fact tables is easier to manage.

Determining the Retention Period:-


The data retention period stores the data for a long time period of 5 to 10 years.
You need to store data in the data warehouse to improve the query performance. The
query performance refers to the data retrieval in response to various queries. You need to
determine the details of the data to be stored to improve the query performance.

Removing Inappropriate Columns from the Fact Table:-

The fact table needs to include only those entries that can respond to various
queries for retrieving data. The data that can be removed from fact table includes.

• Replicated Data:- Refers to duplicate copies of data stored in data warehouse.


• Derived Data:- Refers to the data values that are derived using the already stored
values.
• Aggregated Data:- Consists of the aggregations, such as sum of the data values,
total number of rows in the table and average of the data values.

You need to remove the columns that do not provide any useful information from the fact
table. For example, in a data warehouse that stores data regarding the telephone calls, the
useful columns of data are

• Source phone number


• Destination phone number
• Date of the call
• Time of the call
• Tariff
• Duration of the call

Every data warehouse and data mart has one or more fact tables.
Fact tables contain data that represent the business measurement of an organization. You
can include financial events like cash flow transactions and expenditure details in a fact
table. Fact tables contain numerous rows, sometimes in thousands when they include one
or more years financial details.

You can distinguish fact tables by seeing the numeric data in the
table, which summarize to provide information about organizations history. Fact tables
also contain many foreign keys which are the primary keys of the related dimension
tables that contain the attribute of the fact records. You cannot have descriptive
information in the fact table and can only have numeric fields and index fields that relate
to subsequent dimension tables.

Here, is the sample of the fact table orderdetail_fact used in creating the
warehouse of northwind.

COLUMNS DESCRIPTIONS
Ordered_wk Surrogate key for orders
Order_id Foreign key of orders
Amount Amount of the order
Discount Discount on the order
Order_date Date of the order
Ship_date Date of shipping
Custid_wk Surrogate key for customer
Empid_wk Surrogate key for employee
Shipid_wk Surrogate key for ship
Shipperid_wk Surrogate key for shipper
Regionid_wk Surrogate key for region
Dw_auditid Foreign key for audit table

The above entries represent the order detail on a specific date by the specific
customer which is processed by a particular employee. It also gives the shipper details
about who shipped the order.

The most useful measure of the fact table is the additive numbers. They allow you
to include the summary information by adding various quantities of the measure. The
sales of a specific item for a group of stores in a particular time-period can be included in
the additive numbers. Non additive measures such as inventory, quantity left can also be
part of a fact table, but you need to use different summarization techniques.
Aggregation of Fact Tables:-

The process of deriving summarized data from detail


records is known as aggregation. You can reduce the size of the table by aggregating the
data. When data summarization takes place in the fact table, the analyst has no detailed
information available to him. If you need detailed information, you need to identify and
locate the detailed rows that summarized in the source system that provided the data. You
should maintain the fact tables at finest granularity as possible.

Mixing of aggregated and detailed data in the fact


table causes problems when you use the data warehouse. For example, a purchase order
often contains several line items and may contain a discount, tax, or cartage cost that
applies to the to the order total instead of individual items. The quantities and item
identification are recorded at the item level. Summarization queries become more
complex in this situation and tools such as Analysis Services often require the creation of
special filters to deal with them.
You can use two different approaches for aggregation. In first approach,
you can allocate the order level values to line items based on line items based on values,
quantity, or shipping weight. In second approach, you can two fact tables, one that
contains data at the line item level, the other containing the order level information. You
can relate the two tables by carrying the identification key in the detail fact table. You can
use the order table as the dimension table for the detail table by considering the order-
level values as attributes of the order level table in the dimension hierarchy.

We typically refer to the presentation area as a series of integrated data marts. A


data mart is a wedge of the overall presentation area pie. In its most simplistic form, a
data mart presents the data from a single business process. These business processes cross
the boundaries of organizational functions.

We have several strong opinions about the presentation area. First of all we insist
that the data be presented, stored and accessed in dimensional schemas.

Data in the queryable presentation area of the data warehouse must be


dimensional, must be atomic, and must adhere to data warehouse bus architecture.

If the presentation area is based on the relational database, then these


dimensionally modeled tables are referred to as star schema.
Star Schema:- The star schema consists of a fact table in the center and all the
dimension tables attached to the center fact table. This arrangement of data resembles a
star and is named as star schema.

Product

Area Sales Time

Customer

The star schema is used in the data warehouse and the data marts. The data marts store
data about the data stored in the data warehouse.

Dynamic Dimensions:-

Dynamic Dimensions are the dimensions that vary with time. The
queries for retrieving data also change with the changes in dimensions. The queries are
designed to determine whether the new business policies are successful or not. The
execution of the queries compares the results of new business policies with the old
business policies.

Snowflake Schema:- In snowflake schema, the dimension tables are normalized.


Normalization means that the large dimension tables are partitioned into multiple tables
are partitioned into multiple tables to remove the redundant data such as duplicate
values of a data. For example, you can partition the product dimension into two
dimension tables, Product_category and Product_manufacturer.

The snow flake schema consists of a fact table and the dimensional tables into which
some more dimension tables are connected.
Product_category Product_manufacturer

Product

Area Sales Time

Customer

The above figure shows snowflake schema in which sales is the fact
table. The four dimension tables are PRODUCT, AREA, TIME, CUSTOMER. The
dimension table product is partitioned into product_category and product_manufacturer.

The snowflake schema also has some advantages. The large number of
multiple tables makes the data unmanageable as it becomes difficult to retrieve data from
multiple tables. The metadata also becomes complex in case of snowflake schema,as it
needs to store data about multiple dimension tables.

Starflake Schema:-

The starflake schema is the conbination of star schema and


snowflake schema. The starflake schema consists of fact table, star dimensions and
snowflake dimensions.
Following is the structure of starflake schema:-

Snowflake dimensions

Price Weight

Product

Product Sales Location

Location

Location1 Location2

Snowflake dimensions

Refreshing the Data Warehouse:-

You need to regularly refresh the data in a data warehouse as data


sources are updated at regular intervals. You use data sources such as ODS and legacy
system to load the data in the data warehouse.

The ways for loading the data in the data warehouse is to trap data in the
legacy system as it is being updated. The two ways to trap the updated data in a legacy
system are:-

Data Replication :- Uses triggers to trap the updated data in the legacy
system. A trigger is a set of SQL statements that automatically executes an action when
changing of data in a legacy system occurs. The action is used to store the data updated in
the legacy system.
Change Data Capture(CDC):- Uses log tape to trap the updated data in
the legacy system. A log tape stores all the change occurred through out the day in a
legacy system. You can use the log tape to trap the updated data in the legacy system.

Você também pode gostar