Escolar Documentos
Profissional Documentos
Cultura Documentos
• linked to, but different from, the many production databases that run in an
organisation;
• subject-oriented, rather than application-oriented, to provide a consistent view of
the business;
• integrated, because data is consolidated from different application systems;
• time variant, because information has a time dimension, whereas operational data
is valid only at a particular moment;
• non-volatile, since data is added to the data warehouse, rather than replaced.
While many people say that the concept of an ‘information’ warehouse came from IBM, a
number of companies were building data warehouses during the 1980s but giving them
different names. In South Africa, pioneering organisations like Eskom and the former
United Building Society set up very large databases for their management and executive
information systems well before the warehouse concept was established. IBM can be
credited with realising the true potential of a database for management-oriented rather
than operational needs, and of course for promoting the concept world-wide.
1. Organisations have spent many years improving ways to put data into their
operational systems; now they are beginning to appreciate the value of getting that
data out again. However, in many cases the databases that support those
operational systems are not suitable for quick and easy access to information.
2. Application databases are designed to handle data for specific, quite narrow
purposes, not integrated organisational views.
3. The IS department often does not want to allow users on the operational system
for fear of degrading the system with resource-intensive queries, or breaching data
security and integrity.
4. The data management differences between databases for application systems and
for decision-making cannot be re-conciled on one system.
5. The software tools that users need should be on platforms that make it easy to
access and use data - the operational platform may not be suitable.
6. The growth in recent years of decision support and data mining applications
necessitates the move to a different database architecture.
7. ‘Get those xxxx users off our back’!
If you were building a data warehouse five years ago, chances are it would be on your
company’s mainframe computer. But with increased use of the data warehouse came a
heavier load on mainframe computer resources and degradation of overall performance.
Nowadays, a common solution is to position the warehouse on a mid-range platform that
can scale up easily and at a suitably low cost - conditions not yet available on
mainframes. The lower implementation costs of a client-server warehouse make it easier
to motivate the development of a warehouse. And data warehouses are not mission-
critical systems, so the common client-server criticism about lack of system management
tools, is less significant.
Because a data warehouse crosses organisational boundaries, there are serious non-
technical issues that you must bear in mind at an early stage; not to do so can be a career-
limiting move. Consider how organisations start data warehousing projects, from two
different approaches.
One approach begins with corporate data modelling, and builds the warehouse by taking
a broad view of business requirements; you can call this a top-down methodology. The
other approach is to focus on a specific information application and use that as the core
around which the warehouse will grow; this is the bottom-up methodology. Both
approaches are valid but have their own problems. The top-down method requires long
hard work to motivate and justify, political savvy, and approval by someone to pay up-
front before seeing the final product. And that’s before you even start on development.
The bottom-up approach avoids many of those problems, but runs the risk that the
information requirement you start with becomes quickly out-dated, or is not accepted by
business users.
Before you embark on a data warehouse project, you need to look at your current
applications and databases. If the warehouse is being developed because no one apart
from programmers can access corporate data in its current state, then have a clear
understanding about the data standards you will apply, and how the data warehouse will
be designed. Too many organisations discover that they have multiple standards or
definitions for their data. For data warehouses built using relational databases, it is a
common mistake to design the warehouse database around the same normalised form as
the application system. While the ‘normal form’ concept is fine for application efficiency,
it creates havoc for users of a data warehouse.
Next, examine the processes that will be implemented to extract and convert data from
your existing application systems to the warehouse. Among other things, these processes
must ensure cleanliness and integrity - for example, identifying invalid data - as well as
extracting the correct data every time - getting daily data for the wrong day is not an
uncommon problem. While these housekeeping functions seem obvious, they can become
a major headache as the warehouse grows, unless they are properly thought through at the
outset.
As the data warehouse grows, two other warehouse admin. functions will become more
important. One function is packaging the data for commonly requested queries so that the
queries run quicker. Doing this is easier said than done. Not many database systems
analyse the contents of queries. More significantly, common queries at one time become
rare at another time due to business changes. This is an area that requires vigilance. The
other admin. function is what to do as the data ages. As the data gets older, the
importance of detail decreases. So to conserve disk space, this data should be summarised
to higher and higher levels of summary, and eventually archived. Deciding how and what
to summarise requires knowledge of past query patterns, and some insight into future
business information requirements.
To make the contents of a data warehouse easily understood, users have begun to demand
decent directories of corporate data, otherwise known as data dictionaries. This is a bad
news and good news story. The bad news is that many organisations have spend a lot of
time and money trying to implement corporate dictionaries for a data warehouse with
little obvious success. While products do exist, the average user has great difficulty
understanding them since they are still aimed mainly at programmers. The good news is
that some of the end-user query tools that are beginning to emerge have quite adequate
data dictionary functions. Watch this space for more developments.
When starting a data warehouse, the best advice is to start with a small and focused need,
and work on getting get it right first time. This makes it easier to project benefits, and to
get data issues sorted out on a small system before you tackle a large one.
The contents of the data warehouse must be easily understandable. The data must
be intuitive and obvious to the business user, not merely the developer. The contents of
the data warehouse must be labeled meaningfully. Business users want to separate and
combine the data in the data warehouse in endless combinations, a process commonly
referred to as slicing and dicing. The tools that access the data warehouse must be simple
and easy to use. They must also return query to the users with minimal wait times.
The data in the data warehouse must be credible. Data must be carefully
assembled from variety of sources around the organization, cleansed, quality assured, and
released only when it is fit for users consumption. Information from one business process
should match information from other business process. If two performance measures
have the same name, then they must mean the same thing. Conversely, if two measures
don’t mean the same thing, then should be labeled differently. Consistent information
means high quality information. It means that all the data is accounted for and complete.
Consistency also implies that common definitions for the contents of the data warehouse
are available for the users.
We simply can’t avoid to change. User needs, business conditions, data and
technology are all subject to the shifting sands of time. The data warehouse must be
designed to handle this inevitable change. Changes to data warehouse must be graceful,
meaning they don’t invalidate existing data or applications. The existing data and
applications should not be changed or disrupted when the business community asks new
questions or new data is added to the warehouse. If descriptive data in the warehouse I
modified, we must account for the changes appropriately.
The data warehouse must be a secure bastion that protects our information :-
The data warehouse must have the right data in it to support decision making.
There is only one true output from the data warehouse: the decisions that are made after
the data warehouse has presented its evidence. These decisions deliver the business
impact and the values attributable to the data warehouse. The original label that predates
the data warehouse is still the best description of what we are designing :
a decision support system.
It desen’t matter that we have built an elegant solution using best-of –breed and
platforms. If the business community has not embraced the data warehouse and continued
to use it actively six months after training, than we have failed the acceptance test. Unlike
an operational system rewrite, where business users have no choice but to use the new
syatem, data warehouse usage is sometimes optional. Business user acceptance has more
to do with simplicity than anything else.
Components of a data warehouse
Extract
Data Mart #1
Services:- DIMENSIONAL
Clean, Combine Atomic and
Load
and Standardize summary data Acces
Based on single Ad Hoc Query
Confirm business process Tools
dimensions
These are the operational systems of record that capture the transactions of the
business. The source systems should be thought of as outside the data warehouse because
presumably we have little or no control over the content and format of the data in these
operational legacy systems. The main priorities of the source systems are processing
performance and availability. Queries against the source systems are narrow, one-record
at a time queries that are part of the normal transaction flow and severely restricted on the
demands on the operational system. We make the strong assumption that source systems
are not queried in the broad and unexpected ways that data warehouses typically are
queried. The source system maintains little historical data, and if you have a good data
warehouse, the source systems can be relieved of much of the responsibility for
representing the past. Each source system is a natural application where little investment
has been made to sharing common data such as product, customer, geography or
calendar.
The difference between the data warehouse and the operational system is as
follows:-
The key architectural requirement for the data staging area is that it is off-limits to
business users and does not provide query and presentation services.
Extraction is the first step in the process of getting data into the data warehouse
environment. Extraction means reading and understanding the source data and copying
the data needed for the data warehouse into the staging area for further manipulation.
Once data is extracted to the staging area, there are numerous potential
transformations such as cleansing the data(correcting misspellings, resolving the domain
conflicts, dealing the missing elements, or parsing into standard formats), combining data
from multiple sources and assigning warehouse keys. These transformations are all
precursors to loading the data into the data warehouse presentation area..
The data staging area is dominated by the simple activities of sorting and
sequential processing. In many cases, the data staging area is not based on relational
technology but instead may consist of a system of flat files. After you validate your data
for conformance with the defines one-to-one and many-to-one business rules, it may be
pointless to take the final step of building a full blown third-normal form physical
database.
However, there are cases, where the data arrives at the doorstep of the data staging
area in the 3rd normal form relational format. In these situations, the managers of the data
staging area simply may be more comfortable performing the cleansing and the
transformation tasks using a set of normalized structures. A normalized database for data
staging area is acceptable. However, we continue to have some reservations about this
approach. The creation of both normalized structures for staging and dimensional
structures for presentation means that the data is extracted, transformed and loaded twice-
once in normalized database and then again when we load the dimensional model.
Obviously, this two-step process requires more time and resources for the development
effort, more time for the periodic loading or updating of data, and more capacity to store
the multiple copies of the data. At the bottom line, this typically translates into the need
for the larger development, ongoing support, and hardware platform budgets.
Unfortunately, some data warehouse project teams have failed miserably because they
focused all their energy and resources on constructing the normalized structures rather
than allocating time to development of a presentation area that supports improved
business decision making. While we believe that enterprise-wide data consistency is a
fundamental goal of the data warehouse environment, there are equally effective and less
costly approaches than physically creating a normalized set of tables in your staging area,
if these structures don’t already exist.
Data Presentation
The data presentation area is where data is organized, stored, and made available
for direct querying by users, report writers, and other analytical applications. Since the
backroom staging area is off-limits, the presentation area is the data warehouse as far as
the business community is concerned. It is all the business community sees and touches
via data access tools.
You can create a data warehouse system for an organization by two approaches. In
the first approach you can create and implement a central data warehouse first with data
marts created later. In the second approach, the data marts are implemented in such a way
that the data warehouse works properly when their information joins in the warehouse
system. In both the approaches, the design needs centralization for perfect use and
consistency of the organizations data warehouse information. Data marts that are
designed with central specifications can produce consistent reports even though the data
is saved in different places.
FACT Tables:-
• Customer Name
• Account Number
• Type of Transaction
• Amount of Transaction
The diagram below shows a schema with a fact table and dimensional tables:-
Dimension table Dimension table
Name of Fact table Area 1
product PRODUCT
AREA Area2
Product DURATION
Number Area3
Description
Year
Beginning
Date Dimension table
Date of
Completion
Fact 1 Fact 2
Product Area Product Area
Duration Duration
Name of
Dimension table
Product
Product
Number
Description
In above fact constellation, fact1 and fact2 share a common dimension table.
You need to identify the transactions of interest, which are essential events
in the business. The transactions of interest are also know as elemental transactions. For
example, records of the phone calls made by the customer and the records of the account
transactions made by the customer of the bank. The information about the transactions
should consist of appropriate details. You need to identify the dimensions related to each
fact in the fact table identifying the elemental transactions.
The fact table can be of any size when sufficient budgets, hardware, and
database are available. The database designer needs to maintain enough funds, when
storing the details of the data. Factors that need to be considered when designing the fact
table are:-
• Determining for the historical period for which the fact table is to be
made. The historical table enables you to store the data related to the
required historical period. The time for which the data is stored is also
known as data retention period.
• Determining whether the collected data consists of the required details
• Minimizing the size of the columns in the fact tables.
• Including the time factor in the fact table.
• Subdividing the fact table consisting of larger amounts of data into smaller
fact tables. The data in the smaller fact tables is easier to manage.
The fact table needs to include only those entries that can respond to various
queries for retrieving data. The data that can be removed from fact table includes.
You need to remove the columns that do not provide any useful information from the fact
table. For example, in a data warehouse that stores data regarding the telephone calls, the
useful columns of data are
Every data warehouse and data mart has one or more fact tables.
Fact tables contain data that represent the business measurement of an organization. You
can include financial events like cash flow transactions and expenditure details in a fact
table. Fact tables contain numerous rows, sometimes in thousands when they include one
or more years financial details.
You can distinguish fact tables by seeing the numeric data in the
table, which summarize to provide information about organizations history. Fact tables
also contain many foreign keys which are the primary keys of the related dimension
tables that contain the attribute of the fact records. You cannot have descriptive
information in the fact table and can only have numeric fields and index fields that relate
to subsequent dimension tables.
Here, is the sample of the fact table orderdetail_fact used in creating the
warehouse of northwind.
COLUMNS DESCRIPTIONS
Ordered_wk Surrogate key for orders
Order_id Foreign key of orders
Amount Amount of the order
Discount Discount on the order
Order_date Date of the order
Ship_date Date of shipping
Custid_wk Surrogate key for customer
Empid_wk Surrogate key for employee
Shipid_wk Surrogate key for ship
Shipperid_wk Surrogate key for shipper
Regionid_wk Surrogate key for region
Dw_auditid Foreign key for audit table
The above entries represent the order detail on a specific date by the specific
customer which is processed by a particular employee. It also gives the shipper details
about who shipped the order.
The most useful measure of the fact table is the additive numbers. They allow you
to include the summary information by adding various quantities of the measure. The
sales of a specific item for a group of stores in a particular time-period can be included in
the additive numbers. Non additive measures such as inventory, quantity left can also be
part of a fact table, but you need to use different summarization techniques.
Aggregation of Fact Tables:-
We have several strong opinions about the presentation area. First of all we insist
that the data be presented, stored and accessed in dimensional schemas.
Product
Customer
The star schema is used in the data warehouse and the data marts. The data marts store
data about the data stored in the data warehouse.
Dynamic Dimensions:-
Dynamic Dimensions are the dimensions that vary with time. The
queries for retrieving data also change with the changes in dimensions. The queries are
designed to determine whether the new business policies are successful or not. The
execution of the queries compares the results of new business policies with the old
business policies.
The snow flake schema consists of a fact table and the dimensional tables into which
some more dimension tables are connected.
Product_category Product_manufacturer
Product
Customer
The above figure shows snowflake schema in which sales is the fact
table. The four dimension tables are PRODUCT, AREA, TIME, CUSTOMER. The
dimension table product is partitioned into product_category and product_manufacturer.
The snowflake schema also has some advantages. The large number of
multiple tables makes the data unmanageable as it becomes difficult to retrieve data from
multiple tables. The metadata also becomes complex in case of snowflake schema,as it
needs to store data about multiple dimension tables.
Starflake Schema:-
Snowflake dimensions
Price Weight
Product
Location
Location1 Location2
Snowflake dimensions
The ways for loading the data in the data warehouse is to trap data in the
legacy system as it is being updated. The two ways to trap the updated data in a legacy
system are:-
Data Replication :- Uses triggers to trap the updated data in the legacy
system. A trigger is a set of SQL statements that automatically executes an action when
changing of data in a legacy system occurs. The action is used to store the data updated in
the legacy system.
Change Data Capture(CDC):- Uses log tape to trap the updated data in
the legacy system. A log tape stores all the change occurred through out the day in a
legacy system. You can use the log tape to trap the updated data in the legacy system.