Você está na página 1de 4

6/17/11 7:06 AM Four Ways to Build a Data Warehouse

Page 1 of 4 http://www.tdan.com/view-articles/4770
Home Newsletter TDAN.com Picks Events Resources About Contact Search TDAN.com
Printer-friendly
E-mail to friend
THE DATA ADMINISTRATION NEWSLETTER TDAN.com
ROBERT S. SEINER PUBLISHER
Current Issue
Archive
Special Features
Featured Columns
Perspectives
Contribute
Subscribe
> home > newsletter > article
Four Ways to Build a Data Warehouse
by Wayne Eckerson
Published: May 29, 2007
It has been said there are as many ways to build data warehouses as
there are companies to build them.
It has been said there are as many ways to build data warehouses as there are
companies to build them. Each data warehouse is unique because it must adapt to the
needs of business users in different functional areas, whose companies face different
business conditions and competitive pressures.
Nonetheless, four major approaches to building a data warehousing environment exist.
These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid,
and 4) federated. Most organizationswittingly or notfollow one or another of these
approaches as a blueprint for development.
Although we have been building data warehouses since the early 1990s, there is still a
great deal of confusion about the similarities and differences among these
architectures. This is especially true of the top-down and bottom-up approaches,
which have existed the longest and occupy the polar ends of the development
spectrum.
As a result, some companies fail to adopt a clear vision for the way the data
warehousing environment can and should evolve. Others, paralyzed by confusion or
fear of deviating from prescribed tenets for success, cling too rigidly to one approach or
another, undermining their ability to respond flexibly to new or unexpected situations.
Ideally, organizations need to borrow concepts and tactics from each approach to
create environments that uniquely meets their needs.
Semantic and Substantive Differences The two most influential approaches are
championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific
authors and consultants in the data warehousing field. Inmon, who is credited with
coining the term data warehousing in the early 1990s, advocates a top-down
approach, in which companies first build a data warehouse followed by data marts.
Kimballs approach, on the other hand, is often called bottom-up because it starts and
ends with data marts, negating the need for a physical data warehouse altogether.
On the surface, there is considerable friction between top-down and bottom-up
approaches. But in reality, the differences are not as stark as they may appear. Both
approaches advocate building a robust enterprise architecture that adapts easily to
changing business needs and delivers a single version of the truth. In some cases, the
differences are more semantic than substantive in nature. For example, both
approaches collect data from source systems into a single data store, from which data
marts are populated. But while top-down subscribers call this a data warehouse,
bottom-up adherents often call this a staging area.
Nonetheless, significant differences exist between the two approaches (see chart.) Data
warehousing professionals need to understand the substantial, subtle, and semantic
differences among the approaches and which industry gurus or consultants advocate
each approach. This will provide a clearer understanding of the different routes to
achieve data warehousing success and how to translate between the advice and
rhetoric of the different approaches.
Top-Down Approach
The top-down approach views the data warehouse as the linchpin of the entire analytic
environment. The data warehouse holds atomic or transaction data that is extracted
from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is
summarized, dimensionalized, and distributed to one or more dependent data marts. These data marts are dependent
because they derive all their data from a centralized data warehouse.
Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data
before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there
are numerous source systems, large volumes of data, or small batch windows with which to extract data from source
systems.
The major benefit of a top-down approach is that it provides an integrated, flexible architecture to support downstream
analytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcing
consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in
the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business
6/17/11 7:06 AM Four Ways to Build a Data Warehouse
Page 2 of 4 http://www.tdan.com/view-articles/4770
needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or
support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they
need cross-functional or enterprise views of the data.
On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the
initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the
physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or
reports. (Of course, depending on the size of an implementation, organizations can deploy all three tiers within a single
database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications.
Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the
details behind the summary data in their reports.
Bottom-Up Approach
In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible.
Unlike the top-down approach, these data marts contain all the databoth atomic and summarythat users may want or
need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each
data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a
single version of the truth as well as both summary and atomic data.
The bottom-up approach consciously tries to minimize back-office operations, preferring to focus an organizations effort
on developing dimensional designs that meet end-user requirements. The bottom-up staging area is non-persistent, and
may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional
data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to
extend existing dimensional models to accommodate new subject areas.
Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data
structures using dimensional, star schema models. It also delivers value rapidly because it doesnt lay down a heavy
infrastructure up front.
Without an integration infrastructure, the bottom-up approach relies on a dimensional bus to ensure that data marts are
logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use
conformed dimensions and facts when building new data marts. Thus, each new data mart is integrated with others
within a logical enterprise dimensional model.
Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users
do not have to drill through from a data mart to another structure to obtain detailed or transaction data. The use of a
staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data
marts.
One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and
facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single
physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to
ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a
tendency for organizations to create independent or non-integrated data marts.
In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus,
organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up
architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved
simply by pulling a subset of data from a data mart at night when users are not active on the system.
Hybrid Approach
The hybrid approach tries to blend the best of both top-down and bottom-up approaches. It attempts to capitalize on
the speed and user-orientation of the bottom-up approach without sacrificing the integration enforced by a data
warehouse in a top down approach. Pieter Mimno, an independent consultant who teaches at TDWI conferences, is
currently the most vocal proponent of this approach.
The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form before
developing the first data mart. The first several data marts are also designed in third normal form but deployed using star
schema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability and
query performance of a star schema.
The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise and
local models in the data marts as well as synchronize the differences between them. This lets local groups, for example,
develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificing
long-term integration. Organizations also use the ETL tool to extract and load data from source systems into the
dimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on the
fly.
After deploying the first few dependent data marts, an organization then backfills a data warehouse behind the data
marts, instantiating the fleshed out version of the enterprise data model. The organization then transfers atomic data
from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money,
and processing resources. Organizations typically backfill a data warehouse once business users request views of atomic
data across multiple data marts.
The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecture
framework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once its really
needed (e.g. when executives start asking for reports that cross data mart boundaries.)
However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and therefore
may never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database
(i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to
query which database.
This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, develop
aggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools have
matured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy for
local groups to stray irrevocably from the enterprise data model.
Federated Approach
The federated approach is sometimes confused with the hybrid approach above or hub-and-spoke data warehousing
architectures that are a reflection of a top-down approach.
However, the federated approachas defined by its most vocal proponent, Doug Hackneyis not a methodology or
6/17/11 7:06 AM Four Ways to Build a Data Warehouse
Page 3 of 4 http://www.tdan.com/view-articles/4770
architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfect
system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet
changing needs or business conditions. In short, its a salve for the soul of the stressed out data warehousing project
manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users.
Hackney says the federated approach is an architecture of architectures. It recommends how to integrate a multiplicity of
heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and will
continue to implement in spite of the IT groups best effort to enforce standards and adhere to a specific architecture.
Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards
as an optimal solution. He says it provides the maximum amount of architecture possible in a given political and
implementation reality. The approach merely encourages organizations to share the highest value metrics, dimensions,
and measures wherever possible, however possible. This may mean, for example, creating a common staging area to
eliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data
warehouses, or analytic applications.
The major problem with the federated approach is that it is not well documented. There are only a few columns written on
the subject. But perhaps this is enough, as it doesnt prescribe a specific end-state or approach. Another potential problem
is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and
fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta
data is a pernicious problem in a heterogeneous, ever-changing environment.
Summary
The four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousing
managers need to be aware of these methodologies but not wedded to them. These methodologies have shaped the
debate about data warehousing best practices, and comprise the building blocks for methodologies developed by practicing
consultants.
Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own
way through the data warehousing thicket. Since each organization must respond to unique needs and business
conditions, having a foundation of best practice models to start with augurs a successful outcome.
Top-Down Bottom-Up Hybrid Federated
Major Characteristics
! Emphasizes the DW.
! Starts by designing an
enterprise model for a DW.
! Deploys multi-tier architecture
comprised of a staging area, a
DW, and dependent data marts.
! The staging area is persistent.
! The DW is enterprise-oriented;
data marts are function-specific.
! The DW has atomic-level data;
data marts have summary data.
! The DW uses an enterprise-
based normalized model; data
marts use a subject-specific
dimensional model.
! Users can query the data
warehouse and data marts.

! Emphasizes data marts.
! Starts by designing a
dimensional model for a data
mart.
! Uses a flat architecture
consisting of a staging area and
data marts.
! The staging area is largely
non-persistent.
! Data marts contain both atomic
and summary data.
! Data marts can provide both
enterprise and function-specific
views.
! A data mart consists of a single
star schema, logically or
physically deployed.
! Data marts are deployed
incrementally and integrated
using conformed dimensions.

! Emphasizes DW and data
marts; blends top-down and
bottom-up methods.
! Starts by designing enterprise
and local models synchronously.
! Spends 23 weeks creating a
high-level, normalized, enterprise
model; fleshes out model with
initial marts. ! Populates marts
with atomic and summary data
via a non-persistent staging area.
! Models marts as one or more
star schemas.
! Uses ETL tool to populate data
marts and exchange meta data
between ETL tool and data marts.
! Backfills a DW behind the
marts when users want views at
atomic level across marts;
instantiates the fleshed out
enterprise model, and moves
atomic data to the DW.

! Emphasizes the need to
integrate new and existing
heterogeneous BI environments.
! An architecture of
architectures.
! Acknowledges the reality of
change in organizations and
systems that make it difficult to
implement a formalized
architecture.
! Rationalizes the use of
whatever means possible to
implement or integrate analytical
resources to meet changing needs
or business conditions.
! Encourages organizations to
share dimensions, facts, rules,
definitions, and data wherever
possible, however possible.

Pros
! Enforces a flexible, enterprise
architecture.
! Once built, minimizes the
possibility of renegade
independent data marts.
! Supports other analytical
structures in an architected
environment, including data
mining sets, ODSs, and
operational reports.
! Keeps detailed data in
normalized form so it can be
flexibly re-purposed to meet new
and unexpected needs.
! Data warehouse eliminates
redundant extracts.
! Focuses on creating user-
friendly, flexible data structures.
! Minimizes back office
operations and redundant data
structures to accelerate
deployment and reduce cost.
! No drill-through required since
atomic data is always stored in
the data marts.
! Creates new views by
extending existing stars or
building new ones within the
same logical model.
! Staging area eliminates
redundant extracts.

! Provides rapid development
within an enterprise architecture
framework.
! Avoids creation of renegade
independent data marts.
! Instantiates enterprise model
and architecture only when
needed and once data marts
deliver real value.
! Synchronizes meta data and
database models between
enterprise and local definitions.
! Backfilled DW eliminates
redundant extracts.

! Provides a rationale for band
aid approaches that solve real
business problems.
! Alleviates the guilt and stress
data warehousing managers
might experience by not adhering
to formalized architectures.
! Provides pragmatic way to
share data and resources.

Cons
! Upfront modeling and platform
deployment mean the first
increments take longer to deploy
and cost more.
! Few query tools can easily join
data across multiple, physically
distinct marts.
! Requires groups throughout an
! Requires organizations to
enforce standard use of entities
and rules.
! Backfilling a DW is disruptive,
! The approach is not fully
articulated.
! With no predefined end-state
or architecture in mind, it may
6/17/11 7:06 AM Four Ways to Build a Data Warehouse
Page 4 of 4 http://www.tdan.com/view-articles/4770
! Requires building and managing
multiple data stores and
platforms.
! Difficult to drill through from
summary data in marts to detail
data in DW.
! Might need to store detail data
in data marts anyway.

organization to consistently use
dimensions and facts to ensure a
consolidated view.
! Not designed to support
operational data stores or
operational reporting data
structures or processes.

requiring corporate commitment,
funding, and application rewrites.
! Few query tools can dynamically
query atomic and summary data
in different databases.

give way to unfettered chaos.
! It might encourage rather than
reign in independent development
and perpetuate the disintegration
of standards and controls.

Major Proponents
Bill Inmon and co-authors

Ralph Kimball and co-authors

Many practitioners

Doug Hackney

Go to Current Issue | Go to Issue Archive
Recent articles by Wayne Eckerson
Are You Stuck In BI Adolescence?
The Business Intelligence Evangelist
Wayne Eckerson - Wayne Eckerson has been a thought leader and consultant in the business intelligence (BI) field since
1995. He has conducted numerous in-depth research studies and is a noted speaker and blogger. He is the author of the
best-selling book Performance Dashboards: Measuring, Monitoring, and Managing Your Business. For many years, he
served as director of education and research at The Data Warehousing Institute (TDWI) where he chaired its BI Executive
Summit and created a popular BI Maturity Model and Assessment. Wayne is currently director of research at TechTarget
and president of BI Leader Consulting, which provides advisory services to user and vendor organizations. He can be
reached at weckerson@techtarget.com.
Quality Content for Data Management Professionals Since 1997
Copyright 1997-2011, The Data Administration Newsletter, LLC -- www.TDAN.com
Contact Publisher | Comments and Contributions Welcome | Advertising | Disclaimer
TDAN.com is an affiliate of the BeyeNETWORK

Você também pode gostar