Escolar Documentos
Profissional Documentos
Cultura Documentos
Jeff Lawyer
and
Shamsul Chowdhury
warehouse, thus being “dependent” data marts. “In- data warehouse is built [1, 2, 15]. It is extremely
dependent” data marts should be discouraged, as the important for the business champion to engage data
data used to build them would not be of the same warehouse business partners in an “enterprise” man-
assured quality as the data warehouse “single source ner, not as individual vertical business units
of truth”. (“stove pipes”). In conjunction with the business
ciencies [14]. Be sure to interview, hire, and contract compound fields, especially when for attributes mak-
with individuals and firms according to the data ing up the key of the entity [3]. Each component of a
warehousing style, and perhaps even the technolo- compound legacy field should be broken out into
gies, you choose. Also, there are a number of indus- detail attributes in the data warehouse, as each com-
try trade shows and conferences from which beginner ponent is a unit of business data which could have
to experienced practitioners can benefit greatly. significance on its own. As a best practice, use na-
Again, select and use these opportunities based upon tive keys for primary keys; do not use token keys,
the style of data warehousing you choose. which are made up "serial" type numbers with no
meaning that represent a unique set of multiple native
3.4 Data Warehouse Scope key values. With multidimensional data warehouse
structures, however, it is often recommended to use
For the initial release of your data warehouse, token keys because, with the multiple dimension enti-
limit the number of data subjects implemented and ties surrounding a central facts entity, the primary
the extent of their content, perhaps employing an key list of the central facts entity would be the un-
evolutionary prototype or proof-of-concept develop- wieldy list of all primary keys of its dimension enti-
ment methodology [15]. This will minimize initial ties [8]. This used to be primarily an RDBMS per-
investment, help gain expertise with a smaller set of formance issue, but most RDBMS vendors have pro-
data (and, thus, a smaller set of technical challenges), vided technical enhancements or indexing capabili-
and deliver business value sooner. This is an excel- ties that alleviate the concern for non-numeric, multi-
lent way of demonstrating the informational and ple keys. In practice, getting rid of the multiplicity of
monetary benefits of data warehousing to the com- keys has more to do with minimizing SQL keying of
pany's top-level management, increasing their overall power users than maximizing database performance.
commitment and support of the concept. A potential compromise would be to carry both the
native keys and the token keys, trading ease of use
for more database space consumed.
3.5 Data Warehouse Data Modeling
3.7 Data Warehouse Loading
It is important, once a data warehousing architec-
ture is chosen, to adhere to it from beginning to end .
This may seem rhetorical, but there can be many op- When populating the data warehouse from the
portunities and much pressure to short-cut the process legacy, external, and ODS files and databases, you
necessary to create a quality data warehouse. Using a should employ the use of utility Extract / Transfor-
robust data modeling tool, follow a typical concep- mation / Load (ETL) purchased software. Similarly,
tual to logical to physical data model progression, build necessary dependent data marts from the data
maintaining all data models in as close to third- warehouse using an ETL tool [14]. These tools are
normal form (3NF) as possible [14]. However, al- somewhat costly, but provide necessary structure and
though the goal is a 3NF data model for the atomic efficiency in ensuring data quality, transformation
data warehouse, employ practical denormalization and standardization of data values, and in building
without compromising the basic entity-relationship and delivering the data stream necessary to load the
structure. For example, one allowable type of de- data warehouse.
normalization technique is where a parent entity in-
stance includes a total attribute computed from add- 3.8 Data Warehouse Data Marts
ing together the attribute values from child entity
instances. Another allowable type is where child Independent ("end-run") data marts built directly
entity instances include a redundant attribute, such as from legacy, external, and/or ODS data files and da-
“transaction date”, which has been replicated from a tabases should be avoided. It is best to first source
parent entity instance. Denormalization actions the data into the data warehouse, thus becoming part
which combine or split entities should generally be of the "single source of truth", and then into a data
avoided unless necessary in the physical database mart, if necessary [14]. In addition, the data ware-
environment to address a demonstrated performance house should not feed any ODS or legacy systems
issue. directly, as that makes the 24 hours per day x 7 days
per week operational systems dependent upon the
3.6 Data Warehouse Attribution and Keys data warehouse, which is rarely set up in a 24 hours
per day x 7 days per week operational format. For
When defining attributes for the entities of the example, you may wish to compute a "relative cus-
data warehouse data model, do not define intelligent, tomer score" ("poor", "good", "better", or "best") to
be used in an operational system such as customer warehouse data columns defined as "code" type col-
service for customer treatment workflow. If you umns should have all potential values and their mean-
need extensive history from the data warehouse to ings either encoded in metadata extensions or, if
compute this score, it is acceptable to "reverse- many values exist, consider building special data
source" this data from the data warehouse. However, warehouse code / decode tables. If you have a meta-
you should not make the operational system depend- data repository that supports import and export of
ent upon that action -- the operational system should metadata, use it. If not, strongly consider purchasing
be able to function with the most recent score avail- a metadata repository package that supports not only
able. Other business and system requirements which import and export of metadata, but also has Internet
are pushing the architecture towards what is referred or Intranet deployment capabilities. Since the thirst
to as "real-time data warehousing” are probably best for metadata will be great, it is important to have ro-
implemented as some form of legacy / ODS combi- bust metadata direct access and reporting capability.
nation. At this U. S. retailing company, there was
no formal, centralized metadata repository that could
3.9 Data Warehouse Loading be used for data warehousing. A cursory review of
metadata repositories available for purchase did not
Another consideration for loading the data ware- produce any candidate repositories that matched the
house is load frequency. It is likely that no single requirements for use in their data warehousing envi-
frequency (weekly, biweekly, monthly, etc.) will be ronment. Therefore, the development team initially
used for loading the data warehouse. The frequency collected the all-important metadata and entered it
and volume of data associated with outside, legacy, into an MS-Access database, from which rudimentary
and ODS systems will likely determine when to in- reports were used to communicate the metadata.
voke loading cycles. At the U. S. retailing company, Later, the MS-Access database was converted to In-
daily transaction data are collected, staged daily, and formix, and Java-based Intranet applications were
loaded weekly due to the tremendous volume of written to maintain, retrieve, and report on the meta-
transactions (500 million per year). Customer data data. Also available was bulk-loading of metadata
for about 190 million individuals are loaded every from MS-Excel spreadsheets.
two weeks and is done so in conjunction with a cus-
tomer management ODS which operates on a similar 3.11 Data Warehouse Education / Support
schedule. The credit account data for the retailer's
credit card portfolio are loaded on a monthly basis. In addition to good metadata describing the data
(Note the anomaly the business user must be aware of warehousing environment, you need to provide for
-- for a customer who opens a credit account during a regular and targeted education regarding the data
transaction, the transaction data will arrive on the warehouse and data mart structure and content, SQL
data warehouse weeks before the customer's credit coding techniques, access tools, data privacy, and any
account detail. Thus is the double-edged sword of other requests you need to field from the business
data warehousing.) users. Consider setting up an official data warehouse
Intranet web site as a clearinghouse for detailed in-
3.10 Data Warehouse Metadata formation, education requests, questions, forms, and
links to related web sites. Some companies have set
Data worth warehousing are data worth docu- up a "decision support center" and allocated person-
menting. This brings up the importance of ensuring a nel to it to handle or route questions and assist in
good metadata thread exists throughout all environ- getting data warehouse information to business
ments [14]. There is little one can do regarding leg- groups who are not users of the data warehouse, but
acy metadata, other than dedicating resources to ret- know the data warehouse may contain the detail an-
rofitting any discrepancies uncovered. For the data swer to their business question.
warehouse, however, the importance of good meta-
data can not be emphasized enough. ETL teams are 3.12 Data Warehouse Modification
going to rely heavily on the accuracy and complete-
ness of metadata. Data warehouse power users and Finally, you should maintain a detailed and con-
casual users are going to access metadata frequently trolled data warehouse change management process
in order to formulate their data warehouse queries. that involves the business sponsor, data / information
The data warehouse metadata must consist of good architecture, data / metadata administration, DBA,
business names and definitions, as well as standard- analyst, programmer, and any other group involved in
ized technical names and database formats. Data the data warehousing community [14]. In the change
management process, allow for error corrections, warehouse sourcing requirement itself, sometimes by
relatively small change requests, larger work requests as much as 100%.
or enhancements, and major projects. A special data warehouse sourcing challenge is
what data to select to put in it. Legacy systems con-
4. Challenges tain many terabytes of data. How can anyone select
what subset of data to copy to the data warehouse?
The technique that represents perhaps the best way to
A number of challenges were encountered during
do this involves selecting key master files and data-
creation of the U. S. retailing company's data ware-
bases in the legacy environment and having the in-
house. Some of these challenges were anticipated
tended business users rank each and every element
and others were not. Interestingly, throughout the
and column as to its significance from a business
project to build the data warehouse, though, organiza-
intelligence standpoint. This could be as simple as a
tional and project process challenges overshadowed
"yes" / "no" designation, to a scoring system whereby
technical challenges.
the resultant list could be sorted by score and evalu-
ated at many points of cost versus benefit. In addi-
4.1 Organizational Challenges tion, some business data that would score low now
may score higher in the future due to business, mar-
Although a business sponsor was selected to rep- keting, societal, compliance, legal, or other factors.
resent the data warehouse, this sponsor was a mem- Regardless of the method of selection, there is a risk
ber of one of four vertical business units participating that not all needed detailed data will be captured.
in the data warehouse project. Interdepartmental However, as long as the data warehousing project is
cooperative promises were made, but annual incen- never "closed", there will be future opportunity to
tives for members of these departments were tied to include that data, but only to the extent that this data
their business performance alone. As well, the an- are kept from a historical standpoint.
nual incentives for the information technology asso-
ciates selected to participate were also tied to their 4.3 Technological Pressures
assigned business unit's performance. No incentive
dollars were tied to the data warehouse project di-
As the data warehouse and its environment ma-
rectly. The difficulty here is that maximum attention
ture, certain technological pressures surface that seem
was paid to departmental systems, with secondary
natural and creative to middle- and upper-level man-
consideration given to the data warehouse project.
agement, but violate standard data warehousing ar-
This affected work priorities and project time-line
chitecture precepts. The first stems from the fact that
adjustments had to be made on a regular basis.
a data warehouse built with excellent cleansing and
integration techniques yields data that can be of
4.2 Data Sourcing Challenges higher quality than the legacy data from which it
came. Top-level executives start referring to their
The data sourcing challenges have their roots in "Customer Data Warehouse" as their "Customer Da-
the legacy system environment. Many companies tabase". In turn, they will exert pressure to update
have no real system of record for critical enterprise legacy customer data from the data warehouse, not
data, their legacy systems and data being aligned realizing the difference in data currency between the
more on critical process boundaries instead. Since warehouse (up to two-week old customer data) and
ODS systems and ODS databases are a relatively the legacy system (near-real-time).
recent spin-off of data warehousing, there is typically Secondly, there is pressure to push the data
little or no integrated ODS data from which to source warehouse into a real-time or near-real-time envi-
the data warehouse, as well. In an environment with ronment. This idea often comes from on-site vendors
weak or informal data management practices, data that would dearly love to see the data warehouse
subjects are either not clearly established or not used pushed into a 24 hours per day x 7 days per week
at all. Saddle this net situation with weak or nonexis- environment -- they would be the most likely recipi-
tent metadata, then the data sourcing effort will seem ents of the cost of the equipment and expertise to
mammoth, and it usually is. This is why the phrase support a multi-terabyte operational database in a
"data warehouse sourcing" has been equated with the real-time environment and connect it to the opera-
phrase "data archaeology" -- you may know where to tional legacy systems, as well. After seven years of
dig, but you simply do not have any idea what the existence, the data warehouse at the U. S. retailing
next shovel-full will give you. Even with this reali- company satisfies business user requirements by op-
zation, most data warehousing projects, even with the erating on a service level agreement of 10 hours per
use of an ETL tool, woefully underestimate the data
day and 5 days per week. The cost to operate this and a fourth needs view ABD, you could build a data
database on a 24 hours per day x 7 days per week mart supporting view ABCDE to satisfy all four users
basis would be several times higher. (and other users requiring the same views or even
Thirdly, there is pressure to build independent new views of interest BCD, BCE, BCDE, etc.).
data marts from non-data warehouse data directly
from legacy or ODS files and databases. Some argue 4.4 Customer Data Challenges
that this is nothing more than implementing a hybrid
Inmon / Kimball data warehousing environment, Customer name and address data is perhaps the
however, there is no such thing. There are advan- most difficult to establish and maintain, particularly
tages to either over the other, but only one can be if you have multiple stove-pipe systems each main-
selected as the architecture for any particular data taining their own customer master files or databases.
warehousing implementation. In the case where the Do not underestimate the effort needed to overcome
Inmon style has been selected, all data marts should this challenge. You may need to purchase software
be created from the data warehouse, where the data to assist you in standardizing name and address for-
has already been scrubbed, integrated, and stored mats. You will have to write some form of a cus-
with a high degree of quality [14]. Creating data tomer management system to collect the customer
marts from different sources for the same data allows data you have and determine if one vertical business
for error and variable tolerances such that different unit's "John Smith" is the same as another vertical
answers may be given to the exact same business business unit's "John Smith". You may even need to
intelligence question. Always create your data marts rely on an external vendor to assist you in keeping up
from the "single source of truth" -- the data ware- with name, address, and telephone number changes.
house. At the U. S. retailing company, no less than sixteen
A fourth technological pressure is to proliferate customer master files had been identified across the
application-specific data marts in an uncontrolled multiple businesses. Long term, their goal is to in-
fashion. It takes time, education, and experience to stall a customer management system as a robust ODS
become a skilled query writer using native SQL. containing multiple detailed source customer data by
Many power users have no problem accessing the vertical business unit. A ranking process "survives"
native data warehouse directly using SQL. However, the best individual name, address, and telephone
new users of the data warehouse could easily become number data and prepares it for transmission to an
discouraged if not enough time, education, and op- external vendor. The vendor matches the name and
portunity to practice are allowed to attain SQL self- address data to its files and appends an individual key
sufficiency. A typical solution would be to create a to the individual, as well as an address key to the
data mart containing only the subset of data of inter- address. The data is then transmitted back from the
est, perhaps pre-summarized and pared down to meet vendor and loaded into the data warehouse, after
a specific area of business interest. This, in itself, is which sophisticated SQL households individuals ac-
not a bad thing to do if there are a multitude of busi- cording to those having the same address or other
ness users that could benefit from this data. How- shared detail data. History is kept on individuals
ever, if only this one user will benefit, then this user migrating from household to household, as well as
will request additional data marts. Multiply this by individuals and households migrating from address to
several business users of the same skill, and you will address. Currently, eight of the company's sixteen
soon have uncontrolled proliferation of data marts. customer master files are included, with the remain-
The solution is not to prevent all creation of data ing eight targeted for future incorporation, at which
marts, but to analyze requests for data marts (views) time the customer management system ODS will be
against other requests for data marts (views) and in- upgraded and integrated with the sixteen legacy sys-
stall actual data marts that satisfy more than one tems for two-way customer data communication.
view. For example, if one user needs view ABDE, a
second needs view ACE, a third needs view ABE,
Customer Interaction
Sample List
Customer Interaction