Datawarehousing and Informatica

Data Warehousing & INFORMATICA
DATA WAREHOUSE
A data warehouse is the main repository of the organization's historical data, its corporate memory. For
example, an organization would use the information that's stored in its data warehouse to find out what day of the
week they sold the most widgets in May 1992, or how employee sick leae the week !efore the winter !reak
differed !etween "alifornia and #ew $ork from 2%%1&2%%'. (n other words, the data warehouse contains the raw
material for management's decision support system. )he critical factor leading to the use of a data warehouse is
that a data analyst can perform complex *ueries and analysis on the information without slowing down the
operational systems.
+hile operational systems are optimized for simplicity and speed of modification ,online transaction processing,
or -.)/0 through heay use of data!ase normalization and an entity&relationship model, the data warehouse is
optimized for reporting and analysis ,on line analytical processing, or -.1/0. Fre*uently data in data warehouses is
heaily denormalised, summarised and2or stored in a dimension&!ased model !ut this is not always re*uired to
achiee accepta!le *uery response times.
More formally, 3ill (nmon ,one of the earliest and most influential practitioners0 defined a data warehouse as
follows4
Subject-oriented, meaning that the data in the data!ase is organized so that all the data elements relating to the
same real&world eent or o!5ect are linked together6

Time-variant, meaning that the changes to the data in the data!ase are tracked and recorded so that reports can
!e produced showing changes oer time6 o!ieefans.com
Non-volatile, meaning that data in the data!ase is neer oer&written or deleted, once committed, the data is
static, read&only, !ut retained for future reporting6
Integrated, meaning that the data!ase contains data from most or all of an organization's operational applications,
and that this data is made consistent 7istory of data warehousing
8ata +arehouses !ecame a distinct type of computer data!ase during the late 199%s and early 199%s. )hey were
deeloped to meet a growing demand for management information and analysis that could not !e met !y
operational systems. -perational systems were una!le to meet this need for a range of reasons4
)he processing load of reporting reduced the response time of the operational systems,
)he data!ase designs of operational systems were not optimized for information analysis and reporting,
Most organizations had more than one operational system, so company&wide reporting could not !e
supported from a single system, and
8eelopment of reports in operational systems often re*uired writing specific computer programs which
was slow and expensie.
1s a result, separate computer data!ases !egan to !e !uilt that were specifically designed to support management
information and analysis purposes. )hese data warehouses were a!le to !ring in data from a range of different data
sources, such as mainframe computers, minicomputers, as well as personal computers and office automation
1
software such as spreadsheet, and integrate this information in a single place. )his capa!ility, coupled with user&
friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.
1s technology improed ,lower cost for more performance0 and user re*uirements increased ,faster data load cycle
times and more features0, data warehouses hae eoled through seeral fundamental stages4
Offline Oerational Databa!e! & 8ata warehouses in this initial stage are deeloped !y simply copying the data!ase
of an operational system to an off&line serer where the processing load of reporting does not impact on the
operational system's performance.
Offline Data Ware"ou!e & 8ata warehouses in this stage of eolution are updated on a regular time cycle ,usually
daily, weekly or monthly0 from the operational systems and the data is stored in an integrated reporting&oriented
data structure
Real Time Data Ware"ou!e & 8ata warehouses at this stage are updated on a transaction or eent !asis, eery time
an operational system performs a transaction ,e.g. an order or a deliery or a !ooking etc.0

Integrated Data Ware"ou!e & 8ata warehouses at this stage are used to generate actiity or transactions that are
passed !ack into the operational systems for use in the daily actiity of the organization.
DATA WAREHOUSE AR#HITE#TURE
)he term data warehouse architecture is primarily used today to descri!e the oerall structure of a 3usiness
(ntelligence system. -ther historical terms include decision support systems ,8::0, management information
systems ,M(:0, and others.
)he data warehouse architecture descri!es the oerall system from arious perspecties such as data, process, and
infrastructure needed to communicate the structure, function and interrelationships of each component. )he
infrastructure or technology perspectie details the arious hardware and software products used to implement the
distinct components of the oerall system. )he data perspecties typically diagrams the source and target data
structures and aid the user in understanding what data assets are aaila!le and how they are related. )he process
perspectie is primarily concerned with communicating the process and flow of data from the originating source
system through the process of loading the data warehouse, and often the process that client products use to access
and extract data from the warehouse.
DATA STORA$E %ETHOTS
(n O&T' & online transaction processing systems relational data!ase design use the discipline of data
modeling and generally follow the "odd rules of data normalization in order to ensure a!solute data integrity. .ess
complex information is !roken down into its most simple structures ,a ta!le0 where all of the indiidual atomic leel
elements relate to each other and satisfy the normalization rules. "odd defines ' increasing stringent rules of
normalization and typically -.)/ systems achiee a ;rd leel normalization. Fully normalized -.)/ data!ase designs
often result in haing information from a !usiness transaction stored in dozens to hundreds of ta!les. <elational
2
data!ase managers are efficient at managing the relationships !etween ta!les and result in ery fast insert2update
performance !ecause only a little !it of data is affected in each relational transaction.
-.)/ data!ases are efficient !ecause they are typically only dealing with the information around a single
transaction. (n reporting and analysis, thousands to !illions of transactions may need to !e reassem!led imposing a
huge workload on the relational data!ase. =ien enough time the software can usually return the re*uested results,
!ut !ecause of the negatie performance impact on the machine and all of its hosted applications, data
warehousing professionals recommend that reporting data!ases !e physically separated from the -.)/ data!ase.
(n addition, data warehousing suggests that data !e restructured and reformatted to facilitate *uery and analysis !y
noice users. -.)/ data!ases are designed to proide good performance !y rigidly defined applications !uilt !y
programmers fluent in the constraints and conentions of the technology. 1dd in fre*uent enhancements, and to
many a data!ase is 5ust a collection of cryptic names, seemingly unrelated and o!scure structures that store data
using incomprehensi!le coding schemes. 1ll factors that while improing performance, complicate use !y untrained
people. .astly, the data warehouse needs to support high olumes of data gathered oer extended periods of time
and are su!5ect to complex *ueries and need to accommodate formats and definitions of inherited from
independently designed package and legacy systems.
8esigning the data warehouse data 1rchitecture synergy is the realm of 8ata +arehouse 1rchitects. )he goal of a
data warehouse is to !ring data together from a ariety of existing data!ases to support management and reporting
needs. )he generally accepted principle is that data should !e stored at its most elemental leel !ecause this
proides for the most useful and flexi!le !asis for use in reporting and information analysis. 7oweer, !ecause of
different focus on specific re*uirements, there can !e alternatie methods for design and implementing data
warehouses. )here are two leading approaches to organizing the data in a data warehouse. )he dimensional
approach adocated !y <alph >im!all and the normalized approach adocated !y 3ill (nmon. +hilst the dimension
approach is ery useful in data mart design, it can result in a rats nest of long term data integration and a!straction
complications when used in a data warehouse.
(n the ?dimensional? approach, transaction data is partitioned into either a measured ?facts? which are generally
numeric data that captures specific alues or ?dimensions? which contain the reference information that gies each
transaction its context. 1s an example, a sales transaction would !e !roken up into facts such as the num!er of
products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and
salesperson. )he main adantage of a dimensional approach is that the data warehouse is easy for !usiness staff
with limited information technology experience to understand and use. 1lso, !ecause the data is pre&5oined into the
dimensional form, the data warehouse tends to operate ery *uickly. )he main disadantage of the dimensional
approach is that it is *uite difficult to add or change later if the company changes the way in which it does
!usiness.
)he ?normalized? approach uses data!ase normalization. (n this method, the data in the data warehouse is stored in
third normal form. )a!les are then grouped together !y su!5ect areas that reflect the general definition of the data
,customer, product, finance, etc.0. )he main adantage of this approach is that it is *uite straightforward to add
new information into the data!ase && the primary disadantage of this approach is that !ecause of the num!er of
ta!les inoled, it can !e rather slow to produce information and reports. Furthermore, since the segregation of
facts and dimensions is not explicit in this type of data model, it is difficult for users to 5oin the re*uired data
3
elements into meaningful information without a precise understanding of the data structure.
:u!5ect areas are 5ust a method of organizing information and can !e defined along any lines. )he traditional
approach has su!5ects defined as the su!5ects or nouns within a pro!lem space. For example, in a financial serices
!usiness, you might hae customers, products and contracts. 1n alternatie approach is to organize around the
!usiness transactions, such as customer enrollment, sales and trades.
1dantages of using data warehouse
)here are many adantages to using a data warehouse, some of them are4
@nhances end&user access to a wide ariety of data.
3usiness decision makers can o!tain arious kinds of trend reports e.g. the item with the most sales in
a particular area 2 country for the last two years.
1 data warehouse can !e a significant ena!ler of commercial !usiness applications, most nota!ly
#u!tomer relation!"i management (#R%)*
"oncerns in using data warehouses
@xtracting, cleaning and loading data is time consuming.
8ata warehousing pro5ect scope must !e actiely managed to delier a release of defined content
and alue.
"ompati!ility pro!lems with systems already in place.
:ecurity could deelop into a serious issue, especially if the data warehouse is we! accessi!le.
8ata :torage design controersy warrants careful consideration and perhaps prototyping of the
data warehouse solution for each pro5ect's enironments.
HISTOR+ O, DATA WAREHOUSIN$
8ata warehousing emerged for many different reasons as a result of adances in the field of information systems.
1 ital discoery that propelled the deelopment of data warehousing was the fundamental differences !etween
operational ,transaction processing0 systems and informational ,decision support0 systems. -perational systems are
run in real time where in contrast informational systems support decisions on a historical point&in&time. 3elow is a
comparison of the two.
"haracteristic -perational :ystems ,-.)/0 (nformational :ystems
,-.1/0
/rimary /urpose <un the !usiness on a current
!asis
:upport managerial decision
making
)ype of 8ata <eal time !ased on current
data
:napshots and predictions
/rimary Asers "lerks, salespersons,
administrators
Managers, analysts,
customers
:cope #arrow, planned, and simple
updates and *ueries
3road, complex *ueries and
analysis
4
8esign =oal /erformance throughput,
aaila!ility
@ase of flexi!le access and
use
8ata!ase
concept
"omplex simple
#ormalization 7igh .ow
)ime&focus /oint in time /eriod of time
Bolume Many & constant updates and
*ueries on one or a few ta!le
rows
/eriodic !atch updates and
*ueries re*uiring many or
all rows
-ther aspects that also contri!uted for the need of data warehousing are4
C (mproements in data!ase technology
o )he !eginning of relational data models and relational data!ase management systems ,<83M:0
C 1dances in computer hardware
o )he a!undant use of afforda!le storage and other architectures
C )he importance of end&users in information systems
o )he deelopment of interfaces allowing easier use of systems for end users
C 1dances in middleware products
o @na!led enterprise data!ase connectiity across heterogeneous platforms
8ata warehousing has eoled rapidly since its inception. 7ere is the story timeline of data warehousing4
19D%Es F -perational systems ,such as data processing0 were not a!le to handle large and fre*uent re*uests for data
analyses. 8ata stored was in mainframe files and static data!ases. 1 re*uest was processed from recorded tapes for
specific *ueries and data gathering. )his proed to !e time consuming and an inconenience.
199%Es F <eal time computer applications !ecame decentralized. <elational models and data!ase management
systems started emerging and !ecoming the wae. <etrieing data from operational data!ases still a pro!lem
!ecause of Gislands of data.H
199%Es F 8ata warehousing emerged as a feasi!le solution to optimize and manipulate data !oth internally and
externally to allow !usinessE to make accurate decisions.
+hat is data warehousingI
1fter information technology took the world !y storm, there were many reolutionary concepts that were created
to make it more effectie and helpful. 8uring the nineties as new technology was !eing !orn and was !ecoming
o!solete in no time, there was a need for a concrete fool proof idea that can help data!ase administration more
secure and relia!le. )he concept of data warehousing was thus, inented to help the !usiness decision making
process. )he working of data warehousing and its applications has !een a !oon to information technology
professionals all oer the world. (t is ery important for all these managers to understand the architecture of how it
works and how can it !e used as a tool to improe performance. )he concept has reolutionized the !usiness
planning techni*ues.
#oncet
(nformation processing and managing a data!ase are the two important components for any !usiness to hae a
5
smooth operation. 8ata warehousing is a concept where the information systems are computerized. :ince there
would !e a lot of applications that run simultaneously, there is a possi!ility that each indiidual processes create an
exclusie Gsecondary dataH which originates from the source. )he data warehouses are useful in tracking all the
information down and are useful in analyzing this information and improe performance. )hey offer a wide ariety
of options and are highly compati!le to irtually all working enironments. )hey help the managers of companies to
gauge the progress that is made !y the company oer a period of time and also explore new ways to improe the
growth of the company. )here are many GitEsH in !usiness and these data warehouses are read only integrated
data!ases that help to answer these *uestions. )hey are useful to form a structure of operations and analyze the
su!5ect matter on a gien time period.
T"e !tructure
1s is the case with all computer applications there are arious steps that are inoled in planning a data warehouse.
)he need is analyzed and most of the time the end user is taken into consideration and their input forms an
inalua!le asset in !uilding a customized data!ase. )he !usiness re*uirements are analyzed and the GneedH is
discoered. )hat would then !ecome the focus area. (f a company wants to analyze all its records and use the
research in improing performance.
1 data warehouse allows the manager to focus on this area. 1fter the need is zeroed in on then a conceptual data
model is designed. )his model is then used a !asic structure that companies follow to !uild a physical data!ase
design. 1 num!er of iterations, technical decisions and prototypes are formulated. )hen the systems deelopment
life cycle of design, deelopment, implementation and support !egins.

#ollection of data
)he pro5ect team analyzes arious kinds of data that need to go into the data!ase and also where they can find all
this information that they can use to !uild the data!ase. )here are two different kinds of data. -ne which can !e
found internally in the company and the other is the data that comes from another source. )here would !e another
team of professionals who would work on the creation, extraction programs that are used to collect all the
information that is needed from a num!er of data!ases, Files or legacy systems. )hey identify these sources and ten
copy them onto a staging area outside the data!ase. )hey clean all the data which is descri!ed as cleansing and
make sure that it does not contain any errors. )hey copy all the data into his data warehouse. )his concept of data
extraction from the source and the selection, transformation processes hae !een uni*ue !enchmarks of this
concept. )his is ery important for the pro5ect to !ecome successful. 1 lot of meticulous planning is inoled in
arriing at a step !y step configuration of all the data from the source to the data warehouse.
U!e of metadata
)he whole process of extracting data and collecting it to make it effectie component in the operation re*uires
GmetadataH. )he transformation of an analytical system from an operational system is achieed only with maps of
Meta data. )he transformational data includes the change in names, data changes and the physical characteristics
that exist. (t also includes the description of the data, its !rigand updates. 1lgorithms are used in summarizing the
data.Meta data proides graphical user interface that helps the non&technical end users. )his offers richness in
naigation and accessing the data!ase. )here is other form of Meta data called the operational Meta data. )his
forms the fundamental structure of accessing the procedures and monitoring the growth of data warehouse in
relation with the aaila!le storage space. (t also recognizes who would !e responsi!le to access the data in the
warehouse and in operational systems.
6
Data mart!-!ecific data
(n eery data !ase systems, there is a need for updation. :ome of them do it !y the day and some !y the minute.
7oweer if a specific department needs to monitor its own data in sync with the oerall !usiness process. )hey
store it as data marts. )hese are not as !ig as data arehouse and are useful for storing the data and the information
of a specific !usiness module. )he latest trend in data warehousing is to deelop smaller data marts and then
manage each of them indiidually and later integrate them into the oerall !usiness structure.
:ecurity and relia!ility :imilar to information system, trustworthiness of data is determined !y the trustworthiness
of the hardware, software, and the procedures that created them. )he relia!ility and authenticity of the data and
information extracted from the warehouse will !e a function of the relia!ility and authenticity of the warehouse
and the arious source systems that it encompasses.
(n data warehouse enironments specifically, there needs to !e a means to ensure the integrity of data first !y
haing procedures to control the moement of data to the warehouse from operational systems and second !y
haing controls to protect warehouse data from unauthorized changes. 8ata warehouse trustworthiness and security
are contingent upon ac*uisition, transformation and access metadata and systems documentation
)he !asic need for eery data !ase is that it needs to !e secure and trustworthy. )his is determined !y the
hardware components of the system the relia!ility and authenticity of the data and information extracted from the
warehouse will !e a function of the relia!ility and authenticity of the warehouse and the arious source systems
that it encompasses. (n data warehouse enironments specifically, there needs to !e a means to ensure the integrity
of data first !y haing procedures to control the moement of data to the warehouse from operational systems and
second !y haing controls to protect warehouse data from unauthorized changes. 8ata warehouse trustworthiness
and security are contingent upon ac*uisition, transformation and access metadata and systems documentation.
7an and >am!er ,2%%10 define a data warehouse as G1 repository of information collected from multiple sources,
stored under a unified scheme, and which usually resides at a single site.H
(n educational terms, all past information aaila!le in electronic format a!out a school or district such as !udget,
payroll, student achieement and demographics is stored in one location where it can !e accessed using a single set
of in*uiry tools.
)hese are some of the driers that hae !een created to initiate data warehousing.
C #R%- "ustomer relationship management .there is a threat of losing customers due to poor *uality and sometimes
those unknown reasons that no!ody eer explored. 1s a result of direct competition, this concept of customer
relationship management has !een on the forefront to proide the solutions. 8ata warehousing techni*ues hae
helped this cause enormously. 8iminishing profit margins4 =lo!al competition has forced many companies that
en5oyed generous profit margins on their products to reduce their prices to remain competitie. :ince cost of goods
sold remains constant, companies need to manage their operations !etter to improe their operating margins
C 8ata warehouses ena!le management decision support for managing !usiness operations. <etaining the existing
customers has !een the most important feature of present day !usiness. )o facilitate good customer relationship
management companies are inesting a lot of money to find out the exact needs of the consumer. 1s a result of this
direct competition the concept of customer relationship management came into existence. 8ata warehousing
techni*ues hae helped this cause enormously. 8iminishing profit margins4 =lo!al competition has forced many
companies that en5oyed generous profit margins on their products to reduce their prices to remain competitie.
7
:ince cost of goods sold remains constant, companies need to manage their operations !etter to improe their
operating margins. 8ata warehouses ena!le management decision support for managing !usiness operations.
. Deregulation4 the eer growing competition and the diminishing profit margins hae made companies to explore
arious new possi!ilities to play the game !etter. 1 company deelops in one direction and esta!lishes a particular
core competency in the market. 1fter they hae their own speciality, they look for new aenues to go into a new
market with a completely new set of possi!ilities. For a company to enture into deeloping a new core
competency, the concept of deregulation is ery important. . 8ata warehouses are used to proide this information.
8ata warehousing is useful in generating a cross reference data !ase that would help companies to get into cross
selling. this is the single most effectie way that this can hap
C T"e comlete life c/cle. )he industry is ery olatile where we come across a wide range of new products eery
day and then !ecoming o!solete in no time. )he waiting time for the complete lifecycle often results in a heay loss
of resources of the company. )here was a need to !uild a concept which would help in tracking all the olatile
changes and update them !y the minute. )his allowed companies to !e extra safe (n regard to all their products.
)he system is useful in tracking all the changes and helps the !usiness decision process to a great deal. )hese are
also descri!ed as !usiness intelligence systems in that aspect.
%erging of bu!ine!!e!4 1s descri!ed a!oe, as a direct result of growing competition, companies 5oin forces to
care a niche in a particular market. )his would help the companies to work towards a common goal with twice the
num!er of resources. (n case of such an eent, there is a huge amount of data that has to !e integrated. )his data
might !e on different platforms and different operating systems. )o hae a centralized authority oer the data, it is
important that a !usiness tool has to !e generated which not only is effectie !ut also relia!le. 8ata warehousing
fits the need <eleance of 8ata +arehousing for organizations @nterprises today, !oth nationally and glo!ally, are in
perpetual search of competitie adantage. 1n incontroerti!le axiom of !usiness management is that information
is the key to gaining this adantage. +ithin this explosion of data are the clues management needs to define its
market strategy. 8ata +arehousing )echnology is a means of discoering and unearthing these clues, ena!ling
organizations to competitiely position themseles within market sectors. (t is an increasingly popular and powerful
concept of applying information technology to soling !usiness pro!lems. "ompanies use data warehouses to store
information for marketing, sales and manufacturing to help managers get a feel for the data and run the !usiness
more effectiely. Managers use sales data to improe forecasting and planning for !rands, product lines and !usiness
areas. <etail purchasing managers use warehouses to track fast&moing lines and ensure an ade*uate supply of high&
demand products. Financial analysts use warehouses to manage currency and exchange exposures, oersee cash
flow and monitor capital expenditures.
8ata warehousing has !ecome ery popular among organizations seeking competitie adantage !y getting strategic
information fast and easy ,1dhikari, 199J0. )he reasons for organizations for haing a data warehouse can !e
grouped into four sections4
C Ware"ou!ing data out!ide t"e oerational !/!tem!4
)he primary concept of data warehousing is that the data stored for !usiness analysis can most effectiely !e

accessed !y separating it from the data in the operational systems. Many of the reasons for this separation has
eoled oer the years. 1 few years !efore legacy systems archied data onto tapes as it !ecame inactie and many
analysis reports ran from these tapes or data sources to minimize the performance on the operational systems.

. Integrating data from more t"an one oerational !/!tem 4
8ata warehousing are more successful when data can !e com!ined from more than one operational system. +hen
data needs to !e !rought together from more than one application, it is natural that this integration !e done at a
place independent of the source application. 3efore the eolution of structured data warehouses, analysts in many
instances would com!ine data extracted from more than one operational system into a single spreadsheet or a
data!ase. )he data warehouse may ery effectiely com!ine data from multiple source applications such as sales,
marketing, finance, and production.
. Data i! mo!tl/ volatile-
1nother key attri!ute of the data in a data warehouse system is that the data is !rought to the warehouse after it
has !ecome mostly non&olatile. )his means that after the data is in the data warehouse, there are no modifications
to !e made to this information.
C Data !aved for longer eriod! t"an in tran!action !/!tem!-
8ata from most operational systems is archied after the data !ecomes inactie. For example, an order may
!ecome inactie after a set period from the fulfillment of the order6 or a !ank account may !ecome inactie after
it has !een closed for a period of time. )he primary reason for archiing the inactie data has !een the
performance of the operational system. .arge amounts of inactie data mixed with operational lie data can
significantly degrade the performance of a transaction that is only processing the actie data. :ince the data
warehouses are designed to !e the archies for the operational data, the data here is saed for a ery long period.
Advantage! of data 0are"ou!e-
)here are seeral adantages of data warehousing. +hen companies hae a pro!lem that re*uires necessary
changes in their transaction, they need the information and the transaction processing to make a decision.
. Time reduction
?)he warehouse has ena!led employee to shift their time from collecting information to analyzing it and that helps
the company make !etter !usiness decisions? 1 data warehouse turns raw information into a useful analytical tool
for !usiness decision&making. Most companies want to get the information or transaction processing *uickly in order
to make a decision&making. (f companies are still using traditional online transaction processing systems, it will take
longer time to get the information that needed. 1s a result, the decision&making will !e made longer, and the
companies will lose time and money. 8ata warehouse also makes the transaction processing easier.
. Efficienc/
(n order to minimize inconsistent reports and proide the capa!ility for data sharing, the companies should proide
a data!ase technology that is re*uired to write and maintain *ueries and reports. 1 data warehouse proides, in one
central repository, all the metrics necessary to support decision&making throughout the *ueries and reports. Kueries
and reports make the management processing !e efficient.
C #omlete Documentation
1 typical data warehouse o!5ectie is to store all the information including history. )his o!5ectie comes with its
own challenges. 7istorical data is seldom kept on the operational systems6 and, een if it is kept, rarely is found in
!
three or fie years of history in one file. )here are some reasons why companies need data warehouse to store
historical data.
C Data Integration
1nother primary goal for all data warehouses is to integrate data, !ecause it is a primary deficiency in current
decision support. 1nother reason to integrate data is that the data content in one file is at a different leel of
granularity than that in another file or that the same data in one file is updated at a different time period than that
in another file.
&imitation!-
1lthough data warehouse !rings a lot of adantages to corporate, there are some disadantages that apply to data
warehouse.
C Hig" #o!t
8ata warehouse system is too expensie. 1ccording to /hil 3lackwood, Gwith the aerage cost of data warehouse
systems alued atL1.9 millionH. )his limits small companies to !uy data warehouse system. 1s a result, only !ig
companies can afford to !uy it. (t means that not all companies hae proper system to store data and transaction
system data!ases.
Furthermore, !ecause small companies do not hae data warehouse, then it causes difficulty for small companies to
store data and information in the system that may causes small companies to organize the data as one of the
re*uirement for the company will grow.
. #omle1it/
Moreoer, data warehouse is ery complex system. )he primary function of data warehouse is to integrate all the
data and the transaction system data!ase. 3ecause integrate the system is complicated, data warehousing can
complicate !usiness process significantly. For example, small change in the transaction processing system may hae
ma5or impacts on all transaction processing system. :ometimes, adding, deleting, or changing the data and
transaction can causes time consuming. )he administrator need to control and check the correctness of changing
transaction in order to impact on other transaction. )herefore, complexity of data warehouse preents the
companies from changing the data or transaction that are necessary to make.
Oortunitie! and #"allenge! for Data Ware"ou!ing
8ata warehousing is facing tremendous opportunities and challenges which to a greater part decide the most
immediate deelopments and future trends. 3ehind all these pro!a!le happenings is the impact that the (nternet
has upon ways of doing !usiness and, conse*uently, upon data warehousingMa more and more important tool for
todayEs and futureEs organizations and enterprises. )he opportunities and challenges for data warehousing are
mainly reflected in four aspects.

C Data 2ualit/
8ata warehousing has unearthed many preiously hidden data&*uality pro!lems. Most companies hae attempted
data warehousing and discoered pro!lems as they integrate information from different !usiness units. 8ata that
was apparently ade*uate for operational systems has often proed to !e inade*uate for data warehouses ,Faden,
2%%%0. -n the other hand, the emergence of @&commerce has also opened up an entirely new source of data&*uality
1"
pro!lems. 1s we all know, data, now, may !e entered at a +e! site directly !y a customer, a !usiness partner, or, in
some cases, !y anyone who isits the site. )hey are more likely to make mistakes, !ut, in most cases, less likely to
care if they do. 1ll these are Geleating data cleansing from an o!scure, specialized technology to a core
re*uirement for data warehousing, cusomer&relationship management, and +e!&!ased commerce G
C 3u!ine!! Intelligence
)he second challenge comes from the necessity of integrating data warehousing with !usiness intelligence to
maximize profits and competency. +e hae !een witnessing an eer&increasing demand to deploy data warehousing
structures and !usiness intelligence. )he primary purpose of the data warehouse is experiencing a shift from a focus
on data transformation into information toMmost recentlyMtransformation into intelligence.
1ll the way down this new deelopment, people will expect more and more analytical function of the data
warehouse. )he customer profile will !e extended with psycho&graphic, !ehaioral and competitie ownership
information as companies attempt to go !eyond understanding a customerEs preference. (n the end, data
warehouses will !e used to automate actions !ased on !usiness intelligence. -ne example is to determine with
which supplier the order should !e placed in order to achiee deliery as promised to the customer.

C E-bu!ine!! and t"e Internet
3esides the data *uality pro!lem we mentioned a!oe, a more profound impact of this new trend on data
warehousing is in the nature of data warehousing itself.
-n the surface, the rapidly expanding e&!usiness has posed a threat to data warehouse practitioners. )hey may !e
concerned that the (nternet has surpassed data warehousing in terms of strategic importance to their company, or
that (nternet deelopment skills are more highly alued than those for data warehousing. )hey may feel that the
(nternet and e&!usiness hae captured the hearts and minds of !usiness executies, relegating data warehousing to
Nsecond class citizenE status. 7oweer, the opposite is true.
. Ot"er trend!
+hile data warehousing is facing so many challenges and opportunities, it also !rings opportunities for other
fields. :ome trends that hae 5ust started are as follows4
C More and more small&tier and middle&tier corporations are looking to !uild their own decision support systems.
C )he reengineering of decision support systems more often than not end up with the architecture that would help
fuel the growth of their decision support systems.
C 1danced decision support architectures proliferate in response to companiesE increasing demands to integrate
their customer relationship management and e&!usiness initiaties with their decision support systems.
C More organizations are starting to use data warehousing meta data standards, which allow the arious decision
support tools to share their data with one another.
#Arc"itectural Overvie0
(n concept the architecture re*uired is relatiely simple as can !e seen from the diagram !elow4
:ource :ystem,s0
8ata
11
Mart
@).
)ransaction <epository
@).
@).
@).
<eporting )ools
8ata
Mart
8ata
Mart
Figure 1 & :imple 1rchitecture
7oweer this is a ery simple design concept and does not reflect what it takes to implement a data warehousing
solution. (n the next section we look not only at these core components !ut the additional elements re*uired to
make it all work.
+hite /aper & -eriew 1rchitecture for @nterprise 8ata +arehouses

#omonent! of t"e Enterri!e Data Ware"ou!e
)he simple architecture diagram shown at the start of the document shows four core components of an enterprise
data warehouse. <eal implementations howeer often hae many more depending on the circumstances. (n this
section we look first at the core components and then look at what other additional components might !e needed.
)he core components
)he core components are those shown on the diagram in Figure 1 F :imple 1rchitecture. )hey are the ones that are
most easily identified and descri!ed.
:ource :ystems
)he first component of a data warehouse is the source systems, without which there would !e no data. )hese
proide the input into the solution and will re*uire detailed analysis early in any pro5ect. (mportant considerations
in looking at these systems include4
(s this the master of the data you are looking forI
+ho owns2manages2maintains this systemI
+here is the source system in its lifecycleI
+hat is the *uality of the data in the systemI
+hat are the !atch2!ackup2upgrade cycles on the systemI
"an we get access to itI
Source !/!tem! can broadl/ be categori!ed in five t/e!-
-n&line )ransaction /rocessing ,-.)/0 :ystems )hese are the main operational systems of the !usiness and will
normally include financial systems, manufacturing systems, and customer relationship management ,"<M0 systems.
)hese systems will proide the core of any data warehouse !ut, whilst a large part of the effort will !e expended on
loading these systems it is the integration of the other sources that proides the alue. .egacy :ystems
-rganisations will often hae systems that are at the end of their life, or archies of de&commissioned systems. -ne
12
of the !usiness case 5ustifications for !uilding a data warehouse may hae !een to remoe these systems after the
critical data has !een moed into the data warehouse. )his sort of data often adds to the historical richness of a
solution.
%i!!ing or Source-le!! Data
8uring the analysis it is often the case that data is identified as re*uired !ut for which no ia!le source exists, e.g.
exchange rates used on a gien date or corporate calendar eents, a source that is unusa!le for loading such as a
document, or 5ust that the answer is
in someone.s head. )here is also data re*uired for the !asic operation such as descriptions of codes.
)his is therefore an important category, which is fre*uently forgotten during the initial design stages, and then
re*uires a last minute fix into the system, often achieed !y direct manual changes to the data warehouse. )he
down side of this approach is that it loses the tracking, control and audita!ility of the information added to the
warehouse. -ur adice is therefore to create a system or systems that we call the +arehouse :upport 1pplication
,+:10. )his is normally a num!er of simple data entry type forms that can capture the data re*uired. )his is then
treated as another -.)/ source and managed in the same way. -rganisations are often concerned a!out how much
of this they will hae to !uild. (n reality it is a reflection of the leel of good data capture during the existing
!usiness process and current systems. (f
these are good then there will !e little or no +:1 components to !uild !ut if they are poor then significant
deelopment will !e re*uired and this should also raise a red flag a!out the readiness of the organisation to
undertake this type of !uild.

Tran!actional Reo!itor/ (TR)
)he )ransactional <epository is the store of the lowest leel of data and thus defines the scope and size of
the data!ase. )he scope is defined !y what ta!les are aaila!le in the data model and the size is defined !y the
amount of data put into the model. 8ata that is loaded here will !e clean, consistent, and time ariant. )he design
of the data model in this area is critical to the long term success of the data warehouse as it determines the scope
and the cost of changes, makes mistakes expensie and ineita!ly causes delays.
1s can !e seen from the architecture diagram the transaction repository sits at the heart of the system6 it is the
point where all data is integrated and the point where history is held. (f the model, once in production, is missing
key !usiness information and can not easily !e xtended when the re*uirements or the sources change then this will
mean significant rework. 1oiding this cost is a factor in the choice of design for this data Model.
(n order to design the )ransaction <epository there are three data modelling approaches that can !e identified.
@ach lends itself to different organisation types and each has its own adantages and disadantages, although a
detailed discussion of these is outside the scope of this document.
)he three approaches are4
Enterri!e Data %odelling (3ill Inmon)
)his is a data model that starts !y using conentional relational modelling techni*ues and often will descri!e
the !usiness in a conentional normalised data!ase. )here may then !e a series of de&normalisations for
performance and to assist extraction into the
data marts.
)his approach is typically used !y organisations that hae a corporate&wide data model and strong central
control !y a group such as a strategy team. )hese organisations will tend also to hae more internally deeloped
13
systems rather than third party products.
Data 3u! (Ral" 4imball)
)he data model for this type of solution is normally made up of a series of star schemas that hae eoled oer
time, with dimensions !ecoming .conformed. as they are re&used. )he transaction repository is made up of these
!ase star schemas and their associated dimensions. )he data marts in the architecture will often 5ust !e iews
either directly onto these schemas or onto aggregates of these star schemas. )his approach is particularly suita!le
for companies which hae eoled from a num!er of independent data marts and growing and eoling into a more
mature data warehouse enironment. /rocess #eutral Model
1 /rocess #eutral 8ata Model is a data model in which all em!edded !usiness rules hae !een remoed.
(f this is done correctly then as !usiness processes change there should !e little or no change re*uired to the data
model. 3usiness (ntelligence solutions designed around such a model should therefore not !e su!5ect to limitations
as the !usiness changes.
)his is achieed !oth !y making many relationships optional and hae multiple cardinality, and !y carefully making
sure the model is generic rather then reflecting only the iews and needs of one or more specific !usiness areas.
1lthough this sounds simple ,and it is once you get used to it0 in reality it takes a little while to fully understand and
to !e a!le to achiee. )his type of data model has !een used !y a num!er of ery large organisations where it
com!ines some of the !est features of !oth the data !us approach and enterprise data modelling. 1s with enterprise
data modelling it sets out to descri!e the entire !usiness
!ut rather than normalise data it uses an approach that em!eds the metadata ,or data a!out data0 in the data
model and often contains natural star schemas. )his approach is generally used !y large corporations that hae one
or more of the following attri!utes4 many legacy systems, a num!er of systems as a result of !usiness ac*uisitions,
no central data model, or
a rapidly changing corporate enironment.

Data %art!
)he data marts are areas of a data!ase where the data is organised for user *ueries, reporting and
analysis. Oust as with the design of the )ransaction <epository there are a num!er of design types for data mart. )he
choice depends on factors such as the design of transaction repository and which tools are to !e used to *uery the
data marts.
)he most commonly used models are star schemas and snowflake schemas where direct data!ase access is made,
whilst data cu!es are faoured !y some tool endors. (t is also possi!le to hae single ta!le solution sets if this
meets the !usiness re*uirement. )here is no need for all data marts to hae the same design type, as they are user
facing it is important that they are fit for purpose for the user and not what suits a purist architecture.

@xtract. )ransform& .oad ,@).0 )ools
@). tools are the !ack!one of the data warehouse, moing data from source to transaction repository and on to
data marts. )hey must deal with issues of performance of load for large olumes and with complex transformation
of data, in a repeata!le, scheduled enironment. )hese tools !uild the interfaces !etween components in the
architecture and will also often work with data cleansing elements to ensure that the most accurate data is
aaila!le. )he need for a standard approach to @). design within a pro5ect is paramount. 8eelopers will often
create an intricate and complicated solution for which there is a simple solution, often re*uiring little compromise.
14
1ny compromise in the deliera!le is usually accepted !y the !usiness once they understand these simple
approaches will sae them a great deal of cash in terms of time taken to design, deelop, test and ultimately
support.
Anal/!i! and Reorting Tool!
"ollecting all of the data into a single place and making it aaila!le is useless without the a!ility for users to access
the information. )his is done with a set of analysis and reporting tools. 1ny gien data warehouse is likely to hae
more than one tool. )he types of tool can !e *ualified in !roadly four categories4
:imple reporting tools that either produce fixed or simple parameterised reports.
"omplex ad hoc *uery tools that allow users to !uild and specify their own *ueries.
:tatistical and data mining packages that allow users to dele into the information contained within the
data.
+hat&if )ools that allow users to extract data and then modify it to role play or simulate scenarios.
Additional #omonent!
(n addition to the core components a real data warehouse may re*uire any or all of these components to delier the
solution. )he re*uirement to use a component should !e considered !y each programme on its own merits.
&iteral Staging Area (&SA)
-ccasionally, the implementation of the data warehouse encounters enironmental pro!lems, particularly with
legacy systems ,e.g. a mainframe system, which is not easily accessi!le !y applications and tools0. (n this case it
might !e necessary to implement a .iteral :taging 1rea, which creates a literal copy of the source systems content
!ut in a more conenient enironment ,e.g. moing mainframe data into an -83" accessi!le relational data!ase0.
)his literal staging area then acts as a surrogate for the source system for use !y the downstream @). interfaces.
)here are some important !enefits associated with implementing an .:14
(t will make the system more accessi!le to downstream @). products.
(t creates a *uick win for pro5ects that hae !een trying to get data off, for example a Mainframe, in a more
la!orious fashion.
(t is a good place to perform data *uality profiling.
(t can !e used as a point close to the source to perform data *uality cleaning.
Tran!action Reo!itor/ Staging Area (TRS)
@). loading will often need an area to put intermediate data sets, or working ta!les, somewhere which for clarity
and ease of management should not !e in the same area as the main model. )his area is used when !ringing data
from a source system or its surrogate into the transaction repository.
Data %art Staging Area (D%S)
1s with the transaction repository staging area there is a need for space !etween the transaction repository and
data marts for intermediate data sets. )his area proides that space.
Oerational Data Store (ODS)
1n operational data store is an area that is used to get data from a source and, if re*uired, lightly aggregate it to
15
make it *uickly aaila!le. )his is re*uired for certain types of reporting which need to !e aaila!le in realtime
,Apdated within 1' minutes0 or near&time ,for example 1' to J% minutes old0. )he -8: will not normally clean,
integrate, or fully aggregate data ,as the data warehouse does0 !ut it will proide rapid answers, and the data will
then !ecome aaila!le ia the data warehouse once the cleaning, integration and aggregation has taken place in
the next !atch cycle.
Tool! 5 Tec"nolog/
)he component diagrams a!oe show all the areas and the elements needed. )his translates into a significant list of
tools and technology that are re*uired to !uild and operationally run a data warehouse solution. )hese include4
#$-perating system
#$8ata!ase
#$3ackup and <ecoery
#$@xtract, )ransform, .oad ,@).0
#$8ata Kuality /rofiling
#$8ata Kuality "leansing
#$:cheduling
#$1nalysis P <eporting
#$8ata Modelling
#$Metadata <epository
#$:ource "ode "ontrol
#$(ssue )racking
#$+e! !ased solution integration
)he tools selected should operate together to coer all of these areas. )he technology choices will also !e
influenced !y whether the organisation needs to operate a homogeneous
,all systems of the same type0 or heterogeneous ,systems may !e of differing types0
enironment, and also whether the solution is to !e centralised or distri!uted.
Oerating S/!tem
)he serer side operating system is usually an easy decision, normally following the recommendation in the
organisation.s (nformation :ystem strategy. )he operating system choice for enterprise data warehouses tends to !e
a Anix2.inux ariant, although some organisations do use Microsoft operating systems. (t is not the purpose of this
paper to make any recommendations for the a!oe and the choice should !e the result of the organisation.s normal
procurement procedures.
Databa!e
)he data!ase falls into a ery similar category to the operating system in that for most organisations it is a gien
from a select few including -racle, :y!ase, (3M 832 or Microsoft :K.:erer.
3ac6u and Recover/
)his may seem like an o!ious re*uirement !ut is often oerlooked or slipped in at the
end. From .8ay 1. of deelopment there will !e a need to !ackup and recoer the data!ases from time to time. )he
!ackup poses a num!er of issues4
#$(deally !ackups should !e done whilst allowing the data!ase to stay up.
#$(t is not uncommon for elements to !e !acked up during the day as this is the point of least load on the system and
16
it is often read&only at that point.
#$(t must handle large olumes of data.
#$(t must cope with !oth data!ases and source data in flat files.
)he recoery has to deal with the related conse*uence of the a!oe4
#$<ecoery of large data!ases *uickly to a point in time.
E1tract - Tran!form - &oad (ET&)
)he purpose of the extract, transform and load ,@).0 software, to create interfaces, has !een descri!ed a!oe and
is at the core of the data warehouse. )he market for such tools is constantly moing, with a trend for data!ase
endors to include this sort of technology in their core product. :ome of the considerations for selection of an @).
tool include4
#$1!ility to access source systems
#$1!ility to write to target systems
#$"ost of deelopment ,it is noticea!le that some of the easy to deploy and operate tools are not easy to deelop
with0
#$"ost of deployment ,it is also noticea!le that some of the easiest tools to deelop with are not easy to deploy or
operate0
#$(ntegration with scheduling tools )ypically, only one @). is needed, howeer it is common for specialist tools to !e
used from a source system to a literal staging area as a way of oercoming a limitation in the main @).
Data 2ualit/ 'rofiling o!ieefans.com
8ata profiling tools look at the data and identify issues with it. (t does this !y some of the following techni*ues4
#$.ooking at indiidual alues in a column to check that they are alid Balidating data types within a column #$
#$.ooking for rules a!out uni*ueness or fre*uencies of certain alues
#$Balidating primary and foreign key constraints QQQQQ
#$Balidating that data within a row is consistent
#$Balidating that data is consistent within a ta!le
#$Balidating that data is consistent across ta!les etc. #$
)his is important for !oth the analysts when examining the system and deelopers
when !uilding the system. (t also will identify data *uality cleansing rules that can !e applied to the data !efore
loading. (t is worth noting that good analysts will often do this without tools especially if good analysis templates
are aaila!le.
Data 2ualit/ #lean!ing
)his tool updates data to improe the oerall data *uality, often !ased on the output of the data *uality profiling
tool. )here are essentially two types of cleansing tools4
#$Rule-ba!ed clean!ing7 this performs updates on the data !ased on rules ,e.g. make eerything uppercase6 replace
two spaces with a single space, etc.0. )hese rules can !e ery simple or *uite complex depending on the tool used
and the !usiness re*uirement.
#$Heuri!tic clean!ing7 this performs cleansing !y !eing gien only an approximate method of soling the pro!lem
within the context of some goal, and then uses feed!ack from the effects of the solution to improe its own
17
performance. )his is commonly used for address matching type pro!lems.
1n important consideration when implementing a cleansing tool is that the process should !e performed as closely
as possi!le to the source system. (f it is performed further downstream, data will !e repeatedly presented for
cleansing.
Sc"eduling
+ith !ackup, @). and !atch reporting runs the data warehouse enironment has a large num!er of 5o!s to !e
scheduled ,typically in the hundreds per day0 with many 8ependencies, for example4
.)he !ackup can only start at the end of the !usiness day and proided that the source system has generated a flat
file, if the file does not exist then it must poll for thirty minutes to see if it arries otherwise notify an operator.
)he data mart load can not start until the transaction repository load is complete !ut then can run six different
data mart loads in parallel.
)his should !e done ia a scheduling tool that integrates into the enironment.
Anal/!i! 5 Reorting
)he analysis and reporting tools are the user.s main interface into the system. 1s has already !een discussed there
are four main types
#$:imple reporting tools
#$"omplex ad hoc *uery tools
#$:tatistical and data mining packages
#$+hat&if tools
+hilst the market for such tools changes constantly the recognised source of
information is )he -.1/ <eport2.
Data %odelling
+ith all the data models that hae !een discussed it is o!ious that a tool in which to
!uild data models is re*uired. )his will allow designers to graphically manage data models and generate the code to
create the data!ase o!5ects. )he tool should !e capa!le of !oth logical and physical data modelling. Metadata
<epository Metadata is data a!out data. (n the case of the data warehouse this will include information a!out the
sources, targets, loading procedures, when those procedures were run, and information a!out what certain terms
mean and how they relate to the data in the data!ase. )he metadata re*uired is defined in a su!se*uent section on
documentation howeer the information itself will need to !e held somewhere. Most tools hae some elements of a
metadata repository !ut there is a need to identify what constitutes the entire repository !y identifying which parts
are held in which tools.
2 )he -.1/ <eport !y #igel /endse and <ichard "reeth is an independent research resource for organizations !uying
and implementing -.1/ applications.
Source #ode #ontrol
Ap to this point you will hae noticed that we hae steadfastly remained endor independent and we remain
so here. 7oweer the issue of source control is one of the !iggest impacts on a data warehouse. (f the tools that you
use do not hae ersioning control or if your tools do not integrate to allow ersion control across them and your
organisation does not hae a source code control tool then download and use "B:, it is free, multi&platform and we
hae found can !e made to work with most of the tools in other categories. )here are also Microsoft +indows
1
clients for "B: and we! !ased tools for "B: aaila!le.
I!!ue Trac6ing
(n a similar ein to the issue of source code control most pro5ects do not deal with issue tracking well. )he
worst nightmare !eing a spreadsheet that is mailed around once a week to get updates. +e again recommend that
if a suita!le tool is not already aaila!le then you consider an open source tool called 3ugzilla.

Web 3a!ed Solution Integration
<unning a programme such as the one descri!ed will !ring much information together. (t is important to !ring
eerything together in an accessi!le fashion. Fortunately we! technologies proide an easy way to do this.
1n ideal enironment would allow communities to see some or all of the following ia a secure we! !ased interface4
#$:tatic reports
#$/arameterised reports
#$+e! !ased reporting tools
#$3alanced :corecards
#$1nalysis
#$8ocumentation
#$<e*uirements .i!rary
#$3usiness )erms 8efinitions
#$:chedules
#$Metadata <eports
#$8ata Kuality profiles
#$8ata Kuality rules
#$8ata Kuality <eports
#$(ssue tracking
#$:ource code
)here are two similar !ut different technologies that are aaila!le to do this depending on the corporate approach
or philosophy4
#$/ortals4 these proide personalised we!sites and make use of distri!uted applications to proide a colla!oratie
workspace.
#$+iki;4 which proide a we!site that allows users to easily add and edit contents and link to other we! applications
3oth can !e ery effectie in deeloping common understanding of what the data warehouse does and how it
operates which in turn leads to a more engaged user community and greater return on inestment.
; 1 wiki is a type of we!site that allows users to easily add and edit content and is especially suited for
colla!oratie writing. (n essence, wiki is a simplification of the process of creating 7)M. we! pages com!ined with
a system that records each indiidual change that occurs oer time, so that at any time, a page can !e reerted to
any of its preious states. 1 wiki system may also proide arious tools that allow the user community to easily
monitor the constantly changing state of the wiki and discuss the issues that emerge in trying to achiee a general
consensus a!out wiki content.
Documentation Re8uirement!
=ien the size and complexity of the @nterprise 8ata +arehouse, a core set of documentation is re*uired,
which is descri!ed in the following section. (f a structured pro5ect approach is adopted, these documents would !e
1!
produced as a natural !yproduct howeer we would recommend the following set of documents as a minimum. )o
facilitate this, at 8ata Management P +arehousing, we hae deeloped our own set of templates for this purpose.
Re8uirement! $at"ering
)his is a document managed using a word&processor.
Time!cale!- 1t start of pro5ect R% days effort plus on&going updates.
)here are four sections to our re*uirement templates4
#$,act!- these are the key figures that a !usiness re*uires. -ften these will !e associated with >ey /erformance
(ndicators ,>/(s0 and the information re*uired to calculate them i.e. the Metrics re*uired for running the company.
1n example of a fact might !e the num!er of products sold in a store.
#$Dimen!ion!- this is the information used to constrain or *ualify the facts. 1n example of this might !e the list of
products or the date of a transaction or some attri!ute of the customer who purchased product.
#$2uerie!- these are the typical *uestions that a user might want to ask for example .7ow many cans of soft drink
were sold to male customers on the 2nd Fe!ruaryI. )his uses information from the re*uirements sections on
aaila!le facts and dimensions. #on&functional4 these are the re*uirements that do not directly relate to the data, #$
such as when must the system !e aaila!le to users, how often does it need to !e refreshed, what *uality metrics
should !e recorded a!out the data, who should !e a!le to access it, etc.
#ote that whilst an initial re*uirements document will come early in the pro5ect it will undergo a num!er of
ersions as the user community matures in !oth its use and understanding of the system and data aaila!le to it.
>ey 8esign 8ecisions )his is a document managed using a word&processor.
)imescales4 %.' days effort as and when re*uired.
)his is a simple one or two page template used to record the design decisions that are
made during the pro5ect. (t contains the issue, the proposed outcome, any counterarguments
and why they were re5ected and the impact on the arious teams within the
pro5ect. (t is important !ecause gien the long term nature of such pro5ects there is
often a reisionist element that *ueries why such decisions were made and spends
time reisiting them.
Data %odel
)his is held in the data modelling tool.s internal format.
)imescales4 1t start of pro5ect 2% days effort plus on&going updates. 3oth logical and physical data models will !e
re*uired. )he logical data model is an a!stract representation of a set of data entities and their relationship,
usually including their key attri!utes. )he logical data model is intended to facilitate analysis of the function of the
data design, and is not intended to !e a full representation of the physical data!ase. (t is typically produced early
in system design, and it is fre*uently a precursor to the physical data model that documents the actual
implementation of the data!ase.
(n parallel with the gathering of re*uirements the data models for the transaction repository and the initial data
marts will !e deeloped. )hese will !e constantly maintained throughout the life of the solution.
Anal/!i!
)hese are documents managed using a word&processor. )he analysis phase of the pro5ect is !roken down into three
main templates, each sering as a step in the progression of understanding re*uired to !uild the system. 8uring the
system analysis part of the pro5ect, the following three areas must !e coered and documented4
2"
Source S/!tem Anal/!i! (SSA)
)imescales4 2&; days effort per source system.
)his is a simple high&leel oeriew of each source system to understand its alue as a potential source of !usiness
information, and to clarify its ownership and longeity. )his is normally done for all systems that are potential
sources. 1s the name implies this looks at the .system. leel and identifies .candidate. systems.
)hese documents are only updated at the start of each phase when candidate systems are !eing identified.

Source Entit/ Anal/!i! (SEA)
)imescales4 D&1% days effort per system.
)his is a detailed look at the .candidate. systems, examining the data, the data *uality issues, fre*uency of update,
access rights, etc. )he output is a list of ta!les and fields that are re*uired to populate the data warehouse. )hese
documents are updated at the start of each phase when candidate systems are !eing examined and as part of the
impact analysis of any upgrades to a system that has !een used for a preious phase and is !eing upgraded.

Target Oriented Anal/!i! (TOA)
)imescales4 1'&2% days effort for the )ransaction <epository, ;&' days effort for each data mart. )his is a document
that descri!es the mappings and transformations that are re*uired to populate a target o!5ect. (t is important that
this is target focused as a common failing is to look at the source and ask the *uestion .+here do ( put all these !its
of information I. rather than the correct *uestion which is .( need to populate this o!5ect where do ( get the
information from I.

Oeration! $uide
)his is a document managed using a word&processor. )imescales4 2% days towards the end of the deelopment
phase. )his document descri!es how to operate the system6 it will include the schedule for running all the @). 5o!s,
including dependencies on other 5o!s and external factors such as the !ackups or a source system. (t will also
include instructions on how to
recoer from failure and what the escalation procedures for technical pro!lem resolution are. -ther sections will
include information on current sizing, predicted growth and key data inflection points ,e.g. year end where there
are a particularly large num!er of 5ournal entries0 (t will also include the !ackup and recoery plan identifying what
should !e !acked up and how to perform system recoeries from !ackup. :ecurity Model )his is a document
managed using a word&processor.
)imescales4 1% days effort after the data model is complete, ' days effort toward the deelopment phase.
)his document should identify who can access what data when and where. )his can !e a complex issue, !ut the
a!oe architecture can simplify this as most access control needs to !e around the data marts and nearly eerything
else will only !e isi!le to the @). tools extracting and loading data into them.
(ssue log
)his is held in the issue logging systemEs internal format.
)imescales4 8aily as re*uired.
1s has already !een identified the pro5ect will re*uire an issue log that tracks issues during the deelopment and
operation of the system.
%etadata
)here are two key categories of metadata as discussed !elow4
3usiness Metadata
21
)his is a document managed using a word&processor or a /ortal or +iki if aaila!le.
3usiness 8efinitions "atalogueR
)imescales4 2% days effort after the re*uirements are complete and ongoing maintenance.
)his is a catalogue of !usiness terms and their definitions. (t is all a!out adding context to data and making meaning
explicit and proiding definitions to !usiness terms, data elements, acronyms and a!!reiations. (t will often
include information a!out who owns the definition and who maintains it and where appropriate what formula is
re*uired to calculate it. -ther useful elements will include synonyms, related terms and preferred terms. )ypical
examples can include definitions of !usiness terms such as .#et :ales Balue. or .1erage reenue per customer. as
well as definitions of hierarchies and common terms such as customer.
)echnical Metadata )his is the information created !y the system as it is running. (t will either !e held in serer log
files or data!ases.
Server 5 Databa!e availabilit/
)his includes all information a!out which serers and data!ases were aaila!le when and seres two purposes,
firstly monitoring and management of serice leel agreements ,:.1s0 and secondly performance optimisation to fit
the @). into the aaila!le !atch window and to ensure that users hae good reporting performance.
ET& Information
)his is all the information generated !y the @). process and will include items such as4
#$+hen was a mapping created or changedI
#$+hen was it last runI
#$7ow long did it run forI
#$8id it succeed or failI
#$7ow many records inserted, updated, deletedS
)his information is again used to monitor the effectie running and operation of the system not only in failure !ut
also !y identifying trends such as mappings or transformations whose /erformance characteristics are changing.
Kuery (nformation
)his gathers information a!out which *ueries the users are making. )he information will include4
#$+hat are the *ueries that are !eing runI
#$+hich ta!les do they accessI
#$+hich fields are !eing usedI
#$7ow long do *ueries take to executeI
)his information is used to optimise the users experience !ut also to remoe redundant information that is no
longer !eing *ueried !y users.
Some additional "ig"-level guideline!
)he following items are 5ust some of the common issues that arise in deliering data warehouse solutions. +hilst
not exhaustie they are some of the most important factors to
consider4
/rogramme or pro5ectI
For data warehouse solutions to !e successful ,and financially ia!le0, it is important for organisations to iew the
deelopment as a long term programme of work and examine how the work can !e !roken up into smaller
component pro5ects for deliery. )his ena!les many smaller *uick wins at different stages of the programme whilst
retaining focus on the oerall o!5ectie.
22
@xamples of this approach may include the deelopment of tactical independent data marts, a literal staging area
to facilitate reporting from a legacy system, or prioritization of the 8eelopment of particular reports which can
significantly help a particular !usiness function. Most successful data warehouse programmes will hae an
operational life in excess of ten years with peaks and troughs in deelopment.
T"e tec"nolog/ tra
1t the outset of any data warehouse pro5ect organisations fre*uently fall into the trap of wanting to design the
largest, most complex and functionally all&inclusie solution. )his will often tempt the technical teams to use the
latest, greatest technology promised !y a endor.
7oweer, !uilding a data warehouse is not a!out creating the !iggest data!ase or using the cleerest technology, it
is a!out putting lots of different, often well esta!lished, components together so that they can function successfully
to meet the organisation.s data management re*uirements. (t also re*uires sufficient design such that when the
next enhancement or extension of the re*uirement comes along, there is a known and well understood !usiness
process and technology path to meet that re*uirement.

9endor Selection
)his document presents a endor&neutral iew. 7oweer, it is important ,and perhaps o!ious0 to note that the
products which an organisation chooses to !uy will dramatically affect the design and deelopment of the system.
(n particular most endors are looking to spread their coerage in the market space. )his means that two selected
products may hae oerlapping functionality and therefore which product to use for a gien piece of functionality
must !e identified. (t is also important to differentiate !etween strategic and tactical tools
)he other ma5or consideration is that this technology market space changes rapidly. )he process, where!y
endors constantly add features similar to those of another competing product, means that few endors will hae a
significant long term adantage on features alone. Most features that you will re*uire ,rather than those that are
sometimes desired0 will !ecome aaila!le during the lifetime of the programme in market leading products if they
are not already there.
)he rule of thum! is therefore when assessing products to follow the !asic =artner' type magic *uadrant of
.a!ility to execute. and .completeness of ision. and com!ine that with your organisations iew of the long term
relationship it has with the endor and the fact that a series of rolling upgrades to the technology will !e re*uired
oer the life of the programme.

Develoment artner!
)his is one of the thorniest issues for large organisations as they often hae policies that outsource
deelopment work to third parties and do not want to create internal teams.
(n practice the issue can !e !roken down with programme management and !usiness
re*uirements !eing sourced internally. )echnical design authority is either an external domain expert who
transitions to an internal person or an internal person if suita!le skills exist.
(t is then possi!le for indiidual deelopment pro5ects to !e outsourced to deelopment partners. (n general
the market place has more contractors with this type of experience than permanent staff with specialist
domain2technology knowledge and so some contractor !ase either internally or at the deelopment partner is
almost ineita!le. Altimately it comes down to the indiiduals and how they come together as a team, regardless of
the supplier and the !est teams will !e a !lend of the !est people.

23
T"e develoment and imlementation !e8uence
8ata +arehousing on this scale re*uires a top down approach to re*uirements and a !ottom up approach to the
!uild. (n order to delier a solution it is important to understand what is re*uired of the reports, where that is
sourced from in the transaction repository and how in turn the transaction repository is populated from the source
system. "onersely the !uild must start at the !ottom and !uild up through the transaction repository and on to the
data marts.
@ach !uild phase will look to either !uild up ,i.e. add another leel0 or !uild out ,i.e. add another source0
)his approach means that the pro5ect manager can firstly !e assured that the final destination will meet the users
re*uirement and that the !uild can !e optimized !y using different teams to !uild up in some areas whilst other
teams are !uilding out the underlying leels. Asing this model it is also possi!le to change direction after each
completed phase.

Homogeneou! 5 Heterogeneou! Environment!
)his architecture can !e deployed using homogeneous or heterogeneous technologies. (n a homogeneous
enironment all the operating systems, data!ases and other components are !uilt using the same technology, whilst
a heterogeneous solution would allow multiple technologies to !e used, although it is usually adisa!le to limit this
to one technology per component.
For example using -racle on A#(T eerywhere would !e a homogeneous enironment, whilst using :y!ase for the
transaction repository and all staging areas on a A#(T enironment and Microsoft :K.:erer on Microsoft +indows
for the data marts would !e an example of a heterogeneous enironment.
)he trade off !etween the two deployments is the cost of integration and managing additional skills with a
heterogeneous enironment compared with the suita!ility of a single product to fulfil all roles in a homogeneous
enironment. )here is o!iously a spectrum of solutions 3etween the two end points, such as the same operating
system !ut different data!ases.
#entrali!ed v!* Di!tributed !olution!
)his architecture also supports deployment in either a centralised or distri!uted mode. (n a centralised solution all
the systems are held at a central data centre, this has the adantage of easy management !ut may result in a
performance impact where users that are remote from the central solution suffer pro!lems oer the network.
"onersely a distri!uted solution proides local solutions, which may hae a !etter performance profile for local
users !ut might !e more difficult to administer and will suffer from capacity issues when loading the data. -nce
again there is a spectrum of solutions and therefore there are degrees to which this can !e applied. (t is normal that
centralised solutions are associated with homogeneous enironments whilst distri!uted enironments are usually
heterogeneous, howeer this need not always !e the case.
"onerting 8ata from 1pplication "entric to Aser "entric :ystems such as @</ systems are effectiely systems
designed to pump data through a particular !usiness process ,application&centric0. 1 data warehouse is designed to
look across systems ,user&centric0 to allow the user to iew the data they need to perform their 5o!.
1s an example4 raising a purchase order in the @</ system is optimised to get the purchase order from !eing raised,
through approal to !eing sent out. +hilst the data warehouse user may want to look at who is raising orders, the
aerage alue, who approes them and how long do they take to do the approal. <e*uirements should therefore
reflect the iew of the data warehouse user and not what a single application can proide.
Anal/!i! and Reorting Tool U!age
+hen !uying licences etc. for the analysis and reporting tools a common mistake is to re*uire many thousands for a
24
gien reporting tool. -nce deliered the num!er of users neer rises to the original estimates. )he diagram !elow
illustrates why this occurs4
Flexi!ility in data access and complexity of tool
:ize of user community
8ata
Mining
1d 7oc
<eporting )ools
/arameterised <eporting
Fixed <eporting
+e! 3ased 8esktop )ools
:enior 1nalysts
3usiness 1nalysts
3usiness Asers
"ustomers and :uppliers
<esearchers
Figure ' & 1nalysis and <eporting )ool Asage
+hat the diagram shows is that there is a direct, inerse relationship !etween the degree of reporting flexi!ility
re*uired !y a user and the num!er of users re*uiring this access.
)here will !e ery few people, typically !usiness analysts and planners at the top !ut these indiiduals will need to
hae tools that really allow them to manipulate and mine the data. 1t the next leel down, there will !e a
somewhat larger group of users who re*uire ad hoc reporting access, these people will normally !e deeloping or
improing reports that get presented to management. )he remainder !ut largest community of the user !ase will
only hae a re*uirement to !e presented with data in the form of pre&defined reports with arying degrees of
in!uilt flexi!ility4 for instance, managers, sales staff or een suppliers and customers coming into the solution oer
the internet. )his !road community will also influence the choice of tool to reflect the skills of the users. )herefore
no indiidual tool will !e perfect and it is a case of fitting the users and a selection of tools together to gie the
!est results.
Data Ware"ou!e .1 data warehouse is a su!5ect&oriented, integrated, time&ariant and non&olatile collection of
data in support of managementEs decision making process. ,3ill (nmon0.
8esign /attern 1 design pattern proides a generic approach, rather than a specific solution for !uilding a
particular system or systems.
8imension )a!le 8imension ta!les contain attri!utes that descri!e fact records in the fact ta!le.
8istri!uted :olution 1 system architecture where the system components are distri!uted oer a num!er of sites to
proide local solutions. 8M: 8ata Mart :taging, a component in the data warehouse architecture for staging data.
@</ @nterprise <esource /lanning, a !usiness management system that integrates all facets of the !usiness,
including planning, manufacturing, sales, and marketing. @). @xtract, )ransform and .oad. 1ctiities re*uired to
populate data warehouses and -.1/ applications with clean, consistent, integrated and properly summarized data.
1lso a component in the data warehouse architecture. Fact )a!le (n an organisation, the .facts. are the key figures
that a !usiness re*uires. +ithin that organisation.s data mart, the fact ta!le is the foundation from which
eerything else arises.
Term De!crition
7eterogeneous :ystem 1n enironment in which all or any of the operating systems, data!ases and other
25
components are !uilt using the different technology, and are the integrated !y means of customized interfaces.
7euristic "leansing "leansing !y means of an approximate method for soling a /ro!lem within the context of a
goal. 7euristic cleansing then uses feed!ack from the effects of its solution to improe its own performance.
7omogeneous :ystem 1n enironment in which the operating systems, data!ases and other components are !uilt
using the same technology. >88 >ey 8esign 8ecision, a pro5ect template. >/( >ey /erformance (ndicators. >/(s help
an organization define and measure progress toward organizational goals. .:1 .iteral :taging 1rea. 8ata from a
legacy system is taken and stored in a data!ase in order to make this data more readily accessi!le to the
downstream systems. 1 component in the data warehouse architecture. Middleware :oftware that connects or
seres as the ?glue? !etween two otherwise separate applications. #ear&time <efers to data !eing updated !y means
of !atch processing at interals of in !etween 1' minutes and 1 hour ,in contrast to .<eal&time. data, which needs
to !e updated within 1' minute interals0. #ormalisation 8ata!ase normalization is a process of eliminating
duplicated data in a relational data!ase. )he key idea is to store data in one location, and proide links to it
whereer needed. -8: -perational 8ata :tore, also a component in the data warehouse architecture that allows
near&time reporting. -.1/ -n&.ine 1nalytical /rocessing. 1 category of applications and technologies for collecting,
managing, processing and presenting multidimensional data for analysis and management purposes.
-.)/ -.)/ ,-nline )ransaction /rocessing0 is a form of transaction processing conducted ia a computer network.
/ortal 1 +e! site or serice that offers a !road array of resources and serices, such as e&mail, forums, search
engines. /rocess #eutral Model 1 /rocess #eutral 8ata Model is a data model in which all em!edded !usiness rules
hae !een remoed. (f this is done correctly then as !usiness processes change there should !e little or no change
re*uired to the data model. 3usiness
(ntelligence solutions designed around such a model should therefore not !e su!5ect to limitations as the !usiness
changes.
<ule 3ased "leansing 1 data cleansing method, which performs updates on the data
!ased on rules.
:@1 :ource @ntity 1nalysis, an analysis template.
:nowflake :chema 1 ariant of the star schema with normalized dimension ta!les.
::1 :ource :ystem 1nalysis, an analysis template.
Term De!crition
:tar :chema 1 relational data!ase schema for representing multidimensional data. )he data is stored in a central
fact ta!le, with one or more ta!les holding information on each dimension. 8imensions hae leels, and all leels
are usually shown as colum ns in each dimension ta!le.
)-1 )arget -riented 1nalysis, an analysis template. )< )ransactional <epository. )he collated, clean
repository for the lowest leel of data held !y the organisation and a component in the data warehouse
architecture. )<: )ransaction <epository :taging, a component in the data warehouse architecture used to stage
data. +iki 1 wiki is a type of we!site, or the software needed to operate this we!site, that allows users to easily
add and edit content, and that is particularly suited to colla!oratie content creation. +:1 +arehouse :upport
1pplication, a component in the data warehouse architecture that supports missing data. 8esigning the :tar :chema
8ata!ase
"reating a :tar :chema 8ata!ase is one of the most important, and sometimes the final, step in creating a
data warehouse. =ien how important this process is to our data warehouse, it is important to understand how me
moe from a standard, on&line transaction processing ,-.)/0 system to a final star schema ,which here, we will call
an -.1/ system0.
26
)his paper attempts to address some of the issues that hae no dou!t kept you awake at night. 1s you stared at the
ceiling, wondering how to !uild a data warehouse, *uestions !egan swirling in your mind4
U +hat is a 8ata +arehouseI +hat is a 8ata MartI
U +hat is a :tar :chema 8ata!aseI
U +hy do ( want2need a :tar :chema 8ata!aseI
U )he :tar :chema looks ery denormalized. +onEt ( get in trou!le for thatI
U +hat do all these terms meanI
U :hould ( repaint the ceilingI
)hese are certainly !urning *uestions. )his paper will attempt to answer these *uestions, and show you how to
!uild a star schema data!ase to support decision support within your organization.

Asually, you are !ored with terminology at the end of a chapter, or !uried in an appendix at the !ack of the
!ook. 7ere, howeer, ( hae the thrill of presenting some terms up front. )he intent is not to !ore you earlier than
usual, !ut to present a !aseline off of which we can operate. )he pro!lem in data warehousing is that the terms are
often used loosely !y different parties. )he 8ata +arehousing (nstitute ,http422www.dw&institute.com0 has
attempted to standardize some terms and concepts. ( will present my !est understanding of the terms ( will use
throughout this lecture. /lease note, howeer, that ( do not speak for the 8ata +arehousing (nstitute.
O&T'
-.)/ stand for -nline )ransaction /rocessing. )his is a standard, normalized data!ase structure. -.)/ is designed
for transactions, which means that inserts, updates, and deletes must !e fast. (magine a call center that takes
orders. "all takers are continually taking calls and entering orders that may contain numerous items. @ach order and
each item must !e inserted into a data!ase. :ince the performance of the data!ase is critical, we want to maximize
the speed of inserts ,and updates and deletes0. )o maximize performance, we typically try to hold as few records in
the data!ase as possi!le.
O&A' and Star Sc"ema
-.1/ stands for -nline 1nalytical /rocessing. -.1/ is a term that means many things to many people.
7ere, we will use the term -.1/ and :tar :chema pretty much interchangea!ly. +e will assume that a star schema
data!ase is an -.1/ system. )his is not the same thing that Microsoft calls -.1/6 they extend -.1/ to mean the
cu!e structures !uilt using their product, -.1/ :erices. 7ere, we will assume that any system of read&only,
historical, aggregated data is an -.1/ system.
(n addition, we will assume an -.1/2:tar :chema can !e the same thing as a data warehouse. (t can !e, although
often data warehouses hae cu!e structures !uilt on top of them to speed *ueries.
Data Ware"ou!e and Data %art
3efore you !egin grum!ling that ( hae taken two ery different things and lumped them together, let me explain
that 8ata +arehouses and 8ata Marts are conceptually different F in scope. 7oweer, they are !uilt using the exact
same methods and procedures, so ( will define them together here, and then discuss the differences.
1 data warehouse ,or mart0 is way of storing data for later retrieal. )his retrieal is almost always used to support
decision&making in the organization. )hat is why many data warehouses are considered to !e 8:: ,8ecision&:upport
:ystems0. $ou will hear some people argue that not all data warehouses are 8::, and thatEs fine. :ome data
warehouses are merely archie copies of data. :till, the full !enefit of taking the time to create a star schema, and
then possi!ly cu!e structures, is to speed the retrieal of data. (n other words, it supports *ueries. )hese *ueries
are often across time. 1nd why would anyone look at data across timeI /erhaps they are looking for trends. 1nd if
27
they are looking for trends, you can !et they are making decisions, such as how much raw material to order. =uess
what4 thatEs decision supportV
@nough of the soap !ox. 3oth a data warehouse and a data mart are storage mechanisms for read&only, historical,
aggregated data. 3y read&only, we mean that the person looking at the data wonEt !e changing it. (f a user wants to
look at the sales yesterday for a certain product, they should not hae the a!ility to change that num!er. -f course,
if we know that num!er is wrong, we need to correct it, !ut more on that later.
)he GhistoricalH part may 5ust !e a few minutes old, !ut usually it is at least a day old. 1 data warehouse usually
holds data that goes !ack a certain period in time, such as fie years. (n contrast, standard -.)/ systems usually
only hold data as long as it is GcurrentH or actie. 1n order ta!le, for example, may moe orders to an archie ta!le
once they hae !een completed, shipped, and receied !y the customer.
+hen we say that data warehouses and data marts hold aggregated data, we need to stress that there are many
leels of aggregation in a typical data warehouse. (n this section, on the star schema, we will 5ust assume the
G!aseH leel of aggregation4 all the data in our data warehouse is aggregated to a certain point in time.
.etEs look at an example4 we sell 2 products, dog food and cat food. @ach day, we record sales of each product. 1t
the end of a couple of days, we might hae data that looks like this4

Kuantity :old
8ate -rder #um!er 8og Food "at Food
R22R299 1 ' 2
2 ; %
; 2 J
R 2 2
' ; ;

R22'299 1 ; D
2 2 1
; R %
)a!le 1
#ow, as you can see, there are seeral transactions. )his is the data we would find in a standard -.)/ system.
7oweer, our data warehouse would usually not record this leel of detail. (nstead, we summarize, or aggregate,
the data to daily totals. -ur records in the data warehouse might look something like this4

Kuantity :old
8ate 8og Food "at Food
R22R299 1' 1;
R22'299 9 9
)a!le 2
$ou can see that we hae reduced the num!er of records !y aggregating the indiidual transaction records into daily
records that show the num!er of each product purchased each day.
+e can certainly get from the -.)/ system to what we see in the -.1/ system 5ust !y running a *uery. 7oweer,
there are many reasons not to do this, as we will see later.
1ggregations
)here is no magic to the term Gaggregations.H (t simply means a summarized, additie alue. )he leel of
aggregation in our star schema is open for de!ate. +e will talk a!out this later. Oust realize that almost eery star
schema is aggregated to some !ase leel, called the grain.
-.)/ :ystems
-.)/, or -nline )ransaction /rocessing, systems are standard, normalized data!ases. -.)/ systems are optimized for
2
inserts, updates, and deletes6 in other words, transactions. )ransactions in this context can !e thought of as the
entry, update, or deletion of a record or set of records.
-.)/ systems achiee greater speed of transactions through a couple of means4 they minimize repeated data, and
they limit the num!er of indexes. First, letEs examine the minimization of repeated data.
(f we take the concept of an order, we usually think of an order header and then a series of detail records. )he
header contains information such as an order num!er, a !ill&to address, a ship&to address, a /- num!er, and other
fields. 1n order detail record is usually a product num!er, a product description, the *uantity ordered, the unit
price, the total price, and other fields. 7ere is what an order might look like4
Figure 1
#ow, the data !ehind this looks ery different. (f we had a flat structure, we would see the detail records looking
like this4

-rder #um!er -rder 8ate "ustomer (8 "ustomer
#ame
"ustomer
1ddress
"ustomer
"ity
12;R' R22R299 R'1 1"M@
/roducts
12; Main :treet .ouisille
"ustomer :tate "ustomer
Wip
"ontact
#ame
"ontact
#um!er
/roduct (8 /roduct
#ame
>$ R%2%2 Oane 8oe '%2&'''&1212 11;O2 +idget
/roduct
8escription
"ategory :u!"ategory /roduct /rice Kuantity
-rdered
@tcX
YH 3rass +idget 3rass =oods +idgets L1.%% 2%% @tcX
)a!le ;
#otice, howeer, that for each detail, we are repeating a lot of information4 the entire customer address, the
contact information, the product information, etc. +e need all of this information for each detail record, !ut we
donEt want to hae to enter the customer and product information for each record. )herefore, we use relational
technology to tie each detail to the header record, without haing to repeat the header information in each detail
record. )he new detail records might look like this4

-rder #um!er /roduct Kuantity
2!
#um!er -rdered
12RD; 1R<12O 2%%
)a!le R
1 simplified logical iew of the ta!les might look something like this4
Figure 2
#otice that we do not hae the extended cost for each record in the -rder8etail ta!le. )his is !ecause we store as
little data as possi!le to speed inserts, updates, and deletes. )herefore, any num!er that can !e calculated is
calculated and not stored.
+e also minimize the num!er of indexes in an -.)/ system. (ndexes are important, of course, !ut they slow down
inserts, updates, and deletes. )herefore, we use 5ust enough indexes to get !y. -er&indexing can significantly
decrease performance.
#ormalization
8ata!ase normalization is !asically the process of remoing repeated information. 1s we saw a!oe, we do not want
to repeat the order header information in each order detail record. )here are a num!er of rules in data!ase
normalization, !ut we will not go through the entire process.
First and foremost, we want to remoe repeated records in a ta!le. For example, we donEt want an order ta!le that
looks like this4
3"
Figure ;
(n this example, we will hae to hae some limit of order detail records in the -rder ta!le. (f we add 2% repeated
sets of fields for detail records, we wonEt !e a!le to handle that order for 21 products. (n addition, if an order 5ust
has one product ordered, we still hae all those fields wasting space.
:o, the first thing we want to do is !reak those repeated fields into a separate ta!le, and end up with this4

Figure R
#ow, our order can hae any num!er of detail records.
-.)/ 1dantages
1s stated !efore, -.)/ allows us to minimize data entry. For each detail record, we only hae to enter the primary
key alue from the -rder7eader ta!le, and the primary key of the /roduct ta!le, and then add the order *uantity.
)his greatly reduces the amount of data entry we hae to perform to add a product to an order.
#ot only does this approach reduce the data entry re*uired, it greatly reduces the size of an -rder8etail record.
"ompare the size of the records in )a!le ; as to that in )a!le R. $ou can see that the -rder8etail records take up
much less space when we hae a normalized ta!le structure. )his means that the ta!le is smaller, which helps speed
inserts, updates, and deletes.
(n addition to keeping the ta!le smaller, most of the fields that link to other ta!les are numeric. Kueries generally
perform much !etter against numeric fields than they do against text fields. )herefore, replacing a series of text
fields with a numeric field can help speed *ueries. #umeric fields also index faster and more efficiently.
+ith normalization, we may also hae fewer indexes per ta!le. )his means that inserts, updates, and deletes run
faster, !ecause each insert, update, and delete may affect one or more indexes. )herefore, with each transaction,
these indexes must !e updated along with the ta!le. )his oerhead can significantly decrease our performance.
31
O&T' Di!advantage!
)here are some disadantages to an -.)/ structure, especially when we go to retriee the data for analysis. For
one, we now must utilize 5oins and *uery multiple ta!les to get all the data we want. Ooins tend to !e slower than
reading from a single ta!le, so we want to minimize the num!er of ta!les in any single *uery. +ith a normalized
structure, we hae no choice !ut to *uery from multiple ta!les to get the detail we want on the report.
-ne of the adantages of -.)/ is also a disadantage4 fewer indexes per ta!le. Fewer indexes per ta!le are great
for speeding up inserts, updates, and deletes. (n general terms, the fewer indexes we hae, the faster inserts,
updates, and deletes will !e. 7oweer, again in general terms, the fewer indexes we hae, the slower select *ueries
will run. For the purposes of data retrieal, we want a num!er of indexes aaila!le to help speed that retrieal.
:ince one of our design goals to speed transactions is to minimize the num!er of indexes, we are limiting ourseles
when it comes to doing data retrieal. )hat is why we look at creating two separate data!ase structures4 an -.)/
system for transactions, and an -.1/ system for data retrieal.
.ast !ut not least, the data in an -.)/ system is not user friendly. Most () professionals would rather not hae to
create custom reports all day long. (nstead, we like to gie our customers some *uery tools and hae them create
reports without inoling us. Most customers, howeer, donEt know how to make sense of the relational nature of
the data!ase. Ooins are something mysterious, and complex ta!le structures ,such as associatie ta!les on a !ill&of&
material system0 are hard for the aerage customer to use. )he structures seem o!ious to us, and we sometimes
wonder why our customers canEt get the hang of it. <emem!er, howeer, that our customers know how to do a F(F-&
to&.(F- realuation and other such tasks that we donEt want to deal with6 therefore, understanding relational
concepts 5ust isnEt something our customers should hae to worry a!out.
(f our customers want to spend the ma5ority of their time performing analysis !y looking at the data, we need to
support their desire for fast, easy *ueries. -n the other hand, we need to meet the speed re*uirements of our
transaction&processing actiities. (f these two re*uirements seem to !e in conflict, they are, at least partially. Many
companies hae soled this !y haing a second copy of the data in a structure resered for analysis. )his copy is
more heaily indexed, and it allows customers to perform large *ueries against the data without impacting the
inserts, updates, and deletes on the main data. )his copy of the data is often not 5ust more heaily indexed, !ut
also denormalized to make it easier for customers to understand.
<easons to 8enormalize
+heneer ( ask someone why you would eer want to denormalize, the first ,and often only0 answer is4 speed.
+eEe already discussed some disadantages to the -.)/ structure6 it is !uilt for data inserts, updates, and deletes,
!ut not data retrieal. )herefore, we can often s*ueeze some speed out of it !y denormalizing some of the ta!les
and haing *ueries go against fewer ta!les. )hese *ueries are faster !ecause they perform fewer 5oins to retriee
the same recordset.
Ooins are slow, as we hae already mentioned. Ooins are also confusing to many end users. 3y denormalizing, we can
present the user with a iew of the data that is far easier for them to understand. +hich iew of the data is easier
for a typical end&user to understand4
32
Figure '
Figure J
)he second iew is much easier for the end user to understand. +e had to use 5oins to create this iew, !ut if we
put all of this in one ta!le, the user would !e a!le to perform this *uery without using 5oins. +e could create a iew
that looks like this, !ut we are still using 5oins in the !ackground and therefore not achieing the !est performance
on the *uery.
7ow +e Biew (nformation
1ll of this leads us to the real *uestion4 how do we iew the data we hae stored in our data!aseI )his is not the
*uestion of how we iew it with *ueries, !ut how do we logically iew itI For example, are these intelligent
*uestions to ask4
U 7ow many !ottles of 1niseed :yrup did we sell last weekI
U 1re oerall sales of "ondiments up or down this year compared to preious yearsI
U -n a *uarterly and then monthly !asis, are 8airy /roduct sales cyclicalI
U (n what regions are sales down this year compared to the same period last yearI +hat products in those
regions account for the greatest percentage of the decreaseI
1ll of these *uestions would !e considered reasona!le, perhaps een common. )hey all hae a few things in
common. First, there is a time element to each one. :econd, they all are looking for aggregated data6 they are
asking for sums or counts, not indiidual transactions. Finally, they are looking at data in terms of G!yH conditions.
+hen ( talk a!out G!yH conditions, ( am referring to looking at data !y certain conditions. For example, if we take
33
the *uestion G-n a *uarterly and then monthly !asis, are 8airy /roduct sales cyclicalH we can !reak this down into
this4 G+e want to see total sales !y category ,5ust 8airy /roducts in this case0, !y *uarter or !y month.H
7ere we are looking at an aggregated alue, the sum of sales, !y specific criteria. +e could add further G!yH
conditions !y saying we wanted to see those sales !y !rand and then the indiidual products.
Figuring out the aggregated alues we want to see, like the sum of sales dollars or the count of users !uying a
product, and then figuring out these G!yH conditions is what dries the design of our star schema.
Making the 8ata!ase Match our @xpectations
(f we want to iew our data as aggregated num!ers !roken down along a series of G!yH criteria, why donEt we 5ust
store data in this formatI
)hatEs exactly what we do with the star schema. (t is important to realize that -.)/ is not meant to !e the !asis of
a decision support system. )he G)H in -.)/ stands for transactions, and a transaction is all a!out taking orders and
depleting inentory, and not a!out performing complex analysis to spot trends. )herefore, rather than tie up our
-.)/ system !y performing huge, expensie *ueries, we !uild a data!ase structure that maps to the way we see the
world.
+e see the world much like a cu!e. +e wonEt talk a!out cu!e structures for data storage 5ust yet. (nstead, we will
talk a!out !uilding a data!ase structure to support our *ueries, and we will speed it up further !y creating cu!e
structures later.
,act! and Dimen!ion!
+hen we talk a!out the way we want to look at data, we usually want to see some sort of aggregated data.
)hese data are called measures. )hese measures are numeric alues that are measura!le and additie. For
example, our sales dollars are a perfect measure. @ery order that comes in generates a certain sales olume
measured in some currency. (f we sell twenty products in one day, each for fie dollars, we generate 1%% dollars in
total sales. )herefore, sales dollars is one measure we may want to track. +e may also want to know how many
customers we had that day. 8id we hae fie customers !uying an aerage of four products each, or did we hae
5ust one customer !uying twenty productsI :ales dollars and customer counts are two measures we will want to
track.
Oust tracking measures isnEt enough, howeer. +e need to look at our measures using those G!yH conditions.
)hese G!yH conditions are called dimensions. +hen we say we want to know our sales dollars, we almost always
mean !y day, or !y *uarter, or !y year. )here is almost always a time dimension on anything we ask for. +e may also
want to know sales !y category or !y product. )hese !y conditions will map into dimensions4 there is almost always
a time dimension, and product and geographic dimensions are ery common as well.
)herefore, in designing a star schema, our first order of !usiness is usually to determine what we want to see ,our
measures0 and how we want to see it ,our dimensions0.
%aing Dimen!ion! into Table!
8imension ta!les answer the GwhyH portion of our *uestion4 how do we want to slice the dataI For example, we
almost always want to iew data !y time. +e often donEt care what the grand total for all data happens to !e. (f
our data happen to start on Oune 1R, 1999, do we really care how much our sales hae !een since that date, or do
we really care how one year compares to other yearsI "omparing one year to a preious year is a form of trend
analysis and one of the most common things we do with data in a star schema.
+e may also hae a location dimension. )his allows us to compare the sales in one region to those in another. +e
may see that sales are weaker in one region than any other region. )his may indicate the presence of a new
competitor in that area, or a lack of adertising, or some other factor that !ears inestigation.
34
+hen we start !uilding dimension ta!les, there are a few rules to keep in mind. First, all dimension ta!les should
hae a single&field primary key. )his key is often 5ust an identity column, consisting of an automatically
incrementing num!er. )he alue of the primary key is meaningless6 our information is stored in the other fields.
)hese other fields contain the full descriptions of what we are after. For example, if we hae a /roduct dimension
,which is common0 we hae fields in it that contain the description, the category name, the su!&category name,
etc. )hese fields do not contain codes that link us to other ta!les. 3ecause the fields are the full descriptions, the
dimension ta!les are often fat6 they contain many large fields.
8imension ta!les are often short, howeer. +e may hae many products, !ut een so, the dimension ta!le cannot
compare in size to a normal fact ta!le. For example, een if we hae ;%,%%% products in our product ta!le, we may
track sales for these products each day for seeral years. 1ssuming we actually only sell ;,%%% products in any gien
day, if we track these sales each day for ten years, we end up with this e*uation4 ;,%%% products sold T ;J'
day2year Z 1% years e*uals almost 11,%%%,%%% recordsV )herefore, in relatie terms, a dimension ta!le with ;%,%%%
records will !e short compared to the fact ta!le.
=ien that a dimension ta!le is fat, it may !e tempting to denormalize the dimension ta!le. <esist the urge to do
so6 we will see why in a little while when we talk a!out the snowflake schema.

Dimen!ional Hierarc"ie!
+e hae !een !uilding hierarchical structures in -.)/ systems for years. 7oweer, hierarchical structures in an
-.1/ system are different !ecause the hierarchy for the dimension is actually all stored in the dimension ta!le.
)he product dimension, for example, contains indiidual products. /roducts are normally grouped into categories,
and these categories may well contain su!&categories. For instance, a product with a product num!er of T12O" may
actually !e a refrigerator. )herefore, it falls into the category of ma5or appliance, and the su!&category of
refrigerator. +e may hae more leels of su!&categories, where we would further classify this product. )he key here
is that all of this information is stored in the dimension ta!le.
-ur dimension ta!le might look something like this4
Figure D
#otice that !oth "ategory and :u!category are stored in the ta!le and not linked in through 5oined ta!les that store
the hierarchy information. )his hierarchy allows us to perform Gdrill&downH functions on the data. +e can perform a
*uery that performs sums !y category. +e can then drill&down into that category !y calculating sums for the
su!categories for that category. +e can the calculate the sums for the indiidual products in a particular
su!category.
)he actual sums we are calculating are !ased on num!ers stored in the fact ta!le. +e will examine the fact ta!le in
more detail later.
"onsolidated 8imensional 7ierarchies ,:tar :chemas0
)he a!oe example ,Figure D0 shows a hierarchy in a dimension ta!le. )his is how the dimension ta!les are !uilt in a
35
star schema6 the hierarchies are contained in the indiidual dimension ta!les. #o additional ta!les are needed to
hold hierarchical information.
:toring the hierarchy in a dimension ta!le allows for the easiest !rowsing of our dimensional data. (n the a!oe
example, we could easily choose a category and then list all of that categoryEs su!categories. +e would drill&down
into the data !y choosing an indiidual su!category from within the same ta!le. )here is no need to 5oin to an
external ta!le for any of the hierarchical informaion.
(n this oerly&simplified example, we hae two dimension ta!les 5oined to the fact ta!le. +e will examine the fact
ta!le later. For now, we will assume the fact ta!le has only one num!er4 :ales8ollars.
Figure 9
(n order to see the total sales for a particular month for a particular category, our :K. *uery has to !e defined.

Sno0fla6e Sc"ema!
:ometimes, the dimension ta!les hae the hierarchies !roken out into separate ta!les. )his is a more normalized
structure, !ut leads to more difficult *ueries and slower response times.
Figure 9 represents the !eginning of the snowflake process. )he category hierarchy is !eing !roken out of the
/roduct8imension ta!le. $ou can see that this structure increases the num!er of 5oins and can slow *ueries. :ince
the purpose of our -.1/ system is to speed *ueries, snowflaking is usually not something we want to do. :ome
people try to normalize the dimension ta!les to sae space. 7oweer, in the oerall scheme of the data warehouse,
the dimension ta!les usually only hold a!out 1[ of the records. )herefore, any space saings from normalizing, or
snowflaking, are negligi!le.
igure 9
3uilding the Fact )a!le
)he Fact )a!le holds our measures, or facts. )he measures are numeric and additie across some or all of the
dimensions. For example, sales are numeric and we can look at total sales for a product, or category, and we can
36
look at total sales !y any time period. )he sales figures are alid no matter how we slice the data.
+hile the dimension ta!les are short and fat, the fact ta!les are generally long and skinny. )hey are long !ecause
they can hold the num!er of records represented !y the product of the counts in all the dimension ta!les.
For example, take the following simplified star schema4
Figure 1%
(n this schema, we hae product, time and store dimensions. (f we assume we hae ten years of daily data, 2%%
stores, and we sell '%% products, we hae a potential of ;J',%%%,%%% records ,;J'% days Z 2%% stores Z '%%
products0. 1s you can see, this makes the fact ta!le long.
)he fact ta!le is skinny !ecause of the fields it holds. )he primary key is made up of foreign keys that hae
migrated from the dimension ta!les. )hese fields are 5ust some sort of numeric alue. (n addition, our measures are
also numeric. )herefore, the size of each record is generally much smaller than those in our dimension ta!les.
7oweer, we hae many, many more records in our fact ta!le.
Fact =ranularity
-ne of the most important decisions in !uilding a star schema is the granularity of the fact ta!le. )he granularity, or
fre*uency, of the data is usually determined !y the time dimension. For example, you may want to only store
weekly or monthly totals. )he lower the granularity, the more records you will hae in the fact ta!le. )he
granularity also determines how far you can drill down without returning to the !ase, transaction&leel data.
Many -.1/ systems hae a daily grain to them. )he lower the grain, the more records that we hae in the fact
ta!le. 7oweer, we must also make sure that the grain is low enough to support our decision support needs.
-ne of the ma5or !enefits of the star schema is that the low&leel transactions are summarized to the fact ta!le
grain. )his greatly speeds the *ueries we perform as part of our decision support. )his aggregation is the heart of
our -.1/ system.
Fact )a!le :ize
37
+e hae already seen how '%% products sold in 2%% stores and tracked for 1% years could produce ;J',%%%,%%%
records in a fact ta!le with a daily grain. )his, howeer, is the maximum size for the ta!le. Most of the time, we do
not hae this many records in the ta!le. -ne of the things we do not want to do is store zero alues. :o, if a product
did not sell at a particular store for a particular day, we would not store a zero alue. +e only store the records
that hae a alue. )herefore, our fact ta!le is often sparsely populated.
@en though the fact ta!le is sparsely populated, it still holds the ast ma5ority of the records in our data!ase and is
responsi!le for almost all of our disk space used. )he lower our granularity, the larger the fact ta!le. $ou can see
from the preious example that moing from a daily to weekly grain would reduce our potential num!er of records
to only slightly more than '2,%%%,%%% records.
)he data types for the fields in the fact ta!le do help keep it as small as possi!le. (n most fact ta!les, all of the
fields are numeric, which can re*uire less storage space than the long descriptions we find in the dimension ta!les.
Finally, !e aware that each added dimension can greatly increase the size of our fact ta!le. (f we added one
dimension to the preious example that included 2% possi!le alues, our potential num!er of records would reach
D.; !illion.
"hanging 1ttri!utes
-ne of the greatest challenges in a star schema is the pro!lem of changing attri!utes. 1s an example, we will use
the simplified star schema in Figure 1%. (n the :tore8imension ta!le, we hae each store !eing in a particular
region, territory, and zone. :ome companies realign their sales regions, territories, and zones occasionally to reflect
changing !usiness conditions. 7oweer, if we simply go in and update the ta!le, and then try to look at historical
sales for a region, the num!ers will not !e accurate. 3y simply updating the region for a store, our total sales for
that region will not !e historically accurate.
(n some cases, we do not care. (n fact, we want to see what the sales would hae !een had this store !een in that
other region in prior years. More often, howeer, we do not want to change the historical data. (n this case, we may
need to create a new record for the store. )his new record contains the new region, !ut leaes the old store
record, and therefore the old regional sales data, intact. )his approach, howeer, preents us from comparing this
stores current sales to its historical sales unless we keep track of itEs preious :tore(8. )his can re*uire an extra
field called /reious:tore(8 or something similar.
)here are no right and wrong answers. @ach case will re*uire a different solution to handle changing attri!utes.
1ggregations
Finally, we need to discuss how to handle aggregations. )he data in the fact ta!le is already aggregated to the fact
ta!leEs grain. 7oweer, we often want to aggregate to a higher leel. For example, we may want to sum sales to a
monthly or *uarterly num!er. (n addition, we may !e looking for total 5ust for a product or a category.
)hese num!ers must !e calculated on the fly using a standard :K. statement. )his calculation takes time, and
therefore some people will want to decrease the time re*uired to retriee higher&leel aggregations.
:ome people store higher&leel aggregations in the data!ase !y pre&calculating them and storing them in the
data!ase. )his re*uires that the lowest&leel records hae special alues put in them. For example, a
)ime8imension record that actually holds weekly totals might hae a 9 in the 8ay-f+eek field to indicate that this
particular record holds the total for the week.
)his approach has !een used in the past, !ut !etter alternaties exist. )hese alternaties usually consist of !uilding
a cu!e structure to hold pre&calculated alues. +e will examine MicrosoftEs -.1/ :erices, a tool designed to !uild
cu!e structures to speed our access to warehouse data.
:lowly changing dimensions are used to descri!e the date effectiity of the data. (t descri!e the dimensions whose
attri!ute alue ary oer time.
)his term is commonly used in the 8ata +arehousing world. 7oweer, the pro!lem exists in the -.)/, relational data
modeling as well.
3
@xample4
)he sales representatie assigned to a customer may change oer time. .inda was the salesrep for 13", inc. !efore
March last year. >athy later !ecomes the representatie for this account.
$ou may want to track the data Gas isH, Gas wasH, or !oth. (f you show the year total sales, you can either report as
the sales are all generated !y >athy, or actually !reak the num!er down !etween .inda and >athy.
Slo0l/ c"anging dimen!ion!
Slo0l/ c"anging dimen!ion! (:)
U )he dimensional attri!ute record is oerwritten with the new alue
U #o changes are needed elsewhere in the dimension record
U #o keys are affected anywhere in the data!ase
U Bery easy to implement !ut the historical data is now inconsistent
Slo0l/ c"anging dimen!ion! (;)
U (ntroduce a new record for the same dimensional entity in order to reflect its changed state
U 1 new instance of the dimensional key is created which references the new record
U (n order to is !est dealt with !y using ersion digits at the end of the key.
U 1ll these keys need to !e created, maintained and managed !y someone and tracked in the metadata
U )he data!ase maintains its consistency and the ersions can !e said to partition history
Slo0l/ c"anging dimen!ion! (<)
U Ase slightly different design of dimension ta!le which has fields for4
o original status of dimensional attri!ute
o current status of dimensional attri!ute
o an effectie date of change field
U )his allows the analyst to compare the as&is and as&was states against each other
U -nly two states can !e traced, the current and the original
U :ome inconsistencies are created in the data as time is not properly partitioned
Introduction to t"e Serie!
-racle9i proides a new set of @). options that can !e effectiely integrated into the @). architecture. (n order to
deelop the correct approach to implementing new technology in the @). architecture, it is important to
understand the components, architectural options and !est practices when designing and deeloping a data
warehouse. +ith this !ackground, each option will !e explored and how it is !est suited for the @). architecture.
)hrough this series of articles, an oeriew of the @). architecture will !e discussed as well as a detailed look at
each option. @ach @). optionEs syntax, !ehaior and performance ,where appropriate0 will !e examined. 3ased on
the results of examples, com!ined with a solid understanding of the @). architecture, strategies and approaches to
leerage the new options in the @). architecture will !e outlined. )he final article in the series will proide a look
at all of the @). options working together, stemming from examples throughout the series.
(ndiidual articles in the series include4
U /art 1 F -eriew of the @xtract, )ransform and .oad ,@).0 1rchitecture
U /art 2 F @xternal )a!les
U /art ; F Multiple )a!le (nsert
U /art R F Apsert 2M@<=@ (#)- ,1dd and Apdate "om!ined :tatement0
3!
U /art ' F )a!le Functions
U /art J F 3ring it 1ll )ogether4 1 .ook at the "om!ined Ase of the @). -ptions.
)he information in the series is targeted for data warehouse deelopers, data warehouse architects and information
technology managers.
-eriew of the @xtract, )ransform and .oad 1rchitecture ,@).0
)he warehouse architect can assem!le @). architectures in many different forms using an endless ariety of
technologies. 8ue to this fact, the warehouse can take adantage of the software, skill sets, hardware and
standards already in place within an organization. )he potential weakness of the warehouse arises when a loosely
managed pro5ect, which does not adhere to a standard approach, results in an increase in scope, !udget and
maintenance. )his weakness may result in ulnera!ilities to unforeseen data integrity limitations in the source
systems as well. )he key to eliminating this weakness is to deelop a technical design that employs solid warehouse
expertise and data warehouse !est practices. /rofessional experience and the data warehouse fundamentals are key
elements to eliminating failure on a warehouse pro5ect.
/otential pro!lems are exposed in this article not to delier fear or confirm the popular clich\ that Gwarehouse
pro5ects fail.H (t is simply important to understand that new technologies, such as data!ase options, are not a
replacement for the principles of data warehousing and @). processing. #ew technologies should, and many times
will, adance or complement the warehouse. )hey should make its architecture more efficient, scala!le and sta!le.
)hat is where the new -racle9i features play nicely. )hese features will !e explored while looking at their
appropriate uses in the @). architecture. (n order to determine where the new -racle9i features may fit into the
@). architecture, it is important to look at @). approaches and components.
1pproaches to @). 1rchitecture
+ithin the @). architecture two distinct, !ut not mutually exclusie, approaches are traditionally used in the @).
design. )he custom approach is the oldest and was once the only approach for data warehousing. (n effect, this
approach takes the technologies and hardware that an organization has on hand and deelops a data warehouse
using those technologies. )he second approach includes the use of packaged @). software. )his approach focuses on
performing the ma5ority of connectiity, extraction, transformation and data loading within the @). tool itself.
7oweer, this software comes with an additional cost. )he potential !enefits of an @). package include a reduction
in deelopment time as well as a reduction in maintenance oerhead.
ET& #omonent!
)he @). architecture is traditionally designed into two components4
U )he source to stage component is intended to focus the efforts of reading the source data ,sourcing0 and
replicating the data to the staging area. )he staging area is typically comprised of seeral schemas that house
indiidual source systems or sets of related source systems. +ithin each schema, all of the source system ta!les are
usually Gmirrored.H )he structure of the stage ta!le is identical to that of the source ta!le with the addition of data
elements to support referential integrity and future @). processing.
U )he stage to warehouse component focuses the effort of standardizing and centralizing the data from the
source systems into a single iew of the organizationEs information. )his centralized target can !e a data
warehouse, data mart, operational data store, customer list store, reporting data!ase or any other reporting2data
enironment. ,)he examples in this article assume the final target is a data warehouse.0 )his portion of the
architecture should not !e concerned with translation, data formats, or data type conersion. (t can now focus on
the complex task of cleansing, standardizing and transforming the source data according to the !usiness rules.
(t is important to note that an @). tool strictly Gextracts, transforms and loads.H :eparate tools or external serice
4"
organizations, which may re*uire additional cost, accomplish the work of name and address cleansing and
standardization. )hese data cleansing tools can work in con5unction with the @). packaged software in a ariety of
ways. Many organizations exist that are a!le to perform the same work offsite under a contractual !asis. )he task of
data cleansing can occur in the staging enironment prior to or during stage to warehouse processing. (n any case, it
is a good practice to house a copy of the data cleansing output in the staging area for auditing purposes.
)he following sections include diagrams and oeriews of4
U "ustom source to stage,
U /ackaged @). tool source to stage,
U "ustom stage to warehouse, and
U /ackaged @). tool stage to warehouse architectures.
)his article assumes that the staging and warehouse data!ases are -racle9i instances hosted on separate systems.
"ustom @). F :ource to :tage
Figur e 14 :ource to :tage /ortion of "ustom @). 1rchitecture
Figure 1 outlines the source to stage portion of a custom @). architecture and exposes seeral methods of data
Gconnections.H )hese methods include4
U <eplicating data through the use of data replication software ,mirroring software0 that detects or GsniffsH
changes from a data!ase or file system logs.
U =enerating flat files !y pulling or pushing data from a client program connected to the source system.
U HF)/ingH internal data from the source system in a natie or altered format.
U "onnecting natiely to source system data and2or files ,i.e., a 832 connection to a 1:2R%% file systems0.
U <eading data from a natie data!ase connection.
U <eading data oer a data!ase link from an -racle instance to the target -racle staging instance and F)/ing
data from an external site to the staging host system.
-ther data connection options may include a tape deliered on site and copied, reading data from a *ueue ,i.e.,
MK:eries0, reading data from an enterprise application integration ,@1(0 message, reading data ia a data!ase
!ridge or other third&party !roker for data access ,i.e., 832 "onnect, 831nywhere0, etc X
1fter a connection is esta!lished to the source systems, many methods are used to read and load the data into the
staging area as descri!ed in the diagram. )hese methods include the use of4
U <eplication software ,com!ines read and write replication into a single software package0,
41
U 1 shell or other scripting tool such as >:7, ":7, /@<. and :K. reading data from a flat file,
U 1 shell or other scripting tool reading data from a data!ase connection ,i.e., oer /@<. 83(0,
U 1 packaged or custom executa!le such as ", "QQ, 1+>, :@8 or Oaa reading data from a flat file,
U 1 packaged or custom executa!le reading data from a data!ase connection and :K.Z.oader reading from a
flat file.
/ackaged @). )ool F :ource to :tage
Figure 24 :ource to :tage /ortion Asing a /ackaged @). )ool
Figure 2 outlines the source to stage portion of an @). architecture using a packaged @). tool and exposes seeral
methods of data GconnectionsH which are similar to those used in the custom source to stage processing model. )he
connection method with a packaged @). tool typically allows for all of the connections one would expect from a
custom deelopment effort. (n most cases, each type of source connection re*uires a license. For example if a
connection is re*uired to a :y!ase, 832 and -racle data!ase, three separate licenses are needed. (f licensing is an
issue, the @). architecture typically em!races a hy!rid solution using other custom methods to replicate source
data in addition to the packaged @). tool.
"onnection methods include4
U <eplicating data using data replication software ,mirroring software0 that detects or sniffs changes. from
the data!ase or file system logs.
U F)/ing internal data from the source system in natie or altered format.
U "onnecting natiely to the system data and2or files ,i.e., a 832 connection to a 1:2R%% file systems0.
U <eading data from a natie data!ase connection.
U <eading data oer a data!ase link !y an -racle staging data!ase from the target -racle instance.
U F)/ing data from an external site to the staging host system.
-ther options may include a tape deliered on site and copied, reading data from a *ueue ,i.e., MK:eries0, reading
data from an enterprise application integration ,@1(0 message2*ueue, reading data ia a data!ase !ridge or other
third&party !roker for data access ,i.e., 832 "onnect, 831nywhere0, etcX
1fter a connection is esta!lished to the source systems, the @). tool is used to read, perform simple
transformations such as rudimentary cleansing ,i.e., trimming spaces0, perform data type conersion, conert data
formats and load the data into the staging area. 1danced transformations are recommended to take place in the
stage to warehouse component and not in the source to stage processing ,explained in the next section0. 3ecause
the package @). tool is designed to handle all of the transformations and conersions, all the work is done within
42
the @). serer itself. +ithin the @). toolEs serer repository, separate mappings exist to perform the indiidual @).
tasks.
"ustom @). F :tage to +arehouse
Figure ;4 :tage to +arehouse /ortion of "ustom @). 1rchitecture
Figure ; outlines the stage to warehouse portion of a custom @). architecture. )he @). stage to warehouse
component is where the data standardization and centralization occur. )he work of gathering, formatting and
conerting data types is has !een completed !y the source to stage component. #ow the @). work can focus on the
task of creating a single iew of the organizationEs data in the warehouse.
)his diagram exposes seeral typical methods of standardizing and2or centralizing data to the data warehouse.
)hese methods include the use of a4
U /.2:K. procedure reading and writing directly to the data warehouse from the staging data!ase ,this could
!e done 5ust as easily if the procedure was located in the warehouse data!ase0.
U /.2:K. procedure reading from the staging data!ase and writing to flat files ,i.e., ia a :K. script0.
U :K.Z/lus client writing data to a flat file from stage, :K.Z.oader importing files into the warehouse for
loading or additional processing !y a /.2:K. procedure.
U 1n -racle ta!le export&import process from staging to the warehouse for loading or additional processing !y
a /.2:K. procedure.
U :hell or other scripting tool such as >:7, ":7, /@<. or :K. reading data natiely or from a flat file and
writing data into the warehouse.
U /ackaged or custom executa!le such as ", "QQ, 1+>, :@8 or Oaa reading data natiely or from a flat file
and writing data into the warehouse.
/ackaged @). )ool F :tage to +arehouse
43
Figure R4 :tage to +arehouse /ortion of a /ackaged @). )ool 1rchitecture
Figure R outlines the stage to warehouse portion of a packaged @). tool architecture. Figure R diagrams the
packaged @). application performing the standardization and centralization of data to the warehouse all within one
application. )his is the strength of a packaged @). tool. (n addition, this is the component of @). architecture
where the @). tool is !est suited to apply the organizationEs !usiness rules. )he packaged @). tool will source the
data through a natie connection to the staging data!ase. (t will perform transformations on each record after
pulling the data from the stage data!ase through a pipe. From there it will load each record into the warehouse
through a natie connection to the data!ase. 1gain, not all packaged @). architectures look like this due to many
factors. )ypically a deiation in the architecture is due to re*uirements that the @). software cannot, or is not
licensed, to fulfill. (n these instances one of the custom stage to warehouse methods is most commonly used.
3usiness .ogic and the @). 1rchitecture
(n any warehouse deelopment effort, the !usiness logic is the core of the warehouse. )he !usiness logic is applied
to the proprietary data from the organizationEs internal and external data sources. )he application process
com!ines the heterogeneous data into a single iew of the organizationEs information. )he logic to create a central
iew of the information is often a complex task. (n order to properly manage this task, it is important to consolidate
the !usiness rules into the stage to warehouse @). component, regardless of the @). architecture. (f this !est
practice is ignored, much of the !usiness logic may !e spread throughout the source to stage and stage to
warehouse components. )his will ultimately hamper the organizationEs a!ility to maintain the warehouse solution
long term and may lead to an error prone system.
+ithin the packaged @). tool architecture, the centralization of the !usiness logic !ecomes a less complex task.
8ue to the fact that the mapping and transformation logic is managed !y the @). software package, the
centralization of rules is offered as a feature of the software. 7oweer, using packaged @). tools does not
guarantee a proper @). implementation. =ood warehouse deelopment practices are still necessary when
deeloping any type of @). architecture.
(n the custom @). architecture, it !ecomes critical to place the application of !usiness logic in the stage to
warehouse component due to the large num!er of indiidual modules. )he custom solution will typically store
!usiness logic in a custom repository or in the code of the @). transformations. )his is the greatest disadantage to
the custom warehouse. 8eeloping a custom repository re*uires additional deelopment effort, solid warehousing
design experience and strict attention to detail. 8ue to this difficulty, the choice may !e made to deelop the rules
into the @). transformation code to speed the time of deliery. +hether or not the decision is made to store the
rules in a custom repository or not, it is important to hae a well&thought&out design. )he !usiness rules are the
heart of the warehouse. 1ny pro!lems with the rules will create errors in the system.
(t is important to understand some of the !est practices and risks when deeloping @). architectures to !etter
appreciate how the new technology will fit into the architecture. +ith this !ackground it is apparent that new
44
technology or data!ase options will not !e siler !ullet for @). processing. #ew technology will not increase a
solutionEs effectieness nor replace the need for management of the !usiness rules. 7oweer, the new -racle 9i @).
options proide a great complement to custom and packaged @). tool architectures.
@xtract, transform, and load ,@).0 is a process in data warehousing that inoles extracting data from outside
sources,
transforming it to fit !usiness needs, and ultimately loading it into the data warehouse.
@). is important, as it is the way data actually gets loaded into the warehouse. )his article assumes that data is
always loaded into a data warehouse, whereas the term @). can in fact refer to a process that loads any data!ase.
"ontents
]hide^
1 @xtract
2 )ransform
; .oad
R "hallenges
' )ools
E1tract
)he first part of an @). process is to extract the data from the source systems. Most data warehousing pro5ects
consolidate data from different source systems. @ach separate system may also use a different data organization 2
format. "ommon data source formats are relational data!ases and flat files, !ut may include non&relational
data!ase structures such as (M: or other data structures such as B:1M or (:1M. @xtraction conerts the data into a
format for transformation processing.
Tran!form
)he transform stage applies a series of rules or functions to the extracted data to derie the data to !e loaded.
:ome data sources will re*uire ery little manipulation of data. (n other cases, one or more of the following
transformations types may !e re*uired4
:electing only certain columns to load ,or selecting null columns not to load0
)ranslating coded alues ,e.g., if the source system stores M for male and F for female, !ut the warehouse stores 1
for male and 2 for female0
@ncoding free&form alues ,e.g., mapping ?Male? and ?M? and ?Mr? onto 10
8eriing a new calculated alue ,e.g., sale_amount ` *ty Z unit_price0
Ooining together data from multiple sources ,e.g., lookup, merge, etc.0
:ummarizing multiple rows of data ,e.g., total sales for each region0
=enerating surrogate key alues
)ransposing or pioting ,turning multiple columns into multiple rows or ice ersa0
:plitting a column into multiple columns ,e.g., putting a comma&separated list specified as a string in one column as
indiidual alues in different columns0
&oad
)he load phase loads the data into the data warehouse. 8epending on the re*uirements of the organization, this
process ranges widely. :ome data warehouses merely oerwrite old information with new data. More complex
systems can maintain a history and audit trail of all changes to the data.
45
#"allenge!
@). processes can !e *uite complex, and significant operational pro!lems can occur with improperly designed @).
systems.
)he range of data alues or data *uality in an operational system may !e outside the expectations of designers at
the time alidation and transformation rules are specified. 8ata profiling of a source during data analysis is
recommended to identify the data conditions that will need to !e managed !y transform rules specifications.
)he scala!ility of an @). system across the lifetime of its usage needs to !e esta!lished during analysis. )his
includes understanding the olumes of data that will hae to !e processed within :erice .eel 1greements, ,:.1s0.
)he time aaila!le to extract from source systems may change, which may mean the same amount of data may hae
to !e processed in less time. :ome @). systems hae to scale to process tera!ytes of data to update data
warehouses with tens of tera!ytes of data. (ncreasing olumes of data may re*uire designs that can scale from daily
!atch to intra&day micro&!atch to integration with message *ueues for continuous transformation and update.
1 recent deelopment in @). software is the implementation of parallel processing. )his has ena!led a num!er of
methods to improe oerall performance of @). processes when dealing with large olumes of data.
)here are ; main types of parallelisms as implemented in @). applications4
8ata4 3y splitting a single se*uential file into smaller data files to proide parallel access.
'ieline- 1llowing the simultaneous running of seeral components on the same data stream. 1n example would !e
looking up a alue on record 1 at the same time as adding together two fields on record 2.
#omonent4 )he simultaneous running of multiple processes on different data streams in the same 5o!. :orting one
input file while performing a deduplication on another file would !e an example of component parallelism.
1ll three types of parallelism are usually com!ined in a single 5o!.1n additional difficulty is making sure the data
!eing uploaded is relatiely consistent. :ince multiple source data!ases all hae different update cycles ,some may
!e updated eery few minutes, while others may take days or weeks0, an @). system may !e re*uired to hold !ack
certain data until all sources are synchronized. .ikewise, where a warehouse may hae to !e reconciled to the
contents in a source system or with the general ledger, esta!lishing synchronization and reconciliation points is
necessary.
Tool!
+hile an @). process can !e created using almost any programming language, creating them from scratch is *uite
complex. (ncreasingly, companies are !uying @). tools to help in the creation of @). processes.
1 good @). tool must !e a!le to communicate with the many different relational data!ases and read the arious file
formats used throughout an organization. @). tools hae started to migrate into @nterprise 1pplication (ntegration,
or een @nterprise :erice 3us, systems that now coer much more than 5ust the extraction, transformation and
loading of data. Many @). endors now hae data profiling, data *uality and metadata capa!ilities
WHAT IS AN ET& 'RO#ESS=
+71) (: 1# @). /<-"@::I
@). process & acronymic for extraction, transformation and loading operations are a fundamental
phenomenon in a data warehouse. +heneer 8M. ,data manipulation language0 operations such as
(#:@<), A/81)@ -< 8@.@)@ are issued on the source data!ase, data extraction occurs.
1fter data extraction and transformation hae taken place, data are loaded into the data warehouse.
46
(ncremental loading is !eneficial in the sense that only that hae changed after the last data
extraction and transformation are loaded.
-<1".@ "71#=@ 81)1 "1/)A<@ F<1M@+-<>
)he change data framework is designed for capturing only insert, delete and update operations on
the oracle data!ase, that is to say they are '8M. sensitie'. 3elow is architecture of change data
capture framework. 3elow is architecture illustrating the flow of information in an oracle data
capture framework.
Figure 1."hange data capture framework architecture.
(mplementing oracle change data capture is ery simple. Following the following steps, guides you
through the whole implementation process.
:ource ta!le identification4 Firstly, the source ta!les must !e identified.
#"oo!e a ubli!"er4 )he pu!lisher is responsi!le for creating and managing the change ta!les. #ote
that the pu!lisher must !e granted :@.@")_"1)1.-=_<-.@, which ena!les the pu!lisher to select
data from any :$:&owned dictionary ta!les or iews and @T@"A)@_"1)1.-=_<-.@, which ena!les the
pu!lisher to receie execute priileges on any :$:&owned packages. 7e also needs select priilege on
the source ta!les
#"ange table! creation4 +hen data extraction occurs, change data are stored in the change ta!les.
1lso stored in the change ta!les are system metadata, imperatie for the smooth functioning of the
change ta!les. (n order to create the change ta!les, the procedure
83M:_.-=M#<_"8"_/A3.(:7."<@1)@_"71#=@_)13.@ is executed. (t is important to note that each
source ta!le must hae its own change ta!le.
"hoose the su!scri!er4 )he pu!lisher must grant select priilege on the change ta!les and source
ta!les to the su!scri!er. $ou might hae more than one su!scri!er as the case may !e.
Sub!crition "andle creation- "reating the su!scription handle is ery pertinent !ecause it is used
to specifically identify a particular su!scription. (rrespectie of the num!er of ta!les su!scri!ed to,
one and only one su!scription handle must !e created. )o create a su!scription handle, first define a
aria!le, and then execute the
83M:_.-=M#<_"8"_:A3:"<(3@.=@)_:A3:"<(/)(-# 71#8.@ procedure.
Sub!cribe to t"e c"ange table!- )he data in the change ta!les are usually enormous, thus only data
of interest should !e su!scri!ed to. )o su!scri!e, the
83M:_.-=M#<_"8"_:A3:"<(3@.:A3:"<(3@ procedure is executed.
Sub!crition activation- :u!scription is actiated only once and after actiation, su!scription cannot
!e modified. 1ctiate your su!scription using the
83M:_.-=M#<_"8"_:A3:"<(3@.1")(B1)@_:A3:"<(/)(-# procedure.
Sub!crition 0indo0 creation- :ince su!scription to the change ta!les does not stop data extraction
from the source ta!le, a window is set up using the
83M:_.-=M#<_"8"_:A3:"<(3@.@T)@#8_+(#8-+ procedure. 7oweer, it is to !e noted that
changes effected on the source system after this procedure is executed will not !e aaila!le until the
window is flushed and re&extended.
Sub!crition vie0! creation- (n order to iew and *uery the change data, a su!scri!er iew is
prepared for indiidual source ta!les that the su!scri!er su!scri!es to using
83M:_.-=M#<_"8"_:A3:"<(3@./<@/1<@_:A3:"<(3@<_B(@+ procedure. 7oweer, you need to define
47
the aria!le in which the su!scri!er iew name would !e returned. 1lso, you would !e prompted for
the su!scription handle, source schema name and source ta!le name.
Kuery the change ta!les4 <esident in the su!scri!er iew are not only the change data needed !ut
also metadata, fundamental to the efficient use of the change data such as -/@<1)(-#L, ":"#L,
A:@<#1M@L etc. :ince you already know the iew name, you can descri!e the iew and then *uery it
using the conentional select statement.
Dro t"e !ub!criber vie0- )he dropping of the su!scri!er iew is carried out only when you are sure
you are done with the data in the iew and they are no longer needed ,i.e. they'e !een iewed and
extracted0. (t is imperatie to note that each su!scri!er iew must !e dropped indiidually using the
83M:_.-=M#<_"8"_:A3:"<(3@.8<-/_:A3:"<(3@_B(@+ procedure. /urge the su!scription iew4 )o
facilitate the extraction of change data again, the su!scription window must !e purged using the
83M:_.-=M#<_"8"_:A3:"<(3@./A<=@_+(#8-+ procedure.
@). /rocess
7ere is the typical @). /rocess4
:pecify metadata for sources, such as ta!les in an operational system
:pecify metadata for targetsMthe ta!les and other data stores in a data warehouse
:pecify how data is extracted, transformed, and loaded from sources to targets
:chedule and execute the processes
Monitor the execution
1 @). tool thus inoles the following components4
1 design tool for !uilding the mapping and the process flows
1 monitor tool for executing and monitoring the process
)he process flows are se*uences of steps for the extraction, transformation, and loading of data. )he data is
extracted from sources ,inputs to an operation0 and loaded into a set of targets ,outputs of an operation0 that make
up a data warehouse or a data mart.
1 good @). design tool should proide the change management features that satisfies the following criteria4
1 metadata respository that stores the metdata a!out sources, targets, and the transformations that connect them.
@nforce metadata source control for team&!ased deelopment 4 Multiple designers should !e a!le to work with the
same metadata repository at the same time without oerwriting each otherEs changes. @ach deeloper should !e
a!le to check out metadata from the respository into their pro5ect or workspace, modify them, and check the
changes !ack into the respository.
1fter a metadata o!5ect has !een checked out !y one person, it is locked so that it cannot !e updated !y another
person until the o!5ect has !een checked !ack in.
Overvie0 of ET& in Data Ware"ou!e!
$ou need to load your data warehouse regularly so that it can sere its purpose of facilitating !usiness analysis.
)o do this, data from one or more operational systems needs to !e extracted and copied into the data warehouse.
)he process of extracting data from source systems and !ringing it into the data warehouse is commonly called @).,
which stands for extraction, transformation, and loading. )he acronym @). is perhaps too simplistic, !ecause it
omits the transportation phase and implies that each of the other phases of the process is distinct. +e refer to the
entire process, including data loading, as @).. $ou should understand that @). refers to a !road process, and not
three well&defined steps.
4
)he methodology and tasks of @). hae !een well known for many years, and are not necessarily uni*ue to data
warehouse enironments4 a wide ariety of proprietary applications and data!ase systems are the () !ack!one of
any enterprise. 8ata has to !e shared !etween applications or systems, trying to integrate them, giing at least two
applications the same picture of the world. )his data sharing was mostly addressed !y mechanisms similar to what
we now call @)..
8ata warehouse enironments face the same challenge with the additional !urden that they not only hae to
exchange !ut to integrate, rearrange and consolidate data oer many systems, there!y proiding a new unified
information !ase for !usiness intelligence. 1dditionally, the data olume in data warehouse enironments tends to
!e ery large.
+hat happens during the @). processI 8uring extraction, the desired data is identified and extracted from many
different sources, including data!ase systems and applications. Bery often, it is not possi!le to identify the specific
su!set of interest, therefore more data than necessary has to !e extracted, so the identification of the releant
data will !e done at a later point in time. 8epending on the source system's capa!ilities ,for example, operating
system resources0, some transformations may take place during this extraction process. )he size of the extracted
data aries from hundreds of kilo!ytes up to giga!ytes, depending on the source system and the !usiness situation.
)he same is true for the time delta !etween two ,logically0 identical extractions4 the time span may ary !etween
days2hours and minutes to near real&time. +e! serer log files for example can easily !ecome hundreds of
mega!ytes in a ery short period of time.
1fter extracting data, it has to !e physically transported to the target system or an intermediate system for further
processing. 8epending on the chosen way of transportation, some transformations can !e done during this process,
too. For example, a :K. statement which directly accesses a remote target through a gateway can concatenate two
columns as part of the :@.@") statement.
)he emphasis in many of the examples in this section is scala!ility. Many long&time users of -racle 8ata!ase are
experts in programming complex data transformation logic using /.2:K.. )hese chapters suggest alternaties for
many such data manipulation operations, with a particular emphasis on implementations that take adantage of
-racle's new :K. functionality, especially for @). and the parallel *uery infrastructure.
ET& Tool! for Data Ware"ou!e!
8esigning and maintaining the @). process is often considered one of the most difficult and resource&intensie
portions of a data warehouse pro5ect. Many data warehousing pro5ects use @). tools to manage this process. -racle
+arehouse 3uilder ,-+30, for example, proides @). capa!ilities and takes adantage of inherent data!ase
a!ilities. -ther data warehouse !uilders create their own @). tools and processes, either inside or outside the
data!ase.
3esides the support of extraction, transformation, and loading, there are some other tasks that are important for a
successful @). implementation as part of the daily operations of the data warehouse and its support for further
enhancements. 3esides the support for designing a data warehouse and the data flow, these tasks are typically
addressed !y @). tools such as -+3.
-racle is not an @). tool and does not proide a complete solution for @).. 7oweer, -racle does proide a rich set
of capa!ilities that can !e used !y !oth @). tools and customized @). solutions. -racle offers techni*ues for
transporting data !etween -racle data!ases, for transforming large olumes of data, and for *uickly loading new
data into a data warehouse.
8aily -perations in 8ata +arehouses
)he successie loads and transformations must !e scheduled and processed in a specific order. 8epending on the
success or failure of the operation or parts of it, the result must !e tracked and su!se*uent, alternatie processes
4!
might !e started. )he control of the progress as well as the definition of a !usiness workflow of the operations are
typically addressed !y @). tools such as -racle +arehouse 3uilder.
Evolution of t"e Data Ware"ou!e
1s the data warehouse is a liing () system, sources and targets might change. )hose changes must !e maintained
and tracked through the lifespan of the system without oerwriting or deleting the old @). process flow
information. )o !uild and keep a leel of trust a!out the information in the warehouse, the process flow of each
indiidual record in the warehouse can !e reconstructed at any point in time in the future in an ideal case.
Overvie0 of E1traction in Data Ware"ou!e!
@xtraction is the operation of extracting data from a source system for further use in a data warehouse
enironment. )his is the first step of the @). process. 1fter the extraction, this data can !e transformed and loaded
into the data warehouse.
)he source systems for a data warehouse are typically transaction processing applications. For example, one of the
source systems for a sales analysis data warehouse might !e an order entry system that records all of the current
order actiities.
8esigning and creating the extraction process is often one of the most time&consuming tasks in the @). process and,
indeed, in the entire data warehousing process. )he source systems might !e ery complex and poorly documented,
and thus determining which data needs to !e extracted can !e difficult. )he data has to !e extracted normally not
only once, !ut seeral times in a periodic manner to supply all changed data to the data warehouse and keep it up&
to&date. Moreoer, the source system typically cannot !e modified, nor can its performance or aaila!ility !e
ad5usted, to accommodate the needs of the data warehouse extraction process.
)hese are important considerations for extraction and @). in general. )his chapter, howeer, focuses on the
technical considerations of haing different kinds of sources and extraction methods. (t assumes that the data
warehouse team has already identified the data that will !e extracted, and discusses common techni*ues used for
extracting data from source data!ases.
8esigning this process means making decisions a!out the following two main aspects4
+hich extraction method do ( chooseI
)his influences the source system, the transportation process, and the time needed for refreshing the warehouse.
7ow do ( proide the extracted data for further processingI
)his influences the transportation method, and the need for cleaning and transforming the data.
Introduction to E1traction %et"od! in Data Ware"ou!e!
)he extraction method you should choose is highly dependent on the source system and also from the !usiness
needs in the target data warehouse enironment. Bery often, there is no possi!ility to add additional logic to the
source systems to enhance an incremental extraction of data due to the performance or the increased workload of
these systems. :ometimes een the customer is not allowed to add anything to an out&of&the&!ox application
system.
)he estimated amount of the data to !e extracted and the stage in the @). process ,initial load or maintenance of
data0 may also impact the decision of how to extract, from a logical and a physical perspectie. 3asically, you hae
to decide how to extract data logically and physically.
&ogical E1traction %et"od!
)here are two types of logical extraction4
5"
Full @xtraction
(ncremental @xtraction
,ull E1traction
)he data is extracted completely from the source system. 3ecause this extraction reflects all the data currently
aaila!le on the source system, there's no need to keep track of changes to the data source since the last successful
extraction. )he source data will !e proided as&is and no additional logical information ,for example, timestamps0 is
necessary on the source site. 1n example for a full extraction may !e an export file of a distinct ta!le or a remote
:K. statement scanning the complete source ta!le.
Incremental E1traction
1t a specific point in time, only the data that has changed since a well&defined eent !ack in history will !e
extracted. )his eent may !e the last time of extraction or a more complex !usiness eent like the last !ooking day
of a fiscal period. )o identify this delta change there must !e a possi!ility to identify all the changed information
since this specific time eent. )his information can !e either proided !y the source data itself such as an
application column, reflecting the last&changed timestamp or a change ta!le where an appropriate additional
mechanism keeps track of the changes !esides the originating transactions. (n most cases, using the latter method
means adding extraction logic to the source system.
Many data warehouses do not use any change&capture techni*ues as part of the extraction process. (nstead, entire
ta!les from the source systems are extracted to the data warehouse or staging area, and these ta!les are compared
with a preious extract from the source system to identify the changed data. )his approach may not hae significant
impact on the source systems, !ut it clearly can place a considera!le !urden on the data warehouse processes,
particularly if the data olumes are large.
-racle's "hange 8ata "apture mechanism can extract and maintain such delta information.
'"/!ical E1traction %et"od!
8epending on the chosen logical extraction method and the capa!ilities and restrictions on the source side, the
extracted data can !e physically extracted !y two mechanisms. )he data can either !e extracted online from the
source system or from an offline structure. :uch an offline structure might already exist or it might !e generated !y
an extraction routine.
)here are the following methods of physical extraction4
-nline @xtraction
-ffline @xtraction
Online E1traction
)he data is extracted directly from the source system itself. )he extraction process can connect directly to the
source system to access the source ta!les themseles or to an intermediate system that stores the data in a
preconfigured manner ,for example, snapshot logs or change ta!les0. #ote that the intermediate system is not
necessarily physically different from the source system.
+ith online extractions, you need to consider whether the distri!uted transactions are using original source o!5ects
or prepared source o!5ects.
Offline E1traction
)he data is not extracted directly from the source system !ut is staged explicitly outside the original source system.
51
)he data already has an existing structure ,for example, redo logs, archie logs or transporta!le ta!lespaces0 or was
created !y an extraction routine.
$ou should consider the following structures4
,lat file!
8ata in a defined, generic format. 1dditional information a!out the source o!5ect is necessary for further
processing.

Dum file!
-racle&specific format. (nformation a!out the containing o!5ects may or may not !e included, depending on the
chosen utility. <edo and archie logs (nformation is in a special, additional dump file.
)ransporta!le ta!lespaces
1 powerful way to extract and moe large olumes of data !etween -racle data!ases. -racle "orporation
recommends that you use transporta!le ta!lespaces wheneer possi!le, !ecause they can proide considera!le
adantages in performance and managea!ility oer other extraction techni*ues.
#"ange Data #ature
1n important consideration for extraction is incremental extraction, also called "hange 8ata "apture. (f a data
warehouse extracts data from an operational system on a nightly !asis, then the data warehouse re*uires only the
data that has changed since the last extraction ,that is, the data that has !een modified in the past 2R hours0.
"hange 8ata "apture is also the key&ena!ling technology for proiding near real&time, or on&time, data
warehousing.
+hen it is possi!le to efficiently identify and extract only the most recently changed data, the extraction process
,as well as all downstream operations in the @). process0 can !e much more efficient, !ecause it must extract a
much smaller olume of data. Anfortunately, for many source systems, identifying the recently modified data may
!e difficult or intrusie to the operation of the system. "hange 8ata "apture is typically the most challenging
technical issue in data extraction.
3ecause change data capture is often desira!le as part of the extraction process and it might not !e possi!le to use
the "hange 8ata "apture mechanism, this section descri!es seeral techni*ues for implementing a self&deeloped
change capture on -racle 8ata!ase source systems4
)imestamps
/artitioning
)riggers
)hese techni*ues are !ased upon the characteristics of the source systems, or may re*uire modifications to the
source systems. )hus, each of these techni*ues must !e carefully ealuated !y the owners of the source system
prior to implementation.
@ach of these techni*ues can work in con5unction with the data extraction techni*ue discussed preiously. For
example, timestamps can !e used whether the data is !eing unloaded to a file or accessed through a distri!uted
*uery.
Time!tam!
)he ta!les in some operational systems hae timestamp columns. )he timestamp specifies the time and date that a
gien row was last modified. (f the ta!les in an operational system hae columns containing timestamps, then the
latest data can easily !e identified using the timestamp columns. For example, the following *uery might !e useful
for extracting today's data from an orders ta!le4
52
:@.@") Z F<-M orders
+7@<@ )<A#","1:),order_date 1: date0,'dd'0 `
)-_81)@,:$:81)@,'dd&mon&yyyy'06
(f the timestamp information is not aaila!le in an operational source system, you will not always !e a!le to modify
the system to include timestamps. :uch modification would re*uire, first, modifying the operational system's ta!les
to include a new timestamp column and then creating a trigger to update the timestamp column following eery
operation that modifies a gien row.
'artitioning
:ome source systems might use range partitioning, such that the source ta!les are partitioned along a date key,
which allows for easy identification of new data. For example, if you are extracting from an orders ta!le, and the
orders ta!le is partitioned !y week, then it is easy to identify the current week's data.
Data Ware"ou!ing E1traction 0a/!
$ou can extract data in two ways4
@xtraction Asing 8ata Files
@xtraction )hrough 8istri!uted -perations
E1traction U!ing Data ,ile!
Most data!ase systems proide mechanisms for exporting or unloading data from the internal data!ase format into
flat files. @xtracts from mainframe systems often use "-3-. programs, !ut many data!ases, as well as third&party
software endors, proide export or unload utilities.
8ata extraction does not necessarily mean that entire data!ase structures are unloaded in flat files. (n many cases,
it may !e appropriate to unload entire data!ase ta!les or o!5ects. (n other cases, it may !e more appropriate to
unload only a su!set of a gien ta!le such as the changes on the source system since the last extraction or the
results of 5oining multiple ta!les together. 8ifferent extraction techni*ues ary in their capa!ilities to support these
two scenarios.
+hen the source system is an -racle data!ase, seeral alternaties are aaila!le for extracting data into files4
@xtracting into Flat Files Asing :K.Z/lus
@xtracting into Flat Files Asing -"( or /roZ" /rograms
@xporting into @xport Files Asing the @xport Atility
@xtracting into @xport Files Asing @xternal )a!les
E1tracting into ,lat ,ile! U!ing S2&>'lu!
)he most !asic techni*ue for extracting data is to execute a :K. *uery in :K.Z/lus and direct the output of the
*uery to a file. For example, to extract a flat file, country_city.log, with the pipe sign as delimiter !etween column
alues, containing a list of the cities in the A: in the ta!les countries and customers, the following :K. script could
!e run4
:@) echo off :@) pagesize % :/--. country_city.log
:@.@") distinct t1.country_name aa'a'aa t2.cust_city
F<-M countries t1, customers t2 +7@<@ t1.country_id ` t2.country_id
1#8 t1.country_name` 'Anited :tates of 1merica'6
53
:/--. off
)he exact format of the output file can !e specified using :K.Z/lus system aria!les.
)his extraction techni*ue offers the adantage of storing the result in a customized format. #ote that using the
external ta!le data pump unload facility, you can also extract the result of an ar!itrary :K. operation. )he example
preiously extracts the results of a 5oin.
)his extraction techni*ue can !e parallelized !y initiating multiple, concurrent :K.Z/lus sessions, each session
running a separate *uery representing a different portion of the data to !e extracted. For example, suppose that
you wish to extract data from an orders ta!le, and that the orders ta!le has !een range partitioned !y month, with
partitions orders_5an1999, orders_fe!1999, and so on. )o extract a single year of data from the orders ta!le, you
could initiate 12 concurrent :K.Z/lus sessions, each extracting a single partition. )he :K. script for one such session
could !e4
:/--. order_5an.dat
:@.@") Z F<-M orders /1<)()(-# ,orders_5an199906
:/--. -FF
)hese 12 :K.Z/lus processes would concurrently spool data to 12 separate files. $ou can then concatenate them if
necessary ,using operating system utilities0 following the extraction. (f you are planning to use :K.Z.oader for
loading into the target, these 12 files can !e used as is for a parallel load with 12 :K.Z.oader sessions.
@en if the orders ta!le is not partitioned, it is still possi!le to parallelize the extraction either !ased on logical or
physical criteria. )he logical method is !ased on logical ranges of column alues, for example4
:@.@") ... +7@<@ order_date
3@)+@@# )-_81)@,'%1&O1#&99'0 1#8 )-_81)@,';1&O1#&99'06
)he physical method is !ased on a range of alues. 3y iewing the data dictionary, it is possi!le to identify the
-racle 8ata!ase data !locks that make up the orders ta!le. Asing this information, you could then derie a set of
rowid&range *ueries for extracting data from the orders ta!le4
:@.@") Z F<-M orders +7@<@ rowid 3@)+@@# alue1 and alue26
/arallelizing the extraction of complex :K. *ueries is sometimes possi!le, although the process of !reaking a single
complex *uery into multiple components can !e challenging. (n particular, the coordination of independent
processes to guarantee a glo!ally consistent iew can !e difficult. Anlike the :K.Z/lus approach, using the new
external ta!le data pump unload functionality proides transparent parallel capa!ilities.
#ote that all parallel techni*ues can use considera!ly more "/A and (2- resources on the source system, and the
impact on the source system should !e ealuated !efore parallelizing any extraction techni*ue.
E1tracting into ,lat ,ile! U!ing O#I or 'ro># 'rogram!
-"( programs ,or other programs using -racle call interfaces, such as /roZ" programs0, can also !e used to extract
data. )hese techni*ues typically proide improed performance oer the :K.Z/lus approach, although they also
re*uire additional programming. .ike the :K.Z/lus approach, an -"( program can extract the results of any :K.
*uery. Furthermore, the parallelization techni*ues descri!ed for the :K.Z/lus approach can !e readily applied to
-"( programs as well.
+hen using -"( or :K.Z/lus for extraction, you need additional information !esides the data itself. 1t minimum, you
need information a!out the extracted columns. (t is also helpful to know the extraction format, which might !e the
separator !etween distinct columns.
54
E1orting into E1ort ,ile! U!ing t"e E1ort Utilit/
)he @xport utility allows ta!les ,including data0 to !e exported into -racle 8ata!ase export files. Anlike the
:K.Z/lus and -"( approaches, which descri!e the extraction of the results of a :K. statement, @xport proides a
mechanism for extracting data!ase o!5ects. )hus, @xport differs from the preious approaches in seeral important
ways4
)he export files contain metadata as well as data. 1n export file contains not only the raw data of a ta!le, !ut also
information on how to re&create the ta!le, potentially including any indexes, constraints, grants, and other
attri!utes associated with that ta!le.
1 single export file may contain a su!set of a single o!5ect, many data!ase o!5ects, or een an entire schema.
@xport cannot !e directly used to export the results of a complex :K. *uery. @xport can !e used only to extract
su!sets of distinct data!ase o!5ects.
)he output of the @xport utility must !e processed using the (mport utility.
-racle proides the original @xport and (mport utilities for !ackward compati!ility and the data pump
export2import infrastructure for high&performant, scala!le and parallel extraction. :ee -racle 8ata!ase Atilities for
further details.
E1tracting into E1ort ,ile! U!ing E1ternal Table!
(n addition to the @xport Atility, you can use external ta!les to extract the results from any :@.@") operation. )he
data is stored in the platform independent, -racle&internal data pump format and can !e processed as regular
external ta!le on the target system. )he following example extracts the result of a 5oin operation in parallel into
the four specified files. )he only allowed external ta!le type for extracting data is the -racle&internal format
-<1".@_81)1/AM/.
"<@1)@ 8(<@")-<$ def_dir 1: '2net2dlsunR92priate2h!aer2+-<>2F@1)A<@:2et'6
8<-/ )13.@ extract_cust6
"<@1)@ )13.@ extract_cust
-<=1#(W1)(-# @T)@<#1.
,)$/@ -<1".@_81)1/AM/ 8@F1A.) 8(<@")-<$ def_dir 1""@:: /1<1M@)@<:
,#-318F(.@ #-.-=F(.@0
.-"1)(-# ,'extract_cust1.exp', 'extract_cust2.exp', 'extract_cust;.exp',
'extract_custR.exp'00
/1<1..@. R <@O@") .(M() A#.(M()@8 1:
:@.@") c.Z, co.country_name, co.country_su!region, co.country_region
F<-M customers c, countries co where co.country_id`c.country_id6
)he total num!er of extraction files specified limits the maximum degree of parallelism for the write operation.
#ote that the parallelizing of the extraction does not automatically parallelize the :@.@") portion of the statement.
Anlike using any kind of export2import, the metadata for the external ta!le is not part of the created files when
using the external ta!le data pump unload. )o extract the appropriate metadata for the external ta!le, use the
83M:_M@)181)1 package, as illustrated in the following statement4
:@) .-#= 2%%%
:@.@") 83M:_M@)181)1.=@)_88.,')13.@','@T)<1")_"A:)'0 F<-M 8A1.6
E1traction T"roug" Di!tributed Oeration!
Asing distri!uted&*uery technology, one -racle data!ase can directly *uery ta!les located in arious different
55
source systems, such as another -racle data!ase or a legacy system connected with the -racle gateway technology.
:pecifically, a data warehouse or staging data!ase can directly access ta!les and data located in a connected source
system. =ateways are another form of distri!uted&*uery technology. =ateways allow an -racle data!ase ,such as a
data warehouse0 to access data!ase ta!les stored in remote, non&-racle data!ases. )his is the simplest method for
moing data !etween two -racle data!ases !ecause it com!ines the extraction and transformation into a single
step, and re*uires minimal programming. 7oweer, this is not always feasi!le.
:uppose that you wanted to extract a list of employee names with department names from a source data!ase and
store this data into the data warehouse. Asing an -racle #et connection and distri!uted&*uery technology, this can
!e achieed using a single :K. statement4
"<@1)@ )13.@ country_city 1: :@.@") distinct t1.country_name, t2.cust_city
F<-M countriesbsource_d! t1, customersbsource_d! t2
+7@<@ t1.country_id ` t2.country_id
1#8 t1.country_name`'Anited :tates of 1merica'6
)his statement creates a local ta!le in a data mart, country_city, and populates it with data from the countries and
customers ta!les on the source system.
)his techni*ue is ideal for moing small olumes of data. 7oweer, the data is transported from the source system
to the data warehouse through a single -racle #et connection. )hus, the scala!ility of this techni*ue is limited. For
larger data olumes, file&!ased data extraction and transportation techni*ues are often more scala!le and thus
more appropriate.
1; )ransportation in 8ata +arehouses
)he following topics proide information a!out transporting data into a data warehouse4
-eriew of )ransportation in 8ata +arehouses
(ntroduction to )ransportation Mechanisms in 8ata +arehouses
Tran!ortation in Data Ware"ou!e!
)ransportation is the operation of moing data from one system to another system. (n a data warehouse
enironment, the most common re*uirements for transportation are in moing data from4
1 source system to a staging data!ase or a data warehouse data!ase
1 staging data!ase to a data warehouse
1 data warehouse to a data mart
)ransportation is often one of the simpler portions of the @). process, and can !e integrated with other portions of
the process.
(ntroduction to )ransportation Mechanisms in 8ata +arehouses
$ou hae three !asic choices for transporting data in warehouses4
)ransportation Asing Flat Files
)ransportation )hrough 8istri!uted -perations
)ransportation Asing )ransporta!le )a!lespaces
Tran!ortation U!ing ,lat ,ile!
)he most common method for transporting data is !y the transfer of flat files, using mechanisms such as F)/ or
other remote file system access protocols. 8ata is unloaded or exported from the source system into flat files using
techni*ues, and is then transported to the target platform using F)/ or similar mechanisms.
3ecause source systems and data warehouses often use different operating systems and data!ase systems, using flat
56
files is often the simplest way to exchange data !etween heterogeneous systems with minimal transformations.
7oweer, een when transporting data !etween homogeneous systems, flat files are often the most efficient and
most easy&to&manage mechanism for data transfer.
Tran!ortation T"roug" Di!tributed Oeration!
8istri!uted *ueries, either with or without gateways, can !e an effectie mechanism for extracting data. )hese
mechanisms also transport the data directly to the target systems, thus proiding !oth extraction and
transformation in a single step. 8epending on the tolera!le impact on time and system resources, these mechanisms
can !e well suited for !oth extraction and transformation.
1s opposed to flat file transportation, the success or failure of the transportation is recognized immediately with
the result of the distri!uted *uery or transaction.
Tran!ortation U!ing Tran!ortable Table!ace!
-racle transporta!le ta!lespaces are the fastest way for moing large olumes of data !etween two -racle
data!ases. /reious to the introduction of transporta!le ta!lespaces, the most scala!le data transportation
mechanisms relied on moing flat files containing raw data. )hese mechanisms re*uired that data !e unloaded or
exported into files from the source data!ase, )hen, after transportation, these files were loaded or imported into
the target data!ase. )ransporta!le ta!lespaces entirely !ypass the unload and reload steps.
Asing transporta!le ta!lespaces, -racle data files ,containing ta!le data, indexes, and almost eery other -racle
data!ase o!5ect0 can !e directly transported from one data!ase to another. Furthermore, like import and export,
transporta!le ta!lespaces proide a mechanism for transporting metadata in addition to transporting data.
)ransporta!le ta!lespaces hae some limitations4 source and target systems must !e running -racle9i ,or higher0,
must use the same character set, and, prior to -racle 8ata!ase 1%g, must run on the same operating system. For
details on how to transport ta!lespace !etween operating systems.
)he most common applications of transporta!le ta!lespaces in data warehouses are in moing data from a staging
data!ase to a data warehouse, or in moing data from a data warehouse to a data mart.
)ransporta!le )a!lespaces @xample
:uppose that you hae a data warehouse containing sales data, and seeral data marts that are refreshed monthly.
1lso suppose that you are going to moe one month of sales data from the data warehouse to the data mart.
Ste : 'lace t"e Data to be Tran!orted into it! o0n Table!ace
)he current month's data must !e placed into a separate ta!lespace in order to !e transported. (n this example, you
hae a ta!lespace ts_temp_sales, which will hold a copy of the current month's data. Asing the "<@1)@ )13.@ ... 1:
:@.@") statement, the current month's data can !e efficiently copied to this ta!lespace4
"<@1)@ )13.@ temp_5an_sales #-.-==(#= )13.@:/1"@ ts_temp_sales
1: :@.@") Z F<-M sales
+7@<@ time_id 3@)+@@# ';1&8@"&1999' 1#8 '%1&F@3&2%%%'6
Following this operation, the ta!lespace ts_temp_sales is set to read&only4
1.)@< )13.@:/1"@ ts_temp_sales <@18 -#.$6
1 ta!lespace cannot !e transported unless there are no actie transactions modifying the ta!lespace. :etting the
ta!lespace to read&only enforces this.
)he ta!lespace ts_temp_sales may !e a ta!lespace that has !een especially created to temporarily store data for
57
use !y the transporta!le ta!lespace features. this ta!lespace can !e set to read2write, and, if desired, the ta!le
temp_5an_sales can !e dropped, or the ta!lespace can !e re&used for other transportations or for other purposes.
(n a gien transporta!le ta!lespace operation, all of the o!5ects in a gien ta!lespace are transported. 1lthough
only one ta!le is !eing transported in this example, the ta!lespace ts_temp_sales could contain multiple ta!les. For
example, perhaps the data mart is refreshed not only with the new month's worth of sales transactions, !ut also
with a new copy of the customer ta!le. 3oth of these ta!les could !e transported in the same ta!lespace. Moreoer,
this ta!lespace could also contain other data!ase o!5ects such as indexes, which would also !e transported.
1dditionally, in a gien transporta!le&ta!lespace operation, multiple ta!lespaces can !e transported at the same
time. )his makes it easier to moe ery large olumes of data !etween data!ases. #ote, howeer, that the
transporta!le ta!lespace feature can only transport a set of ta!lespaces which contain a complete set of data!ase
o!5ects without dependencies on other ta!lespaces. For example, an index cannot !e transported without its ta!le,
nor can a partition !e transported without the rest of the ta!le. $ou can use the 83M:_)): package to check that a
ta!lespace is transporta!le.
(n this step, we hae copied the Oanuary sales data into a separate ta!lespace6 howeer, in some cases, it may !e
possi!le to leerage the transporta!le ta!lespace feature without een moing data to a separate ta!lespace. (f the
sales ta!le has !een partitioned !y month in the data warehouse and if each partition is in its own ta!lespace, then
it may !e possi!le to directly transport the ta!lespace containing the Oanuary data. :uppose the Oanuary partition,
sales_5an2%%%, is located in the ta!lespace ts_sales_5an2%%%. )hen the ta!lespace ts_sales_5an2%%% could
potentially !e transported, rather than creating a temporary copy of the Oanuary sales data in the ts_temp_sales.
7oweer, the same conditions must !e satisfied in order to transport the ta!lespace ts_sales_5an2%%% as are
re*uired for the specially created ta!lespace. First, this ta!lespace must !e set to <@18 -#.$. :econd, !ecause a
single partition of a partitioned ta!le cannot !e transported without the remainder of the partitioned ta!le also
!eing transported, it is necessary to exchange the Oanuary partition into a separate ta!le ,using the 1.)@< )13.@
statement0 to transport the Oanuary data. )he @T"71#=@ operation is ery *uick, !ut the Oanuary data will no
longer !e a part of the underlying sales ta!le, and thus may !e unaaila!le to users until this data is exchanged
!ack into the sales ta!le after the export of the metadata. )he Oanuary data can !e exchanged !ack into the sales
ta!le after you complete step ;.
Ste ; E1ort t"e %etadata
)he @xport utility is used to export the metadata descri!ing the o!5ects contained in the transported ta!lespace.
For our example scenario, the @xport command could !e4
@T/ )<1#:/-<)_)13.@:/1"@`y )13.@:/1"@:`ts_temp_sales F(.@`5an_sales.dmp
)his operation will generate an export file, 5an_sales.dmp. )he export file will !e small, !ecause it contains only
metadata. (n this case, the export file will contain information descri!ing the ta!le temp_5an_sales, such as the
column names, column datatype, and all other information that the target -racle data!ase will need in order to
access the o!5ects in ts_temp_sales.
Ste < #o/ t"e Datafile! and E1ort ,ile to t"e Target S/!tem
"opy the data files that make up ts_temp_sales, as well as the export file 5an_sales.dmp to the data mart platform,
using any transportation mechanism for flat files. -nce the datafiles hae !een copied, the ta!lespace
ts_temp_sales can !e set to <@18 +<()@ mode if desired.
:tep R (mport the Metadata
-nce the files hae !een copied to the data mart, the metadata should !e imported into the data mart4
(M/ )<1#:/-<)_)13.@:/1"@`y 81)1F(.@:`'2d!2temp5an.f'
5
)13.@:/1"@:`ts_temp_sales F(.@`5an_sales.dmp
1t this point, the ta!lespace ts_temp_sales and the ta!le temp_sales_5an are accessi!le in the data mart. $ou can
incorporate this new data into the data mart's ta!les.
$ou can insert the data from the temp_sales_5an ta!le into the data mart's sales ta!le in one of two ways4
(#:@<) 2ZQ 1//@#8 Z2 (#)- sales :@.@") Z F<-M temp_sales_5an6
Following this operation, you can delete the temp_sales_5an ta!le ,and een the entire ts_temp_sales ta!lespace0.
1lternatiely, if the data mart's sales ta!le is partitioned !y month, then the new transported ta!lespace and the
temp_sales_5an ta!le can !ecome a permanent part of the data mart. )he temp_sales_5an ta!le can !ecome a
partition of the data mart's sales ta!le4
1.)@< )13.@ sales 188 /1<)()(-# sales_%%5an B1.A@:
.@:: )71# ,)-_81)@,'%1&fe!&2%%%','dd&mon&yyyy'006
1.)@< )13.@ sales @T"71#=@ /1<)()(-# sales_%%5an
+()7 )13.@ temp_sales_5an (#".A8(#= (#8@T@: +()7 B1.(81)(-#6
Ot"er U!e! of Tran!ortable Table!ace!
)he preious example illustrates a typical scenario for transporting data in a data warehouse. 7oweer,
transporta!le ta!lespaces can !e used for many other purposes. (n a data warehousing enironment, transporta!le
ta!lespaces should !e iewed as a utility ,much like (mport2@xport or :K.Z.oader0, whose purpose is to moe large
olumes of data !etween -racle data!ases. +hen used in con5unction with parallel data moement operations such
as the "<@1)@ )13.@ ... 1: :@.@") and (#:@<) ... 1: :@.@") statements, transporta!le ta!lespaces proide an
important mechanism for *uickly transporting data for many purposes.

Overvie0 of &oading and Tran!formation in Data Ware"ou!e!
8ata transformations are often the most complex and, in terms of processing time, the most costly part of
the extraction, transformation, and loading ,@).0 process. )hey can range from simple data conersions to
extremely complex data scru!!ing techni*ues. Many, if not all, data transformations can occur within an -racle
data!ase, although transformations are often implemented outside of the data!ase ,for example, on flat files0 as
well.
)his chapter introduces techni*ues for implementing scala!le and efficient data transformations within the
-racle 8ata!ase. )he examples in this chapter are relatiely simple. <eal&world data transformations are often
considera!ly more complex. 7oweer, the transformation techni*ues introduced in this chapter meet the ma5ority
of real&world data transformation re*uirements, often with more scala!ility and less programming than alternatie
approaches.
)his chapter does not seek to illustrate all of the typical transformations that would !e encountered in a data
warehouse, !ut to demonstrate the types of fundamental technology that can !e applied to implement these
transformations and to proide guidance in how to choose the !est techni*ues.
Tran!formation ,lo0
5!
From an architectural perspectie, you can transform your data in two ways4
:) %ulti!tage Data Tran!formation
;) 'ielined Data Tran!formation
%ulti!tage Data Tran!formation
)he data transformation logic for most data warehouses consists of multiple steps. For example, in
transforming new records to !e inserted into a sales ta!le, there may !e separate logical transformation steps to
alidate each dimension key.
Figure 1R&1 offers a graphical way of looking at the transformation logic.
Figure 1R&1 Multistage 8ata )ransformation
+hen using -racle 8ata!ase as a transformation engine, a common strategy is to implement each transformation as
a separate :K. operation and to create a separate, temporary staging ta!le ,such as the ta!les new_sales_step1 and
new_sales_step2 in Figure 1R&10 to store the incremental results for each step. )his load&then&transform strategy
also proides a natural checkpointing scheme to the entire transformation process, which ena!les to the process to
!e more easily monitored and restarted. 7oweer, a disadantage to multistaging is that the space and time
re*uirements increase.
(t may also !e possi!le to com!ine many simple logical transformations into a single :K. statement or single /.2:K.
procedure. 8oing so may proide !etter performance than performing each step independently, !ut it may also
introduce difficulties in modifying, adding, or dropping indiidual transformations, as well as recoering from failed
transformations.
'ielined Data Tran!formation
)he @). process flow can !e changed dramatically and the data!ase !ecomes an integral part of the @). solution.
)he new functionality renders some of the former necessary process steps o!solete while some others can !e
6"
remodeled to enhance the data flow and the data transformation to !ecome more scala!le and non&interruptie.
)he task shifts from serial transform&then&load process ,with most of the tasks done outside the data!ase0 or load&
then&transform process, to an enhanced transform&while&loading.
-racle offers a wide ariety of new capa!ilities to address all the issues and tasks releant in an @). scenario. (t is
important to understand that the data!ase offers toolkit functionality rather than trying to address a one&size&fits&
all solution. )he underlying data!ase has to ena!le the most appropriate @). process flow for a specific customer
need, and not dictate or constrain it from a technical perspectie. Figure 1R&2 illustrates the new functionality,
which is discussed throughout later sections.
Figure 14-2 Pipelined Data Transformation
8escription of the illustration dwhsg1%D.gif
.oading Mechanisms
$ou can use the following mechanisms for loading a data warehouse4
.oading a 8ata +arehouse with :K.Z.oader
.oading a 8ata +arehouse with @xternal )a!les
.oading a 8ata +arehouse with -"( and 8irect&/ath 1/(s
.oading a 8ata +arehouse with @xport2(mport
Sc"ema! in Data Ware"ou!e!
1 !c"ema is a collection of data!ase o!5ects, including ta!les, iews, indexes, and synonyms.
)here is a ariety of ways of arranging schema o!5ects in the schema models designed for data warehousing. -ne
data warehouse schema model is a star schema.
61
Star Sc"ema!
)he !tar !c"ema is perhaps the simplest data warehouse schema. (t is called a star schema !ecause the entity&
relationship diagram of this schema resem!les a star, with points radiating from a central ta!le. )he center of the
star consists of a large fact ta!le and the points of the star are the dimension ta!les.
1 !tar 8uer/ is a 5oin !etween a fact ta!le and a num!er of dimension ta!les. @ach dimension ta!le is 5oined to the
fact ta!le using a primary key to foreign key 5oin, !ut the dimension ta!les are not 5oined to each other. )he
optimizer recognizes star *ueries and generates efficient execution plans for them.
1 typical fact ta!le contains keys and measures. For example, in the sh sample schema, the fact ta!le, sales,
contain the measures *uantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and
promo_id. )he dimension ta!les are customers, times, products, channels, and promotions. )he products dimension
ta!le, for example, contains information a!out each product num!er that appears in the fact ta!le.
1 star 5oin is a primary key to foreign key 5oin of the dimension ta!les to a fact ta!le.
)he main adantages of star schemas are that they4
/roide a direct and intuitie mapping !etween the !usiness entities !eing analyzed !y end users and the
schema design.
/roide highly optimized performance for typical star *ueries.
1re widely supported !y a large num!er of !usiness intelligence tools, which may anticipate or een re*uire
that the data warehouse schema contain dimension ta!les.
:tar schemas are used for !oth simple data marts and ery large data warehouses.
Figure 19&2 presents a graphical representation of a star schema.
Figure 19-2 Star Schema
62
Sno0fla6e Sc"ema!
)he snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. (t
is called a snowflake schema !ecause the diagram of the schema resem!les a snowflake.
:nowflake schemas normalize dimensions to eliminate redundancy. )hat is, the dimension data has !een grouped
into multiple ta!les instead of one large ta!le. For example, a product dimension ta!le in a star schema might !e
normalized into a products ta!le, a product_category ta!le, and a product_manufacturer ta!le in a snowflake
schema. +hile this saes space, it increases the num!er of dimension ta!les and re*uires more foreign key 5oins.
)he result is more complex *ueries and reduced *uery performance. Figure 19&; presents a graphical representation
of a snowflake schema.
Figure 19-3 Snowflake Schema
8escription of the illustration dwhsg%%9.gif
-ptimizing :tar Kueries
$ou should consider the following when using star *ueries4
)uning :tar Kueries
Asing :tar )ransformation
)uning :tar Kueries
)o get the !est possi!le performance for star *ueries, it is important to follow some !asic guidelines4
1 !itmap index should !e !uilt on each of the foreign key columns of the fact ta!le or ta!les.
63
)he initialization parameter :)1<_)<1#:F-<M1)(-#_@#13.@8 should !e set to )<A@. )his ena!les an important
optimizer feature for star&*ueries. (t is set to F1.:@ !y default for !ackward&compati!ility.
+hen a data warehouse satisfies these conditions, the ma5ority of the star *ueries running in the data warehouse
will use a *uery execution strategy known as the star transformation. )he star transformation proides ery
efficient *uery performance for star *ueries.
U!ing Star Tran!formation
)he star transformation is a powerful optimization techni*ue that relies upon implicitly rewriting ,or transforming0
the :K. of the original star *uery. )he end user neer needs to know any of the details a!out the star
transformation. -racle's *uery optimizer automatically chooses the star transformation where appropriate.
)he star transformation is a *uery transformation aimed at executing star *ueries efficiently. -racle processes
a star *uery using two !asic phases. )he first phase retriees exactly the necessary rows from the fact ta!le ,the
result set0. 3ecause this retrieal utilizes !itmap indexes, it is ery efficient. )he second phase 5oins this result set
to the dimension ta!les. 1n example of an end user *uery is4 ?+hat were the sales and profits for the grocery
department of stores in the west and southwest sales districts oer the last three *uartersI? )his is a simple star
*uery.
Ho0 Oracle #"oo!e! to U!e Star Tran!formation
)he optimizer generates and saes the !est plan it can produce without the transformation. (f the
transformation is ena!led, the optimizer then tries to apply it to the *uery and, if applica!le, generates
the !est plan using the transformed *uery. 3ased on a comparison of the cost estimates !etween the !est
plans for the two ersions of the *uery, the optimizer will then decide whether to use the !est plan for
the transformed or untransformed ersion.
(f the *uery re*uires accessing a large percentage of the rows in the fact ta!le, it might !e !etter to use a full
ta!le scan and not use the transformations. 7oweer, if the constraining predicates on the dimension ta!les are
sufficiently selectie that only a small portion of the fact ta!le needs to !e retrieed, the plan !ased on the
transformation will pro!a!ly !e superior.
#ote that the optimizer generates a su!*uery for a dimension ta!le only if it decides that it is reasona!le to
do so !ased on a num!er of criteria. )here is no guarantee that su!*ueries will !e generated for all dimension
ta!les. )he optimizer may also decide, !ased on the properties of the ta!les and the *uery, that the transformation
does not merit !eing applied to a particular *uery. (n this case the !est regular plan will !e used.
Star Tran!formation Re!triction!
:tar transformation is not supported for ta!les with any of the following characteristics4
Kueries with a ta!le hint that is incompati!le with a !itmap access path
Kueries that contain !ind aria!les
64
)a!les with too few !itmap indexes. )here must !e a !itmap index on a fact ta!le column for the optimizer to
generate a su!*uery for it.
<emote fact ta!les. 7oweer, remote dimension ta!les are allowed in the su!*ueries that are generated.
1nti&5oined ta!les
)a!les that are already used as a dimension ta!le in a su!*uery
)a!les that are really unmerged iews, which are not iew partitions
)he star transformation may not !e chosen !y the optimizer for the following cases4
)a!les that hae a good single&ta!le access path
)a!les that are too small for the transformation to !e worthwhile
(n addition, temporary ta!les will not !e used !y star transformation under the following conditions4
)he data!ase is in read&only mode
)he star *uery is part of a transaction that is in serializa!le mode
65
Informatica
Informatica is a tool, supporting all the steps of @xtraction, )ransformation and .oad process. #ow days (nformatica
is also !eing used as an (ntegration tool.
(nformatica is an easy to use tool. (t has got a simple isual interface like forms in isual !asic. $ou 5ust need to
drag and drop different o!5ects ,known as transformations0 and design process flow for 8ata extraction
transformation and load. )hese process flow diagrams are known as maing!. -nce a mapping is made, it can !e
scheduled to run as and when re*uired. (n the !ackground (nformatica serer takes care of fetching data from
source, transforming it, P loading it to the target systems2data!ases.
(nformatica can communicate with all ma5or data sources ,mainframe2<83M:2Flat Files2TM.2B:M2:1/ etc0, can
moe2transform data !etween them. (t can moe huge olumes of data in a ery effectie way, many a times !etter
than een !espoke programs written for specific data moement only. (t can throttle the transactions ,do !ig
updates in small chunks to aoid long locking and filling the transactional log0. (t can effectiely 5oin data from two
distinct data sources ,een a xml file can !e 5oined with a relational ta!le0. (n all, (nformatica has got the a!ility to
effectiely integrate heterogeneous data sources P conerting raw data into useful information.
3efore we start actually working in (nformatica, letEs hae an idea a!out the company owning this wonderful
product.
:ome facts and figures a!out (nformatica "orporation4
Founded in 199;, !ased in <edwood "ity, "alifornia
1R%%Q @mployees6 ;R'% Q "ustomers6 D9 of the Fortune 1%% "ompanies
#1:81K :tock :ym!ol4 (#F16 :tock /rice4 L19.DR
,%92%R22%%90

<eenues in fiscal year 2%%94 LR''.DM
(nformatica 8eeloper #etworks4 2%%%% Mem!ers
(nformatica :oftware 1rchitecture illustrated
(nformatica @). product, known as (nformatica /ower "enter consists of ; main components.
1. Informatica PowerCenter Client Tools:
)hese are the deelopment tools installed at deeloper end. )hese tools ena!le a deeloper to
8efine transformation process, known as mapping. (De!igner)
8efine run&time properties for a mapping, known as sessions (Wor6flo0 %anager)
Monitor execution of sessions (Wor6flo0 %onitor)
Manage repository, useful for administrators (Reo!itor/ %anager)
<eport Metadata (%etadata Reorter)
2. Informatica PowerCenter Repository:
<epository is the heart of (nformatica tools. <epository is a kind of data inentory where all the data related to
mappings, sources, targets etc is kept. )his is the place where all the metadata for your application is stored. 1ll
66
the client tools and (nformatica :erer fetch data from <epository. (nformatica client and serer without repository
is same as a /" without memory2harddisk, which has got the a!ility to process data !ut has no data to process. )his
can !e treated as !ackend of (nformatica.
3. Informatica PowerCenter Server:
:erer is the place, where all the executions take place. :erer makes physical connections to sources2targets,
fetches data, applies the transformations mentioned in the mapping and loads the data in the target system.
)his architecture is isually explained in diagram !elow4
Source!
Standard- <83M:, Flat Files,
TM., -83"
Alication!- :1/ <2;, :1/ 3+,
/eople:oft, :ie!el, O8 @dwards,
i2
EAI- MK :eries, )i!co, OM:,
+e! :erices
&egac/- Mainframes ,832,
B:1M, (M:, (8M:, 1da!as01:R%%
,832, Flat File0
Remote Source!
Target!
Standard- <83M:, Flat Files,
TM., -83"
Alication!- :1/ <2;, :1/ 3+,
/eople:oft, :ie!el, O8 @dwards,
i2
EAI- MK :eries, )i!co, OM:,
+e! :erices
&egac/- Mainframes ,83201:R%%
,8320
Remote Target!
(nformatica /roduct .ine
(nformatica is a powerful @). tool from (nformatica "orporation, a leading proider of enterprise data integration
software and @). softwares.
)he important products proided !y (nformatica "orporation is proided !elow4
/ower "enter
/ower Mart
67
/ower @xchange
/ower "enter "onnect
/ower "hannel
Metadata @xchange
/ower 1nalyzer
:uper =lue
'o0er #enter 5 'o0er %art- /ower Mart is a departmental ersion of (nformatica for !uilding, deploying, and
managing data warehouses and data marts. /ower center is used for corporate enterprise data warehouse and
power mart is used for departmental data warehouses like data marts. /ower "enter supports glo!al repositories
and networked repositories and it can !e connected to seeral sources. /ower Mart supports single repository and it
can !e connected to fewer sources when compared to /ower "enter. /ower Mart can extensi!ily grow to an
enterprise implementation and it is easy for deeloper productiity through a codeless enironment.
'o0er E1c"ange- (nformatica /ower @xchange as a stand alone serice or along with /ower "enter, helps
organizations leerage data !y aoiding manual coding of data extraction programs. /ower @xchange supports
!atch, real time and changed data capture options in main frame,832, B:1M, (M: etc.,0, mid range ,1:R%% 832
etc.,0, and for relational data!ases ,oracle, s*l serer, d!2 etc0 and flat files in unix, linux and windows systems.
'o0er #enter #onnect- )his is add on to (nformatica /ower "enter. (t helps to extract data and metadata from @</
systems like (3M's MK:eries, /eoplesoft, :1/, :ie!el etc. and other third party applications.
'o0er #"annel- )his helps to transfer large amount of encrypted and compressed data oer .1#, +1#, through
Firewalls, tranfer files oer F)/, etc.
%eta Data E1c"ange- Metadata @xchange ena!les organizations to take adantage of the time and effort already
inested in defining data structures within their () enironment when used with /ower "enter. For example, an
organization may !e using data modeling tools, such as @rwin, @m!arcadero, -racle designer, :y!ase /ower 8esigner
etc for deeloping data models. Functional and technical team should hae spent much time and effort in creating
the data model's data structures,ta!les, columns, data types, procedures, functions, triggers etc0. 3y using meta
deta exchange, these data structures can !e imported into power center to identifiy source and target mappings
which leerages time and effort. )here is no need for informatica deeloper to create these data structures once
again.
'o0er Anal/?er- /ower 1nalyzer proides organizations with reporting facilities. /ower1nalyzer makes accessing,
analyzing, and sharing enterprise data simple and easily aaila!le to decision makers. /ower1nalyzer ena!les to
gain insight into !usiness processes and deelop !usiness intelligence.
+ith /ower1nalyzer, an organization can extract, filter, format, and analyze corporate information from data stored
in a data warehouse, data mart, operational data store, or otherdata storage models. /ower1nalyzer is !est with a
dimensional data warehouse in a relational data!ase. (t can also run reports on data in any ta!le in a relational
data!ase that do not conform to the dimensional model.
Suer $lue- :uperglue is used for loading metadata in a centralized place from seeral sources. <eports can !e run
against this superglue to analyze meta data.
TRANS,OR%ATIONS
6
(nformatica )ransformations
1 transformation is a repository o!5ect that generates, modifies, or passes data. )he 8esigner proides a set of
transformations that perform specific functions. For example, an 1ggregator transformation performs calculations
on groups of data.
)ransformations can !e of two types4
Active Tran!formation
1n actie transformation can change the num!er of rows that pass through the transformation, change the
transaction !oundary, can change the row type. For example, Filter, )ransaction "ontrol and Apdate :trategy are
actie transformations.
Note- )he key point is to note that 8esigner does not allow you to connect multiple actie transformations or an
actie and a passie transformation to the same downstream transformation or transformation input group !ecause
the (ntegration :erice may not !e a!le to concatenate the rows passed !y actie transformations 7oweer,
:e*uence =enerator transformation,:=)0 is an exception to this rule. 1 :=) does not receie data. (t generates
uni*ue numeric alues. 1s a result, the (ntegration :erice does not encounter pro!lems concatenating rows passed
!y a :=) and an actie transformation.
'a!!ive Tran!formation*
1 passie transformation does not change the num!er of rows that pass through it, maintains the transaction
!oundary, and maintains the row type.
)he key point is to note that 8esigner allows you to connect multiple transformations to the same downstream
transformation or transformation input group only if all transformations in the upstream !ranches are passie. )he
transformation that originates the !ranch can !e actie or passie.
)ransformations can !e "onnected or An"onnected to the data flow.
#onnected Tran!formation
"onnected transformation is connected to other transformations or directly to target ta!le in the mapping.
Un#onnected Tran!formation
1n unconnected transformation is not connected to other transformations in the mapping. (t is called within another
transformation, and returns a alue to that transformation
:*E1re!!ion Tran!formation #onnected@a!!ive
$ou can use the @xpression transformation to calculate alues in a single row !efore you write to the target. For
example, you might need to ad5ust employee salaries, concatenate first and last names, or conert strings to
num!ers. $ou can use the @xpression transformation to perform any non&aggregate calculations. $ou can also use the
@xpression transformation to test conditional statements !efore you output the results to target ta!les or other
transformations.
6!
"alculating Balues
)o use the @xpression transformation to calculate alues for a single row, you must include the following ports4
Inut or inut@outut ort! for eac" value u!ed in t"e calculation* For example, when calculating the
total price for an order, determined !y multiplying the unit price !y the *uantity ordered, the input or
input2output ports. -ne port proides the unit price and the other proides the *uantity ordered.
Outut ort for t"e e1re!!ion* $ou enter the expression as a configuration option for the output port. )he
return alue for the output port needs to match the return alue of the expression. For information on
entering expressions, see G)ransformationsH in the Designer Guide. @xpressions use the transformation
language, which includes :K.&like functions, to perform calculations
$ou can enter multiple expressions in a single @xpression transformation. 1s long as you enter only one expression
for each output port, you can create any num!er of output ports in the transformation. (n this way, you can use one
@xpression transformation rather than creating separate transformations for each calculation that re*uires the same
set of data.
;*,ilter tran!formation #onnecte @ Active
)he Filter transformation allows you to filter rows in a mapping. $ou pass all the rows from a source
transformation through the Filter transformation, and then enter a filter condition for the transformation. 1ll ports
in a Filter transformation are input2output, and only rows that meet the condition pass through the Filter
transformation.
(n some cases, you need to filter data !ased on one or more conditions !efore writing it to targets. For example, if
you hae a human resources target containing information a!out current employees, you might want to filter out
employees who are part&time and hourly
7"
+ith the filter of :1.1<$ S ;%%%%, only rows of data where employees that make salaries greater than L;%,%%% pass
through to the target.
1s an actie transformation, the Filter transformation may change the num!er of rows passed through it. 1 filter
condition returns )<A@ or F1.:@ for each row that passes through the transformation, depending on whether a row
meets the specified condition. -nly rows that return )<A@ pass through this transformation. 8iscarded rows do not
appear in the session log or re5ect files.$ou use the transformation language to enter the filter condition. )he
condition is an expression that returns )<A@ or F1.:@.
)o maximize session performance, include the Filter transformation as close to the sources in the mapping as
possi!le. <ather than passing rows you plan to discard through the mapping, you then filter out unwanted data early
in the flow of data from sources to targets
Ase the Filter transformation early in the mapping.)o maximize session performance, keep the Filter transformation
as close as possi!le to the sources in the mapping. <ather than passing rows that you plan to discard through the
mapping, you can filter out unwanted data early in the flow of data from sources to targets.
)o filter out rows containing null alues or spaces, use the (:#A.. and (:_:/1"@: functions to test the alue of the
port. For example, if you want to filter out rows that contain #A..s in the F(<:)_#1M@ port, use the following
condition4 ((F,(:#A..,F(<:)_#1M@0,F1.:@,)<A@0
<*Aoiner Tran!formation #onnected@Active
$ou can use the Ooiner transformation to 5oin source data from two related heterogeneous sources residing in
different locations or file systems. -r, you can 5oin data from the same source.
)he Ooiner transformation 5oins two sources with at least one matching port. )he Ooiner transformation uses a
condition that matches one or more pairs of ports !etween the two sources. (f you need to 5oin more than two
sources, you can add more Ooiner transformations to the mapping.)he Ooiner transformation re*uires input from
two separate pipelines or two !ranches from one pipeline.
)he Ooiner transformation accepts input from most transformations. 7oweer, there are some limitations on the
pipelines you connect to the Ooiner transformation. $ou cannot use a Ooiner transformation in the following
situations4
@ither input pipeline contains an Apdate :trategy transformation.
$ou connect a :e*uence =enerator transformation directly !efore the Ooiner transformation
)he 5oin condition contains ports from !oth input sources that must match for the /ower"enter :erer to
5oin two rows. 8epending on the type of 5oin selected, the Ooiner transformation either adds the row to the result
set or discards the row. )he Ooiner produces result sets !ased on the 5oin type, condition, and input data sources.
71
3efore you define a 5oin condition, erify that the master and detail sources are set for optimal performance.
8uring a session, the /ower"enter :erer compares each row of the master source against the detail source. )he
fewer uni*ue rows in the master, the fewer iterations of the 5oin comparison occur, which speeds the 5oin process.
)o improe performance, designate the source with the smallest count of distinct alues as the master.
$ou define the 5oin type on the /roperties ta! in the transformation. )he Ooiner transformation supports the
following types of 5oins4
#ormal
Master -uter
8etail -uter
Full -uter
$ou can improe session performance !y configuring the Ooiner transformation to use sorted input. +hen you
configure the Ooiner transformation to use sorted data, the /ower"enter :erer improes performance !y
minimizing disk input and output. $ou see the greatest performance improement when you work with large data
sets.
+hen you use a Ooiner transformation in a mapping, you must configure the mapping according to the num!er of
pipelines and sources you intend to use. $ou can configure a mapping to 5oin the following types of data4
Data from multile !ource!. +hen you want to 5oin more than two pipelines, you must configure the
mapping using multiple Ooiner transformations.
Data from t"e !ame !ource. +hen you want to 5oin data from the same source, you must configure the
mapping to use the same source
Un!orted Aoiner Tran!formation
+hen the /ower"enter :erer processes an unsorted Ooiner transformation, it reads all master rows !efore it reads
the detail rows. )o ensure it reads all master rows !efore the detail rows, the /ower"enter :erer !locks the detail
source while it caches rows from the master source. -nce the /ower"enter :erer reads and caches all master
rows, it un!locks the detail source and reads the detail rows
Sorted Aoiner Tran!formation
+hen the /ower"enter :erer processes a sorted Ooiner transformation, it !locks data !ased on the mapping
configuration.
+hen the /ower"enter :erer can !lock and un!lock the source pipelines connected to the Ooiner transformation
without !locking all sources in the target load order group simultaneously, it uses !locking logic to process the
Ooiner transformation. -therwise, it does not use !locking logic and instead it stores more rows in the cache.
'erform join! in a databa!e 0"en o!!ible*
72
/erforming a 5oin in a data!ase is faster than performing a 5oin in the session. (n some cases, this is not possi!le,
such as 5oining ta!les from two different data!ases or flat file systems. (f you want to perform a 5oin in a data!ase,
you can use the following options4
"reate a pre&session stored procedure to 5oin the ta!les in a data!ase.
Ase the :ource Kualifier transformation to perform the 5oin.
Aoin !orted data 0"en o!!ible.
$ou can improe session performance !y configuring the Ooiner transformation to use sorted input. +hen you
configure the Ooiner transformation to use sorted data, the /ower"enter :erer improes performance !y
minimizing disk input and output. $ou see the greatest performance improement when you work with large data
sets.
For an unsorted Ooiner transformation, designate as the master source the source with fewer rows.
For optimal performance and disk storage, designate the master source as the source with the fewer rows. 8uring a
session, the Ooiner transformation compares each row of the master source against the detail source. )he fewer
uni*ue rows in the master, the fewer iterations of the 5oin comparison occur, which speeds the 5oin process.
For a sorted Ooiner transformation, designate as the master source the source with fewer duplicate key alues.
For optimal performance and disk storage, designate the master source as the source with fewer duplicate key
alues. +hen the /ower"enter :erer processes a sorted Ooiner transformation, it caches rows for one hundred keys
at a time. (f the master source contains many rows with the same key alue, the /ower"enter :erer must cache
more rows, and performance can !e slowed.
B*Ran6 Tran!formation active@connected
)he <ank transformation allows you to select only the top or !ottom rank of data. $ou can use a <ank
transformation to return the largest or smallest numeric alue in a port or group. $ou can also use a <ank
transformation to return the strings at the top or the !ottom of a session sort order. 8uring the session, the
/ower"enter :erer caches input data until it can perform the rank calculations.
$ou connect all ports representing the same row set to the transformation. -nly the rows that fall within that rank,
!ased on some measure you set when you configure the transformation, pass through the <ank transformation. $ou
can also write expressions to transform data or perform calculations
1s an actie transformation, the <ank transformation might change the num!er of rows passed through it. $ou might
pass 1%% rows to the <ank transformation, !ut select to rank only the top 1% rows, which pass from the <ank
transformation to another transformation.
$ou can connect ports from only one transformation to the <ank transformation. )he <ank transformation allows
you to create local aria!les and write non&aggregate expressions.
73
Ran6 #ac"e!
8uring a session, the /ower"enter :erer compares an input row with rows in the data cache. (f the input row out&
ranks a cached row, the /ower"enter :erer replaces the cached row with the input row. (f you configure the <ank
transformation to rank across multiple groups, the /ower"enter :erer ranks incrementally for each group it finds.
)he /ower"enter :erer stores group information in an index cache and row data in a data cache. (f you create
multiple partitions in a pipeline, the /ower"enter :erer creates separate caches for each partition
Ran6 Tran!formation 'roertie!
+hen you create a <ank transformation, you can configure the following properties4
@nter a cache directory.
:elect the top or !ottom rank.
:elect the input2output port that contains alues used to determine the rank. $ou can select only one port
to define a rank.
:elect the num!er of rows falling within a rank.
8efine groups for ranks, such as the 1% least expensie products for each manufacturer.
the <ank transformation changes the num!er of rows in two different ways. 3y filtering all !ut the rows falling
within a top or !ottom rank, you reduce the num!er of rows that pass through the transformation. 3y defining
groups, you create one set of ranked rows for each group.
C*Router Tran!formation connected@active
1 <outer transformation is similar to a Filter transformation !ecause !oth transformations allow you to use a
condition to test data. 1 Filter transformation tests data for one condition and drops the rows of data that do not
meet the condition. 7oweer, a <outer transformation tests data for one or more conditions and gies you the
option to route rows of data that do not meet any of the conditions to a default output group.
(f you need to test the same input data !ased on multiple conditions, use a <outer transformation in a mapping
instead of creating multiple Filter transformations to perform the same task. )he <outer transformation is more
efficient. For example, to test data !ased on three conditions, you only need one <outer transformation instead of
three filter transformations to perform this task. .ikewise, when you use a <outer transformation in a mapping, the
/ower"enter :erer processes the incoming data only once. +hen you use multiple Filter transformations in a
mapping, the /ower"enter :erer processes the incoming data for each transformation.
Asing =roup Filter "onditions
$ou can test data !ased on one or more group filter conditions. $ou create group filter conditions on the =roups ta!
using the @xpression @ditor. $ou can enter any expression that returns a single alue. $ou can also specify a constant
for the condition. 1 group filter condition returns )<A@ or F1.:@ for each row that passes through the
74
transformation, depending on whether a row satisfies the specified condition. Wero ,%0 is the e*uialent of F1.:@,
and any non&zero alue is the e*uialent of )<A@. )he /ower"enter :erer passes the rows of data that ealuate to
)<A@ to each transformation or target that is associated with each user&defined group
1 <outer transformation has input ports and output ports. (nput ports are in the input group, and output ports are in
the output groups. $ou can create input ports !y copying them from another transformation or !y manually creating
them on the /orts ta!.
D*Se8uence $enerator Tran!formation a!!ive@connected
)he :e*uence =enerator transformation generates numeric alues. $ou can use the :e*uence =enerator to create
uni*ue primary key alues, replace missing primary keys, or cycle through a se*uential range of num!ers.
)he :e*uence =enerator transformation is a connected transformation. (t contains two output ports that you can
connect to one or more transformations. )he /ower"enter :erer generates a !lock of se*uence num!ers each time
a !lock of rows enters a connected transformation. (f you connect "A<<B1., the /ower"enter :erer processes one
row in each !lock. +hen #@T)B1. is connected to the input port of another transformation, the /ower"enter :erer
generates a se*uence of num!ers. +hen "A<<B1. is connected to the input port of another transformation, the
/ower"enter :erer generates the #@T)B1. alue plus the (ncrement 3y alue.
+hen creating primary or foreign keys, only use the "ycle option to preent the /ower"enter :erer from creating
duplicate primary keys. $ou might do this !y selecting the )runcate )arget )a!le option in the session properties ,if
appropriate0 or !y creating composite keys.
)o create a composite key, you can configure the /ower"enter :erer to cycle through a smaller set of alues. For
example, if you hae three stores generating order num!ers, you might hae a :e*uence =enerator cycling through
alues from 1 to ;, incrementing !y 1. +hen you pass the following set of foreign keys, the generated alues then
create uni*ue composite keys4
)he :e*uence =enerator transformation proides two output ports4 #@T)B1. and "A<<B1.. $ou cannot edit or
delete these ports. .ikewise, you cannot add ports to the transformation.
#@T)B1.
"onnect #@T)B1. to multiple transformations to generate uni*ue alues for each row in each transformation. Ase
the #@T)B1. port to generate se*uence num!ers !y connecting it to a transformation or target. $ou connect the
#@T)B1. port to a downstream transformation to generate the se*uence !ased on the "urrent Balue and (ncrement
3y properties.
For example, you might connect #@T)B1. to two target ta!les in a mapping to generate uni*ue primary key alues.
)he /ower"enter :erer creates a column of uni*ue primary key alues for each target ta!le. )he column of uni*ue
primary key alues is sent to one target ta!le as a !lock of se*uence num!ers. )he second targets receies a !lock
of se*uence num!ers from the :e*uence =enerator transformation only after the first target ta!le receies the
!lock of se*uence num!ers.
75
(f you want the same alues to go to more than one target that receies data from a single transformation, you can
connect a :e*uence =enerator transformation to that preceding transformation. )he /ower"enter :erer :e*uence
=enerator transformation processes the alues into a !lock of se*uence num!ers. )his allows the /ower"enter
:erer to pass uni*ue alues to the transformation, and then route rows from the transformation to targets.
:tart Balue and "ycle
$ou can use "ycle to generate a repeating se*uence, such as num!ers 1 through 12 to correspond to the months in a
year.
)o cycle the /ower"enter :erer through a se*uence4
1. @nter the lowest alue in the se*uence that you want the /ower"enter :erer to use for the :tart Balue.
2. )hen enter the highest alue to !e used for @nd Balue.
;. :elect "ycle.
1s it cycles, the /ower"enter :erer reaches the configured end alue for the se*uence, it wraps around and starts
the cycle again, !eginning with the configured :tart Balue.
#um!er of "ached Balues
#um!er of "ached Balues determines the num!er of alues the /ower"enter :erer caches at one time. +hen
#um!er of "ached Balues is greater than zero, the /ower"enter :erer caches the configured num!er of alues and
updates the current alue each time it caches alues.
+hen multiple sessions use the same reusa!le :e*uence =enerator transformation at the same time, there might !e
multiple instances of the :e*uence =enerator transformation. )o aoid generating the same alues for each session,
resere a range of se*uence alues for each session !y configuring #um!er of "ached Balues.
<eset (f you select <eset for a non&reusa!le :e*uence =enerator transformation, the /ower"enter :erer generates
alues !ased on the original current alue each time it starts the session. -therwise, the /ower"enter :erer
updates the current alue to reflect the last&generated alue plus one, and then uses the updated alue the next
time it uses the :e*uence =enerator transformation
E*Sorter Tran!formation connected@active
)he :orter transformation allows you to sort data. $ou can sort data in ascending or descending order according to a
specified sort key. $ou can also configure the :orter transformation for case&sensitie sorting, and specify whether
the output rows should !e distinct. )he :orter transformation is an actie transformation. (t must !e connected to
the data flow.
76
$ou can sort data from relational or flat file sources. $ou can also use the :orter transformation to sort data passing
through an 1ggregator transformation configured to use sorted input.
+hen you create a :orter transformation in a mapping, you specify one or more ports as a sort key and configure
each sort key port to sort in ascending or descending order. $ou also configure sort criteria the /ower"enter :erer
applies to all sort key ports and the system resources it allocates to perform the sort operation.
)he :orter transformation contains only input2output ports. 1ll data passing through the :orter transformation is
sorted according to a sort key. )he sort key is one or more ports that you want to use as the sort criteria.
Sorter #ac"e Si?e
)he /ower"enter :erer uses the :orter "ache :ize property to determine the maximum amount of memory it can
allocate to perform the sort operation. )he /ower"enter :erer passes all incoming data into the :orter
transformation !efore it performs the sort operation.
$ou can specify any amount !etween 1 M3 and R =3 for the :orter cache size. (f the total configured session cache
size is 2 =3 ,2,1RD,R9;,JR9 !ytes0 or greater, you must run the session on a JR&!it /ower"enter :erer.
Di!tinct Outut Ro0!
$ou can configure the :orter transformation to treat output rows as distinct. (f you configure the :orter
transformation for distinct output rows, the Mapping 8esigner configures all ports as part of the sort key. +hen the
/ower"enter :erer runs the session, it discards duplicate rows compared during the sort operation.
Tran!formation Scoe
)he transformation scope specifies how the /ower"enter :erer applies the transformation logic to incoming data4
Tran!action* 1pplies the transformation logic to all rows in a transaction. "hoose )ransaction when a row of
data depends on all rows in the same transaction, !ut does not depend on rows in other transactions.
All Inut* 1pplies the transformation logic on all incoming data. +hen you choose 1ll (nput, the
/ower"enter drops incoming transaction !oundaries. "hoose 1ll (nput when a row of data depends on all
rows in the source.
F*Tran!action #ontrol Tran!formation Active@connected
/ower"enter allows you to control commit and roll!ack transactions !ased on a set of rows that pass through a
)ransaction "ontrol transformation. 1 transaction is the set of rows !ound !y commit or roll!ack rows. $ou can
define a transaction !ased on a arying num!er of input rows. $ou might want to define transactions !ased on a
group of rows ordered on a common key, such as employee (8 or order entry date.
(n /ower"enter, you define transaction control at two leels4
77
Wit"in a maing* +ithin a mapping, you use the )ransaction "ontrol transformation to define a
transaction. $ou define transactions using an expression in a )ransaction "ontrol transformation. 3ased on
the return alue of the expression, you can choose to commit, roll !ack, or continue without any
transaction changes.
Wit"in a !e!!ion* +hen you configure a session, you configure it for user&defined commit. $ou can choose
to commit or roll !ack a transaction if the /ower"enter :erer fails to transform or write any row to the
target.
)he expression contains alues that represent actions the /ower"enter :erer performs !ased on the return alue
of the condition. )he /ower"enter :erer ealuates the condition on a row&!y&row !asis. )he return alue
determines whether the /ower"enter :erer commits, rolls !ack, or makes no transaction changes to the row. +hen
the /ower"enter :erer issues a commit or roll!ack !ased on the return alue of the expression, it !egins a new
transaction. Ase the following !uilt&in aria!les in the @xpression @ditor when you create a transaction control
expression4
T#G#ONTINUEGTRANSA#TION* )he /ower"enter :erer does not perform any transaction change for this
row. )his is the default alue of the expression.
T#G#O%%ITG3E,ORE* )he /ower"enter :erer commits the transaction, !egins a new transaction, and
writes the current row to the target. )he current row is in the new transaction.
T#G#O%%ITGA,TER* )he /ower"enter :erer writes the current row to the target, commits the
transaction, and !egins a new transaction. )he current row is in the committed transaction.
T#GRO&&3A#4G3E,ORE* )he /ower"enter :erer rolls !ack the current transaction, !egins a new
transaction, and writes the current row to the target. )he current row is in the new transaction.
T#GRO&&3A#4GA,TER* )he /ower"enter :erer writes the current row to the target, rolls !ack the
transaction, and !egins a new transaction. )he current row is in the rolled !ack transaction.
(f the transaction control expression ealuates to a alue other than commit, roll!ack, or continue, the
/ower"enter :erer fails the session.
%aing $uideline! and 9alidation o!ieefans.com
"onsider the following rules and guidelines when you create a mapping with a )ransaction "ontrol transformation4
(f the mapping includes an TM. target, and you choose to append or create a new document on commit, the
input groups must receie data from the same transaction control point.
)ransaction "ontrol transformations connected to any target other than relational, TM., or dynamic (3M
MK:eries targets are ineffectie for those targets.
$ou must connect each target instance to a )ransaction "ontrol transformation.
$ou can connect multiple targets to a single )ransaction "ontrol transformation.
$ou can connect only one effectie )ransaction "ontrol transformation to a target.
$ou cannot place a )ransaction "ontrol transformation in a pipeline !ranch that starts with a :e*uence
=enerator transformation.
(f you use a dynamic .ookup transformation and a )ransaction "ontrol transformation in the same mapping,
a rolled&!ack transaction might result in unsynchronized target data.
7
1 )ransaction "ontrol transformation may !e effectie for one target and ineffectie for another target. (f
each target is connected to an effectie )ransaction "ontrol transformation, the mapping is alid@ither all
targets or none of the targets in the mapping should !e connected to an effectie )ransaction "ontrol
transformation.
H*Aggregator Tran!formation connected@active
)he 1ggregator transformation allows you to perform aggregate calculations, such as aerages and sums. )he
1ggregator transformation is unlike the @xpression transformation, in that you can use the 1ggregator
transformation to perform calculations on groups. )he @xpression transformation permits you to perform
calculations on a row&!y&row !asis only.
"omponents of the 1ggregator )ransformation
)he 1ggregator is an actie transformation, changing the num!er of rows in the pipeline. )he 1ggregator
transformation has the following components and options4
Aggregate e1re!!ion* @ntered in an output port. "an include non&aggregate expressions and conditional
clauses.
$rou b/ ort* (ndicates how to create groups. )he port can !e any input, input2output, output, or
aria!le port. +hen grouping data, the 1ggregator transformation outputs the last row of each group unless
otherwise specified.
Sorted inut* Ase to improe session performance. )o use sorted input, you must pass data to the
1ggregator transformation sorted !y group !y port, in ascending or descending order.
Aggregate cac"e* )he /ower"enter :erer stores data in the aggregate cache until it completes aggregate
calculations. (t stores group alues in an index cache and row data in the data cache.
)he 8esigner allows aggregate expressions only in the 1ggregator transformation. 1n aggregate expression can
include conditional clauses and non&aggregate functions. (t can also include one aggregate function nested within
another aggregate function, such as4
M1T, "-A#), ()@M 00
)he result of an aggregate expression aries depending on the group !y ports used in the transformation. For
example, when the /ower"enter :erer calculates the following aggregate expression with no group !y ports
defined, it finds the total *uantity of items sold4
:AM, KA1#)()$ 0
1ggregate Functions
$ou can use the following aggregate functions within an 1ggregator transformation. $ou can nest one aggregate
function within another aggregate function.
)he transformation language includes the following aggregate functions4
7!
1B=
"-A#)
F(<:)
.1:)
M1T
M@8(1#
M(#
/@<"@#)(.@
:)88@B
:AM
B1<(1#"@
#on&1ggregate Functions
$ou can also use non&aggregate functions in the aggregate expression.
)he following expression returns the highest num!er of items sold for each item ,grouped !y item0. (f no items were
sold, the expression returns %.
((F, M1T, KA1#)()$ 0 S %, M1T, KA1#)()$ 0, %00
Null 9alue! in Aggregate ,unction!
+hen you configure the /ower"enter :erer, you can choose how you want the /ower"enter :erer to handle null
alues in aggregate functions. $ou can choose to treat null alues in aggregate functions as #A.. or zero. 3y
default, the /ower"enter :erer treats null alues as #A.. in aggregate functions.
$rou 3/ 'ort!
)he 1ggregator transformation allows you to define groups for aggregations, rather than performing the aggregation
across all input data. For example, rather than finding the total company sales, you can find the total sales grouped
!y region.
)o define a group for the aggregate expression, select the appropriate input, input2output, output, and aria!le
ports in the 1ggregator transformation. $ou can select multiple group !y ports, creating a new group for each
uni*ue com!ination of groups. )he /ower"enter :erer then performs the defined aggregation for each group
U!ing Sorted Inut
$ou can improe 1ggregator transformation performance !y using the sorted input option. +hen you use sorted
input, the /ower"enter :erer assumes all data is sorted !y group. 1s the /ower"enter :erer reads rows for a
"
group, it performs aggregate calculations. +hen necessary, it stores group information in memory. )o use the :orted
(nput option, you must pass sorted data to the 1ggregator transformation. $ou can gain performance with sorted
ports when you configure the session with multiple partitions.
+hen you do not use sorted input, the /ower"enter :erer performs aggregate calculations as it reads.
7oweer, since data is not sorted, the /ower"enter :erer stores data for each group until it reads the entire source
to ensure all aggregate calculations are accurate.
Aggregator Tran!formation Ti!
$ou can use the following guidelines to optimize the performance of an 1ggregator transformation.
Ase sorted input to decrease the use of aggregate caches.
:orted input reduces the amount of data cached during the session and improes session performance. Ase this
option with the :orter transformation to pass sorted data to the 1ggregator transformation.
.imit connected input2output or output ports.
.imit the num!er of connected input2output or output ports to reduce the amount of data the 1ggregator
transformation stores in the data cache.
Filter !efore aggregating.
(f you use a Filter transformation in the mapping, place the transformation !efore the 1ggregator transformation to
reduce unnecessary aggregation.
:I*Udate Strateg/ Tran!formation active@connected
+hen you design your data warehouse, you need to decide what type of information to store in targets. 1s part of
your target ta!le design, you need to determine whether to maintain all the historic data or 5ust the most recent
changes.
For example, you might hae a target ta!le, )_"A:)-M@<:, that contains customer data. +hen a customer address
changes, you may want to sae the original address in the ta!le instead of updating that portion of the customer
row. (n this case, you would create a new row containing the updated address, and presere the original row with
the old customer address. )his illustrates how you might store historical information in a target ta!le. 7oweer, if
1
you want the )_"A:)-M@<: ta!le to !e a snapshot of current customer data, you would update the existing
customer row and lose the original address.
)he model you choose determines how you handle changes to existing rows. (n /ower"enter, you set your update
strategy at two different leels4
Wit"in a !e!!ion* +hen you configure a session, you can instruct the /ower"enter :erer to either treat all
rows in the same way ,for example, treat all rows as inserts0, or use instructions coded into the session
mapping to flag rows for different data!ase operations.
Wit"in a maing* +ithin a mapping, you use the Apdate :trategy transformation to flag rows for insert,
delete, update, or re5ect
Setting t"e Udate Strateg/
Ase the following steps to define an update strategy4
1. )o control how rows are flagged for insert, update, delete, or re5ect within a mapping, add an Apdate
:trategy transformation to the mapping. Apdate :trategy transformations are essential if you want to flag
rows destined for the same target for different data!ase operations, or if you want to re5ect rows.
2. 8efine how to flag rows when you configure a session. $ou can flag all rows for insert, delete, or update, or
you can select the data drien option, where the /ower"enter :erer follows instructions coded into Apdate
:trategy transformations within the session mapping.
;. 8efine insert, update, and delete options for each target when you configure a session. -n a target&!y&
target !asis, you can allow or disallow inserts and deletes, and you can choose three different ways to
handle updates
For the greatest degree of control oer your update strategy, you add Apdate :trategy transformations to a
mapping. )he most important feature of this transformation is its update strategy expression, used to flag
indiidual rows for insert, delete, update, or re5ect.
,or0arding Rejected Ro0!

$ou can configure the Apdate :trategy transformation to either pass re5ected rows to the next transformation or
drop them. 3y default, the /ower"enter :erer forwards re5ected rows to the next transformation. )he
/ower"enter :erer flags the rows for re5ect and writes them to the session re5ect file. (f you do not select Forward
<e5ected <ows, the /ower"enter :erer drops re5ected rows and writes them to the session log file
2
Udate Strateg/ E1re!!ion!
Fre*uently, the update strategy expression uses the ((F or 8@"-8@ function from the transformation language to test
each row to see if it meets a particular condition. (f it does, you can then assign each row a numeric code to flag it
for a particular data!ase operation. For example, the following ((F statement flags a row for re5ect if the entry date
is after the apply date. -therwise, it flags the row for update4
((F, , @#)<$_81)@ S 1//.$_81)@0, 88_<@O@"), 88_A/81)@ 0

Secif/ing Oeration! for Individual Target Table!
-nce you determine how to treat all rows in the session, you also need to set update strategy options for indiidual
targets. 8efine the update strategy options in the )ransformations iew on Mapping ta! of the session properties.
$ou can set the following update strategy options4
In!ert* :elect this option to insert a row into a target ta!le.
Delete* :elect this option to delete a row from a ta!le.
Udate* $ou hae the following options in this situation4
o Udate a! udate* Apdate each row flagged for update if it exists in the target ta!le.
o Udate a! in!ert* (nset each row flagged for update.
o Udate el!e In!ert* Apdate the row if it exists. -therwise, insert it.
Truncate table* :elect this option to truncate the target ta!le !efore loading data.
Udate Strateg/ #"ec6li!t
"hoosing an update strategy re*uires setting the right options within a session and possi!ly adding Apdate :trategy
transformations to a mapping. )his section summarizes what you need to implement different ersions of an update
strategy.
3
Onl/ erform in!ert! into a target table*
+hen you configure the session, select (nsert for the )reat :ource <ows 1s session property. 1lso, make sure that
you select the (nsert option for all target instances in the session.
8elete all rows in a target ta!le.
+hen you configure the session, select 8elete for the )reat :ource <ows 1s session property. 1lso, make sure that
you select the 8elete option for all target instances in the session.
-nly perform updates on the contents of a target ta!le.
+hen you configure the session, select Apdate for the )reat :ource <ows 1s session property. +hen you configure
the update options for each target ta!le instance, make sure you select the Apdate option for each target instance.
/erform different data!ase operations with different rows destined for the same target ta!le.
1dd an Apdate :trategy transformation to the mapping. +hen you write the transformation update strategy
expression, use either the 8@"-8@ or ((F function to flag rows for different operations ,insert, delete, update, or
re5ect0. +hen you configure a session that uses this mapping, select 8ata 8rien for the )reat :ource <ows 1s
session property. Make sure that you select the (nsert, 8elete, or one of the Apdate options for each target ta!le
instance.
<e5ect data.
1dd an Apdate :trategy transformation to the mapping. +hen you write the transformation update strategy
expression, use 8@"-8@ or ((F to specify the criteria for re5ecting the row. +hen you configure a session that uses
this mapping, select 8ata 8rien for the )reat :ource <ows 1s session property.
::*&oo6u Tran!formation 'a!!ive
connected@unconnected
Ase a .ookup transformation in a mapping to look up data in a flat file or a relational ta!le, iew, or synonym. $ou
can import a lookup definition from any flat file or relational data!ase to which !oth the /ower"enter "lient and
:erer can connect. $ou can use multiple .ookup transformations in a mapping.
)he /ower"enter :erer *ueries the lookup source !ased on the lookup ports in the transformation. (t compares
.ookup transformation port alues to lookup source column alues !ased on the lookup condition. /ass the result of
the lookup to other transformations and a target.
$ou can use the .ookup transformation to perform many tasks, including4
$et a related value* For example, your source includes employee (8, !ut you want to include the employee
name in your target ta!le to make your summary data easier to read.
4
'erform a calculation* Many normalized ta!les include alues used in a calculation, such as gross sales per
inoice or sales tax, !ut not the calculated alue ,such as net sales0.
Udate !lo0l/ c"anging dimen!ion table!* $ou can use a .ookup transformation to determine whether rows
already exist in the target.
$ou can configure the .ookup transformation to perform the following types of lookups4
#onnected or unconnected* "onnected and unconnected transformations receie input and send output in
different ways.
Relational or flat file loo6u* +hen you create a .ookup transformation, you can choose to perform a
lookup on a flat file or a relational ta!le.
+hen you create a .ookup transformation using a relational ta!le as the lookup source, you can connect to
the lookup source using -83" and import the ta!le definition as the structure for the .ookup
transformation.
+hen you create a .ookup transformation using a flat file as a lookup source, the 8esigner inokes the Flat
File +izard. For more information a!out using the Flat File +izard, see G+orking with Flat FilesH in the
Designer Guide.
#ac"ed or uncac"ed* :ometimes you can improe session performance !y caching the lookup ta!le. (f you
cache the lookup, you can choose to use a dynamic or static cache. 3y default, the lookup cache remains
static and does not change during the session. +ith a dynamic cache, the /ower"enter :erer inserts or
updates rows in the cache during the session. +hen you cache the target ta!le as the lookup, you can look
up alues in the target and insert them if they do not exist, or update them if they do.
Note- (f you use a flat file lookup, you must use a static cache.
#onnected and Unconnected &oo6u!
$ou can configure a connected .ookup transformation to receie input directly from the mapping pipeline, or you
can configure an unconnected .ookup transformation to receie input from the result of an expression in another
transformation.
. 8ifferences 3etween "onnected and Anconnected .ookups
#onnected &oo6u Unconnected &oo6u
<eceies input alues directly from the pipeline.
<eceies input alues from the result of a 4.>/
expression in another transformation.
$ou can use a dynamic or static cache. $ou can use a static cache.
"ache includes all lookup columns used in the mapping ,that is, "ache includes all lookup2output ports in the
5
lookup source columns included in the lookup condition and lookup
source columns linked as output ports to other transformations0.
lookup condition and the lookup2return port.
"an return multiple columns from the same row or insert into the
dynamic lookup cache.
8esignate one return port ,<0. <eturns one
column from each row.
(f there is no match for the lookup condition, the /ower"enter
:erer returns the default alue for all output ports. (f you
configure dynamic caching, the /ower"enter :erer inserts rows
into the cache or leaes it unchanged.
(f there is no match for the lookup condition,
the /ower"enter :erer returns #A...
(f there is a match for the lookup condition, the /ower"enter :erer
returns the result of the lookup condition for all lookup2output
ports. (f you configure dynamic caching, the /ower"enter :erer
either updates the row the in the cache or leaes the row
unchanged.
(f there is a match for the lookup condition,
the /ower"enter :erer returns the result of
the lookup condition into the return port.
/ass multiple output alues to another transformation. .ink
lookup2output ports to another transformation.
/ass one output alue to another
transformation. )he lookup2output2return
port passes the alue to the transformation
calling 4.>/ expression.
:upports user&defined default alues. 8oes not support user&defined default alues.
<elational and Flat File .ookups
+hen you create a .ookup transformation, you can choose to use a relational ta!le or a flat file for the lookup
source.
Relational &oo6u!
+hen you create a .ookup transformation using a relational ta!le as a lookup source, you can connect to the lookup
source using -83" and import the ta!le definition as the structure for the .ookup transformation.
$ou can use the following options with relational lookups only4
$ou can oerride the default :K. statement if you want to add a +7@<@ clause or *uery multiple ta!les.
$ou can use a dynamic lookup cache with relational lookups.
,lat ,ile &oo6u!
+hen you use a flat file for a lookup source, you can use any flat file definition in the repository, or you can import
it. +hen you import a flat file lookup source, the 8esigner inokes the Flat File +izard.
$ou can use the following options with flat file lookups only4
$ou can use indirect files as lookup sources !y specifying a file list as the lookup file name.
6
$ou can use sorted input for the lookup.
$ou can sort null data high or low. +ith relational lookups, this is !ased on the data!ase support.
$ou can use case&sensitie string comparison with flat file lookups. +ith relational lookups, the case&
sensitie comparison is !ased on the data!ase support.
U!ing Sorted Inut
+hen you configure a flat file .ookup transformation for sorted input, the condition columns must !e grouped. (f
the condition columns are not grouped, the /ower"enter :erer cannot cache the lookup and fails the session. For
!est caching performance, sort the condition columns.
)he .ookup transformation also ena!les an associated ports property that you configure when you use a dynamic
cache.
. :ession /roperties for Flat File .ookups
'roert/ De!crition
.ookup
:ource File
8irectory
@nter the directory name. 3y default, the /ower"enter :erer looks in the serer aria!le directory,
L/M.ookupFile8ir, for lookup files.
$ou can enter the full path and file name. (f you specify !oth the directory and file name in the
.ookup :ource Filename field, clear this field. )he /ower"enter :erer concatenates this field with
the .ookup :ource Filename field when it runs the session.
$ou can also use the L(nputFileName session parameter to specify the file name.

.ookup
:ource
Filename
)he name of the lookup file. (f you use an indirect file, specify the name of the indirect file you want
the /ower"enter :erer to read.
$ou can also use the lookup file parameter, L.ookupFileName, to change the name of the lookup file
a session uses.
(f you specify !oth the directory and file name in the :ource File 8irectory field, clear this field. )he
/ower"enter :erer concatenates this field with the .ookup :ource File 8irectory field when it runs
the session. For example, if you hae G"4clookup_datacH in the .ookup :ource File 8irectory field,
then enter Gfilename.txtH in the .ookup :ource Filename field. +hen the /ower"enter :erer !egins
the session, it looks for G"4clookup_datacfilename.txtH.

.ookup
:ource
Filetype
(ndicates whether the lookup source file contains the source data or a list of files with the same file
properties. "hoose 8irect if the lookup source file contains the source data. "hoose (ndirect if the
lookup source file contains a list of files.
+hen you select (ndirect, the /ower"enter :erer creates one cache for all files. (f you use sorted
input with indirect files, erify that the range of data in the files do not oerlap. (f the range of data
oerlaps, the /ower"enter :erer processes the lookup as if you did not configure for sorted input.
"onfiguring <elational .ookups in a :ession
7
+hen you configure a session, you specify the connection for the lookup data!ase in the "onnection node on the
Mapping ta! ,)ransformation iew0. $ou hae the following options to specify a connection4
"hoose any relational connection.
Ase the connection aria!le, L83"onnection.
:pecify a data!ase connection for L:ource or L)arget information.
(f you use L:ource or L)arget for the lookup connection, configure the L:ource "onnection Balue and L)arget
"onnection Balue in the session properties. )his ensures that the /ower"enter :erer uses the correct data!ase
connection for the aria!le when it runs the session.
(f you use L:ource or L)arget and you do not specify a "onnection Balue in the session properties, the /ower"enter
:erer determines the data!ase connection to use when it runs the session. (t uses a source or target data!ase
connection for the source or target in the pipeline that contains the .ookup transformation. (f it cannot determine
which data!ase connection to use, it fails the session.
&oo6u #ondition
)he /ower"enter :erer uses the lookup condition to test incoming alues. (t is similar to the +7@<@ clause in an
:K. *uery. +hen you configure a lookup condition for the transformation, you compare transformation input alues
with alues in the lookup source or cache, represented !y lookup ports. +hen you run a workflow, the /ower"enter
:erer *ueries the lookup source or cache for all incoming alues !ased on the condition.
$ou must enter a lookup condition in all .ookup transformations. :ome guidelines for the lookup condition apply for
all .ookup transformations, and some guidelines ary depending on how you configure the transformation.
Ase the following guidelines when you enter a condition for a .ookup transformation4
)he datatypes in a condition must match.
Ase one input port for each lookup port used in the condition. $ou can use the same input port in more than
one condition in a transformation.
+hen you enter multiple conditions, the /ower"enter :erer ealuates each condition as an 1#8, not an
-<. )he /ower"enter :erer returns only rows that match all the conditions you specify.
)he /ower"enter :erer matches null alues. For example, if an input lookup condition column is #A.., the
/ower"enter :erer ealuates the #A.. e*ual to a #A.. in the lookup.
(f you configure a flat file lookup for sorted input, the /ower"enter :erer fails the session if the condition
columns are not grouped. (f the columns are grouped, !ut not sorted, the /ower"enter :erer processes the
lookup as if you did not configure sorted input. For more information a!out sorted input.
)he lookup condition guidelines and the way the /ower"enter :erer processes matches can ary, depending on
whether you configure the transformation for a dynamic cache or an uncached or static cache. For more
information a!out lookup caches.

&oo6u #ac"e!
$ou can configure a .ookup transformation to cache the lookup file or ta!le. )he /ower"enter :erer !uilds a cache
in memory when it processes the first row of data in a cached .ookup transformation. (t allocates memory for the
cache !ased on the amount you configure in the transformation or session properties. )he /ower"enter :erer
stores condition alues in the index cache and output alues in the data cache. )he /ower"enter :erer *ueries the
cache for each row that enters the transformation.
)he /ower"enter :erer also creates cache files !y default in the L/M"ache8ir. (f the data does not fit in the
memory cache, the /ower"enter :erer stores the oerflow alues in the cache files. +hen the session completes,
the /ower"enter :erer releases cache memory and deletes the cache files unless you configure the .ookup
transformation to use a persistent cache.
+hen configuring a lookup cache, you can specify any of the following options4
/ersistent cache
<ecache from lookup source
:tatic cache
8ynamic cache
:hared cache
Note- $ou can use a dynamic cache for relational lookups only.
"onfiguring Anconnected .ookup )ransformations
1n unconnected .ookup transformation is separate from the pipeline in the mapping. $ou write an expression using
the 4.>/ reference *ualifier to call the lookup within another transformation. :ome common uses for unconnected
lookups include4
)esting the results of a lookup in an expression
Filtering rows !ased on the lookup results
Marking rows for update !ased on the result of a lookup, such as updating slowly changing dimension ta!les
"alling the same lookup multiple times in one mapping
"omplete the following steps when you configure an unconnected .ookup transformation4
1. 1dd input ports.
2. 1dd the lookup condition.
;. 8esignate a return alue.
!
R. "all the lookup from another transformation.
"all the .ookup )hrough an @xpression
$ou supply input alues for an unconnected .ookup transformation from a 4.>/ expression in another
transformation. )he arguments are local input ports that match the .ookup transformation input ports used in the
lookup condition. Ase the following syntax for a 4.>/ expression4
4.>/.lookup_transformation_name,argument, argument, 0
)o continue the example a!out the retail store, when you write the update strategy expression, the order of ports
in the expression must match the order in the lookup condition. (n this case, the ()@M_(8 condition is the first
lookup condition, and therefore, it is the first argument in the update strategy expression.
((F,(:#A..,4.>/.lkp()@M:_8(M,()@M_(8, /<("@00, 88_A/81)@, 88_<@O@")0
.ookup "aches -eriew
$ou can configure a .ookup transformation to cache the lookup ta!le. )he /ower"enter :erer !uilds a cache in
memory when it processes the first row of data in a cached .ookup transformation. (t allocates memory for the
cache !ased on the amount you configure in the transformation or session properties. )he /ower"enter :erer
stores condition alues in the index cache and output alues in the data cache. )he /ower"enter :erer *ueries the
cache for each row that enters the transformation.
)he /ower"enter :erer also creates cache files !y default in the L/M"ache8ir. (f the data does not fit in the
memory cache, the /ower"enter :erer stores the oerflow alues in the cache files. +hen the session completes,
the /ower"enter :erer releases cache memory and deletes the cache files unless you configure the .ookup
transformation to use a persistent cache.
(f you use a flat file lookup, the /ower"enter :erer always caches the lookup source. (f you configure a flat file
lookup for sorted input, the /ower"enter :erer cannot cache the lookup if the condition columns are not grouped.
(f the columns are grouped, !ut not sorted, the /ower"enter :erer processes the lookup as if you did not configure
sorted input. For more information a!out sorted input.
+hen configuring a lookup cache, you can specify any of the following options4
'er!i!tent cac"e* $ou can sae the lookup cache files and reuse them the next time the /ower"enter :erer
processes a .ookup transformation configured to use the cache.
Recac"e from !ource* (f the persistent cache is not synchronized with the lookup ta!le, you can configure
the .ookup transformation to re!uild the lookup cache. For more information.
Static cac"e* $ou can configure a static, or read&only, cache for any lookup source. 3y default, the
/ower"enter :erer creates a static cache. (t caches the lookup file or ta!le and looks up alues in the
cache for each row that comes into the transformation. +hen the lookup condition is true, the /ower"enter
!"
:erer returns a alue from the lookup cache. )he /ower"enter :erer does not update the cache while it
processes the .ookup transformation.
d/ namic cac"e* (f you want to cache the target ta!le and insert new rows or update existing rows in the
cache and the target, you can create a .ookup transformation to use a dynamic cache. )he /ower"enter
:erer dynamically inserts or updates data in the lookup cache and passes data to the target ta!le. $ou
cannot use a dynamic cache with a flat file lookup.
S"ared cac"e* $ou can share the lookup cache !etween multiple transformations. $ou can share an
unnamed cache !etween transformations in the same mapping. $ou can share a named cache !etween
transformations in the same or different mappings. For more information..
+hen you do not configure the .ookup transformation for caching, the /ower"enter :erer *ueries the lookup ta!le
for each input row. )he result of the .ookup *uery and processing is the same, whether or not you cache the lookup
ta!le. 7oweer, using a lookup cache can increase session performance. -ptimize performance !y caching the
lookup ta!le when the source ta!le is large.
"ache "omparison
)he differences !etween an uncached lookup, a static cache, and a dynamic cache4
)a!le 9&1. .ookup "aching "omparison
Uncac"ed Static #ac"e D/namic #ac"e
$ou cannot insert or update the
cache.
$ou cannot insert or update the
cache.
$ou can insert or update rows in the cache
as you pass rows to the target.
$ou cannot use a flat file lookup.
$ou can use a relational or a flat
file lookup.
$ou can use a relational lookup only.
+hen the condition is true, the
/ower"enter :erer returns a
alue from the lookup ta!le or
cache.
+hen the condition is not true,
the /ower"enter :erer returns
the default alue for connected
transformations and #A.. for
unconnected transformations.
.
+hen the condition is true, the
/ower"enter :erer returns a
alue from the lookup ta!le or
cache.
+hen the condition is not true,
the /ower"enter :erer returns
the default alue for connected
transformations and #A.. for
unconnected transformations.
.
+hen the condition is true, the /ower"enter
:erer either updates rows in the cache or
leaes the cache unchanged, depending on
the row type. )his indicates that the row is
in the cache and target ta!le. $ou can pass
updated rows to the target ta!le.
+hen the condition is not true, the
/ower"enter :erer either inserts rows into
the cache or leaes the cache unchanged,
depending on the row type. )his indicates
that the row is not in the cache or target
ta!le. $ou can pass inserted rows to the
target ta!le.
.
Note- )he /ower"enter :erer uses the same transformation logic to process a .ookup transformation whether you
configure it to use a static cache or no cache. 7oweer, when you configure the transformation to use no cache, the
/ower"enter :erer *ueries the lookup ta!le instead of the lookup cache.
!1
Wor6ing 0it" an Uncac"ed &oo6u or Static #ac"e
3y default, the /ower"enter :erer creates a static lookup cache when you configure a .ookup
transformation for caching. )he /ower"enter :erer !uilds the cache when it processes the first lookup re*uest. (t
*ueries the cache !ased on the lookup condition for each row that passes into the transformation. )he /ower"enter
:erer does not update the cache while it processes the transformation. )he /ower"enter :erer processes an
uncached lookup the same way it processes a cached lookup except that it *ueries the lookup source instead of
!uilding and *uerying the cache.
Wor6ing 0it" a D/namic &oo6u #ac"e
For relational lookups, you might want to configure the transformation to use a dynamic cache when the target
ta!le is also the lookup ta!le. )he /ower"enter :erer !uilds the cache when it processes the first lookup re*uest.
(t *ueries the cache !ased on the lookup condition for each row that passes into the transformation. +hen you use
a dynamic cache, the /ower"enter :erer updates the lookup cache as it passes rows to the target.
+hen the /ower"enter :erer reads a row from the source, it updates the lookup cache !y performing one of the
following actions4
In!ert! t"e ro0 into t"e cac"e* )he row is not in the cache and you specified to insert rows into the cache.
$ou can configure the transformation to insert rows into the cache !ased on input ports or generated
se*uence (8s. )he /ower"enter :erer flags the row as insert.
Udate! t"e ro0 in t"e cac"e* )he row exists in the cache and you specified to update rows in the cache.
)he /ower"enter :erer flags the row as update. )he /ower"enter :erer updates the row in the cache
!ased on the input ports.
%a6e! no c"ange to t"e cac"e* )he row exists in the cache and you specified to insert new rows only. -r,
the row is not in the cache and you specified to update existing rows only. -r, the row is in the cache, !ut
!ased on the lookup condition, nothing changes. )he /ower"enter :erer flags the row as unchanged.
.ookup "ache )ips
Ase the following tips when you configure the .ookup transformation to cache the lookup ta!le4
"ache small lookup ta!les.
(mproe session performance !y caching small lookup ta!les. )he result of the lookup *uery and processing is the
same, whether or not you cache the lookup ta!le.
Ase a persistent lookup cache for static lookup ta!les.
(f the lookup ta!le does not change !etween sessions, configure the .ookup transformation to use a persistent
lookup cache. )he /ower"enter :erer then saes and reuses cache files from session to session, eliminating the
time re*uired to read the lookup ta!le.
!2
:;*Source 2ualifier Tran!formation active@connected
+hen you add a relational or a flat file source definition to a mapping, you need to connect it to a :ource
Kualifier transformation. )he :ource Kualifier transformation represents the rows that the /ower"enter :erer
reads when it runs a session.
$ou can use the :ource Kualifier transformation to perform the following tasks4
Aoin data originating from t"e !ame !ource databa!e* $ou can 5oin two or more ta!les with primary key&
foreign key relationships !y linking the sources to one :ource Kualifier transformation.
,ilter ro0! 0"en t"e 'o0er#enter Server read! !ource data* (f you include a filter condition, the
/ower"enter :erer adds a +7@<@ clause to the default *uery.
Secif/ an outer join rat"er t"an t"e default inner join* (f you include a user&defined 5oin, the
/ower"enter :erer replaces the 5oin information specified !y the metadata in the :K. *uery.
Secif/ !orted ort!* (f you specify a num!er for sorted ports, the /ower"enter :erer adds an -<8@< 3$
clause to the default :K. *uery.
Select onl/ di!tinct value! from t"e !ource* (f you choose :elect 8istinct, the /ower"enter :erer adds a
:@.@") 8(:)(#") statement to the default :K. *uery.
#reate a cu!tom 8uer/ to i!!ue a !ecial SE&E#T !tatement for t"e 'o0er#enter Server to read !ource
data* For example, you might use a custom *uery to perform aggregate calculations.
'arameter! and 9ariable!
$ou can use mapping parameters and aria!les in the :K. *uery, user&defined 5oin, and source filter of a :ource
Kualifier transformation. $ou can also use the system aria!le LLL:ess:tart)ime.
)he /ower"enter :erer first generates an :K. *uery and replaces each mapping parameter or aria!le with its
start alue. )hen it runs the *uery on the source data!ase.
+hen you use a string mapping parameter or aria!le in the :ource Kualifier transformation, use a string identifier
appropriate to the source system. Most data!ases use a single *uotation mark as a string identifier. For example, to
use the string parameter LL(/1ddress in a source filter for a Microsoft :K. :erer data!ase ta!le, enclose the
parameter in single *uotes as follows, 'LL(/1ddress'.
For relational sources, the /ower"enter :erer generates a *uery for each :ource Kualifier transformation
when it runs a session. )he default *uery is a :@.@") statement for each source column used in the
mapping. (n other words, the /ower"enter :erer reads only the columns that are connected to another
transformation
Overriding t"e Default 2uer/
!3
$ou can alter or oerride the default *uery in the :ource Kualifier transformation !y changing the default settings
of the transformation properties. 8o not change the list of selected ports or the order in which they appear in the
*uery. )his list must match the connected transformation output ports.
+hen you edit transformation properties, the :ource Kualifier transformation includes these settings in the default
*uery. 7oweer, if you enter an :K. *uery, the /ower"enter :erer uses only the defined :K. statement. )he :K.
Kuery oerrides the Aser&8efined Ooin, :ource Filter, #um!er of :orted /orts, and :elect 8istinct settings in the
:ource Kualifier transformation.
Note- +hen you oerride the default :K. *uery, you must enclose all data!ase resered words in *uotes.
Aoining Source Data
$ou can use one :ource Kualifier transformation to 5oin data from multiple relational ta!les. )hese ta!les must !e
accessi!le from the same instance or data!ase serer.
+hen a mapping uses related relational sources, you can 5oin !oth sources in one :ource Kualifier transformation.
8uring the session, the source data!ase performs the 5oin !efore passing data to the /ower"enter :erer. )his can
increase performance when source ta!les are indexed.
Ti- Ase the Ooiner transformation for heterogeneous sources and to 5oin flat files.
"ustom Ooins
(f you need to oerride the default 5oin, you can enter contents of the +7@<@ clause that specifies the 5oin in the
custom *uery.
$ou might need to oerride the default 5oin under the following circumstances4
"olumns do not hae a primary key&foreign key relationship.
)he datatypes of columns used for the 5oin do not match.
$ou want to specify a different type of 5oin, such as an outer 5oin.
7eterogeneous Ooins
)o perform a heterogeneous 5oin, use the Ooiner transformation. Ase the Ooiner transformation when you need to
5oin the following types of sources4
Ooin data from different source data!ases
Ooin data from different flat file systems
Ooin relational sources and flat files
Adding an S2& 2uer/
!4
)he :ource Kualifier transformation proides the :K. Kuery option to oerride the default *uery. $ou can enter an
:K. statement supported !y your source data!ase. 3efore entering the *uery, connect all the input and output ports
you want to use in the mapping.
+hen you edit the :K. Kuery, you can generate and edit the default *uery. +hen the 8esigner generates the
default *uery, it incorporates all other configured options, such as a filter or num!er of sorted ports. )he resulting
*uery oerrides all other options you might su!se*uently configure in the transformation.
$ou can include mapping parameters and aria!les in the :K. Kuery. +hen including a string mapping parameter or
aria!le, use a string identifier appropriate to the source system. For most data!ases, you should enclose the name
of a string parameter or aria!le in single *uotes.
$ou can use the :ource Kualifier and the 1pplication :ource Kualifier transformations to perform an outer 5oin
of two sources in the same data!ase. +hen the /ower"enter :erer performs an outer 5oin, it returns all rows from
one source ta!le and rows from the second source ta!le that match the 5oin condition.
.ocations for @ntering -uter Ooin :yntax
Tran!formation
Tran!formation
Setting
De!crition
:ource Kualifier
transformation
Aser&8efined Ooin
"reate a 5oin oerride. 8uring the session, the /ower"enter
:erer appends the 5oin oerride to the +7@<@ clause of the
default *uery.
:K. Kuery
@nter 5oin syntax immediately after the +7@<@ in the default
*uery.
1pplication :ource Kualifier
transformation
Ooin -erride
"reate a 5oin oerride. 8uring the session, the /ower"enter
:erer appends the 5oin oerride to the +7@<@ clause of the
default *uery.
@xtract -erride
@nter 5oin syntax immediately after the +7@<@ in the default
*uery.
$ou can com!ine left outer and right outer 5oins with normal 5oins in a single source *ualifier. $ou can use multiple
normal 5oins and multiple left outer 5oins.
$ou can enter a source filter to reduce the num!er of rows the /ower"enter :erer *ueries. (f you include the
string '+7@<@' or large o!5ects in the source filter, the /ower"enter :erer fails the session
+hen you use sorted ports, the /ower"enter :erer adds the ports to the -<8@< 3$ clause in the default
*uery. )he /ower"enter :erer adds the configured num!er of ports, starting at the top of the :ource Kualifier
transformation. $ou might use sorted ports to improe performance when you include any of the following
transformations in a mapping4
!5
(f you want the /ower"enter :erer to select uni*ue alues from a source, you can use the :elect 8istinct
option. $ou might use this feature to extract uni*ue customer (8s from a ta!le listing total sales. Asing :elect
8istinct filters out unnecessary data earlier in the data flow, which might improe performance.
3y default, the 8esigner generates a :@.@") statement. (f you choose :elect 8istinct, the :ource Kualifier
transformation includes the setting in the default :K. *uery.
$ou can add pre& and post&session :K. commands on the /roperties ta! in the :ource Kualifier transformation. $ou
might want to use pre&session :K. to write a timestamp row to the source ta!le when a session !egins.
)he /ower"enter :erer runs pre&session :K. commands against the source data!ase !efore it reads the source. (t
runs post&session :K. commands against the source data!ase after it writes to the target.
:< Stored 'rocedure Tran!formation 'a!!ive
connected@unconnected
1 :tored /rocedure transformation is an important tool for populating and maintaining data!ases. 8ata!ase
administrators create stored procedures to automate tasks that are too complicated for standard :K. statements.
1 stored procedure is a precompiled collection of )ransact&:K., /.&:K. or other data!ase procedural statements
and optional flow control statements, similar to an executa!le script. :tored procedures are stored and run within
the data!ase. $ou can run a stored procedure with the @T@"A)@ :K. statement in a data!ase client tool, 5ust as you
can run :K. statements. Anlike standard :K., howeer, stored procedures allow user&defined aria!les, conditional
statements, and other powerful programming features.
-ne of the most useful features of stored procedures is the a!ility to send data to the stored procedure, and
receie data from the stored procedure.
Inut@Outut 'arameter!
For many stored procedures, you proide a alue and receie a alue in return. )hese alues are known as input and
output parameters. For example, a sales tax calculation stored procedure can take a single input parameter, such as
the price of an item. 1fter performing the calculation, the stored procedure returns two output parameters, the
amount of tax, and the total cost of the item including the tax.
)he :tored /rocedure transformation sends and receies input and output parameters using ports, aria!les,
or !y entering a alue in an expression
)he :tored /rocedure transformation captures return alues in a similar manner as input2output parameters,
depending on the method that the input2output parameters are captured. (n some instances, only a parameter or a
return alue can !e captured.
!6
#onnected and Unconnected
:tored procedures run in either connected or unconnected mode. )he mode you use depends on what the stored
procedure does and how you plan to use it in your session. $ou can configure connected and unconnected :tored
/rocedure transformations in a mapping.
#onnected* )he flow of data through a mapping in connected mode also passes through the :tored
/rocedure transformation. 1ll data entering the transformation through the input ports affects the stored
procedure. $ou should use a connected :tored /rocedure transformation when you need data from an input
port sent as an input parameter to the stored procedure, or the results of a stored procedure sent as an
output parameter to another transformation.
Unconnected* )he unconnected :tored /rocedure transformation is not connected directly to the flow of
the mapping. (t either runs !efore or after the session, or is called !y an expression in another
transformation in the mapping.
"omparison of "onnected and Anconnected :tored /rocedure )ransformations
If /ou 0ant to U!e t"i! mode
<un a stored procedure !efore or after your session. Anconnected
<un a stored procedure once during your mapping, such as pre& or post&session. Anconnected
<un a stored procedure eery time a row passes through the :tored /rocedure transformation.
"onnected or
Anconnected
<un a stored procedure !ased on data that passes through the mapping, such as when a
specific port does not contain a null alue.
Anconnected
/ass parameters to the stored procedure and receie a single output parameter.
"onnected or
Anconnected
/ass parameters to the stored procedure and receie multiple output parameters.
#ote4 )o get multiple output parameters from an unconnected :tored /rocedure
transformation, you must create aria!les for each output parameter. For details..
"onnected or
Anconnected
<un nested stored procedures. Anconnected
"all multiple times within a mapping. Anconnected
)o use a :tored /rocedure transformation4
1. "reate the stored procedure in the data!ase.
3efore using the 8esigner to create the transformation, you must create the stored procedure in the
data!ase. $ou should also test the stored procedure through the proided data!ase client tools.
2. (mport or create the :tored /rocedure transformation.
!7
Ase the 8esigner to import or create the :tored /rocedure transformation, proiding ports for any necessary
input2output and return alues.
;. 8etermine whether to use the transformation as connected or unconnected.
$ou must determine how the stored procedure relates to the mapping !efore configuring the
transformation.
R. (f connected, map the appropriate input and output ports.
$ou use connected :tored /rocedure transformations 5ust as you would most other transformations. "lick
and drag the appropriate input flow ports to the transformation, and create mappings from output ports to
other transformations.
'. (f unconnected, either configure the stored procedure to run pre& or post&session, or configure it to run
from an expression in another transformation.
:ince stored procedures can run !efore or after the session, you may need to specify when the unconnected
transformation should run. -n the other hand, if the stored procedure is called from another
transformation, you write the expression in another transformation that calls the stored procedure. )he
expression can contain aria!les, and may or may not include a return alue.
J. "onfigure the session.
)he session properties in the +orkflow Manager includes options for error handling when running stored
procedures and seeral :K. oerride options.
-racle
(n -racle, any stored procedure that returns a alue is called a stored function. <ather than using the "<@1)@
/<-"@8A<@ statement to make a new stored procedure !ased on the example, you use the "<@1)@ FA#")(-#
statement. (n this sample, the aria!les are declared as (# and -A), !ut -racle also supports an (#-A) parameter
type, which allows you to pass in a parameter, modify it, and return the modified alue4
"onfiguring an Anconnected )ransformation
1n unconnected :tored /rocedure transformation is not directly connected to the flow of data through the mapping.
(nstead, the stored procedure runs either4
,rom an e1re!!ion* "alled from an expression written in the @xpression @ditor within another
transformation in the mapping.
're- or o!t-!e!!ion* <uns !efore or after a session.
+hen using an unconnected :tored /rocedure transformation in an expression, you need a method of returning the
alue of output parameters to a port. Ase one of the following methods to capture the output alues4
!
1ssign the output alue to a local aria!le.
1ssign the output alue to the system aria!le /<-"_<@:A.).
3y using /<-"_<@:A.), you assign the alue of the return parameter directly to an output port, which can apply
directly to a target. $ou can also com!ine the two options !y assigning one output parameter as /<-"_<@:A.), and
the other parameter as a aria!le.
Ase /<-"_<@:A.) only within an expression. (f you do not use /<-"_<@:A.) or a aria!le, the port containing the
expression captures a #A... $ou cannot use /<-"_<@:A.) in a connected .ookup transformation or within the "all
)ext for a :tored /rocedure transformation.
(f you re*uire nested stored procedures, where the output parameter of one stored procedure passes to another
stored procedure, use /<-"_<@:A.) to pass the alue. o!ieefans.com
)he /ower"enter :erer calls the unconnected :tored /rocedure transformation from the @xpression
transformation. #otice that the :tored /rocedure transformation has two input ports and one output port. 1ll three
ports are string datatypes.
Anconnected :tored /rocedure transformations can !e called from an expression in another transformation. Ase the
following rules and guidelines when configuring the expression4
1 single output parameter is returned using the aria!le /<-"_<@:A.).
+hen you use a stored procedure in an expression, use the 4:/ reference *ualifier. )o aoid typing errors,
select the :tored /rocedure node in the @xpression @ditor, and dou!le&click the name of the stored
procedure.
7oweer, the same instance of a :tored /rocedure transformation cannot run in !oth connected and
unconnected mode in a mapping. $ou must create different instances of the transformation.
)he input2output parameters in the expression must match the input2output ports in the :tored /rocedure
transformation. (f the stored procedure has an input parameter, there must also !e an input port in the
:tored /rocedure transformation.
+hen you write an expression that includes a stored procedure, list the parameters in the same order that
they appear in the stored procedure and the :tored /rocedure transformation.
)he parameters in the expression must include all of the parameters in the :tored /rocedure
transformation. $ou cannot leae out an input parameter. (f necessary, pass a dummy aria!le to the stored
procedure.
)he arguments in the expression must !e the same datatype and precision as those in the :tored /rocedure
transformation.
Ase /<-"_<@:A.) to apply the output parameter of a stored procedure expression directly to a target. $ou
cannot use a aria!le for the output parameter to pass the results directly to a target. Ase a local aria!le
to pass the results to an output port within the same transformation.
#ested stored procedures allow passing the return alue of one stored procedure as the input parameter of
another stored procedure
:B*Union Tran!formation connected@active
!!
)he Anion transformation is a multiple input group transformation that you can use to merge data from multiple
pipelines or pipeline !ranches into one pipeline !ranch. (t merges data from multiple sources similar to the A#(-#
1.. :K. statement to com!ine the results from two or more :K. statements. :imilar to the A#(-# 1.. statement,
the Anion transformation does not remoe duplicate rows.
$ou can connect heterogeneous sources to a Anion transformation. )he Anion transformation merges sources with
matching ports and outputs the data from one output group with the same ports as the input groups
1 Anion transformation has multiple input groups and one output group. "reate input groups on the =roups ta!, and
create ports on the =roup /orts ta!.
$ou can create one or more input groups on the =roups ta!. )he 8esigner creates one output group !y default. $ou
cannot edit or delete the output group.
U!ing a Union Tran!formation in %aing!
)he Anion transformation is a non&!locking multiple input group transformation. $ou can connect the input groups
to different !ranches in a single pipeline or to different source pipelines.
+hen you add a Anion transformation to a mapping, you must erify that you connect the same ports in all input
groups. (f you connect all ports in one input group, !ut do not connect a port in another input group, the
/ower"enter :erer passes #A..s to the unconnected port.
:C*#u!tom Tran!formation (active@a!!ive) @connected
"ustom transformations operate in con5unction with procedures you create outside of the 8esigner interface to
extend /ower"enter functionality. $ou can create a "ustom transformation and !ind it to a procedure that you
deelop using the functions $ou can use the "ustom transformation to create transformation applications, such as
sorting and aggregation, which re*uire all input rows to !e processed !efore outputting any output rows. )o support
this process, the input and output functions occur separately in "ustom transformations compared to @xternal
/rocedure transformations.
)he /ower"enter :erer passes the input data to the procedure using an input function. )he output function is a
separate function that you must enter in the procedure code to pass output data to the /ower"enter :erer. (n
contrast, in the @xternal /rocedure transformation, an external procedure function does !oth input and output, and
its parameters consist of all the ports of the transformation.
"ode /age "ompati!ility
)he "ustom transformation procedure code page is the code page of the data the "ustom transformation procedure
processes. )he following factors determine the "ustom transformation procedure code page4
/ower"enter :erer data moement mode
1""
)he (#F1_")"hange:tringMode,0 function
)he (#F1_"):et8ata"ode/age(8,0 function
)he "ustom transformation procedure code page must !e two&way compati!le with the /ower"enter :erer code
page. )he /ower"enter :erer passes data to the procedure in the "ustom transformation procedure code page.
1lso, the data the procedure passes to the /ower"enter :erer must !e alid characters in the "ustom
transformation procedure code page.
1 "ustom transformation has !oth input and output groups. (t also can hae input ports, output ports, and
input2output ports. $ou create and edit groups and ports on the /orts ta! of the "ustom transformation. $ou can
also define the relationship !etween input and output ports on the /orts ta!.
#reating $rou! and 'ort!
$ou can create multiple input groups and multiple output groups in a "ustom transformation. $ou must create at
least one input group and one output group. )o create an input group, click the "reate (nput =roup icon. )o create
an output group, click the "reate -utput =roup icon. +hen you create a group, the 8esigner adds it as the last
group. +hen you create a passie "ustom transformation, you can only create one input group and one output
group.
)o create a port, click the 1dd !utton. +hen you create a port, the 8esigner adds it !elow the currently selected
row or group. @ach port contains attri!utes defined on the /ort 1ttri!ute 8efinitions ta!. $ou can edit the attri!utes
for each port. For more information a!out creating and editing user&defined port attri!utes
3y default, an output port in a "ustom transformation depends on all input ports. 7oweer, you can define the
relationship !etween input and output ports in a "ustom transformation. +hen you do this, you can iew link paths
in a mapping containing a "ustom transformation and you can see which input ports an output port depends on. $ou
can also iew source column dependencies for target ports in a mapping containing a "ustom transformation.
)o define the relationship !etween ports in a "ustom transformation, create a port dependency. 1 port dependency
is the relationship !etween an output or input2output port and one or more input or input2output ports. +hen you
create a port dependency, !ase it on the procedure logic in the code.
"ustom )ransformation /roperties
Otion De!crition
Module (dentifier
)he module name. @nter only 1:"(( characters in this field. $ou cannot enter multi!yte
characters.
)his property is the !ase name of the 8.. or the shared li!rary that contains the procedure.
)he 8esigner uses this name to create the " file when you generate the external procedure
code.
Function (dentifier )he name of the procedure in the module. @nter only 1:"(( characters in this field. $ou cannot
enter multi!yte characters.
1"1
)he 8esigner uses this name to create the " file where you enter the procedure code.
<untime .ocation
)he location that contains the 8.. or shared li!rary. )he default is L/M@xt/roc8ir. @nter a path
relatie to the /ower"enter :erer machine that runs the session using the "ustom
transformation.
(f you make this property !lank, the /ower"enter :erer uses the enironment aria!le
defined on the /ower"enter :erer machine to locate the 8.. or shared li!rary.
$ou must copy all 8..s or shared li!raries to the runtime location or to the enironment
aria!le defined on the /ower"enter :erer machine. )he /ower"enter :erer fails to load the
procedure when it cannot locate the 8.., shared li!rary, or a referenced file.
)racing .eel 1mount of detail displayed in the session log for this transformation. )he default is #ormal.
(s /artitiona!le
:pecifies whether or not you can create multiple partitions in a pipeline that uses this
transformation. )his property is disa!led !y default.
(nputs Must 3lock
:pecifies whether or not the procedure associated with the transformation must !e a!le to
!lock incoming data. )his property is ena!led !y default.
For more information a!out !locking data..
(s 1ctie
:pecifies whether this transformation is an actie or passie transformation.
$ou cannot change this property after you create the "ustom transformation. (f you need to
change this property, create a new "ustom transformation and select the correct property
alue.
Apdate :trategy
)ransformation
:pecifies whether or not this transformation defines the update strategy for output rows. )his
property is disa!led !y default. $ou can ena!le this for actie "ustom transformations.
For more information a!out this property..
)ransformation
:cope
:pecifies how the /ower"enter :erer applies the transformation logic to incoming data4
<ow
)ransaction
1ll (nput
+hen the transformation is passie, this property is always <ow. +hen the transformation is
actie, this property is 1ll (nput !y default.
For more information a!out working with transaction control.
=enerate
)ransaction
:pecifies whether or not this transformation can generate transactions. +hen a "ustom
transformation generates transactions, it does so for all output groups.
)his property is disa!led !y default. $ou can only ena!le this for actie "ustom
transformations.
For more information a!out working with transaction control.
-utput is
<epeata!le
:pecifies whether the order of the output data is consistent !etween session runs.
#eer. )he order of the output data is inconsistent !etween session runs. )his is the
default for actie transformations.
3ased -n (nput -rder. )he output order is consistent !etween session runs when the
1"2
input data order is consistent !etween session runs. )his is the default for passie
transformations.
1lways. )he order of the output data is consistent !etween session runs een if the
order of the input data is inconsistent !etween session runs.
3y default, the /ower"enter :erer concurrently reads sources in a target load order group. 7oweer, you can write
the external procedure code to !lock input data on some input groups. 3locking is the suspension of the data flow
into an input group of a multiple input group transformation. For more information a!out !locking source data, )o
use a "ustom transformation to !lock input data, you must write the procedure code to !lock and un!lock data. $ou
must also ena!le !locking on the /roperties ta! for the "ustom transformation.
:D*J%& Source 2ualifierran!formation Active@connected
$ou can add an TM. :ource Kualifier transformation to a mapping !y dragging an TM. source definition to the
Mapping 8esigner workspace or !y manually creating one. +hen you add an TM. source definition to a mapping, you
need to connect it to an TM. :ource Kualifier transformation. )he TM. :ource Kualifier transformation defines the
data elements that the /ower"enter :erer reads when it executes a session. (t determines how the /ower"enter
reads the source data.
1n TM. :ource Kualifier transformation always has one input or output port for eery column in the TM. source.
+hen you create an TM. :ource Kualifier transformation for a source definition, the 8esigner links each port in the
TM. source definition to a port in the TM. :ource Kualifier transformation. $ou cannot remoe or edit any of the
links. (f you remoe an TM. source definition from a mapping, the 8esigner also remoes the corresponding TM.
:ource Kualifier transformation. $ou can link one TM. source definition to one TM. :ource Kualifier transformation
:E Normali?er Tran!formation active@connected
#ormalization is the process of organizing data. (n data!ase terms, this includes creating normalized ta!les and
esta!lishing relationships !etween those ta!les according to rules designed to !oth protect the data and make the
data!ase more flexi!le !y eliminating redundancy and inconsistent dependencies.
)he #ormalizer transformation normalizes records from "-3-. and relational sources, allowing you to organize the
data according to your own needs. 1 #ormalizer transformation can appear anywhere in a pipeline when you
1"3
normalize a relational source. Ase a #ormalizer transformation instead of the :ource Kualifier transformation when
you normalize a "-3-. source. +hen you drag a "-3-. source into the Mapping 8esigner workspace, the Mapping
8esigner creates a #ormalizer transformation with input and output ports for eery column in the source.
$ou primarily use the #ormalizer transformation with "-3-. sources, which are often stored in a denormalized
format. )he -""A<: statement in a "-3-. file nests multiple records of information in a single record. Asing the
#ormalizer transformation, you !reak out repeated data within a record into separate records. For each new record
it creates, the #ormalizer transformation generates a uni*ue identifier. $ou can use this key alue to 5oin the
normalized records.
$ou can also use the #ormalizer transformation with relational sources to create multiple rows from a single row of
data.

'erformance Tuning
)he goal of performance tuning is to optimize session performance !y eliminating performance !ottlenecks. )o tune
the performance of a session, first you identify a performance !ottleneck, eliminate it, and then identify the next
performance !ottleneck until you are satisfied with the session performance. $ou can use the test load option to run
sessions when you tune session performance.
)he most common performance !ottleneck occurs when the /ower"enter :erer writes to a target data!ase. $ou
can identify performance !ottlenecks !y the following methods4
1"4
Running te!t !e!!ion!* $ou can configure a test session to read from a flat file source or to write to a flat
file target to identify source and target !ottlenecks.
Stud/ing erformance detail!* $ou can create a set of information called performance details to identify
session !ottlenecks. /erformance details proide information such as !uffer input and output efficiency
%onitoring !/!tem erformance* $ou can use system monitoring tools to iew percent "/A usage, (2-
waits, and paging to identify system !ottlenecks.
-nce you determine the location of a performance !ottleneck, you can eliminate the !ottleneck !y following these
guidelines4
Eliminate !ource and target databa!e bottlenec6!* 7ae the data!ase administrator optimize data!ase
performance !y optimizing the *uery, increasing the data!ase network packet size, or configuring index and
key constraints.
Eliminate maing bottlenec6!* Fine tune the pipeline logic and transformation settings and options in
mappings to eliminate mapping !ottlenecks.
Eliminate !e!!ion bottlenec6!* $ou can optimize the session strategy and use performance details to help
tune session configuration.
Eliminate !/!tem bottlenec6!* 7ae the system administrator analyze information from system monitoring
tools and improe "/A and network performance.
(f you tune all the !ottlenecks a!oe, you can further optimize session performance !y increasing the num!er of
pipeline partitions in the session. 1dding partitions can improe performance !y utilizing more of the system
hardware while processing the session.
3ecause determining the !est way to improe performance can !e complex, change only one aria!le at a time, and
time the session !oth !efore and after the change. (f session performance does not improe, you might want to
return to your original configurations.
Identif/ing t"e 'erformance 3ottlenec6
)he first step in performance tuning is to identify the performance !ottleneck. /erformance !ottlenecks can occur
in the source and target data!ases, the mapping, the session, and the system. =enerally, you should look for
performance !ottlenecks in the following order4
1. )arget
2. :ource
;. Mapping
R. :ession
1"5
'. :ystem
$ou can identify performance !ottlenecks !y running test sessions, iewing performance details, and using system
monitoring tools.
Identif/ing Target 3ottlenec6!
)he most common performance !ottleneck occurs when the /ower"enter :erer writes to a target data!ase. $ou
can identify target !ottlenecks !y configuring the session to write to a flat file target. (f the session performance
increases significantly when you write to a flat file, you hae a target !ottleneck.
(f your session already writes to a flat file target, you pro!a!ly do not hae a target !ottleneck. $ou can optimize
session performance !y writing to a flat file target local to the /ower"enter :erer.
"auses for a target !ottleneck may include small check point interals, small data!ase network packet size, or
pro!lems during heay loading operations. For details a!out eliminating a target !ottleneck.
Identif/ing Source 3ottlenec6!
/erformance !ottlenecks can occur when the /ower"enter :erer reads from a source data!ase. (f your session
reads from a flat file source, you pro!a!ly do not hae a source !ottleneck. $ou can improe session performance
!y setting the num!er of !ytes the /ower"enter :erer reads per line if you read from a flat file source.
(f the session reads from relational source, you can use a filter transformation, a read test mapping, or a data!ase
*uery to identify source !ottlenecks.
U!ing a ,ilter Tran!formation
$ou can use a filter transformation in the mapping to measure the time it takes to read source data.
1dd a filter transformation in the mapping after each source *ualifier. :et the filter condition to false so that no
data is processed past the filter transformation. (f the time it takes to run the new session remains a!out the same,
then you hae a source !ottleneck.
U!ing a Read Te!t Se!!ion
$ou can create a read test mapping to identify source !ottlenecks. 1 read test mapping isolates the read *uery !y
remoing the transformation in the mapping. Ase the following steps to create a read test mapping4
1. Make a copy of the original mapping.
2. (n the copied mapping, keep only the sources, source *ualifiers, and any custom 5oins or *ueries.
;. <emoe all transformations.
1"6
R. "onnect the source *ualifiers to a file target.
Ase the read test mapping in a test session. (f the test session performance is similar to the original session, you
hae a source !ottleneck.
U!ing a Databa!e 2uer/
$ou can identify source !ottlenecks !y executing the read *uery directly against the source data!ase.
"opy the read *uery directly from the session log. @xecute the *uery against the source data!ase with a *uery tool
such as is*l. -n +indows, you can load the result of the *uery in a file. -n A#(T systems, you can load the result of
the *uery in 2de2null.
Measure the *uery execution time and the time it takes for the *uery to return the first row. (f there is a long delay
!etween the two time measurements, you can use an optimizer hint to eliminate the source !ottleneck.
"auses for a source !ottleneck may include an inefficient *uery or small data!ase network packet sizes. For details
a!out eliminating source !ottlenecks.
Identif/ing %aing 3ottlenec6!
(f you determine that you do not hae a source or target !ottleneck, you might hae a mapping !ottleneck. $ou can
identify mapping !ottlenecks !y using a Filter transformation in the mapping.
(f you determine that you do not hae a source !ottleneck, you can add a Filter transformation in the mapping
!efore each target definition. :et the filter condition to false so that no data is loaded into the target ta!les. (f the
time it takes to run the new session is the same as the original session, you hae a mapping !ottleneck.
$ou can also identify mapping !ottlenecks !y using performance details. 7igh errorrows and rowsinlookupcache
counters indicate a mapping !ottleneck.
Hig" Ro0!inloo6ucac"e #ounter!
Multiple lookups can slow down the session. $ou might improe session performance !y locating the largest lookup
ta!les and tuning those lookup expressions.
Hig" Errorro0! #ounter!
)ransformation errors impact session performance. (f a session has large num!ers in any of the
!ransformation_errorrows counters, you might improe performance !y eliminating the errors..
Identif/ing a Se!!ion 3ottlenec6
1"7
(f you do not hae a source, target, or mapping !ottleneck, you may hae a session !ottleneck. $ou can identify a
session !ottleneck !y using the performance details. )he /ower"enter :erer creates performance details when you
ena!le "ollect /erformance 8ata in the /erformance settings on the /roperties ta! of the session properties.
/erformance details display information a!out each :ource Kualifier, target definition, and indiidual
transformation. 1ll transformations hae some !asic counters that indicate the num!er of input rows, output rows,
and error rows.
1ny alue other than zero in the readfromdisk and writetodisk counters for 1ggregator, Ooiner, or <ank
transformations indicate a session !ottleneck.
:mall cache size, low !uffer memory, and small commit interals can cause session !ottlenecks.
1ggregator, <ank, and Ooiner <eadfromdisk and +ritetodisk "ounters
(f a session contains 1ggregator, <ank, or Ooiner transformations, examine each !ransformation"readfromdisk and
!ransformation_writetodisk counter.
(f these counters display any num!er other than zero, you can improe session performance !y increasing the index
and data cache sizes. )he /ower"enter :erer uses the index cache to store group information and the data cache
to store transformed data, which is typically larger. )herefore, although !oth the index cache and data cache sizes
affect performance, you will most likely need to increase the data cache size more than the index cache size. (f the
session performs incremental aggregation, the /ower"enter :erer reads historical aggregate data from the local
disk during the session and writes to disk when saing historical data. 1s a result, the 1ggregator_readtodisk and
writetodisk counters display a num!er !esides zero. 7oweer, since the /ower"enter :erer writes the historical
data to a file at the end of the session, you can still ealuate the counters during the session. (f the counters show
any num!er other than zero during the session run, you can increase performance !y tuning the index and data
cache sizes.
)o iew the session performance details while the session runs, right&click the session in the +orkflow Monitor and
choose /roperties. "lick the /roperties ta! in the details dialog !ox.
:ource and )arget 3uffer(nput_efficiency and 3uffer-utput_efficiency "ounters
(f the 3uffer(nput_efficiency and the 3uffer-utput_efficiency counters are low for all sources and targets,
increasing the session 8)M !uffer size may improe performance. For information on when and how to tune this
parameter.
Ander certain circumstances, tuning the !uffer !lock size may also improe session performance.
Identif/ing a S/!tem 3ottlenec6
1fter you tune the source, target, mapping, and session, you may consider tuning the system. $ou can identify
system !ottlenecks !y using system tools to monitor "/A usage, memory usage, and paging.
1"
)he /ower"enter :erer uses system resources to process transformation, session execution, and reading and
writing data. )he /ower"enter :erer also uses system memory for other data such as aggregate, 5oiner, rank, and
cached lookup ta!les. $ou can use system performance monitoring tools to monitor the amount of system resources
the /ower"enter :erer uses and identify system !ottlenecks.
-n +indows, you can use system tools in the )ask Manager or 1dministratie )ools.
-n A#(T systems you can use system tools such as mstat and iostat to monitor system performance.
-n +indows, you can iew the /erformance and /rocesses ta! in the )ask Manager ,use "trl&1lt&8el and choose )ask
Manager0. )he /erformance ta! in the )ask Manager proides a *uick look at "/A usage and total memory used. $ou
can iew more detailed performance information !y using the /erformance Monitor on +indows ,use :tart&
/rograms&1dministratie )ools and choose /erformance Monitor0.
Ase the +indows /erformance Monitor to create a chart that proides the following information4
'ercent roce!!or time. (f you hae seeral "/As, monitor each "/A for percent processor time. (f the
processors are utilized at more than 9%[, you may consider adding more processors.
'age!@!econd. (f pages2second is greater than fie, you may hae excessie memory pressure ,thrashing0.
$ou may consider adding more physical memory.
'"/!ical di!6! ercent time. )his is the percent time that the physical disk is !usy performing read or write
re*uests. $ou may consider adding another disk deice or upgrading the disk deice.
'"/!ical di!6! 8ueue lengt". )his is the num!er of users waiting for access to the same disk deice. (f
physical disk *ueue length is greater than two, you may consider adding another disk deice or upgrading
the disk deice.
Server total b/te! er !econd. )his is the num!er of !ytes the serer has sent to and receied from the
network. $ou can use this information to improe network !andwidth.
Identif/ing S/!tem 3ottlenec6! on UNIJ
$ou can use A#(T tools to monitor user !ackground process, system swapping actions, "/A loading process, and (2-
load operations. +hen you tune A#(T systems, tune the serer for a ma5or data!ase system. Ase the following A#(T
tools to identify system !ottlenecks on the A#(T system4
l!attr -E -I !/!I . Ase this tool to iew current system settings. )his tool shows maxuproc, the maximum
leel of user !ackground processes. $ou may consider reducing the amount of !ackground process on your
system.
io!tat . Ase this tool to monitor loading operation for eery disk attached to the data!ase serer. (ostat
displays the percentage of time that the disk was physically actie. 7igh disk utilization suggests that you
may need to add more disks.
(f you use disk arrays, use utilities proided with the disk arrays instead of iostat.
1"!
vm!tat or !ar -0 . Ase this tool to monitor disk swapping actions. :wapping should not occur during the
session. (f swapping does occur, you may consider increasing your physical memory or reduce the num!er of
memory&intensie applications on the disk.
!ar -u. Ase this tool to monitor "/A loading. )his tool proides percent usage on user, system, idle time,
and waiting time. (f the percent time spent waiting on (2- ,[wio0 is high, you may consider using other
under&utilized disks. For example, if your source data, target data, lookup, rank, and aggregate cache files
are all on the same disk, consider putting them on different disks.
Otimi?ing t"e Target Databa!e
(f your session writes to a flat file target, you can optimize session performance !y writing to a flat file target that
is local to the /ower"enter :erer. (f your session writes to a relational target, consider performing the following
tasks to increase performance4
8rop indexes and key constraints.
(ncrease checkpoint interals.
Ase !ulk loading.
Ase external loading.
(ncrease data!ase network packet size.
-ptimize -racle target data!ases.
Droing Inde1e! and 4e/ #on!traint!
+hen you define key constraints or indexes in target ta!les, you slow the loading of data to those ta!les. )o
improe performance, drop indexes and key constraints !efore running your session. $ou can re!uild those indexes
and key constraints after the session completes.
(f you decide to drop and re!uild indexes and key constraints on a regular !asis, you can create pre& and post&load
stored procedures to perform these operations each time you run the session.
Note- )o optimize performance, use constraint&!ased loading only if necessary.
Increa!ing #"ec6oint Interval!
)he /ower"enter :erer performance slows each time it waits for the data!ase to perform a checkpoint. )o
increase performance, consider increasing the data!ase checkpoint interal. +hen you increase the data!ase
checkpoint interal, you increase the likelihood that the data!ase performs checkpoints as necessary, when the size
of the data!ase log file reaches its limit.
11"
For details on specific data!ase checkpoints, checkpoint interals, and log files, consult your data!ase
documentation.
3ul6 &oading
$ou can use !ulk loading to improe the performance of a session that inserts a large amount of data to a 832,
:y!ase, -racle, or Microsoft :K. :erer data!ase. "onfigure !ulk loading on the Mapping ta!.
+hen !ulk loading, the /ower"enter :erer !ypasses the data!ase log, which speeds performance. +ithout writing
to the data!ase log, howeer, the target data!ase cannot perform roll!ack. 1s a result, you may not !e a!le to
perform recoery. )herefore, you must weigh the importance of improed session performance against the a!ility to
recoer an incomplete session.
E1ternal &oading
$ou can use the @xternal .oader session option to integrate external loading with a session.
(f you hae a 832 @@ or 832 @@@ target data!ase, you can use the 832 @@ or 832 @@@ external loaders to !ulk load
target files. )he 832 @@ external loader uses the /ower"enter :erer d!2load utility to load data. )he 832 @@@
external loader uses the 832 1utoloader utility.
(f you hae a )eradata target data!ase, you can use the )eradata external loader utility to !ulk load target files.
(f your target data!ase runs on -racle, you can use the -racle :K.Z.oader utility to !ulk load target files. +hen you
load data to an -racle data!ase using a pipeline with multiple partitions, you can increase performance if you
create the -racle target ta!le with the same num!er of partitions you use for the pipeline.
(f your target data!ase runs on :y!ase (K, you can use the :y!ase (K external loader utility to !ulk load target files.
(f your :y!ase (K data!ase is local to the /ower"enter :erer on your A#(T system, you can increase performance !y
loading data to target ta!les directly from named pipes.
Increa!ing Databa!e Net0or6 'ac6et Si?e
$ou can increase the network packet size in the (nformatica +orkflow Manager to reduce target !ottleneck. For
:y!ase and Microsoft :K. :erer, increase the network packet size to 9> & 1J>. For -racle, increase the network
packet size in tnsnames.ora and listener.ora. (f you increase the network packet size in the /ower"enter :erer
configuration, you also need to configure the data!ase serer network memory to accept larger packet sizes.
:ee your data!ase documentation a!out optimizing data!ase network packet size.
Otimi?ing Oracle Target Databa!e!
(f your target data!ase is -racle, you can optimize the target data!ase !y checking the storage clause, space
allocation, and roll!ack segments.
111
+hen you write to an -racle data!ase, check the storage clause for data!ase o!5ects. Make sure that ta!les are
using large initial and next alues. )he data!ase should also store ta!le and index data in separate ta!lespaces,
prefera!ly on different disks.
+hen you write to -racle target data!ases, the data!ase uses roll!ack segments during loads. Make sure that the
data!ase stores roll!ack segments in appropriate ta!lespaces, prefera!ly on different disks. )he roll!ack segments
should also hae appropriate storage clauses.
$ou can optimize the -racle target data!ase !y tuning the -racle redo log. )he -racle data!ase uses the redo log
to log loading operations. Make sure that redo log size and !uffer size are optimal. $ou can iew redo log properties
in the init.ora file.
(f your -racle instance is local to the /ower"enter :erer, you can optimize performance !y using (/" protocol to
connect to the -racle data!ase. $ou can set up -racle data!ase connection in listener.ora and tnsnames.ora.
Otimi?ing t"e Source Databa!e
(f your session reads from a flat file source, you can improe session performance !y setting the num!er of !ytes
the /ower"enter :erer reads per line. 3y default, the /ower"enter :erer reads 1%2R !ytes per line. (f each line in
the source file is less than the default setting, you can decrease the .ine :e*uential 3uffer .ength setting in the
session properties.
(f your session reads from a relational source, reiew the following suggestions for improing performance4
-ptimize the *uery.
"reate tempd! as in&memory data!ase.
Ase conditional filters.
(ncrease data!ase network packet size.
"onnect to -racle data!ases using (/" protocol.
Otimi?ing t"e 2uer/
(f a session 5oins multiple source ta!les in one :ource Kualifier, you might !e a!le to improe performance !y
optimizing the *uery with optimizing hints. 1lso, single ta!le select statements with an -<8@< 3$ or =<-A/ 3$
clause may !enefit from optimization such as adding indexes.
Asually, the data!ase optimizer determines the most efficient way to process the source data. 7oweer, you might
know properties a!out your source ta!les that the data!ase optimizer does not. )he data!ase administrator can
create optimizer hints to tell the data!ase how to execute the *uery for a particular set of source ta!les.
)he *uery the /ower"enter :erer uses to read data appears in the session log. $ou can also find the *uery in the
:ource Kualifier transformation. 7ae your data!ase administrator analyze the *uery, and then create optimizer
hints and2or indexes for the source ta!les.
112
Ase optimizing hints if there is a long delay !etween when the *uery !egins executing and when /ower"enter
receies the first row of data. "onfigure optimizer hints to !egin returning rows as *uickly as possi!le, rather than
returning all rows at once. )his allows the /ower"enter :erer to process rows parallel with the *uery execution.
Kueries that contain -<8@< 3$ or =<-A/ 3$ clauses may !enefit from creating an index on the -<8@< 3$ or =<-A/
3$ columns. -nce you optimize the *uery, use the :K. oerride option to take full adantage of these
modifications. For details on using :K. oerride, see G:ource Kualifier )ransformationH in the !ransformation
Guide.
$ou can also configure the source data!ase to run parallel *ueries to improe performance. :ee your data!ase
documentation for configuring parallel *uery.
Asing tempd! to Ooin :y!ase and Microsoft :K. :erer )a!les
+hen 5oining large ta!les on a :y!ase or Microsoft :K. :erer data!ase, you might improe performance !y creating
the tempd! as an in&memory data!ase to allocate sufficient memory. "heck your :y!ase or Microsoft :K. :erer
manual for details.
U!ing #onditional ,ilter!
1 simple source filter on the source data!ase can sometimes impact performance negatiely !ecause of lack of
indexes. $ou can use the /ower"enter conditional filter in the :ource Kualifier to improe performance.
+hether you should use the /ower"enter conditional filter to improe performance depends on your session. For
example, if multiple sessions read from the same source simultaneously, the /ower"enter conditional filter may
improe performance.
7oweer, some sessions may perform faster if you filter the source data on the source data!ase. $ou can test your
session with !oth the data!ase filter and the /ower"enter filter to determine which method improes performance.
Increa!ing Databa!e Net0or6 'ac6et Si?e!
$ou can improe the performance of a source data!ase !y increasing the network packet size, allowing larger
packets of data to cross the network at one time. )o do this you must complete the following tasks4
(ncrease the data!ase serer network packet size.
"hange the packet size in the +orkflow Manager data!ase connection to reflect the data!ase serer packet
size.
For -racle, increase the packet size in listener.ora and tnsnames.ora. For other data!ases, check your data!ase
documentation for details on optimizing network packet size.
#onnecting to Oracle Source Databa!e!
(f your -racle instance is local to the /ower"enter :erer, you can optimize performance !y using (/" protocol to
connect to the -racle data!ase. $ou can set up -racle data!ase connection in listener.ora and tnsnames.ora.
113
Otimi?ing t"e %aing
Mapping&leel optimization may take time to implement !ut can significantly !oost session performance. Focus on
mapping&leel optimization only after optimizing on the target and source data!ases.
=enerally, you reduce the num!er of transformations in the mapping and delete unnecessary links !etween
transformations to optimize the mapping. $ou should configure the mapping with the least num!er of
transformations and expressions to do the most amount of work possi!le. $ou should minimize the amount of data
moed !y deleting unnecessary links !etween transformations.
For transformations that use data cache ,such as 1ggregator, Ooiner, <ank, and .ookup transformations0, limit
connected input2output or output ports. .imiting the num!er of connected input2output or output ports reduces
the amount of data the transformations store in the data cache.
$ou can also perform the following tasks to optimize the mapping4
"onfigure single&pass reading.
-ptimize datatype conersions.
@liminate transformation errors.
-ptimize transformations.
-ptimize expressions.
#onfiguring Single-'a!! Reading
:ingle&pass reading allows you to populate multiple targets with one source *ualifier. "onsider using single&pass
reading if you hae seeral sessions that use the same sources. (f you 5oin the separate mappings and use only one
source *ualifier for each source, the /ower"enter :erer then reads each source only once, then sends the data into
separate data flows. 1 particular row can !e used !y all the data flows, !y any com!ination, or !y none, as the
situation demands.
For example, you hae the /A<"71:(#= source ta!le, and you use that source daily to perform an aggregation and a
ranking. (f you place the 1ggregator and <ank transformations in separate mappings and sessions, you force the
/ower"enter :erer to read the same source ta!le twice. 7oweer, if you 5oin the two mappings, using one source
*ualifier, the /ower"enter :erer reads /A<"71:(#= only once, then sends the appropriate data to the two
separate data flows.
+hen changing mappings to take adantage of single&pass reading, you can optimize this feature !y factoring out
any functions you do on !oth mappings. For example, if you need to su!tract a percentage from the /<("@ ports for
!oth the 1ggregator and <ank transformations, you can minimize work !y su!tracting the percentage #efore
splitting the pipeline as shown in Figure 2'&14
Figure 2'&1. :ingle&/ass <eading
114
-ptimizing 8atatype "onersions
Forcing the /ower"enter :erer to make unnecessary datatype conersions slows performance. For example, if your
mapping moes data from an (nteger column to a 8ecimal column, then !ack to an (nteger column, the unnecessary
datatype conersion slows performance. +here possi!le, eliminate unnecessary datatype conersions from
mappings.
:ome datatype conersions can improe system performance. Ase integer alues in place of other datatypes when
performing comparisons using .ookup and Filter transformations.
For example, many data!ases store A.:. zip code information as a "har or Barchar datatype. (f you conert your zip
code data to an (nteger datatype, the lookup data!ase stores the zip code 9R;%;&12;R as 9R;%;12;R. )his helps
increase the speed of the lookup comparisons !ased on zip code.
Eliminating Tran!formation Error!
(n large num!ers, transformation errors slow the performance of the /ower"enter :erer. +ith each transformation
error, the /ower"enter :erer pauses to determine the cause of the error and to remoe the row causing the error
from the data flow. )hen the /ower"enter :erer typically writes the row into the session log file.
)ransformation errors occur when the /ower"enter :erer encounters conersion errors, conflicting mapping logic,
and any condition set up as an error, such as null input. "heck the session log to see where the transformation
errors occur. (f the errors center around particular transformations, ealuate those transformation constraints.
(f you need to run a session that generates a large num!ers of transformation errors, you might improe
performance !y setting a lower tracing leel. 7oweer, this is not a recommended long&term response to
transformation errors.
Otimi?ing &oo6u Tran!formation!
(f a mapping contains a .ookup transformation, you can optimize the lookup. :ome of the things you can do to
increase performance include caching the lookup ta!le, optimizing the lookup condition, or indexing the lookup
ta!le.
#ac"ing &oo6u!
(f a mapping contains .ookup transformations, you might want to ena!le lookup caching. (n general, you want to
cache lookup ta!les that need less than ;%%M3.
115
+hen you ena!le caching, the /ower"enter :erer caches the lookup ta!le and *ueries the lookup cache during the
session. +hen this option is not ena!led, the /ower"enter :erer *ueries the lookup ta!le on a row&!y&row !asis.
$ou can increase performance using a shared or persistent cache4
S"ared cac"e* $ou can share the lookup cache !etween multiple transformations. $ou can share an
unnamed cache !etween transformations in the same mapping. $ou can share a named cache !etween
transformations in the same or different mappings.
'er!i!tent cac"e* (f you want to sae and reuse the cache files, you can configure the transformation to use
a persistent cache. Ase this feature when you know the lookup ta!le does not change !etween session runs.
Asing a persistent cache can improe performance !ecause the /ower"enter :erer !uilds the memory
cache from the cache files instead of from the data!ase.
<educing the #um!er of "ached <ows
Ase the .ookup :K. -erride option to add a +7@<@ clause to the default :K. statement. )his allows you to reduce
the num!er of rows included in the cache.
-ptimizing the .ookup "ondition
(f you include more than one lookup condition, place the conditions with an e*ual sign first to optimize lookup
performance.
Inde1ing t"e &oo6u Table
)he /ower"enter :erer needs to *uery, sort, and compare alues in the lookup condition columns. )he index needs
to include eery column used in a lookup condition. $ou can improe performance for !oth cached and uncached
lookups4
#ac"ed loo6u!* $ou can improe performance !y indexing the columns in the lookup -<8@< 3$. )he
session log contains the -<8@< 3$ statement.
Uncac"ed loo6u!* 3ecause the /ower"enter :erer issues a :@.@") statement for each row passing into
the .ookup transformation, you can improe performance !y indexing the columns in the lookup condition.
Otimi?ing %ultile &oo6u!
(f a mapping contains multiple lookups, een with caching ena!led and enough heap memory, the lookups can slow
performance. 3y locating the .ookup transformations that *uery the largest amounts of data, you can tune those
lookups to improe oerall performance.
)o see which .ookup transformations process the most data, examine the .ookup_rowsinlookupcache counters for
each .ookup transformation. )he .ookup transformations that hae a large num!er in this counter might !enefit
from tuning their lookup expressions. (f those expressions can !e optimized, session performance improes.
116
Otimi?ing ,ilter Tran!formation!
(f you filter rows from the mapping, you can improe efficiency !y filtering early in the data flow. (nstead of using a
Filter transformation halfway through the mapping to remoe a siza!le amount of data, use a source *ualifier filter
to remoe those same rows at the source.
(f you cannot moe the filter into the source *ualifier, moe the Filter transformation as close to the source
*ualifier as possi!le to remoe unnecessary data early in the data flow.
(n your filter condition, aoid using complex expressions. $ou can optimize Filter transformations !y using simple
integer or true2false expressions in the filter condition.
Ase a Filter or <outer transformation to drop re5ected rows from an Apdate :trategy transformation if you do not
need to keep re5ected rows.
Otimi?ing Aggregator Tran!formation!
1ggregator transformations often slow performance !ecause they must group data !efore processing it. 1ggregator
transformations need additional memory to hold intermediate group results. $ou can optimize 1ggregator
transformations !y performing the following tasks4
=roup !y simple columns.
Ase sorted input.
Ase incremental aggregation.
$rou 3/ Simle #olumn!
$ou can optimize 1ggregator transformations when you group !y simple columns. +hen possi!le, use num!ers
instead of string and dates in the columns used for the =<-A/ 3$. $ou should also aoid complex expressions in the
1ggregator expressions.
U!e Sorted Inut
$ou can increase session performance !y sorting data and using the 1ggregator :orted (nput option.
)he :orted (nput decreases the use of aggregate caches. +hen you use the :orted (nput option, the /ower"enter
:erer assumes all data is sorted !y group. 1s the /ower"enter :erer reads rows for a group, it performs aggregate
calculations. +hen necessary, it stores group information in memory.
)he :orted (nput option reduces the amount of data cached during the session and improes performance. Ase this
option with the :ource Kualifier #um!er of :orted /orts option to pass sorted data to the 1ggregator
transformation.
$ou can !enefit from !etter performance when you use the :orted (nput option in sessions with multiple partitions.
117
U!e Incremental Aggregation
(f you can capture changes from the source that changes less than half the target, you can use (ncremental
1ggregation to optimize the performance of 1ggregator transformations.
+hen using incremental aggregation, you apply captured changes in the source to aggregate calculations in a
session. )he /ower"enter :erer updates your target incrementally, rather than processing the entire source and
recalculate the same calculations eery time you run the session.
Otimi?ing Aoiner Tran!formation!
Ooiner transformations can slow performance !ecause they need additional space at run time to hold intermediate
results. $ou can iew Ooiner performance counter information to determine whether you need to optimize the
Ooiner transformations.
Ooiner transformations need a data cache to hold the master ta!le rows and an index cache to hold the 5oin columns
from the master ta!le. $ou need to make sure that you hae enough memory to hold the data and the index cache
so the system does not page to disk. )o minimize memory re*uirements, you can also use the smaller ta!le as the
master ta!le or 5oin on as few columns as possi!le.
)he type of 5oin you use can affect performance. #ormal 5oins are faster than outer 5oins and result in fewer rows.
+hen possi!le, use data!ase 5oins for homogenous sources.
Otimi?ing Se8uence $enerator Tran!formation!
$ou can optimize :e*uence =enerator transformations !y creating a reusa!le :e*uence =enerator and use it in
multiple mappings simultaneously. $ou can also optimize :e*uence =enerator transformations !y configuring the
#um!er of "ached Balues property.
)he #um!er of "ached Balues property determines the num!er of alues the /ower"enter :erer caches at one
time. Make sure that the #um!er of "ached Balue is not too small. $ou may consider configuring the #um!er of
"ached Balues to a alue greater than 1,%%%.
Otimi?ing E1re!!ion!
1s a final step in tuning the mapping, you can focus on the expressions used in transformations. +hen examining
expressions, focus on complex expressions for possi!le simplification. <emoe expressions one&!y&one to isolate the
slow expressions.
-nce you locate the slowest expressions, take a closer look at how you can optimize those expressions.
Factoring -ut "ommon .ogic
(f the mapping performs the same task in seeral places, reduce the num!er of times the mapping performs the task
!y moing the task earlier in the mapping. For example, you hae a mapping with fie target ta!les. @ach target
11
re*uires a :ocial :ecurity num!er lookup. (nstead of performing the lookup fie times, place the .ookup
transformation in the mapping !efore the data flow splits. )hen pass lookup results to all fie targets.
%inimi?ing Aggregate ,unction #all!
+hen writing expressions, factor out as many aggregate function calls as possi!le. @ach time you use an aggregate
function call, the /ower"enter :erer must search and group the data. For example, in the following expression,
the /ower"enter :erer reads "-.AM#_1, finds the sum, then reads "-.AM#_3, finds the sum, and finally finds the
sum of the two sums4
:AM,"-.AM#_10 Q :AM,"-.AM#_30
(f you factor out the aggregate function call, as !elow, the /ower"enter :erer adds "-.AM#_1 to "-.AM#_3, then
finds the sum of !oth.
:AM,"-.AM#_1 Q "-.AM#_30
<eplacing "ommon :u!&@xpressions with .ocal Baria!les
(f you use the same su!&expression seeral times in one transformation, you can make that su!&expression a local
aria!le. $ou can use a local aria!le only within the transformation, !ut !y calculating the aria!le only once, you
can speed performance. For details, see G)ransformationsH in the Designer Guide.
#"oo!ing Numeric ver!u! String Oeration!
)he /ower"enter :erer processes numeric operations faster than string operations. For example, if you look up
large amounts of data on two columns, @M/.-$@@_#1M@ and @M/.-$@@_(8, configuring the lookup around
@M/.-$@@_(8 improes performance.
-ptimizing "har&"har and "har&Barchar "omparisons
+hen the /ower"enter :erer performs comparisons !etween "71< and B1<"71< columns, it slows each time it
finds trailing !lank spaces in the row. $ou can use the )reat "71< as "71< -n <ead option in the /ower"enter
:erer setup so that the /ower"enter :erer does not trim trailing spaces from the end of "har source fields.
#"oo!ing DE#ODE ver!u! &OO4U'
+hen you use a .-->A/ function, the /ower"enter :erer must look up a ta!le in a data!ase. +hen you use a
8@"-8@ function, you incorporate the lookup alues into the expression itself, so the /ower"enter :erer does not
hae to look up a separate ta!le. )herefore, when you want to look up a small set of unchanging alues, using
8@"-8@ may improe performance.
U!ing Oerator! In!tead of ,unction!
11!
)he /ower"enter :erer reads expressions written with operators faster than expressions with functions. +here
possi!le, use operators to write your expressions. For example, if you hae an expression that inoles nested
"-#"1) calls such as4
"-#"1), "-#"1), "A:)-M@<:.F(<:)_#1M@, ' '0 "A:)-M@<:..1:)_#1M@0
you can rewrite that expression with the aa operator as follows4
"A:)-M@<:.F(<:)_#1M@ aa ' ' aa "A:)-M@<:..1:)_#1M@
-ptimizing ((F @xpressions
((F expressions can return a alue as well as an action, which allows for more compact expressions. For example,
say you hae a source with three $2# flags4 F.=_1, F.=_3, F.=_", and you want to return alues such that4 (f
F.=_1 ` G$H, then return ` B1._1. (f F.=_1 ` G$H 1#8 F.=_3 ` G$H, then return ` B1._1 Q B1._3, and so on for all
the permutations.
-ne way to write the expression is as follows4
((F, F.=_1 ` '$' and F.=_3 ` '$' 1#8 F.=_" ` '$',
B1._1 Q B1._3 Q B1._",
((F, F.=_1 ` '$' and F.=_3 ` '$' 1#8 F.=_" ` '#',
B1._1 Q B1._3 ,
((F, F.=_1 ` '$' and F.=_3 ` '#' 1#8 F.=_" ` '$',
B1._1 Q B1._",
((F, F.=_1 ` '$' and F.=_3 ` '#' 1#8 F.=_" ` '#',
B1._1 ,
((F, F.=_1 ` '#' and F.=_3 ` '$' 1#8 F.=_" ` '$',
B1._3 Q B1._",
((F, F.=_1 ` '#' and F.=_3 ` '$' 1#8 F.=_" ` '#',
B1._3 ,
((F, F.=_1 ` '#' and F.=_3 ` '#' 1#8 F.=_" ` '$',
B1._",
((F, F.=_1 ` '#' and F.=_3 ` '#' 1#8 F.=_" ` '#',
%.%,
00000000
)his first expression re*uires 9 ((Fs, 1J 1#8s, and at least 2R comparisons.
3ut if you take adantage of the ((F function's a!ility to return a alue, you can rewrite that expression as4
((F,F.=_1`'$', B1._1, %.%0Q ((F,F.=_3`'$', B1._3, %.%0Q ((F,F.=_"`'$', B1._", %.%0
)his results in three ((Fs, two comparisons, two additions, and a faster session.
@aluating @xpressions (f you are not sure which expressions slow performance, the following steps can help isolate
12"
the pro!lem.
)o ealuate expression performance4
1. )ime the session with the original expressions.
2. "opy the mapping and replace half of the complex expressions with a constant.
;. <un and time the edited session.
R. Make another copy of the mapping and replace the other half of the complex expressions with a constant.
'. <un and time the edited session.
Otimi?ing t"e Se!!ion
-nce you optimize your source data!ase, target data!ase, and mapping, you can focus on optimizing the session.
$ou can perform the following tasks to improe oerall performance4
(ncrease the num!er of partitions.
<educe errors tracing.
<emoe staging areas.
)une session parameters.
)a!le 2'&1 lists the settings and alues you can use to improe session performance4
)a!le 2'&1. :ession )uning /arameters
Setting Default 9alue Sugge!ted %inimum 9alue Sugge!ted %a1imum 9alue
8)M 3uffer :ize 12,%%%,%%% !ytes J,%%%,%%% !ytes 129,%%%,%%% !ytes
3uffer !lock size JR,%%% !ytes R,%%% !ytes 129,%%% !ytes
(ndex cache size 1,%%%,%%% !ytes 1,%%%,%%% !ytes 12,%%%,%%% !ytes
8ata cache size 2,%%%,%%% !ytes 2,%%%,%%% !ytes 2R,%%%,%%% !ytes
"ommit interal 1%,%%% rows #21 #21
7igh /recision 8isa!led #21 #21
)racing .eel #ormal )erse #21
/ipeline /artitioning
121
(f you purchased the partitioning option, you can increase the num!er of partitions in a pipeline to improe session
performance. (ncreasing the num!er of partitions allows the /ower"enter :erer to create multiple connections to
sources and process partitions of source data concurrently.
+hen you create a session, the +orkflow Manager alidates each pipeline in the mapping for partitioning. $ou can
specify multiple partitions in a pipeline if the /ower"enter :erer can maintain data consistency when it processes
the partitioned data.
Allocating 3uffer %emor/
+hen the /ower"enter :erer initializes a session, it allocates !locks of memory to hold source and target data. )he
/ower"enter :erer allocates at least two !locks for each source and target partition. :essions that use a large
num!er of sources and targets might re*uire additional memory !locks. (f the /ower"enter :erer cannot allocate
enough memory !locks to hold the data, it fails the session.
3y default, a session has enough !uffer !locks for 9; sources and targets. (f you run a session that has more than 9;
sources and targets, you can increase the num!er of aaila!le memory !locks !y ad5usting the following session
parameters4
DT% 3uffer Si?e* (ncrease the 8)M !uffer size found in the /erformance settings of the /roperties ta!. )he
default setting is 12,%%%,%%% !ytes.
Default 3uffer 3loc6 Si?e* 8ecrease the !uffer !lock size found in the 1danced settings of the "onfig
-!5ect ta!. )he default setting is JR,%%% !ytes.
)o configure these settings, first determine the num!er of memory !locks the /ower"enter :erer re*uires to
initialize the session. )hen, !ased on default settings, you can calculate the !uffer size and2or the !uffer !lock size
to create the re*uired num!er of session !locks.
(f you hae TM. sources or targets in your mapping, use the num!er of groups in the TM. source or target in your
calculation for the total num!er of sources and targets.
For example, you create a session that contains a single partition using a mapping that contains '% sources and '%
targets.
1. $ou determine that the session re*uires 2%% memory !locks4
],total num!er of sources Q total num!er of targets0Z 2^ ` ,session !uffer !locks0
1%% Z 2 ` 2%%
2. #ext, !ased on default settings, you determine that you can change the 8)M 3uffer :ize to 1',%%%,%%%, or
you can change the 8efault 3uffer 3lock :ize to 'R,%%%4
,session 3uffer 3locks0 ` ,.90 Z ,8)M 3uffer :ize0 2 ,8efault 3uffer 3lock :ize0 Z ,num!er of partitions0
2%% ` .9 Z 14222222 2 JR%%% Z 1
122
or
2%% ` .9 Z 12%%%%%% 2 $4%%% Z 1
Increa!ing DT% 3uffer Si?e
)he 8)M 3uffer :ize setting specifies the amount of memory the /ower"enter :erer uses as 8)M !uffer memory.
)he /ower"enter :erer uses 8)M !uffer memory to create the internal data structures and !uffer !locks used to
!ring data into and out of the /ower"enter :erer. +hen you increase the 8)M !uffer memory, the /ower"enter
:erer creates more !uffer !locks, which improes performance during momentary slowdowns.
(ncreasing 8)M !uffer memory allocation generally causes performance to improe initially and then leel off. +hen
you increase the 8)M !uffer memory allocation, consider the total memory aaila!le on the /ower"enter :erer
system.
(f you do not see a significant increase in performance, 8)M !uffer memory allocation is not a factor in session
performance.
Note- <educing the 8)M !uffer allocation can cause the session to fail early in the process !ecause the /ower"enter
:erer is una!le to allocate memory to the re*uired processes.
)o increase 8)M !uffer size4
1. =o to the /erformance settings of the /roperties ta!.
2. (ncrease the setting for 8)M 3uffer :ize, and click ->.
)he default for 8)M 3uffer :ize is 12,%%%,%%% !ytes. (ncrease the setting !y increments of multiples of the !uffer
!lock size, then run and time the session after each increase.
Otimi?ing t"e 3uffer 3loc6 Si?e
8epending on the session source data, you might need to increase or decrease the !uffer !lock size.
(f the session mapping contains a large num!er of sources or targets, you might need to decrease the !uffer !lock
size.
(f you are manipulating unusually large rows of data, you can increase the !uffer !lock size to improe
performance. (f you do not know the approximate size of your rows, you can determine the configured row size !y
following the steps !elow.
)o ealuate needed !uffer !lock size4
1. (n the Mapping 8esigner, open the mapping for the session.
2. -pen the target instance.
123
;. "lick the /orts ta!.
R. 1dd the precisions for all the columns in the target.
'. (f you hae more than one target in the mapping, repeat steps 2&R for each additional target to calculate
the precision for each target.
J. <epeat steps 2&' for each source definition in your mapping.
D. "hoose the largest precision of all the source and target precisions for the total precision in your !uffer
!lock size calculation.
)he total precision represents the total !ytes needed to moe the largest row of data. For example, if the total
precision e*uals ;;,%%%, then the /ower"enter :erer re*uires ;;,%%% !ytes in the !uffers to moe that row. (f the
!uffer !lock size is JR,%%% !ytes, the /ower"enter :erer can moe only one row at a time.
(deally, a !uffer should accommodate at least 2% rows at a time. :o if the total precision is greater than ;2,%%%,
increase the size of the !uffers to improe performance.
)o increase !uffer !lock size4
1. =o to the 1danced settings on the "onfig -!5ect ta!.
2. (ncrease the setting for 8efault 3uffer 3lock :ize, and click ->.
)he default for this setting is JR,%%% !ytes. (ncrease this setting in relation to the size of the rows. 1s with 8)M
!uffer memory allocation, increasing !uffer !lock size should improe performance. (f you do not see an increase,
!uffer !lock size is not a factor in session performance.
Increa!ing t"e #ac"e Si?e!
)he /ower"enter :erer uses the index and data caches for 1ggregator, <ank, .ookup, and Ooiner transformation.
)he /ower"enter :erer stores transformed data from 1ggregator, <ank, .ookup, and Ooiner transformations in the
data cache !efore returning it to the data flow. (t stores group information for those transformations in the index
cache. (f the allocated data or index cache is not large enough to store the data, the /ower"enter :erer stores the
data in a temporary disk file as it processes the session data. @ach time the /ower"enter :erer pages to the
temporary file, performance slows.
$ou can see when the /ower"enter :erer pages to the temporary file !y examining the performance details. )he
!ransformation_readfromdisk or !ransformation_writetodisk counters for any 1ggregator, <ank, .ookup, or Ooiner
transformation indicate the num!er of times the /ower"enter :erer must page to disk to process the
transformation. :ince the data cache is typically larger than the index cache, you should increase the data cache
more than the index cache.
Increa!ing t"e #ommit Interval
124
)he "ommit (nteral setting determines the point at which the /ower"enter :erer commits data to the target
ta!les. @ach time the /ower"enter :erer commits, performance slows. )herefore, the smaller the commit interal,
the more often the /ower"enter :erer writes to the target data!ase, and the slower the oerall performance.
(f you increase the commit interal, the num!er of times the /ower"enter :erer commits decreases and
performance improes.
+hen you increase the commit interal, consider the log file limits in the target data!ase. (f the commit interal is
too high, the /ower"enter :erer may fill the data!ase log file and cause the session to fail.
)herefore, weigh the !enefit of increasing the commit interal against the additional time you would spend
recoering a failed session.
"lick the =eneral -ptions settings of the /roperties ta! to reiew and ad5ust the commit interal.
Di!abling Hig" 'reci!ion
(f a session runs with high precision ena!led, disa!ling high precision might improe session performance.
)he 8ecimal datatype is a numeric datatype with a maximum precision of 29. )o use a high precision 8ecimal
datatype in a session, configure the /ower"enter :erer to recognize this datatype !y selecting @na!le 7igh
/recision in the session properties. 7oweer, since reading and manipulating the high precision datatype slows the
/ower"enter :erer, you can improe session performance !y disa!ling high precision.
+hen you disa!le high precision, the /ower"enter :erer conerts data to a dou!le. )he /ower"enter :erer reads
the 8ecimal row ;9%%%'9R11;92%;';1DR''';%292 as ;9%%%'9R11;92%; x 1%
1;
.
"lick the /erformance settings on the /roperties ta! to ena!le high precision.
Reducing Error Tracing
(f a session contains a large num!er of transformation errors that you hae no time to correct, you can improe
performance !y reducing the amount of data the /ower"enter :erer writes to the session log.
)o reduce the amount of time spent writing to the session log file, set the tracing leel to )erse. $ou specify )erse
tracing if your sessions run without pro!lems and you don't need session details. 1t this tracing leel, the
/ower"enter :erer does not write error messages or row&leel information for re5ect data.
)o de!ug your mapping, set the tracing leel to Ber!ose. 7oweer, it can significantly impact the session
performance. 8o not use Ber!ose tracing when you tune performance.
)he session tracing leel oerrides any transformation&specific tracing leels within the mapping. )his is not
recommended as a long&term response to high leels of transformation errors.
Removing Staging Area!
125
+hen you use a staging area, the /ower"enter :erer performs multiple passes on your data. +here possi!le,
remoe staging areas to improe performance. )he /ower"enter :erer can read multiple sources with a single
pass, which may alleiate your need for staging areas. For details on single&pass reading.
Otimi?ing t"e S/!tem
-ften performance slows !ecause your session relies on inefficient connections or an oerloaded /ower"enter
:erer system. :ystem delays can also !e caused !y routers, switches, network protocols, and usage !y many users.
1fter you determine from the system monitoring tools that you hae a system !ottleneck, you can make the
following glo!al changes to improe the performance of all your sessions4
Imrove net0or6 !eed* :low network connections can slow session performance. 7ae your system
administrator determine if your network runs at an optimal speed. 8ecrease the num!er of network hops
!etween the /ower"enter :erer and data!ases.
U!e multile 'o0er#enter Server!* Asing multiple /ower"enter :erers on separate systems might dou!le
or triple session performance.
U!e a !erver grid* Ase a collection of /ower"enter :erers to distri!ute and process the workload of a
workflow. For information on serer grids.
Imrove #'U erformance* <un the /ower"enter :erer and related machines on high performance "/As,
or configure your system to use additional "/As.
#onfigure t"e 'o0er#enter Server for AS#II data movement mode* +hen all character data processed !y
the /ower"enter :erer is D&!it 1:"(( or @3"8(", configure the /ower"enter :erer for 1:"(( data moement
mode.
#"ec6 "ard di!6! on related mac"ine!* :low disk access on source and target data!ases, source and target
file systems, as well as the /ower"enter :erer and repository machines can slow session performance.
7ae your system administrator ealuate the hard disks on your machines.
Reduce aging* +hen an operating system runs out of physical memory, it starts paging to disk to free
physical memory. "onfigure the physical memory for the /ower"enter :erer machine to minimize paging to
disk.
U!e roce!!or binding* (n a multi&processor A#(T enironment, the /ower"enter :erer may use a large
amount of system resources. Ase processor !inding to control processor usage !y the /ower"enter :erer.
Imroving Net0or6 Seed
)he performance of the /ower"enter :erer is related to network connections. 1 local disk can moe data fie to
twenty times faster than a network. "onsider the following options to minimize network actiity and to improe
/ower"enter :erer performance.
(f you use flat file as a source or target in your session, you can moe the files onto the /ower"enter :erer system
to improe performance. +hen you store flat files on a machine other than the /ower"enter :erer, session
performance !ecomes dependent on the performance of your network connections. Moing the files onto the
/ower"enter :erer system and adding disk space might improe performance.
126
(f you use relational source or target data!ases, try to minimize the num!er of network hops !etween the source
and target data!ases and the /ower"enter :erer. Moing the target data!ase onto a serer system might improe
/ower"enter :erer performance.
+hen you run sessions that contain multiple partitions, hae your network administrator analyze the network and
make sure it has enough !andwidth to handle the data moing across the network from all partitions.
U!ing %ultile 'o0er#enter Server!
$ou can run multiple /ower"enter :erers on separate systems against the same repository. 8istri!uting the session
load to separate /ower"enter :erer systems increases performance. For details on using multiple /ower"enter
:erers.
Asing :erer =rids 1 serer grid allows you to use the com!ined processing power of multiple /ower"enter :erers
to !alance the workload of workflows.
(n a serer grid, a /ower"enter :erer distri!utes sessions across the network of aaila!le /ower"enter :erers. $ou
can further improe performance !y assigning a more powerful serer to run a complicated mapping.
Running t"e 'o0er#enter Server in AS#II Data %ovement %ode
+hen all character data processed !y the /ower"enter :erer is D&!it 1:"(( or @3"8(", configure the /ower"enter
:erer to run in the 1:"(( data moement mode. (n 1:"(( mode, the /ower"enter :erer uses one !yte to store each
character. +hen you run the /ower"enter :erer in Anicode mode, it uses two !ytes for each character, which can
slow session performance.
U!ing Additional #'U!
"onfigure your system to use additional "/As to improe performance. 1dditional "/As allows the system to run
multiple sessions in parallel as well as multiple pipeline partitions in parallel.
7oweer, additional "/As might cause disk !ottlenecks. )o preent disk !ottlenecks, minimize the num!er of
processes accessing the disk. /rocesses that access the disk include data!ase functions and operating system
functions. /arallel sessions or pipeline partitions also re*uire disk access.
Reducing 'aging
/aging occurs when the /ower"enter :erer operating system runs out of memory for a particular operation and
uses the local disk for memory. $ou can free up more memory or increase physical memory to reduce paging and the
slow performance that results from paging. Monitor paging actiity using system tools.
$ou might want to increase system memory in the following circumstances4
$ou run a session that uses large cached lookups.
127
$ou run a session with many partitions.
(f you cannot free up memory, you might want to add memory to the system.
U!ing 'roce!!or 3inding
(n a multi&processor A#(T enironment, the /ower"enter :erer may use a large amount of system resources if you
run a large num!er of sessions. 1s a result, other applications on the machine may not hae enough system
resources aaila!le. $ou can use processor !inding to control processor usage !y the /ower"enter :erer. (n a :un
:olaris enironment, the system administrator can create and manage a processor set using the psrset command.
)he system administrator can then use the p!ind command to !ind the /ower"enter :erer to a processor set so the
processor set only runs the /ower"enter :erer. )he :un :olaris enironment also proides the psrinfo command to
display details a!out each configured processor, and the psradm command to change the operational status of
processors. (n an 7/&AT enironment, the system administrator can use the /rocess <esource Manager utility to
control "/A usage in the system. )he /rocess <esource Manager allocates minimum system resources and uses a
maximum cap of resources.
'ieline 'artitioning
-nce you hae tuned the application, data!ases, and system for maximum single&partition performance, you may
find that your system is under&utilized. 1t this point, you can reconfigure your session to hae two or more
partitions. 1dding partitions may improe performance !y utilizing more of the hardware while processing the
session.
Ase the following tips when you add partitions to a session4
Add one artition at a time* )o !est monitor performance, add one partition at a time, and note your
session settings !efore you add each partition.
Set DT% 3uffer %emor/* For a session with n partitions, this alue should !e at least n times the alue for
the session with one partition.
Set cac"ed value! for Se8uence $enerator* For a session with n partitions, there should !e no need to use
the G#um!er of "ached BaluesH property of the :e*uence =enerator transformation. (f you must set this
alue to a alue greater than zero, make sure it is at least n times the original alue for the session with
one partition.
'artition t"e !ource data evenl/* "onfigure each partition to extract the same num!er of rows.
%onitor t"e !/!tem 0"ile running t"e !e!!ion* (f there are "/A cycles aaila!le ,twenty percent or more
idle time0 then this session might see a performance improement !y adding a partition.
%onitor t"e !/!tem after adding a artition* (f the "/A utilization does not go up, the wait for (2- time
goes up, or the total data transformation rate goes down, then there is pro!a!ly a hardware or software
!ottleneck. (f the wait for (2- time goes up a significant amount, then check the system for hardware
!ottlenecks. -therwise, check the data!ase configuration.
Tune databa!e! and !/!tem* Make sure that your data!ases are tuned properly for parallel @). and that
your system has no !ottlenecks.
12
. Otimi?ing t"e Source Databa!e for 'artitioning
Asually, each partition on the reader side represents a su!set of the data to !e processed. 3ut if the data!ase is not
tuned properly, the results may not make your session any *uicker. )his is fairly easy to test. "reate a pipeline with
one partition. Measure the reader throughput in the +orkflow Manager. 1fter you do this, add partitions. (s the
throughput scaling linearlyI (n other words, if you hae two partitions, is your reader throughput twice as fastI (f
this is not true, you pro!a!ly need to tune your data!ase.
:ome data!ases may hae specific options that must !e set to ena!le parallel *ueries. $ou should check your
indiidual data!ase manual for these options. (f these options are off, the /ower"enter :erer runs multiple
partition :@.@") statements serially.
$ou can also consider adding partitions to increase the speed of your *uery. @ach data!ase proides an option to
separate the data into different ta!lespaces. (f your data!ase allows it, you can use the :K. oerride feature to
proide a *uery that extracts data from a single partition.
)o maximize a single&sorted *uery on your data!ase, you need to look at options that ena!le parallelization. )here
are many options in each data!ase that may increase the speed of your *uery.
7ere are some configuration options to look for in your source data!ase4
#"ec6 for configuration arameter! t"at erform automatic tuning* For example, -racle has a parameter
called parallel_automatic_tuning.
%a6e !ure intra-aralleli!m (t"e abilit/ to run multile t"read! on a !ingle 8uer/) i! enabled* For
example, on -racle you should look at parallel_adaptie_multi_user. -n 832, you should look at
intra_parallel.
%a1imum number of arallel roce!!e! t"at are available for arallel e1ecution!* For example, on
-racle, you should look at parallel_max_serers. -n 832, you should look at max_agents.
Si?e for variou! re!ource! u!ed in aralleli?ation* For example, -racle has parameters such as
large_pool_size, shared_pool_size, hash_area_size, parallel_execution_message_size, and
optimizer_percent_parallel. 832 has configuration parameters such as dft_fetch_size, fcm_num_!uffers,
and sort_heap.
Degree! of aralleli!m (ma/ occur a! eit"er a databa!e configuration arameter or an otion on t"e
table or 8uer/)* For example, -racle has parameters parallel_threads_per_cpu and
optimizer_percent_parallel. 832 has configuration parameters such as dft_prefetch_size, dft_degree, and
max_*uery_degree.
Turn off otion! t"at ma/ affect /our databa!e !calabilit/* For example, disa!le archie logging and timed
statistics on -racle.
Note- )he a!oe examples are not a comprehensie list of all the tuning options aaila!le to you on the data!ases.
"heck your indiidual data!ase documentation for all performance tuning configuration parameters aaila!le.
Otimi?ing t"e Target Databa!e for 'artitioning
12!
(f you hae a mapping with multiple partitions, you want the throughput for each partition to !e the same as the
throughput for a single partition session. (f you do not see this correlation, then your data!ase is pro!a!ly inserting
rows into the data!ase serially.
)o make sure that your data!ase inserts rows in parallel, check the following configuration options in your target
data!ase4
&oo6 for a configuration otion t"at need! to be !et e1licitl/ to enable arallel in!ert!* For example,
-racle has d!_writer_processes, and 832 has max_agents ,some data!ases may hae this ena!led !y
default0.
#on!ider artitioning /our target table* (f it is possi!le, try to hae each partition write to a single
data!ase partition. $ou can use the <outer transformation to do this. 1lso, look into haing the data!ase
partitions on separate disks to preent (2- contention among the pipeline partitions.
Turn off otion! t"at ma/ affect /our databa!e !calabilit/* For example, disa!le archie logging and timed
statistics on -racle.
13"
Se!!ion Recover/
(f you stop a session or if an error causes a session to stop unexpectedly, refer to the session logs to determine the
cause of the failure. "orrect the errors, and then complete the session. )he method you use to complete the
session depends on the configuration of the mapping and the session, the specific failure, and how much progress
the session made !efore it failed. (f the /ower"enter :erer did not commit any data, run the session again. (f the
session issued at least one commit and is recoera!le, consider running the session in recoery mode.
<ecoery allows you to restart a failed session and complete it as if the session had run without pause. +hen the
/ower"enter :erer runs in recoery mode, it continues to commit data from the point of the last successful
commit. For more information on /ower"enter :erer processing during recoery.
1ll recoery sessions run as part of a workflow. +hen you recoer a session, you also hae the option to run part of
the workflow. "onsider the configuration and design of the workflow and the status of other tasks in the workflow
!efore you choose a method of recoery. 8epending on the configuration and status of the workflow and session,
you can choose one or more of the following recoery methods4
Recover a !u!ended 0or6flo0* (f the workflow suspends due to session failure, you can recoer the failed
session and resume the workflowRecover a failed 0or6flo0* (f the workflow fails as a result of session
failure, you can recoer the session and run the rest of the workflow.
Recover a !e!!ion ta!6* (f the workflow completes, !ut a session fails, you can recoer the session alone
without running the rest of the workflow. $ou can also use this method to recoer multiple failed sessions in
a !ranched workflow.
'rearing for Recover/
3efore you perform recoery, you must configure the mapping, session, workflow, and target data!ase to ensure
that the recoery session will consistently read, transform, and write data as though the session had not failed.
Ander certain circumstances, you cannot recoer the session and must run it again. For more information on
completing unrecoera!le sessions.
o!ieefans.com
131
#onfiguring t"e %aing
+hen you design a mapping, consider re*uirements for session recoery. "onfigure the mapping so that the
/ower"enter :erer can extract, transform, and load data with the same results each time it runs the session.
Ase the following guidelines when you configure the mapping4
Sort t"e data from t"e !ource* )his guarantees that the /ower"enter :erer always receies source rows in
the same order. $ou can do this !y configuring the :orted /orts option in the :ource Kualifier or 1pplication
:ource Kualifier transformation or !y adding a :orter transformation configured for distinct output rows to
the mapping after the source *ualifier.
9erif/ all target! receive data from tran!formation! t"at roduce reeatable data* :ome transformations
produce repeata!le data. $ou can ena!le a session for recoery in the +orkflow Manager when all targets in
the mapping receie data from transformations that produce repeata!le data. For more information on
repeata!le data.
1lso, to perform consistent data recoery, the source, target, and transformation properties for the recoery session
must !e the same as those for the failed session. 8o not change the properties of o!5ects in the mapping !efore you
run the recoery session.
#onfiguring t"e Se!!ion
)o perform recoery on a failed session, the session must meet the following criteria4
)he session is ena!led for recoery.
)he preious session run failed and the recoery information is accessi!le.
)o ena!le recoery, select the @na!le <ecoery option in the @rror 7andling settings of the "onfiguration ta! in the
session properties.
(f you ena!le recoery and also choose to truncate the target for a relational normal load session, the /ower"enter
:erer does not truncate the target when you run the session in recoery mode.
Ase the following guidelines when you ena!le recoery for a partitioned session4
)he +orkflow Manager configures all partition points to use the default partitioning scheme for each
transformation when you ena!le recoery.
)he +orkflow Manager sets the partition type to pass&through unless the transformation receiing the data
is either an 1ggregator transformation, a <ank transformation, or a sorted Ooiner transformation.
$ou can only ena!le recoery for unsorted Ooiner transformations with one partition.
For "ustom transformations, you can ena!le recoery only for transformations with one input group.
)he /ower"enter :erer disa!les test load when you ena!le the session for recoery.
132
)o perform consistent data recoery, the session properties for the recoery session must !e the same as the session
properties for the failed session. )his includes the partitioning configuration and the session sort order.
#onfiguring t"e Wor6flo0
)he recoery method you choose for the workflow depends on the design and configuration of the workflow. 1s with
sessions, you can configure a workflow so that you can correct errors and complete the workflow as though it ran
without error.
(f other tasks or workflows in your enironment depend on the successful completion of a session, configure the
workflow containing the session to suspend on error. )his is useful for se*uential and concurrent sessions !ecause it
preents the /ower"enter :erer from continuing the workflow after the session fails. )his is also useful if multiple
concurrent sessions fail or if other workflows depend on the successful completion of the workflow(f you do not
want to configure the workflow to suspend on error, you can configure recoera!le sessions to fail the workflow if
the session fails. )his preents the /ower"enter :erer from continuing to run the workflow after the session fails.
(n this case, you may want to perform recoery !y running the part of the workflow that did not yet run. $ou can
also allow the workflow to complete een if sessions or other tasks fail. $ou can then choose to recoer only the
failed session tasks. )his allows you to recoer the sessions without running preiously successful tasks.
#onfiguring t"e Target Databa!e o!ieefans.com
+hen the /ower"enter :erer runs a session in recoery mode, it uses information in recoery ta!les that it creates
on the target data!ase system. )he /ower"enter :erer creates the recoery ta!les when it runs a session ena!led
for recoery. (f the ta!les already exist, the /ower"enter :erer writes information to them.
)he /ower"enter :erer creates the following recoery ta!les in the target data!ase4
'%GRE#O9ER+* )his ta!le records target load information during the session run. )he /ower"enter :erer
remoes the information from this ta!le after each successful session and initializes the information at the
!eginning of su!se*uent sessions.
'%GT$TGRUNGID* )his ta!le records information the /ower"enter :erer uses to identify each target on the
data!ase. )he information remains in the ta!le !etween session runs.
(f you want the /ower"enter :erer to create the recoery ta!les, you must grant ta!le creation priileges to the
data!ase user name for the target data!ase connection. (f you do not want the /ower"enter :erer to create the
recoery ta!les, you must create the recoery ta!les manually.
8o not edit or drop the recoery ta!les while recoery is ena!led. (f you want to disa!le recoery, the /ower"enter
:erer does not remoe the recoery ta!les from the target data!ase. $ou must manually remoe the recoery
ta!les.
)a!le 11&1 descri!es the format of /M_<@"-B@<$4
)a!le 11&1. /M_<@"-B@<$ )a!le 8efinition
133
#olumn Name Datat/e
<@/_=(8 B1<"71<,2R%0
+F.-+_(8 #AM3@<
:A3O_(8 #AM3@<
)1:>_(#:)_(8 #AM3@<
)=)_(#:)_(8 #AM3@<
/1<)()(-#_(8 #AM3@<
)=)_<A#_(8 #AM3@<
<@"-B@<$_B@< #AM3@<
"7@">_/-(#) #AM3@<
<-+_"-A#) #AM3@<
)a!le 11&2 descri!es the format of /M_)=)_<A#_(84
)a!le 11&2. /M_)=)_<A#_(8 )a!le 8efinition
#olumn Name Datat/e
.1:)_)=)_<A#_(8 #AM3@<
Note- (f you manually create the /M_)=)_<A#_(8 ta!le, you must specify a alue other than zero in the
.1:)_)=)_<A#_(8 column to ensure that the session runs successfully in recoery mode.
"reating pmcmd :cripts
$ou can use &mcmd to perform recoery from the command line or in a script. +hen you use &mcmd commands in a
script, &mcmd indicates the success or failure of the command with a return code. )he following return codes apply
to recoery sessions.
)a!le 11&; descri!es the return codes for &mcmd that relate to recoery4
)a!le 11&;. pmcmd <eturn "odes for <ecoery
#ode De!crition
12
)he /ower"enter :erer cannot start recoery !ecause the session or workflow is scheduled, suspending,
waiting for an eent, waiting, initializing, a!orting, stopping, disa!led, or running.
19
)he /ower"enter :erer cannot start the session in recoery mode !ecause the workflow is configured to run
continuously.
+orking with <epeata!le 8ata
134
$ou can ena!le a session for recoery in the +orkflow Manager when all targets in the mapping receie data from
transformations that produce repeata!le data. 1ll transformations hae a property that determines when the
transformation produces repeata!le data. For most transformations, this property is hidden. 7oweer, you can write
the "ustom transformation procedure to output repeata!le data, and then configure the "ustom transformation
-utput (s <epeata!le property to match the procedure !ehaior.
)ransformations can produce repeata!le data under the following circumstances4
Never* )he order of the output data is inconsistent !etween session runs. )his is the default for actie
"ustom transformations.
3a!ed on inut order* )he output order is consistent !etween session runs when the input data order for all
input groups is consistent !etween session runs. )his is the default for passie "ustom transformations.
Al0a/!* )he order of the output data is consistent !etween session runs een if the order of the input data
is inconsistent !etween session runs.
3a!ed on tran!formation configuration* )he transformation produces repeata!le data depending on how
you configure the transformation. $ou can always ena!le the session for recoery, !ut you may get
inconsistent results depending on how you configure the transformation.
)a!le 11&R lists which transformations produce repeata!le data4
)a!le 11&R. )ransformations that -utput <epeata!le 8ata
Tran!formation Outut i! Reeatable
:ource Kualifier
,relational0
3ased on transformation configuration.
Ase sorted ports to produce repeata!le data. -r, add a transformation that produces
repeata!le data immediately after the :ource Kualifier transformation. (f you do not do
either of these options, you might get inconsistent results.
:ource Kualifier ,flat
file0
1lways.
1pplication :ource
Kualifier
Ase sorted ports for relational sources, such as :ie!el sources, to produce repeata!le data.
-r, add a transformation that produces repeata!le data immediately after the 1pplication
:ource Kualifier transformation. (f you do not do either of these options, you might get
inconsistent results.
MK :ource Kualifier 1lways.
TM. :ource Kualifier 1lways.
1ggregator 1lways.
"ustom 3ased on transformation configuration.
"onfigure the -utput is <epeata!le property according to the "ustom transformation
135
procedure !ehaior.
@xpression 3ased on input order.
@xternal /rocedure 3ased on input order.
Filter 3ased on input order.
Ooiner 3ased on input order.
.ookup 3ased on input order.
#ormalizer ,B:1M0
1lways.
$ou can ena!le the session for recoery, howeer, you might get inconsistent results if you
run the session in recoery mode. )he #ormalizer transformation generates source data in
the form of primary keys. <ecoering a session might generate different alues than if the
session completed successfully. 7oweer, the /ower"enter :erer continues to produce
uni*ue key alues.
#ormalizer ,pipeline0 3ased on input order.
<ank 1lways.
<outer 3ased on input order.
:e*uence =enerator
$ou must reset the se*uence alue to the alue set in the failed session run. (f you do not,
you might get inconsistent results.
:orter, configured for
distinct output rows
1lways.
:orter, not configured
for distinct output
rows
3ased on input order.
:tored /rocedure 3ased on input order.
)ransaction "ontrol 3ased on input order.
Anion #eer.
Apdate :trategy 3ased on input order.
TM. =enerator 1lways.
TM. /arser 1lways.
)o run a session in recoery mode, you must first ena!le the failed session for recoery. )o ena!le a session for
recoery, the +orkflow Manager erifies all targets in the mapping receie data from transformations that produce
repeata!le data. )he +orkflow Manager uses the alues in the )a!le 11&R to determine whether or not you can
ena!le a session for recoery.
136
7oweer, the +orkflow Manager cannot erify whether or not you configure some transformations, such as the
:e*uence =enerator transformation, correctly and always allows you to ena!le these sessions for recoery. $ou may
get inconsistent results if you do not configure these transformations correctly.
$ou cannot ena!le a session for recoery in the +orkflow Manager under the following circumstances4
+ou connect a tran!formation t"at never roduce! reeatable data directl/ to a target* )o ena!le this
session for recoery, you can add a transformation that always produces repeata!le data !etween the
transformation that neer produces repeata!le data and the target.
+ou connect a tran!formation t"at never roduce! reeatable data directl/ to a tran!formation t"at
roduce! reeatable data ba!ed on inut order* )o ena!le this session for recoery, you can add a
transformation that always produces repeata!le data immediately after the transformation that neer
produces repeata!le data.
+hen a mapping contains a transformation that neer produces repeata!le data, you can add a transformation that
always produces repeata!le data immediately after it.
Note- (n some cases, you might get inconsistent data if you run some sessions in recoery mode. For a description of
circumstances that might lead to inconsistent data.
Figure 11&1 illustrates a mapping you can ena!le for recoery4
Figure 11&1. Mapping $ou "an @na!le for <ecoery
)he mapping contains an 1ggregator transformation that always produces repeata!le data. )he 1ggregator
transformation proides data for the .ookup and @xpression transformations. .ookup and @xpression transformations
produce repeata!le data if they receie repeata!le data. )herefore, the target receies repeata!le data, and you
can ena!le this session for recoery.
Figure 11&2 illustrates a mapping you cannot ena!le for recoery4
Figure 11&2. Mapping $ou "annot @na!le for <ecoery
137
)he mapping contains two :ource Kualifier transformations that produce repeata!le data. 7oweer, the mapping
contains a Anion and "ustom transformation downstream that neer produce repeata!le data. )he .ookup
transformation only produces repeata!le data if it receies repeata!le data. )herefore, the target does not receie
repeata!le data, and you cannot ena!le this session for recoery.
$ou can modify this mapping to ena!le the session for recoery !y adding a :orter transformation configured for
distinct output rows immediately after transformations that neer output repeata!le data. :ince the Anion
transformation is connected directly to another transformation that neer produces repeata!le data, you only need
to add a :orter transformation after the "ustom transformation, as shown in the mapping in Figure 11&;4
Figure 11&;. Modified Mapping $ou "an @na!le for <ecoery
Recovering a Su!ended Wor6flo0
$ou can configure the workflow to suspend if a task fails. (f a session that is ena!led for recoery fails, you can
correct the error that caused the session to fail and resume the suspended workflow in recoery mode. +hen the
/ower"enter :erer resumes the workflow, it runs the failed session in recoery mode. (f the recoery session
succeeds, the /ower"enter :erer runs the rest of the workflow.
13
$ou can recoer a suspended workflow with se*uential or concurrent sessions. For workflows with either se*uential
or concurrent sessions, suspending the workflow on error is useful if successie tasks in the workflow depend on the
success of the preious sessions. For a workflow with concurrent sessions, resuming a suspended workflow in
recoery mode also allows you to simultaneously recoer concurrent failed sessions.
$ou can only resume a suspended workflow in recoery mode if a session that is ena!led for recoery fails. (f a
session fails that is not ena!led for recoery, you can resume the workflow normally. +hen you resume the
workflow, the /ower"enter :erer restarts the session. (f the session succeeds, the /ower"enter :erer runs the rest
of the workflow.
)o configure the workflow to suspend on error, ena!le the :uspend -n @rror option on the =eneral ta! of the
workflow properties.
Recovering a Su!ended Wor6flo0 0it" Se8uential Se!!ion!
+hen a se*uential session ena!led for recoery fails, the /ower"enter :erer places the workflow in a suspended
state. +hile the workflow is suspended, you can correct the error that caused the session to fail.
1fter you correct the error, you can resume the workflow in recoery mode. +hen it resumes the workflow, the
/ower"enter :erer starts the failed session in recoery mode.
(f the recoery session succeeds, the /ower"enter :erer runs the rest of the workflow. (f the recoery session fails,
the /ower"enter :erer suspends the workflow again.
@xample
:uppose the workflow w_(tem-rders contains two se*uential sessions. (n this workflow, s_(tem:ales is ena!led for
recoery, and the workflow is configured to suspend on error.
Figure 11&R illustrates w_(tem-rders4
Figure 11&R. <esuming a :uspended +orkflow with :e*uential :essions
:uppose s_(tem:ales fails, and the /ower"enter :erer suspends the workflow. $ou correct the error and resume the
workflow in recoery mode. )he /ower"enter :erer recoers the session successfully, and then runs
s_Apdate-rders.
(f s_Apdate-rders also fails, the /ower"enter :erer suspends the workflow again. $ou correct the error, !ut you
cannot resume the workflow in recoery mode !ecause you did not ena!le the session for recoery. (nstead, you
13!
resume the workflow. )he /ower"enter :erer starts s_Apdate-rders from the !eginning, completes the session
successfully, and then runs the :top+orkflow control task.
Recovering a Su!ended Wor6flo0 0it" #oncurrent Se!!ion!
+hen a concurrent session ena!led for recoery fails, the /ower"enter :erer places the workflow in a suspending
state while it completes any other concurrently running tasks. 1fter concurrent tasks succeed or fail, the
/ower"enter :erer places the workflow in a suspended state. +hile the workflow is suspended, you can correct the
error that caused the session to fail. (f concurrent tasks failed, you can also correct those errors.
1fter you correct the error, you can resume the workflow in recoery mode. )he /ower"enter :erer runs the failed
session in recoery mode. (f multiple concurrent sessions failed, the /ower"enter :erer starts all failed sessions
ena!led for recoery in recoery mode, and restarts other concurrent tasks or sessions not ena!led for recoery.
1fter successful recoery or completion of all failed sessions and tasks, the /ower"enter :erer completes the rest
of the workflow. (f a recoery session or task fails again, the /ower"enter :erer suspends the workflow.
@xample
:uppose you hae the workflow w_(tems8aily, containing three concurrent sessions, s_:upplier(nfo, s_/romo(tems,
and s_(tem:ales. (n this workflow, s_:upplier(nfo and s_/romo(tems are ena!led for recoery, and the workflow is
configured to suspend on error.
Figure 11&' illustrates w_(tems8aily4
Figure 11&'. <esuming a :uspended +orkflow with "oncurrent :essions
:uppose s_:upplier(nfo fails while the /ower"enter :erer is running the three sessions. )he /ower"enter :erer
places the workflow in a suspending state and continues running the other two sessions. s_/romo(tems and
s_(tem:ales also fail, and the /ower"enter :erer then places the workflow in a suspended state.
$ou correct the errors that caused each session to fail and then resume the workflow in recoery mode. )he
/ower"enter :erer starts s_:upplier(nfo and s_/romo(tems in recoery mode. :ince s_(tem:ales is not ena!led for
recoery, it restarts the session from the !eginning. )he /ower"enter :erer runs the three sessions concurrently.
1fter all sessions succeed, the /ower"enter :erer runs the "ommand task.
14"
Ste! for Recovering a Su!ended Wor6flo0
$ou can use the +orkflow Monitor to resume a workflow in recoery mode. (f the workflow or session is currently
scheduled, waiting, or disa!led, the /ower"enter :erer cannot run the session in recoery mode. $ou must stop or
unschedule the workflow or stop the session.
)o resume a workflow or worklet in recoery mode4
1. (n the #aigator, select the suspended workflow you want to resume.
2. "hoose )ask&<esume2<ecoer.
)he /ower"enter :erer resumes the workflow.
Recovering a ,ailed Wor6flo0
$ou can configure a session to fail the workflow if the session fails. (f the session is also ena!led for recoery, you
can correct the error that caused the session to fail and recoer the workflow from the failed session. +hen the
/ower"enter :erer recoers the workflow from the failed session, it runs the failed session in recoery mode. (f the
recoery session succeeds, the /ower"enter :erer runs the rest of the workflow.
$ou can recoer a workflow from a failed se*uential or concurrent session. $ou might want to fail a workflow as a
result of session failure if successie tasks in the workflow depend on the success of the preious sessions.
)o configure a session to fail the workflow if the session fails, ena!le the Fail /arent (f )his )ask Fails option on the
=eneral ta! of the session properties.. o!ieefans.com
<ecoering a Failed +orkflow with :e*uential :essions
+hen a se*uential session fails that is ena!led for recoery and configured to fail the workflow, the /ower"enter
:erer fails the workflow. $ou can correct the error that caused the session to fail and recoer the workflow from
the failed session. +hen the /ower"enter :erer recoers the workflow from the session, it runs the session in
recoery mode.
(f the recoery session succeeds, the /ower"enter :erer runs the rest of the workflow. (f the recoery session fails,
the /ower"enter :erer fails the workflow again.
@xample
:uppose the workflow w_(tem-rders contains two se*uential sessions. s_(tem:ales is ena!led for recoery and also
configured to fail the parent workflow if it fails.
141
Figure 11&J illustrates w_(tem-rders4
Figure 11&J. <ecoering /art of a +orkflow +ith :e*uential :essions
:uppose s_(tem:ales fails, and the /ower"enter :erer fails the workflow. $ou correct the error and recoer the
workflow from s_(tem:ales. )he /ower"enter :erer successfully recoers the session, and then runs the next task
in the workflow, s_Apdate-rders.
:uppose s_Apdate-rders also fails, and the /ower"enter :erer fails the workflow again. $ou correct the error, !ut
you cannot recoer the workflow from the session. (nstead, you start the workflow from the session. )he
/ower"enter :erer starts s_Apdate-rders from the !eginning, completes the session successfully, and then runs the
:top+orkflow control task.
Recovering a ,ailed Wor6flo0 0it" #oncurrent Se!!ion!
+hen a concurrent session fails that is ena!led for recoery and configured to fail the workflow, the /ower"enter
:erer fails the workflow. $ou can then correct the error that caused the session to fail and recoer the workflow
from the failed session. +hen the /ower"enter :erer recoers the workflow, it runs the session in recoery mode.
(f the recoery session succeeds, the /ower"enter :erer runs successie tasks in the workflow in the same path as
the session. )he /ower"enter :erer does not recoer or restart concurrent tasks when you recoer a workflow from
a failed session.
(f multiple concurrent sessions fail that are ena!led for recoery and configured to fail the workflow, the
(nformatica fails the workflow when the first session fails. "oncurrent sessions continue to run until they succeed or
fail. 1fter all concurrent sessions complete, you can correct the errors that caused failures.
1fter you correct the errors, you can recoer the workflow. (f multiple sessions ena!led for recoery fail,
indiidually recoer all !ut one failed session. $ou can then recoer the workflow from the remaining failed session.
)his ensures that the (nformatica recoers all concurrent failed sessions !efore it runs the rest of the workflow.
@xample
:uppose the workflow w_(tems8aily contains three concurrent sessions, s_:upplier(nfo, s_/romo(tems, and
s_(tem:ales. (n this workflow, each session is ena!led for recoery and configured to fail the parent workflow if the
session fails.
142
Figure 11&D illustrates w_(tems8aily4
Figure 11&D. <ecoering /art of a +orkflow with "oncurrent :essions
:uppose s_:upplier(nfo fails while the three concurrent sessions are running, and the /ower"enter :erer fails the
workflow. s_/romo(tems and s_(tem:ales also fail. $ou correct the errors that caused each session to fail.
(n this case, you must com!ine two recoery methods to run all sessions !efore completing the workflow. $ou
recoer s_/romo(tems indiidually. $ou cannot recoer s_(tem:ales !ecause it is not ena!led for recoery, !ut you
start the session from the !eginning. 1fter the /ower"enter :erer successfully completes s_/romo(tems and
s_(tem:ales, you recoer the workflow from s_:upplier(nfo. )he /ower"enter :erer runs the session in recoery
mode, and then runs the "ommand task.
Ste! for Recovering a ,ailed Wor6flo0
$ou can use the +orkflow Manager or +orkflow Monitor to recoer a failed workflow. (f the workflow or session is
currently scheduled, waiting, or disa!led, the /ower"enter :erer cannot run the session in recoery mode. $ou
must stop or unschedule the workflow or stop the session.
)o recoer a failed workflow using the +orkflow Manager4
1. :elect the failed session in the #aigator or in the +orkflow 8esigner workspace.
2. <ight&click the failed session and choose <ecoer +orkflow from )ask.
)he /ower"enter :erer runs the failed session in recoery mode, and then runs the rest of the workflow.
)o recoer a failed workflow using the +orkflow Monitor4
1. :elect the failed session in the #aigator.
2. <ight&click the session and choose <ecoer +orkflow From )ask.
or
"hoose )ask&<ecoer +orkflow From )ask.
143
)he /ower"enter :erer runs the session in recoery mode.
Recovering a Se!!ion Ta!6
(f you do not configure the workflow to suspend on error, and you do not configure the workflow to fail if sessions or
tasks fail, the /ower"enter :erer completes the workflow een if it encounters errors. (f a session fails, !ut other
tasks in the workflow complete successfully, you may want to recoer only the failed session. +hen the
/ower"enter :erer recoers a session, it runs the session in recoery mode.
$ou can recoer se*uential or concurrent sessions. For workflows with se*uential sessions, indiidually recoering a
session is useful if the rest of the workflow succeeded and you need to recoer the failed session. )his allows you to
recoer the session without restarting successful tasks.
For workflows with concurrent sessions, this method is useful if multiple concurrent sessions fail and also cause the
workflow to fail. $ou can indiidually recoer concurrent sessions and indiidually start su!se*uent tasks in the
workflow paths until the paths conerge at a single task.
(n other complex, !ranched workflows, indiidually recoering multiple failed sessions allows you to specify the
order in which the sessions run.
Recovering Se8uential Se!!ion!
+hen a se*uential session ena!led for recoery fails, and the workflow is not configured to suspend or fail on error,
the /ower"enter :erer continues to run the workflow. $ou can correct the error that caused the session to fail.
1fter you correct the error, you can indiidually recoer the failed session. +hen the /ower"enter :erer
indiidually recoers a session, it runs the session in recoery mode. (t does not run other tasks in the workflow.
Recovering #oncurrent Se!!ion!
+hen a concurrent session ena!led for recoery fails, the /ower"enter :erer continues to run the workflow. -ther
tasks and the workflow may succeed. $ou can correct the error that caused the session to fail. (f concurrent tasks
failed, you can also correct those errors. 1fter you correct the errors, you can indiidually recoer each session
without running the rest of the workflow.
(f multiple concurrent sessions fail that are ena!led for recoery and configured to fail the workflow on session
failure, the /ower"enter :erer fails the workflow. $ou can correct the errors that caused the sessions to fail. 1fter
you correct the errors, you can indiidually recoer each session. -nce all concurrent tasks are recoered or
complete, you can start the session from a task where the concurrent paths conerge.
@xample
:uppose the workflow w_(tems8aily contains three concurrently running sessions. @ach session is ena!led for
recoery and configured to fail the workflow if the session fails.
144
Figure 11&9 illustrates w_(tems8aily4
Figure 11&9. <ecoering "oncurrent :essions (ndiidually
:uppose s_(tem:ales fails and the /ower"enter :erer fails the workflow. s_/romo(tems and s_:upplier(nfo also fail.
$ou correct the errors that caused the sessions to fail.
1fter you correct the errors, you indiidually recoer each failed session. )he /ower"enter :erer successfully
recoers the sessions. )he workflow paths after the sessions conerge at the "ommand task, allowing you to start
the workflow from the "ommand task and complete the workflow. o!ieefans.com
1lternatiely, after you correct the errors, you could also indiidually recoer two of the three failed sessions. 1fter
the /ower"enter :erer successfully recoers the sessions, you can recoer the workflow from the third session. )he
/ower"enter :erer then recoers the third session and, on successful recoery, runs the rest of the workflow.
Ste! for Recovering a Se!!ion Ta!6
$ou can use the +orkflow Manager or +orkflow Monitor to recoer a failed session in a workflow. (f the workflow or
session is currently scheduled, waiting, or disa!led, the /ower"enter :erer cannot run the session in recoery
mode. $ou must stop or unschedule the workflow or stop the session.
)o recoer a failed session using the +orkflow Manager4
1. :elect the failed session in the #aigator or in the +orkflow 8esigner workspace.
2. <ight&click the failed session and choose <ecoer )ask.
)o recoer a failed session using the +orkflow Monitor4
1. :elect the failed session in the #aigator.
2. <ight&click the session and choose <ecoer )ask. o!ieefans.com
or
145
"hoose )ask&<ecoer )ask.
Server Handling for Recover/
)he /ower"enter :erer writes recoery data to relational target data!ases when you run a session ena!led for
recoery. (f the session fails, the /ower"enter :erer uses the recoery data to determine the point at which it
continues to commit data during the recoery session.
9erif/ing Recover/ Table!
)he /ower"enter :erer creates recoery information in cache files for all sessions ena!led for recoery. (t also
creates recoery ta!les on the target data!ase for relational targets during the initial session run.
(f the session is ena!led for recoery, the /ower"enter :erer creates recoery information in cache files during the
normal session run. )he /ower"enter :erer stores the cache files in the directory specified for L/M"ache8ir. )he
/ower"enter :erer generates file names in the format /M=M8_M@)181)1_Z.dat. 8o not alter these files or remoe
them from the /ower"enter :erer cache directory. )he /ower"enter :erer cannot run the recoery session if you
delete the recoery cache files.
(f the session writes to a relational data!ase and is ena!led for recoery, the /ower"enter :erer also erifies the
recoery ta!les on the target data!ase for all relational targets at the !eginning of a normal session run. (f the
ta!les do not exist, the /ower"enter :erer creates them. (f the data!ase user name the /ower"enter :erer uses
to connect to the target data!ase does not hae permission to create the recoery ta!les, you must manually create
them.
8uring the session run, the /ower"enter :erer writes target load information for normal load targets into the
recoery ta!les. (f the session fails, the /ower"enter :erer uses this information to complete the session in
recoery mode. (f the session is configured to write to relational targets in !ulk mode, the /ower"enter :erer does
not write recoery information to the recoery ta!les.
(f the session completes successfully, the /ower"enter :erer deletes all recoery cache files and remoes recoery
ta!le entries that are related to the session. )he /ower"enter :erer initializes the information in the recoery
ta!les at the !eginning of the next session run.
)he /ower"enter :erer also uses the recoery cache files to store messages from real&time sources.
Running Recover/
(f a session ena!led for recoery fails, you can run the session in recoery mode. )he /ower"enter :erer moes a
recoery session through the states of a normal session4 scheduled, waiting, running, succeeded, and failed. +hen
the /ower"enter :erer starts the recoery session, it runs all pre&session tasks.
146
For relational normal load targets, the /ower"enter :erer performs incremental load recoery. (t uses the recoery
information created during the normal session run to determine the point at which the session stopped committing
data to the target. (t then continues writing data to the target. -n successful recoery, the /ower"enter :erer
remoes the recoery information from the ta!les.
For example, if the /ower"enter :erer commits 1%,%%% rows !efore the session fails, when you run the session in
recoery mode, the /ower"enter :erer !ypasses the rows up to 1%,%%% and starts loading with row 1%,%%1.
(f the session writes to a relational target in !ulk mode, the /ower"enter :erer performs the entire writer run. (f
the )runcate )arget )a!le option is ena!led in the session properties, the /ower"enter :erer truncates the target
!efore loading data.
(f the session writes to a flat file or TM. file, the /ower"enter :erer performs full load recoery. (t oerwrites the
existing output file and performs the entire writer run. (f the session writes to heterogeneous targets, the
/ower"enter :erer performs incremental load recoery for all relational normal load targets and full load recoery
for all other target types.
-n successful recoery, the /ower"enter :erer deletes recoery cache files associated with the session. (t also
performs all post&session tasks.
#omleting Unrecoverable Se!!ion!
(n some cases, you cannot perform recoery for a session. )here may also !e circumstances that cause a recoery
session to fail or produce inconsistent data. (f you cannot recoer a session, you can run the session again.
$ou cannot run sessions in recoery mode under the following circumstances4
+ou c"ange t"e number of artition!* (f you change the num!er of partitions after the session fails, the
recoery session fails.
Recover/ table i! emt/ or mi!!ing from t"e target databa!e* )he /ower"enter :erer fails the recoery
session under the following circumstances4
o $ou deleted the ta!le after the /ower"enter :erer created it.
o )he session ena!led for recoery succeeded, and the /ower"enter :erer remoed the recoery
information from the ta!le.
Recover/ cac"e file i! mi!!ing* )he /ower"enter :erer fails the recoery session if the recoery cache file
is missing from the /ower"enter :erer cache directory.
T"e 'o0er#enter Server erforming recover/ i! on a different oerating !/!tem* )he operating system
of the /ower"enter :erer that runs the recoery session must !e the same as the operating system of the
/ower"enter :erer that ran the failed session.
$ou might get inconsistent data if you perform recoery under the following circumstances4
+ou c"ange t"e artitioning configuration* (f you change any partitioning options after the session fails,
you may get inconsistent data.
147
Source data i! not !orted* )o perform a successful recoery, the /ower"enter :erer must process source
rows during recoery in the same order it processes them during the initial session. Ase the :orted /orts
option in the :ource Kualifier transformation or add a :orter transformation directly after the :ource
Kualifier transformation.
T"e !ource! or target! c"ange after t"e initial !e!!ion failure* (f you drop or create indexes, or edit data
in the source or target ta!les !efore recoering a session, the /ower"enter :erer may return missing or
repeat rows.
T"e !e!!ion 0rite! to a relational target in bul6 modeK but t"e !e!!ion i! not configured to truncate t"e
target table* )he /ower"enter :erer may load duplicate rows to the during the recoery session.
T"e maing u!e! a Normali?er tran!formation* )he #ormalizer transformation generates source data in
the form of primary keys. <ecoering a session might generate different alues than if the session
completed successfully. 7oweer, the /ower"enter :erer will continue to produce uni*ue key alues.
T"e maing u!e! a Se8uence $enerator tran!formation* )he :e*uence =enerator transformation
generates source data in the form of se*uence alues. <ecoering a session might generate different alues
than if the session completed successfully.
(f you want to ensure the same se*uence data is generated during the recoery session, you can reset the
alue specified as the "urrent Balue in the :e*uence =enerator transformation properties to the same alue
used when you ran the failed session. (f you do not reset the "urrent Balue, the /ower"enter :erer will
continue to generate uni*ue :e*uence alues.
T"e !e!!ion erform! incremental aggregation and t"e 'o0er#enter Server !to! une1ectedl/* (f the
/ower"enter :erer stops unexpectedly while running an incremental aggregation session, the recoery
session cannot use the incremental aggregation cache files. <ename the !ackup cache files for the session
from /M1==Z.idx.!ak and /M1==Z.dat.!ak to /M1==Z.idx and /M1==Z.dat !efore you perform recoery.
T"e 'o0er#enter Server data movement mode c"ange! after t"e initial !e!!ion failure* (f you change the
data moement mode !efore recoering the session, the /ower"enter :erer might return incorrect data.
T"e 'o0er#enter Server code age or !ource and target code age! c"ange after t"e initial !e!!ion
failure* (f you change the source, target, or /ower"enter :erer code pages, the /ower"enter :erer might
return incorrect data. $ou can perform recoery if the new code pages are two&way compati!le with the
original code pages.
T"e 'o0er#enter Server run! in Unicode mode and /ou c"ange t"e !e!!ion !ort order* +hen the
/ower"enter :erer runs in Anicode mode, it sorts character data !ased on the sort order selected for the
session. 8o not perform recoery if you change the session sort order after the session fails.
14

Datawarehousing and Informatica

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Datawarehousing and Informatica

Enviado por

Direitos autorais:

Formatos disponíveis

Data Warehousing & INFORMATICA

Data Warehousing & INFORMATICA

,or0arding Rejected Ro0!

Data Warehousing & INFORMATICA

Você também pode gostar