ETL Design Challenges and Solutions Through Informatica

ETL Design Challenges and Solutions through
Informatica
Akshayananda Maiti
ETL Design
Challenges and
Solutions through
Informatica
Akshayananda Maiti
ETL Design Challenges and Solutions through Informatica
Contents
1 Incremental load mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
2 Incremental load when Source record has no timestamp . . . . . . . . . . . . . . . . . . . . . . .6
2.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
3 Duplicate records from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
3.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
4 Two database technologies in a single mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
4.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
5 How to optimize source read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
5.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
6 Source fact record does not have a dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
1 Incremental load mechanism

If the mappings run daily then simple solution to create incremental load would be to
Select source records
From Source Table
Where timestamp > sysdate 1
However, this solution has some limitation. It only works when
1) Source system table does not get updated during a specific time window and the
mapping should be scheduled inside this window.
2) The mappings should run everyday without fail.
To make your system more robust where you dont have to adhere to above two constraints,
you need a more sophisticated solution.
1.1 Solution
Usually any ETL system will have some control table to store the daily run information like run
start time, run end time and run success flag.
Utilize that table to store the timestamp information of last record extracted from source
e.g. 30th March 2008 9:00 am. During next run extract records where timestamp > that stored
value of previous run 30th March 2008 9:00 am.
In our case, we have created two tables (having same structure as shown below):
DSR_STG_LD_STS
DSR_LD_STS
Table Name: DSR_LD_STS
Field Name
Data Type
Description
DSR_LD_STS_ID
Number(10)
Number generated by an Oracle sequence object
LD_START_TMSTMP
Date
Timestamp of Data load start
LD_END_TMSTMP
Date
Timestamp of Date load end
LD_CMPRSN_START_TMSTMP
Date
Starting timestamp of the range for delta load
LD_CMPRSN_END_TMSTMP
Date
Ending timestamp of the range for delta load
LD_STS_CD
Char(1)
S if Load completed successfully

F if Load failed
DSR_TRNSCTN_TMSTMP
Date
SYSDATE
DSR_TRNSCTN_TYP_CD
Varchar2(1)
I If Insert
U If Update
DSR_TRNSCTN_USER_ID
Varchar2(30)
DSR_INFORMATICA_USER
DSR_EFCTV_END_DT
Date
12-31-9999
TCS Confidential
Before the data loading into the datamart starts, two columns (LD_CMPRSN_START_
TMSTMP and LD_CMPRSN_END_TMSTMP as highlighted above) in the loading status
tables are set to identify the time window for source records. Any source record created/
updated inside this time window will be extracted through the datamart mappings. Following
steps will give details of the mechanism:
DSR_STG_LD_STS stores latest workflow run record (Current)
DSR_LD_STS stores one record for each run of the workflow (History + Current).
Mapping m_USD_Load_Status_ins_at_start runs before the datamart load.
Mapping m_USD_Load_Status_upd_at_end runs after the datamart load.
Mapping m_USD_Load_Status_ins_at_start inserts a new record in DSR_LD_STS

and inserts the same record in DSR_STG_LD_STS after truncating DSR_STG_LD_
STS. After this mapping runs, the single record in DSR_STG_LD_STS looks like
as follows:
Column Name
Value
DSR_LD_STS_ID
previous id + 1
LD_START_TMSTMP
Sysdate
LD_END_TMSTMP
NULL (run end time not known at this point)
1-1-1900 if this is the first time load

else
previous runs end timestamp
Sysdate
LD_STS_CD
F (As run is not completed till now)
DSR_TRNSCTN_TMSTMP
Sysdate
DSR_TRNSCTN_TYP_CD
DSR_TRNSCTN_USER_ID
DSR_EFCTV_END_DT
12-31-9999
Mapping m_USD_Load_Status_upd_at_end does same update (highlighted columns
in below table) on single record of table DSR_STG_LD_STS and latest record of

table DSR_LD_STS. After this mapping runs, the single record in DSR_STG_LD_
STS looks like as follows:
TCS Confidential

Column Name
Value
DSR_LD_STS_ID
previous id + 1
LD_START_TMSTMP
Sysdate
LD_END_TMSTMP
Sysdate
1-1-1900 if this is the first time load

else
previous runs end timestamp
Sysdate
LD_STS_CD
S (As run is completed now)
DSR_TRNSCTN_TMSTMP
Sysdate
DSR_TRNSCTN_TYP_CD
DSR_TRNSCTN_USER_ID
DSR_EFCTV_END_DT
12-31-9999
2 Incremental Load When Source Record has

no Timestamp
To extract and load data incrementally one needs timestamp information in the source records.
However, many a times you may come across to a source system, which does not have
timestamp. We had two such source systems. We had to handle extra complexity as described
below:
Source system table did not have any record_status_code information to indicate whether
a record is active or not. All inactive records get physically deleted from source table. However,
in the datamart we followed better way of handling deletion, i.e. soft delete. That means we
had kept a record_status_code column (in our tables), which gets populated with value D
once the record gets deleted from business.
In summary we have got a source which does not have timestamp and which does physical
delete of records. From this we have to populate a datamart incrementally and also we have
to adhere to soft delete policy.
2.1 Solution
In most of the data warehousing scenarios there will be a staging area between source and
datamart. Utilize that staging area to build timestamp information and record status information
(Staging area table will have same structure as source table and extra 2 columns i.e. TIME_
STMP and RCRD_STS_CD). Only way to achieve this is to compare todays snapshot of
source table with yesterdays snapshot of the same table.
TCS Confidential
One simple mechanism for comparison would be to maintain a temporary table, which will
store yesterdays snapshot of the source table. However this solution demands lot of overhead
in terms of replicating the whole source system database.
There are better solutions as we have built one in our case as described below:
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.1.6
2.1.7
2.1.8
For each staging area table, create two stored procedures to execute from Informatica
mapping (source pre load and target post load). The functionality of each stored
procedure is as follows:
Target pre load procedure makes those RCRD_STS_CD column values T for which
the value is not equal to D. For example: procedure name = sp_upd_pre.
Target post load procedure makes those RCRD_STS_CD column values D, which
have value T and puts sysdate in timestamp column. For example: procedure name
= sp_upd_post.
Lets take a case where you have more than one mapping that loads data into single
table
The first mapping calls procedure sp_upd_pre through a stored procedure
transformation, which does target pre load.
All the mappings either insert new records or update existing ones and set the
RCRD_STS_CD to A. The timestamp value is set to sysdate for the new record.
But while updating the existing record the timestamp value remains unchanged.
The last mapping calls procedure sp_upd_post through a stored procedure
transformation, which does target post load.
Once all mappings runs successfully two columns of the table will be updated as per
below logic.
Type of source Record
Rcrd_sts_cd
before load
New record
Rcrd_sts_cd
after load
Timestamp
before load
Timestamp
after load
Sysdate
Old record
Old date
Old date
Old record deleted from source

after previous load run
Old date
Sysdate
Old record deleted from source

before previous load run
Old date
Old date
2.1.9
At the next part of the workflow, the mappings which loads data from staging area
to datamart, will extract only 1st and 3rd type of records as described in the table
above. That means, the mappings will extract only the new records or those records,
which has been deleted recently (after previous load run).
TCS Confidential
3 Duplicate Records from Source

A number of times you will face data quality issues like duplicate data available from source.
There are various solutions available depending on the client agreed treatment of duplicate
records. In our case the treatment agreed with client was that pick up any one record from
the set of duplicate records.
A simple solution in this scenario would be to use DISTINCT in the source qualifier query.
However this solution has two limitations:
1) Using DISTINCT in a query degrades query performance heavily.
2) If the source is not a RDBMS you will not be able to use source qualifier query.
In our case we had designed a better solution.
3.1 Solution
We can use sorter transformation (with distinct option checked) to remove duplicate records.
Please note in the below diagram that the option distinct to be checked to get distinct
records.
Verify distinct option is checked

There could be a more complex scenario where duplicate means there is more than one
record for a set of columns (= the business key, which is supposed to be unique) of the records.
It need not be necessary that all those records having same value in other columns too.
In this case use aggregator transformation (by putting group by on the business key
columns)
TCS Confidential
Check the group by check box, to make

Informatica select the last record from
the set of records having same delivery
number (business key). Rest of the
records will not be processed further.
4 Two Database Technologies in a Single Mapping

In our case we have one database as Oracle and other is SQL Server. There was a unique
challenge to extract incremental records from SQL Server while control information for
incremental mechanism is stored in Oracle table.
4.1 Solution
We can create this mechanism by using variables of Informatica, by creating 2 threads inside
one mapping.
First thread extracts comparison_start_timestamp of last load from Oracle table and set a
variable (say $$STARTTIME) with that timestamp.
Second thread extracts all records from worldlink SQL Server tables which has timestamp
greater than $$STARTTIME. A sample source qualifier code with the variable.
TCS Confidential

SELECT
Packages.PackageId,
PackageDetails.ItemNumber,
PackageDetails.Quantity,
PackageDetails.UOM,
PackageDetails.POLineNumber,
PackageDetails.DeleteFlag,
Sites.CrossReference
FROM
Packages,
PackageDetails,
Sites,
Shipments
WHERE
PackageDetails.PackageId = Packages.Id
AND
Shipments.SiteId=Sites.Id
AND
Packages.ShipmentId=Shipments.Id
AND
PackageDetails.UpdateDateTime >= Convert(datetime ,$$STARTTIME)
Note: Ensure that you have put quote () before and after the variable ($$STARTTIME) when you use it in the
sql query even if it is a datetime type variable.
5 How to Optimize Source Read

In a situation when two mappings are taking different information from same very large source
table, we can optimize the total load time by merging two mappings where the source table
will be read only once and both the target tables will be loaded.
Here we will talk about an example we have handled where there was an added complexity.
The two target tables are fact and related dimension table where fact table has the foreign key
to the dimension table.
5.1 Solution
Simultaneous loading in dim and fact table is technically not feasible if foreign key constraint
in fact table is enabled. To overcome this challenge we used 2 threads inside the mapping.
1st Thread
After the source qualifier, the record flow will split out to two branches. 1st branch will load
target dimension table. 2nd branch handles fact records.
10
TCS Confidential
2nd branch of 1st Thread(for fact records)

2nd branch again split out to 3rd and 4th branch. Here, we have introduced a temporary fact (say
fact_tmp) table, which does not have any referential integrity constraint. In 3rd branch all new
fact records (instead of inserting them to the fact table) will populate fact_tmp. In 4th branch all
old fact records get directly updated in fact table.
Note: All the branches described above are inside the 1st thread and they have a single source qualifier.
2nd Thread
Once 1st thread is completed (more specifically, loading in dimension table is completed) all
the records from fact_tmp table are inserted in fact table. Then fact_tmp table is truncated.
The optimization in performance is derived from the usage of single source qualifier (that
means very large source table is being read only once instead of twice) to finally load both the
target tables (that is dimension and fact table). Following attached acrobat document shows
one such example pictorially.
Diagram for loading

dim & fact in same
mapping.pdf
6 Source Fact Record does not have a Dimension

Normally a source system will provide all information to build all the dimension keys for a fact
record. However sometimes data quality issues may arise where few fact records may have
dimension information missing (NULL value in related columns of source). Due to referential
integrity of fact table a direct insert of such records in fact table is not feasible.
6.1 Solution
To achieve this we have to create dummy dimension keys as described below.
The relevant dimension for which few source fact records has the value as NULL, has to
be populated with one extra entry called dummy with the id -9999 (that is what we had done
in our case).
If there is any fact record coming from source which do not have that specific dimension
information, then the record will still be inserted into the fact table with dimension_key =
-9999.
11
TCS Confidential

ETL Design Challenges and Solutions Through Informatica

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ETL Design Challenges and Solutions Through Informatica

Enviado por

Direitos autorais:

Formatos disponíveis

ETL Design Challenges and Solutions through

ETL Design Challenges and Solutions through Informatica

ETL Design Challenges and Solutions through Informatica

1 Incremental load mechanism

Number generated by an Oracle sequence object

Timestamp of Data load start

Timestamp of Date load end

Starting timestamp of the range for delta load

Ending timestamp of the range for delta load

S if Load completed successfully

ETL Design Challenges and Solutions through Informatica

Mapping m_USD_Load_Status_ins_at_start inserts a new record in DSR_LD_STS

NULL (run end time not known at this point)

1-1-1900 if this is the first time load

F (As run is not completed till now)

Mapping m_USD_Load_Status_upd_at_end does same update (highlighted columns

in below table) on single record of table DSR_STG_LD_STS and latest record of

ETL Design Challenges and Solutions through Informatica

1-1-1900 if this is the first time load

S (As run is completed now)

2 Incremental Load When Source Record has

ETL Design Challenges and Solutions through Informatica

Type of source Record

Old record deleted from source

Old record deleted from source

ETL Design Challenges and Solutions through Informatica

3 Duplicate Records from Source

Verify distinct option is checked

ETL Design Challenges and Solutions through Informatica

Check the group by check box, to make

4 Two Database Technologies in a Single Mapping

ETL Design Challenges and Solutions through Informatica

5 How to Optimize Source Read

ETL Design Challenges and Solutions through Informatica

2nd branch of 1st Thread(for fact records)

Diagram for loading

6 Source Fact Record does not have a Dimension

Você também pode gostar