Você está na página 1de 12

ETL Design Challenges and Solutions through

Informatica
Akshayananda Maiti

ETL Design
Challenges and
Solutions through
Informatica

Akshayananda Maiti

ETL Design Challenges and Solutions through Informatica

Contents
1 Incremental load mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
2 Incremental load when Source record has no timestamp . . . . . . . . . . . . . . . . . . . . . . .6
2.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
3 Duplicate records from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
3.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
4 Two database technologies in a single mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
4.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
5 How to optimize source read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
5.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
6 Source fact record does not have a dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

1 Incremental load mechanism


If the mappings run daily then simple solution to create incremental load would be to
Select source records
From Source Table
Where timestamp > sysdate 1
However, this solution has some limitation. It only works when
1) Source system table does not get updated during a specific time window and the
mapping should be scheduled inside this window.
2) The mappings should run everyday without fail.
To make your system more robust where you dont have to adhere to above two constraints,
you need a more sophisticated solution.

1.1 Solution
Usually any ETL system will have some control table to store the daily run information like run
start time, run end time and run success flag.
Utilize that table to store the timestamp information of last record extracted from source
e.g. 30th March 2008 9:00 am. During next run extract records where timestamp > that stored
value of previous run 30th March 2008 9:00 am.
In our case, we have created two tables (having same structure as shown below):
 DSR_STG_LD_STS
 DSR_LD_STS
Table Name: DSR_LD_STS
Field Name

Data Type

Description

DSR_LD_STS_ID

Number(10)

Number generated by an Oracle sequence object

LD_START_TMSTMP

Date

Timestamp of Data load start

LD_END_TMSTMP

Date

Timestamp of Date load end

LD_CMPRSN_START_TMSTMP

Date

Starting timestamp of the range for delta load

LD_CMPRSN_END_TMSTMP

Date

Ending timestamp of the range for delta load

LD_STS_CD

Char(1)

S if Load completed successfully


F if Load failed

DSR_TRNSCTN_TMSTMP

Date

SYSDATE

DSR_TRNSCTN_TYP_CD

Varchar2(1)

I If Insert
U If Update

DSR_TRNSCTN_USER_ID

Varchar2(30)

DSR_INFORMATICA_USER

DSR_EFCTV_END_DT

Date

12-31-9999

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

Before the data loading into the datamart starts, two columns (LD_CMPRSN_START_
TMSTMP and LD_CMPRSN_END_TMSTMP as highlighted above) in the loading status
tables are set to identify the time window for source records. Any source record created/
updated inside this time window will be extracted through the datamart mappings. Following
steps will give details of the mechanism:
 DSR_STG_LD_STS stores latest workflow run record (Current)
 DSR_LD_STS stores one record for each run of the workflow (History + Current).
 Mapping m_USD_Load_Status_ins_at_start runs before the datamart load.
 Mapping m_USD_Load_Status_upd_at_end runs after the datamart load.

Mapping m_USD_Load_Status_ins_at_start inserts a new record in DSR_LD_STS


and inserts the same record in DSR_STG_LD_STS after truncating DSR_STG_LD_
STS. After this mapping runs, the single record in DSR_STG_LD_STS looks like
as follows:
Column Name

Value

DSR_LD_STS_ID

previous id + 1

LD_START_TMSTMP

Sysdate

LD_END_TMSTMP

NULL (run end time not known at this point)

LD_CMPRSN_START_TMSTMP

1-1-1900 if this is the first time load


else
previous runs end timestamp

LD_CMPRSN_END_TMSTMP

Sysdate

LD_STS_CD

F (As run is not completed till now)

DSR_TRNSCTN_TMSTMP

Sysdate

DSR_TRNSCTN_TYP_CD

DSR_TRNSCTN_USER_ID

DSR_INFORMATICA_USER

DSR_EFCTV_END_DT

12-31-9999

 Mapping m_USD_Load_Status_upd_at_end does same update (highlighted columns

in below table) on single record of table DSR_STG_LD_STS and latest record of


table DSR_LD_STS. After this mapping runs, the single record in DSR_STG_LD_
STS looks like as follows:

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica


Column Name

Value

DSR_LD_STS_ID

previous id + 1

LD_START_TMSTMP

Sysdate

LD_END_TMSTMP

Sysdate

LD_CMPRSN_START_TMSTMP

1-1-1900 if this is the first time load


else
previous runs end timestamp

LD_CMPRSN_END_TMSTMP

Sysdate

LD_STS_CD

S (As run is completed now)

DSR_TRNSCTN_TMSTMP

Sysdate

DSR_TRNSCTN_TYP_CD

DSR_TRNSCTN_USER_ID

DSR_INFORMATICA_USER

DSR_EFCTV_END_DT

12-31-9999

2 Incremental Load When Source Record has


no Timestamp
To extract and load data incrementally one needs timestamp information in the source records.
However, many a times you may come across to a source system, which does not have
timestamp. We had two such source systems. We had to handle extra complexity as described
below:
Source system table did not have any record_status_code information to indicate whether
a record is active or not. All inactive records get physically deleted from source table. However,
in the datamart we followed better way of handling deletion, i.e. soft delete. That means we
had kept a record_status_code column (in our tables), which gets populated with value D
once the record gets deleted from business.
In summary we have got a source which does not have timestamp and which does physical
delete of records. From this we have to populate a datamart incrementally and also we have
to adhere to soft delete policy.

2.1 Solution
In most of the data warehousing scenarios there will be a staging area between source and
datamart. Utilize that staging area to build timestamp information and record status information
(Staging area table will have same structure as source table and extra 2 columns i.e. TIME_
STMP and RCRD_STS_CD). Only way to achieve this is to compare todays snapshot of
source table with yesterdays snapshot of the same table.

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

One simple mechanism for comparison would be to maintain a temporary table, which will
store yesterdays snapshot of the source table. However this solution demands lot of overhead
in terms of replicating the whole source system database.
There are better solutions as we have built one in our case as described below:
2.1.1

2.1.2
2.1.3

2.1.4
2.1.5
2.1.6

2.1.7
2.1.8

For each staging area table, create two stored procedures to execute from Informatica
mapping (source pre load and target post load). The functionality of each stored
procedure is as follows:
Target pre load procedure makes those RCRD_STS_CD column values T for which
the value is not equal to D. For example: procedure name = sp_upd_pre.
Target post load procedure makes those RCRD_STS_CD column values D, which
have value T and puts sysdate in timestamp column. For example: procedure name
= sp_upd_post.
Lets take a case where you have more than one mapping that loads data into single
table
The first mapping calls procedure sp_upd_pre through a stored procedure
transformation, which does target pre load.
All the mappings either insert new records or update existing ones and set the
RCRD_STS_CD to A. The timestamp value is set to sysdate for the new record.
But while updating the existing record the timestamp value remains unchanged.
The last mapping calls procedure sp_upd_post through a stored procedure
transformation, which does target post load.
Once all mappings runs successfully two columns of the table will be updated as per
below logic.

Type of source Record

Rcrd_sts_cd
before load

New record

Rcrd_sts_cd
after load

Timestamp
before load

Timestamp
after load
Sysdate

Old record

Old date

Old date

Old record deleted from source


after previous load run

Old date

Sysdate

Old record deleted from source


before previous load run

Old date

Old date

2.1.9

At the next part of the workflow, the mappings which loads data from staging area
to datamart, will extract only 1st and 3rd type of records as described in the table
above. That means, the mappings will extract only the new records or those records,
which has been deleted recently (after previous load run).

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

3 Duplicate Records from Source


A number of times you will face data quality issues like duplicate data available from source.
There are various solutions available depending on the client agreed treatment of duplicate
records. In our case the treatment agreed with client was that pick up any one record from
the set of duplicate records.
A simple solution in this scenario would be to use DISTINCT in the source qualifier query.
However this solution has two limitations:
1) Using DISTINCT in a query degrades query performance heavily.
2) If the source is not a RDBMS you will not be able to use source qualifier query.
In our case we had designed a better solution.

3.1 Solution
We can use sorter transformation (with distinct option checked) to remove duplicate records.
Please note in the below diagram that the option distinct to be checked to get distinct
records.

Verify distinct option is checked


There could be a more complex scenario where duplicate means there is more than one
record for a set of columns (= the business key, which is supposed to be unique) of the records.
It need not be necessary that all those records having same value in other columns too.
In this case use aggregator transformation (by putting group by on the business key
columns)

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

Check the group by check box, to make


Informatica select the last record from
the set of records having same delivery
number (business key). Rest of the
records will not be processed further.

4 Two Database Technologies in a Single Mapping


In our case we have one database as Oracle and other is SQL Server. There was a unique
challenge to extract incremental records from SQL Server while control information for
incremental mechanism is stored in Oracle table.

4.1 Solution
We can create this mechanism by using variables of Informatica, by creating 2 threads inside
one mapping.
First thread extracts comparison_start_timestamp of last load from Oracle table and set a
variable (say $$STARTTIME) with that timestamp.
Second thread extracts all records from worldlink SQL Server tables which has timestamp
greater than $$STARTTIME. A sample source qualifier code with the variable.

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica


SELECT
Packages.PackageId,
PackageDetails.ItemNumber,
PackageDetails.Quantity,
PackageDetails.UOM,
PackageDetails.POLineNumber,
PackageDetails.DeleteFlag,
Sites.CrossReference
FROM
Packages,
PackageDetails,
Sites,
Shipments
WHERE
PackageDetails.PackageId = Packages.Id
AND
Shipments.SiteId=Sites.Id
AND
Packages.ShipmentId=Shipments.Id
AND
PackageDetails.UpdateDateTime >= Convert(datetime ,$$STARTTIME)
Note: Ensure that you have put quote () before and after the variable ($$STARTTIME) when you use it in the
sql query even if it is a datetime type variable.

5 How to Optimize Source Read


In a situation when two mappings are taking different information from same very large source
table, we can optimize the total load time by merging two mappings where the source table
will be read only once and both the target tables will be loaded.
Here we will talk about an example we have handled where there was an added complexity.
The two target tables are fact and related dimension table where fact table has the foreign key
to the dimension table.

5.1 Solution
Simultaneous loading in dim and fact table is technically not feasible if foreign key constraint
in fact table is enabled. To overcome this challenge we used 2 threads inside the mapping.
1st Thread
After the source qualifier, the record flow will split out to two branches. 1st branch will load
target dimension table. 2nd branch handles fact records.

10

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

ETL Design Challenges and Solutions through Informatica

2nd branch of 1st Thread(for fact records)


2nd branch again split out to 3rd and 4th branch. Here, we have introduced a temporary fact (say
fact_tmp) table, which does not have any referential integrity constraint. In 3rd branch all new
fact records (instead of inserting them to the fact table) will populate fact_tmp. In 4th branch all
old fact records get directly updated in fact table.
Note: All the branches described above are inside the 1st thread and they have a single source qualifier.

2nd Thread
Once 1st thread is completed (more specifically, loading in dimension table is completed) all
the records from fact_tmp table are inserted in fact table. Then fact_tmp table is truncated.
The optimization in performance is derived from the usage of single source qualifier (that
means very large source table is being read only once instead of twice) to finally load both the
target tables (that is dimension and fact table). Following attached acrobat document shows
one such example pictorially.

Diagram for loading


dim & fact in same
mapping.pdf

6 Source Fact Record does not have a Dimension


Normally a source system will provide all information to build all the dimension keys for a fact
record. However sometimes data quality issues may arise where few fact records may have
dimension information missing (NULL value in related columns of source). Due to referential
integrity of fact table a direct insert of such records in fact table is not feasible.

6.1 Solution
To achieve this we have to create dummy dimension keys as described below.
The relevant dimension for which few source fact records has the value as NULL, has to
be populated with one extra entry called dummy with the id -9999 (that is what we had done
in our case).
If there is any fact record coming from source which do not have that specific dimension
information, then the record will still be inserted into the fact table with dimension_key =
-9999.

11

TCS Confidential

Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.

Você também pode gostar