Escolar Documentos
Profissional Documentos
Cultura Documentos
Informatica
Akshayananda Maiti
ETL Design
Challenges and
Solutions through
Informatica
Akshayananda Maiti
Contents
1 Incremental load mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
1.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
2 Incremental load when Source record has no timestamp . . . . . . . . . . . . . . . . . . . . . . .6
2.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
3 Duplicate records from Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
3.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
4 Two database technologies in a single mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
4.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
5 How to optimize source read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
5.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
6 Source fact record does not have a dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.1 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
1.1 Solution
Usually any ETL system will have some control table to store the daily run information like run
start time, run end time and run success flag.
Utilize that table to store the timestamp information of last record extracted from source
e.g. 30th March 2008 9:00 am. During next run extract records where timestamp > that stored
value of previous run 30th March 2008 9:00 am.
In our case, we have created two tables (having same structure as shown below):
DSR_STG_LD_STS
DSR_LD_STS
Table Name: DSR_LD_STS
Field Name
Data Type
Description
DSR_LD_STS_ID
Number(10)
LD_START_TMSTMP
Date
LD_END_TMSTMP
Date
LD_CMPRSN_START_TMSTMP
Date
LD_CMPRSN_END_TMSTMP
Date
LD_STS_CD
Char(1)
DSR_TRNSCTN_TMSTMP
Date
SYSDATE
DSR_TRNSCTN_TYP_CD
Varchar2(1)
I If Insert
U If Update
DSR_TRNSCTN_USER_ID
Varchar2(30)
DSR_INFORMATICA_USER
DSR_EFCTV_END_DT
Date
12-31-9999
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
Before the data loading into the datamart starts, two columns (LD_CMPRSN_START_
TMSTMP and LD_CMPRSN_END_TMSTMP as highlighted above) in the loading status
tables are set to identify the time window for source records. Any source record created/
updated inside this time window will be extracted through the datamart mappings. Following
steps will give details of the mechanism:
DSR_STG_LD_STS stores latest workflow run record (Current)
DSR_LD_STS stores one record for each run of the workflow (History + Current).
Mapping m_USD_Load_Status_ins_at_start runs before the datamart load.
Mapping m_USD_Load_Status_upd_at_end runs after the datamart load.
Value
DSR_LD_STS_ID
previous id + 1
LD_START_TMSTMP
Sysdate
LD_END_TMSTMP
LD_CMPRSN_START_TMSTMP
LD_CMPRSN_END_TMSTMP
Sysdate
LD_STS_CD
DSR_TRNSCTN_TMSTMP
Sysdate
DSR_TRNSCTN_TYP_CD
DSR_TRNSCTN_USER_ID
DSR_INFORMATICA_USER
DSR_EFCTV_END_DT
12-31-9999
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
Value
DSR_LD_STS_ID
previous id + 1
LD_START_TMSTMP
Sysdate
LD_END_TMSTMP
Sysdate
LD_CMPRSN_START_TMSTMP
LD_CMPRSN_END_TMSTMP
Sysdate
LD_STS_CD
DSR_TRNSCTN_TMSTMP
Sysdate
DSR_TRNSCTN_TYP_CD
DSR_TRNSCTN_USER_ID
DSR_INFORMATICA_USER
DSR_EFCTV_END_DT
12-31-9999
2.1 Solution
In most of the data warehousing scenarios there will be a staging area between source and
datamart. Utilize that staging area to build timestamp information and record status information
(Staging area table will have same structure as source table and extra 2 columns i.e. TIME_
STMP and RCRD_STS_CD). Only way to achieve this is to compare todays snapshot of
source table with yesterdays snapshot of the same table.
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
One simple mechanism for comparison would be to maintain a temporary table, which will
store yesterdays snapshot of the source table. However this solution demands lot of overhead
in terms of replicating the whole source system database.
There are better solutions as we have built one in our case as described below:
2.1.1
2.1.2
2.1.3
2.1.4
2.1.5
2.1.6
2.1.7
2.1.8
For each staging area table, create two stored procedures to execute from Informatica
mapping (source pre load and target post load). The functionality of each stored
procedure is as follows:
Target pre load procedure makes those RCRD_STS_CD column values T for which
the value is not equal to D. For example: procedure name = sp_upd_pre.
Target post load procedure makes those RCRD_STS_CD column values D, which
have value T and puts sysdate in timestamp column. For example: procedure name
= sp_upd_post.
Lets take a case where you have more than one mapping that loads data into single
table
The first mapping calls procedure sp_upd_pre through a stored procedure
transformation, which does target pre load.
All the mappings either insert new records or update existing ones and set the
RCRD_STS_CD to A. The timestamp value is set to sysdate for the new record.
But while updating the existing record the timestamp value remains unchanged.
The last mapping calls procedure sp_upd_post through a stored procedure
transformation, which does target post load.
Once all mappings runs successfully two columns of the table will be updated as per
below logic.
Rcrd_sts_cd
before load
New record
Rcrd_sts_cd
after load
Timestamp
before load
Timestamp
after load
Sysdate
Old record
Old date
Old date
Old date
Sysdate
Old date
Old date
2.1.9
At the next part of the workflow, the mappings which loads data from staging area
to datamart, will extract only 1st and 3rd type of records as described in the table
above. That means, the mappings will extract only the new records or those records,
which has been deleted recently (after previous load run).
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
3.1 Solution
We can use sorter transformation (with distinct option checked) to remove duplicate records.
Please note in the below diagram that the option distinct to be checked to get distinct
records.
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
4.1 Solution
We can create this mechanism by using variables of Informatica, by creating 2 threads inside
one mapping.
First thread extracts comparison_start_timestamp of last load from Oracle table and set a
variable (say $$STARTTIME) with that timestamp.
Second thread extracts all records from worldlink SQL Server tables which has timestamp
greater than $$STARTTIME. A sample source qualifier code with the variable.
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
5.1 Solution
Simultaneous loading in dim and fact table is technically not feasible if foreign key constraint
in fact table is enabled. To overcome this challenge we used 2 threads inside the mapping.
1st Thread
After the source qualifier, the record flow will split out to two branches. 1st branch will load
target dimension table. 2nd branch handles fact records.
10
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.
2nd Thread
Once 1st thread is completed (more specifically, loading in dimension table is completed) all
the records from fact_tmp table are inserted in fact table. Then fact_tmp table is truncated.
The optimization in performance is derived from the usage of single source qualifier (that
means very large source table is being read only once instead of twice) to finally load both the
target tables (that is dimension and fact table). Following attached acrobat document shows
one such example pictorially.
6.1 Solution
To achieve this we have to create dummy dimension keys as described below.
The relevant dimension for which few source fact records has the value as NULL, has to
be populated with one extra entry called dummy with the id -9999 (that is what we had done
in our case).
If there is any fact record coming from source which do not have that specific dimension
information, then the record will still be inserted into the fact table with dimension_key =
-9999.
11
TCS Confidential
Copyright 2007 by Tata Consultancy Services. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted
in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of Tata Consultancy Services.