Você está na página 1de 21

Informatica Join Vs Database Join

In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.

Which is the fastest? Informatica or Oracle?

In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?

As an application developer, you have the choice of either using joining syntaxes in database level to join your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question is which system performs this faster?
Test Preparation

We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool

3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size

We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.
Result

The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.
Data Points 1 2 3 4 Master Table Record Count 0.1 M 0.2 M 0.4 M 0.6 M Detail Table Record Count 1M 2M 4M 6M

Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption

1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note

1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.

What is incremental aggregation?


When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. If the source changes only incrementally and you can capture changes, you can configure the session to process only those changes. This allows the Informatica Server to update your target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.

Comparing Performance of SORT operation (Order By) in Informatica and Oracle


In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning. Which is the fastest? Informatica or Oracle?

Informatica is one of the leading data integration tools in todays world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. Think about a typical ETL operation often used in enterprise level data integration. A lot of data processing can be either redirected to the database or to the ETL tool. In general, both the database and the ETL tool are reasonably capable of doing such operations with almost same efficiency and capability. But in order to achieve the optimized performance, a developer must carefully consider and decide which system s/he should be trusting with for each individual processing task. In this article, we will take a basic database operation Sorting, and we will put these two systems to test in order to determine which does it faster than the other, if at all.
Which sorts data faster? Oracle or Informatica?

As an application developer, you have the choice of either using ORDER BY in database level to sort your data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The question is which system performs this faster?
Test Preparation

We will perform the same test with different data points (data volumes) and log the results. We will start with 1 million records and we will be doubling

the volume for each next data points. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. The source table has 10 columns and first 8 columns will be used for sorting 9. Informatica sorter has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica level. We have executed these mappings with different data points and logged the result.
Result

The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.

Verdict The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.
Assumption

1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments
Note

This data can only be used for performance comparison but cannot be used for performance benchmarking. To know the Informatica and Oracle performance comparison for JOIN operation

Implementing Informatica Incremental Aggregation


Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate calculations in a session. If the source changes incrementally and we can capture the changes, then we can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to delete previous loads data, process the entire source data and recalculate the same data each time you run the session.

Using Informatica Normalizer Transformation


Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement. Learn how to effectively use normalizer here.
Using Noramalizer Transformation

A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate data for single-occurring source columns. The

Normalizer transformation parses multiple-occurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in columns to rows. Normalizer effectively does the opposite of Aggregator!
Example of Data Transpose using Normalizer

Think of a relational table that stores four quarters of sales by store and we need to create a row for each sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter like below..
The following source rows contain four quarters of sales by store: Source Table Store Quarter1 Quarter2 Quarter3 Quarter4

Store1

100

300

500

700

Store2

250

450

650

850

The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that identifies the quarter number:
Target Table Store Sales Quarter

Store 1

100

Store 1

300

Store 1

500

Store 1

700

Store 2

250

Store 2

450

Store 2

650

Store 2

850

How Informatica Normalizer Works Suppose we have the following data in source: Name Month Transportation House Rent Food

Sam

Jan

200

1500

500

John

Jan

300

1200

300

Tom

Jan

300

1350

350

Sam

Feb

300

1550

450

John

Feb

350

1200

290

Tom

Feb

350

1400

350

and we need to transform the source data and populate this as below in the target table:
Name Month Expense Type Expense

Sam

Jan

Transport

200

Sam

Jan

House rent

1500

Sam

Jan

Food

500

John

Jan

Transport

300

John

Jan

House rent

1200

John

Jan

Food

300

Tom

Jan

Transport

300

Tom

Jan

House rent

1350

Tom .. like this.

Jan

Food

350

Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer. Image: Normalization Mapping Example 1

I will explain the mapping further below.


Setting Up Normalizer Transformation Property First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of the Normalizer transformation, since we have Food,Houserent and Transportation. Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and Month

In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab. Interestingly we will observe two new columns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD. GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurence field i.e. the column no of the input Expense head. Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows. Below is the screen-shot of the expression to handle this GCID efficiently:

What is the difference between Normal load and Bulk load?


Load types:- 1)Bulk Load 2)Normal Load Normal load:1)in case of less data. 2)we can get its log details 3)we can rollback and commit. 4)Session recovery possible. 5)performance may be low . Bulk load :1)In case of large data 2)no log details are available. 3)can't rollback and commit 4)session recovery not possible.

5)performance improves.

Implementing Informatica Partitions


Why use Informatica Pipeline Partition? Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.
PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained


Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel. A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage. Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions). We can set the following attributes to partition a pipeline: Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at

partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations. Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently. Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions. Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing. Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations. Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.

Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning. Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager. The PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and gridbased hardware environments.

Implementing Informatica Persistent Cache


You must have noticed that the time Informatica takes to build the lookup cache can be too much sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your time.
What is Persistent Cache?

Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire data of the lookup table from database server to Informatica Server as a part of lookup cache building activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider this scenario - what if you are looking up to the same table different times using different lookups in different mappings? Do you want to spend the time of building the lookup cache again and again for each lookup? Off course not! Just use persistent cache option! Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not.
Where and when we shall use persistent cache:

Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to populate the surrogate key in the fact tables based on their source system

keys. Now if we cache the same Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading timeline.
So the solution is to use Named Persistent Cache.

In the first mapping we will create the Named Persistent Cache file by setting three properties in the Properties tab of Lookup transformation.

Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat

Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table. Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.

Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named

Persistent cache file name that was defined in the mapping where the persistent cache file was created.
Note:

If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file. So if the incoming source data volume is high, the lookup tables data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.

Aggregation with out Informatica Aggregator


Since Informatica process data row by row, it is generally possible to handle data aggregation operation even without an Aggregator Transformation. On certain cases, you may get huge performance gain using this technique!

General Idea of Aggregation without Aggregator Transformation


Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the Employee Table. The SQL query for this would be: SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO; If we need to implement this in Informatica, it would be very easy as we would obviously go for an Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY the problem can be solved easily. Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use the very funda of the expression transformation of holding the value of an attribute of the previous tuple over here.
But wait... why would we do this? Aren't we complicating the thing here?

Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for

aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this
Image: Aggregation with Expression and Sorter 1 Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties Sorter (SRT_SAL1) Ports Tab

Expression (EXP_SAL2) Ports Tab

Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica aggregator transformation. Hope you liked it!

Informatica Dynamic Lookup Cache


A LookUp cache does not change once built. But what if the underlying lookup table changes the data after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the underlying table changes?