Você está na página 1de 4

Process for Initial & Delta Loads into UCM

The process outlined below enables us to perform initial & delta loads into UCM through the SDH tables thereby maintaining a history of any merging of records that might occur during loading. This history will enable the unmerging of records if it is determined later that the merge should not have been done. This process requires the Customer data has to be pre-processed outside of UCM to prepare batches of data which can be fed into UCM. Data Quality software such as Trillium or First Logic or IIR is required for pre-processing Customer data before loading into UCM. Need for Pre-Processing: If using the standard batch loading process into UCM (without pre-processing), data to be cleansed and de-duplicated is loaded into UCM SDH tables, which act as staging tables. A UCM Workflow picks up these records from the SDH tables and processes them, one at a time. This Workflow runs in a single threaded (instance) mode and only utilizes one CPU. Even if the machine has multiple CPUs, the workflow process will only use one of the CPUs because the workflow process manager runs on only one CPU. Thus, the throughput is constrained by the speed of the CPU, higher the speed, higher the throughput. If we are able to start multiple workflow process managers, then we can utilize the multiple CPUs. Also, if we can start multiple instances of the workflow process on each process manager, then these multiple instances will compete for utilizing the CPU on which the process manager is running. Out of the box, the reason UCM is set up to run only one instance of this Workflow is two fold: 1. Matching occurs only between the record being processed by each Workflow instance and the records in the base tables of UCM. So, if multiple instances are run in parallel the records being processed by the multiple instances of the Workflow are not compared (matched) with each other. This may result in duplicates in UCM. 2. If 2 or more records being processed by the multiple Workflow instances match with the same base table record, and an auto-merge is attempted by 2 or more Workflows simultaneously, there may be some contention in terms of the sequence of merging. In other words, 2 or more incoming records may merge simultaneously with the same base table record instead of sequentially merging one after the other. This may result in unpredictable results. But, if we can guarantee that the records being processed by the multiple instances of the Workflow, each have a unique Window Key, the above 2 issues will be avoided. This is because, unique Window Keys imply that the records do not match with each other (not match candidates) and also that each record will match with a different base table record (if at all there is a match), thus eliminating any contention and unpredictable results. Step 1: Pre-Processing Customer Data Customer data received from a source application will be fed into a Data Quality batch process (e.g Trillium or First Logic or IRR,etc). The batch process will generate a Window Key based on pre-configured matching parameters. The output from the batch process is a file that contains all the input records with the Window Key assigned to each record. Note: Records with identical Window Keys are candidates for duplicates. During the matching process records with identical Window Keys are fed into the matching engine

where matching rules are evaluated on these records. We do not need to perform the matching process in this step, but just generate the Window Keys. Matching process will be called from UCM in a later step. It is recommended to perform data cleansing & standardization also in this batch process, prior to generating the Window Keys. Scripts can be written (e.g with perl) to partition the data in the output file from the batch process. The script will partition the data into multiple batches such that each batch will contain records with unique Window Keys. This means that records that have identical Window Keys will end up in different batches. This will ensure that in any given batch, there are no potential duplicates. We will end up with as many batches as the largest number of records having identical Window Keys.

Example: Data Set prior to partitioning based on Window Keys # Record Window Key 1 A1 WK1 2 A2 WK1 3 A3 WK1 4 A4 WK2 5 A5 WK2 6 A6 WK3 7 A7 WK3 8 A8 WK4 9 A9 WK4 10 A10 WK4 11 A11 WK5 12 A12 WK5 13 A13 WK6 14 A14 WK7 Batch #1 of the data after partitioning # Record Window Key 1 A1 WK1 2 A4 WK2 3 A6 WK3 4 A8 WK4 5 A11 WK5 6 A13 WK6 7 A14 WK7 Batch #2 of the data after partitioning # Record Window Key 1 A2 WK1

2 3 4 5

A5 A7 A9 A12

WK2 WK3 WK4 WK5

Batch #3 of the data after partitioning # Record Window Key 1 A3 WK1 2 A10 WK4

Step 2: Processing the partitioned batches through UCM The partitioned batches (result of pre-processing) will be loaded into UCM SDH (Source Data & History) tables, one batch at a time, by an EIM process. Only after the UCM Workflow completes processing all the records from the first batch, the EIM process will load the next batch. So, at any given point, the SDH tables will only have records from one single batch. This process of loading one batch at a time into the SDH table can be automated by writing a custom Workflow in Siebel UCM which can monitor (by polling) the SDH tables, determine when a batch is completed, and start the EIM process for loading the next batch. The UCM Workflow which performs cleansing (if it is not done in batch process outside of UCM), matching, and survivorship, will be setup to run multiple instances/threads in a parallel processing mode (for example 200 Workflow processes in parallel). In a hypothetical scenario in which the workflow takes about 1 seconds/record, and assuming a call to data quality take less than 0.5 seconds to respond back to the match step in the Workflow. If we have 200 Workflow processes running in parallel, we should get a throughput of 200 records/second. This translates to 17.28 Million records/day. Note: The time taken by the Workflow per record, is highly dependent on how long it takes for the data quality server to respond back to the match step in the Workflow (which makes a call out to the data quality software) with matching records and corresponding scores. UCM Workflow will create new records, auto-merge duplicates and queue the suspect duplicates for review by the Data Steward. The suspect duplicates have to be manually processed (merge or new decision) by the Data Steward team. As the Data Steward team makes decisions to merge, UCM will perform the merge and publish the Merge messages if required by other systems in the enterprise. Note: If the number of suspect duplicates is large (more than a few thousand records), the manual review process by the Data Steward (or Data Steward team) could take a long time (weeks or months). Step 3: Delta Loads & Initial Loads from other source applications The same process (Steps 1 & 2) described above can be used for monthly delta loads from a source application, as well as initial loading the data from other source applications (after initial loading from one source application is completed). Advantages of using this process for initial & delta loads: A complete history of merge transactions is available in case a need for unmerge arises The same process can be used for initial loads and recurring delta loads Minimum need for customization in UCM for performing initial & delta loads

Standard UCM functionality generates the UUID and populates the cross reference tables (If data is directly loaded into base tables, custom steps are required for generating UUID and populating cross reference tables)

Você também pode gostar