Informatica Best Practices - Error Handling

TabIe of Contents
Error HandIing
Error HandIing Process
Error HandIing Strategies - B2B Data Transformation
Error HandIing Strategies - Data Warehousing
Error HandIing Strategies - GeneraI
Error HandIing Techniques - PowerCenter Mappings
Error HandIing Techniques - PowerCenter WorkfIows and Data AnaIyzer
2013 Informatica Corporation. All rights reserved. Velocity Methodology
Error Handling Process
ChaIIenge
For an error handling strategy to be implemented successfully, it must be integral to the load process as
a whole. The method of implementation for the strategy will vary depending on the data integration
requirements for each project.
The resulting error handling process should however, always involve the following three steps:
1. Error identification
2. Error retrieval
3. Error correction
This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.
Description
A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as:
* Relational database error logging
* Email notification of workflow failures
* Session error thresholds
* The reporting capabilities of PowerCenter Data Analyzer
* Data profiling
These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart
below:
Error Identification
Error Handling
3/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology
The first step in the error handling process is error identification. Error identification is often achieved
through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and
referential integrity constraints at the database.
This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors,
and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables.
Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables
(PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter
repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this
manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was
called within the mapping.
Error RetrievaI
The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository,
it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be
customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report
prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and
data that caused the error. ln this way, the error is successfully captured in the repository and can be easily retrieved through a
Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as "number of errors is
greater than zero).
Error Correction
The final step in the error handling process is error correction. As PowerCenter automates the process of error identification,
and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through
Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message,
error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and
others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA
to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports
emailing a report directly through the web-based interface to make the process even easier.
For further automation, a report broadcasting rule that emails the error report to a developer's inbox can be set up to run on a
pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be
implemented. The exact method of data correction depends on various factors such as the number of records with errors, data
availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred.
Considerations made during error correction include:
* The 'owner' of the data should always fix the data errors. For example, if the source data is coming from an external
system, then the errors should be sent back to the source system to be fixed.
* ln some situations, a simple re-execution of the session will reprocess the data.
* Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate
processing of rows.
* Lastly, errors can also be corrected through a manual SQL load of the data. lf the volume of errors is low, the rejected data
can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error
reports. The corrected data can then be manually inserted into the target table using a SQL statement.
Any approach to correct erroneous data should be precisely documented and followed as a standard.
lf the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session
to correct the errors and load the corrected data into the ODS or staging area.
Data ProfiIing Option
For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the
PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling
Error Handling
option enables users to create data profiles through a wizard-driven GUl that provides profile reporting
such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default
values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to
deliver data profiling information through an intuitive Bl tool. This is a recommended best practice since error handling reports
and data profile reports can be delivered to users through the same easy-to-use application.
Integrating Error HandIing, Load Management, and Metadata
Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management
process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for
successful operation and management. The flow chart below illustrates this in the end-to-end load process.
Error Handling
Error Handling
Error handling underpins the data integration system from end-to-end. Each of the load components
performs validation checks, the results of which must be reported to the operational team. These components are not just
PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for
example:
* Process VaIidation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)?
* Source FiIe VaIidation. ls the source file datestamp later than the previous load?
* FiIe Check. Does the number of rows successfully loaded match the source rows read?
Error Handling
Error Handling Strategies - B2B Data Transformation
ChaIIenge
The challenge for B2B Data Transformation (B2B DT) based solutions is to create efficient accurate
processes for transforming data to appropriate intermediate data formats and to subsequently transform
data from those formats to correct output formats. Error handling strategies are a core part of assuring
the accuracy of any transformation process.
Error handling strategies in B2B Data Transformation solutions should address the following two needs:
1. Detection of errors in the transformation leading to successive refinement of the transformation logic during an iterative
development cycle.
2. Designed for correct error detection, retrieval and handling in production environments.
ln general errors can be characterized as either expected or unexpected.
An expected error is an error condition that we do anticipate to occur periodically. For example, a printer running out of paper is
an expected error. ln a B2B scenario this may correspond to a partner company sending a file in an incorrect format - although
it is an error condition it is expected from time to time. Usually processing of an expected error is part of normal system
functionality and does not constitute a failure of the system to perform as designed.
Unexpected errors typically occur when the designers of a system believe a particular scenario is handled, but due to logic
flaws or some other implementation fault, the scenario is not handled. These errors might include hardware failures, out of
memory situations or unexpected situations due to software bugs.
Errors can also be classified by severity (e.g., warning errors and fatal errors).
For unexpected fatal errors, the transformation process is often unable to complete and may result in a loss of data. ln these
cases, the emphasis is on prompt discovery and reporting of the error and support of any troubleshooting process.
Often the appropriate action for fatal unexpected errors is not addressed at the individual B2B Data Transformation translation
level but at the level of the calling process.
This Best Practice describes various strategies for handling expected and unexpected errors both from production and
development troubleshooting points of view and discusses the error handling features included as part of lnformatica's B2B
Data Transformation 8.x.
Description
This Best Practice is intended to help designers of B2B DT solutions decide which error handling strategies to employ in their
solutions and to familiarize them with new features in lnformatica B2B DT.
TerminoIogy
B2B DT is used as a generic term for the parsing, transformation and serialization technologies provided in lnformatica's B2B
DT products. These technologies have been made available through the B2B Data Transformation, Unstructured Data Option
for PowerCenter and as standalone products known as B2B Data Transformation and PowerExchange for Complex Data.
Note: lnformatica's B2B DT was previously known as PowerExchange for Complex Data Exchange (CDE) or ltem field Content
Master (CM).
Errors in B2B Data Transformation SoIutions
There are several types of errors possible in a B2B data transformation. The common types of errors that should be handled
while designing and developing are:
* Logic errors
Error Handling
* Errors in structural aspects of inbound data (missing syntax etc)
* Value errors
* Errors reported by downstream components (i.e., legacy components in data hubs)
* Data-type errors for individual fields
* Unrealistic values (e.g., impossible dates)
* Business rule breaches
Production Errors vs. FIaws in the Design - Production errors are those where the source data or the environmental setup
does not conform to the specifications for the development whereas flaws in design occur when the development does not
conform to the specification. For example, a production error can be an incorrect source file format that does not conform to the
specification layout given for development. A flaw in design could be as trivial as defining an element to be mandatory where
the possibility of non-occurrence of the element cannot be ruled out completely.
Unexpected Errors vs. Expected Errors: Expected errors are those that can be anticipated for a solution scenario based
upon experience (e.g., the EDl message file does not conform to the latest EDl specification). Unexpected errors are most
likely caused by environment set up issues or unknown bugs in the program (e.g., a corrupted file system).
Severity of Errors - Not all the errors in a system are equally important. Some errors may require that the process be halted
until they are corrected (e.g., an incorrect format of source files). These types of errors are termed as critical/fatal errors.
Whereas there could be a scenario where a description field is longer than the field length specified, but the truncation does not
affect the process. These types of errors are termed as warnings. The severity of a particular error can only be defined with
respect to the impact it creates on the business process it supports. ln B2B DT the severity of errors are classified into the
following categories:
lnformation A normal operation performed by B2B Data Transformation.
Warning A warning about a possible error. For example, B2B Data
Transformation generates a warning event if an operation
overwrites the existing content of a data holder. The execution
continues.
Failure A component failed. For example, an anchor fails if B2B Data
Transformation cannot find it in the source document. The
execution continues.
Optional Failure An optional component, configured with the optional property,
failed. For example, an optional anchor is missing from the
source document. The execution continues.
Fatal error A serious error occurred, for example, a parser has an illegal
configuration. B2B Data Transformation halts the execution.
Unknown The event status cannot be determined.
Error HandIing in Data Integration Architecture
The Error Handling Framework in the context of B2B DT defines a basic infrastructure and the mechanisms for building more
reliable and fault-tolerant data transformations. lt integrates error handling facilities into the overall data integration architecture.
How do you integrate the necessary error handling into the data integration architecture?
User interaction : Even in erroneous situations the data transformation should behave in a controlled way and the user should
be informed appropriately about the system's state. The user must interact between the error's handling to avoid cyclic
dependencies.
Robustness : The error handling should be simple. All additional code for handling error situations makes the transformation
more complex, which itself increases the probability of errors. Thus the error handling should provide some basic mechanism
for handling internal errors. However, for the error handling code it is even more important to be correct and to avoid any
nested error situations.
Error Handling
Separation of error handIing code : Without any separation the normal code will be cluttered by a lot of error handling code.
This makes code less readable, error prone and more difficult to maintain.
Specific error handIing versus compIexity : Errors must be classified more precisely in order to handle them effectively and
to take measures tailored to specific errors.
DetaiIed error information versus compIexity : Whenever the transformation terminates due to an error suitable information
is needed to analyze the error. Otherwise, it is not feasible to investigate the original fault that caused the error.
Performance : We do not want to pay very much for error handling during normal operation.
ReusabiIity : The services of the error handling component should be designed for reuse because it is a basic component
useful for a number of transformations.
Error HandIing Mechanisms in B2B DT
The common techniques that would help a B2B DT designer in designing an error handling strategy are summarized below.
Debug: This method of error handling underlines the usage of the built-in capabilities of B2B DT for most basic errors. The
debug of a B2B DT parser/serializer can be done in multiple ways.
* Highlight the selection of an element on the example source file.
* Use a Writeval component along with disabling automatic output in the project property.
* Use of the disable and enable feature for each of the components.
* Run the parser/serializer and browse the event log for any failures.
All the debug components should be removed before deploying the service in production.
Schema Modification: This method of error handling demonstrates one of the ways to communicate the erroneous record
once it is identified. The erroneous data can be captured at different levels (e.g., at field level or at record level). The XML
schema methodology dictates to add additional XML elements into the schema structure for the error data and error message
holding. This allows the developer to validate each of the elements with the business rules and if any element or record does
not conform to the rules then that data and a corresponding error message can be stored in the XML structure.
Error Data in a Different FiIe : This methodology stresses the point of storing the erroneous records or elements in a
separate file other than the output data stream. This method is useful when a business critical timeline for the data processing
cannot be compromised for a couple of erroneous records. This method allows the processing for the correct records to be
done and the erroneous records to be inspected and corrected as a different stream function. ln this methodology the business
validations are done for each of the elements with the specified rules and if any of the elements or records fails to conform, they
are directed to a predefined error file. The path to the file is generally passed in the output file for further investigation or the
path of the file is a static path upon which a script is executed to send those error files to operations for correction.
Design Time TooIs for Error HandIing in B2B DT
A failure is an event that prevents a component from processing data in the expected way. An anchor might fail if it searches for
text that does not exist in the source document. A transformer or action might fail if its input is empty or has an inappropriate
data type.
A failure can be a perfectly normal occurrence. For example, a source document might contain an optional date. A parser
contains a Content anchor that processes the date, if it exists. lf the date does not exist in a particular source document, the
Content anchor fails. By configuring the transformation appropriately, you can control the result of a failure. ln the example, you
might configure the parser to ignore the missing data and continue processing.
B2B Data Transformation offers various mechanisms for error handling during design time
Feature Description
B2B DT event log This is a B2B DT specific event generation mechanism where each event corresponds to an
action taken by a transformation such as recognizing a particular lexical sequence. lt is useful
in the troubleshooting of work in progress, but event files can grow very large, hence it is not
recommended for production systems. lt is distinct from the event system offered by other
B2B DT products and from the OS based event system. Custom events can be generated
within transformation scripts. Event based failures are reported as exceptions or other errors
in the calling environment.
Error Handling
B2B DT Trace files Trace files are controlled by the B2B DT configuration application. Automated strategies may
be applied for the recycling of trace files
Custom error information At the simplest levels, custom errors can be generated as B2B DT events (using the
AddEventAction). However if the event mechanism is disabled for memory or performance
reasons, these are omitted. Other alternatives include generation of custom error files,
integration with OS event tracking mechanisms and integration with 3rd party management
platform software. lntegration with OS eventing or 3rd party platform software requires custom
extensions to B2B DT.
The event log is the main trouble shooting tool in B2B DT solutions. lt captures all of the details in an event log file when an
error occurs in the system. These files can be generated when testing in a development studio environment or running a
service engine. These files reside in the CM_Reports directory specified in the CM_Config file under the installation directory of
B2B DT. ln the studio environment the default location is Results/events.cme in the project folder. The error messages
appearing in the event log file are either system generated or user-defined (which can be accomplished by adding the add
event action). The ADDEVENT Action enables the developer to pass a user-defined error message to the event log file in case
a specific error condition occurs.
Overall the B2B DT event mechanism is the simplest to implement. But for large or high volume production systems, the event
mechanism can create very large event files, and it offers no integration with popular enterprise software administration
platforms. lnformatica recommends using B2B DT Events for troubleshooting purposes during development only.
ln some cases, performance constraints may determine the error handling strategy. For example updating an external event
system may cause performance bottlenecks and producing a formatted error report can be time consuming. ln some cases
operator interaction may be required that could potentially block a B2B DT transformation from completing.
Finally, it is worth looking at whether some part of the error handling can be offloaded outside of B2B DT to avoid performance
bottlenecks.
When using custom error schemes, consider the following:
* Multiple invocations of the same transformation may execute in parallel
* Don't hardwire error file paths
* Don't assume a single error output file
Avoid the use of the B2B DT event log for production systems (especially when processing Excel files).
The trace files capture the state of the system along with the process lD and failure messages. lt creates the reporting of the
error along with the time stamp. lt captures details about the system in different category areas, including file system,
environment, networking etc. lt gives details about the process id and the thread id that was processing the execution. lt aids in
getting the system level error (if there is one). The name of the trace file can be modified in the Configuration wizard. The
maximum size of the trace file can also be limited in the CMConfiguration editor.
lf the Data Transformation Engine runs under multiple user accounts, the user logs may overwrite each other, or it may be
difficult to identify the logs belonging to a particular user. Prevent this by configuring users with different log locations.
ln addition to the logs of service events, there is an Engine initialization event log. This log records problems that occur when
the Data Transformation Engine starts without reference to any service or input data. View this log to diagnose installation
problems such as missing environment variables.
The initialization log is located in the CMReports\lnit directory.
New Error HandIing Features in B2B DT 8.x
Using the OptionaI Property to HandIe FaiIures
lf the optional property of a component is not selected, a failure of the component causes its parent to fail. lf the parent is also
non-optional, its own parent fails, and so forth. For example, suppose that a Parser contains a Group, and the Group contains a
Error Handling
Marker. All the components are non-optional. lf the Marker does not exist in the source document, the
Marker fails. This causes the Group to fail, which in turn causes the Parser to fail.
lf the optional property of a component is selected, a failure of the component does not bubble up to the parent. For example,
suppose that a Parser contains a Group, and the Group contains a Marker. ln this example, suppose that the Group is optional.
The failed Marker causes the Group to fail, but the Parser does not fail.
Note however that certain components lack the optional property because the components never fail, regardless of their input.
An example is the Sort action. lf the Sort action finds no data to sort, it simply does nothing. lt does not report a failure.
RoIIback
lf a component fails, its effects are rolled back. For example, suppose that a Group contains three non-optional Content
anchors that store values in data holders. lf the third Content anchor fails, the Group fails. Data Transformation rolls back the
effects of the first two Content anchors. The data that the first two Content anchors already stored in data holders is removed.
The rollback applies only to the main effects of a transformation such as a parser storing values in data holders or a serializer
writing to its output. The rollback does not apply to side effects. ln the above example, if the Group contains an ODBCAction
that performs an lNSERT query on a database, the record that the action added to the database is not deleted.
Writing a FaiIure Message to the User Log
A component can be configured to output failure events to a user-defined log. For example, if an anchor fails to find text in the
source document, it can write a message in the user log. This can occur even if the anchor is defined as optional so that the
failure does not terminate the transformation processing.
The user log can contain the following types of information:
* Failure level: lnformation, Warning, or Error
* Name of the component that failed
* Failure description
* Location of the failed component in the lntelliScript
* Additional information about the transformation status (such as the values of data holders)
CustomLog
The CustomLog component can be used as the value of the on_fail property. ln the event of a failure, the CustomLog
component runs a serializer that prepares a log message. The system writes the message to a specified location.
Property Description
run_serializer A serializer that prepares the log message
output The output location. The options include: MSMQOutput. Writes
to an MSMQ queue. OutputDataHolder. Writes to a data holder.
OutputFile. Writes to a file. ResultFile. Writes to the default
results file of the transformation. OutputCOM. Uses a custom
COM component to output the data. Additional choices:
OutputPort. The name of an AdditionalOutputPort where the data
is written. StandardErrorLog. Writes to the user log.
Error HandIing in B2B DT with PowerCenter Integration
ln a B2B DT solution scenario, both expected and unexpected errors can occur, whether caused by a production issue or a
flaw in the design. lf the right error handling processes are not in place, then if an error occurs, the processing aborts with a
description of the error in the log (event file). This can also results in data loss if the erroneous records are not captured and
reported correctly. lt also fails the program it is called from. For example if the B2B Data Transformation service is used through
PowerCenter UDO/B2B DT, then an error causes the PowerCenter session to fail.
Error Handling
This section focuses on how to orchestrate PowerCenter and B2B DT if the B2B DT services are being
called from a PowerCenter mapping. Below are the most common ways of orchestrating the error trapping and error handling
mechanism.
1. Use PowerCenter's Robustness and Reporting Function : ln general the PowerCenter engine is very robust
and powerful enough to handle complex erroneous scenarios. Thus the usual practice is to perform any business
validation or valid values comparison in PowerCenter. This enables error records to be directed to the already established
Bad Files or Reject Tables in PowerCenter. This feature also allows the repository to store information about the number
of records loaded and the number of records rejected and thus aids in easier reporting of errors.
2. Output the Error in an XML Tag : When complex parsing validations are involved, B2B DT is more powerful than
PowerCenter in handling them (e.g., String function and regular expression). ln these scenarios the validations are
performed in the B2B DT engine and the schema is redesigned to capture the error information in the associated tag of
the XML. When this XML is parsed in a PowerCenter mapping the error tags are directed to be stored in the custom build
error reporting tables from which the reporting of the errors can be done. The design of the custom build error tables will
depend on the design of the error handling XML schema. Generally these tables correspond one-to-one with the XML
structure with few additional metadata fields like processing date, Source System, etc.
3. Output to the PowerCenter Log FiIes : lf unexpected error occurs in the B2B DT processing then the error
descriptions and details are stored in the log file directory as specified in the CMconfig.xml. The path to the file and the
fatal errors are reported to the PowerCenter Log so that the operators can quickly detect problems. This unexpected error
handling can be exploited with care for the user defined errors in the B2B DT transformation by adding the Addevent
Action and marking the error type as "Failure.
Best Practices for HandIing Errors in Production
ln a production environment the turnaround time of the processes should be as short as possible and as automated as
possible. Using B2B DT integration with Power Center these requirements should be met seamlessly without intervention from
lT professionals for error reporting, the correction of the data file and the reprocessing of data.
ExampIe Scenario 1 - HIPAA Error Reporting
Error Handling
ExampIe Scenario 2 - EmaiIing Error FiIes to Operator
Below is a case study for an implementation at a major financial client. The solution was implemented with total automation for
the sequence of error trapping, error reporting, correction and reprocessing of data. The high level solution steps are:
* Analyst receives loan tape via Email from a dealer
* Analyst saves the file to a file system on a designated file share
* A J2EE server monitors the file share for new files and pushes them to PowerCenter
* PowerCenter invokes B2B DT to process (passing XML data fragment, supplying path to loan tape file and other
parameters)
* Upon a successful outcome, PowerCenter saves the data to the target database
* PowerCenter notifies the Analyst via Email
* On failure, PowerCenter Emails the XLS error file containing the original data and errors
Error Handling
Error Handling Strategies - Data Warehousing
ChaIIenge
A key requirement for any successful data warehouse or data integration project is that it attains
credibility within the user community. At the same time, it is imperative that the warehouse be as
up-to-date as possible since the more recent the information derived from it is, the more relevant it is to
the business operations of the organization, thereby providing the best opportunity to gain an advantage
over the competition.
Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction
(in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the
event (i.e., after the error has been identified). ln data warehouse systems, however, any systematic error (e.g., for a particular
load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data
cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information.
Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. lf errors occur, it is equally
important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e.,
before the business tries to use the information in error).
The types of error to consider include:
* Source data structures
* Sources presented out-of-sequence
* 'Old' sources represented in error
* lncomplete source files
* Data-type errors for individual fields
* Unrealistic values (e.g., impossible dates)
* Business rule breaches
* Missing mandatory data
* O/S errors
* RDBMS errors
These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors)
concerns.
Description
ln an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that
every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational
constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct
order), and are never corrupted during transfer to the data warehouse. ln addition, the OS and RDBMS never run out of
resources, or have permissions and privileges change.
Realistically, however, the operational applications are rarely able to cope with every possible business scenario or
combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite
the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored
(typically as free-text comments). ln every case, there is a risk that the source data does not match what the data warehouse
expects.
Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business
managers. lf erroneous data does reach the warehouse, it must be identified and removed immediately (before the current
version of the warehouse can be published). Preferably, error data should be identified during the load process and prevented
from reaching the warehouse at all. ldeally, erroneous source data should be identified before a load even begins, so that no
resources are wasted trying to load it.
As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the
Error Handling
warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it
becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a
by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the
warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and
end-users become owners of their data.
As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load
management, data quality and key management, and operational processes and procedures.
Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the
failure occurred.
Quality management defines the criteria whereby data can be identified as in error; and error management identifies the
specific error(s), thereby allowing the source data to be corrected.
Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors,
perhaps indicating a failure in operational procedure.
Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level
flow chart below:
Error Handling
Error Management
Considerations
High-LeveI Issues
From previous discussion of load
management, a number of checks
can be performed before any
attempt is made to load a source
data set. Without load management
in place, it is unlikely that the
warehouse process will be robust
enough to satisfy any end-user
requirements, and error correction
processing becomes moot (in so far
as nearly all maintenance and
development resources will be
working full time to manua//y
correct bad data in the warehouse).
The following assumes that you
have implemented load
management processes similar to
lnformatica's best practices.
* Process Dependency checks in
the load management can
identify when a source data set
is missing, duplicates a
previous version, or has been
presented out of sequence, and
where the previous load failed
but has not yet been corrected.
* Load management prevents
this source data from being
loaded. At the same time, error
management processes should
record the details of the failed
load; noting the source
instance, the load affected, and
when and why the load was
aborted.
* Source file structures can be
compared to expected
structures stored as metadata,
either from header information
or by attempting to read the first
data row.
* Source table structures can be
compared to expectations;
typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in
Error Handling
metadata), or by simply running a 'describe' command against the table (again comparing to a
pre-stored version in metadata).
* Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been
corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational
application).
* ln every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and
any other relevant process-level details.
Low-LeveI Issues
Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further
error management processes need to be applied to the individual source rows and fields.
* lndividual source fields can be compared to expected data-types against standard metadata within the repository, or
additional information added by the development. ln some instances, this is enough to abort the rest of the load; if the field
structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more
worryingly) is likely to be processed unpredictably.
* Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can
be used to spot failed date conversions, conversions of string to numbers, or missing required data. ln rare cases, stored
procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the
potentially crushing impact on performance if a particularly error-filled load occurs.
* Business rule breaches can then be picked up. lt is possible to define allowable values, or acceptable value ranges within
PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included
in the mapping itself). A more flexible approach is to use external tables to codify the business rules. ln this way, only the
rules tables need to be amended if a new business rule needs to be applied. lnformatica has suggested methods to
implement such a process.
* Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management
in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However,
from an error handling perspective, such errors must still be identified and recorded, even when key management
techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular
source data fails, it is difficult to realize when there is a systematic problem in the source systems.
* lnter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of
events (e.g., a customer query, followed by a booking request, followed by a confirmation, followed by a payment). lf the
events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem
with the source system, or the way in which the source system is being used.
* An important principle to follow is to try to identify aII of the errors on a particular row before halting processing, rather
than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced
data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed
subsequently, it is sensible to identify aII the corrections that need to be made before reloading, rather than fixing the
first, re-running, and then identifying a second error (which halts the load for a second time).
OS and RDBMS Issues
Since best practice means that referential integrity (Rl) issues are proactively managed within the loads, instances where the
RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference
information is missing).
However, there is little that can be done to identify the more generic RDBMS problems that are likely to occur; changes to
schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table
space extents available, missing partitions and the like.
Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax,
and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who,
from an operational perspective, are not aware that there is likely to be an impact on the data warehouse, or are not aware that
the data warehouse managers need to be kept up to speed.
ln both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be
Error Handling
impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are
revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs
out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNlX host
are amended, bad files themselves (or even the log files) may not be accessible.
Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a
load to 'complete' should be absolutely the last step in a given process, any failure before, or including, that point leaves the
load in an 'incomplete' state. Subsequent runs should note this, and enforce correction of the last load before beginning the
new one.
The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and
DBAs have proper and working communication with the data warehouse management to allow proactive control of changes.
Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if
they occur.
Auto-Correction vs. ManuaI Correction
Load management and key management best practices ( Key Management in Data Warehousing Solutions ) have already
defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual
intervention, and the latter to allow Rl errors to be managed so that the quantitative quality of the warehouse data is preserved,
and incorrect key values are corrected as soon as the source system provides the missing data.
We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data
as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link
between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the
warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of
the warehouse to be reloaded entirely from scratch.
ln addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or
more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user
training.
The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be
corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows
source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and
managed.
Error Management Techniques
SimpIe Error HandIing Structure
The following data structure is an example of the error metadata that should be captured as a minimum within the error
handling strategy.
Error Handling
The example defines three main sets of information:
* The ERROR_DEFlNlTlON table, which stores descriptions for the various types of errors, including:
- process-level (e.g., incorrect source file, load started out-of-sequence)
- row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and
- reconciliation (e.g., incorrect row numbers, incorrect file total etc.).
* The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of
error for particular loads and of the distribution of error types. lt is linked to the load management processes via the
SRC_lNST_lD and PROC_lNST_lD, from which other process-level information can be gathered.
* The ERROR_DETAlL table stores information about actual rows with errors, including how to identify the specific row that
was in error (using the source natural keys and row number) together with a string of field identifier/value pairs
concatenated together. lt is not expected that this information will be deconstructed as part of an automatic correction
load, but if necessary this can be pivoted (e.g., using simple UNlX scripts) to separate out the field/value pairs for
subsequent reporting.
Error Handling
Error Handling Strategies - General
ChaIIenge
The challenge is to accurately and efficiently load data into the target data architecture. This Best
Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying
data errors, methods for handling data errors, and alternatives for addressing the most common types of
problems. For the most part, these strategies are relevant whether your data integration project is
loading an operational data structure (as with data migrations, consolidations, or loading various sorts of
operational data stores) or loading a data warehousing structure.
Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business.
When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate
manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it
until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two
conflicting goals:
* The need for accurate information.
* The ability to analyze or process the most complete information available with the understanding that errors can exist.
Data Integration Process VaIidation
ln general, there are three methods for handling data errors detected in the loading process:
* Reject AII. This is the simplest to implement since all errors are rejected from entering the target when they are
detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete.
Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are
and how they affect the completeness of the data.
Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be
created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows
have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may
cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a
complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect
downstream transactions.
The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through
existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the
target if it is correct, and it would then be loaded into the data mart using the normal process.
* Reject None. This approach gives users a complete picture of the available data without having to consider data that
was not available due to it being rejected during the load process. The problem is that the data may not be complete or
accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty
transactions.
With Reject None, the complete set of data is loaded, but the data may not support correct transactions or aggregations.
Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but
incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different
hierarchies.
The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct
all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and
fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load,
and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or
data marts.
Error Handling
* Reject CriticaI. This method provides a balance between missing information and incorrect information. lt involves
examining each row of data and determining the particular data elements to be rejected. All changes that are valid are
processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can
be fixed in the source systems and loaded on a subsequent run of the ETL process.
This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates.
Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at
various levels in the organization. Attributes provide additional descriptive information per key element.
lnserts are important for dimensions or master data because subsequent factual data may rely on the existence of the
dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can
usually be loaded with the existing dimensional data unless the update is to a key element.
The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or
non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some
tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data
architecture.
lnformatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the
most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the
ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to
understand that some information may be held out of the target, and also that some of the information in the target data
structures may be at least temporarily allocated to the wrong hierarchies.
HandIing Errors in Dimension ProfiIes
Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created
with date stamps that indicate when the change took place. This allows power users to review the target data using either
current (As-ls) or past (As-Was) views of the data.
A profile record should occur for each change in the source data. Problems occur when two fields change in the source system
and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the
second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the
existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an lNSERT is
complicated. lf a third field is changed in the source before the error is fixed, the correction process is complicated further.
The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On
1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field
3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.
Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 Open Sunday BRed Open 9 5
1/10/2000 Open Sunday BRed Open 24hrs
1/15/2000 Open Sunday Red Open 24hrs
Three methods exist for handling the creation and update of profiles:
1. The first method produces a new profile record each time a change is detected in the source. lf a field value was invalid,
then the original field value is maintained.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 1/5/2000 Open Sunday Black Open 9 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/15/2000 Open Sunday Red Open 24hrs
By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source
system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates
a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a
mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been
Error Handling
created.
* The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the
profile record for the change to Field 3.
lf we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. lf the third field
changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field
was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.
* The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the
Field 2 value in both.
Date Profile Date Field 1 Value Field 2 Value Field 3 Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 5
1/5/2000 1/5/2000 Open Sunday Black Open 9 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/5/2000 (Update) Open Sunday Red Open 9-5
1/15/2000 1/10/2000 (Update) Open Sunday Red Open 24hrs
lf we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex
algorithms that handle the process correctly. lt involves being able to determine when an error occurred and examining all
profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods,
we still have an issue of determining if a value is a correction or a new value. lf an error is never fixed in the source system, but
a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records,
when in reality a new profile record should have been entered.
Recommended Method
A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new,
correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. ln this way, the
corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting
the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another
process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis
of the data until the correction method is determined because the current information is reflected in the new Profile.
Data QuaIity Edits
Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target.
The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators
can be used to:
* Show the record and field level quality associated with a given record at the time of extract.
* ldentify data sources and errors encountered in specific records.
* Support the resolution of specific record error types via an update and resubmission process.
Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value), missing data in a
required field, wrong data type/format, or invalid data value. lf a record contains even one error, data quality (DQ) fields will be
appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ
fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are
stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be
loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.
The following types of errors cannot be processed:
* A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and
used to generate a notice to the sending system indicating that x number of invalid records were received and could not be
processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has
been replaced or not.
Error Handling
* The source file or record is illegible. The file or record would be sent to a reject queue. Metadata
indicating that x number of invalid records were received and could not be processed may or may not be available for a
general notice to be sent to the sending system. ln this case, due to the nature of the error, no tracking is possible to
determine whether the invalid record has been replaced or not. lf the file or record is illegible, it is likely that individual
unique records within the file are not identifiable. While information can be provided to the source system site indicating
there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.
ln these error types, the records can be processed, but they contain errors:
* A required (non-key) field is missing.
* The value in a numeric or date field is non-numeric.
* The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is
used for this validation.
When an error is detected during ingest and cleansing, the identified error type is recorded.
QuaIity Indicators (QuaIity Code TabIe)
The requirement to validate virtually every data element received from the source data systems mandates the development,
implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an
elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data
quality problems, systemic issues, business process problems and information technology breakdowns.
The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required Field, 3-Wrong Data Type/Format, 4-lnvalid
Data Value and 5-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields
for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily
identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail
necessary for acute quality problems to be remedied in a timely manner.
HandIing Data Errors
The need to periodically correct data in the target is inevitable. But how often should these corrections be performed?
The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data
from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid
performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.
Reject TabIes vs. Source System
As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the
related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix
data in the reject tables, or whether data fixes will be restricted to source systems. lf errors are fixed in the reject tables, the
target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of
changes in the target data architecture. lf all fixes occur in the source systems, then these fixes must be applied correctly to the
target data.
Attribute Errors and DefauIt VaIues
Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a
product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the
address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are
most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research).
Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target.
When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter
thetarget. Some rules that have been proposed for handling defaults are as follows:
Value Types
Error Handling
Description
Error Handling
Default
Error Handling
Reference Values Attributes that are foreign keys to other tables Unknown
Small Value Sets Y/N indicator fields No
Other Any other type of attribute Null or Business provided value
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not
translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this
purpose.)
The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values
(e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these
values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case
basis. ln many cases, the data integration process is set to populate Null into these fields, which means undefined in the target.
After a source system value is corrected and passes validation, it is corrected in the target.
Primary Key Errors
The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new
key is actually an update to an old key in the source system. For example, a location number is assigned and the new location
is transferred to the target using the normal process; then the location number is changed due to some source business rule
such as: a// Warehouses shou/d be /n the 5000 range . The process assumes that the change in the primary key is actually a
new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data
being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture.
Fixing this type of error involves integrating the two records in the target data, along with the related facts. lntegrating the two
rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence
properly. lf two profile records exist for the same day, then a manual decision is required as to which is correct. lf facts were
loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct
the data.
The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data
lD really represent two different lDs). ln this case, it is necessary to restore the source information for both dimensions and
facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the
restore to correct the errors.
DM Facts CaIcuIated from EDW Dimensions
lf information is captured as dimensional data from the source, but used as measures residing on the fact records in the target,
we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected.
lf we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement.
lf we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data
is fixed. lf we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are
fixed, the affected rows can simply be loaded and applied to the target data.
Fact Errors
lf there are no business rules that reject fact records except for relationship errors to dimensional data, then when we
encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following
night. This nightly reprocessing continues until the data successfully enters the target data structures. lnitial and periodic
analyses should be performed on the errors to determine why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in
dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables
enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the
source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple
Error Handling
source data occurs when two source systems can contain different data for the same dimensional entity.
Reference TabIes
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code
value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table
to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source
systems into the target structures.
The translation tables contain one or more rows for each source value and map the value to a matching row in the reference
table. For example, the SOURCE column in FlLE X on System X can contain O, S or W. The data steward would be
responsible for entering in the translation table the following values:
Source Value Code Translation
O OFFlCE
S STORE
W WAREHSE
These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar
field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the
translation table to maintain consistency across systems:
Source Value Code Translation
OF OFFlCE
ST STORE
WH WAREHSE
The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL
process uses the reference table to populate the following values into the target:
Code Translation Code Description
OFFlCE Office
STORE Retail Store
WAREHSE Distribution Warehouse
Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after
data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to
OFFlCE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the
first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the
entire target data architecture.
DimensionaI Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products,
at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the
source system value to the target value. For location, this is straightforward, but over time, products may have multiple source
system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves
as a good example for error handling.)
There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation
data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the
data steward to review it. The first option requires the data steward to create the translation for new entities, while the second
lets the ETL process create the translation, but marks the record as Pending Verification until the data steward reviews it and
changes the status to Verified before any facts that reference it can be loaded.
When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy values. This
requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate
an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists
them.
Error Handling
A problem specific to Product is that when it is created as new, it is really just a changed SKU number.
This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is
fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to
be merged, requiring manual intervention.
The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but
really represent two different products). ln this case, it is necessary to restore the source information for all loads since the error
was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the
data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.
ManuaI Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be
established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning
and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be
maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.
MuItipIe Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain
subsets of the required information. For example, one system may contain Warehouse and Store information while another
contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the
correct information.
When this happens, both sources have the ability to update the same row in the target. lf both sources are allowed to update
the shared information, data accuracy and profile problems are likely to occur. lf we update the shared information on only one
source system, the two systems then contain different information. lf the changed system is loaded into the target, it creates a
new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the
new profile, assumes a change occurred and creates another new profile with the old, unchanged value. lf the two systems
remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with
the same information.
To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should
be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this
sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the
one profile record created for that day.
One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from
the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a
primary source where information can be shared from multiple sources. Developers can use the field level information to
update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the
correct source fields as primary and by the data integration team to customize the load process.
Error Handling
Error Handling Techniques - PowerCenter Mappings
ChaIIenge
ldentifying and capturing data errors using a mapping approach, and making such errors available for
further processing or correction.
Description
ldentifying errors and creating an error handling strategy is an essential part of a data integration project. ln the production
environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is
to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected
transformation or database constraint errors.
Data VaIidation Errors
The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.
Consider the following questions:
What types of data errors are likely to be encountered?
Of these errors, which ones should be captured?
What process can capture the possible errors?
Should errors be captured before they have a chance to be written to the target database?
Will any of these errors need to be reloaded or corrected?
How will the users know if errors are encountered?
How will the errors be stored?
Should descriptions be assigned for individual errors?
Can a table be designed to store captured errors and the error descriptions?
Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and
improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g.,
executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of
functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the
target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to
the session log and the reject/bad file, thus improving performance.
Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to
them. This approach can be effective for many types of data content error, including: date conversion, null values intended for
not null target fields, and incorrect data formats or data types.
SampIe Mapping Approach for Data VaIidation Errors
ln the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written
to not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated
from the data flow and logged in an error table.
One solution is to implement a mapping similar to the one shown below:
Error Handling
An expression transformation can be employed to validate the source data, applying rules and flagging
records with one or more errors.
A router transformation can then separate valid rows from those containing the errors. lt is good practice to append error rows
with a unique key; this can be a composite consisting of a MAPPlNG_lD and ROW_lD, for example. The MAPPlNG_lD would
refer to the mapping name and the ROW_lD would be created by a sequence generator.
The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error
reporting and investigation. ln this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.
The table
ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_lD, and the error
description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of
reference for reporting.
The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns:
ROW_lD and MAPPlNG_lD. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the
entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to
reprocess them.
The mapping logic must assign a unique description for each error in the rejected row. ln this example, any null value intended
Error Handling
for a not null target field could generate an error message such as 'NAME is NULL' or 'DOB is NULL'. This
step can be done in an expression transformation (e.g., EXP_VALlDATlON in the sample mapping).
After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a
normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that
are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be
generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.
The following table shows how the error data produced may look.
Table Name: CUSTOMER_ERR
NAME DOB ADDRESS ROW_lD MAPPlNG_lD
NULL NULL NULL 1 DlM_LOAD
Table Name: ERR_DESC_TBL
FOLDER_NAME MAPPlNG_lD ROW_lD ERROR_DESC LOAD_DATE SOURCE Target
CUST DlM_LOAD 1 Name is NULL 10/11/2006 CUSTOMER_FF CUSTOMER
CUST DlM_LOAD 1 DOB is NULL 10/11/2006 CUSTOMER_FF CUSTOMER
CUST DlM_LOAD 1 Address is NULL 10/11/2006 CUSTOMER_FF CUSTOMER
The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in
mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of
data validation errors.
Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data
validation errors as 'soft' or 'hard'.
A 'hard' error can be defined as one that would fail when being written to the database, such as a constraint error.
A 'soft' error can be defined as a data content error.
A record flagged as 'hard' can be filtered from the target and written to the error tables, while a record flagged as 'soft' can be
written to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data
imperfections while still allowing the records to be processed for end-user reporting.
Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems.
The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings
that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is
identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture
identified errors, the operations team can effectively communicate data quality issues to the business users.
Constraint and Transformation Errors
Perfect data can never be guaranteed. ln implementing the mapping approach described above to detect errors and log them
to an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the
validated data to the database; however the relational database management system (RDBMS) may reject it for some
unexpected reason. An RDBMS may, for example, reject data if constraints are violated. ldeally, we would like to detect these
database-level errors automatically and send them to the same error table used to store the soft errors caught by the mapping
approach described above.
ln some cases, the 'stop on errors' session property can be set to '1' to stop source data for which unhandled errors were
encountered from being loaded. ln this case, the process will stop with a failure, the data must be corrected, and the entire
source may need to be reloaded or recovered. This is not always an acceptable approach.
An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the
records that were found to be in error. This can be achieved by configuring the 'stop on errors' property to 0 and switching on
relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors
are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four
Error Handling
tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS.
The PowerCenter Workf/ow Adm/n/strat/on Ou/de contains detailed information on the structure of these tables. However, the
PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table
allow us to retrieve any RDBMS errors: SESS_lNST_lD: A unique identifier for the session. Joining this table with the
Metadata Exchange (MX) View REP_LOAD_SESSlONS in the repository allows the MAPPlNG_lD to be retrieved.
TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the
target transformation. TRANS_ROW_lD: Specifies the row lD generated by the last active source. This field contains the row
number at the target when the error occurred. ERROR_MSG: Error message generated by the RDBMS
With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an
additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View
REP_LOAD_SESSlON in the repository, and insert the error details into ERR_DESC_TBL. When the post process ends,
ERR_DESC_TBL will contain both 'soft' errors and 'hard' errors.
One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can
be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more
rows at the target). ln this case, the mapping that loads the source must write translation data to a staging table (including the
source key and target row number). The translation table can then be used by the post-load session to identify the source key
by the target row number retrieved from the error log. The source key stored in the translation table could be a row number in
the case of a flat file, or a primary key in the case of a relational data source.
Reprocessing
After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of
the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations
team can, therefore, fix the data in the source that resulted in 'soft' errors and may be able to explain and remediate the 'hard'
errors.
Once the errors have been fixed, the source data can be reloaded. ldeally, only the rows resulting in errors during the first run
should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and
using a parameter to configure the mapping for an initial load or for a reprocess load. lf the mapping is reprocessing, the lookup
searches for each source row number in the error table, while the filter removes source rows for which the lookup has not found
errors. lf initial loading, all rows are passed through the filter, validated, and loaded.
With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records
successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should
be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This
ensures that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.
Error Handling
Error Handling Techniques - PowerCenter Workflows and Data Analyzer
ChaIIenge
lmplementing an efficient strategy to identify different types of errors in the ETL process, correct the
errors, and reprocess the corrected data.
Description
ldentifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an
ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of
acceptable data quality; and process errors, which are driven by the stability of the process itself.
The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider
the following questions:
* What tools and methods can help in detecting all the possible errors?
* What tools and methods can help in correcting the errors?
* What is the best way to reconcile data across multiple systems?
* Where and how will the errors be stored? (i.e., relational tables or flat files)
A robust error handling strategy can be implemented using PowerCenter's built-in error handling capabilities along with Data
Analyzer as follows:
* Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process
failures.
* Data Errors: Setup the ETL process to:
- Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables for
analysis, correction, and reprocessing.
- Setup Data AnaIyzer aIerts to notify the PowerCenter Administrator in the event of any rejected rows.
- Setup customized Data AnaIyzer reports and dashboards at the project level to provide information on failed
sessions, sessions with failed rows, load time, etc.
Configuring an EmaiI Task to HandIe Process FaiIures
Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the event of a
session failure. Create a reusable email task and use it in the "On Failure Email property settings in the Components tab of the
session, as shown in the following figure.
Error Handling
When you configure the subject and body of a post-session email, use email variables to include
information about the session run, such as session name, mapping name, status, total number of records loaded, and total
number of records rejected. The following table lists all the available email variables:
Email Variables for Post-Session Email
Email Variable Description
%s Session name.
%e Session status.
%b Session start time.
%c Session completion time.
%i Session elapsed time (session completion time-session start time).
%l Total rows loaded.
%r Total rows rejected.
%t Source and target table details, including read throughput in bytes per second and write throughput
in rows per second. The PowerCenter Server includes all information displayed in the session detail
dialog box.
%m Name of the mapping used in the session.
%n Name of the folder containing the session.
%d Name of the repository containing the session.
%g
Error Handling
Attach the session log to the message.
Error Handling
%a Attach the named file. The file must be local to the PowerCenter Server. The following are valid file
names: %a or %a . Note: The file name cannot include the greater than character (>) or a line
break.
Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. lnclude these
variables in the email message only.
Configuring Row Error Logging in PowerCenter
PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these tables
to capture data errors greatly reduces the time and effort required to implement an error handling strategy when compared with
a custom error handling solution.
When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the
PowerCenter Server logs error information that allows you to determine the cause and source of the error. The PowerCenter
Server logs information such as source name, row lD, current row data, transformation, timestamp, error code, error message,
repository name, folder name, session name, and mapping information. This error metadata is logged for all row-level errors,
including database errors, transformation errors, and errors raised through the ERROR() function, such as business rule
violations.
Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you enable
error logging and chose the 'Relational Database' Error Log Type, the PowerCenter Server offers you the following features:
* Generates the following tables to help you track row errors:
- PMERR_DATA . Stores data and metadata about a transformation row error and its corresponding source row.
- PMERR_MSG . Stores metadata about an error and the error message.
- PMERR_SESS . Stores metadata about the session.
- PMERR_TRANS . Stores metadata about the source and transformation ports, such as name and datatype,
when a transformation error occurs.
o Appends error data to the same tables cumulatively, if they already exist, for the further runs of the session.
o Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors to go to one
set of error tables, you can specify the prefix as 'EDW_'
o Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you
specify the same error log table name prefix for all sessions.
ExampIe:
ln the following figure, the session 's_m_Load_Customer' loads Customer Data into the EDW Customer table. The Customer
Table in EDW has the following structure:
CUSTOMER_lD NOT NULL NUMBER (PRlMARY KEY)
CUSTOMER_NAME NULL VARCHAR2(30)
CUSTOMER_STATUS NULL VARCHAR2(10)
There is a primary key constraint on the column CUSTOMER_lD.
To take advantage of PowerCenter's built-in error handling features, you would set the session properties as shown below:
Error Handling
The session property 'Error Log Type' is set to 'Relational Database', and 'Error Log DB Connection' and
'Table name Prefix' values are given accordingly.
When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into
the Error Tables as shown below:
EDW_PMERR_DATA
WORKFLOW_
RUN_lD
WORKLET_
RUN_lD
SESS_ lNST_ lD TRANS_NAME TRANS_
ROW_lD
TRANS_ROW
DATA
SOURCE_
ROW_lD
SOURCE_
ROW_ TYPE
SOURCE_
ROW_ DATA
LlNE_ NO
8 0 3 Customer
_Table
1 D:1001:0000000
0 0000D:Elvis
PresD:Valid
-1 -1 N/A 1
8 0 3 Customer
_Table
2 D:1002:0000000
0 0000D:James
BondD:Valid
-1 -1 N/A 1
8 0 3 Customer
_Table
3 D:1003:0000000
0 0000D:Michael
JaD:Valid
-1 -1 N/A 1
EDW_PMERR_MSG
Error Handling
WORKFL
OW_
RUN_lD
SESS_
lNST_lD
SESS_
START_Tl
ME
REPOSlT
ORY_
NAME
FOLDER_
NAME
WORKFL
OW_
NAME
TASK_
lNST_PAT
H
MAPPlNG
_ NAME
LlNE_ NO
6 3 9/15/2004
18:31
pc711 Folder1 wf_test1 s_m_test1 m_test1 1
7 3 9/15/2004
18:33
8 3 9/15/2004
18:34
EDW_PMERR_SESS
WORKFLOW_
RUN_lD
SESS_
lNST_lD
SESS_
START_TlME
REPOSlTORY_
NAME
FOLDER_
NAME
WORKFLOW_
NAME
TASK_
lNST_PATH
MAPPlNG_
NAME
LlNE_ NO
6 3 9/15/2004
18:31
7 3 9/15/2004
18:33
8 3 9/15/2004
18:34
EDW_PMERR_TRANS
WORKFLOW_
RUN_lD
SESS_lNST_l
D
TRANS_NAMETRANS_GRO
UP
TRANS_ATTR LlNE_ NO
8 3 Customer_Table lnput Customer
_ld:3,
Customer
_Name:12,
Customer
_Status:12
1
By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.
Error Detection and Notification using Data AnaIyzer
lnformatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer is
lnformatica's powerful business intelligence tool that is used to provide insight into the PowerCenter repository metadata.
You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into
production environment ETL activities. ln addition, the following capabilities of Data Analyzer are recommended best practices:
* Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made
into the error tables PMERR_DATA or PMERR_TRANS.
* Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter folders for
easy analysis.
* Configure reports to provide detailed information of the row level errors for each session. This can be accomplished by
using the four error tables as sources of data for the reports
Data ReconciIiation Using Data AnaIyzer
Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS to
targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing tedious
queries, comparing two separately produced reports, or using constructs such as DBLinks.
Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your company's
data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and reusable way to
Error Handling
accomplish data reconciliation. Using Data Analyzer's reporting capabilities, you can select data from
various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers
through aggregate reports. You can further schedule the reports to run automatically every time the relevant PowerCenter
sessions complete, and setup alerts to notify the appropriate business or technical users in case of any discrepancies.
For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a data
warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by comparing
key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation reports can be run
automatically after PowerCenter loads the data, or they can be run by technical users or business on demand. This process
allows users to verify the accuracy of data and builds confidence in the data warehouse solution.
Error Handling

Informatica Best Practices - Error Handling

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Informatica Best Practices - Error Handling

Enviado por

Direitos autorais:

Formatos disponíveis

TabIe of Contents

Você também pode gostar