Escolar Documentos
Profissional Documentos
Cultura Documentos
Challenge
The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.
Description
Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals: The need for accurate information. The ability to analyze or process the most complete information available with the understanding that errors can exist.
Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions. With Reject None, the complete set of data is loaded, but the data may not support correct transactions or aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts.
Reject Critical. This method provides a balance between missing information and incorrect information. It involves examining each row of data and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process. This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from
Three methods exist for handling the creation and update of profiles: 1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.
Date 1/1/2000 Profile Date 1/1/2000 Field 1 Value Closed Sunday Field 2 Value Black Field 3 Value Open 9 5
1/10/2000 1/10/2000 Open Sunday Black 1/15/2000 1/15/2000 Open Sunday Red
By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -regardless if it is a fix to a previous error -- is applied as a new change that creates a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been created.
2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the profile record for the change to Field 3. If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.
3. The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both.
Date 1/1/2000 1/5/2000 Profile Date Field 1 Value 1/1/2000 1/5/2000 Closed Sunday Field 2 Value Black Field 3 Value Open 9 5 Open 9 5 Open 24hrs Open 9-5 Open 24hrs
If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records, when in reality a new profile record should have been entered.
Recommended Method
Attributes that are foreign Unknown keys to other tables Y/N indicator fields Any other type of attribute No Null or Business provided value
Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this purpose.) The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate Null into these fields, which means undefined in the target. After a source system value is corrected and passes validation, it is corrected in the target.
Fact Errors
If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.
Data Stewards
Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.
Reference Tables
The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures.
These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the translation table to maintain consistency across systems:
Source Value OF ST WH Code Translation OFFICE STORE WAREHSE
The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL process uses the reference table to populate the following values into the target:
Code Translation OFFICE STORE WAREHSE Office Retail Store Distribution Warehouse Code Description
Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.
Dimensional Data
New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.) There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as Pending Verification until
Manual Updates
Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.
Multiple Sources
The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information. When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with