ETL Testing Process: Strategies for Minimizing Defects in Data Warehouse Implementation

ETL Testing Process
Author: Srinivasa Rao P.V (srinivas.p@in.ibm.com)
Abstract:
This paper takes a look at the introduction of a warehouse and the different strategies to test a data warehouse application. It attempts to suggest various approaches that could be beneficial while testing the ETL process in a DW. A data warehouse is a critical business application and defects in it results is business loss that cannot be accounted for. ere! we walk "ou through some of the basic phases and strategies to minimi#e defects.
Introduction:
This is an era of global competition and ignorance is one of the greatest threats to modern business. As such organi#ations across the globe are rel"ing on IT services for strategic decision$ making. A data warehouse implementation is one such tool that comes to the rescue. %iven the criticalit" of a DW application! a defect$free DW implementation is a dream comes true for an" organi#ation. As &A and testing personnel! our role is to ensure this thereb" leading to ma'imi#ed profits! better decisions and customer satisfactions. A bug in the s"stem traced at a later stage not onl" increases the cost associated with rework! but also associates with it the use of incorrect data to make strategic decisions. ence! pre$implementation defect detection should be ensured. In light of the above discussion! let us take a look at the definition and e'amples of DW and then into the various strategies involved in the testing c"cle for a DW application. What is a Data Warehouse? According to Inmon! famous author for several data warehouse books! (A data warehouse is a sub)ect oriented! integrated! time variant! non volatile collection of data in support of management*s decision making process(. E'ample+ In order to store data! over the "ears! man" application designers in each branch have made their individual decisions as to how an application and database should be
built. ,o source s"stems will be different in naming conventions! variable measurements! encoding structures! and ph"sical attributes of data. -onsider a bank that has got several branches in several countries! has millions of customers and the lines of business of the enterprise are savings! and loans. The following e'ample e'plains how the data is integrated from source s"stems to target s"stems. Example o Source !ata :: In the below e'ample! attribute name! column name! data t"pe and values are entirel" different from one source s"stem to another. This inconsistenc" in data can become a problem while generating the statistics from the historical data hence we have to avoid this b" integrating the data into a data warehouse with good standards. Example o Source !ata: S"stem #ame ,ource ,"stem . ,ource ,"stem 9 ,ource ,"stem ; Example o %ar&et !ata (!ata 'arehouse): 2ecord =. 2ecord =9 2ecord =; -ustomer Application Date -ustomer Application Date -ustomer Application Date -/,T01E23APPLI-ATI043DATE DATE -/,T01E23APPLI-ATI043DATE DATE -/,T01E23APPLI-ATI043DATE DATE 7...977: 7...977: 7...977: Attribute #ame -ustomer Application Date -ustomer Application Date Application Date $olumn #ame -/,T01E23APPLI-ATI04 !ata t"pe 4/1E2I-56!78 Values ..7.977: ..7.977: 7.40<977:
3DATE -/,T3APPLI-ATI043DATE DATE APPLI-ATI043DATE DATE
In the above e'ample of target data! attribute names! column names! and data t"pes are consistent throughout the target s"stem. This is how data from various source s"stems is integrated and accuratel" stored into the data warehouse
A warehouse is a relational database that is designed for >uer" and anal"sis rather than transaction processing. A DW usuall" contains historical data that is derived from transaction data. It separates anal"sis workload from transaction workload and enables a business to consolidate data from several sources. In addition to a relational database! a data warehouse environment often consists of an ETL solution! an 0LAP engine! client anal"sis tools! and other applications that manage the process of gathering data and delivering it to business users. There are three t"pes of data warehouses+ Enterprise Data Warehouse $ An enterprise data warehouse provides a central database for decision support throughout the enterprise. 0D,50perational Data ,tore8 $ This has a broad enterprise wide scope! but unlike the real enterprise data warehouse! data is refreshed in near real time and used for routine business activit". 0ne of the t"pical applications of the 0D, 50perational Data ,tore8 is to hold the recent data before migration to the Data Warehouse.T"picall"! the 0D, are not conceptuall" e>uivalent to the Data Warehouse albeit do store the data that have a deeper level of the histor" than that of the 0LTP data. Data 1art $ Data mart is a subset of data warehouse and it supports a particular region! business unit or business function.
%he !' %estin& (i e $"cle:
As with an" other piece of software a DW implementation undergoes the natural c"cle of /nit testing! ,"stem testing! 2egression testing! Integration testing and Acceptance testing. owever! unlike others there are no off$the$shelf testing products available for a DW.
Unit testing: Traditionall" this has been the task of the developer. This is a white$bo' testing to ensure the module or component is coded as per agreed upon design specifications. The developer should focus on the following+ a8 All inbound and outbound director" structures are created properl" with appropriate permissions and sufficient disk space. All tables used during the ETL; are present with necessar" privileges. b8 The ETL routines give e'pected results+ All transformation logics work as designed from source till target ?oundar" conditions are satisfied@ e.g. check for date fields with leap "ear dates ,urrogate ke"s have been generated properl" 4/LL values have been populated where e'pected 2e)ects have occurred where e'pected and log for re)ects is created with sufficient details Error recover" methods c8 That the data loaded into the target is complete+ All source data that is e'pected to get loaded into target! actuall" get loaded@ compare counts between source and target and use data profiling tools All fields are loaded with full contents@ i.e. no data field is truncated while transforming
4o duplicates are loaded Aggregations take place in the target properl" Data integrit" constraints are properl" taken care of
System testing: %enerall" the &A team owns this responsibilit". Aor them the design document is the bible and the entire set of test cases is directl" based upon it. ere we test for the functionalit" of the application and mostl" it is black$bo'. The ma)or challenge here is preparation of test data. An intelligentl" designed input dataset can bring out the flaws in the application more >uickl". Wherever possible use production$like data. Bou ma" also use data generation tools or customi#ed tools of "our own to create test data. We must test for all possible combinations of input and specificall" check out the errors and e'ceptions. An unbiased approach is re>uired to ensure ma'imum efficienc". Cnowledge of the business process is an added advantage since we must be able to interpret the results functionall" and not )ust code$wise. The &A team must test for+ Data completeness and correctness@ match source to target counts and validate the data. Data aggregations@ match aggregated data against staging tables andDor 0D, Lookups/Transformations is applied correctl" as per specifications Granularity of data is as per specifications Error logs and audit tables are generated and populated properl" Notifications to IT and/or business are generated in proper format ETL Data Validation Components to be considered : 0rgani#ations t"picall" have Edirt" dataF that must be cleansed or scrubbed before being loaded into the data warehouse. In an ideal world! there would not be dirt" data. The data
in operational s"stems would be clean. /nfortunatel"! this is virtuall" never the case. The data in these source s"stems is the result of poor data >ualit" practices and little can be done about the data that is alread" there. While organi#ations should move toward improving data >ualit" at the source s"stem level! nearl" all data warehousing initiatives must cope with dirt" data! at least in the short term. There are man" reasons for dirt" data! including+ Dummy alues. Inappropriate values have been entered into fields. Aor e'ample! a customer service representative! in a hurr" and not perceiving entering correct data as being particularl" important! might enter the storeGs HIP code rather than the customerGs HIP! or enters III$II$IIII whenever a ,,4 is unknown. The operational s"stem accepts the input! but it is not correct. !bsence of data" Data was not entered for certain fields. This is not alwa"s attributable to la#" data entr" habits and the lack of edit checks! but to the fact that different business units ma" have different needs for certain data values in order to run their operations. Aor e'ample! the department that originates mortgage loans ma" have a federal reporting re>uirement to capture the se' and ethnicit" of a customer! whereas the department that originates consumer loans does not. #ultipurpose fields. A field is used for multiple purposesJ conse>uentl"! it does not consistentl" store the same thing. This can happen with packaged applications that include fields that are not re>uired to run the application. Different departments ma" use the Ee'traF fields for their own purposes! and as a result! what is stored in the fields is not consistent. Cryptic data" It is not clear what data is stored in a field. The documentation is poor and the attribute name provides little help in understanding the fieldGs content. The field ma" be derived from other fields or the field ma" have been used for different purposes over the "ears. Contradicting data" The data should be the same but it isnGt. Aor e'ample! a customer ma" have different addresses in different source s"stems. Inappropriate use of address lines" Data has been incorrectl" entered into address lines. Address lines are commonl" broken down into! for e'ample! Line . for first! middle! and last name! Line 9 for street address! Line ; for apartment number! and so on. Data is not alwa"s entered into the correct line! which makes it difficult to parse the data for later use.
Violation of business rules" ,ome of the values stored in a field are inconsistent with business realit". Aor e'ample! a source s"stem ma" have recorded an ad)ustable rate mortgage loan where the value of the minimum interest rate is higher than the value of the ma'imum interest rate.
$eused primary keys" A primar" ke" is not uni>ueJ it is used with multiple occurrences. There are man" wa"s that this problem can occur. Aor e'ample! assume that a branch bank has a uni>ue identifier 5i.e.! a primar" ke"8. The branch is closed and the primar" ke" is no longer in use. ?ut two "ears later! a new branch is opened! and the old identifier is reused. The primar" ke" is the same for the old and the new branch.
Non%uni&ue identifiers" An item of interest! such as a customer! has been assigned multiple identifiers. Aor e'ample! in the health care field! it is common for health care providers to assign their own identifier to patients. This makes it difficult to integrate patient records to provide a comprehensive understanding of a patientGs health care histor".
Data integration problems" The data is difficult or impossible to integrate. This can be due to non$uni>ue identifiers! or the absence of an appropriate primar" ke". To illustrate! for decades customers have been associated with their accounts through a customer name field on the account record. Integrating multiple customer accounts in this situation can be difficult. When we e'amine all the account records that belong to one customer! we find different spellings or abbreviations of the same customer name! sometimes the customer is recorded under an alias or a maiden name! and occasionall" two or three customers have a )oint account and all of their names are s>uee#ed into one name field.
There are several alternatives to cleansing dirt" data. 0ne option is to rel" on the basic cleansing capabilities of ETL software. Another option is to custom$write data cleansing routines. The final alternative is to use special$purpose data cleansing software. 2egardless of the alternative selected! the basic process is the same. The first step is to parse the individual data elements that are e'tracted from the source s"stems 5L"on! .II68. Aor e'ample! a customer record might be broken down into first name! middle name! last name! title! firm! street number! street! cit"! state! and HIP code.
Data algorithms 5possibl" based on AI techni>ues8 and secondar"! e'ternal data sources 5such as /, -ensus data8 are then used to correct and enhance the parsed data. Aor e'ample! a vanit" address 5like Lake -alumet8 is replaced with the ErealF address 5-hicago8 and the plus four digits are added to the HIP code. 4e't! the parsed data is standardi)ed. /sing both standard and custom business rules! the data is transformed into its preferred and consistent format. Aor e'ample! a prename ma" be added 5e.g.! 1s.! Dr.8! first name match standards ma" be identified 5e.g! ?eth ma" be Eli#abeth! ?ethan"! or ?ethel8! and a standard street name ma" be applied 5e.g.! ,outh ?utler Drive ma" be transformed to ,. ?utler Dr.8. The parsed! corrected! and standardi#ed data is then scanned to match records. The matching ma" be based on simple business rules! such as whether the name and address are the same! or AI based methods that utili#e sophisticated pattern recognition techni>ues. 1atched records are then consolidated. The consolidated records integrate the data from the different sources and reflect the standards that have been applied. Aor e'ample! source s"stem number one ma" not contain phone numbers but source s"stem number two does. The consolidated record contains the phone number. The consolidated record also contains the applied standards! such as recording 1s. Eli#abeth Kames as the personGs name! with the appropriate pre$name applied. 0nce the data is cleaned! transformed! and integrated! it is read" for loading into the warehouse. The first loading provides the initial data for the warehouse. ,ubse>uent loadings can be done in one of two wa"s. 0ne alternative is to bulk load the warehouse ever" time. With this approach! all of the data 5i.e.! the old and the new8 is loaded each time. This approach re>uires simple processing logic but becomes impractical as the volume of data increases. The more common approach is to refresh the warehouse with onl" newl" generated data. Another issue that must be addressed is how fre>uentl" to load the warehouse. Aactors that affect this decision include the business need for the data and the business c"cle that provides the data. Aor e'ample! users of the warehouse ma" need dail"! weekl"! or monthl" updates! depending on their use of the data. 1ost business processes have a natural business c"cle that
generates data that can be loaded into the warehouse at various points in the c"cle. Aor e'ample! a compan"Gs pa"roll is t"picall" run on a weekl" basis. -onse>uentl"! data from the pa"roll application is loaded to the warehouse on a weekl" basis. The trend is for continuous updating of the data warehouse. This approach is sometimes referred to as EtrickleF loading of the warehouse. There are several factors that are causing this near real$ time updating of the warehouse. As data warehouses are increasingl" being used to support operational processes! having current data is important. Also! when trading partners are given access to warehouse data! the e'pectation is that the data is up$to$date. Ainall"! man" firms operate on a global basis and there is not a good time to load the warehouse. /sers around the world need access to the warehouse on a 9LMN basis. A long Eload windowF is not acceptable. Regression testing: A DW application is not a one$time solution. Possibl" it is the best e'ample of an incremental design where re>uirements are enhanced and refined >uite often based on business needs and feedbacks. In such a situation it is ver" critical to test that the e'isting functionalities of a DW application are not messed up whenever an enhancement is made to it. %enerall" this is done b" running all functional tests for e'isting code whenever a new piece of code is introduced. owever! a better strateg" could be to preserve earlier test input data and result sets and running the same again. 4ow the new results could be compared against the older ones to ensure proper functionalit". Integration testing: This is done to ensure that the application developed works from an end$to$end perspective. ere we must consider the compatibilit" of the DW application with upstream and downstream flows. We need to ensure for data integrit" across the flow. 0ur test strateg" should include testing for+ ,e>uence of )obs to be e'ecuted with )ob dependencies and scheduling 2e$startabilit" of )obs in case of failures
%eneration of error logs -leanup scripts for the environment including database This activit" is a combined responsibilit" and participation of e'perts from all related applications is a must in order to avoid misinterpretation of results. Acceptance testing: This is the most critical part because here the actual users validate "our output datasets. The" are the best )udges to ensure that the application works as e'pected b" them. owever! business users ma" not have proper ETL knowledge. ence! the development and test team should be read" to provide answers regarding ETL process that relate to data population. The test team must have sufficient business knowledge to translate the results in terms of business. Also the load windows refresh period for the DW and the views created should be signed off from users. Performance testing: In addition to the above tests a DW must necessaril" go through another phase called performance testing. An" DW application is designed to be scaleable and robust. Therefore! when it goes into production environment! it should not cause performance problems. ere! we must test the s"stem with huge volume of data. We must ensure that the load window is met even under such volumes. This phase should involve D?A team! and ETL e'pert and others who can review and validate "our code for optimi#ation.
$onclusion: *inall" a few words of caution to end with. Testing a DW application should be done
with a sense of utmost responsibilit". A bug in a DW traced at a later stage results in unpredictable losses. And the task is even more difficult in the absence of an" single end$to$end testing tool. ,o the strategies for testing should be methodicall" developed! refined and streamlined. This is also true since the re>uirements of a DW are often d"namicall" changing. /nder such circumstances repeated discussions with development team and users is of utmost
importance to the test team. Another area of concern is test coverage. This has to be reviewed multiple times to ensure completeness of testing. Alwa"s remember! a DW tester must go an e'tra mile to ensure near defect free solutions.
E'ample for Data <alidation Testing Levels There are several levels of testing that can be performed during data warehouse testing. ,ome e'amples! -onstraint testing ,ource to target counts ,ource to target data validation Error processing. The level of testing to be performed should be defined as part of the testing strateg". -onstraints During constraint testing! the ob)ective is to validate uni>ue constraints! primar" ke"s! foreign ke"s! inde'es! and relationships. The test script should include these validation points. ,ome ETL processes can be developed to validate constraints during the loading of the warehouse. If the decision is made to add constraint validation to the ETL process! the ETL code must validate all business rules and relational data re>uirements. Depending solel" on the automation of constraint testing is risk". When the setup is not done correctl" or maintained throughout the ever changing re>uirements process! the validation could become incorrect and will nullif" the tests. -ounts the ob)ective of the count test scripts is to determine if the record counts in the source match the record counts in the target. ,ome ETL Processes are capable of capturing record count information such as records read! records written! records in error! etc. If the ETL process is being used can capture that level of detail and create a list of the counts! allow it to do so. This will save time during the validation process. ,ource to Target 4o ETL process is smart enough to perform source to target field$to field validation. This piece of the testing c"cle is the most labor intensive and re>uires the most thorough anal"sis of the data. There are a variet" of tests that can be performed during source to target validation. ?elow is a list of tests that are best practices+ Threshold testing e'pose an" truncation that ma" be occurring during the transformation or loading of data Aor e'ample+ ,ource+ table..field. 5<A2- A2L78+ ,tage+ table9.field: 5<A2- A29:8+
Target+ table;.field9 5<A2- A2L78+ In this e'ample the source field has a threshold of L7! the stage field has a threshold of 9: and the target mapping has a threshold of L7. The last .: characters will be truncated during the ETL process of the stage table. An" data that was stored in position 9O$;7 will be lost during the move from source to staging Aield to Aield Aield$to$field testing is a constant value being populated during the ETL processP It should not be unless it is documented in the 2e>uirements and subse>uentl" documented in the test scripts. Do the values in the source fields match the values in the respective target fieldsP ?elow are two additional field$to$field tests that should occur. Initiali#ation during the ETL process if the code does not re$initiali#e the cursor 5or working storage8 after each record! there is a chance that fields with null values ma" contain data from a previous record. Aor e'ample+ 2ecord .9:+ ,ource field. Q 2ed Target field. Q 2ed 2ecord .9O+ ,ource field. Q null Target field . Q 2ed
Acron"ms: .. DW@ Data Warehouse 9. &A@ &ualit" Assurance ;. ETL@ E'traction! Transformation and Loading L. 0D,@ 0perational Data ,tore Re erences: .. Data Warehousing@ ,oumendra 1ohant" 9. ,trategies for testing data warehouse applications@ Keff Theobald! DW review 1aga#ine! Kune 977N issue ;. The Data Warehouse Toolkit@ 2alph Cimball

ETL Testing Process: Strategies for Minimizing Defects in Data Warehouse Implementation

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ETL Testing Process: Strategies for Minimizing Defects in Data Warehouse Implementation

Enviado por

Direitos autorais:

Formatos disponíveis

ETL Testing Process

Author: Srinivasa Rao P.V (srinivas.p@in.ibm.com)

3DATE -/,T3APPLI-ATI043DATE DATE APPLI-ATI043DATE DATE

%he !' %estin& (i e $"cle:

Você também pode gostar