Você está na página 1de 8

• Cognizant 20-20 Insights

From Relational Database Management


to Big Data: Solutions for Data
Migration Testing
A successful approach to big data migration testing requires
end-to-end automation and swift verification of huge volumes of
data to produce quick and lasting results.
Executive Summary ing of customers. Such insight will reveal what
customers are buying, doing, saying, thinking and
Large enterprises face numerous challenges
feeling, as well as what they need.
connecting multiple CRM applications and their
data warehouse systems to connect with end But this requires capturing and analyzing
users across the multitude of products they huge pools of interactional and transactional
offer. When their disparate data is spread across data. Capturing such large data sets, however,
multiple systems, these enterprises cannot: has created a double-edged sword for many
companies. On the plus side, it affords companies
• Conduct sophisticated analytics that substan- the opportunity to make meaning from Code Halo
tially improve business decision-making.
intersections; the downside is figuring out how
• Offer better search and data sharing. and where to store all this data.
• Gain a holistic view of a single individual Enter Hadoop, the de facto open source
across multiple identities; customers may have
standard that is increasingly being used by many
multiple accounts due to multiple locations or
companies in large data migration projects.
devices such as company or Facebook IDs.
Hadoop is an open-source framework that allows
• Unlock the power of data science to create for the distributed processing of large data sets.
reports using tools of their choice. It is designed to scale up from single servers to
In such situations, companies lose the ability thousands of machines, each offering local com-
to understand customers. Overcoming these putation and storage. As data from different
obstacles is critical to gaining the insights needed sources flows into Hadoop, the biggest challenge
to customize user experience and personalize is “data validation from source to Hadoop.”
1
interactions. By applying Code HaloTM thinking – In fact, according to a report published by IDG
and distilling insights from the swirl of data that Enterprise, “70% of enterprises have either
surrounds people, processes, organizations and deployed or are planning to deploy big data
2
devices – companies of all shapes and sizes and projects and programs this year.”
across all sectors can gain a deep understand-

cognizant 20-20 insights | september 2015


With the huge amount of data migrated to Amazon Redshift, a fast, fully managed, pet-
Hadoop and other big data platforms, the abyte-scale data warehouse service.
challenge of data quality emerges. The simple,
widely used, cumbersome solution is manual The migration to the AWS Hadoop environment is
validation. However, this is not scalable and may a three-step process:
not offer any significant value-add to customers.
It impacts project schedules. Moreover, testing • Cloud service: Virtual machines/physical
machines are used to connect and extract the
cycle times can get squeezed.
tables from source databases using Sqoop,
This white paper posits a solution: a framework which pushes them to Amazon S3.
that can be adopted across industries to perform • Cloud storage: Amazon S3 cloud storage
effective big data migration testing with all open- center is used for all the data that is being sent
source tools. by virtual machines. It stores data in flat file
format.
Challenges in RDBMS to Big Data
Migration Testing • Data processing: Amazon EMR processes and
distributes vast amounts of data using Hadoop.
Big data migration typically involves multiple The data is grabbed from S3 and stored as Hive
source systems and large volumes of data. tables (see Glossary, page 7).
However, most organizations lack the open-
source tools to handle this important task. RDBMS to Big Data Migration
The right tool should be set up quickly and Testing Solution
offer multiple customization options. Migration Step 1: Define Scenarios
generally happens in entity batches. A set To test migrated data, performing one-to-one
of entities is selected, migrated and tested. comparison of all the entities is required. Since
This cycle goes on until all application data big data volumes are (as the term suggests) huge,
is migrated. three test scenarios are performed for each entity:

Migration generally happens in • Count reconciliation for all rows.


entity batches. A set of entities • Find missing primary keys for all rows.
is selected, migrated and tested. • Compare field-to-field data for sample records.
This cycle goes on until all These steps are required to, first, verify the
record count in the source DB and target DB and,
application data is migrated. second, to ensure that all records from source
systems flow to the target DB, which is performed
An easily scalable solution can reduce the con-
by checking the primary key in the source
secutive testing cycles. Even minimal human
system and the target system for all records. This
intervention can hinder testing efforts. Another
confirms that all records are present in the target
challenge comes when defining effective
DB. Third, and most important, is comparing the
scenarios for each entity. Performing 100%
source and target databases for all columns for
field-to-field validation of data is ideal, but when
sample records. This ensures that the data is
the data volume is in petabytes, test execution
not corrupted, date formats are maintained and
duration increases tremendously. A proper
data is not truncated. The number of records
sampling method should be adopted, and solid
for sample testing can be decided according to
data transformation rules should be considered
the data volume. A basic data corruption can be
in testing.
identified by testing 100 sample records.
Big Data Migration Process Step 2: Choose the Appropriate Method
Hadoop as a service is offered by Amazon Web of Testing
Services (AWS), a cloud computing solution that Per our analysis, we shortlisted two methods of
abstracts the operational challenges of running testing:
Hadoop and making medium- and large-scale
data processing accessible, easy, fast and inex- • UNIX shell script and T-SQL-based reconcilia-
pensive. The typical services available include tion.
Amazon S3 (Simple Storage Service) and Amazon
EMR (Elastic MapReduce). Also preferred is
• PIG scripting.

cognizant 20-20 insights 2


Testing Approach Comparison
Unix Shell Script and T-SQL-Based PIG Scripting
Reconciliation
Prerequisites Load target Hadoop data into the Migrate data from RDBMS to HDFS
central QA server (SQL server) as and compare QA HDFS files with Dev
different entities and validate with HDFS files using Pig scripting.
source tables.
Flat files for each entity created
SQL server database to store tables using Sqoop tool.
and perform comparison using SQL
queries.

Preconfigured linked server in SQL


server DB is needed to connect to all
your source databases.
Efforts Initial coding for five to 10 tables Compares flat files.
takes one week.
Scripting needed for each column in
Consecutive additions take two days the table.
for ~10 tables.
Efforts are equally proportionate
to the number of tables and their
columns.
Automation/ Full automation possible. No automation possible.
Manual
Performance (On Delivers the results quickly compared This method needs migration
Windows XP, 3 GB to other methods. of source table to HDFS files as
RAM, 1 CPU) a prerequisite, which is time-
For 15 tables with an average 100K,
consuming.
records will take:
Processing can be faster than other
~30 minutes for count.
methods.
~20 minutes for sample 100 records.

~1 hour for missing primary keys.


Highlights Full automation possible/job Offers a lot of flexibility in coding.
scheduling possible.
Very useful in more complex
Fast comparison. transformations.

No permission or security issues


faced while accessing big data on
AWS.
Low Points Initial framework setup is time- Greater effort for decoding,
consuming. reporting results and handling script
errors.
Figure 1

Another option is to use Microsoft Hive ODBC end-to-end automation is possible. If any transfor-
Driver to access Hive data, but this approach is mations are present, those need to be performed
more appropriate for smaller volumes. in the staging layer – which can be treated as
source, to further implement similar solutions.
Figure 1 shows a comparison of the two methods. According to the above analysis, PIG scripting
is more appropriate for testing migration with
Hence, based on this comparison, we recommend
complex transformation logic. But for this type
a focus on the first approach, where full

cognizant 20-20 insights 3


High-Level Validation Approach
Source Systems Jenkins Slave Machine
• Stored procedure to compute the count of each table from
SQL batch Files from Hive Files from UNIX
source system. Results from Hive are compared with this result.
• Stored procedure to pull ROW_ID from all tables of source files used to server are
and find out missing/extra ones in Hive results. load file contents downloaded to
• Stored procedure to pull source column data of the sample to QA tables. Windows server.
records pulled from Hive and compare results.
Report any data mismatch. Windows batch
Oracle SQL script
DB Server
DB Linked server to
Server
pull data from various DBs. WinSCP
MySQL Any Download
Other Commands Get Data shell Shell script to
DB
RDBMS script to get generate
count, ROW_IDs Get Data
and sample shell script
data from dynamically.
Source to target data flow: QA DB Hive tables.
Data from source systems is migrated Server
to HDFS using Sqoop – ETL. (SQL server) CSV File with
Hive table names.

LOAD DATA INPATH


Hadoop ‘hdfs_file’ INTO TABLE
Distributed tablename
File System • CSV file with count of records and table name
for each table in Hive.
• CSV file with ROW_ID from all tables available in Hive.
• CSV file with first 100 records of all columns from Hive tables.
HIVE

AWS HADOOP

Figure 2

of simple migration, the PIG scripting approach is >> Store the table list in the CSV file on a UNIX
very time-consuming and resource-intensive. server.

Step 3: Build a Framework >> Write a UNIX shell script with input as a ta-
ble list CSV file and generate another shell
Here we bring in data from Hadoop to a SQL
script to extract the hive data into the CSV
server’s consolidation database and validate it
files for each table.
with the source. Figure 2 illustrates the set of
methods we recommend. »»This shell script will be executed from
the Windows batch file.
• UNIX shell scripting: In the migration process, >> Dynamically generate a UNIX shell script to
the development team uses the Sqoop tool to
ensure there is a need to update only the
migrate RDBMS tables as HDFS files. LOAD
table list CSV file of every iteration/release
DATA INPATH command creates the table
for the new table additions.
definition in the Hive metastore. HDSF files are
stored in Amazon S3. • WinSCP: The next step is to transfer the files
in the Hadoop environment to the Windows
To fetch data from Hive to a flat file:
server. WinSCP batch command interface can

Sample Code from Final Shell Script

Figure 3

cognizant 20-20 insights 4


Results of Count Reconciliation for Hive Tables Migrated from a Webpage
SIEBEL HIVE COUNT CLUSTER RECON SUMMARY
AUD_ID EXECUTION_DATE SCHEMA_NAME TARGET_DB TOTAL_PASS TOTAL_FAIL ENV
153 2015-04-14 21:55:27.787 SIEBELPRD HIVE 60 94 PRD

SIEBEL HIVE COUNT CLUSTER RECON DETAIL


AUD_ID AUD_SEQ SOURCE_TAB_ SOURCE_ROW_CNT HIVE_TAB_ HIVE_ROW_CNT DIFF PERCENTAGE_DIFF STATUS EXEC_DATE
NAME NAME
153 1 S_ADDR_PER 353420 S_ADDR_PER 343944 9476 2.68 FAIL 2015-04-14
21:55:27.787
153 2 S_PARTY 2730468 S_PARTY 2730468 0 0 PASS 2015-04-14
21:55:27.787
153 3 S_ORG_GROUP 16852 S_ORG_GROUP 16852 0 0 PASS 2015-04-14
21:55:27.787
153 4 S_LST_OF_VAL 29624 S_LST_OF_VAL 29624 0 0 PASS 2015-04-14
21:55:27.787
153 5 S_GROUP_ 413912 S_GROUP_ 413912 0 0 PASS 2015-04-14
CONTACT CONTACT 21:55:27.787
153 6 S_CONTACT 1257758 S_CONTACT 1257758 0 0 PASS 2015-04-14
21:55:27.787
153 7 S_CON_ADDR 6220 S_CON_ADDR 6220 0 0 PASS 2015-04-14
21:55:27.787
153 8 S_CIF_CON_MAP 28925 S_CIF_CON_ 28925 0 0 PASS 2015-04-14
MAP 21:55:27.787
153 9 S_ADDR_PER 93857 S_ADDR_PER 93857 0 0 PASS 2015-04-14
21:55:27.787
153 10 S_PROD_LN 1114 S_PROD_LN 1106 8 0.72 FAIL 2015-04-14
21:55:27.787
153 11 S_ASSET_REL 696178 S_ASSET_REL 690958 5220 0.75 FAIL 2015-04-14
21:55:27.787
153 12 S_AGREE_ITM_ 925139 S_AGREE_ITM_ 917657 7482 0.81 FAIL 2015-04-14
REL REL 21:55:27.787
153 13 S_REVN 131111 S_REVN 128949 2162 1.65 FAIL 2015-04-14
21:55:27.787
153 14 S_ENTLMNT 127511 S_ENTLMNT 125144 2367 1.86 FAIL 2015-04-14
21:55:27.787
153 15 S_ASSET_XA 5577029 S_ASSET_XA 5457724 119305 2.14 FAIL 2015-04-14
21:55:27.787
153 16 S_BU 481 S_BU 470 11 2.29 FAIL 2015-04-14
21:55:27.787
153 17 S_ORG_EXT 345276 S_ORG_EXT 336064 9212 2.67 FAIL 2015-04-14
21:55:27.787
153 18 S_ORG_BU 345670 S_ORG_BU 336424 9246 2.67 FAIL 2015-04-14
21:55:27.787

Figure 4

be implemented for this. The WinSCP batch file. The batch file executes the shell script
file (.SCP) connects to the Hadoop environ- from the Windows server on the Hadoop envi-
ment using an open sftp command. A simple ronment using the Plink command. In this way,
“GET” command with the file name can copy all Hive data is loaded into the SQL server
the file to the Windows server. table. The next step is to execute the SQL
server procedure to perform count/primary
• SQL server database usage: The SQL server key/sample data comparison. We use SQLCMD
is the main database used for loading Hive
to execute the SQL server procedure from the
data and final reconciliation results. A SQL
batch file.
script is created to load data from the .CSV
file to the database table. The script uses the • Jenkins: End-to-end validation processes
“Bulk Insert” command. can be triggered by Jenkins. Jenkins jobs can
be scheduled to execute on an hourly/daily/
• Windows batch command: The above-men-
weekly basis without manual intervention. On
tioned process of transferring the data in .CSV
Jenkins, an auto-scheduled ANT script invokes
files, importing the files into the SQL server and
a Java program to connect to the SQL server
validating the source and target data should all
to generate the HTML report of the latest
be done sequentially. All validation processes
records. Jenkins jobs can e-mail the results to
can be automated by creating a Windows batch
the predefined recipients in the HTML format.

cognizant 20-20 insights 5


Overcoming Test Migration Challenges
SNO Implementation Issues Resolutions
1 Column data type mismatch errors while Create tables in SQL server by matching Hive table
loading .CSV files and Hive data into the SQL data types.
server table.
2 No FTP access in Hadoop database to transfer Use WinSCP software.
files.
3 Column Sequence Mismatch between Hive Create tables in SQL server for target entities by
tables and Source tables, which results in matching Hive table column order.
failure to load the .CSV files into Hive_* tables.
4 Inability to load .CSV files, due to end of file Update SQL statement with appropriate row
issue in SQL server bulk insert. terminator “char (10)” linefeed, which allows import
of .CSV files from a Unix server.
5 Performance issues on primary key validations. Performance tuning on SQL server stored
procedures and increasing more temp DB space of
SQL server, etc.
6 Handling comma in the column values. Create TSV file, so it will not create any issue while
data is loading.
Remove NULL, null from TSV and generate .txt file.
Finally convert into UTF-8 to UTF-16 and generate
XLS file, which can be loaded to SQL server
database.
Figure 5

The Advantages of a Big Data Testing Migration Framework*


Scenario Manual (mins.) Framework Gain (mins.) % Gain
(mins.)
Count 20 2 18 90.00%
Sample 100 Records 100 1.3 98.7 98.70%
Missing Primary Key 40 4 36 90.00%
* Effort calculated for one table with around 500k records with summary report generation
Figure 6

Implementation Issues and Resolutions before implementing a framework like the one
presented in this white paper.
Organizations may face a number of implementa-
tion issues. Figure 5 provides probable resolutions.
• Think big when it comes to big data testing.
Choose an optimum data subset for testing;
Impact of Testing
sampling should be based on geographies,
Figure 6 summarizes the impact of the manual priority customers, customer types, product
Excel testing when using our framework for one types and product mix.
of the customer’s CRM applications based on
Oracle and SQL server databases. • Create an environment to accommodate huge
data sets. Cloud setups are recommended.
Looking Forward • Be aware of the Agile/Scrum cadence mismatch.
More and more organizations are using big data Break up data into smaller incremental blocks
tools and techniques to quickly and effectively as a work-around.
analyze data for improved customer understand-
ing and product/service delivery. This white paper
• Get smart about open-source capabilities.
Spend a good amount of time up front under-
presents a framework to help organizations to standing the tools and techniques that drive
more quickly, efficiently and accurately conduct success.
big data migration testing. As your organization
moves forward, here are key points to consider

cognizant 20-20 insights 6


Glossary
• AWS: Amazon Web Services is a collection of remote computing services, also called Web services,
that make up a cloud computing platform from Amazon.com.

• Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost-
effectively process vast amounts of data.

• Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable,
highly-scalable object storage.

• Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis-
tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks:
massive data storage and faster processing.

• Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza-
tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon
EMR on AWS.

• Jenkins: Jenkins is an open-source, continuous-integration tool written in Java. Jenkins provides


continuous integration services for software development.

• PIG scripting: PIG is a high-level platform for creating MapReduce programs used with Hadoop. The
language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL for RDBMS systems.

• RDBMS: A relational database management system is a database management system (DBMS) that
is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory.

• Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from
non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into
HDFS. This process is briefly called extract, transform and load (ETL).

• WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft
Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond
this, WinSCP offers basic file manager and file synchronization functionality.

• Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a
command line interpreter.

• T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language
(SQL).

cognizant 20-20 insights 7


Footnotes
1 For more on Code Halos and innovation, read “Code Rules: A Playbook for Managing at the Crossroads,”
Cognizant Technology Solutions, June 2013, http://www.cognizant.com/Futureofwork/Documents/
code-rules.pdf, and the book, Code Halos: How the Digital Lives of People, Things, and Organizations Are
Changing the Rules of Business, by Malcolm Frank, Paul Roehrig and Ben Pring, published by John Wiley &
Sons, April 2014, http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118862074.html.
2 2014 IDG Enterprise Big Data Report, http://www.idgenterprise.com/report/big-data-2.

About the Author


Rashmi Khanolkar is a Senior Architect within Cognizant’s Comms-Tech Business Unit. Proficient in appli-
cation architecture, data architecture and technical design, Rashmi has 15-plus years of experience in
the software industry. She has managed multiple data migration quality projects involving large volumes
of data. Rashmi also has extensive experience on multiple development projects on .Net and Moss
2007, and has broad knowledge within the CRM, insurance and banking domains. She can be reached at
Rashmi.Khanolkar@cognizant.com.

About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the world’s leading companies build stronger busi-
nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac-
tion, technology innovation, deep industry and business process expertise, and a global, collaborative
workforce that embodies the future of work. With over 100 development and delivery centers worldwide
and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100,
the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and
fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.

World Headquarters European Headquarters India Operations Headquarters


500 Frank W. Burr Blvd. 1 Kingdom Street #5/535, Old Mahabalipuram Road
Teaneck, NJ 07666 USA Paddington Central Okkiyam Pettai, Thoraipakkam
Phone: +1 201 801 0233 London W2 6BD Chennai, 600 096 India
Fax: +1 201 801 0243 Phone: +44 (0) 20 7297 7600 Phone: +91 (0) 44 4209 6000
Toll Free: +1 888 937 3277 Fax: +44 (0) 20 7121 0102 Fax: +91 (0) 44 4209 6060
Email: inquiry@cognizant.com Email: infouk@cognizant.com Email: inquiryindia@cognizant.com

­­© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
Codex 1439

Você também pode gostar