Escolar Documentos
Profissional Documentos
Cultura Documentos
Another option is to use Microsoft Hive ODBC end-to-end automation is possible. If any transfor-
Driver to access Hive data, but this approach is mations are present, those need to be performed
more appropriate for smaller volumes. in the staging layer – which can be treated as
source, to further implement similar solutions.
Figure 1 shows a comparison of the two methods. According to the above analysis, PIG scripting
is more appropriate for testing migration with
Hence, based on this comparison, we recommend
complex transformation logic. But for this type
a focus on the first approach, where full
AWS HADOOP
Figure 2
of simple migration, the PIG scripting approach is >> Store the table list in the CSV file on a UNIX
very time-consuming and resource-intensive. server.
Step 3: Build a Framework >> Write a UNIX shell script with input as a ta-
ble list CSV file and generate another shell
Here we bring in data from Hadoop to a SQL
script to extract the hive data into the CSV
server’s consolidation database and validate it
files for each table.
with the source. Figure 2 illustrates the set of
methods we recommend. »»This shell script will be executed from
the Windows batch file.
• UNIX shell scripting: In the migration process, >> Dynamically generate a UNIX shell script to
the development team uses the Sqoop tool to
ensure there is a need to update only the
migrate RDBMS tables as HDFS files. LOAD
table list CSV file of every iteration/release
DATA INPATH command creates the table
for the new table additions.
definition in the Hive metastore. HDSF files are
stored in Amazon S3. • WinSCP: The next step is to transfer the files
in the Hadoop environment to the Windows
To fetch data from Hive to a flat file:
server. WinSCP batch command interface can
Figure 3
Figure 4
be implemented for this. The WinSCP batch file. The batch file executes the shell script
file (.SCP) connects to the Hadoop environ- from the Windows server on the Hadoop envi-
ment using an open sftp command. A simple ronment using the Plink command. In this way,
“GET” command with the file name can copy all Hive data is loaded into the SQL server
the file to the Windows server. table. The next step is to execute the SQL
server procedure to perform count/primary
• SQL server database usage: The SQL server key/sample data comparison. We use SQLCMD
is the main database used for loading Hive
to execute the SQL server procedure from the
data and final reconciliation results. A SQL
batch file.
script is created to load data from the .CSV
file to the database table. The script uses the • Jenkins: End-to-end validation processes
“Bulk Insert” command. can be triggered by Jenkins. Jenkins jobs can
be scheduled to execute on an hourly/daily/
• Windows batch command: The above-men-
weekly basis without manual intervention. On
tioned process of transferring the data in .CSV
Jenkins, an auto-scheduled ANT script invokes
files, importing the files into the SQL server and
a Java program to connect to the SQL server
validating the source and target data should all
to generate the HTML report of the latest
be done sequentially. All validation processes
records. Jenkins jobs can e-mail the results to
can be automated by creating a Windows batch
the predefined recipients in the HTML format.
Implementation Issues and Resolutions before implementing a framework like the one
presented in this white paper.
Organizations may face a number of implementa-
tion issues. Figure 5 provides probable resolutions.
• Think big when it comes to big data testing.
Choose an optimum data subset for testing;
Impact of Testing
sampling should be based on geographies,
Figure 6 summarizes the impact of the manual priority customers, customer types, product
Excel testing when using our framework for one types and product mix.
of the customer’s CRM applications based on
Oracle and SQL server databases. • Create an environment to accommodate huge
data sets. Cloud setups are recommended.
Looking Forward • Be aware of the Agile/Scrum cadence mismatch.
More and more organizations are using big data Break up data into smaller incremental blocks
tools and techniques to quickly and effectively as a work-around.
analyze data for improved customer understand-
ing and product/service delivery. This white paper
• Get smart about open-source capabilities.
Spend a good amount of time up front under-
presents a framework to help organizations to standing the tools and techniques that drive
more quickly, efficiently and accurately conduct success.
big data migration testing. As your organization
moves forward, here are key points to consider
• Amazon EMR: Amazon Elastic MapReduce is a Web service that makes it easy to quickly and cost-
effectively process vast amounts of data.
• Amazon S3: Amazon Simple Storage Service provides developers and IT teams with secure, durable,
highly-scalable object storage.
• Hadoop: Hadoop is an open-source software framework for storing and processing big data in a dis-
tributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks:
massive data storage and faster processing.
• Hive: Hive is a data warehouse infrastructure built on top of Hadoop for providing data summariza-
tion, query and analysis. Amazon maintains a software fork of Apache Hive that is included in Amazon
EMR on AWS.
• PIG scripting: PIG is a high-level platform for creating MapReduce programs used with Hadoop. The
language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL for RDBMS systems.
• RDBMS: A relational database management system is a database management system (DBMS) that
is based on the relational model as invented by E.F. Codd, of IBM’s San Jose Research Laboratory.
• Sqoop: Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from
non-Hadoop data stores, transform the data into a form usable by Hadoop and then load the data into
HDFS. This process is briefly called extract, transform and load (ETL).
• WinSCP: Windows Secure Copy is a free and open-source SFTP, SCP and FTP client for Microsoft
Windows. Its main function is to secure file transfer between a local and a remote computer. Beyond
this, WinSCP offers basic file manager and file synchronization functionality.
• Unix shell scripting: A shell script is a computer program designed to be run by the Unix shell, a
command line interpreter.
• T-SQL: Transact-SQL is Microsoft’s and Sybase’s proprietary extension to Structured Query Language
(SQL).
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the world’s leading companies build stronger busi-
nesses. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfac-
tion, technology innovation, deep industry and business process expertise, and a global, collaborative
workforce that embodies the future of work. With over 100 development and delivery centers worldwide
and approximately 218,000 employees as of June 30, 2015, Cognizant is a member of the NASDAQ-100,
the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and
fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
© Copyright 2015, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
Codex 1439