Informatica Data Lake Management On The Aws Cloud

Informatica Data Lake
Management on the AWS Cloud

Quick Start Reference Deployment
January 2018
Informatica Big Data Team

Vinod Shukla – AWS Quick Start Reference Team
Contents
Overview ................................................................................................................................. 2
Informatica Components ................................................................................................... 3
Costs and Licenses .............................................................................................................. 3
Architecture............................................................................................................................ 4
Informatica Services on AWS............................................................................................. 5
Planning the Data Lake Management Deployment ..............................................................8
Deployment Options ..........................................................................................................8
Prerequisites .......................................................................................................................8
Deployment Steps ..................................................................................................................9
Step 1. Prepare Your AWS Account ....................................................................................9
Step 2. Upload Your Informatica License ..........................................................................9
Step 3. Launch the Quick Start ........................................................................................ 10
Step 4. Monitor the Deployment ...................................................................................... 18
Step 5. Download and Install Informatica Developer ..................................................... 21
Manual Cleanup ...............................................................................................................22
Troubleshooting ...................................................................................................................22
Using Informatica Data Lake Management on AWS .......................................................... 23
Page 1 of 30
Amazon Web Services – Informatica Data Lake Management on the AWS Cloud January 2018
Transient and Persistent Clusters ....................................................................................23

Common AWS Architecture Patterns for Informatica Data Lake Management............. 23
Process Flow ..................................................................................................................... 27
Additional Resources ...........................................................................................................29
GitHub Repository .............................................................................................................. 30
Document Revisions ........................................................................................................... 30
This Quick Start deployment guide was created by Amazon Web Services (AWS) in
partnership with Informatica.
Quick Starts are automated reference deployments that use AWS CloudFormation
templates to deploy key technologies on AWS, following AWS best practices.
Overview
This Quick Start reference deployment guide provides step-by-step instructions for
deploying the Informatica Data Lake Management solution on the AWS Cloud.
A data lake uses a single, Hadoop-based data repository that you create to manage the
supply and demand of data. Informatica’s solution on the AWS Cloud integrates, organizes,
administers, governs, and secures large volumes of both structured and unstructured data.
The solution delivers actionable fit-for-purpose, reliable, and secure information for
business insights.
Consider the following key principles when you implement a data lake:
 The data lake must prevent barriers to onboarding data of any type and size from any
source.
 Data must be easily refined and immediately provisioned for consumption.
 Data must be easy to find, retrieve, and share within the organization.
 Data is a corporate accountable asset, managed collaboratively by data governance, data
quality, and data security initiatives.
This Quick Start is for users who want to deploy and develop an Informatica Data Lake
Management solution on the AWS Cloud.
Page 2 of 30
Informatica Components
The Data Lake Management solution uses the following Informatica products:
Informatica Big Data Management enables your organization to process large, diverse,
and fast changing datasets so you can get insights into your data. Use Big Data
Management to perform big data integration and transformation without writing or
maintaining Apache Hadoop code. Collect diverse data faster, build business logic in a
visual environment, and eliminate hand-coding to get insights on your data.
Informatica Enterprise Data Catalog brings together all data assets in an enterprise
and presents a comprehensive view of the data assets and data asset relationships.
Enterprise Data Catalog captures the technical, business, and operational metadata for a
large number of data assets that you use to determine the effectiveness of enterprise data.
From across the enterprise, Enterprise Data Catalog gathers information related to
metadata, including column data statistics, data domains, data object relationships, and
data lineage information. A comprehensive view of enterprise metadata can help you make
critical decisions on data integration, data quality, and data governance in the enterprise.
The Developer tool includes the native and Hadoop run-time environments for optimal
processing. In the native environment, the Data Integration Service processes the data. In
the Hadoop environment, the Data Integration Service pushes the processing to nodes in a
Hadoop cluster.
Costs and Licenses

You are responsible for the cost of the AWS services used while running this Quick Start
reference deployment. There is no additional cost for using the Quick Start.
The AWS CloudFormation template for this Quick Start includes configuration parameters
that you can customize. Some of these settings, such as instance type, will affect the cost of
deployment. See the pricing pages for each AWS service you will be using for cost estimates.
This Quick Start requires a license to deploy the Informatica Data Lake Management
solution, as described in the Prerequisites section. To sign up for a demo license, contact
Informatica.
Page 3 of 30
Architecture
Figure 1 shows the typical components of a generic data lake management solution.
Figure 1: Components of a data lake management solution
The solution includes the following core components, beginning with the lower part of the
diagram in Figure 1:
Big Data Infrastructure: From a connectivity perspective (for example, on-premises,

cloud, IOT, unstructured, semi-structured) the solution reliably accommodates an
expanding volume and variety of data types. The solution has the capacity to scale up (when
you increase individual hardware capacity) or scale out (when you increase infrastructure
capacity linearly for parallel processing), and can be deployed directly into your AWS
environment.
Big Data Storage: The solution can store large amounts of a variety of data (structured,
unstructured, semi-structured) at scale with the performance that guarantees timely
delivery of data to business analysts.
Big Data Processing: The solution can process data at any latency, such as real time,
near real time, and batch, using big data processing frameworks such as Apache Spark.
Metadata Intelligence manages all the metadata from a variety of data sources. For
example, a data catalog manages data generated by big data and by traditional sources. To
Page 4 of 30
do this, it collects, indexes, and applies machine learning to metadata. It also provides
metadata services such as semantic search, automated data domain discovery and tagging,
and data intelligence that can guide user behavior.
Big Data Integration, in which a data lake architecture must integrate data from various
disparate data sources, at any latency, with the ability to rapidly develop ELT (extract, load,
and transform) or ETL (extract, transform, and load) data flows.
Big Data Governance and Quality are critical to a data lake, especially when dealing
with a variety of data. The purpose of big data governance is to deliver trusted, timely, and
relevant information to support the business outcome.
Big Data Security is the process of minimizing data risk. Activities include discovering,
identifying, classifying, and protecting sensitive data, as well as analyzing its risk based on
value, location, protection, and proliferation.
Finally, Intelligent Data Applications (Self-Service Data Preparation, Enterprise Data

Catalog, and Data Security Intelligence) provide data analysts, data scientists, data
stewards, and data architects with a collaborative self-service platform for data governance
and security that can discover, catalog and prepare data for big data analytics.
Informatica Services on AWS

Deploying this Quick Start with default parameters builds the Informatica Data Lake
environment illustrated in Figure 2 in the AWS Cloud. The Quick Start deployment
automatically creates the following Informatica elements:
 Domain
 Model Repository Service
 Data Integration Service
In addition, the deployment automatically embeds Hadoop clusters in the virtual private
cloud (VPC) for metadata storage and processing.
The deployment then assigns the connection to the Amazon EMR cluster for the Hadoop
Distributed File System (HDFS) and Hive. It also sets up connections to enable scanning of
Amazon Simple Storage Service (Amazon S3) and Amazon Redshift environments as part of
the data lake.
Page 5 of 30
The Informatica domain and repository database are hosted on Amazon Relational
Database Service (Amazon RDS) using Oracle, which handles management tasks such as
backups, patch management, and replication.
To access Informatica Services on the AWS Cloud, you can install the Informatica client to
run Big Data Management on a Microsoft Windows machine. You can then access
Enterprise Data Catalog by using a web browser.
Figure 2 shows the Informatica Data Lake Management solution deployed on AWS.
Figure 2: Informatica Data Lake Management solution deployed on AWS
The Quick Start sets up a highly available architecture that spans two Availability Zones,
and a VPC configured with public and private subnets according to AWS best practices.
Managed network address translation (NAT) gateways are deployed into the public subnets
and configured with an Elastic IP address for outbound internet connectivity.
Page 6 of 30
The Quick Start also installs and configures the following information services during the
one-click deployment:
 Informatica domain, which is the fundamental administrative unit of the
Informatica platform. The Informatica platform has a service-oriented architecture that
provides the ability to scale services and share resources across multiple machines.
 Model Repository Service, which is a relational database that stores all the metadata
for projects created using Informatica client tools. The model repository also stores run-
time and configuration information for applications that are deployed to a Data
Integration Service.
 Data Integration Service, which is a compute component within the Informatica
domain that manages requests to submit big data integration, big data quality, and
profiling jobs to the Hadoop cluster for processing.
 Content Management Service, which manages reference data. It provides reference
data information to the Data Integration Service and Informatica Developer.
 Analyst Service, which runs the Analyst tool in the Informatica domain. The Analyst
Service manages the connections between the service components and the users who log
in to the Analyst tool. You can perform column and rule profiling, manage scorecards,
and manage bad records and duplicate records in the Analyst tool.
 Profiling, which helps you find the content, quality, and structure of data sources of an
application, schema, or enterprise. A profile is a repository object that finds and
analyzes all data irregularities across data sources in the enterprise, and hidden data
problems that put data projects at risk. The profiling results include unique values, null
values, data domains, and data patterns. When you use this Quick Start, you can run
profiling on the Data Integration Service (default) or Hadoop.
 Business Glossary, which consists of online glossaries of business terms and policies
that define important concepts within an organization. Data stewards create and publish
terms that include information such as descriptions, relationships to other terms, and
associated categories. Glossaries are stored in a central location for easy lookup by
consumers. Glossary assets include business terms, policies, and categories that contain
information that consumers might search for. A glossary is a high-level container that
stores Glossary assets. A business term defines relevant concepts within the
organization, and a policy defines the business purpose that governs practices related to
the term. Business terms and policies can be associated with categories, which are
descriptive classifications.
 Catalog Service, which runs Enterprise Data Catalog and manages connections
between service components and external applications.
Page 7 of 30
 An embedded Hadoop cluster that uses Hortonworks, running HDFS, Hbase, Yarn,
and Solr.
 Informatica Cluster Service, which runs and manages all Hadoop services, Apache
Ambari server, and Apache Ambari agents on the embedded Hadoop cluster.
 Metadata and Catalog, which include the metadata persistence store, search index,
and graph database in an embedded Hadoop cluster. The catalog represents an indexed
inventory of all the data assets in the enterprise that you configure in Enterprise Data
Catalog. Enterprise Data Catalog organizes all the enterprise metadata in the catalog
and enables the users of external applications to discover and understand the data.
The Informatica domain and the Informatica Model Repository databases are configured
on Amazon RDS using Oracle.
Planning the Data Lake Management Deployment

Deployment Options
This Quick Start provides two deployment options:
 Deployment of the Data Lake Management solution into a new VPC (end-to-
end deployment). This option builds a new virtual private cloud (VPC) with public
and private subnets, and then deploys the Informatica Data Lake Management solution
into that infrastructure.
 Deployment of the Data Lake Management solution into an existing VPC.
This option provisions data lake components into your existing AWS infrastructure.
The Quick Start provides separate templates for these options. It also lets you configure
CIDR blocks, instance types, and data lake settings, as discussed later in this guide.
Prerequisites
Specialized Knowledge
Before you deploy this Quick Start, we recommend that you become familiar with the
following AWS services:
 Amazon VPC
 Amazon EC2
 Amazon EMR
If you are new to AWS, see the Getting Started Resource Center.
Page 8 of 30
Technical Requirements
Before you deploy this Quick Start, verify the following prerequisites:
 You have an account with AWS, and you know the account login information.
 You have purchased a license for the Informatica Data Lake Management solution . To
sign up for a demo license, please contact Informatica, your sales representative, or the
consulting partner you’re working with.
The license file should have a name like AWSDatalakeLicense.key.
Deployment Steps
Step 1. Prepare Your AWS Account
1. If you don’t already have an AWS account, create one at https://aws.amazon.com by
following the on-screen instructions.
2. Use the region selector in the navigation bar to choose the AWS Region where you want
to deploy the Informatica Data Lake Management solution on AWS.
3. Create a key pair in your preferred region.
When you log in to any Amazon EC2 system or Amazon EMR cluster, you use a
password file for authentication. The file is called a private key file and has a file name
extension of .pem. If you do not have an existing .pem key to use, follow the instructions
in the AWS documentation to create a key pair.
Note Your administrator might ask you to use a particular existing key pair.
When you create a key pair, you save the .pem file to your desktop system.
Simultaneously, AWS saves the key pair to your account. Make a note of the key pair
that you want to use for the Data Lake Management instance, so that you can provide
the key pair name during network configuration.
4. If necessary, request a service limit increase for the Amazon EC2 M3 and M4 instance
types. You might need to do this if you already have an existing deployment that uses
this instance type, and you think you might exceed the default limit with this reference
deployment.
Step 2. Upload Your Informatica License

Upload the license for the Informatica Data Lake Management solution to an S3 bucket,
following the instructions in the Amazon S3 documentation. You will be prompted for the
bucket name during deployment.
Page 9 of 30
To sign up for a demo license, please contact Informatica, your sales representative, or the
consulting partner you’re working with.
Step 3. Launch the Quick Start

Note You are responsible for the cost of the AWS services used while running this
Quick Start reference deployment. There is no additional cost for using this Quick
Start. For full details, see the pricing pages for each AWS service you will be using in
this Quick Start. Prices are subject to change.
1. Choose one of the following options to launch the AWS CloudFormation template into
your AWS account. For help choosing an option, see deployment options earlier in this
guide.
Option 1 Option 2
Deploy data lake into a Deploy data lake into an
new VPC on AWS existing VPC on AWS
Launch Launch
Important If you’re deploying Informatica Data Lake Management into an

existing VPC, make sure that your VPC has two private and two public subnets in
different Availability Zones for the database instances. These subnets require NAT
gateways or NAT instances in their route tables, to allow the instances to download
packages and software without exposing them to the Internet. You’ll also need the
domain name option configured in the DHCP options as explained in the Amazon
VPC documentation. You’ll be prompted for your VPC settings when you launch the
Quick Start.
Each deployment takes about two hours to complete.

2. Check the region that’s displayed in the upper-right corner of the navigation bar, and
change it if necessary. This is where the network infrastructure for Informatica Data
Lake Management will be built. The template is launched in the US East (Ohio) Region
by default.
3. On the Select Template page, keep the default setting for the template URL, and then
choose Next.
Page 10 of 30
4. On the Specify Details page, change the stack name if needed. Review the parameters
for the template. Provide values for the parameters that require input. For all other
parameters, review the default settings and customize them as necessary. When you
finish reviewing and customizing the parameters, choose Next.
In the following tables, parameters are listed by category and described separately for
the two deployment options:
– Parameters for deploying Informatica components into a new VPC
– Parameters for deploying Informatica components into an existing VPC
Note The templates for the two scenarios share most, but not all, of the same
parameters. For example, the template for an existing VPC prompts you for the VPC
and subnet IDs in your existing VPC environment. You can also download the
templates and edit them to create your own parameters based on your specific
deployment scenario.
 Option 1: Parameters for deploying into a new VPC

View template
Network Configuration:
Parameter label Default Description
(name)
Availability Zones Requires input The two Availability Zones that will be used to deploy
(AvailabilityZones) Informatica Data Lake Management components. The Quick
Start preserves the logical order you specify.
VPC CIDR 10.0.0.0/16 The CIDR block for the VPC.

(VPCCIDR)
Private Subnet 1 CIDR 10.0.0.0/19 The CIDR block for the private subnet located in Availability
(PrivateSubnet1CIDR) Zone 1.
Private Subnet 2 CIDR 10.0.32.0/19 The CIDR block for the private subnet located in Availability
(PrivateSubnet2CIDR) Zone 2.
Public Subnet 1 CIDR 10.0.128.0/20 The CIDR block for the public (DMZ) subnet located in
(PublicSubnet1CIDR) Availability Zone 1.
Public Subnet 2 CIDR 10.0.144.0/20 The CIDR block for the public (DMZ) subnet located in
(PublicSubnet2CIDR) Availability Zone 2.
IP Address Range Requires input The CIDR IP range that is permitted to access the Informatica
(RemoteAccessCIDR) domain and the Amazon EMR cluster. We recommend that
you use a constrained CIDR range to reduce the potential of
inbound attacks from unknown IP addresses. For example, to
Page 11 of 30

(name)
specify the range of 10.20.30.40 to 10.20.30.49, enter
10.20.30.40/49.
Amazon EC2 Configuration:

(name)
Informatica Small The size of the Informatica embedded cluster. Choose from the
Embedded Cluster following:
Size  Small: c4.8xlarge, single node
(ICSClusterSize)  Medium: c4.8xlarge, three nodes
 Large: c4.8xlarge, six nodes
Informatica Domain c4.4xlarge The EC2 instance type for the instance that hosts the
Instance Type Informatica domain. The two options are c4.4xlarge and
(InformaticaServer c4.8xlarge.
InstanceType)
Key Pair Name Requires input A public/private key pair, which allows you to connect securely
(KeyPairName) to your instance after it launches. When you created an AWS
account, this is the key pair you created in your preferred
region.
Amazon EMR Configuration:

(name)
EMR Cluster Name Requires input The name of the Amazon EMR cluster where the Data Lake
(EMRClusterName) Management instance will be deployed.
EMR Core Instance m4.xlarge The instance type for Amazon EMR core nodes.
Type
(EMRCoreInstanceType)
EMR Core Nodes Requires input The number of core nodes. Enter a value between 1 and 500.
(EMRCoreNodes)
EMR Master Instance m4.xlarge The instance type for the Amazon EMR master node.
Type
(EMRMasterInstance
Type)
EMR Logs Bucket Requires input The S3 bucket where the Amazon EMR logs will be stored.
Name
(EMRLogBucket)
Page 12 of 30
Amazon RDS Configuration:

(name)
Informatica Database awsquickstart The user name for the database instance associated with the
Username Informatica domain and services (such as Model Repository
(DBUser) Service, Data Integration Service, and Content Management
Service). The user name is an 8-18 character string.
Informatica Database Requires input The password for the database instance associated with the
Instance Password Informatica domain and services. The password is an 8-18
(DBPassword) character string.
Amazon Redshift Configuration:

(name)
Redshift Cluster Type single-node The type of cluster. You can specify single-node or multi-node.
(RedshiftClusterType) If you specify multi-node, use the Redshift Number of
Nodes parameter to specify how many nodes you would like
to provision in your cluster.
Redshift Database dev The name of the first database to create when the cluster is
Name created.
(RedshiftDatabaseName)
Redshift Database 5439 The port number on which the cluster accepts incoming
Port connections.
(RedshiftDatabasePort)
Redshift Number of 1 The number of compute nodes in the cluster. For multi-node
Nodes clusters, this parameter must be greater than 1.
(RedshiftNumberOf
Nodes)
Redshift Node Type ds2.xlarge The compute, memory, storage, and I/O capacity of the
(RedshiftNodeType) cluster's nodes. For node size specifications, see the Amazon
Redshift documentation.
Redshift Username defaultuser The user name that is associated with the master user account
(RedshiftUsername) for the cluster that is being created.
Redshift Password Requires input The password that is associated with the master user account
(RedshiftPassword) for the cluster that is being created. The password must be an
8-64 character string that consists of at least one uppercase
letter, one lowercase letter, and one number.
Page 13 of 30
Informatica Enterprise Catalog and BDM Configuration:

(name)
Informatica Requires input The administrator user name for accessing Big Data
Administrator Management. You can specify any string. Make a note of the
Username user name and password, and use it later to log in to the
(InformaticaAdminUser) Administrator tool to configure the Informatica domain.
Informatica Requires input The administrator password for accessing Big Data
Password user name and password, and use it later to log in to the
(InformaticaAdmin Administrator tool to configure the Informatica domain.
Password)
License Key Location Requires input The name of the S3 bucket in your account that contains the
(InformaticaKeyS3Bucket) Informatica license key.
License Key Name Requires input The Informatica license key name; for example,
(InformaticaKeyName) INFALicense_10_2.key.
Note: The key file must be in the top level of the S3 bucket
and not in a subfolder.
Import Sample No Select Yes to import sample catalog data. You can use the
Content sample data to get started with the product.
(ImportSampleData)
AWS Quick Start Configuration:
Informatica recommends that you do not change the default values for the
parameters in this category.

(name)
Quick Start S3 Bucket quickstart- The S3 bucket name for the Quick Start assets. This bucket
Name reference name can include numbers, lowercase letters, uppercase
(QSS3BucketName) letters, and hyphens (-), but should not start or end with a
hyphen. You can specify your own bucket if you copy all of the
assets and submodules into it, if you want to customize the
templates and override the Quick Start behavior for your
specific implementation.
Quick Start S3 Key informatica/ The S3 key name prefix for your copy of the Quick Start assets.
Prefix datalake/latest/ This prefix can include numbers, lowercase letters, uppercase
(QSS3KeyPrefix) letters, hyphens (-), and forward slashes (/). This parameter
enables you to customize or extend the Quick Start for your
Page 14 of 30
 Option 2: Parameters for deploying into an existing VPC

View template
Network Configuration:
(name)
VPC Requires input The ID of your existing VPC where you want to deploy the
(VPCID) Informatica Data Lake Management solution (for example,
vpc-0343606e).
The VPC must meet the following requirements:
 It must be set up with public access through the
internet via an attached internet gateway.
 The DNS Resolution property of the VPC must be set
to Yes.
 The Edit DNS Hostnames property of the VPC must
be set to Yes.
Informatica Domain Requires input A publicly accessible subnet ID where the Informatica
Subnet domain will reside. Select one of the available subnets
(InformaticaServer listed.
SubnetID)
Informatica Database Requires input The IDs of two private subnets in the selected VPC.
Subnets Note: These subnets must be in different Availability Zones
(DBSubnetIDs) in the selected VPC.
IP Address Range Requires input The CIDR IP range that is permitted to access the
(IPAddressRange) Informatica domain and the Informatica embedded cluster.
We recommend that you use a constrained CIDR range to
reduce the potential of inbound attacks from unknown IP
addresses. For example, to specify the range of 10.20.30.40
to 10.20.30.49, enter 10.20.30.40/49.
Amazon EC2 Configuration:

(name)
Key Pair Name Requires input A public/private key pair, which allows you to connect securely
(KeyPairName) to your instance after it launches. When you created an AWS
account, this is the key pair you created in your preferred
region.
Informatica Domain c4.4xlarge The EC2 instance type for the instance that hosts the
Instance Type Informatica domain. The two options are c4.4xlarge and
(InformaticaServer c4.8xlarge.
InstanceType)
Informatica Small The size of the Informatica embedded cluster. Choose from the
Embedded Cluster following:
Page 15 of 30

(name)
Size  Small: c4.8xlarge, single node
(ICSClusterSize)  Medium: c4.8xlarge, three nodes
 Large: c4.8xlarge, six nodes
Amazon EMR Configuration:

(name)
EMR Master Instance m4.xlarge The instance type for the Amazon EMR master node.
Type
(EMRMasterInstance
Type)
EMR Core Instance m4.xlarge The instance type for Amazon EMR core nodes.
Type
(EMRCoreInstanceType)
EMR Cluster Name Requires input The name of the Amazon EMR cluster where the Data Lake
(EMRClusterName) Management instance will be deployed.
EMR Core Nodes Requires input The number of core nodes. Enter a value between 1 and 500.
(EMRCoreNodes)
EMR Logs Bucket Requires input The S3 bucket where the Amazon EMR logs will be stored.
Name
(EMRLogBucket)
Amazon RDS Configuration:

(name)
Informatica Database awsquickstart The user name for the database instance associated with the
Username Informatica domain and services (such as Model Repository
(DBUser) Service, Data Integration Service, and Content Management
Service). The user name is an 8-18 character string.
Informatica Database Requires input The password for the database instance associated with the
Instance Password Informatica domain and services. The password is an 8-18
(DBPassword) character string.
Amazon Redshift Configuration:

(name)
Redshift Database dev The name of the first database to create when the cluster is
Name created.
(RedshiftDatabaseName)
Page 16 of 30

(name)
Redshift Cluster Type single-node The type of cluster. You can specify single-node or multi-node.
(RedshiftClusterType) If you specify multi-node, use the Redshift Number of
Nodes parameter to specify how many nodes you would like
to provision in your cluster.
Redshift Number of 1 The number of compute nodes in the cluster. For multi-node
Nodes clusters, this parameter must be greater than 1.
(RedshiftNumberOf
Nodes)
Redshift Node Type ds2.xlarge The compute, memory, storage, and I/O capacity of the
(RedshiftNodeType) cluster's nodes. For node size specifications, see the Amazon
Redshift documentation.
Redshift Username defaultuser The user name that is associated with the master user account
(RedshiftUsername) for the cluster that is being created.
Redshift Password Requires input The password that is associated with the master user account
(RedshiftPassword) for the cluster that is being created. The password must be an
8-64 character string that consists of at least one uppercase
letter, one lowercase letter, and one number.
Redshift Database 5439 The port number on which the cluster accepts incoming
Port connections.
(RedshiftDatabasePort)
Informatica Enterprise Catalog and BDM Configuration:

(name)
Informatica Requires input The administrator user name for accessing Big Data
Username user name and password, and use it later to log in to the
(InformaticaAdminUser Administrator tool to configure the Informatica domain.
name)
Informatica Requires input The administrator password for accessing Big Data
Password user name and password, and use it later to log in to the
(InformaticaAdmin Administrator tool to configure the Informatica domain.
Password)
License Key Location Requires input The name of the S3 bucket in your account that contains the
(InformaticaKeyS3Bucket) Informatica license key.
License Key Name Requires input The Informatica license key name; for example,
(InformaticaKeyName) INFALicense_10_2.key.
Note: The key file must be in the top level of the S3 bucket
and not in a subfolder.
Page 17 of 30

(name)
Import Sample No Select Yes to import sample catalog data. You can use the
Content sample data to get started with the product.
(ImportSampleData)
AWS Quick Start Configuration:
Informatica recommends that you do not change the default values for the
parameters in this category.

(name)
Quick Start S3 Bucket quickstart- The S3 bucket name for the Quick Start assets. This bucket
Name reference name can include numbers, lowercase letters, uppercase
(QSS3BucketName) letters, and hyphens (-), but should not start or end with a
hyphen. You can specify your own bucket if you copy all of the
assets and submodules into it, if you want to customize the
templates and override the Quick Start behavior for your
Quick Start S3 Key informatica/ The S3 key name prefix for your copy of the Quick Start assets.
Prefix datalake/latest/ This prefix can include numbers, lowercase letters, uppercase
(QSS3KeyPrefix) letters, hyphens (-), and forward slashes (/). This parameter
enables you to customize or extend the Quick Start for your
When you finish reviewing and customizing the parameters, choose Next.
5. On the Options page, you can specify tags (key-value pairs) for resources in your stack
and set advanced options. When you’re done, choose Next.
6. On the Review page, review and confirm the template settings. Under Capabilities,
select the check box to acknowledge that the template will create IAM resources.
7. Choose Create to deploy the stack.
Step 4. Monitor the Deployment

During deployment, you can monitor the creation of the cluster instance and the
Informatica domain, and get more information about system resources.
1. Choose the stack that you are creating, and then choose the Events tab to monitor the
creation of the stack.
Figure 3 shows part of the Events tab.
Page 18 of 30
Figure 3: Monitoring the deployment in the Events tab
When stack creation is complete, the Status field shows CREATE_COMPLETE, and
the Outputs tab displays a list of stacks that have been created, as shown in
Figure 4.
Figure 4: Stack creation complete
2. Choose the Resources tab.

This tab displays information about the stack and the Data Lake instance. You can select
the linked physical ID properties of individual resources to get more information about
them, as shown in Figure 5.
Page 19 of 30
Figure 5: Resources tab
3. Choose the Outputs tab.

When the Informatica domain setup is complete, the Outputs tab displays the
following information:
Key Description
RedShiftIamRole Amazon Resource Name (ARN) for the Amazon RedShift

IAM role
EICCatalogURL URL for the Informatica EIC user console
InstanceID Informatica domain host name
InformaticaAdminConsoleURL URL for the Informatica administrator console
EtcHostFileEntry Etc host file entry to be added to the /etc/hosts file

to enable access to the domain, using the host name of
the Adminstrative Server
EICAdminURL URL for the EIC Administrator
EMRResourceManagerURL URL for the Amazon EMR Resource Manager
RedShiftClusterEndpoint Amazon Redshift cluster endpoint
CloudFormationLogs Location of the AWS CloudFormation installation log
S3DatalakeBucketName Name of the S3 bucket used for the data lake
InstanceSetupLogs Location of the setup log for the Informatica domain EC2
instance
InformaticaHadoopInstallLogs Location of the master node Hadoop installation log
InformaticaDomainDatabaseEndPoint Informatica domain database endpoint
InformaticaAdminConsoleServerLogs Location of the Informatica domain installation log
InformaticaHadoopClusterURL URL to the IHS Hadoop gateway node
Page 20 of 30
Key Description
InformaticaBDMDeveloperClient Location where you can download the Informatica

Developer tool (see step 5)
Note If the Outputs tab is not populated with this information, wait for domain
setup to be complete.
4. Use the links in the Outputs tab to access Informatica management tools. For example:
Use To
InformaticaAdminConsoleURL Open the Instance Administration screen. You can

use this screen to manage Informatica services and
resources. You can also get additional information about
the instance, such as the public DNS and public IP
address.
EICAdminURL Administer the Enterprise Data Catalog environment.
EICCatalogURL Access Enterprise Data Catalog. See the Informatica

Enterprise Data Catalog User Guide for information
about logging in to Enterprise Data Catalog.
Step 5. Download and Install Informatica Developer

Informatica Developer (the Developer tool) is an application that you use to design and
implement data integration, data quality, data profiling, data services, and big data
solutions. You can use the Developer tool to import metadata, create connections, and
create data objects. You can also use the Developer tool to create and run profiles,
mappings, and workflows.
1. Log in to the AWS CloudFormation console at

https://console.aws.amazon.com/cloudformation/.
1. Choose the Outputs tab.
2. Right-click the value of the InformaticaBDMDeveloperClient key to download the
Developer tool client installer.
3. Uncompress and launch the installer to install the Developer tool on a local drive.
Page 21 of 30
Manual Cleanup
If you deploy the Quick Start for a new VPC, Amazon EMR creates security groups that are
not deleted when you delete the Amazon EMR cluster. To clean up after deployment, follow
these steps:
1. Delete the Amazon EMR cluster.
2. Delete the Amazon EMR-managed security groups (ElasticMapReduce-master,
ElasticMapReduce-slave) by deleting the circularly dependent rules followed by the
security groups themselves.
3. Delete the AWS CloudFormation stack.
Troubleshooting
Q. I encountered a CREATE_FAILED error when I launched the Quick Start.
A. If you encounter this error in the AWS CloudFormation console, we recommend that you
relaunch the template with Rollback on failure set to No. (This setting is under
Advanced in the AWS CloudFormation console, Options page.) With this setting, the
stack’s state will be retained and the instance will be left running, so you can troubleshoot
the issue. (You'll want to look at the log files in %ProgramFiles%\Amazon\EC2ConfigService
and C:\cfn\log.)
Important When you set Rollback on failure to No, you’ll continue to

incur AWS charges for this stack. Please make sure to delete the stack when
you’ve finished troubleshooting.
For additional information, see Troubleshooting AWS CloudFormation on the AWS

website.
Q. I encountered an error while installing Informatica domain and services.

A. We recommend that you view the /installation.log log file to get more information
about the errors you encountered.
Q. I encountered a size limitation error when I deployed the AWS Cloudformation

templates.
A. We recommend that you launch the Quick Start templates from the location we’ve
provided or from another S3 bucket. If you deploy the templates from a local copy on your
computer or from a non-S3 location, you might encounter template size limitations when
you create the stack. For more information about AWS CloudFormation limits, see the AWS
documentation.
Page 22 of 30
Using Informatica Data Lake Management on AWS

After you deploy this Quick Start, you can use any of the patterns described in this section
to use the Informatica Data Lake Management solution on AWS.
Transient and Persistent Clusters

Amazon EMR provides two methods to configure a cluster: transient and persistent.
Transient clusters are shut down when the jobs are complete. For example, if a batch-
processing job pulls web logs from Amazon S3 and processes the data once a day, it is more
cost-effective to use transient clusters to process web log data and shut down the nodes
when the processing is complete. Persistent clusters continue to run after data processing is
complete. The Informatica Data Lake Management solution supports both cluster types.
For more information, see the Amazon EMR best practices whitepaper.
This Quick Start sets up a persistent EMR cluster with a configurable number of core nodes,
as defined by the EMRCoreNodes parameter.
Common AWS Architecture Patterns for Informatica Data Lake Management

Informatica Data Lake Management supports the following patterns that leverage AWS for
big data processing.
Pattern 1: Using Amazon S3

In this first pattern, data is loaded to Amazon S3 using Informatica. For data processing,
the Informatica Big Data Management mapping logic pulls data from Amazon S3 and sends
it for processing to Amazon EMR.
Amazon EMR does not copy the data to the local disk or HDFS. Instead, the mappings open
multithreaded HTTP connections to Amazon S3, pull data to the Amazon EMR cluster, and
process data in streams, as illustrated in Figure 6.
Page 23 of 30
Figure 6: Pattern 1 using Amazon S3
Pattern 2: Using HDFS and Amazon S3 as Backup Storage

In this pattern, Informatica writes data directly to HDFS and leverages the Amazon EMR
task nodes to process the data and periodically copy data to Amazon S3 as the backup
storage, as illustrated in Figure 7.
The advantage of this pattern is the ability to process data without copying it to Amazon
EMR. Although copying to Amazon EMR may improve performance, the disadvantage is
durability. Because Amazon EMR uses ephemeral disk to store data, data could be lost if the
EC2 instance for Amazon EMR fails. HDFS replicates data within the Amazon EMR cluster
and can usually recover from node failures. However, data loss could still occur if the
number of lost nodes is greater than your replication factor. Informatica recommends that
you back up HDFS data to Amazon S3 periodically.
Page 24 of 30
Figure 7: Pattern 2 using HDFS and Amazon S3 as backup
Pattern 3: Using Amazon Kinesis and Kinesis Firehose for Real-Time and
Streaming Analytics
In the third pattern, unbounded events streams that are continuously generated from
devices, IoT applications, and cloud applications are ingested in real time, using
Informatica Edge Data Streaming, into Amazon Kinesis. Using Informatica Big Data
Streaming, which leverages the existing Informatica platform, streaming pipelines can be
built using pre-built transformations, connectors, and parsers. These elements are
optimized to execute on an Amazon EMR cluster in streaming mode using Spark
Streaming. They support the consumption of data records from an Amazon Kinesis stream
and act as a producer for writing data to a defined Amazon Kinesis Firehose delivery
stream. Data can be persisted to Amazon S3, Amazon Redshift, and Amazon Elasticsearch
Service (Amazon ES) and delivered in JSON and binary payloads.
For more information about deploying Informatica Big Data Streaming on AWS, please
contact Informatica or your implementation partner.
Figure 8 shows the Informatica Big Data Streaming architecture.
Page 25 of 30
Figure 8: Pattern 3 using the Informatica Big Data Streaming architecture
Pattern 4: Using AWS for Self-Service Data Discovery and Preparation

In the last pattern, Informatica Enterprise Data Lake provides data analysts with a
collaborative, self-service, big data discovery and preparation solution. Analysts can rapidly
discover and turn raw data into insights, with quality and governance powered by data
intelligence deployed on AWS.
When deployed on AWS, Informatica Enterprise Data Lake leverages the existing
Informatica platform, which allows analysts to discover, search, and explore data assets for
analysis using an AI-driven data catalog. The Data Lake Management solution makes
recommendations based on the behavior and shared knowledge of the data assets used for
analysis. Once analysts find the relevant data, they can blend, transform, cleanse, and
enrich data by using a Microsoft Excel-like data preparation interface, at scale on an
Amazon EMR cluster.
Data is prepared, published, and made available for consumption in the data lake. An
analyst can assess the prepared data using ad-hoc queries to generate charts, tables, and
other visual formats. IT can operationalize the data preparation steps that will execute the
ad-hoc work done by analysts into Informatica big data mappings, which will run in batch
on an Amazon EMR cluster.
Page 26 of 30
You can deploy Informatica Enterprise Data Lake on the same AWS infrastructure that
supports Informatica Big Data Management and Informatica Enterprise Data Catalog.
Figure 9 shows the data flows for Informatica Enterprise Data Lake.
Figure 9: Data flows used in pattern 4
Process Flow
Figure 10 shows the process flow for using the Informatica Data Lake Management solution
on AWS. It illustrates the data flow process using the Informatica Data Lake Management
solution and Amazon EMR, Amazon S3, and Amazon Redshift.
Page 27 of 30
Figure 10: Informatica Data Lake Management Solution process flow using Amazon EMR
The numbers in Figure 10 refer to the following steps:
Step 1: Collect and move data from on-premises systems into Amazon S3 storage. Consider
offloading infrequently used data, and batch-load raw data to a defined landing zone in
Amazon S3.
Step 2: Collect cloud application and streaming data generated by machines and sensors in
Amazon S3 storage instead of staging it in a temporary file system or a data warehouse.
Step 3: Discover and profile data stored in Amazon S3, using Amazon EMR as the
processing infrastructure. Profile data to better understand its structure and context. Parse
raw data, either in multi-structured or unstructured formats, to extract features and
entities, and cleanse data with data quality tasks. To prepare data for analysis, you can
execute prebuilt transformations and data quality rules natively in EMR to prepare data for
analysis.
Step 4: Match duplicate data within and across big data sources and link them to create a
single view.
Step 5: Perform data masking to protect confidential data such as credit card information,
social security numbers, names, addresses, and phone numbers from unintended exposure
to reduce the risk of data breaches. Data masking helps IT organizations manage the access
to their most sensitive data, providing enterprise-wide scalability, robustness, and
connectivity to a vast array of databases.
Page 28 of 30
Step 6: Data analysts and data scientists can prepare and collaborate on data for analytics
by incorporating semantic search, data discovery, and intuitive data preparation tools for
interactive analysis with trusted, secure, and governed data assets.
Step 7: After cleansing and transforming data on Amazon EMR, move high-value curated
data back to Amazon S3 or to Amazon Redshift. From there, users can directly access data
with BI reports and applications.
Additional Resources
AWS services
 AWS CloudFormation
http://aws.amazon.com/documentation/cloudformation/
 Amazon EBS
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html
 Amazon EC2
http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/
 Amazon EMR
https://aws.amazon.com/documentation/emr/
 Amazon Redshift
https://aws.amazon.com/documentation/redshift/
 Amazon S3
https://aws.amazon.com/documentation/s3/
 Amazon VPC
http://aws.amazon.com/documentation/vpc/
Informatica
 Informatica Network: a source for product documentation, Knowledge Base articles,
and other information
https://network.informatica.com
Quick Start reference deployments

 AWS Quick Start home page
https://aws.amazon.com/quickstart/
Page 29 of 30
GitHub Repository
You can visit our GitHub repository to download the templates and scripts for this Quick
Start, to post your comments, and to share your customizations with others.
Document Revisions
Date Change In sections
January 2018 Initial publication —
© 2018, Amazon Web Services, Inc. or its affiliates, and Informatica LLC. All rights
reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s current product offerings
and practices as of the date of issue of this document, which are subject to change without notice. Customers
are responsible for making their own independent assessment of the information in this document and any
use of AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether
express or implied. This document does not create any warranties, representations, contractual
commitments, conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities
and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of,
nor does it modify, any agreement between AWS and its customers.
The software included with this paper is licensed under the Apache License, Version 2.0 (the "License"). You
may not use this file except in compliance with the License. A copy of the License is located at
http://aws.amazon.com/apache2.0/ or in the "license" file accompanying this file. This code is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
Page 30 of 30

Informatica Data Lake Management On The Aws Cloud

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Informatica Data Lake Management On The Aws Cloud

Enviado por

Direitos autorais:

Formatos disponíveis

Informatica Data Lake

Management on the AWS Cloud

Informatica Big Data Team

Transient and Persistent Clusters ....................................................................................23

Costs and Licenses

Figure 1: Components of a data lake management solution

Big Data Infrastructure: From a connectivity perspective (for example, on-premises,

Finally, Intelligent Data Applications (Self-Service Data Preparation, Enterprise Data

Informatica Services on AWS

 Model Repository Service

 Data Integration Service

Figure 2: Informatica Data Lake Management solution deployed on AWS

Planning the Data Lake Management Deployment

Step 2. Upload Your Informatica License

Step 3. Launch the Quick Start

Important If you’re deploying Informatica Data Lake Management into an

Each deployment takes about two hours to complete.

 Option 1: Parameters for deploying into a new VPC

VPC CIDR 10.0.0.0/16 The CIDR block for the VPC.

Parameter label Default Description

Amazon EC2 Configuration:

Amazon EMR Configuration:

Amazon RDS Configuration:

Amazon Redshift Configuration:

Informatica Enterprise Catalog and BDM Configuration:

AWS Quick Start Configuration:

Parameter label Default Description

 Option 2: Parameters for deploying into an existing VPC

Amazon EC2 Configuration:

Parameter label Default Description

Amazon EMR Configuration:

Amazon RDS Configuration:

Amazon Redshift Configuration:

Parameter label Default Description

Informatica Enterprise Catalog and BDM Configuration:

Parameter label Default Description

AWS Quick Start Configuration:

Parameter label Default Description

Step 4. Monitor the Deployment

Figure 3: Monitoring the deployment in the Events tab

Figure 4: Stack creation complete

2. Choose the Resources tab.

Figure 5: Resources tab

3. Choose the Outputs tab.

RedShiftIamRole Amazon Resource Name (ARN) for the Amazon RedShift

EICCatalogURL URL for the Informatica EIC user console

InstanceID Informatica domain host name

InformaticaAdminConsoleURL URL for the Informatica administrator console

EtcHostFileEntry Etc host file entry to be added to the /etc/hosts file

EICAdminURL URL for the EIC Administrator

EMRResourceManagerURL URL for the Amazon EMR Resource Manager

RedShiftClusterEndpoint Amazon Redshift cluster endpoint

CloudFormationLogs Location of the AWS CloudFormation installation log

S3DatalakeBucketName Name of the S3 bucket used for the data lake

InformaticaHadoopInstallLogs Location of the master node Hadoop installation log

InformaticaDomainDatabaseEndPoint Informatica domain database endpoint

InformaticaAdminConsoleServerLogs Location of the Informatica domain installation log

InformaticaHadoopClusterURL URL to the IHS Hadoop gateway node

InformaticaBDMDeveloperClient Location where you can download the Informatica

InformaticaAdminConsoleURL Open the Instance Administration screen. You can

EICAdminURL Administer the Enterprise Data Catalog environment.

EICCatalogURL Access Enterprise Data Catalog. See the Informatica

Step 5. Download and Install Informatica Developer