Escolar Documentos
Profissional Documentos
Cultura Documentos
It can integrate data from the widest range of enterprise and external
data sources
Implements data validation rules
It is useful in processing and transforming large amounts of data
It uses scalable parallel processing approach
It can handle complex transformations and manage multiple integration
processes
Leverage direct connectivity to enterprise applications as sources or
targets
Leverage metadata for analysis and maintenance
Operates in batch, real time, or as a Web service
Data transformation
Jobs
Parallel processing
Relational databases
Mainframe databases
Business and analytic applications
Enterprise resource planning (ERP) or customer relationship
management (CRM) databases
Online analytical processing (OLAP) or performance management
databases
IBM infosphere job consists of individual stages that are linked together. It
describes the flow of data from a data source to a data target. Usually, a stage
has minimum of one data input and/or one data output. However, some
stages can accept more than one data input and output to more than one
stage.
In Job design various stages you can use are:
Transform stage
Filter stage
Aggregator stage
Remove duplicates stage
Join stage
Lookup stage
Copy stage
Sort stage
Containers
The above image explains how IBM Infosphere DataStage interacts with other
elements of the IBM Information Server platform. DataStage is divided into
two section, Shared Components, and Runtime Architecture.
Activities
Infosphere
DataStage Server 9.1.2 or above
Microsoft Visual Studio .NET 2010 Express Edition C++
Oracle client (full client, not an instant client) if connecting to an Oracle
database
DB2 client if connecting to a DB2 database
To migrate your data from an older version of infosphere to new version uses
the asset interchange tool.
Installation Files
For installing and configuring Infosphere Datastage, you must have following
files in your setup.
For Windows,
EtlDeploymentPackage-windows-oracle.pkg
EtlDeploymentPackage-windows-db2.pkg
For Linux,
EtlDeploymentPackage-linux-db2.pkg
EtlDeploymentPackage-linux-oracle.pkg
1. The 'InfoSphere CDC' service for the database monitors and captures
the change from a source database
2. According to the replication definition "InfoSphere CDC" transfers the
change data to "InfoSphere CDC for InfoSphere DataStage."
3. The "InfoSphere CDC for InfoSphere DataStage" server sends data to
the "CDC Transaction stage" through a TCP/IP session. The
"InfoSphere CDC for InfoSphere DataStage" server also sends a
COMMIT message (along with bookmark information) to mark the
transaction boundary in the captured log.
4. For each COMMIT message sent by the "InfoSphere CDC for
InfoSphere DataStage" server, the "CDC Transaction stage" creates
end-of-wave (EOW) markers. These markers are sent on all output links
to the target database connector stage.
5. When the "target database connector stage" receives an end-of-wave
marker on all input links, it writes bookmark information to a bookmark
table and then commits the transaction to the target database.
6. The "InfoSphere CDC for InfoSphere DataStage" server requests
bookmark information from a bookmark table on the "target database."
7. The "InfoSphere CDC for InfoSphere DataStage" server receives the
Bookmark information.
Determine the starting point in the transaction log where changes are
read when replication begins.
To determine if the existing transaction log can be cleaned up
You will also create two tables (Product and Inventory) and populate them
with sample data. Then you can test your integration
between SQL Replication and Datastage.
Moving forward you will set up SQL replication by creating control tables,
subscription sets, registrations and subscription set members. We will
learn more about this in details in next section.
Here we will take an example of Retail sales item as our database and create
two tables Inventory and Product. These tables will load data from source to
target through these sets. (control tables, subscription sets, registrations,
and subscription set members.)
Step 3) Turn on archival logging for the SALES database. Also, back up the
database by using the following commands
Step 5) Use the following command to create Inventory table and import data
into the table by running the following command.
Step 2) In the file replace <db2-connect-ID> and "<password>" with your user
ID and password for connecting to the SALES database.
asnclp –f crtCtlTablesCaptureServer.asnclp
Step 5) Now in the same command prompt use the following command to
create apply control tables.
asnclp –f crtCtlTablesApplyCtlServer.asnclp
Step 6) Locate the crtRegistration.asnclp script files and replace all instances
of <db2-connect-ID> with the user ID for connecting to the SALES database.
Also, change "<password>" to the connection password.
Step 7) To register the source tables, use following script. As part of creating
the registration, the ASNCLP program will create two CD tables.
CDPRODUCT AND CDINVENTORY.
asnclp –f crtRegistration.asnclp
After changes run the script to create subscription set (ST00) that groups the
source and target tables. The script also creates two subscription set
members, and CCD (consistent change data) in the target database that will
store the modified data. This data will be consumed by Infosphere DataStage.
Step 10) Run the script to create the subscription set, subscription-set
members, and CCD tables.
asnclp –f crtSubscriptionSetAndAddMembers.asnclp
Various options used for creating subscription set and two members include
Step 11) Due to the defect in the replication administration tools. You have to
execute another batch file to set the TARGET_CAPTURE_SCHEMA column
in the IBMSNAP_SUBS_SET control table to null.
For connecting CCD table with DataStage, you need to create Datastage
definition (.dxs) files. The .dsx file format is used by DataStage to import and
export job definitions. You will use ASNCLP script to create two .dsx files. For
example, here we have created two .dsx files.
1. One job sets a synchpoint where DataStage left off in extracting data
from the two tables. The job gets this information by selecting the
SYNCHPOINT value for the ST00 subscription set from the
IBMSNAP_SUBS_SET table and inserting it into the
MAX_SYNCHPOINT column of the IBMSNAP_FEEDETL table.
2. Two jobs that extract data from the PRODUCT_CCD and
INVENTORY_CCD tables. The jobs know which rows to start extracting
by selecting the MIN_SYNCHPOINT and MAX_SYNCHPOINT values
from the IBMSNAP_FEEDETL table for the subscription set.
Starting Replication
To start replication, you will use below steps. When CCD tables are populated
with data, it indicates the replication setup is validated. To view the replicated
data in the target CCD tables use the DB2 Control Center graphical user
interface.
Step 1) Make sure that DB2 is running if not then use db2 start command.
Step 2) Then use asncap command from an operating system prompt to start
capturing program. For example.
asncap capture_server=SALES
The above command specifies the SALES database as the Capture server.
Keep the command window open while the capture is running.
Step 3) Now open a new command prompt. Then start the APPLY program
by using the asnapply command.
Step 4) Now open another command prompt and issue the db2cc command
to launch the DB2 Control Center. Accept the default Control Center.
Step 5) Now in the left navigation tree, open All Databases > STAGEDB and
then click Tables. Double click on table name ( Product CCD) to open the
table. It will look something like this.
Once the Installation and replication are done, you need to create a project. In
DataStage, projects are a method for organizing your data. It includes defining
data files, stages and build jobs in a specific project.
Step 2) For connecting to the DataStage server from your DataStage client,
enter details like Domain name, user ID, password, and server information
1. Name
2. Location of file
3. Click 'OK'
Each project contains:
DataStage jobs
Built-in components. These are predefined components used in a job.
User-defined components. These are customized components created
using the DataStage Manager or DataStage Designer.
The stages in the InfoSphere DataStage and QualityStage Designer client are
stored in the Designer tool palette.
The following stages are included in InfoSphere QualityStage:
Investigate stage
Standardize stage
Match Frequency stage
One-source Match stage
Two-source Match stage
Survive stage
Standardization Quality Assessment (SQA) stage
Parallel Job
Sequence Job
Mainframe Job
Server Job
Step 1) Start the DataStage and QualityStage Designer. Click Start > All
programs > IBM Information Server > IBM WebSphere DataStage and
QualityStage Designer
Domain
User Name
Password
Project Name
OK
Step 3) Now from File menu click import -> DataStage Components.
A new DataStage Repository Import window will open.
Step 5) Under Designer Repository pane -> Open SQLREP folder. Inside the
folder, you will see, Sequence Job and four parallel jobs.
In DataStage, you use data connection objects with related connector stages
to quickly define a connection to a data source in a job design.
Step 1) STAGEDB contains both the Apply control tables that DataStage uses
to synchronize its data extraction and the CCD tables from which the data is
extracted. Use following commands
Step 3) You will have a window with two tabs, Parameters, and General.
Click the browse button next to the 'Connect using Stage Type field',
and in the
Open window navigate the repository tree to Stage Types --> Parallel--
> Database ----> DB2 Connector.
Click Open.
Step 5) In Connection parameters table, enter details like
ConnectionString: STAGEDB2
Username: User ID for connecting to STAGEDB database
Password: Password for connecting to STAGEDB database
Instance: Name of DB2 instance that contains STAGEDB database
Step 6) In the next window save data connection. Click on 'save' button.
Step 1) Select Import > Table Definitions > Start Connector Import Wizard
Step 2) From connector selection page of the wizard, select the DB2
Connector and click Next.
Step 3) Click load on connection detail page. This will populate the wizard
fields with connection information from the data connection that you created in
the previous chapter.
Step 4) Click Test connection on the same page. This will prompt DataStage
to attempt a connection to the STAGEDB database. You can see the
message "connection is successful". Click Next.
Step 5) Make sure on the Data source location page the Hostname and
Database name fields are correctly populated. Then click next.
Step 6) On Schema page. Enter the schema of the Apply control tables (ASN)
or check that the ASN schema is pre-populated into the schema field. Then
click next. The selection page will show the list of tables that are defined in the
ASN Schema.
Step 9) Repeat steps 1-8 two more times to import the definitions for the
PRODUCT_CCD table and then the INVENTORY_CCD table.
NOTE: While importing definitions for the inventory and product, make sure
you change the schemas from ASN to the schema under which
PRODUCT_CCD and INVENTORY_CCD were created.
Now DataStage has all the details that it requires to connect to the SQL
Replication target database.
Stages have predefined properties that are editable. Here we will change
some of these properties for the STAGEDB_ASN_PRODUCT_CCD_extract
parallel job.
Step 1) Browse the Designer repository tree. Under SQLREP folder select the
STAGEDB_ASN_PRODUCT_CCD_extract parallel job. To edit, right-click the
job. The design window of the parallel job opens in the Designer Palette.
Step 2) Locate the green icon. This icon signifies the DB2 connector stage. It
is used for extracting data from the CCD table. Double-click the icon. A stage
editor window opens.
Step 3) In the editor click Load to populate the fields with connection
information. To close the stage editor and save your changes click OK.
Step 5) Now click load button to populate the fields with connection
information.
NOTE: If you are using a database other than STAGEDB as your Apply
control server. Then select the option to load the connection information for
the getSynchPoints stage, which interacts with the control tables rather than
the CCD table.
Under the properties tab makes sure the Target folder is open and the
File = DATASETNAME property is highlighted.
On the right, you will have a file field
Enter the full path to the productdataset.ds file
Click 'OK'.
You have now updated all necessary properties for the product CCD table.
Close the design window and save all changes.
NOTE:
You have to load the connection information for the control server
database into the stage editor for the getSynchPoints stage. If your
control server is not STAGEDB.
For the STAGEDB_ST00_AQ00_getExtractRange and
STAGEDB_ST00_AQ00_markRangeProcessed parallel jobs, open all
the DB2 connector stages. Then use the load function to add
connection information for the STAGEDB database
Step 1) Under SQLREP folder. Select each of the five jobs by (Cntrl+Shift).
Then right click and choose Multiple job compile option.
Step 2) You will see five jobs is selected in the DataStage Compilation
Wizard. Click Next.
Step 5) In the project navigation pane on the left. Click the SQLREP folder.
This brings all five jobs into the director status table.
Now check whether changed rows that are stored in the PRODUCT_CCD and
INVENTORY_CCD tables were extracted by DataStage and inserted into the
two data set files.
Step 8) Accept the defaults in the rows to be displayed window. Then click
OK. A data browser window will open to show the contents of the data set file.
Testing integration between SQL Replication and
DataStage
In the previous step, we compiled and executed the job. In this section, we will
check the integration of SQL replication and DataStage. For that, we will make
changes to the source table and see if the same change is updated into the
DataStage.
The SQL script will do various operations like Update, Insert and delete on
both tables (PRODUCT, INVENTORY) in the Sales database.
Step 5) On the system where DataStage is running. Open the DataStage
Director and execute the STAGEDB_AQ00_S00_sequence job. Click Job >
Run Now.
When you run the job following activities will be carried out.
You can check that the above steps took place by looking at the data sets.
The dataset contains three new rows. The easiest way to check the changes
are implemented is to scroll down far right of the Data Browser. Now look at
the last three rows (see image below)
The letter I, U and D specifies INSERT, UPDATE and DELETE operation that
resulted in each new row.
Summary:
Datastage is an ETL tool which extracts data, transform and load data
from source to the target.
It facilitates business analysis by providing quality data to help in
gaining business intelligence.
DataStage is divided into two section, Shared Components, and
Runtime Architecture.
DataStage has four main components,
o Administrator
o Manager
o Designer
o Director
Following are the key aspects of IBM InfoSphere DataStage
o Data transformation
o Jobs
o Parallel processing
In Job design various stages involved are
o Transform stage
o Filter stage
o Aggregator stage
o Remove duplicates stage
o Join stage
o Lookup stage