Escolar Documentos
Profissional Documentos
Cultura Documentos
Ab Initio Architecture
Overview of Graph
Ab Initio functions
Basic components
Partitioning and De-partitioning
Case Studies
July 6, 2010
Introduction
Data processing tool from Ab Initio software corporation
(http://www.abinitio.com)
User Applications
Development Environments
Co>Operating System
Ab Initio Built-in Host Machine 2
Component Programs
(Partitions, Transforms etc) User
Programs
Co-Operating User
System Programs
Operating System
( Unix , Windows NT )
Operating System
Co-operating System
can talk to the Co-operating system using several protocols like Telnet,
release
Note: During deployment, GDE sets AB_COMPATIBILITY to the Co>Operating System version number. So, a change
July 6, 2010
The Graph Model
A Graph
Logical modular unit of an application.
A Component
A Component Organizer
Datasets
Dataset
Components
L1
Other
Customers
Flows
Expression Metadata
Ports
Layout
End script This script runs after the graph has completed
running.
Setup Command
Graph
End Script
at the back-end
Serial
Parallel
3-way multifile on
file on Host W Hosts X,Y,Z
Construct layout
manually
Run on these
hosts
Database components
can use the same layout
as a database table
Phase 0 Phase 1
Select Phase
Number
View Set
Phase Phase
Host
Host
GDE
GDE
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host
Host
GDE
GDE Agent Agen
Agent
Agen
t
t
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host
Host
Agen Agen
GDE
GDE Agent Agent
t
t
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Component Execution
Component processes do their jobs.
Component processes communicate directly with
datasets and each other to move data around.
Host
Host
GDE
GDE Agen
Agent Agen
Agent
t t
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host
Host
Agen Agen
GDE
GDE Agent Agent
t t
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Agent Termination
When all of an Agents Component processes exit,
the Agent informs the Host process that those
components are finished.
The Agent process then exits.
Host
Host
GDE
GDE
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host Termination
When all Agents have exited, the Host process
informs the GDE that the job is complete.
The Host process then exits.
Host
Host
GDE
GDE
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host
Host
Agen
GDE
GDE Agent
Agen Agent
t
t
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Agen
Agen t
Host t
Host
GDE
GDE Agent Agent
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Agent Termination
When every Component process of an Agent have
been killed, the Agent informs the Host process that
those components are finished.
The Agent process then exits.
Host
Host
GDE
GDE
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Anatomy of a Running Job
Host Termination
When all Agents have exited, the Host process
informs the GDE that the job failed.
The Host process then exits.
Host
Host
GDE
GDE
Client
Clien Host
Host Processing nodes
Processing
t Nodes
May 18, 2010
Ab Initio Functions
July 6, 2010
DML(Data Manipulation Language)
DML provides different set of data types including
Base,Compound as well as User-defined data types
DML)
0345John Smith
0212Sam Spade
0322Elvis Jones
0492Sue West
Field Names
0221William Black
Data Types
record
decimal(4) id;
string(10) first_name;
DML BLOCK string(6) last_name;
end
0345,01-09-02,1000.00John,Smith
Delimiters 0212,05-07-03, 950.00Sam,Spade
0322,17-01-00, 890.50Elvis,Jones
0492,25-12-02,1000.00Sue,West
0221,28-02-03, 500.00William,Black
record
decimal(,) id;
date(DD-MM-YY)(,) join_date;
decimal(7,2) salary_per_day;
string(,) first_name;
string(\n) last_name;
Precision
& Scale end
Function categories
Date functions
Lookup functions
Math functions
Miscellaneous functions
String functions
Using Last-Visits
as a lookup file
lookup_next() File Label Returns successive data records from a Lookup File.
NOTE: Data needs to be partitioned on same key before using lookup local functions
May 18, 2010
Transform Functions : XFRs
Transform functions direct the behavior of transform
component
It is named,parameterized sequence of local variable definition,
statements & rules that computes the expression from input
values & variables,and assigns results to the output object.
Syntax
output-var[,output-var....]::xform-name(input-var[,input-var...])=
begin
local-variable-declaration-list
Variable-list
Rule-list
end;
The list of local variable definitions, if any, must precede the list of statements.
The list of statements, if any, must appear before the list of rules
Example:
1. temp::trans1(in) =
begin
temp.sum :: 0;..............Local variable declaration with field sum
end;
2. out.temp::trans2(temp, in) =
begin
temp.sum :: temp.sum + in. amount;
out. city :: in. city;
out.sum :: temp.sum;
end; May 18, 2010
Basic Components
July 6, 2010
Basic Components
Filter by Expression
Reformat
Redefine Format
Replicate
Join
Sort
Rollup
Aggregate
Dedup Sorted
Reformat uses the following transform function to write output to out port out0:
expr
true?
Yes No
Parameters: None
record
String(56) personal_info;
Decimal(8.2) salary;
end
May 18, 2010
Replicate
The Aggregate uses the following key specifier to sort the data.
Key
Aggregate uses the following transform function to write output.
After the processing the graph produces the following Output File :
Ports:
Parameters of Join:
Count: An integer from 2 to 20 specifying number of following ports and
parameters. Default is 2.
In ports
Unused ports
Reject ports
Error ports
Record-required parameter
Dedup parameter
Select parameter
1. Override-key parameter
1. Key: Name of the fields in the input record that must have matching values
for Join to call transform function
May 18, 2010
Join
Sorted-input:
Input must be sorted: Join accepts unsorted input, and permits the use of
maintain-order parameter
In memory: Input need not be sorted : Join requires sorted input, and
maintain-order parameter is not available.
Default is Input must be sorted
Logging: specifies whether or not you want the component to generate log records for
certain events. The values of logging parameter is True or False.
The default value is False.
log_input: indicates how often you want the component to send an input
record to its log port.
For example: If you select 100,then the component sends every 100th input record to its log
port
log_output: indicates how often you want the component to send an output
record
to its log port.
For example: If you select 100,then the component sends every 100th output record to its log
port
log_reject:indicates how often you want the component to send an reject
record
to its log port.
For example: If you select 100,then the component sends every 100th reject record to its log
port
log_intermediate: indicates how often you want the component to send an
intermediate record to its log port
May 18, 2010
Join
Max-core : maximum memory usage in bytes
Transform : either name of the file containing the transform function, or the transform
string.
Selectn: filter for data records before aggregation. One per inn port.
Reject-Threshold : The components tolerance for reject event
Abort on first reject: The component stops the execution of graph at the first
reject event it generates.
Never Abort: The component does not stops execution of the graph, no matter
how many reject events it generates
Use Limit/Ramp: The component uses the settings in the ramp & limit
parameters to determine how many reject events to allow before it stops the
execution of graph.
Limit: contains an integer that represents a number of reject events
Ramp: contains a real number that represents a rate of reject events in the number of
records processed.
Driving: number of the port to which you connect the driving input. The driving input is
the
largest input. All the other inputs are read into memory.
The driving parameter is only available when the sorted-input parameter is set to
In memory: Input need not be sorted. Specify the port number as the value of the driving
parameter. The Join reads all other inputs into memory
Default is 0
Max-memory: maximum memory usage in bytes before Join writes temporary files to
disk. Only available when the sorted-input parameter is set to Inputs must be sorted.
Join uses the default value, Inner join, for the join-type parameter.
May 18, 2010
Example of Join
Given the preceding data, record formats, parameter, and transform function,
the graph produces Output File with the following data.
in:
Do for first record Initialize:
in each group ...
temp:
Do for every record
in each group Rollup: ...
out:
July 6, 2010
Multifiles
A global view of a set of ordinary files called partitions
usually located on different disks or systems
Ab Initio provides shell level utilities called m_
etc.)
mfile://pluto.us.com/usr/ed/mfs1/new.dat
mfile://host1/u/jo/mfs/mydir
//host1/u1/jo/mfs
//host1/vol4/pA/mydir //host2/vol3/pB/mydir //host3/vol7/pC/mydir
<.mdir>
mfile://host1/u/jo/mfs/mydir/myfile.dat
//host1/u1/jo/mfs/mydir
/myfile.dat //host1/vol4/pA/mydir //host2/vol3/pB/mydir //host3/vol7/pC/mydir
/myfile.dat /myfile.dat /myfile.dat
A multifile
Control file
Partitions (Serial Files)
Forms of Parallelism
Component Parallelism
Pipeline Parallelism
Inherent in Ab Initio
Data Parallelism
Sorting Customers
Sorting Transactions
Processing Record 99
Expanded View
Global View
Multifile
Partition by Round-robin
Partition by Key
Partition by Expression
Partition by Range
Partition by Percentage
Broadcast
Record3
Partition 3
Record6
100 % 3 1
100 57 100 122
91 91 % 3 1 213 91
57
57 % 3 0 25
25
122 25 % 3 1
213
122 % 3 2
213 % 3 0
Data may not be evenly distributed across partitions
May 18, 2010
Partition by Expression
Distributes data records to partitions according to DML
expression values
DML Expression
Expression Value Partition0 Partition1 Partition2
99 / 40 2
99 25 57 99
91 91 / 40 2 22 73 91
57 57 / 40 1
25
22 25 / 40 0
73
22 / 40 0
73 / 40 1
Does not guarantee even distribution across partitions
Cascaded Filter by Expressions can be avoided
May 18, 2010
Broadcast
Combines all data records it receives into a single flow
Writes copy of that flow into each output data partition
Partition0 Partition1 Partition2
A A A
A
B B B B
C C C C
D D D D
E E E E
F F F
F G G
G
76 10 76
10 73 10 17
9 45 84
17 98
9 2 29
45 73
2
84
98
29 Num_Partitions = 3
73
Record-independent
Round-robin No Even
parallelism
Depends on the key Key-dependent
Partition by Key Yes
value parallelism
Partition by Depends on data and
Yes Application specific
Expression expression
Record-independent
Broadcast No Even
parallelism
Key-dependent
Partition by Range Yes Depends on splitters parallelism, Global
Ordering
Gather
Concatenate
Merge
Interleave
Creating ordered
Merge Yes Sorted
serial flows
Unordered
Gather No Arbitrary
departitioning
July 6, 2010
Case Study 1
In a shop, the customer file, contains the following fields:
Cust_id amount
215657 1000
462310 1500
462310 2000
215657 2500
462310 5500
215657 4500 May 18, 2010
Develop the AbInitio Graph, which will do the following:
It takes the first three records of each Cust_id and sum the amounts,
the output file is as follows
July 6, 2010
Thank You!!!
bipm.services@tcs.com