Ab-Initio 1

FROM THE BEGNNNG
NTRODUCTON
Ab Initio is a Latin phrase that translates to
"from first principIes or
"from the beginning.
NTRODUCTON
- Ab nitio is a generaI purpose data processing pIatform for enterprise
class, mission critical applications such as data warehousing, clickstream
processing, data movement, data transformation and analytics.
- Supports integration of arbitrary data sources and programs, and
provides complete metadata management across the enterprise.
- Proven best of breed ETL solution.
- Applications of Ab nitio:
ETL for data warehouses, data marts and operational data sources.
Parallel data cleansing and validation.
Parallel data transformation and filtering.
High performance analytics
Real time, parallel data capture.
ARCHTECTURE
Ab nitio Comprises of 3 Core components.
- The o>OperatingSystem
- Enterprise Meta Environment (EME)
- GraphicaI Design Environment (GDE)
Supplementary Components
- Continuous>Flow
- Plan>t
- Re>source
- Data>Profiler
ARCHTECTURE
ARCHTECTURE
Native Operating System
UNX Windows NT
Ab nitio Co>Operating System
Component
Library
User-defined
Components
Third Party
Components
Application Development Environments
Graphical C ++ Shell
Applications
Ab nitio
Metadata
Repository
OPERATNG ENVRONMENTS
Ab Initio has been deveIoped to support a vast variety of operating
environments.
The o>Operating System and EME is avaiIabIe in many fIavors and
can be instaIIed on
Windows NT based
AX
HP-UX
Linux
Solaris
Tru64 Unix
The GDE part of Ab Initio is currentIy avaiIabIe for Windows based
Operating Systems onIy.
CO>OPERATNG SYSTEM
The o>Operating System and the GraphicaI DeveIopment Environment
(GDE) form the basis of Ab nitio software. They interact with each other to
create and run an Ab nitio application:
The GDE provides a canvas on which you manipuIate icons that represent
data, programs, and the connections between them. The result looks like a
data fIow diagram, and represents the solution to a data manipulation
problem.
The GDE communicates this solution to the o>Operating System as a
Korn sheII script.
The o>Operating System executes the script.
ENTERPRSE META>ENVRONMENT
Enterprise Meta>Environment (EME) is an Abnitio repository and
environment for storing and managing metadata. t provides capability to
store both business and technicaI metadata. EME metadata can be accessed
from the Ab nitio GDE, web browser or Abnitio CoOperating system command
line (air commands).
The basic situation when you work with the EME is that you have files in two
locations, of two kinds:
A personaI work area that you specify, located in some filesystem
Essentially this is a formaIized directory structure that usually only
you have access to. This is where you work on files.
A system storage area
This is the EME system area where every version that you save of the
files you work on is permanentIy preserved, organized in projects.
You (and other users) have indirect access to this area through the GDE
and the command Iine. This area's generic name is EME datastore.
FROM THE BEGNNNG
DAY - 2
GRAPHCAL DEVELOPMENT
ENVRONMENT
GDE is a graphicaI appIication for deveIopers which is used for
designing and running Abnitio Graphs.
The ETL process in Abnitio is represented by Abnitio graphs. Graphs
are formed by components (from the standard components library or
custom), flows (data streams) and parameters.
A user-friendIy frontend for designing Ab nitio ETL graphs
Ability to run, debug Ab nitio jobs and trace execution Iogs.
GDE Abnitio graph compiIation process results in generation of a UNIX
sheII script which may be executed on a machine without the GDE
installed
GDE
Ma
in
Ma
in
To
ols
To
ols
Ru
n
Ru
n
Edi
t
Edi
t
Phas
es
Phas
es
Deb
ug
Deb
ug
Menu
bar
Menu
bar
Sandbo
x
Sandbo
x
Compon
ents
Compon
ents
Applicati
on
Output
Applicati
on
Output
Graph Design area Graph Design area
THE GRAPH
An Ab Initio appIication is called a raph: a graphicaI representation of
datasets, programs, and the connections between them. Within the
framework of a graph, you can arrange and rearrange the datasets,
programs, and connections and specify the way they operate. n this way,
you can build a graph to perform any data manipuIation you want.
You deveIop graphs in the GDE.
THE GRAPH
A raph is a data fIow diagram that defines the various processing stages of a task and the
streams of data as they move from one stage to another. n a graph, a component represents
a stage and a fIow represents a data stream. n addition, there are parameters for specifying
various aspects of graph behavior.
Graphs are built in the GDE by dragging and dropping components, connecting them with flows,
and then defining values for parameters. You run, debug, and tune your graph in the GDE. n
the process of building a graph, you are developing an Ab nitio application, and thus graph
development is called graph programming. When you are ready to deploy your application, you
save the graph as a script that you can run from the command line.
THE GRAPH
A graph is composed of components and fIows:
omponents represent datasets (dataset components) and the programs
(program components) that operate on them.
Dataset components are frequently referred to as fiIe components when
they represent files, or tabIe components when they represent database
tables.
Components have ports through which they read and write data.
Ports have metadata attached to them that specifies how to interpret the
data passing through them.
Program components have parameters that you can set to controI the
specifics of how the program operates.
FIows represent the connections between ports. Data moves from one
component to another through fIows.
THE GRAPH
A graph can run on one or many processors. The processors can be on one or
many computers.
All the computers that participate in running a graph must have the
o>Operating System instaIIed.
When you run a graph from the GDE, the GDE connects to the o>Operating
System on the runhost the computer that hosts the Co>Operating System
installation that controls the execution of the graph. The GDE transmits a Korn
sheII script that contains all the information
the Co>Operating System needs to execute the programs and processes
represented by the graph by..
Managing all connections between the processors or computers that
execute various parts of the graph
Program execution
Transferring data from one program to the next
Transferring data from one processor or computer to another
Execution in phases
Restoring the graph to its state at the last completed checkpoint in case of a
failure
Writing and removal of temporary files
SETTNG UP GDE
Before using the GDE to run a graph or plan, a connection must be established between GDE
and the installation of the o>Operating System that you want to run your graph. The
computer that hosts this installation is called the run host. Typically, the GDE is on one
computer and the run host is a different computer. But even if the GDE is local to the run host,
you must still establish the connection.
ost Settings(On GDE, Run>Settings>ost Tab) dialog to specify the information the GDE
needs to log in to the run host.
The GDE creates a host settings fiIe with the name you specified. The host settings file
contains the information you specified, such as the name of the run host, Iogin name or
username, password, o>Operating System version and Iocation, and type of sheII you log
in with.
DATA PROCESSNG
Generally, Data processing involves following tasks:
oIIection of data from various sources.
Ieansing and standardizing data and its datatypes.
Data integrity checks.
ProfiIing of data.
Staging it for further operations.
Joining with various datasets.
Transforming data based on Business ruIes.
Conforming data vaIidity.
TransIating data to compatible types of target systems.
Loading data to Target systems.
Tying out source and target data for quality and quantity.
AnomaIy, rejection and error handling.
Reporting process updates.
Metadata updation.
COMMON TASKS
SeIecting: The SELECT clause of an SQL statement.
FiItering: Statements in WHERE clause of an SQL statement.
Sorting: Specifications in an ORDER BY clause.
Transformation: Various transformation functions used on the fields in the
SELECT clause.
Aggregation: Summarization on a set of records (ADD, AVG, MN etc)
Switch: Conditional operations using CASE statements.
String Operations: String selection and manipulation (CONCATENATE,
SUBSTRNG etc.)
MathematicaI Operations: Addition, Division etc.
Inquiry: Test for attributes and content of a field of a record.
Joins: Specifications using JON(SQL server, ANS) clauses (WHERE clause
in Oracle).
Lookups: Getting relevant fields from Dimensions.
RoIIups: Summarized data grouped using GROUP clause of an SQL
statement.
COMPONENTS
omponents are cIassified based on their functionaIity. FoIIowing are a
few important sets of omponent categories.
COMPONENTS
Category Descr|pt|on Lxamp|e
Compress rovldes ComponenLs Lo Compress/uecompresslon daLa flles uLlLA1L
ConLlnuous ComponenLs used ln Lhe ConLlnuous llows (CpLlonal) u8LlSP
uaLabase provlde an lnLerface beLween Ab lnlLlo graphs and Lhe ma[or daLabases 8un SCL
uaLaseLs 8epresenL records or acL on records ln1L8MLulA1L llLL
ueparLlLlon ComponenLs for conLrollng flow of parLlLlons CA1PL8
ueprecaLed ulsconLlnued ComponenLs for backward compaLlblllLy
l1 ComponenLs Lo handle l1 operaLlons Sl1 1C
Mlscellaneous ComponenLs Lo perform varleLy of Lasks 8LuLllnL lC8MA1
arLlLlon ComponenLs Lo handle MulLlflle arallellsm A81l1lCn 8? kL?
SorL SorLs records llnu SLl11L8
1ransform Modlfy records ACC8LCA1L
1ranslaLe lnLerface for LranslaLlng record formaLs 8LAu xML
valldaLe 1esL uebug and check records CCMA8L 8LCC8uS
COMPONENTS
The dataset components represent records or act on records, as follows:
INPUT FILE represents records read as input to a graph from one or more serial files or
from a multifile.
INPUT TABLE unloads records from a database into an Ab nitio graph, allowing you to
specify as the source either a database table, or an SQL statement that selects records
from one or more tables.
INTERMEDIATE FILE represents one or more serial files or a multifile of intermediate
results that a graph writes during execution, and saves for your review after execution.
LOOKUP FILE represents one or more serial files or a multifile of records small enough to
be held in main memory, letting a transform function retrieve records much more quickly
than it could if they were stored on disk.
OUTPUT FILE represents records written as output from a graph into one or multiple
serial files or a multifile.
OUTPUT TABLE loads records from a graph into a database, letting you specify the
records' destination either directly as a single database table, or through an SQL
statement that inserts records into one or more tables.
READ MULTIPLE FILES sequentially reads from a list of input files.
READ SARED reduces the disk read rate when multiple graphs (as opposed to multiple
components in the same graph) are reading the same very large file.
WRITE MULTIPLE FILES writes records to a set of output files.
COMPONENTS
The database components are the following:
ALL STORED PROEDURE calls a stored procedure that returns multiple
result sets. The stored procedure can also take parameters.
INPUT TABLE unloads records from a database into an Ab nitio graph,
allowing you to specify as the source either a database table, or an SQL
statement that selects records from one or more tables.
JOIN WIT DB joins records from the flow or flows connected to its input
port with records read directly from a database, and outputs new records
containing data based on, or calculated from, the joined records.
MULTI UPDATE TABLE executes multiple SQL statements for each input
record.
OUTPUT TABLE loads records from a graph into a database, letting you
specify the records'destination either directly as a single database table, or
through an SQL statement that inserts records into one or more tables.
RUN SQL executes SQL statements in a database and writes confirmation
messages to the log port.
TRUNATE TABLE deletes all the rows in a database table, and writes
confirmation messages to the log port.
UPDATE TABLE executes UPDATE, NSERT, or DELETE statements in
embedded SQL format to modify a table in a database, and writes status
information to the log port.
COMPONENTS
MULTI REFORMAT changes the record format of records flowing between from
one to 20 pairs of in and out ports by dropping fields, or by using DML
expressions to add fields, combine fields, or transform the data in the records.
NORMALIZE generates multiple output records from each input record; you can
specify the number of output records, or the number of output records can
depend on a field or fields in each input record. NORMALZE can separate a
record with a vector field into several individual records, each containing one
element of the vector.
REFORMAT changes the record format of records by dropping fields, or by
using DML expressions to add fields, combine fields, or transform the data in the
records.
ROLLUP generates records that summarize groups of records. ROLLUP gives
you more control over record selection, grouping, and aggregation than
AGGREGATE. ROLLUP can process either grouped or ungrouped input. When
processing ungrouped input, ROLLUP maximizes performance by keeping
intermediate results in main memory.
SAN generates a series of cumulative summary records such as successive
year-to-date totals for groups of records. SCAN can process either grouped or
ungrouped input. When processing ungrouped input, SCAN maximizes
performance by keeping intermediate results in main memory.
SAN WIT ROLLUP performs the same operations as SCAN; it generates a
summary record for each input group.
COMPONENTS
The Miscellaneous components perform a variety of tasks:
ASSIGN KEYS assigns a value to a surrogate key field in each record on the in port, based on the value of a
natural key field in that record, and then sends the record to one or two of three output ports.
BUFFERED OPY copies records from its in port to its out port without changing the values in the records. f
the downstream flow stops, BUFFERED COPY copies records to a buffer and defers outputting them until the
downstream flow resumes.
DOUMENTATION provides a facility for documenting a transform.
GATER LOGS collects the output from the Iog ports of components for analysis of a graph after execution.
LEADING REORDS copies records from input to output, stopping after the given number of records.
META PIVOT converts each input record into a series of separate output records: one separate output record
for each field of data in the original input record.
REDEFINE FORMAT copies records from its input to its output without changing the values in the records.
You can use REDEFNE FORMAT to change or rename fields in a record format without changing the values
in the records.
REPLIATE arbitrarily combines all the records it receives into a single flow and writes a copy of that flow to
each of its output flows.
RUN PROGRAM runs an executable program.
TROTTLE copies records from its input to its output, limiting the rate at which records are processed.
REIRULATE and OMPUTE LOSURE components can only be used together. These two components
calculate the complete set of direct and derived relationships among a set of input key-pairs; in other words,
the transitive closure of the relationship within the set.
TRAS ends a flow by accepting all the records in it and discarding them.
COMPONENTS
The transform components modify or manipulate records by using one or several
transform functions. Some of these are multistage transform components that modify
records in up to five stages: input selection, temporary initialization, processing,
finalization, and output selection. Each stage is written as DML transform function.
AGGREGATE generates records that summarize groups of records. AGGREGATE can
process either grouped or ungrouped input. When processing ungrouped input,
AGGREGATE maximizes performance by keeping intermediate results in main memory.
DEDUP SORTED separates one specified record in each group of records from the rest of
the records in the group. DEDUP SORTED requires grouped input.
DENORMALIZE SORTED consolidates groups of related records into a single record with
a vector field for each group, and optionally computes summary fields for each group.
DENORMALZE SORTED requires grouped input.
FILTER BY EXPRESSION filters records according to a specified DML expression.
FUSE applies a transform to corresponding records from each input flow. The transform is
first applied to the first record on each flow, then to the second record on each flow, and
so on. The result of the transform is sent to the out port.
JOIN performs inner, outer, and semi-joins on multiple flows of records. JON can process
either sorted or unsorted input. When processing unsorted input, JON maximizes
performance by loading input records into main memory.
MAT SORTED combines and performs transform operations on multiple flows of
records.MATCH SORTED requires grouped input.
COMPONENTS
The sort components sort and merge records, and perform related tasks:
EKPOINTED SORT sorts and merges records, inserting a checkpoint
between the sorting and merging phases.
FIND SPLITTERS sorts records according to a key specifier, and then finds
the ranges of key values that divide the total number of input records
approximately evenly into a specified number of partitions.
PARTITION BY KEY AND SORT repartitions records by key values and
then sorts the records within each partition; the number of input and output
partitions can be different.
SAMPLE selects a specified number of records at random from one or more
input flows. The probability of any one input record appearing in the output
flow is the same it does not depend on the position of the record in the
input flow.
SORT orders and merges records.
SORT WITIN GROUPS refines the sorting of records already sorted
according to one key specifier: it sorts the records within the groups formed
by the first sort according to a second key specifier.
COMPONENTS
The partition components distribute records to multiple flow partitions or
multiple straight flows to support data parallelism or component parallelism.
BROADAST arbitrarily combines all the records it receives into a single
flow and writes a copy of that flow to each of its output flow partitions.
PARTITION BY EXPRESSION distributes records to its output flow partitions
according to a specified DML expression.
PARTITION BY KEY distributes records to its output flow partitions
according to key values.
PARTITION BY PERENTAGE distributes a specified percent of the total
number of input records to each output flow.
PARTITION BY RANGE distributes records to its output flow partitions
according to the ranges of key values specified for each partition.
PARTITION BY ROUND-ROBIN distributes records evenly to each output
flow.
PARTITION WIT LOAD BALANE distributes records to its output flow
partitions, writing more records to the flow partitions that consume records
faster.
COMPONENTS
The departition components combine multiple flow partitions or multiple
straight flows into a single flow to support data parallelism or component
parallelism
ONATENATE appends multiple flow partitions of records one after
another.
GATER combines records from multiple flow partitions arbitrarily.
INTERLEAVE combines blocks of records from multiple flow partitions in
round-robin fashion.
MERGE combines records from multiple flow partitions that have all been
sorted according to the same key specifier and maintains the sort order.
COMPONENTS
The compress components compress data or expand compressed data:
DEFLATE and INFLATE work on all platforms.
DEFLATE reduces the volume of data in a flow and INFLATE reverses
the effects of DEFLATE.
OMPRESS and UNOMPRESS are available on Unix and Linux
platforms; not on Windows.
OMPRESS reduces the volume of data in a flow and UNOMPRESS
reverses the effects of COMPRESS.
COMPRESS cannot output more than 2 GB of data on Linux platforms.
COMPONENTS
The MVS dataset components perform as follows:
MVS INPUT FILE reads an MVS dataset as an input to your graph.
MVS INTERMEDIATE FILE represents MVS datasets that contain
intermediate results that your graph writes and saves for review.
MVS LOOKUP FILE contains shared MVS data for use with the DML
Iookup functions; it allows access to records according to a key.
MVS OUTPUT FILE represents records written as output from a graph into
an MVS dataset.
VSAM LOOKUP executes a VSAM lookup for each input record.
COMPONENTS
The EME (Enterprise Meta>Environment) components are in the
AB_OME/onnectors > EME folder. They transfer data into and out of the
EME.
LOAD ANNOTATION VALUES attaches annotation values to objects in an EME
datastore, and validates the annotation for the object to which it is to be
attached.
LOAD ATEGORY stores annotation values in EME datastores for objects that
are members of a given category.
LOAD FILE DATASET creates a file dataset in an EME datastore, and attaches
its record format.
LOAD MIMEOBJ creates a MME object in an EME datastore with the specified
MME type.
LOAD TABLE DATASET creates a table dataset in an EME datastore and
attaches its record format.
LOAD TYPE creates a DML record object in an EME datastore.
REMOVE OBJETS AND ANNOTATIONS allows you to delete objects and
annotation values from an EME datastore.
UNLOAD ATEGORY unloads the annotation rules of all objects that are
members of a given category.
COMPONENTS
The %! (File Transfer Protocol) components transfer data, as follows:
FTP FROM transfers files of records from a computer not running the
Co>Operating System to a computer running the Co>Operating System.
FTP TO transfers files of records to a computer not running the
Co>Operating System from a computer running the Co>Operating System.
SFTP FROM transfers files of records from a computer not running the
Co>Operating System to a computer running the Co>Operating System
using the sftp or scp utilities to connect to a Secure Shell (SSH) server on
the remote machine and transfer the files via the encrypted connection
provided by SSH.
SFTP TO transfers files of records from a computer running the
Co>Operating System to a computer not running the Co>Operating System
using the sftp or scp utilities to connect to a Secure Shell (SSH) server on
the remote machine and transfer the files via the encrypted connection
provided by SSH.
COMPONENTS
The ;alidate components test, debug, and check records, and produce data for
testing Ab nitio graphs:
EK ORDER tests whether records are sorted according to a key specifier.
OMPARE EKSUMS compares two checksums generated by COMPUTE
CHECKSUMS. Typically, you use COMPARE CHECKSUMS to compare
checksums generated from two sets of records, each set computed from the
same data by a different method, in order to check the correctness of the
records.
OMPARE REORDS compares records from two flows one by one.
OMPUTE EKSUM calculates a checksum for records.
GENERATE RANDOM BYTES generates a specified number of records, each
consisting of a specified number of random bytes. Typically, the output of
GENERATE RANDOM BYTES is used for testing a graph. For more control over
the content of the records, use GENERATE RECORDS.
GENERATE REORDS generates a specified number of records with fields of
specified lengths and types. You can let GENERATE RECORDS generate
random values within the specified length and type for each field, or you can
control various aspects of the generated values. Typically, you use the output of
GENERATE RECORDS to test a graph.
VALIDATE REORDS separates valid records from invalid records.
COMPONENTS
Where do the actuaI data files represented by
INPUT FILE components reside?
Where should the Co>Operating System write the
files represented by OUTPUT FILE components, or
do those files already exist?
f a file represented by an OUTPUT FLE
component does exist, should the graph append
the data it produces to the existing data in the file,
or should the new data overwrite the old?
What fieId of the data should a program
component use as a key when processing the
data?
What is the record format attached to a particular
port?
How much memory should a component use for
processing before it starts writing temporary files to
disk?
How many partitions do you want to divide the
data into at any particular point in the graph?
What are the Iocations of the processors you want
to use to execute various parts of the graph?
What is the location of the Co>Operating System
you want to use to control the execution of the
graph?
COMPONENTS
onfiguring omponents:
Components comprises of settings which dictates the way it impacts the data
flowing through. These settings are seen and set with the following tabs on
Properties window.
Description
Access
Layout
Parameters
Ports
COMPONENT PROPERTES
DESRIPTION:
Specify Labels
Component specific selections
Data Locations
Partitions
Comments
Name/Author/Version
COMPONENT PROPERTES
AESS:
Specification for file handling methods.
Creation/Deletion
Roll back settings
Exclusive access
File Protection
COMPONENT PROPERTES
LAYOUT:
The locations of files.
The number and locations of the partitions of
multifiles.
The number of the partitions of program
components and the locations where they
execute.
A layout is one of the following:
A URL that specifies the location of a serial file
A URL that specifies the location of the control
partition of a multifile
A list of URLs that specify the locations of:
The partitions of an ad hoc multifile
The working directories of a program
component
very component in a graph both dataset and program components has a
layout. Some graphs use one layout throughout; others use several layouts and
repartition data when needed for processing by a greater or lesser number of
processors.
COMPONENT PROPERTES
PARAMETERS:
Set various component specific
configuration
Specify transform settings
Handle rejects/logs/thresholds
Apply filters
Specify Keys for Sort and Partition
Specify URL for transforms
Parameter interpretation
COMPONENT PROPERTES
PORTS:
Assign record formats for various
ports
Specify URL of record format
nterpretation settings for any
environment parameters.
EDTORS
EDITORS:
Configuring components using various tabs involve changing settings for
Transforms
Filters
Keys
Record Formats
Expressions
These settings are created using
Text editors
Various editors available on their respective item.
EDTORS: RECORD FORMAT
The Record Format Editor enables you to easily create and edit record
formats.
record
decma1{6) cust_d,
strng{8) 1ast_name,
strng{6) 1rst_name,
strng{6) street_addr,
strng{) state,
strng{) new1ne,
end,
DML DATA TYPES
Data 1ype Descr|pt|on syntax examp|e
lnLeger
uescrlbes daLa LhaL ls a blnary number lnLeger ls always deflned
wlLh lLs slze
slgned lnLeger(1) 100
8eal
uescrlbes daLa as a blnary floaLlngpolnL number 8eals are used
for calculaLlons lnvolvlng fracLlonal values
real(4) 123436
ueclmal lnLerpreLs daLa as declmal numbers declmal(43) 1234367
SLrlng uescrlbes daLa as LexL A sLrlng can be of flxed or varlable lengLh sLrlng(11) sample LexL
daLe/daLeLlme lnLerpreLs daLa as a sLrlng represenLlng a calendar daLe/Llme of day
daLe(????MMuu)
daLeLlme(????MMMuuhhmlss)
20100910
2010SL10233443
vold
Speclfles Lhe slze of a block of daLa whose meanlng or lnLernal
sLrucLure ls noL necessary Lo descrlbe
vold(2) vold 0xff 0xfe
vecLor A onedlmenslonal collecLlon of ldenLlcally Lyped daLa lLems sLrlng()dellmlLer n vecLor apple plum pear
user deflned Lype
CusLom uML Lypes LhaL are deflned ln Lerms of oLher exlsLlng uML
Lypes
EDTORS: EXPRESSON EDTOR
FieIds:
Displays the input record
formats available for use
in an expression.
Operators:
Displays the built-in DML operators.
Functions:
Displays the built-in DML
functions.
EDTORS: KEY SPECFER
Key Specifier Editor:
Sort Sequences
equence Deta||s
phonebook
ulglLs are LreaLed as Lhe lowesL value characLers followed by Lhe leLLers of Lhe
alphabeL ln Lhe order Aa8bCcud followed by spaces All oLher characLers such
as puncLuaLlon are lgnored 1he order of dlglLs ls 0 1 2 3 4 3 6 7 8 9
|ndex
ls Lhe same as phonebook orderlng excepL LhaL puncLuaLlon characLers are noL
lgnored Lhey have lower values Lhan all oLher characLers 1he order of
puncLuaLlon characLers ls Lhe machlne sequence
mach|ne
uses characLer code values ln Lhe sequence ln whlch Lhey are arranged ln Lhe
characLer seL of Lhe sLrlng
lor ASCllbased characLer seLs dlglLs are Lhe lowesLvalue characLers followed by
uppercase leLLers followed by lowercase leLLers
lor L8CulC characLer seLs lowercase leLLers are Lhe lowesLvalue characLers
followed by uppercase followed by dlglLs
lor unlcode Lhe order ls from Lhe lowesL characLer code value Lo Lhe hlghesL
custom
uses Lhe userdeflned sorL order ?ou consLrucL a cusLom sequence modlfler by
namlng groups of characLers or by namlng Lhe characLers Lhemselves
EDTORS: TRANSFORM EDTOR
out::re1ormat{n) =
begn
out.* :: n.*,
out.score :: 1 {n.ncome > ) e1se ,
end,
COMPONENTS
Represents records read as input to a graph from one or
more seriaI files or from a muItifiIe.
Location: Datasets
Key Settings:
Data Location URL
Ports
Represents records written as output from a graph into one
or more seriaI files or a muItifiIe.
Location: Datasets
Key Settings:
Data Location URL
Ports
ote: When the target of an OUTPUT FIL component is a
particular file (such as /dev/null, UL, a named pipe, or some
other special file), the Co>Operating System never deletes
and re-creates that file, nor does it ever truncate it.
INPUT FILE
OUTPUT FILE
COMPONENTS
Sorts and merges records. SORT can be used to order
records before you send them to a component that
requires grouped or sorted records.
Location: Sort
Key Parameters:
key : The field to be used as Sort key
Max-core : Amount of memory to be used per
partition.
ote:
Use ey Specifier editor to add/modify keys and
sort order and sequences.
Sort orders are impacted by the sort sequence used.
Character sets used can also have an impact on the
sort order.
SORT
COMPONENTS
FiIters records according to a DML expression.
Location: Transform
Key Settings:
select_expr: The expression defining the
filter key
ote: Use Expression Editor to add/modify the
DML expression
FILTER BY EXPRESSION
COMPONENTS
hanges the format of records by dropping fields,
or by using DML expressions to add fields,
combine fields, or transform the data in the records.
Location: Transform
Key Settings:
count: Number of transformed outputs- n.
transformn: The rules defining the
transformation
select: The expression specifying any filters
for input
reject-threshold: Defines the rule to abort the
process.
ote: Use Expression Editor to add/modify the
DML expression for select
Use %ransform Editor to add/modify transform
rules
REFORMAT
TRANSFORM FUNCTONS
A transform function is a collection of ruIes that specifies how to produce result records
from input records.
The exact behavior of a transform function depends on the component using it.
Transform functions express record reformatting logic.
Transform functions encapsuIate a wide variety of computationaI knowIedge that
cIeanses records, merges records, and aggregates records.
Transform functions perform their operations on the data records flowing into the
component and write the resuIting data records to the out flow.
ExampIe:
The purpose of the transform function in the REFORMAT component is to construct data
records flowing out of the component by:
Using all the fields in the record format of the data records flowing into the component
creating a new field containing a title (Mr. or Ms.) using the gender field of the data
records flowing into the component
Creating a new field containing a score computed by dividing the income field by 10
Transform functions are created and edited in the GDE Transform Editor.
TRANSFORM FUNCTONS
%ransform functions (or transforms) drive nearly all data transformation and
computation in Ab nitio graphs. Typical simple transforms can:
Extract the year from a date.
Combine first and last names into a full name.
Determine the amount owed in sales tax from a transaction amount.
Validate and cleanse existing fields.
Delete bad values or reject records containing invalid data.
Set default values.
Standardize field formats.
Merge or aggregate records.
A transform function is a coIIection of business ruIes, IocaI variabIes, and
statements. The transform expresses the connections between the rules,
variables, and statements, as well as the connections between these
elements and the input and output fields.
TRANSFORM EDTOR
reating Transforms
Transforms consists of
RuIes
VariabIes
and Statements
ReIationships between them are created using the Transform Editor.
The Transform Editor has two views:
Grid view: Rules are created by dragging and dropping fieIds from
input flow, functions and operators.
Text view: Rules are created using the DML syntax.
TRANFORM EDTOR
Input
FieIds
Input
FieIds
Output
FieIds
Output
FieIds
Transform
Function
Transform
Function
GRID VIEW
TRANSFORM EDTOR
Text View:
Typically, you use text view of the Transform Editor to enter or edit
transforms using DML. n addition, Ab nitio software supports the following
text alternatives:
You can enter the DML into a standalone text file, then include that file in the
transform's package.
You can select Embed and enter the DML code into the Value box on the
Parameters tab of the component Properties dialog.
TRANSFORM FUNCTONS
&ntyped transforms:
out :: trans{n) =
begn
out.x :: n.a,
out.y :: n.b + ,
out.z :: n.c + ,
end,
%yped transforms:
decma1{) out :: add_one{decma1{) n) =
begn
out :: n + 4,
end,
record nteger{4) x, y, end out :: 1dd1e{n, doub1e n) =
begn
out.x :: sze_o1{n),
out.y :: n + n,
end,
TRANSFORM EDTOR
RuIe:
A rule, or business rule, is an instruction in a transform function that directs
the construction of one field in an output record. Rules can express everything
from simple reformatting logic for field values to complex computations.
Rules are created in the expression editor triggered by right clicking the rule
line.
Rules can utilize functions and operators to define a specific Iogic.
TRANSFORM EDTOR
Prioritized RuIes:
Priorities can be optionaIIy assigned to the rules for a particular output
field.
They are evaIuated in order of priority, starting with the assignment of
Iowest-numbered priority and proceeding to assignments of higher-
numbered priority.
The Iast ruIe evaluated will be the one with bIank priority, which places
it after all others in priority.
A singIe output fieId can have muItipIe ruIes attached to it.
Prioritized rules are aIways evaIuated in the ascending order of
priority.
In Text View:
out :: average_part_cost{in} =
begin
out :: i1 {in.num_parts > } in.total_cost / in.num_parts
out ::
end
TRANSFORM FUNCTONS
LocaI VariabIe:
A local variable is a variable declared within a transform function. You can use local
variables to simplify the structure of rules or to hold values used by multiple rules.
To declare (or modify) IocaI variabIes, use the VariabIes Editor.
To initialize local variables, drag-and-drop the variable from the Variables tab to the
Output pane
Alternatively, enter the equivalent DML code in the text view of the Transform Editor. For
example:
out::rollup{in}=
begin
let string{7} myvar="a"
let decimal{} mydec=.
end
Statements:
A statement can assign a value to a local variable, a global variable, or an output field;
define processing logic; or control the number of iterations of another statement.
COMPONENTS
unIoads records from a database into a graph,
allowing you to specify as the source either a
database tabIe or an SQL statement that selects
records from one or more tables.
Location: Database / Datasets
Key Settings:
onfi ile: Location of the file defining key
parameters to access the database.
Source: Choose to use a table or a SQL
statement.
ote: Database settings and other pre-requistes
needs to be completed to create the config file.
INPUT TABLE
COMPONENTS
JON reads data from two or more input ports,
combines records with matching keys
according to the transform you specify, and
sends the transformed records to the output port.
Additional ports allow you to collect rejected and
unused records. JON can have as many as 20
input ports.
Location: Transform
Key Settings:
ount: An integer from 2 to 20 specifying the
total number of inputs (in ports) to join.
sorted-input: specifies the input data sorts.
Key: Name(s) of the fieId(s) in the input records
that must have matching vaIues for JON to call
the transform function.
Transform: Transform function specifying the
resultant fields from join
JOIN
COMPONENTS
Key Settings (continued):
Join-type: specifies the join method from inner-join,Outer-Join and
ExpIicit.
Record-requiredn: dependent setting of join-type.
Dedupn: removes dupIicate records on the specified port.
SeIectn: Acts as a component level fiIter for the port.
Override-keyn: AIternative name(s) for the key field(s) for a particular
inn port.
Driving: specifies the port that drives the join
COMPONENTS
JON WTH DB joins records from the flow or flows
connected to its in port with records read directIy from
a database, and outputs new records containing data
based on, or calculated from, the joined records.
Location: Database
Key Settings:
DBConfigfile: A database configuration file. Specifies
the database to connect to
select_sql: The SELECT statement to perform for each
input record.
JOIN with DB
SPECAL TRANSFORMS
AGGREGATE generates records that summarize
groups of records.
ROLLUP is the newer version of the AGGREGATE
component. ROLLUP offers better control over
record selection, grouping and aggregation.
AGGREGATE
VECTORS
A vector is an array of eIements, indexed from 0. An element can be a
singIe fieId or an entire record. Vectors are often used to provide a
logical grouping of information.
An array is a coIIection of eIements that are IogicaIIy grouped for ease of
access. Each element has an index by which the element can be
referenced(read or written).
Example:
char myarrayj = ja, b, c, d, e, 1, g, h, , ,
Above is an example of an array with 10 elements of type char. The
elements of this is addressed as
myarrayj
The above statement will return the first element 'a'. ndices are addressed
starting from 0 to n-1 where n is declared no. of array elements
SPECAL COMPONENTS
NORMALZE generates muItipIe output records
from each of its input records. You can directly
specify the number of output records for each input
record, or the number of output records can
depend on some calculation.
Location : Transform
Key Settings:
Transform: Either the name of the file containing
the types and transform functions, or a transform
string.
NORMALIZE
SPECAL COMPONENTS
DENORMALZE SORTED consoIidates
groups of related records by key into a
singIe output record with a vector fieId for
each group, and optionally computes
summary fields in the output record for each
group.
Location : Transform
Key Settings:
Key: Specifies the name(s) of the key
field(s) the component uses to define
groups of records.
Transform: Specifies either the name of
the file containing the transform function, or
a transform string.
DENORMALIZE SORTED
PHASES AND PARALLELSM
ParaIIeIism and muItifiIes
Graphs can be scaIed to accommodate any amount of data by introducing parallelismdoing
more than one thing at the same time into it. Data parallelism, in particular, separates the
data flowing through the graph into as many divisions called partitions. Partitions can be sent
to as many processors as needed to produce the result in the desired length of time.
Ab nitio multifiles stores partitioned data wherever convenient on different disks or on
different computers in various locations, and so on and yet manage all the partitions as a
single entity.
Phases and checkpoints
Resources can be reguIated by dividing a graph into stages called phases each of
which must complete before the next begins.
heckpoint are used at the end of any phase to guard against Ioss in case of failure. f a
problem occurs, you can return the state of the graph to the last successfully completed
checkpoint and then rerun it without having to start over from the beginning.
You set phases and checkpoints graphically in the GDE.
PARALLELSM
The power of Ab nitio software to process large quantities of data is based
on its use of parallelism doing more than one thing at the same time.
The Co>Operating System uses three types of parallelism:
Component parallelism
Pipeline parallelism
Data parallelism
COMPONENT PARALLELSM
omponent parallelism occurs when program components execute
simuItaneousIy on different branches of a graph.
n the graph above, the CUSTOMERS and TRANSACTONS datasets are
unloaded, sorts them and merges them into a dataset named MERGED
NFORMATON.
Component parallelism scales to the number of branches of a graph the
more branches a graph has, the greater the component parallelism. f a
graph has only one branch, component parallelism cannot occur.
PPELNE PARALLELSM
Both SCORE and SELECT read records as they become available and
write each record immediately after processing it. After SCORE finishes
scoring the first CUSTOMER record and sends it to SELECT, SELECT
determines the destination of the record and sends it on. At the same time,
SCORE reads the second CUSTOMER record. The two processing stages
of the graph run concurrently this is pipeline parallelism.
!ipeline parallelism occurs when several connected program components
on the same branch of a graph execute simuItaneousIy.
DATA PARALLELSM
Data parallelism occurs when a graph separates data into muItipIe
divisions, allowing multiple copies of program components to operate on
the data in all the divisions simultaneously.
The divisions of the data and the copies of the program components that
create data parallelism are called partitions, and a component partitioned
in this way is called a parallel component. f each partition of a parallel
program component runs on a separate processor, the increase in the speed
of processing is almost directly proportional to the number of partitions.
PARTTONS
The divisions of the data and the copies of the program components that create data
parallelism are called partitions, and a component partitioned in this way is called a
parallel component. f each partition of a parallel program component runs on a separate
processor, the increase in the speed of processing is almost directly proportional to the
number of partitions.
FLOW PARTITIONS
When you divide a component into partitions, you divide the flows that connect to it as well.
These divisions are called flow partitions.
PORT PARTITIONS
The port to which a partitioned flow connects is partitioned as well, with the same
number of port partitions as the flow connected to it.
DEPT OF PARALLELISM
The number of partitions of a component, flow, port, graph, or section of a graph determines
its depth of parallelism.
PARTTONS
PARALLEL FILES
Sometimes you will want to store partitioned data in its partitioned state in
a parallel fiIe for further paraIIeI processing. f you locate the partitions of
the parallel file on different disks, the parallel components in a graph can all
read from the file at the same time, rather than being limited by all having to
take turns reading from the same disk. These parallel files are called
multifiles.
MULTFLE SYSTEM
Multifiles are paraIIeI fiIes composed of individuaI files, which may be located
on separate disks or systems. These individual files are the partitions of the
multifile. Understanding the concept of multifiles is essential when you are
developing parallel applications that use files, because the parallelization of data
drives the parallelization of the application.
Data paraIIeIism makes it possible for Ab nitio software to process Iarge
amounts of data very quickIy. n order to take full advantage of the power of
data parallelism, you need to store data in its paraIIeI state. n this state, the
partitions of the data are typically located on different disks on various
machines. An Ab nitio multifile is a paraIIeI fiIe that allows you to manage all
the partitions of the data, no matter where they are located, as a singIe entity.
MuItifiIes are organized by using a MuItifiIe System, which has a directory
tree structure that allows you to work with multifiles and the directory structures
that contain them in the same way you would work with seriaI fiIes. Using an Ab
nitio multifile system, you can apply familiar file management operations to all
partitions of a multifile from a central point of administration no matter how
diverse the machines on which they are located. You do this by referencing the
control partition of the multifile with a single Ab Initio &#.
MULTFLE SYSTEM
MuItifiIe
An Ab Initio multifile organizes all partitions of a muItifiIe into one
singIe virtuaI fiIe that you can reference as one entity.
An Ab nitio muItifiIe can only exist in the context of a multifile system.
MuItifiIes are created by creating a muItifiIe system, and then either
outputting parallel data to it with an Ab nitio graph, or using the
m_touch command to create an empty multifile in the multifile system.
Ad oc MuItifiIes
A paraIIeI dataset with partitions that are an arbitrary set of seriaI
fiIes containing similar data.
reated explicitly by Iisting a set of seriaI fiIes as partitions, or by
using a sheII expression that expands at runtime to a list of serial files.
The serial files listed can be anything from a serial dataset divided into
multiple serial files to any set of serial files containing similar data.
MULTFLE SYSTEM
- Visualize a directory tree containing
subdirectories and files.
- Now imagine 3 identical copies
of the same tree located on
several disks, and number them
0 to 2. These are the data
partitions of the multifile system.
- Then add one more copy of the
tree to serve as the control
partition.
- This Multifile system will be referenced on GDE using Ab nitio URLs.
mfiIe://pIuto.us.com/usr/ed/mfs3
MULTFLE SYSTEM
reating the MuItifiIe system MFS
m_mk1s ,,p1uto.us.com,usr,ed,m1s2 \
,,pear.us.com,p,m1s2- \
,,pear.us.com,q,m1s2- \
,,p1um.us.com,0:,m1s2-
MuItidirectory
m_mkdir //pluto.us.com/usr/ed/m1s3/cust
MuItifiIe
m_touch //pluto.us.com/usr/ed/m1s3/cust/t.out \
ASSGNMENT
reating a MuItidirectory
Open the command console.
Enter the following command
m_mkdir c:\data\m1s\m1sway\employee
LAYOUTS
A Iayout of a component specifies :
The locations of files: A URL specifying the location of file.
The number and locations of the partitions of multifiles: A URL that
specifies the location of the control partition of a multifile.
Every component in a graph both dataset and program components has a
layout. Some graphs use one Iayout throughout; others use severaI Iayouts.
The layouts you choose can be critical to the success or failure of a graph. For a
layout to be effective, it must fulfill the following requirements:
The o>Operating System must be installed on the computers specified by
the layout.
The run host must be able to connect to the computers specified by the
layout.
The working directories in the layout must exist
The permissions in the directories of the layout must allow the graph to
write files there.
The layout must allow enough space for the files the graph needs to write
there.
During execution, a graph writes various fiIes in the Iayouts of some or aII
of the components in it.
FLOWS
FLOWS
FIows indicate the type of data transfer between the components
connected.
FLOWS
Straight FIow:
A straiht flow connects components with the same depth of paraIIeIism,
including serial components, to each other. f the components are serial, the
flow is serial. f the components are parallel, the flow has the same depth of
parallelism as the components.
FLOWS
Fan Out fIow:
A fan-out flow connects a component with a lesser number of partitions to one
with a greater number of partitions in other words, it follows a one-to-many
pattern. The component with the greater depth of parallelism determines the
depth of parallelism of a fan-out flow.
%E: You can onIy use a fan-out fIow when the resuIt of dividing the greater number of partitions by
the Iesser number of partitions is an integer. If this is not the case, you must use an aII-to-aII fIow.
FLOW
Fan In fIow:
A fan-in flow connects a component with a greater depth of parallelism to one
with a lesser depth in other words, it follows a many-to-one pattern. As with a
fan-out flow, the component with the greater depth of parallelism determines the
depth of parallelism of a fan-in flow.
FLOW
AII to AII fIow:
An all-to-all flow connects components with the same or different degrees of
parallelism in such a way that each output port partition of one component is
connected to each input port partition of the other component.
SANDBOX
A sandbox is a speciaI directory (folder) containing a certain minimum
number of specific subdirectories for holding Ab nitio graphs and reIated
fiIes. These subdirectories have standard names that indicate their function.
The sandbox directory itself can have any name; its properties are recorded
in various special and hidden files that lie at its top.
||e]D|r Descr|pt|on Lxamp|e
bln CLher execuLables
db uaLabase conflguraLlon CraSalesdbc
dml nonembedded 8ecord lormaLs r_cusLomersdml
mp Craphs exLroducLsmp
plan lanlL plans SalesCrdersplan
resource lanlL resource pools
run Craphs deployed as ksh exLroducLsksh
sql Sql scrlpLs cusLomerssql
xfr nonembedded 1ransforms r_cusLomersxfr
alrlock
alrpro[ecLparameLers Sandbox arameLers
alrsandboxoverrldes Sandbox Cverrlde arameLers
pro[ecLksh llrsL execuLable SeLs varlables
pro[ecLendksh osL compleLlon scrlpLs
pro[ecLsLarLksh re SLarL scrlpLs
ab_pro[ecL_seLupksh
SANDBOX
Parts of a graph
Data fiIes (.dat)
The Transactions input dataset and the Transaction Data output
dataset. f these are multifiles, the actual datasets will occupy multiple
files.
Record formats (.dmI)
There are two separate record formats in the graph, one for the input
file, the other for the output. These record formats could be
embedded in the nput File and Output File components themselves.
Transforms (.xfr)
The transform function can be embedded in the component, but (as
with the record formats)
Graph fiIe (.mp)
The graph itself is stored complete as an .mp file.
DepIoyed script (.ksh)
f deployed as a script, the graph will also exist as a .ksh file, which
has to be stored somewhere.
SANDBOX
Sandbox Parameters:
Sandbox parameters are variables which are visible to any component in any
graph which is stored in that sandbox. Here are some examples of sandbox
parameters:
$PROJECT_DR
$DML
$RUN
Graphs refer to the sandbox subdirectories by using sandbox parameters.
SANDBOX
Sandbox Parameters Editor
SANDBOX
The defauIt sandbox parameters in a GDE-created sandbox are these
eight:
PROJECT_DR absolute path to the sandbox directory
DML relative sandbox path to the dmI subdirectory
XFR relative sandbox path to the xfr subdirectory
RUN relative sandbox path to the run subdirectory
DB relative sandbox path to the db subdirectory
MP relative sandbox path to the mp subdirectory
RESOURCE relative sandbox path to the resource subdirectory
PLAN relative sandbox path to the pIan subdirectory
GRAPH PARAMETERS
raph parameters are associated with individuaI graphs and are
private to them.
They affect the execution only of the graph for which they are defined.
All the specifiable values of a graph, including each component's
parameters (as well as other values such as URLs, file protections, and
record formats,) comprise that graph's parameters.
Graph parameters are part of the graph they are associated with and are
checked in or out aIong with the graph.
Graph Parameters are used when a graph needs to be reused or to
facilitate runtime changes to certain settings of components used in the
graph.
These are created using Graph Parameter editor.
Values to the graph parameters can be stored(pset) in a file and called
during the execution of graph.
GRAPH PARAMETERS
Graph Parameters Editor
PARAMETER SETTNG
Name:
'alue: Unique identifier of the parameter.
Description: String. First character must be a letter; remaining characters can be a combination of letters, digits,
and underscores. Spaces are not allowed.
xample: OUT_DATA_MFS
Scope:
'alue: LocaI or FormaI.
Description:
Local: parameter receives its value from the Value column.
Formal: parameter receives its value at runtime from the command line. When run from GDE, a dialog
appears prompting input values.
Kind:
Sandbox
'alue: Unspecified or Keyword.
Description:
Unspecified: parameter values are directly assigned on command line.
Keyword: The value for a keyword parameter is identified by a keyword preceding it. The
keyword is the name of the parameter.
Graph
'alue: PositionaI, Keyword or Environment.
Description:
Positional: The value for a positional parameter is identified by its position on the command line.
Keyword: same as Sandbox setting.
Environment: parameter receives its value from the environment.
PARAMETER SETTNG
Type:
'alue: ommon Project, Dependent, Switch or String.
Description:
Common Project: to include other shared, or common, project values within the current
sandbox.
Dependent: parameters whose values depend on the value of a switch parameter.
Switch: purpose of a switch parameter is to allow you to change your sandbox's context:
String: a normal string representing record formats, file paths etc..
Dependent on:
'alue: Name of the switch column.
VaIue:
'alue: The parameter's value, consistent with its type.
Interpretation:
'alue: onstant, substitution, < substitution and SheII.
Description: Determines how the string value is evaluated
Required:
'alue: Required. Or OptionaI.
Description: Specifies the value is required or optional.
Export:
'alue: Checked /Unchecked
Description: Specifies if the parameter /value needs to be exported to environment variable
ORDER OF EVALUATON
When you run a graph, parameters are evaluated in the following order:
1. The host setup script is run.
2. ommon (that is, included) project (sandbox) parameters are
evaluated.
3. Project (sandbox) parameters are evaluated.
4. The project-start.ksh script is run.
5. FormaI parameters are evaluated.
6. Graph parameters are evaluated.
7. The graph start script is run.
GRAPH PARAMS V/S SANDBOX PARAMS
Graph parameters are visible only to the particular graph to which they
belong
Sandbox parameters are visible to all the graphs stored in a particular
sandbox
Graph parameters are created by Edit>Parameters on the Graph window
Sandbox parameters are automatically created (defaults) and can be
edited by Project>Edit Sandbox
or by editing the .air-project-parameters file on the root directory of the
sandbox
Graph parameters are set after Sandbox parameters. f a graph
parameter and a sandbox parameter shares the same name, the graph
parameter has a higher precedence.
PARAMETER NTERPRETATON
The vaIue of a parameter often contains references to other parameters. This attribute specifies how
you want such references to be handIed in this parameter. There are four possibiIities:
substitution:
This specifies that if the value for a parameter contains the name of another parameter preceded by a
dollar sign (), the dollar sign and name are replaced with the value of the parameter that's referred to. n
other words, parameter is replaced by the value of parameter. parameter is said to be "a $ reference" or
"dollar-sign reference to parameter".
< substitution:
${} substitution is similar to $ substitution, except that it additionally requires that you surround the name of
the referenced parameter with curly braces <. f you do not use curly braces, no substitution occurs,
and the dollar sign (and the name that follows it) is taken literally. You should use ${} substitution for
parameters that are likely to have values that contain as a character, such as names of relational
database tables.
constant:
The GDE uses the parameter's specified value literally, as is, with no further interpretation. f the value
contains any special characters, such as , the GDE surrounds them with single quotes in the deployed
shell script, so that they are protected from any further interpretation by the shell.
sheII:
The shell specified in the Host Settings dialog for the graph, usually the Korn shell, interprets the
parameter's value. Note that the value is not a full shell command; rather, it is the equivalent of the value
assigned in the definition of a shell.
PDL:
The Parameter Definition Language is available as an interpretation method only for the :eme value of a
dependent graph parameter. PDL (Parameter Definition Language) is a simple set of notations for
expressing the values of parametersin components, Graphs, and projects.
LOOKUP
A Iookup file is a file of data records that is smaII enough to fit in main
memory, letting a transform function retrieve records much more quickIy
than it could if they were stored on disk. Lookup files associate key values
with corresponding data values to index records and retrieve them.
Ab nitio's Iookup components are dataset components with speciaI
features.
Data in lookups are accessed using specific functions inside transforms.
Indexes are used to access the data when called by Iookup functions.
The data is Ioaded into memory for quick access.
A common use for a lookup file is to hold data frequently used by a transform
component. For example, if the data records of a lookup file contain numeric
product codes and their corresponding product names, a component can
rapidly search the lookup file to translate codes to names or names to codes.
LOOKUP
GuideIines:
Unlike other dataset components, LOOKUP FLE is not connected to other
components in a graph. t has no ports. However, its contents are accessible
from other components in the same phase or later phases of a graph.
LOOKUP FLE are referred in other components by caIIing lookup,
lookup_count, or lookup_next DML functions in any transform function or
expression parameter.
The first argument to these lookup functions is the name of the LOOKUP FLE.
The remaining arguments are values to be matched against the fields named by
the key parameter. The lookup functions return a record that matches the key
values and has the format given by the RecordFormat parameter.
A file to be used as a LOOKUP FLE must be small enough to fit into main
memory.
nformation about LOOKUP FLE components can be stored in a catalog, and
thus you can share the components with other graphs.
LOOKUP
OMPONENTS:
LOOKUP
LOOKUP TEMPLATE
WRITE LOOKUP
WRITE BLOK OMPRESSED LOOKUP
LOOKUP
DML Iookup functions Iocate records in a lookup file based on a key. The
lookup can find an exact value, a reguIar expression (regex) that matches
the lookup expression, or an intervaI (range) in which the lookup expression
falls.
Exact:
A lookup file with an exact key allows for lookups that compare one value to
another. The lookup expression (in the lookup function) must evaluate to a
value, and can be comprised of any number of key fields.
Regex:
Lookups can be configured for pattern matching: the records of the lookup
file have regular expressions in their key fields and you specify that the key
is regex.
IntervaI:
The lookup expression is considered to match a lookup file record if its value
matches any value in the range (i.e., interval). Another way to state this: the
lookup function compares a value to a series of intervals in a lookup file to
determine in which interval the value falls.
LOOKUP
IntervaI Lookup:
IntervaIs defined using one fieId in each Iookup record :
nterval is specified to imply that the lookup file is set up such that each key is
comprised of only one fieId: that field contains a value that is both the upper
endpoint of the previous intervaI as weII as the Iower endpoint of the current
intervaI.
IntervaIs defined by two fieIds in each Iookup record:
ntervals specified with intervaI_top and intervaI_bottom to imply that each lookup
file record contains two key fieIds, one having the bottom endpoint of the range;
the other, the top.
LOOKUP
eyword Descr|pt|on
lnLerval
Speclfles LhaL Lhe fleld holds a value LhaL ls boLh
1he lower endpolnL of Lhe currenL lnLerval lncluslve
1he upper endpolnL of Lhe prevlous lnLerval excluslve
lnLerval_boLLom
Speclfles LhaL Lhe fleld holds a value LhaL ls Lhe lower endpolnL of an lnLerval
lncluslve
1he |nterva|_bottom keyword musL precede Lhe |nterva|_top keyword
lnLerval_Lop
Speclfles LhaL Lhe fleld holds a value LhaL ls Lhe upper endpolnL of an lnLerval
lncluslve
noLe LhaL |nterva|_top musL follow Lhe |nterva|_bottom keyword
lncluslve (defaulL)
Modlfles lnLerval_boLLom and lnLerval_Lop Lo make Lhem lncluslve
use only wlLh lnLerval_Lop and lnLerval_boLLom
excluslve
Modlfles lnLerval_boLLomand lnLerval_Lop Lo make Lhem excluslve
use only wlLh lnLerval_Lop and lnLerval_boLLom
Key Specifier types
LOOKUP
Each record in the lookup file represents an interval, that is, a range of
values. The lower and upper bounds of the range are usually given in two
separate fields of the same type.
A key field marked intervaI_bottom holds the lower endpoint of the
interval.
A key field marked intervaI_top holds the upper endpoint.
f a field in the lookup file's key specifier is specified as intervaI, it must
be the only key field for that lookup file. You cannot specify a multipart key
as an intervaI lookup.
The file contains well-formed intervals:
For each record, the value of the lower endpoint of the interval must be
less than or equal to the value of the upper endpoint of the intervals.
The intervals must not overlap.
The intervals must be sorted into ascending order.
LOOKUP
Ways to use Lookup:
Static: &! IE component.
The lookup file is read into memory when its graph phase starts and stays
in memory untiI the phase ends.
You set up this kind of lookup by placing a LOOKUP FLE component in the
graph. Unlike other dataset components, the LOOKUP FLE component has
no ports; it does not need to be connected to other components. However, its
contents are accessible from other components in the same or later phases
of the graph.
A LOOKUP FLE component lookup has the advantage of being simple to
use: you add a LOOKUP FLE component to a graph, set its properties, and
then call a lookup function.
LOOKUP
Dynamic:
The lookup file (or just its index if the file is block-compressed) is read into
memory onIy when you caII a Ioading function from within a transform;
you later unload the lookup file by calling a Iookup_unIoad function.
A dynamically loaded lookup provides these advantages:
You can specify the lookup file name at runtime rather than when you
create the graph
You can avoid having multiple lookup files or indexes take up a lot of
memory this is especially relevant in a graph that uses many lookup
files.
A graph need load only the relevant partition of a lookup file rather than
the entire file this saves memory in a graph where a component and
lookup file are identically partitioned
LOOKUP
ataIogued Lookup:
nformation about lookup files can be stored in a catalog, so that you can
share the Iookup files among muItipIe graphs by simply sharing the
catalog. A catalog is a file that stores a list of lookup files on a particular run
host. To make a catalog available to a graph, you use the Shared ataIog
Path on the ataIog tab of the Run Settings diaIog to indicate where to
find the shared catalog. Ignore shared cataIog, use Iookup
fiIes onIy (default) gnores the
shared catalog
Use shared cataIog and Iookup fiIes
Accesses the files in the shared
catalog along with lookup files in the
graph.
Add Iookup fiIes to shared cataIog
permanentIy Adds lookup files in the
graph to the shared catalog, then uses
the shared catalog.
reate new shared cataIog from
Iookup fiIes Removes all lookup files
from the shared catalog and replaces
them with the lookup files in the graph.
LOOKUP
Uncompressed Iookup fiIes:
Any file can serve as an uncompressed lookup file if the data is not
compressed and has a fieId defined as a key. These files can also be
created using the WRITE LOOKUP component.
The component writes two files:
a file containing the Iookup data and
an index fiIe that references the data file.
BIock-compressed Iookup fiIes:
Lookup files are written in compressed indexed bIocks. These kind of
files are created using the WRITE BLOK OMPRESSED LOOKUP
component.
The component writes two files:
A data file containing the blocks of compressed data
An index file that contains indexes to the blocks
When you load a block-compressed lookup file you are actually loading its
index file. The compressed data remains on disk.
LOOKUP
Lookup fiIes can be either seriaI or partitioned (muItifiIes).
SeriaI:
When a serial lookup file is accessed in a component, the Co>Operating
System loads the entire file into memory. n most cases the file is
memory-mapped: every component in your graph that uses that lookup
file (and is on the same machine) shares a copy of that file. f the
component is running in parallel, all partitions on the same machine also
share a single copy of the lookup file.
Iookup, Iookup_count, Iookup_match, Iookup_next, and Iookup_nth
are used with serial files
Partitioned:
Partitioning a lookup file is particularly useful when a multifile system is
partitioned across several machines. The depth of the lookup file (the
number of ways it is partitioned) must match the layout depth of the
component accessing it.
Iookup_IocaI, Iookup_count_IocaI, Iookup_match_IocaI,
Iookup_next_IocaI and Iookup_nth_IocaI are used with partitioned
lookup files
LOOKUP
Lookup Functions:
Iookup[_IocaI]
record lookup{_local) {string JLe_LcbeL, { expresscn { , expresscn ... ))}
Returns the first matching data record, if any, from a [partition of a]
lookup file.
Iookup_count[_IocaI]
unsigned long lookup_count{_local) {string JLe_LcbeL, {expresscn {
,expresscn ..))}
Returns the number of matching data records in a [partition of a] lookup
file.
Iookup_match[_IocaI]
unsigned long lookup_match{_local) {string JLe_LcbeL, {expresscn {
,expresscn .. ))}
Looks up a specified match in a [partition of a] lookup file.
LOOKUP
Iookup_next[_IocaI]
record lookup_next{_local) {string JLe_LcbeL }
Returns the next successive matching record, if any, from a [partition of
a] lookup file after a successfuI caII to Iookup or Iookup_count.
Iookup_nth[_IocaI]
record lookup_nth{_local) {string JLe_LcbeL, unsigned integer reccrJ_num }
Returns a specific record from a [partition of a] lookup file.
LOOKUP
DynamicaIIy Ioaded Iookup:
Unlike the normal lookup process, dynamically loaded lookups Ioad the
dataset onIy when it is caIIed by a Iookup_Ioad function in any
transformation on the graph.
This provides an opportunity to dynamically specify which file to use as a
lookup.
Dynamically loaded lookups are used in a specific sequence:
Prepare a LOOKUP_TEMPLATE component
Use Iookup_Ioad to load the dataset and its index
1ookup{1ookup_d, "Mylookup1emp1ate", n.expresson)
Call the standard Iookup function within a transform
Unload the lookup using Iookup_unIoad function
LOOKUP
Tips:
A lookup file works best if it has a fixed-Iength record format because
this creates a smaIIer index and thus faster lookups.
When creating a lookup file from source data, drop any fields that are not
keys or lookup results.
A lookup file represented by a LOOKUP FLE component must fit into
memory. f a file is too large to fit into memory, use NPUT FLE followed
by MATCH SORTED or JON instead.
f the lookup file is large, partition the file on its key the same way as the
input data is partitioned. Then use the functions lookup_local,
lookup_count_local, or lookup_next_local. For more information, see
"Partitioned lookup files".

Ab-Initio 1

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ab-Initio 1

Enviado por

Direitos autorais:

Formatos disponíveis

FROM THE BEGNNNG

Você também pode gostar