Abinitio Training

ABINITIO TRAINING
3/5/2014
DAY ONE

Introduction to Data warehouse ETL AbInitio AbInitio Features Architecture GDE CO>Operating System EME Setting up Environment Data set types and Components Data types and DML I/P File, O/P file, Intermediate file and Lookup file Filter by expression, Replicate, Reformat and Redefine
3/5/2014
Introduction to Data warehouse

A
Data Warehouse is a
Subject-oriented, Integrated, Time variant and Non-volatile collection of data in support of managements decision-making process.
3/5/2014
ETL
Reading the source data. Applying business, transformation, and technical rules. Loading the data.
3/5/2014
AbInitio
AbInitio is Latin for From the Beginning.
AbInitio software is a general-purpose data processing platform for mission critical applications such as: Data warehousing Batch Processing Click-stream analysis Data movement Data transformation
3/5/2014
AbInitio Features

Transformation of disparate sources. Aggregation and other processing. Referential integrity checking. Database loading. Extraction for external processing. Aggregation and loading of data marts. Processing just about any form and volume of data. Parallel sort/merge processing. Data transformation. Re hosting of corporate data. Parallel execution of existing application.
6
3/5/2014
Architecture
User Application Development Environment GDE Shell Component Library User defined component 3rd party component AbInitio CO> Operating System Native Operating System EME
3/5/2014
GDE
3/5/2014
CO>Operating System

Parallel and distributed application execution. Control. Data Transport. Transactional semantics at the application level. Check pointing. Monitoring and debugging. Parallel file management. Metadata driven components.
3/5/2014
CO>Operating System
AbInitio Co>Operating system runs on Sun Solaris IBM AIX Hewlett-Packard HP-UX Siemens Pyramid Reliant Unix IBM DYNIX/ptx Silicon Graphics IRIX Red Hat Linux Windows NT 4.0(x86) Windows NT 2000 (x86) Compaq Tru64 UNIX IBM OS/390 NCR MP-RAS
10
3/5/2014
EME
Repository
for version controlling Used for Documentation
3/5/2014
11
Setting up Environment
3/5/2014
12
Data set types and Components

Data
Set Component Flow Components Transform Components Partitioning Components
3/5/2014
13
Data types and DML

Types
Base Void Number String Date Datetime Compound Vector Record Union
Integer Decimal Real
3/5/2014
14
DML
To
define the complete record structure. Can be defined either in grid mode or in text mode. Can be stored under a file name which can be referred multiple times or can be embedded.
3/5/2014
15
I/P File, O/P file, Intermediate file and Lookup file

Input File: Reads the data records from a serial file or multi file in the file system. Output File: Writes the data records to a serial file or a multi file in the file system. Intermediate File: Write data records to file in the middle of the graph.
Helps in debugging and further processing of intermediate file.
Lookup File: Represents one or multiple serial files or a multiple of data records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored on disk.
Look up file is not connected to other components in graph.

3/5/2014 16
Filter By Expression, Replicate, Reformat and Redefine

Filter by Expression: Enables user to track down a particular record or records, or to put together a sample of records to assists with analysis.
Allows filter the data based on expression that identifies only the records that you need. can also be used for data validation.
Replicate: Used when user want to make multiple copies of a flow for separate processing.
3/5/2014
17
Changes the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. manipulates one record at a time and does work like validation and cleansing e.g. deleting bad values, setting default values, standardizing field formats or rejecting records with invalid date etc.
3/5/2014
18
Transformation rules are defined for transform (0). Use of Reformat component is to Clean input data so that all of the records conform to the same convention
Redefine:
Copies data records from its input to its output without changing the values in the data records. Used to change or rename fields in a record format without changing the values in the records.
3/5/2014
19
DAY TWO
Sort,
Sort within Group, Dedup Sort Rollup and Scan Reject, Error Handling and Debugging
3/5/2014
20
Sort, Sort within Group, Dedup Sort

Sort : Used to sort group of records in a specific order with a key. Looks at all the records in the flow before it produces the final output.
3/5/2014
21

Sort Within Group: Refines the order of sorted dataset by further sorting according to an order specified by a minor key parameter within an order specified by a major key parameter. Imposes an order on those records according to the minor key
3/5/2014 22

Dedup Sort: Used to remove duplicate records (a group of records that share the same key), keeping a single record.
What it does: First sort the data. Set the key for grouping in the dedup component. Finally choose which duplicate to keep.
3/5/2014
23
Rollup and Scan

Rollup: Produces a single record form a group of records identified by a common key (or keys). Useful for summarizing groups of records i.e. totals, averages, max, min etc.
3/5/2014
24
Rollup and Scan

Scan: Generates a series of cumulative summary records such as successive year- to-date totals for groups of data records. Produces intermediate summary records.
3/5/2014
25
Reject, Error Handling and Debugging
Invalid data will go to Rejected Port. Setting reject-threshold parameter inside the component. GDE has a built in debugger capability. Add a Watcher File.
3/5/2014
26
DAY THREE
Join
Multi
Files Parallelism Partition and De Partition Layout, Fan-in, Fan-out and All-to-All
3/5/2014
27
Join
Join:
Used to combine data from two or more flows of records based on a matching key (or keys). Join deals with two activities. 1.Transforming data sources with different record format. 2.Combining data sources with the same record format.
3/5/2014
28
Join
Join types: Inner Join Full outer Join Explicit Join Inner Join: Uses only records with matching keys on both inputs. Full Outer Join: Uses all records from both inputs If a record from one does not have a matching record in the other input, a NULL record is used for the missing record
3/5/2014 29
Join
Explicit Join: Uses all records in one specified input (Based upon True/False), but records with matching keys in the other inputs are optional. Again a NULL record is used for the missing records.
3/5/2014 30
Multi Files
Essentially the global view of a set of ordinary files, each of which may be located anywhere the AbInitio Co-Operating System is installed. Each partition of a multi file is an ordinary file. Resides in multi directories. Identified using URL syntax with mfile: as the protocol part. One Control File.
3/5/2014
31
Parallelism

Processing of datasets in parallel for better performance. Types of Parallelism 1.Componet 2.Pipeline 3.Data Component Parallelism: When more than one component is running at the same time on different data streams. Comes for free with Graph Programming. Limitation: Scales to no. of branches a graph.
3/5/2014
32
Parallelism
Pipeline Parallelism: When two or more connected components process data one by one. Limitation:

Scales to length of branches in a graph. Some operations, like sorting, do not pipeline
Data Parallelism: Occurs when multiple copies of a process act on different sets of data at the same time. Process the whole more quickly using multiple CPU at the same time
3/5/2014 33
Partition and De Partition

Partition: Used to divide data sets into multiple sets for further processing. Types:
The component Partition by Expression partitions data by dividing it according to a DML expression.
The component Partition by Key partitions data by grouping it by a key, like dealing cards into piles according to their suit
3/5/2014
34
The Component Partition with Load Balance Partitions Data by Dynamic load balancing. More data goes to CPUs that are less busy and vice versa, thus maximizing throughput. The Component Partition by Percentage Partitions Data by Distributing it, so the output is proportional to fraction of 100. The Component Partition by Range Partitions Data by Dividing it evenly among nodes, based on a key and a set of partitioning ranges. The Component Partition by Round-robin Partitions Data by Distributing it evenly, in block size chunks, across the output partitions, like dealing cards.
3/5/2014
35

De Partition: Read data from multiple flows or operations and are used to recombine data records from different flows. Opposite to Partition. Types: The Concatenate component produces a single output flow that contains first all the records from the first input partition, then all the records from the second input partition, and so on.
The Gather component collects inputs from multiple partitions in an arbitrary manner, and produces a single output flow. It does not maintain sort order, but is the most efficient departitioned.
36
3/5/2014

The
Interleave component collects records from many sources in round-robin fashion. The effect is like taking a card from each player in turn, forming a deck of cards. Merge components collets inputs from multiple sorted partitions and maintains the sort order.
37
The
3/5/2014
Layout, Fan-in, Fan-out and All-to-All

Layout: Determines the location of a resource. Either serial or parallel. Fan-In: After data partition when departition components collects data from different flows a special symbol comes into flow. Fan-Out: When partition components divides dataset into multiple sets for further processing a special symbol comes into flow.
All-to-All:
3/5/2014
38
DAY FOUR
DBC
File, Input Table, Output Table, Join with DB Sub graph, Phasing, Check point, Recovery Normalize, Denormalize Sorted
3/5/2014
39
DBC File, Input Table, Output Table, Join with DB

DBC File: Required for AbInitio while connecting to any Database system. By default it comes with extension .dbc DBC file fields

The dbms_version field is the version of your database. The db_home field is the location of your database software ( ORACLE_HOME) The db_name field is the value of the identifier for your database instance. For Oracle, this the value of the ORACLE_SID environment variable. For SQL*Net, use @db_name The db_nodes field is a list of database-accessible nodes with Ab Initio installed. Note: If Oracle is on an SMP machine, you usually use one host name unless you are running Oracle OPS (parallel), then you may need a list of all the database runs on. The #user comment and #password comment fields list your name and password. If your database is Oracle and you are identified externally, leave these fields as comments
3/5/2014
40

Input Table: Unloads data records from a database into an AbInitio graph. Allowing you to specify as the source either a database table, or an SQL statement that selects data records from one or more tables.
3/5/2014 41

Output Table: Loads data records from a graph into a database. Specify the records destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables. By default calls the database fast loader to perform the output operation(s).
3/5/2014 42

Join with DB: Joins records from the flow or flows connected to its input port with records read directly from a database, and outputs new records containing data based on, or calculated from, the joined records.
3/5/2014
43
Sub Graph, Phasing, Check Point, Recovery

Sub Graph: A logical sub set of a graph. Used for manageability. Phasing: Breaking an application into separate processing unit. Breaking an application into phases limits the contention for:
Main memory. Processor(s). Breaking an application into phases costs: Disk space
3/5/2014
44
Sub Graph, Phasing, Check Point, Recovery

Check Point: Any phase break can be a checkpoint
Recovery:
3/5/2014
45
Normalize, Denormalize Sorted

Normalize: Generates multiple data records from each input data record; you can specify the number of output records, or the number of output records can depend on a field or fields in each input data record. Separate a data record with a vector field into several individual records, ach containing one element of the vector. Generates a series of output data records of each input data record by calling a transform function repeatedly.
3/5/2014 46
Normalize, Denormalize Sorted

Denormalize Sorted: Consolidates groups of related data records into a single output record with a vector field for each group. Optionally computes summary fields in the output record for each group. Denormalize sorted requires grouped input.
3/5/2014
47
DAY FIVE
Memory
Management Dead Lock Sandbox Setting, Graph and Project Parameter User defined function and Built-in functions
3/5/2014
48
Memory Management
Memory
requires for Sorting, Rollup and
Join Input must be sorted vs In-Memory Sort AI_GRAPH_MAX_CORE_SETTING
3/5/2014
49
Dead Lock
How to avoid Dead Lock : Use Concatenate and Merge with care. Use flow buffering (the GDE Default for a new graph).[*Automatic Flow Buffering is enabled] Insert a phase break before the departitioner. Dont serialize data unnecessarily; repartition instead of departition.
3/5/2014
50
Sandbox Setting, Graph and Project Parameter

Sandbox Setting: Work space is called Sand Box Setting up a standard working environment helps a development team or other team work together. Allows an application to be designed to be portable.
3/5/2014
51

Default sandbox directories $AI_RUNrun directory $AI_DMLrecord format files $AI_XFRtransform files $AI_MPgraphs $AI_DBdatabase config files
3/5/2014
52

A
parameter is simply a name value pair with a number of additional attributes. Parameters that reside in your sandbox are known as sandbox parameter, they set the context of your sandbox. Those that reside in the repository are called project parameters. Graph parameters only apply to the graph in which they are defined.
3/5/2014 53
User defined function and Built-in functions

Work like as AI built in function. Global usability across application. Like built in function stores as .XFR Built in functions Next_in_sequence() Is_blank() Is_defined() etc
3/5/2014 54

Abinitio Training

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Abinitio Training

Enviado por

Direitos autorais:

Formatos disponíveis

ABINITIO TRAINING

Introduction to Data warehouse

for version controlling Used for Documentation

Data set types and Components

Set Component Flow Components Transform Components Partitioning Components

Data types and DML

Integer Decimal Real

I/P File, O/P file, Intermediate file and Lookup file

Helps in debugging and further processing of intermediate file.

Look up file is not connected to other components in graph.

Filter By Expression, Replicate, Reformat and Redefine

Filter By Expression, Replicate, Reformat and Redefine

Filter By Expression, Replicate, Reformat and Redefine

Sort, Sort within Group, Dedup Sort

Sort, Sort within Group, Dedup Sort

Sort, Sort within Group, Dedup Sort

Rollup and Scan

Rollup and Scan

Reject, Error Handling and Debugging

Partition and De Partition

Partition and De Partition

Partition and De Partition

Partition and De Partition

Layout, Fan-in, Fan-out and All-to-All

DBC File, Input Table, Output Table, Join with DB

DBC File, Input Table, Output Table, Join with DB

DBC File, Input Table, Output Table, Join with DB

DBC File, Input Table, Output Table, Join with DB

Sub Graph, Phasing, Check Point, Recovery

Sub Graph, Phasing, Check Point, Recovery

Normalize, Denormalize Sorted

Normalize, Denormalize Sorted

requires for Sorting, Rollup and

Join Input must be sorted vs In-Memory Sort AI_GRAPH_MAX_CORE_SETTING

Sandbox Setting, Graph and Project Parameter

Sandbox Setting, Graph and Project Parameter

Sandbox Setting, Graph and Project Parameter

User defined function and Built-in functions

Você também pode gostar