Escolar Documentos
Profissional Documentos
Cultura Documentos
Material
Version 1.0
REVISION HISTORY
Page 1 of 115
DWH Training
Author /
Contributor
Versio
n
01-Nov2004
1.0
Initial Document
14-Sep2010
1.1
Updated Document
Table of Contents
1
2
Introduction
1.1 Purpose
ORACLE
2.1 DEFINATIONS
NORMALIZATION:
First Normal Form:
Second Normal Form:
Third Normal Form:
Boyce-Codd Normal Form:
Fourth Normal Form:
ORACLE SET OF STATEMENTS:
Data Definition Language :(DDL)
Data Manipulation Language (DML)
Data Querying Language (DQL)
Data Control Language (DCL)
Transactional Control Language (TCL)
Syntaxes:
ORACLE JOINS:
Equi Join/Inner Join:
Non-Equi Join
Self Join
Natural Join
Cross Join
Outer Join
Left Outer Join
Right Outer Join
Full Outer Join
Whats the difference between View and Materialized View?
4
5
5
5
5
6
6
6
6
6
7
7
7
7
9
10
10
10
11
11
11
11
11
12
12
Page 2 of 115
DWH Training
View:
Materialized View:
Inline view:
Indexes:
Why hints Require?
Explain Plan:
Store Procedure:
Packages:
Triggers:
Data files Overview:
2.2 IMPORTANT QUERIES
12
13
13
19
19
21
23
24
24
26
27
DWH CONCEPTS
29
What is BI?
ETL-INFORMATICA
53
53
90
29
UNIX
97
101
103
106
Page 3 of 115
DWH Training
2 ORACLE
2.1 DEFINATIONS
Organizations can store data on various media and in different
formats, such as a hard-copy document
in a filing cabinet or data stored in electronic spreadsheets or in
databases.
A database is an organized collection of information.
To manage databases, you need database
systems (DBMS). A DBMS is a program that
management
Page 4 of 115
DWH Training
NORMALIZATION:
Some Oracle databases were modeled according to the rules of
normalization that were intended to eliminate redundancy.
DWH Training
Update
Delete
Data Querying Language (DQL)
Select
Data Control Language (DCL)
Grant
Revoke
Transactional Control Language (TCL)
Commit
Rollback
Save point
Syntaxes:
CREATE OR REPLACE SYNONYM HZ_PARTIES FOR
SCOTT.HZ_PARTIES
CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA
IDENTIFIED BY exact123 USING ' CAASEDW
Materialized View syntax:
CREATE MATERIALIZED VIEW
EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW
REFRESH COMPLETE
START WITH sysdate
NEXT TRUNC(SYSDATE+1)+ 4/24
WITH PRIMARY KEY
Page 7 of 115
DWH Training
AS
select * from HWMD_MTH_ALL_METRICS_CURR_VW;
Another Method to refresh:
DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');
Case Statement:
Select NAME,
(CASE
WHEN (CLASS_CODE = 'Subscription')
THEN ATTRIBUTE_CATEGORY
ELSE TASK_TYPE
END) TASK_TYPE,
CURRENCY_CODE
From EMP
Decode()
Select empname,Decode(address,HYD,Hyderabad,
Bang, Bangalore, address) as address
from emp;
Procedure:
CREATE
OR
REPLACE
cust_id_IN In
NUMBER,
amount_IN
In
PROCEDURE Update_bal (
NUMBER DEFAULT 1) AS
BEGIN
Update account_tbl Set amount= amount_IN where cust_id=
cust_id_IN
End
Trigger:
Page 8 of 115
DWH Training
DECLARE
BEGIN
IF (:NEW.last_upd_tmst <> :OLD.last_upd_tmst) THEN
-- Insert into Control table record
Insert into table emp_w values('wrk',sysdate)
ELSE
-- Exec procedure
Exec update_sysdate()
END;
ORACLE JOINS:
Equi join
Non-equi join
Self join
Natural join
Page 9 of 115
DWH Training
Cross join
Outer join
Left outer
Right outer
Full outer
Equi Join/Inner Join:
d.deptno;
Self Join
Joining the table itself is called self join.
Ex:
e1.empno=e2.mgr;
Page 10 of 115
DWH Training
Natural Join
Natural join compares all the common columns.
Ex:
join dept;
Outer Join
Outer join gives the non-matching records along with matching
records.
Left Outer Join
This will display the all matching records and the records which
are in left hand side table those that are not in right hand side
table.
Ex:
are in right hand side table those that are not in left hand side
table.
Ex:
on(e.deptno=d.deptno);
Or
d.deptno;
on(e.deptno=d.deptno);
OR
View:
Why Use Views?
To restrict data access
Page 12 of 115
DWH Training
Inline view:
If we write a select statement in from clause that is nothing but
inline view.
Ex:
Get dept wise max sal along with empname and emp no.
Select a.empname, a.empno, b.sal, b.deptno
From EMP a, (Select max (sal) sal, deptno from EMP group by
deptno) b
Where
a.sal=b.sal and
a.deptno=b.deptno
What is the difference between view and materialized
view?
View
Materialized view
It is a database object.
DWH Training
tables.
Row-num
Rowid is permanent.
Row-num is temporary.
The row-num
pseudocoloumn returns a
number indicating the order
in which oracle selects the
row from a table or set of
joined rows.
Having clause
Both where and having clause can be used to filter the data.
Where as in where clause it is
not mandatory.
MERGE Statement
You can use merge command to perform insert and
update in a single command.
Ex: Merge into student1 s1
Using (select * from student2) s2
On (s1.no=s2.no)
When matched then
Update set marks = s2.marks
Page 17 of 115
DWH Training
Sub-query
Co-related sub-query
A sub-query is executed
once for the parent Query
Example:
Example:
Indexes:
1. Bitmap indexes are most appropriate for columns having
low distinct valuessuch as GENDER, MARITAL_STATUS,
and RELATION. This assumption is not completely
accurate, however. In reality, a bitmap index is always
advisable for systems in which data is not frequently
updated by many concurrent systems. In fact, as I'll
demonstrate here, a bitmap index on a column with 100percent unique values (a column candidate for primary
key) is as efficient as a B-tree index.
2. When to Create an Index
3. You should create an index if:
4.
DWH Training
ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data
warehousing systems.
FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and
FIRST_ROWS, based on statistics gathered.
Additional Hints
HASH
Hashes one table (full scan) and creates a hash index for
that table. Then hashes other table and uses hash index to
find corresponding records. Therefore not suitable for < or
> join conditions.
/*+ use_hash */
Use Hint to force using index
Page 21 of 115
DWH Training
table scan if it is full table scan the cost will be more then will
create the indexes on the joining columns and will run the query
it should give better performance and also needs to analyze
the tables if analyzation happened long back. The ANALYZE
statement can be used to gather statistics for a specific table,
index or cluster using
ANALYZE TABLE employees COMPUTE STATISTICS;
If still have performance issue then will use HINTS, hint is
nothing but a clue. We can use hints like
ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data
warehousing systems.
FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and
FIRST_ROWS, based on statistics gathered.
HASH
Hashes one table (full scan) and creates a hash index for
that table. Then hashes other table and uses hash index to
find corresponding records. Therefore not suitable for < or
> join conditions.
/*+ use_hash */
Hints are most useful to optimize the query performance.
Store Procedure:
What are the differences between stored procedures
Page 23 of 115
DWH Training
and triggers?
Stored procedure normally used for performing tasks
But the Trigger normally used for tracing and auditing logs.
Stored procedures should be called explicitly by the user in
order to execute
But the Trigger should be called implicitly based on the events
defined in the table.
Stored Procedure can run independently
But the Trigger should be part of any DML events on the table.
Stored procedure can be executed from the Trigger
But the Trigger cannot be executed from the Stored
procedures.
Stored Procedures can have parameters.
But the Trigger cannot have any parameters.
Stored procedures are compiled collection of programs or SQL
statements in the database.
Using stored procedure we can access and modify data
present in many tables.
Also a stored procedure is not associated with any particular
database object.
But triggers are event-driven special procedures which are
attached to a specific database object say a table.
Stored procedures are not automatically run and they have to
be called explicitly by the user. But triggers get executed when
the particular event associated with the event gets fired.
Packages:
Packages provide a method of encapsulating related
procedures, functions, and associated cursors and variables
together as a unit in the database.
Page 24 of 115
DWH Training
Triggers:
Oracle lets you define procedures called triggers that run
implicitly when an INSERT, UPDATE, or DELETE statement is
issued against the associated table
Types of Triggers
This section describes the different types of triggers:
INSTEAD OF Triggers
Row Triggers
A row trigger is fired each time the table is affected by the
triggering statement. For example, if an UPDATE statement
updates multiple rows of a table, a row trigger is fired once for
each row affected by the UPDATE statement. If a triggering
statement affects no rows, a row trigger is not run.
BEFORE and AFTER Triggers
When defining a trigger, you can specify the trigger timing-Page 25 of 115
DWH Training
Stored Procedures
Where as in procedure we
need to execute manually.
Functions
DWH Training
Stored as a pseudo-code in
database i.e. compiled form.
Delete from EMP where rowid not in (select max (rowid) from
EMP group by empno);
3. Below query transpose columns into rows.
Nam
e
No
Add1
Add2
abc
100
hyd
bang
xyz
200
Mysor
e
pune
max(decode(rank_id,2,address)) as add2,
max(decode(rank_id,3,address))as add3
from
(select emp_id,address,rank() over (partition by emp_id order
by emp_id,address )rank_id from temp )
group by
emp_id
5. Rank query:
Select empno, ename, sal, r from (select empno, ename, sal,
rank () over (order by sal desc) r from EMP);
6. Dense rank query:
The DENSE_RANK function works acts like the RANK function
except that it assigns consecutive ranks:
Select empno, ename, Sal, from (select empno, ename, sal,
dense_rank () over (order by sal desc) r from emp);
7. Top 5 salaries by using rank:
Select empno, ename, sal,r from (select
empno,ename,sal,dense_rank() over (order by sal desc) r from
emp) where r<=5;
Or
Select * from (select * from EMP order by sal desc)
where rownum<=5;
8. 2 nd highest Sal:
Select empno, ename, sal, r from (select empno, ename, sal,
dense_rank () over (order by sal desc) r from EMP) where r=2;
9. Top sal:
Select * from EMP where sal= (select max (sal) from EMP);
10.
DWH Training
Hierarchical queries
Starting at the root, walk from the top down, and eliminate
employee Higgins in the result, but
process the child rows.
SELECT department_id, employee_id, last_name, job_id, salary
FROM employees
WHERE last_name! = Higgins
START WITH manager_id IS NULL
CONNECT BY PRIOR employee_id = menagerie;
3 DWH CONCEPTS
What is BI?
Business Intelligence refers to a set of methods and techniques
that are used by organizations for tactical and strategic decision
making. It leverages methods and technologies that focus on
counts, statistics and business objectives to improve business
performance.
The objective of Business Intelligence is to better understand
customers and improve customer service, make the supply and
distribution chain more efficient, and to identify and address
business problems and opportunities quickly.
Warehouse is used for high level data analysis purpose.It
is used for predictions, timeseries analysis, financial
Analysis, what -if simulations etc. Basically it is used
for better decision making.
What is a Data Warehouse?
Page 30 of 115
DWH Training
Data Warehouse is a "Subject-Oriented, Integrated, TimeVariant Nonvolatile collection of data in support of decision
making".
In terms of design data warehouse and data mart are almost
the same.
In general a Data Warehouse is used on an enterprise level and
a Data Marts is used on a business division/department level.
Subject Oriented:
Data that gives information about a particular subject instead of
about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular
time period.
Non-volatile:
Data is stable in a data warehouse. More data is added but data
is never removed.
What is a DataMart?
Data mart is usually sponsored at the department level and
developed with a specific details or subject in mind, a Data
Mart is a subset of data warehouse with a focused objective.
What is the difference between a data warehouse and a
data mart?
In terms of design data warehouse and data mart are almost
the same.
In general a Data Warehouse is used on an enterprise level and
a Data Marts is used on a business division/department level.
Page 31 of 115
DWH Training
Data Warehouse
Page 33 of 115
DWH Training
DWH Training
tables. If a dimension is
normalized, we say it is a
snow flaked design.
It is called a snowflake
schema because the diagram
resembles a snowflake.
Types of facts?
There are three types of facts:
What is Granularity?
Principle: create fact tables with the most granular data
possible to support analysis of the business process.
In Data warehousing grain refers to the level of detail available
in a given fact table as well as to the level of detail provided by
a star schema.
It is usually given as the number of records per key within the
Page 36 of 115
DWH Training
table. In general, the grain of the fact table is the grain of the
star schema.
Facts: Facts must be consistent with the grain.all facts are at a
uniform grain.
Dimensional Model
Page 37 of 115
DWH Training
No attribute is specified.
At this level, the data modeler attempts to identify the highestlevel relationships among the different entities.
Logical Data Model
Features of logical data model include:
At this level, the data modeler will specify how the logical data
model will be realized in the database schema.
The steps for physical data model design are as follows:
1. Convert entities into tables.
2. Convert relationships into foreign keys.
3. Convert attributes into columns.
9. http://www.learndatamodeling.com/dm_standard.htm
10. Modeling is an efficient and effective way to represent
the organizations needs; It provides information in a
graphical way to the members of an organization to
understand and communicate the business rules and
processes. Business Modeling and Data Modeling are the
two important types of modeling.
The differences between a logical data model and
Page 41 of 115
DWH Training
Represents business
information and defines
business rules
Entity
Table
Attribute
Column
Primary Key
Alternate Key
Rule
Relationship
Foreign Key
Definition
Comment
Page 42 of 115
DWH Training
Page 43 of 115
DWH Training
Page 44 of 115
DWH Training
Page 45 of 115
DWH Training
ACW_DF_FEES_STG
Non-Key Attributes
SEGMENT1
ORGANIZATION_ID
ITEM_TYPE
BUYER_ID
COST_REQUIRED
QUARTER_1_COST
QUARTER_2_COST
QUARTER_3_COST
QUARTER_4_COST
COSTED_BY
COSTED_DATE
APPROV ED_BY
APPROV ED_DATE
ACW_DF_FEES_F
Primary Key
ACW_DF_FEES_KEY
[PK1]
Non-Key Attributes
PRODUCT_KEY
ORG_KEY
DF_MGR_KEY
COST_REQUIRED
DF_FEES
COSTED_BY
COSTED_DATE
APPROV ING_MGR
APPROV ED_DATE
D_CREATED_BY
D_CREATION_DATE
D_LAST_UPDATE_BY
D_LAST_UPDATED_DATE
EDW_TIME_HIERARCHY
ACW_PCBA_A PPROVAL_STG
Non-Key Attributes
INV ENTORY_ITEM_ID
LATEST_REV
LOCATION_ID
LOCATION_CODE
APPROV AL_FLAG
ADJUSTMENT
APPROV AL_DATE
TOTA L_ADJUSTMENT
TOTA L_ITEM_COST
DEMAND
COMM_MGR
BUYER_ID
BUYER
RFQ_CREATED
RFQ_RESPONSE
CSS
ACW_DF_A PPROVAL_STG
Non-Key Attributes
INV ENTORY_ITEM_ID
CISCO_PART_NUMBER
LATEST_REV
PCBA _ITEM_FLAG
APPROV AL_FLAG
APPROV AL_DATE
LOCATION_ID
LOCATION_CODE
BUYER
BUYER_ID
RFQ_CREATED
RFQ_RESPONSE
CSS
ACW_PCBA_A PPROVAL_F
Primary Key
PCBA _APPROVAL_KEY
[PK1]
Non-Key Attributes
PART_KEY
CISCO_PART_NUMBER
SUPPLY_CHANNEL_KEY
NPI
APPROV AL_FLAG
ADJUSTMENT
APPROV AL_DATE
ADJUSTMENT_AMT
SPEND_BY _ASSEMBLY
COMM_MGR_KEY
BUYER_ID
RFQ_CREATED
RFQ_RESPONSE
CSS
D_CREATED_BY
D_CREATED_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
ACW_DF_A PPROVAL_F
Primary Key
DF_APPROVAL_KEY
[PK1]
Non-Key Attributes
PART_KEY
CISCO_PART_NUMBER
SUPPLY_CHANNEL_KEY
PCBA _ITEM_FLAG
APPROV ED
APPROV AL_DATE
BUYER_ID
RFQ_CREATED
RFQ_RESPONSE
CSS
D_CREATED_BY
D_CREATION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
ACW_ORGANIZATION_D
Primary Key
ORG_KEY [PK1]
Non-Key Attributes
ORGANIZATION_CODE
CREA TED_BY
CREA TION_DATE
LAST_UPDATE_DATE
LAST_UPDATED_BY
D_CREATED_BY
D_CREATION_DATE
D_LAST_UPDATE_DATE
D_LAST_UPDATED_BY
ACW_USERS_D
Primary Key
USER_KEY [PK1]
Non-Key Attributes
PERSON_ID
EMAIL_ADDRESS
LAST_NAME
FIRST_NAME
FULL_NAME
EFFECTIV E_STA RT_DATE
EFFECTIV E_END_DATE
EMPLOYEE_NUMBER
LAST_UPDATED_BY
LAST_UPDATE_DATE
CREA TION_DATE
CREA TED_BY
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
D_CREATION_DATE
D_CREATED_BY
ACW_PART_TO_PID_D
Users
Primary Key
PART_TO_PID_KEY [PK1]
Non-Key Attributes
PART_KEY
CISCO_PART_NUMBER
PRODUCT_KEY
PRODUCT_NA ME
LATEST_REVISION
D_CREATED_BY
D_CREATION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
ACW_PRODUCTS_D
Primary Key
PRODUCT_KEY [PK1]
Non-Key Attributes
PRODUCT_NA ME
BUSINESS_UNIT_ID
BUSINESS_UNIT
PRODUCT_FAMILY_ID
PRODUCT_FAMILY
ITEM_TYPE
D_CREATED_BY
D_CREATION_DATE
D_LAST_UPDATE_BY
D_LAST_UPDATED_DATE
ACW_SUPPLY_CHA NNEL_D
Primary Key
SUPPLY_CHANNEL_KEY
[PK1]
Non-Key Attributes
SUPPLY_CHANNEL
DESCRIPTION
LAST_UPDATED_BY
LAST_UPDATE_DATE
CREA TED_BY
CREA TION_DATE
D_LAST_UPDATED_BY
D_LAST_UPDATE_DATE
D_CREATED_BY
D_CREATION_DATE
ACW_DF_FEES_F
Columns
ACW_DF_FEES_KEY
NUMB ER(10) [P K1]
PRODUCT_KEY
NUMB ER(10)
ORG_KE Y
NUMB ER(10)
DF_MGR_K EY
NUMB ER(10)
COST_REQUIRED
CHA R(1)
DF_FE ES
FLOAT(12)
COSTED_B Y
NUMB ER(10)
COSTED_DATE
DAT E
APP ROV ING_MGR
NUMB ER(10)
APP ROV ED_DATE
DAT E
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
D_LAST_UPDATE_BY CHA R(10)
D_LAST_UPDATED_DAT CHA
E R(10)
ACW_DF_FEES_STG
Columns
SEGMENT 1
VARCHAR2(40)
ORGA NIZATION_IDNUMB ER(10)
IT EM_TYPE
CHA R(30)
BUY ER_ID
NUMB ER(10)
COST_REQUIRED CHA R(1)
QUARTE R_1_COSTFLOAT(12)
QUARTE R_2_COSTFLOAT(12)
QUARTE R_3_COSTFLOAT(12)
QUARTE R_4_COSTFLOAT(12)
COSTED_B Y
NUMB ER(10)
COSTED_DATE
DAT E
APP ROV ED_BY
NUMB ER(10)
APP ROV ED_DATE DAT E
EDW_TIME_HIE RARCHY
ACW_PCBA_APPROVAL_STG
Columns
INVENTORY_IT EM_ID
NUMB ER(10)
LATEST _REV
CHA R(10)
LOCATION_ID
NUMB ER(10)
LOCATION_CODE
CHA R(10)
APP ROV AL_FLAG CHA R(1)
ADJUSTME NT
CHA R(1)
APP ROV AL_DA TE DAT E
TOT AL_ADJUSTMENT
CHA R(10)
TOT AL_ITEM _COST FLOAT(10)
DEMA ND
NUMB ER
COMM_MGR
CHA R(10)
BUY ER_ID
NUMB ER(10)
BUY ER
VARCHAR2(240)
RFQ_CREATED
CHA R(1)
RFQ_RE SPONSE
CHA R(1)
CSS
CHA R(10)
ACW_DF_APPROVA L_STG
Columns
INVENTORY_IT EM_ID NUMB ER(10)
CISCO_PA RT _NUMBE R
CHA R(30)
LATEST _REV
CHA R(10)
PCB A_ITEM_FLAG
CHA R(1)
APP ROV AL_FLAG
CHA R(1)
APP ROV AL_DA TE
DAT E
LOCATION_ID
NUMB ER(10)
SUP PLY _CHANNE L
CHA R(10)
BUY ER
VARCHAR2(240)
BUY ER_ID
NUMB ER(10)
RFQ_CREATED
CHA R(1)
RFQ_RE SPONSE
CHA R(1)
CSS
CHA R(10)
ACW_PCBA_APPROVAL_F
Columns
PCB A_A PPROVAL_KEY CHA R(10) [PK1]
PART_K EY
NUMB ER(10)
CISCO_PA RT _NUMBE R CHA R(10)
SUP PLY _CHANNE L_KEYNUMB ER(10)
NPI
CHA R(1)
APP ROV AL_FLAG
CHA R(1)
ADJUSTME NT
CHA R(1)
APP ROV AL_DA TE
DAT E
ADJUSTME NT_AMT
FLOAT(12)
SPE ND_BY_ASSE MBLYFLOAT(12)
COMM_MGR_K EY
NUMB ER(10)
BUY ER_ID
NUMB ER(10)
RFQ_CREATED
CHA R(1)
RFQ_RE SPONSE
CHA R(1)
CSS
CHA R(10)
D_CREA TED_BY
CHA R(10)
D_CREA TED_DAT E
CHA R(10)
D_LAST_UPDATED_BY CHA R(10)
D_LAST_UPDATE_DATEDAT E
ACW_DF_APPROVA L_F
Columns
DF_APPROVAL_KEY
NUMB ER(10) [P K1]
PART_K EY
NUMB ER(10)
CISCO_PA RT_NUMBE R CHA R(30)
SUP PLY _CHANNE L_KEYNUMB ER(10)
PCB A_ITEM_FLAG
CHA R(1)
APP ROV ED
CHA R(1)
APP ROV AL_DA TE
DAT E
BUY ER_ID
NUMB ER(10)
RFQ_CREATED
CHA R(1)
RFQ_RE SPONSE
CHA R(1)
CSS
CHA R(10)
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
D_LAST_UPDATED_BY CHA R(10)
D_LAST_UPDATE_DATEDAT E
ACW_PA RT_TO_PID_D
Columns
PART_T O_PID_KEY
NUMB ER(10) [P K1]
PART_K EY
NUMB ER(10)
CISCO_PA RT_NUMBE RCHA R(30)
PRODUCT_KEY
NUMB ER(10)
PRODUCT_NAME
CHA R(30)
LATEST _REVIS ION
CHA R(10)
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
D_LAST_UPDATED_BYCHA R(10)
D_LAST_UPDATE_DATE
DAT E
ACW_ORGANIZAT ION_D
Columns
ORG_KE Y
NUMB ER(10) [P K1]
ORGA NIZATION_CODE CHA R(30)
CRE ATED_BY
NUMB ER(10)
CRE ATION_DATE
DAT E
LAST_UPDATE_DATE DAT E
LAST_UPDATED_BY NUMB ER
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
D_LAST_UPDATE_DATE
DAT E
D_LAST_UPDATED_BYCHA R(10)
PID_for_DF_Fees
ACW_US ERS_D
Columns
USE R_K EY
NUMB ER(10) [P K1]
PERSON_ID
CHA R(10)
EMAIL_ADDRESS
CHA R(10)
LAST_NAM E
VARCHAR2(50)
FIRST _NAME
VARCHAR2(50)
FULL_NAM E
CHA R(10)
EFFECTIVE_START _DATE
DAT E
EFFECTIVE_END_DAT E DAT E
EMPLOYEE_NUMBER
NUMB ER(10)
SEX
NUMB ER
LAST_UPDATE_DATE
DAT E
CRE ATION_DATE
DAT E
CRE ATED_BY
NUMB ER(10)
D_LAST_UPDATED_BY CHA R(10)
D_LAST_UPDATE_DATE DAT E
D_CREA TION_DATE
DAT E
D_CREA TED_BY
CHA R(10)
ACW_PRODUCTS_D
Columns
PRODUCT_KEY
NUMB ER(10) [P K1]
PRODUCT_NAME
CHA R(30)
BUS INESS _UNIT_ID
NUMB ER(10)
BUS INESS _UNIT
VARCHAR2(60)
PRODUCT_FAM ILY_ID NUMB ER(10)
PRODUCT_FAM ILY
VARCHAR2(180)
IT EM_TYPE
CHA R(30)
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
D_LAST_UPDATE_BY CHA R(10)
D_LAST_UPDATED_DAT CHA
E R(10)
ACW_SUPPLY_CHANNEL_D
Columns
SUP PLY _CHANNE L_KEYNUMB ER(10) [P K1]
SUP PLY _CHANNE L
CHA R(60)
DES CRIPT ION
VARCHAR2(240)
LAST_UPDATED_BY
NUMB ER
LAST_UPDATE_DATE DAT E
CRE ATED_BY
NUMB ER(10)
CRE ATION_DATE
DAT E
D_LAST_UPDATED_BY CHA R(10)
D_LAST_UPDATE_DATEDAT E
D_CREA TED_BY
CHA R(10)
D_CREA TION_DATE
DAT E
Users
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing
Dimension problem, since there is no need to keep track of the
old information.
Disadvantages:
- All history is lost. By applying this methodology, it is
not possible to trace back in history. For example, in
this case, the company would not be able to know that
Christina lived in Illinois before.
- Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not
necessary for the data warehouse to keep track of historical
changes.
Name
State
1001
Christina
Illinois
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases
where the number of rows for the table is very high to start
with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is
necessary for the data warehouse to track historical changes.
Type 3 Slowly Changing Dimension
Page 49 of 115
DWH Training
Name
State
1001
Christina
Illinois
Customer Key
Name
Original State
Current State
Effective Date
Name
Original
State
Current
State
Effective
Date
1001
Christina
Illinois
California
15-JAN2003
Advantages:
- This does not increase the size of the table, since new
information is updated.
- This allows us to keep some part of history.
Disadvantages:
Page 50 of 115
DWH Training
Page 53 of 115
DWH Training
4 ETL-INFORMATICA
4.1Informatica Overview
Informatica is a powerful Extraction, Transformation, and Loading tool and
is been deployed at GE Medical Systems for data warehouse development
in the Business Intelligence Team. Informatica comes with the following
clients to perform various tasks.
Informatica Transformations:
Mapping: Mapping is the Informatica Object which contains set
Page 54 of 115
DWH Training
Normalizer transformations
COBOL sources
XML sources
Target definitions
Other mapplets
Unconnected Lookup
DWH Training
arguments.
We cannot use this lookup
more than once in a
mapping.
Lookup Caches:
When configuring a lookup cache, you can specify any of the
following options:
Persistent cache
Static cache
Page 58 of 115
DWH Training
Dynamic cache
Shared cache
NewLookupRo
w Value
Description
DWH Training
It is a default cache.
Page 60 of 115
DWH Training
Union Transformation:
The Union transformation is a multiple input group
transformation that you can use to merge data from multiple
pipelines or pipeline branches into one pipeline branch. It
merges data from multiple sources similar to the UNION ALL
SQL statement to combine the results from two or more SQL
statements. Similar to the UNION ALL statement, the Union
transformation does not remove duplicate rows.Input groups
should have similar structure.
Page 61 of 115
DWH Training
Aggregator Transformation:
Transformation type:
Active
Connected
The Aggregator transformation performs aggregate
calculations, such as averages and sums. The Aggregator
transformation is unlike the Expression transformation, in that
you use the Aggregator transformation to perform calculations
on groups. The Expression transformation permits you to
perform calculations on a row-by-row basis only.
Components of the Aggregator Transformation:
The Aggregator is an active transformation, changing the
number of rows in the pipeline. The Aggregator transformation
has the following components and options
Aggregate cache: The Integration Service stores data in the
aggregate cache until it completes aggregate calculations. It
stores group values in an index cache and row data in the data
cache.
Group by port: Indicate how to create groups. The port can be
any input, input/output, output, or variable port. When grouping
data, the Aggregator transformation outputs the last row of
each group unless otherwise specified.
Sorted input: Select this option to improve session
Page 62 of 115
DWH Training
Transformation type:
Active/Passive
Connected
The SQL transformation processes SQL queries midstream in a
pipeline. You can insert, delete, update, and retrieve rows from
a database. You can pass the database connection information
to the SQL transformation as input data at run time. The
transformation processes external SQL scripts or SQL queries
that you create in an SQL editor. The SQL transformation
processes the query and returns rows and database errors.
For example, you might need to create database tables before
adding new transactions. You can create an SQL transformation
to create the tables in a workflow. The SQL transformation
returns database errors in an output port. You can configure
another workflow to run if the SQL transformation returns no
errors.
When you create an SQL transformation, you configure the
following options:
Mode. The SQL transformation runs in one of the following modes:
Script mode. The SQL transformation runs ANSI SQL scripts
that are externally located. You pass a script name to the
transformation with each input row. The SQL transformation
outputs one row for each input row.
Query mode. The SQL transformation executes a query that
you define in a query editor. You can pass strings or parameters
to the query to define dynamic queries or change the selection
parameters. You can output multiple rows when the query has a
SELECT statement.
Database type. The type of database the SQL transformation
connects to.
Connection type. Pass database connection information to the
SQL transformation or use a connection object.
Script Mode
Page 64 of 115
DWH Training
Type
Description
ScriptNa
me
Connected
PowerCenter lets you control commit and roll back transactions
based on a set of rows that pass through a Transaction Control
transformation. A transaction is the set of rows bound by
commit or roll back rows. You can define a transaction based on
a varying number of input rows. You might want to define
transactions based on a group of rows ordered on a common
key, such as employee ID or order entry date.
In PowerCenter, you define transaction control at the following
levels:
Within a mapping. Within a mapping, you use the Transaction
Control transformation to define a transaction. You define
transactions using an expression in a Transaction Control
transformation. Based on the return value of the expression,
you can choose to commit, roll back, or continue without any
transaction changes.
Within a session. When you configure a session, you
configure it for user-defined commit. You can choose to commit
or roll back a transaction if the Integration Service fails to
transform or write any row to the target.
When you run the session, the Integration Service evaluates the
expression for each row that enters the transformation. When it
evaluates a commit row, it commits all rows in the transaction
to the target or targets. When the Integration Service evaluates
a roll back row, it rolls back all rows in the transaction from the
target or targets.
If the mapping has a flat file target you can generate an output
file each time the Integration Service starts a new transaction.
You can dynamically name each target flat file.
What is the difference between joiner and lookup
Joiner
Lookup
DWH Training
Lookup
Where as in lookup we
concentrate on cache
concept.
Page 69 of 115
DWH Training
3)
EHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES
_HIST_AUSTRI]
$InputFileName_BAAN_SALE_HIST=/interface/dev/etl/apo/srcfile
s/HS_025_20070921
$DBConnection_Target=DMD2_GEMS_ETL
$$CountryCode=AT
$$CustomerNumber=120165
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_G
EHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES
_HIST_BELUM]
$DBConnection_Sourcet=DEVL1C1_GEMS_ETL
$OutputFileName_BAAN_SALES=/interface/dev/etl/apo/trgfiles/H
S_002_20070921
$$CountryCode=BE
$$CustomerNumber=101495
Page 78 of 115
DWH Training
Page 79 of 115
DWH Training
Developer Changes:
For example, in PowerCenter:
PowerCenter Server has become a service, the
Integration Service
No more Repository Server, but PowerCenter includes
a Repository Service
Client applications are the same, but work on top of the
new services framework
Below are the difference between 7.1 and 8.1 of infa..
1) powercenter connect for sap net weaver bw option
2) sql transformation is added
3) service oriented architecture
4) grid concept is additional feature
5) random file name can genaratation in target
Page 80 of 115
DWH Training
Page 81 of 115
DWH Training
Load Balancer
DWH Training
DWH Training
DWH Training
3)
Page 86 of 115
DWH Training
Logic in the SQ is
Page 87 of 115
DWH Training
Page 88 of 115
DWH Training
Page 89 of 115
DWH Training
[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_G
EHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES
_HIST_AUSTRI]
$DBConnection_Source=DMD2_GEMS_ETL
$DBConnection_Target=DMD2_GEMS_ETL
$$LastUpdateDate Time =01/01/1940
Page 91 of 115
DWH Training
Main mapping
Page 92 of 115
DWH Training
Workflod Design
Page 93 of 115
DWH Training
4.2Informatica Scenarios:
1)
Transacti
on
City
AP
HYD
AP
TPT
KA
BANG
KA
MYSOR
E
KA
HUBLI
EmpNo
stev
100
methe
w
100
john
101
tom
101
Target:
Ename
EmpNo
Stev
methew
100
Page 96 of 115
DWH Training
John tom
101
EmpNo
stev
100
Stev
100
john
101
Mathe
w
102
Output:
Target_1:
Ename
EmpNo
Stev
100
Page 97 of 115
DWH Training
John
101
Mathew
102
Target_2:
Ename
EmpNo
Stev
100
After 30 minutes if the file does not exists we will send out
email notification.
2) Event wait and Event rise approach
We can put event wait before actual session run in the workflow
to wait a indicator file if file available then it will run the session
other event wait it will wait for infinite time till the indicator file
is available.
4.3Development Guidelines
General Development Guidelines
The starting point of the development is the logical model
created by the Data Architect. This logical model forms the
foundation for metadata, which will be continuously be
maintained throughout the Data Warehouse Development Life
Cycle (DWDLC). The logical model is formed from the
requirements of the project. At the completion of the logical
model technical documentation defining the sources, targets,
requisite business rule transformations, mappings and filters.
This documentation serves as the basis for the creation of the
Extraction, Transformation and Loading tools to actually
manipulate the data from the applications sources into the Data
Warehouse/Data Mart.
To start development on any data mart you should have the
following things set up by the Informatica Load Administrator
Informatica
Folder.
The
development
team
in
consultation with the BI Support Group can decide a
three-letter code for the project, which would be used to
Page 101 of 115
DWH Training
Quick Reference
Object Type
Syntax
Folder
Mapping
m_fXY_ZZZ_<Target Table
Page 103 of 115
DWH Training
Name>_x.x
Session
s_fXY_ZZZ_<Target Table
Name>_x.x
Batch
Source Definition
Target Definition
Aggregator
AGG_<Purpose>
Expression
EXP_<Purpose>
Filter
FLT_<Purpose>
Joiner
Lookup
Normalizer
Norm_<Source Name>
Rank
RNK_<Purpose>
Router
RTR_<Purpose>
Sequence Generator
Source Qualifier
Stored Procedure
STP_<Database Name>_<Procedure
Name>
Update Strategy
Mapplet
MPP_<Purpose>
Input Transformation
Output Tranformation
DWH Training
Database Connections
XXX_<Database Name>_<Schema
Name>
4.4Performance Tips
What is Performance tuning in Informatica
The aim of performance tuning is optimize session
performance so sessions run during the available load window
for the Informatica Server.
Increase the session performance by following.
The performance of the Informatica Server is related to
network connections. Data generally moves across a network
at less than 1 MB per second, whereas a local disk moves
data five to twenty times faster. Thus network connections
ofteny affect on session performance. So avoid work
connections.
1. Cache lookups if source table is under 500,000 rows and
DONT cache for tables over 500,000 rows.
2. Reduce the number of transformations. Dont use an
Expression Transformation to collect fields. Dont use an
Update Transformation if only inserting. Insert mode is the
default.
3. If a value is used in multiple ports, calculate the value once
(in a variable) and reuse the result instead of recalculating
it for multiple ports.
4. Reuse objects where possible.
5. Delete unused ports particularly in the Source Qualifier and
Lookups.
6. Use Operators in expressions over the use of functions.
Page 105 of 115
DWH Training
19.
If the lookup table is on the same database as the
source table, instead of using a Lookup transformation, join
the tables in the Source Qualifier Transformation itself if
possible.
20.
If the lookup table does not change between sessions,
configure the Lookup transformation to use a persistent
lookup cache. The Informatica Server saves and reuses
cache files from session to session, eliminating the time
required to read the lookup table.
21.
Use :LKP reference qualifier in expressions only when
calling unconnected Lookup Transformations.
22.
Informatica Server generates an ORDER BY statement
for a cached lookup that contains all lookup ports. By
providing an override ORDER BY clause with fewer columns,
session performance can be improved.
23.
Eliminate unnecessary data type conversions from
mappings.
24.
Reduce the number of rows being cached by using the
Lookup SQL Override option to add a WHERE clause to the
default SQL statement.
UTP Template:
Actual
Results,
Step
Descripti
on
Test Conditions
Expected Results
Pass
or
Fail
(P or
F)
#
SAPCMS
Inter
face
s
Test
ed
By
Actual
Results,
Step
Descripti
on
Test Conditions
Expected Results
Pass
or
Fail
Test
ed
By
(P or
F)
#
1
Check for
the total
count of
records in
source
tables that
is fetched
and the
total
records in
the PRCHG
table for a
perticular
session
timestamp
SOURCE:
Check for
all the
target
columns
whether
they are
getting
populated
correctly
with
source
data.
select PRCHG_ID,
SELECT count(*)
FROM
XST_PRCHG_STG
Should
be same
as the
expected
Pass
Stev
Should
be same
as the
expected
Pass
Stev
TARGET:
Select count(*) from
_PRCHG
PRCHG_DESC,
DEPT_NBR,
EVNT_CTG_CDE,
PRCHG_TYP_CDE,
PRCHG_ST_CDE,
from T_PRCHG
MINUS
select PRCHG_ID,
PRCHG_DESC,
DEPT_NBR,
EVNT_CTG_CDE,
PRCHG_TYP_CDE,
PRCHG_ST_CDE,
from PRCHG
Actual
Results,
Step
Pass
or
Fail
Test
ed
By
Descripti
on
Test Conditions
Expected Results
(P or
F)
Check for
Insert
strategy
to load
records
into target
table.
It should insert a
record into target table
with source data
Should
be same
as the
expected
Pass
Stev
Check for
Update
strategy
to load
records
into target
table.
Should
be same
as the
expected
Pass
Stev
5 UNIX
How strong you are in UNIX?
1) I have Unix shell scripting knowledge whatever informatica
required like
If we want to run workflows in Unix using PMCMD.
Below is the script to run workflow using Unix.
cd /pmar/informatica/pc/pmserver/
/pmar/informatica/pc/pmserver/pmcmd startworkflow -u
$INFA_USER -p $INFA_PASSWD -s $INFA_SERVER:
$INFA_PORT -f $INFA_FOLDER -wait $1 >> $LOG_PATH/
$LOG_FILE
Page 110 of 115
DWH Training
Basic Commands:
Cat file1
(cat is the command to create none zero byte file)
cat file1 file2 > all -----it will combined (it will create file if it
doesnt exit)
cat file1 >> file2---it will append to file 2
o
25 22 15 * 0 /usr/local/bin/backup_jobs
Cp file1 file2
copy a file
mv file1 newname
mv file1 ~/AAA/
your home directory.
Tail -7 filename
You can find out what shell you are using by the
command:
echo $SHELL
Interactive History
A feature of bash and tcsh (and sometimes others) you can use
the up-arrow keys to access your previous commands, edit
them, and re-execute them.
Basics of the vi editor
Opening a file
Page 114 of 115
DWH Training
Vi filename
Creating text
Edit modes: These keys enter editing modes and type in the
text
of your document.
i
Replace 1 character
Replace mode
:q
Quit.
:q!