Você está na página 1de 4

http://www.academia.

edu/10664164/Data_Stage_Interview_Questions
http://www.redbooks.ibm.com/redbooks/pdfs/sg247830.pdf
What is the main difference between data set and file set stage?
Dataset is an internal format of DataStage the main points to be considered abou
t dataset before using are:
1) It stores data in binary in the internal format of DataStage so, it takes les
s time to read/write from dataset
than any other source/target.
2)It preserves the partioning schemes so that you don't have to partition it aga
in.
3)You cannot view data without datastage
Now, About Fileset
1)It stores data in the format similar to a sequential file.
2)Only advantage of using fileset over a sequential file is "it preserves partio
ning scheme"
3)You can view the data but in the order defined in partitioning scheme '
What is difference between Join/Lookup/Merge stages? How these will react if dup
licates records come in input links?
Join Stage:
1.) It has n input links(one being primary and remaining being secondary links),
one output link and there is no reject link
2.) It has 4 join operations: inner join, left outer join, right outer join and
full outer join
3.) join occupies less memory, hence performance is high in join stage
4.) Here default partitioning technique would be Hash partitioning technique
5.) Prerequisite condition for join is that before performing join operation, th
e data should be sorted.
Look up Stage:
1.) It has n input links, one output link and 1 reject link
2.) It can perform only 2 join operations: inner join and left outer join
3.) Join occupies more memory, hence performance reduces
4.) Here default partitioning technique would be Entire
Merge Stage:
1.) Here we have n inputs master link and update links and n-1 reject links
2.) in this also we can perform 2 join operations: inner join, left outer join
3.) the hash partitioning technique is used by default
4.) Memory used is very less, hence performance is high
5.) sorted data in master and update links are mandatory
How many rejects links I can give in Merge stage?
In join stage, if one input have col1,col2,col3 and other have col4,col5,col6 th
en how to join this and perform left outer join ?

When we use Lookup Stage?


DataStage doesn't know how large your data is, so cannot make an informed choice
whether to combine data using a join stage or a lookup stage.
Here's how to decide which to use:
if the reference datasets are big enough to cause trouble, use a join. A join do

es a high-speed sort on the driving and reference datasets.


This can involve I/O if the data is big enough, but the I/O is all highly optimi
zed and sequential.
Once the sort is over the join processing is very fast and never involves paging
or other I/O
Unlike Join stages and Lookup stages, the Merge stage allows you to specify seve
ral reject links as many as input links.
Look up stage doest not required the data to be sorted.
Can we use Hash Partition for reference link in Lookup stage? Yes
How many types of joins supported by Merge/Join/Lookup stages?
Which Partition methods should we use in Merge/Lookup/Join, explain why?
9 Types
Auto
DB2
Entire
Hash
Modulus
Random
Range
Round robin
Same
What is Keyless partitiong ?
Keyless partitioning methods distribute rows without examining the contents of t
he data
Same: Retains existing partitioning from previous stage
Round-robin: Distributes rows evenly across partitions, in a round-robin partiti
on assignment
Random: Distributes rows evenly across partitions in a random partition assignme
nt.
Entire: Each partition receives the entire dataset
Keyed partitioning:
Keyed partitioning examines the data values in one or more key columns,ensuring
that records with the same values in those key columns are assigned to
the same partition
Hash :Assigns rows with the same values in one or more key columns to the same p
artition using an internal hashing algorithm.
Modulus :Assigns rows with the same values in a single integer key column to the
same partition using a simple modulus calculation.
Range :Assigns rows with the same values in one or more key columns to the same
partition using a specified range map generated by pre-reading
the dataset.
DB2 :For DB2 Enterprise Server Edition with DPF (DB2/UDB) only Matches the inter
nal partitioning of the specified source or target table.
How to remove duplicates in a table without using inner query?
What is a "degenerate dimension"?

What is Transport Blocks ?


The following environment variables are all concerned with the block size used f
or the internal transfer of data as jobs run.
Some of the settings only apply to fixed length records.
The following variables are used only for fixed-length records:
APT_MIN_TRANSPORT_BLOCK_SIZE
APT_MAX_TRANSPORT_BLOCK_SIZE
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
APT_LATENCY_COEFFICIENT
The default value is 131072 bytes.

1) what are system variables ?


IBM InfoSphere DataStage provides a set of variables containing useful system infor
mation
that you can access from a transform or routine. System variables are read-only.
@DATE
The internal date when the program started. See the Date function.
@DAY
The day of the month extracted from the value in @DATE.
@FALSE
The compiler replaces the value with 0.
2) Active Stage: It is the "T" of ETL and Passive Stage : It is the "E & L" of E
TL
3) Define data aggregartion ? Summerizes the data
4) An InterProcess (IPC) stage
is a passive stage which provides a communication channel between IBM InfoSphere D
ataStage processes running simultaneously
in the same job.
speed up data transfer between two data sources:
http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_8.7.0/com.ibm.swg.im.iis.ds
.serverjob.dev.doc/topics/c_dsvjbref_InterProcess_Stages.html

5) what are stage variables ?


What are Stage Variables, Derivations and Constants?
Stage Variable - An intermediate processing variable that retains value during r
ead and doesnt pass the
value into target column.
Derivation - Expression that specifies value to be passed on to the target colum

n.
Constant - Conditions that are either true or false that specifies flow of data
with a link.
6) Containers : Usage and Types?
Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers. a) Local Container: Job Specific b) Shared Cont
ainer: Used in any job within a project.
There are two types of shared container:
1.Server shared container. Used in server jobs (can also be used in parallel job
s).
2.Parallel shared container. Used in parallel jobs. You can also include server
shared containers in parallel jobs as a way of incorporating
server job functionality into a parallel stage (for example, you could use one t
o make a server plug-in stage available to a parallel job)
7) Where Datastage stores his repositiry ? most of part in SQL server and Oracle
8) What is Surrogate key ?
9) What are routines ?
10) What are job parameters ?
11) Datastage architecture ?
12) What is ora bulck stage?
this stage is used to bulck load the oracle target table
13)

Você também pode gostar