Datastage Interview Questions

Datastage Interview Questions - Answers
Datastage Interview Questions

What is the flow of loading data into fact & dimensional tables?
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional
table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key values in
dimensional table, the data should be loaded into Fact table.
What is the default cache size? How do you change the cache size if needed?
Default cache size is 256 MB. We can increase it by going into Datastage Administrator and selecting the
Tunable Tab and specify the cache size over there.
What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location
What is Modulus and Splitting in Dynamic Hashed File?
In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "Splitting".
What are Stage Variables, Derivations and Constants?
Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the
value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
Types of views in Datastage Director?
There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages.
1

Types of Parallel Processing?
A) Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.
Orchestrate Vs Datastage Parallel Extender?
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX
platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel
processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE
and released a new version Datastage 6.0 i.e Parallel Extender.
Importance of Surrogate Key in Data warehousing?
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of
underlying database. i.e. Surrogate Key is not affected by the changes going on with a database.
How to run a Shell Script within the scope of a Data stage job?
By using "ExcecSH" command at Before/After job properties.
How do you execute datastage job from command line prompt?
Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname
Functionality of Link Partitioner and Link Collector?
Link Partitioner: It actually splits data into various partitions or data flows using various partition
methods.
Link Collector: It collects the data coming from partitions, merges it into a single data flow and loads to
target.
Types of Dimensional Modeling?
Dimensional modeling is again sub divided into 2 types.
a) Star Schema - Simple & Much Faster. Denormalized form.
b) Snowflake Schema - Complex with more Granularity. More normalized form.
c) Galaxy scheme or complex multi star schema

Differentiate Primary Key and Partition Key?
Primary Key is a combination of unique and not null. It can be a collection of key values called as
composite primary key. Partition Key is a just a part of Primary Key. There are several methods of
partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.
Differentiate Database data and Data warehouse data?

a) Detailed or Transactional
b) Both Readable and Writable.
c) Current.
Containers Usage and Types?
Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project.
Compare and Contrast ODBC and Plug-In stages?

ODBC: a) Poor Performance.
b) Can be used for Variety of Databases.
c) Can handle Stored Procedures.
Plug-In: a) Good Performance.
b) Database specific. (Only one database)
c) Cannot handle Stored Procedures.
Dimension Modelling types along with their significance
Data Modelling is Broadly classified into 2 types.
a) E-R Diagrams (Entity - Relatioships).
b) Dimensional Modelling.
What are Ascential Dastastage Products, Connectivity
Ascential Products
Ascential DataStage
Ascential DataStage EE (3)
Ascential DataStage EE MVS
Ascential DataStage TX
3

Ascential QualityStage
Ascential MetaStage
Ascential RTI (2)
Ascential ProfileStage
Ascential AuditStage
Ascential Commerce Manager
Industry Solutions
Connectivity
Files
RDBMS
Real-time
PACKs
EDI
Other
Explain Data Stage Architecture?
Data Stage contains two components,
Client Component, Server Component.
Client Component:
Data Stage Administrator.
Data Stage Manager
Data Stage Designer
Data Stage Director

Server Components:
Data Stage Engine

Meta Data Repository
Package Installer
Data Stage Administrator: (Roles and Responsibilities )

Used to create the project
Contains set of properties
We can set the buffer size (by default 128 MB)
We can increase the buffer size.
We can set the Environment Variables.
In tunable we have in process and inter-process
In-processData read in sequentially
Inter-process It reads the data as it comes.
It just interfaces to metadata.

Data Stage Manager:
We can view and edit the Meta data Repository.
We can import table definitions.
We can export the Data stage components in .xml or .dsx format.
We can create routines and transforms
We can compile the multiple jobs.
Data Stage Designer:
We can create the jobs. We can compile the job. We can run the job. We can declare stage variable in
transform, we can call routines, transform, macros, functions.
We can write constraints.
Data Stage Director:
We can run the jobs.
We can schedule the jobs. (Schedule can be done daily, weekly, monthly, quarterly)
We can monitor the jobs.
We can release the jobs.
What is Meta Data Repository?

Meta Data is a data about the data.
It also contains
Query statistics
ETL statistics
Business subject area
Source Information
Target Information
Source to Target mapping Information
What is Data Stage Engine?

It is a JAVA engine running at the background.
What is Dimensional Modeling?

Dimensional Modeling is a logical design technique that seeks to present the data in a standard
framework that is, intuitive and allows for high performance access.
What is Star Schema?

Star Schema is a de-normalized multi-dimensional model. It contains centralized fact tables surrounded
by dimensions table.
Dimension Table: It contains a primary key and description about the fact table.
Fact Table: It contains foreign keys to the dimension tables, measures and aggregates.
What is surrogate Key?

It is a 4-byte integer which replaces the transaction / business / OLTP key in the dimension table.
We can store up to 2 billion record.
5
Why we need surrogate key?

It is used for integrating the data may help better for primary key.
Index maintenance, joins, table size, key updates, disconnected inserts and partitioning.
What is Snowflake schema?

It is partially normalized dimensional model in which at two represents least one dimension or
more hierarchy related tables.
Explain Types of Fact Tables?
Factless Fact: It contains only foreign keys to the dimension tables.
Additive Fact: Measures can be added across any dimensions.
Semi-Additive: Measures can be added across some dimensions. Eg, % age, discount
Non-Additive: Measures cannot be added across any dimensions. Eg, Average
Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are
measured across the dimensions with a same set of measures
Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table, the
granularity that is defined in the dimension table is common across between the fact tables.
Junk Dimension: The Dimension table, which contains only flags.
Monster Dimension: If rapidly changes in Dimension are known as Monster Dimension.
De-generative Dimension: It is line item-oriented fact table design.
What are stage variables?
Stage variables are declaratives in Transformer Stage used to store values. Stage variables are
active at the run time. (Because memory is allocated at the run time).
What is sequencer?
It sets the sequence of execution of server jobs.
What are Active and Passive stages?
Active Stage: Active stage model the flow of data and provide mechanisms for combining data
streams, aggregating data and converting data from one data type to another. Eg, Transformer,
aggregator, sort, Row Merger etc.
Passive Stage: A Passive stage handles access to Database for the extraction or writing of data.
Eg, IPC stage, File types, Universe, Unidata, DRS stage etc.
What is ODS?
Operational Data Store is a staging area where data can be rolled back.
What are Macros?
They are built from Data Stage functions and do not require arguments.
A number of macros are provided in the JOBCONTROL.H file to facilitate getting information
about the current job, and links and stages belonging to the current job. These can be used in
6

expressions (for example for use in Transformer stages), job control routines, filenames and table
names, and before/after subroutines.
DSHostName
DSJobStatus
DSProjectName
DSJobName
DSJobController
DSJobStartDate
DSJobStartTime
DSJobStartTimestamp
DSJobWaveNo
DSJobInvocations
DSJobInvocationId
DSStageLastErr
DSStageType
DSStageInRowNum
DSStageVarList
DSLinkLastErr
DSLinkName
DSStageName
DSLinkRowCount
What is keyMgtGetNextValue?
It is a Built-in transform it generates Sequential numbers. Its input type is literal string & output
type is string.
What index is created on Data Warehouse?
Bitmap index is created in Data Warehouse.
What is container?
A container is a group of stages and links. Containers enable you to simplify and modularize
your server job designs by replacing complex areas of the diagram with a single container stage.
You can also use shared containers as a way of incorporating server job functionality into
parallel jobs.
DataStage provides two types of container:
Local containers. These are created within a job and are only accessible by that job. A
local container is edited in a tabbed page of the jobs Diagram window.
Shared containers. These are created separately and are stored in the Repository in the
same way that jobs are. There are two types of shared container
What is function? ( Job Control Examples of Transform Functions )
Functions take arguments and return a value.
BASIC functions: A function performs mathematical or string manipulations on the

arguments supplied to it, and return a value. Some functions have 0 arguments; most have 1 or
more. Arguments are always in parentheses, separated by commas, as shown in this general
syntax:
FunctionName (argument, argument)
7
DataStage BASIC functions: These functions can be used in a job control routine,
which is defined as part of a jobs properties and allows other jobs to be run and controlled from
the first job. Some of the functions can also be used for getting status information on the current
job; these are useful in active stage expressions and before- and after-stage subroutines.
To do this ...
Specify the job you want to control
Set parameters for the job you want to control
Use this function ...

DSAttachJob
DSSetParam
Set limits for the job you want to control

Request that a job is run
Wait for a called job to finish
Gets the meta data details for the specified link
Get information about the current project
Get buffer size and timeout value for an IPC or Web Service
stage
Get information about the controlled job or current job
DSSetJobLimit
DSRunJob
DSWaitForJob
DSGetLinkMetaData
DSGetProjectInfo
DSGetIPCStageProps
Get information about the meta bag properties associated with

the named job
Get information about a stage in the controlled job or current
job
Get the names of the links attached to the specified stage
DSGetJobMetaBag
Get a list of stages of a particular type in a job.
DSGetStagesOfType
Get information about the types of stage in a job.
DSGetStageTypes
Get information about a link in a controlled job or current job
DSGetLinkInfo
Get information about a controlled jobs parameters
DSGetParamInfo
Get the log event from the job log

Get a number of log events on the specified subject from the
job log
Get the newest log event, of a specified type, from the job log
DSGetLogEntry
DSGetLogSummary
Log an event to the job log of a different job

Stop a controlled job
Return a job handle previously obtained from DSAttachJob
DSLogEvent
DSStopJob
DSDetachJob
Log a fatal error message in a job's log file and aborts the job.
DSLogFatal
Log an information message in a job's log file.
DSLogInfo
DSGetJobInfo
DSGetStageInfo
DSGetStageLinks
DSGetNewestLogId

Put an info message in the job log of a job controlling current
job.
Log a warning message in a job's log file.
Generate a string describing the complete status of a valid
attached job.
Insert arguments into the message template.
DSLogToController
Ensure a job is in the correct state to be run or validated.
DSPrepareJob
Interface to system send mail facility.

Log a warning message to a job log file.
DSSendMail
DSTransformError
Convert a job control status or error code into an explanatory

text message.
Suspend a job until a named file either exists or does not exist.
DSTranslateCode
Checks if a BASIC routine is cataloged, either in VOC as a

callable item, or in the catalog space.
Execute a DOS or Data Stage Engine command from a
before/after subroutine.
Set a status message for a job to return as a termination
message when it finishes
DSCheckRoutine
DSLogWarn
DSMakeJobReport
DSMakeMsg
DSWaitForFile
DSExecute
DSSetUserStatus
What is Routines?
Routines are stored in the Routines branch of the Data Stage Repository, where you can create,
view or edit. The following programming components are classified as routines:
Transform functions, Before/After subroutines, Custom UniVerse functions, ActiveX (OLE)
functions, Web Service routines
Dimension Modeling types along with their significance
Data Modelling is broadly classified into 2 types.
A) E-R Diagrams (Entity - Relatioships).
B) Dimensional Modelling.
Question: Dimensional modelling is again sub divided into 2 types.
A) Star Schema - Simple & Much Faster. Denormalized form.
B) Snowflake Schema - Complex with more Granularity. More normalized form.
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional
table. Consists of fields with numeric values.
dimensional table, then data should be loaded into Fact table.
What is Hash file stage and what is it used for?

Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI tables for better
performance.
What are types of Hashed File?
Hashed File is classified broadly into 2 types.
A) Static - Sub divided into 17 types based on Primary Key Pattern.
B) Dynamic - sub divided into 2 types
i) Generic
ii) Specific
Default Hased file is "Dynamic - Type Random 30 D"
What are Static Hash files and Dynamic Hash files?
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file
has a default size of 2GB and the overflow file is used if the data exceeds the 2GB size.
How did you handle reject data?
Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link
has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data
like duplicates of Primary keys or null-rows where data is expected.
What are other Performance tunings you have done in your last project to increase the performance
of slowly running jobs?
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using
Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster
inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better
performance of jobs.
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS
queries.
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster
execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to
run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in
one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This
may be the case if the constraint calls routines or external macros but if it is inline code then the
overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the
unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT WHERE.
10

Make every attempt to use the bulk loader for your particular database. Bulk loaders are
generally faster than using ODBC or OLE.
Tell me one situation from your last project, where you had faced problem and How did u solve it?
1. The jobs in which data is read directly from OCI stages are running extremely slow. I had to stage
the data before sending to the transformer to make the jobs run faster.
2. The job aborts in the middle of loading some 500,000 rows. Have an option either
cleaning/deleting the loaded data and then run the fixed job or run the job again from the row the
job has aborted. To make sure the load is proper we opted the former.
Tell me the environment in your last projects
Give the OS of the Server and the OS of the Client of your recent most project
How did u connect with DB2 in your last project?
Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In
some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers to
connect to DB2 (or) DB2-UDB depending the situation and availability. Certainly DB2-UDB is better in
terms of performance as you know the native drivers are always better than ODBC drivers. 'iSeries
Access ODBC Driver 9.00.02.02' - ODBC drivers to connect to AS400/DB2.
What are Routines and where/how are they written and have you written any routines before?
Routines are stored in the Routines branch of the DataStage Repository, where you can create, view or
edit.
The following are different types of Routines:
1. Transform Functions
2. Before-After Job subroutines
3. Job Control Routines
How did you handle an 'Aborted' sequencer?
In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then
run the job again.
Read the String functions in DS
Functions like [] -> sub-string function and ':' -> concatenation operator
Syntax:
string [ [ start, ] length ]
string [ delimiter, instance, repeats ]
What will you in a situation where somebody wants to send you a file and use that file as an input or
reference and then run job.
Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May
be you can schedule the sequencer around the time the file is expected to arrive.
11
Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.
What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Use crontab utility along with dsexecute() function along with proper parameters passed.
Did you work in UNIX environment?
Yes. One of the most important requirements.
How would call an external Java function which are not supported by DataStage?
Starting from DS 6.0 we have the ability to call external Java functions using a Java package from
Ascential. In this case we can even use the command line to invoke the Java function and write the
return values from the Java program (if any) and use that files as a source in DataStage job.
How will you determine the sequence of jobs to load into data warehouse?
First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the
Aggregator tables (if any).
The above might raise another Why do we have to load the dimensional tables first, then fact tables:
As we load the dimensional tables the keys (primary) are generated and these keys (primary) are Foreign
keys in Fact tables.
Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to
the DB or does it do some kind of Delete logic.
There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement.
On an OCI stage such as Oracle, you do have both Clear and Truncate options. They are radically
different in permissions (Truncate requires you to have alter table permissions where Delete doesn't).
How do you rename all of the jobs to support your new File-naming conventions?
Create an Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl
program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx
file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs
has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary
changes to these Sequencers.
When should we use ODS?
DWH's are typically read only, batch updated on a schedule
ODS's are maintained in more real time, trickle fed constantly
What other ETL's you have worked with?

12

Informatica and also DataJunction if it is present in your Resume.
How good are you with your PL/SQL?
On the scale of 1-10 say 8.5-9
What versions of DS you worked with?
DS 7.5, DS 7.0.2, DS 6.0, DS 5.2
What's the difference between Datastage Developers...?
Datastage developer is one how will code the jobs. Datastage designer is how will design the job, I mean
he will deal with blue prints and he will design the jobs the stages that are required in developing the
code
What are the command line functions that import and export the DS jobs?
dsimport.exe - imports the DataStage components.
dsexport.exe - exports the DataStage components.
How to handle Date convertions in Datastage? Convert mm/dd/yyyy format to yyyy-dd-mm?
We use
a) "Iconv" function - Internal Convertion.
b) "Oconv" function - External Convertion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")
Parallel Processing is broadly classified into 2 types.
b) MPP - Massive Parallel Processing.
What does a Config File in parallel extender consist of?
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.
Types of views in Datastage Director?
13

b) Log View - Status of Job last Run
Did you Parameterize the job or hard-coded the values in the jobs?
Always parameterized the job. Either the values are coming from Job Properties or from a Parameter
Manager a third part tool. There is no way you will hardcode some parameters in your jobs. The
often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data
to be looked against at.
What are the requirements for your ETL tool?

Do you have large sequential files (1 million rows, for example) that need to be compared every day
versus yesterday?
If so, then ask how each vendor would do that. Think about what process they are going to do. Are they
requiring you to load yesterdays file into a table and do lookups?
If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files?
Then maybe they are the right one. It all depends on what you need the ETL to do.
If you are small enough in your data sets, then either would probably be OK.
What are the main differences between Ascential DataStage and Informatica PowerCenter?
Chuck Kelleys You are right; they have pretty much similar functionality. However, what are the
requirements for your ETL tool? Do you have large sequential files (1 million rows, for example) that
need to be compared every day versus yesterday? If so, then ask how each vendor would do that. Think
about what process they are going to do. Are they requiring you to load yesterdays file into a table and
do lookups? If so, RUN!! Are they doing a match/merge routine that knows how to process this in
sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do. If
you are small enough in your data sets, then either would probably be OK.
Les Barbusinskis Without getting into specifics, here are some differences you may want to
explore with each vendor:
Does the tool use a relational or a proprietary database to store its Meta data and
scripts? If proprietary, why?
What add-ons are available for extracting data from industry-standard ERP, Accounting,
and CRM packages?
Can the tools Meta data be integrated with third-party data modeling and/or business
intelligence tools? If so, how and with which ones?
How well does each tool handle complex transformations, and how much external
scripting is required?
What kinds of languages are supported for ETL script extensions?
Almost any ETL tool will look like any other on the surface. The trick is to find out which one
will work best in your environment. The best way Ive found to make this determination is to
ascertain how successful each vendors clients have been using their product. Especially clients
who closely resemble your shop in terms of size, industry, in-house skill sets, platforms, source
systems, data volumes and transformation complexity.
14

Ask both vendors for a list of their customers with characteristics similar to your own that have
used their ETL product for at least a year. Then interview each client (preferably several people
at each site) with an eye toward identifying unexpected problems, benefits, or quirkiness with the
tool that have been encountered by that customer. Ultimately, ask each customer if they had it
all to do over again whether or not theyd choose the same tool and why? You might be
surprised at some of the answers.
Joyce Bischoffs You should do a careful research job when selecting products. You should first
document your requirements, identify all possible products and evaluate each product against the
detailed requirements. There are numerous ETL products on the market and it seems that you are
looking at only two of them. If you are unfamiliar with the many products available, you may
refer to www.tdan.com, the Data Administration Newsletter, for product lists.
If you ask the vendors, they will certainly be able to tell you which of their products features are
stronger than the other product. Ask both vendors and compare the answers, which may or may
not be totally accurate. After you are very familiar with the products, call their references and be
sure to talk with technical people who are actually using the product. You will not want the
vendor to have a representative present when you speak with someone at the reference site. It is
also not a good idea to depend upon a high-level manager at the reference site for a reliable
opinion of the product. Managers may paint a very rosy picture of any selected product so that
they do not look like they selected an inferior product.
How many places u can call Routines?
Four Places u can call
1.Transform of routine
a. Date Transformation
b. Upstring Transformation
2.Transform of the Before & After Subroutines
3.XML transformation
4.Web base transformation
What is the Batch Program and how can generate?
Batch program is the program it's generate run time to maintain by the Datastage itself but u can easy
to change own the basis of your requirement (Extraction, Transformation, Loading) .Batch program are
generate depends your job nature either simple job or sequencer job, you can see this program on job
control option.
Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4 ) if job 1 have 10,000 row
,after run the job only 5000 data has been loaded in target table remaining are not loaded and your
job going to be aborted then.. How can short out the problem?
Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this condition should go
director and check it what type of problem showing either data type problem, warning massage, job fail
or job aborted, If job fail means data type problem or missing column action .So u should go Run
window ->Click-> Tracing->Performance or In your target table ->general -> action-> select this option
here two option
(i) On Fail -- Commit , Continue
(ii) On Skip -- Commit, Continue.
15

First u check how much data already load after then select on skip option then continue and what
remaining position data not loaded then select On Fail , Continue ...... Again Run the job defiantly u
gets successful massage
What happens if RCP is disable?
In such case OSH has to perform Import and export every time when the job runs and the processing
time job is also increased...
How do you rename all of the jobs to support your new File-naming conventions?
Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl
program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx
file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs
has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary
changes to these Sequencers.
What will you in a situation where somebody wants to send you a file and use that file as an input or
reference and then run job.
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job.
May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the
file
What are Sequencers?
Sequencers are job control programs that execute other jobs with preset Job parameters.
In almost all cases we have to delete the data inserted by this from DB manually and fix the job
and then run the job again.
Question34: What is the difference between the Filter stage and the Switch stage?
Ans: There are two main differences, and probably some minor ones as well. The two main differences
are as follows.
1)The Filter stage can send one input row to more than one output link. The Switch stage can not the C switch construct has an implicit break in every case.
2)The Switch stage is limited to 128 output links; the Filter stage can have a theoretically unlimited
number of output links. (Note: this is not a challenge!)
How can i achieve constraint based loading using datastage7.5.My target tables have inter
dependencies i.e. Primary key foreign key constraints. I want my primary key tables to be loaded
first and then my foreign key tables and also primary key tables should be committed before the
foreign key tables are executed. How can I go about it?
Ans:1) Create a Job Sequencer
to load you tables in Sequential mode
In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign key
tables, when triggering the Foreign tables load Job trigger them only when Primary Key load
Jobs run Successfully ( i.e. OK trigger)
2) To improve the performance of the Job, you can disable all the constraints on the tables and
load them. Once loading done, check for the integrity of the data. Which does not meet raise
exceptional data and cleanse them.
This only a suggestion, normally when loading on constraints are up, will drastically
performance will go down.
16

3) If you use Star schema modeling, when you create physical DB from the model, you can
delete all constraints and the referential integrity would be maintained in the ETL process by
referring all your dimension keys while loading fact tables. Once all dimensional keys are
assigned to a fact then dimension and fact can be loaded together. At the same time RI is being
maintained at ETL process level.
How do you merge two files in DS?
Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files are same or
create a job to concatenate the 2 files into one, if the metadata is different.
How do you eliminate duplicate rows?

Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using that
stage we can eliminate the duplicates based on a key column.
How do you pass filename as the parameter for a job?
Ans: While job development we can create a parameter 'FILE_NAME' and the value can be
passed while
Ans: In almost all cases we have to delete the data inserted by this from DB manually and fix the
job and then run the job again.
Is there a mechanism available to export/import individual DataStage ETL jobs from the UNIX
command line?
Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement. You can only export full projects from
the command line.
You can find the export and import executables on the client machine usually someplace like: C:\Program
Files\Ascential\DataStage.
Diff. between JOIN stage and MERGE stage.

JOIN: Performs join operations on two or more data sets input to the stage and then outputs the
resulting dataset.
MERGE: Combines a sorted master data set with one or more sorted updated data sets. The columns
from the records in the master and update data set s are merged so that the out put record contains all
the columns from the master record plus any additional columns from each update record that
required.
A master record and an update record are merged only if both of them have the same values for the
merge key column(s) that we specify .Merge key columns are one or more columns that exist in both the
master and update records.
Advantages of the DataStage?
Business advantages:
Helps for better business decisions;

It is able to integrate data coming from all parts of the company;
It helps to understand the new and already existing clients;
We can collect data of different clients with him, and compare them;
It makes the research of new business possibilities possible;
We can analyze trends of the data read by him.
Technological advantages:
It handles all company data and adapts to the needs;

17
It offers the possibility for the organization of a complex business intelligence;

Flexibly and scalable;
It accelerates the running of the project;
Easily implementable.
What is the architecture of data stage?

Basically architecture of DS is client/server architecture.
Client components & server components
Client components are 4 types they are
1. Data stage designer
2. Data stage administrator
3. Data stage director
4. Data stage manager
Data stage designer is user for to design the jobs
Data stage manager is used for to import & export the project to view & edit the contents of the
repository.
Data stage administrator is used for creating the project, deleting the project & setting the environment
variables.
Data stage director is use for to run the jobs, validate the jobs, scheduling the jobs.
Server components
DS server: runs executable server jobs, under the control of the DS director, that extract, transform, and
load data into a DWH.
DS Package installer: A user interface used to install packaged DS jobs and plug-in;
Repository or project: a central store that contains all the information required to build DWH or data
mart.
1. What r the stages u worked on?
I have some jobs every month automatically delete the log details what r the steps u have to take for
that
We have to set the option autopurge in DS Adminstrator.
I want to run the multiple jobs in the single job. How can u handle.
In job properties set the option ALLOW MULTIPLE INSTANCES.
18

What is version controlling in DS?
In DS, version controlling is used for back up the project or jobs.
This option is available in DS 7.1 version onwards.
Version controls r of 2 types.
1. VSS- visual source safe
2. CVSS- concurrent visual source safe.
VSS is designed by Microsoft but the disadvantage is only one user can access at a time, other user can
wait until the first user complete the operation.
CVSS, by using this many users can access concurrently. When compared to VSS, CVSS cost is high.
What is the difference between clear log file and clear status file?
Clear log--- we can clear the log details by using the DS Director. Under job menu clear log option is
available. By using this option we can clear the log details of particular job.
Clear status file---- lets the user remove the status of the record associated with all stages of selected
jobs.(in DS Director)
I developed 1 job with 50 stages, at the run time one stage is missed how can u identify which stage is
missing?
By using usage analysis tool, which is available in DS manager, we can find out the what r the items r
used in job.
My job takes 30 minutes time to run, I want to run the job less than 30 minutes? What r the steps we
have to take?
By using performance tuning aspects which are available in DS, we can reduce time.
Tuning aspect
In DS administrator
: in-process and inter process
In between passive stages
: inter process stage
OCI stage
: Array size and transaction size
And also use link partitioner & link collector stage in between passive stages
How to do road transposition in DS?
Pivot stage is used to transposition purpose. Pivot is an active stage that maps sets of columns in an
input table to a single column in an output table.
If a job locked by some user, how can you unlock the particular job in DS?
We can unlock the job by using clean up resources option which is available in DS Director. Other wise
we can find PID (process id) and kill the process in UNIX server.
I am getting input value like X = Iconv(31 DEC 1967,D)? What is the X value?
X value is Zero.
Iconv Function Converts a string to an internal storage format.It takes 31 dec 1967 as zero and counts
days from that date(31-dec-1967).
19

What is the Unit testing, integration testing and system testing?
Unit testing: As for Ds unit test will check the data type mismatching,
Size of the particular data type, column mismatching.
Integration testing: According to dependency we will put all jobs are integrated in to one sequence.
That is called control sequence.
System testing: System testing is nothing but the performance tuning aspects in Ds.
What are the command line functions that import and export the DS jobs?
Dsimport.exe ---- To import the DataStage components
Dsexport.exe ---- To export the DataStage components
How many hashing algorithms are available for static hash file and dynamic hash file?
Sixteen hashing algorithms for static hash file.
Two hashing algorithms for dynamic hash file( GENERAL or SEQ.NUM)
What happens when you have a job that links two passive stages together?
Obviously there is some process going on. Under covers Ds inserts a cut-down transformer stage
between the passive stages, which just passes data straight from one stage to the other.
What is the use use of Nested condition activity?
Nested Condition. Allows you to further branch the execution of a sequence depending on a condition.
I have three jobs A,B,C . Which are dependent on each other? I want to run A & C jobs daily and B job
runs only on Sunday. How can u do it?
First you have to schedule A & C jobs Monday to Saturday in one sequence.
Next take three jobs according to dependency in one more sequence and schedule that job only Sunday.
What are the ways to execute datastage jobs?
A job can be run using a few different methods:
from Datastage Director (menu Job -> Run now...)
from command line using a dsjob command
Datastage routine can run a job (DsRunJob command)
by a job sequencer
How to invoke a Datastage shell command?
Datastage shell commands can be invoked from :
Datastage administrator (projects tab -> Command)
Telnet client connected to the datastage server

How to stop a job when its status is running?
To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it
doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout
If it still doesn't help go to the datastage shell and invoke the following command: ds.tools
20

It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the
clear locks commands (options 7-10).
How to run and schedule a job from command line?
To run a job from command line use a dsjob command
Command Syntax: dsjob [-file | [-server ][-user ][-password ]] []
The command can be placed in a batch file and run in a system scheduler.
How to release a lock held by jobs?
Go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the
clear locks commands (options 7-10).
User privileges for the default DataStage roles?

The role privileges are:
DataStage Developer - user with full access to all areas of a DataStage project
DataStage Operator - has privileges to run and manage deployed DataStage jobs
-none- - no permission to log on to DataStage
What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell.
These are:
FILE.STAT command
ANALYZE.FILE command
Is it possible to run two versions of datastage on the same pc?
Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager.
That application can unregister and register system libraries used by Datastage.
Error in Link collector - Stage does not support in-process active-to-active inputs or outputs
To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.
What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';
Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'
what is the difference between logging text and final text message in terminator stage
Every stage has a 'Logging Text' area on their General tab which logs an informational message when the
stage is triggered or started.
Informational - is a green line, DSLogInfo() type message.
The Final Warning Text - the red fatal, the message which is included in the sequence
abort message
21

Error in STPstage - SOURCE Procedures must have an output link
The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.
To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform
How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.
It can be designed in the following way:
source odbc stage which fetches one record from the database and maps it to one
column - for example: select sysdate from dual
A transformer which passes that record through. If required, add pl/sql procedure
parameters as columns on the right-hand side of tranformer's mapping
Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in
the procedure name and select Transform as procedure type. In the input tab select 'execute
procedure for each row' (it will be run once).
Design of a DataStage server job with Oracle plsql procedure call
Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage
will create new entries in Director and new job will be named with automatically generated suffix (for
example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be
deleted at any time and will be automatically recreated by datastage on the next run.
Error in STPstage - STDPROC property required for stage xxx

The error appears in Stored Procedure (STP) stage when the 'Procedure name' field is empty. It occurs
even if the Procedure call syntax is filled in correctly.
To get rid of error fill in the 'Procedure name' field.
Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
22

ABORT
END
Datastage routine which reads the first line from a text file
Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END
READSEQ FILE1.RECORD FROM H.FILE1 ELSE
Call DSLogWarn("******************** File is empty", "JobControl")
END
firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1
How to test a datastage routine or transform?
To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click
'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run
all to see the results.
Datastage will remember all the test arguments during future tests.
When hashed files should be used? What are the benefits or using them?
Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value
pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details,
countries, exchange rates). Stored this way it can be spread across the project and accessed from
different jobs.
How to construct a container and deconstruct it or switch between local and shared?
To construct a container go to Datastage designer, select the stages that would be included in the
container and from the main menu select Edit -> Construct Container and choose between local and
shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed
and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by
right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted
back to local.
23

Corresponding datastage data types to ORACLE types?
Most of the datastage variable types map very well to oracle types. The biggest problem is to map
correctly oracle NUMBER(x,y) format.
The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type
and to fill in Length and Scale column accordingly.
There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle
char to datastage char.
How to adjust commit interval when loading data to the database?
In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)
Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction
handling -> Rows per transaction.
If set to 0 the commit will be issued at the end of a successfull transaction.
What is the use of INROWNUM and OUTROWNUM datastage variables?
@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:
@INROWNUM counts incoming rows to a transformer in a datastage job
@OUTROWNUM counts oucoming rows from a transformer in a datastage job

These variables can be used to generate sequences, primary keys, id's, numbering rows and also for
debugging and error tracing.
They play similiar role as sequences in Oracle.
Datastage trim function cuts out more characters than expected
By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d"
result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing
occurrences of character, and reduces multiple occurrences to a single occurrence.
To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")
Database update actions in ORACLE stage
The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact
that it's crucial to select the key columns properly as it will determine which column will appear in the
WHERE part of the SQL statement. Available actions:
Clear the table then insert rows - deletes the contents of the table (DELETE statement) and
adds new rows (INSERT).
Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement)
and adds new rows (INSERT).
Insert rows without clearing - only adds new rows (INSERT statement).
Delete existing rows only - deletes matched rows (issues only the DELETE statement).
Replace existing rows completely - deletes the existing rows (DELETE statement), then adds
new rows (INSERT).
Update existing rows only - updates existing rows (UPDATE statement).
Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new
rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows
(UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
24
User-defined SQL - the data is written using a user-defined SQL statement.

User-defined SQL file - the data is written using a user-defined SQL statement from a file.
Use and examples of ICONV and OCONV functions?
ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an
output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )
Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"
That expression formats a number and rounds it to 2 decimal places:
Oconv(L01.TURNOVER_VALUE*100,"MD2")
Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"
ERROR 81021 Calling subroutine DSR_RECORD ACTION=2
Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.
The problem appears when a job sequence is used and it contains many stages (usually more than 10)
and very often when a network connection is slow.
Basically the cause of a problem is a failure between DataStage client and the server communication.
The solution to the issue is:
Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username
and password and a job should compile successfully.
execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help
How to check Datastage internal error descriptions
To check the description of a number go to the datastage shell (from administrator or telnet to the
server machine) and invoke the following command:
SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an
error number
The command will produce a brief error description which probably will not be helpful in resolving an
issue but can be a good starting point for further analysis.
Error timeout waiting for mutex
25

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex
There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It
may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use
of many transformers.
There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the
Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters get
rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if its very complex). Consider splitting it into two
or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the
SQL statements).
ERROR 30107 Subroutine failed to complete successfully
Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers
The problem appears when a project is moved from one project to another (for example when
deploying a project from a development environment to production).
The solution to the issue is:
Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell
Datastage Designer hangs when editing job activity properties
The appears when running Datastage Designer under Windows XP after installing patches or the Service
Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application
freezes and the only way to close it is from the Windows Task Manager.
The solution of the problem is very simple. Just Download and install the XP SP2 patch for the
Datastage client.
It can be found on the IBM client support site (need to log in):
https://www.ascential.com/eservice/public/welcome.do
Go to the software updates section and select an appropriate patch from the Recommended DataStage
patches section.
Sometimes users face problems when trying to log in (for example when the license doesnt cover the
IBM Active Support), then it may be necessary to contact the IBM support which can be reached at
WDISupport@us.ibm.com
Can Datastage use Excel files as a data input?
26

Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible
approaches available:
Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a
Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is
impossible to do this on an Unix machine. On Datastage servers operating in Windows it can be set up
here:
Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft
Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK
Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage
in Datastage to read the data.
Parallel processing
Datastage jobs are highly scalable due to the implementation of parallel processing. The EE
architecture is process-based (rather than thread processing), platform independent and uses the
processing node concept. Datastage EE is able to execute jobs on multiple CPUs (nodes) in
parallel and is fully scalable, which means that a properly designed job can run across resources
within a single machine or take advantage of parallel platforms like a cluster, GRID, or MPP
architecture (massively parallel processing).
Partitioning and Pipelining

Partitioning means breaking a dataset into smaller sets and distributing them evenly across the
partitions (nodes). Each partition of data is processed by the same operation and transformed in
the same way.
The main outcome of using a partitioning mechanism is getting a linear scalability. This means
for instance that once the data is evenly distributed, a 4 CPU server will process the data four
times faster than a single CPU machine.
Pipelining means that each part of an ETL process (Extract, Transform, Load) is executed
simultaneously, not sequentially. The key concept of ETL Pipeline processing is to start the
Transformation and Loading tasks while the Extraction phase is still running.
Datastage Enterprise Edition automatically combines pipelining, partitioning and parallel

processing. The concept is hidden from a Datastage programmer. The job developer only
chooses a method of data partitioning and the Datastage EE engine will execute the partitioned
and parallelized processes.
Section 1.01 Differences between Datastage Enterprise and Server Edition

1.
The major difference between Infosphere Datastage Enterprise and Server edition is that
Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages,
which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and
stages look similiar to the Datastage Server objects, however their capababilities are way different.
In rough outline:
27

o
Parallel jobs are executable datastage programs, managed and controlled by Datastage
Server runtime environment
o
Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In
most cases no manual intervention is needed to implement optimally those techniques.
o
Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
2.
Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built components representing stages
used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run
faster, even if processed on one CPU.
3.
Datastage Enterprise Edition adds functionality to the traditional server stages, for instance
record and column level format properties.
4.
Datastage EE brings also completely new stages implementing the parallel concept, for example:
o
Enterprise Database Connectors for Oracle, Teradata & DB2
o
Development and Debug stages - Peek, Column Generator, Row Generator, Head, Tail,
Sample ...
o
Data set, File set, Complex flat file, Lookup File Set ...
o
Join, Merge, Funnel, Copy, Modify, Remove Duplicates ...
5.
When processing large data volumes Datastage EE jobs would be the right choice, however
when dealing with smaller data environment, using Server jobs might be just easier to develop,
understand and manage.
When a company has both Server and Enterprise licenses, both types of jobs can be used.
6.
Sequence jobs are the same in Datastage EE and Server editions.
what is the difference between ds 7.5 & 8.1?
New version of DS is 8.0 and supprots QUALITY STAGE &

Profile Stage and etc, and it also contain a webbrowsers
1.To implement scd we have seperate stage(SCD stage)
2.we dont have client manager tool in vversion 8,its
incorporated with Designer itself.
3.There is no need of hardcoding the parameters for every
job and we have a option called Parameter set.if we create
the parameter set,we can call the parameter set for whole
project or job,sequence..
what happens when job is compiling?

During compilation of a DataStage Parallel job there is very high CPU and memory utilization
on the server, and the job may take a very log time for the compilation to complete.
What APT_CONFIG in ds?
APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt
file that has the node's information and Configuration of SMP/MMP server.
28

Apt_configfile is used for to store the nodes information, and it contains the disk storage
information, and scrach information. and datastage understands the architecture of system based
on this Configfile. for parallel process normally two nodes are required its name like 10,.20.
anyaways the APT_CONFIG_FILE (not just APT_CONFIG) is the
configuration file that defines the nodes (the scratch area temp
area) for the specific project
is it possible to add extra nodes in the configuration file?
what is RCP? n how does it works?
Run time column propagation is used in case of partial schema
usage. when we only know about the columns to be processed and we
want all other columns to be propagated to target as they are we
check enable RCP option in administrator or output page columns
tab or stage page general tab and we only need to specify the
schema of tables we are concerned with . According to
documentation Runtime column propagation (RCP) allows DataStage
to be flexible aboutthe columns you define in a job. If RCP is
enabled for a project you can justdefine the columns you are
interested in using in a job but ask DataStageto propagate the
other columns through the various stages. So suchcolumns can be
extracted from the data source and end up on your datatarget
without explicitly being operated on in between.Sequential files
unlike most other data sources do not have inherentcolumn
definitions and so DataStage cannot always tell where there
areextra columns that need propagating. You can only use RCP on
sequentialfiles if you have used the Schema File property (see
Schema File onpage 5-8 and on page 5-31) to specify a schema
which describes all thecolumns in the sequential file. You need
to specify the same schema file forSequential File Stage 5-47any
similar stages in the job where you want to propagate columns.
Stagesthat will require a schema file are: Sequential File File
Set External Source External Target Column Import Column Export
Run Time Column Propagation can be used with Column Import
Stage. If RCP is enabled in our project we can define only the
columns which we are interested in and other rest of the columns
datastage will send through various other stages.
This will ensure such columns reach to our Target eventhough
they are not used in between of the stages.
starschema n snowflake schema? n Difference?
Star Schema
De-Normalized Data Structure
Snowflake Schema
Normalized Data Structure
29

Category wise Single Dimension Table
More data dependency and redundancy
No need to use complicated join
Dimension table split into many pieces

less data dependency and No redundancy
Complicated Join
Query Results Faster
Some delay in Query Processing
No Parent Table
It May contain Parent Table
Simple DB Structure
Complicated DB Structure
Difference bet OLTP n Datawarehouse?

The OLTP database records transactions in real time and aims to automate clerical data entry processes
of a business entity. Addition, modification and deletion of data in the OLTP database is essential and
the semantics of the application used in the front end impact on the organization of the data in the
database.
The data warehouse on the other hand does not cater to real time operational requirements of the
enterprise. It is more a storehouse of current and historical data and may also contain data extracted
from external data sources.
However, the data warehouse supports OLTP system by providing a place for the latter to offload data
as it accumulates and by providing services which would otherwise degrade the performance of the
database.
Differences Data warehouse database and OLTP database
Data warehouse database
Designed for analysis of business measures by categories and attributes
Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table.
Loaded with consistent, valid data; requires no real time validation
Supports few concurrent users relative to OLTP
OLTP database
Designed for real time business operations.
Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table.
Optimized for validation of incoming data during transactions; uses validation data tables.
Supports thousands of concurrent users.
30

What is datamodalling?
The analysis of data objects and their relationships to other data objects. Data modeling is often
the first step in database design and object-oriented programming as the designers first create a
conceptual model of how data items relate to each other. Data modeling involves a progression
from conceptual model to logical model to physical schema.
Data modelling is the process of identifying entities, the relationship between those entities and
their attributes. There are a range of tools used to achieve this such as data dictionaries, decision
trees, decision tables, schematic diagrams and the process of normalisation.
how to draw second highest salary?
select ename,esal from
(select ename,esal from hsal
order by esal desc)
where rownum <=2;
select max(salary ) from emp table where sal<(select max
(salary)from emp table)
select max(sal) from emp where level=2 connect by prior
sal>sal group by level;
how to remove duplicates from table?
select count(*) from MyTable and select distinct * from MyTable
to copy all distinct values into new table
select distinct * into NewTable from MyTable
SELECT email,
COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
Difference bet egrep n fgrep?
There is a difference. fgrep can not search for regular expressions in a string. It is used for plain
string matching.
egrep can search regular expressions too.
grep is a combination of both egrep and fgrep. If you don't specify -E or -F option, by default
grep will function as egrep but will have string searching ability too.
Hence the best one to be used is grep.
31

fgrep = "Fixed GREP".
fgrep searches for fixed strings only. The "f" does not stand for "fast" - in fact, "fgrep foobar *.c"
is usually slower than "egrep foobar *.c" (Yes, this is kind of surprising. Try it.)
Fgrep still has its uses though, and may be useful when searching a file for a larger number of
strings than egrep can handle.
egrep = "Extended GREP"
egrep uses fancier regular expressions than grep. Many people use egrep all the time, since it has
some more sophisticated internal algorithms than grep or fgrep, and is usually the fastest of the
three programs
CHMOD command?
Permissions
u - User who owns the file.
g - Group that owns the file.
o - Other.
a - All.
r - Read the file.
w - Write or edit the file.
x - Execute or run the file as a program.
Numeric Permissions:
CHMOD can also to attributed by using Numeric Permissions:
400 read by owner
040 read by group
004 read by anybody (other)
200 write by owner
020 write by group
002 write by anybody
100 execute by owner
010 execute by group
001 execute by anybody
what is the difference between ds 7.5 & 8.1?

The main difference is dsmanager client is combined with dsdesigner in 8.1 and the following are the
new in 8.1
Scd2 stage,dataconnection,ps,qs,rangelookup .
Difference between internal sort and external sort?
32

Performance wise internal sort is best becoz it doesnt use any buffer where as external
sort takes
buffer memory to store rec.
how u pass only required number of rec through partitions?
Go to jobproperties-execution-enable trace compile-give req number of rec.
what happens when job is compiling?
1.all processing stages will develop osh code
2.Tx will develop c++ code in the background.
3.job information will be updated in metadata repository.
4. compile
what APT_CONFIG in ds?
Its configuration file which defines parallelism to our jobs.
how many types of parallelisms n partions r there?
Two types smp n mpp.
it possible to add extra nodes in the configuration file?
Yes it is possible to add extra nodes go to configuration management tools where u find apt_conf edit
that for ur req no of nodes.
what is RCP? n how does it works?
Run time column propagation ,is used to propagate the columns which r not define in the metadata.
how data is moving from one stage to another?
In virtual dataset form.
it possible to run multiple jobs in a single job?
Yes ,goto jobprop-have a option allow multiple instances.
What is APT_DUMP_SCORE?
APT_DUMP_SCORE - shows operators, datasets, nodes, partitions, combinations and

processes used in a job.
env var: admin
Stage Variable - An intermediate processing variable that retains value during read and doesnt
pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target column.
Constant - Conditions that are either true or false that specifies flow of data with a link.
pipeline parallelism:
here each satge will work on separate processor
What is the difference between Job Control and Job Sequence
Job control specially used to control the job, means through this we can pass the parameters,
some conditions, some log file information, dashboard information, load recover etc...,
job seq is used to run the group of jobs based upon some conditions. For final/incremental
processing we keep all the jobs in one diff seq and we run the jobs at a time by giving some
triggers.
33

What is the max size of Data set stage? (PX) no limit?
performance in sort stage
See if it is orcle db then u can write user def query for sort and remove duplicates in the source
itself. and maintaining some key partition teqniques u can improve the performence.
If it is not the case means better go for some key partion teqniques in sort, keeping the same
partition which is in prev stage. don't allow the duplicates , remove duplicates and give unique
partition key.
How to develop the SCD using LOOKUP stage?
we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for scd2.
we have to take source(file or db) and dataset as a ref link(for look up) and then LOOKUP stage,
in this we have to compare the source with dataset and we have to give condition as continue,
continue there. after that in t/r we have to give the conditon, after that we have to take two targets
for insert and update, there we have to manually write the sql insert and update statements.
If u see the design, then u can easily understand that.
What is the diffrence between IBM Web Sphere DataStage 7.5 (Enterprise Edition ) & Standard
Ascential DataStage 7.5 Version ?
IBM Information Server also known as DS 8 has more features like Quality Stage & MetaStage .
It maintains its repsository in DB2 unlike files in 7.5. Also it has stage specifically for SCD 1 &
2.
I think there is no version like standard Ascential DataStage 7.5, I know only the advanced
edition of Datastage i.e., only web sphere Datastage and Quality stage, it is released by IBM
itself and given the version as 8.0.1, in this there are only 3 client tools(admin..,desig..,director),
here they have removed the manager, it is included in designer itself (for importing and
exporting) and in this they have added some extra stages like SCD stage , by using this we can
impliment scd1 and scd2 directly, and there are some advanced stages are there.
They have included the QualityStage, which is used for data validation which is very very
importent for dwh. There are somany things are available in Qualitystage, we can think it as a
seperate tools for dwh.
What are the errors you expereiced with data stage
Here in datastage there are some warnings and some fatal errors will come in the log file.
34

If there is any fatal error means the job got aborted but if there are any warnings are there means
the job not aborts but we have to handle those warnings also. logfile must be cleared with no
warnings also.
so many errors will come in diff jobs,
Parameter not found in job load recover.
child job is failed bcoz of some .....
control job is failed bcoz of some .....
....etc
what are the main diff between server job and parallel job in datastage?
in server jobs we have few stages and its mainly logical intensive and we r using transformer for
most of the things and it does not uses MPP systems
in paralell jobs we have lots of stages and its stage intensive and for particular thing we have in
built stages in parallel jobs and it uses MPP systems
***********************************************************
In server we dont have an option to process the data in multiple nodes as in parallel. In parallel
we have an advatage to process the data in pipelines and by partitioning, whereas we dont have
any such concept in server jobs.
There are lot of differences in using same stages in server and parallel. For example, in parallel, a
sequencial file or any other file can have either an input link or an output ink, but in server it can
have both(that too more than 1).
********************************************************************
server jobs can compile and run with in datastage server but parallel jobs can compile and run
with in datastage unix server.
server jobs can extact total rows from source to anthor stage then only that stage will be activate
and passing the rows into target level or dwh.it is time taking.
but in parallel jobs it is two types
1.pipe line parallelisam
2.partion parallelisam
1.based on statistical performence we can extract some rows from source to anthor stage at the
same time the stage will be activate and passing the rows into target level or dwh.it will maintain
only one node with in source and target.
35

2.partion parallelisam will maintain more than one node with in source and target.
Why you need Modify Stage?
When you are able to handle Null handling and Data type changes in ODBC stages why you
need Modify Stage?
Used to change the datatypes, if the source contains the varchar and the target contains integer
then we have to use this Modify Stage and we have to change according to the requirement. And
we can do some modification in length also.
Modify Stage is used for the purpose of Datatype Change.
What is the difference between Squential Stage & Dataset Stage. When do u use them.
a)Sequential stage is use for format of squential file and Dataset is use for any type of format
file (random)
b)Parallel jobs use data sets to manage data within a job. You can think of each link in a job as
carrying a data set. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other WebSphere DataStage jobs. Data sets are operating
system files, each referred to by a control file, which by convention has the suffix .ds. Using data
sets wisely can be key to good performance in a set of linked jobs. You can also manage data
sets independently of a job using the Data Set Management utility, available from the
WebSphere DataStage Designer or Director In datset dat is stored in some encrypted format ie.,
we can view the data through view data facility available in datastage but it cant be viewed in
Linux or back end system. In sequential file data can be viewed any where. Extraction of data
from the datset is much more faster than the sequential file.
how can we improve the performance of the job while handling huge amount of data
a)Minimize the transformer state,Reference table have huge amount of date then you can use join
stage. Reference table have less amount of data then you can use lookup.
b)this require a job level tuning or server level tuning.
in job level we can do the follwing.
job level tuning
use Join for huge amount of data rather than lookup.
use modify stage rather than transformer for simple transformation.
Sort the data before remove duplicate stage.
server level tuning
this can only be done after having adequate knowledge of the serever level parameter which can
improve the server execution performance.
36

HI How can we create read only jobs in Datastage.
By creating Protected Project. In Protected Project all jobs are read only. You cant modify the
job. If you modify that job it will not effect the job.
b)A job can be made read only by the follwing process:
Export the job in the .dsx format and change the attribute which store the readonly information
from 0 ( 0 refers to editable job) to 1 ( 1 refer to the read only job).
then import the job again and override or rename the existing job to have both of the form.
there are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs.
these routines will write in BASIC Language
2.parlell routines which will used in parlell jobs
These routines will write in C/C++ Language
3.mainframe routines which will used in mainframe jobs
DataStage Parallel routines made really easy

http://blogs.ittoolbox.com/dw/soa/archives/datastage-parallel-routines-made-really-easy-20926
How will you determine the sequence of jobs to load into data warehouse?
First we execute the jobs that load the data into Dimension tables, then Fact tables, then load the
Aggregator tables (if any).
The sequence of the job can also be determined by the determining the parent child relationship
in the target tables to be loaded. parent table always need to be loaded before child tables.
Error while connecting DS Admin?

All you have to do is go settings-control panel-User accounts- Create new user with a password.
Restart your comp and login with the new user name. Try using the new user name into
datastage and I am sure that you should be able to do it.
37
DataStage - delete header and footer on the source sequential

How do you you delete header and footer on the source sequential file and how do you create
header and footer on target sequential file using datastage?
In Designer Pallete Development/Debug we can find Head & tail. By using this we can do......
How can we implement Slowly Changing Dimensions in DataStage?.

a)We can implement SCD in datastage
1.Type 1 SCD:insert else update in ODBC stage
2.Type 2 SCD:insert new rows if the primary key is same and update with effective from date as
JobRundate and to date to some max date
3.Type 3 SCD:insert value to the column the old value and update the existing column with the
new value
b) by using lookup stage and change capture stage we will implement the scd.
we have 3 types of scds
type1:it will maintain the current values only.
type2: it will maintain the both current and historical values.
type3: it will maintain the current and partial historical values.
Differentiate Database data and Data warehouse data?

Data in a Database is
Detailed or Transactional
Both Readable and Writable.
Current.
b)By Database, one means OLTP (On Line Transaction Processing). This can be the source
systems or the ODS (Operational Data Store), which contains the transactional data.
c)Database data is in the form of OLTP and Data warehouse data will be in the form of OLAP.
OLTP is for transactional process and OLAP is for Analysis purpose.
d)dwh:
it contains current and historical data
very summary data
it follows denormalization
38

DIMENSIONAL MODEL
non volatile
what is the difference between datastage and informatica
a)The main difference between data stge and informatica is the SCALABILTY..informatca is
scalable than datastage
b)In my view Datastage is also Scalable, the difference lies in the number of built-in functions
which makes DataStage more user friendly
c)In my view,Datastage is having less no. of transformers copared to Informatica which makes
user to get difficulties while working
d)The main difference is Vendors. Each one is having plus from their architecture. For Datastage
it is a Top-Down approach. Based on the Businees needs we have to choose products.
e)Main difference lies in parellism, Datastage uses parellism concept through node
configuration, where Informatica does not
f)I have used both Datastage and Informatica... In my opinion, DataStage is way more powerful
and scalable than Informatica. Informatica has more developer-friendly features, but when it
comes to scalabality in performance, it is much inferior as compared to datastage.
Here are a few areas where Informatica is inferior 1. Partitioning - Datastage PX provides many more robust partitioning options than informatica.
You can also re-partition the data whichever way you want.
2. Parallelism - Informatica does not support full pipeline parallelism (although it claims).
3. File Lookup - Informatica supports flat file lookup, but the caching is horrible. DataStage
supports hash files, lookup filesets, datasets for much more efficient lookup.
4. Merge/Funnel - Datastage has a very rich functionality of merging or funnelling the streams.
In Informatica the only way is to do a Union, which by the way is always a Union-all.
g)Informatica and DataStage both are ETL tools, which are used for data acquisition process,
Nothing but ETL the main difference between Informatica and DataStage is for Informatica the
repository (container of meta data is database-meta data is stored in database for data stage the
repository is file-meta data is stored in file before going to ETL Informatica & DataStage will
check the repository for meta data here accessing a file is more faster than database because file
is static but data is more secure in data base than file-data may be corrupted in file hence finally
39

we can conclude that data stage will perform faster than Informatica but when it comes to
security issue Informatica is better than DataStage
h)SAS DI studio is best when compared to Informatica and Datastage as it generates SAS code at
the back end .SAS is highly flexible compared to other BI solution.
Why the sequential file not used in Hash Lookup?
This question is not proper, but here is a answer
Because sequential file has no key. sequential file are converted in a hash file and hashed file are
used as a hash Lookup.
What is Invocation ID?
This only appears if the job identified by Job Name has Allow Multiple Instance enabled. Enter a
name for the invocation or a job parameter allowing the instance name to be supplied at run time
An 'invocation id' is what makes a 'multi-instance' job unique at runtime. With normal jobs, you
can only have one instance of it running at any given time. Multi-instance jobs extend that and
allow you to have multiple instances of that job running (hence the name). They are still a
'normal' job under the covers, so still have the restriction of one at a time - it's just that now that
'one' includes the invocation id. So, you can run multiple 'copies' of the same job as long as the
currently running invocation ids are unique.
How to connect two stages which do not have any common columns between them?
If suppose two stage dont have the same column name then in between use one Transformer
stage and map the required column.
Difference between Hashfile and Sequential File?
Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a
file with no key column. Hash file used as a reference for look up. Sequential file cannot
b)Hash file can be stored in DS memory (Buffer) but Sequential file cannot be.. duplicates will
be removed in hashfile i.e, No duplicates in Hashfile.
How do you fix the error "OCI has fetched truncated data" in DataStage
a)Can we use Change capture stage to get the truncated data's.Members please confirm
b)I have same problem and don't know what is the solution. I encounter this problem only for
part of data.
c)This kind of error occurs when you have CLOB in back end and Varchar in DataStage. So
check the back end and accordingly put LongVarchar in the DataStage with the maximum
number of length which is used in Database.
Which partition we have to use for Aggregate Stage in parallel jobs ?
40

a)By default this stage allows Auto mode of partitioning. The best partitioning is based on the
operating mode of this stage and preceding stage. If the aggregator is operating in sequential
mode, it will first collect the data and before writing it to the file using the default Auto
collection method. If the aggregator is in parallel mode then we can put any type of partitioning
in the drop down list of partitioning tab. Generally auto or hash can be used.
b)I think the above answer is a little misleading. Most of the time you'll be using aggr. stage in
parallel mode. Now if you use the auto partioning mode, it doesnt indicate that the key columns
that you are grouping on will lie in the same partition. Thus the result will not be useful for this
aggregation.
1) Identify the grouping keys you want to aggregate on.
2) In a stage prior to aggr. , Do a hash partition on the grouping keys. This will ensure that all the
similiar group keys lie in a particular partition.
3) Now the result of partition will be appropriate.
4) I even think the entire partition method can be usefull, But it will be slightly higher overhead
as compared to hash partitioning.
how do we create index in data satge?

What type of index are you looking for ? If it is only based on rows, use @inrownum or
@outrownum
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in
Dimensional table. Consists of fields with numeric values.
dimensional table, the data should be loaded into Fact table.
a)Here is the sequence of loading a datawarehouse.
1. The source data is first loading into the staging area, where data cleansing takes place.
2. The data from staging area is then loaded into dimensions/lookups.
3.Finally the Fact tables are loaded from the corresponding source tables from the staging area.
b)The data is extracted from the different source systems.After extraction the data is transfered to
the staging layer for cleansing purpose.Cleansing means LTRIM/RTRIM etc.The data is coming
periodically to the staging layer.An ODS is used to store the Resent data.An ODS and the
Staging Area are the two types of layer between the source system and the target system.After
41

that the data is transformed according to the buisness needs with the help of the ETL
Transformations.And then the data is finally loaded into the target system or data warehouse.
What is a sequential file that has single input link??
Sequential file always has single link because its it cannot accept the multiple links or threads.
Data in sequential file always runs sequentially..
Aggregators What does the warning Hash table has grown to xyz . mean?
Aggrigator cannot store the data onto disk like Sortstage Do the data landing.
your system memory will be occupied by the data that is going to aggrigator.
If your system memory is full then you get that kid of weird messages.
I dealed witht that kind of error once.. My solution to that is use multiple chunks of data and
multiple aggrigators.
what is hashing algorithm?
Hashing is a technic how you store the data in dynamic files.
There are few algorithams for doing this process.
Read Data-statructures books for the algoritham models.
Hash Files are created as dynamic files using hashing algo
How do you load partial data after job failed
source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort
, Instead of removing 5000 records from target , How can i resume the load
a)There are lot of ways of doing this.
But we keep the Extract , Transform and Load proess seperately.
Generally only load job never failes unles there is a data issue.
All data issues are cleared before in trasform only.
there are some DB tools that do this automatically
If you want to do this manually. Keep track of number of records in a has file or test file.
Update the file as you insert the record.
if job failed in the middle then read the number from the file and process the records from there
only ignoring the record numbers before that
try @INROWNUM function for better result.
Is Hashed file an Active or Passive Stage? When will be it useful?
Hash file stage is a Passive stage.
The stage which do some process into it is called active stages . Ex: Transformer, Sort,
Aggrigate.
How do you extract job parameters from a file?
Through user variable activity in sequencer Job. through calling a routine.
1.What about System variables?
2.How can we create Containers?
3.How can we improve the performance of DataStage?
42

4.what are the Job parameters?
5.what is the difference between routine and transform and function?
6.What are all the third party tools used in DataStage?
7.How can we implement Lookup in DataStage Server jobs?
8.How can we implement Slowly Changing Dimensions in DataStage?.
9.How can we join one Oracle source and Sequential file?.
10.What is iconv and oconv functions?
11.Difference between Hashfile and Sequential File?
12. Maximum how many characters we can give for a Job name in DataStage?
a)Answers for ur Question in simple words:2.Containers are nothing but the set of stages with
links3.Trun in-process buffer and transaction size4.Nothing but parameters to pass in
runtime5.6.Lots of there7.using hashed File8.CDC Stage9.10.Powerful function for Date
transformation11.Access speed is slow in sequential file rather than hashed File12.If you know
just tell me
b)System variables comprise of a set of variables which are used to get system information and
they can be accessed from a transformer or a routine. They are read only and start with an @.
c)In server canvas we can improve performance in 2 ways :
Firstly we can increase the memory by enabling interprocess row buffering in job properties ,
and
Secondly by inserting an IPC stage we break a process into 2 processes.We can use this stage to
connect two passive stages or 2 active stages.
d)The following rules apply to the names that you can give DataStage jobs:
Job names can be any length.
They must begin with an alphabetic character.
They can contain alphanumeric characters and underscores.
Job category names can be any length and consist of any characters, including spaces
e)1.System Variables are inbuilt functions that can be called in a transformer stage
2.Containers is a group of stages and links,thy are of 2 types,local containers and shared
containers
3.Using IPC,Managing the array and transaction size,Project tunables can be set through the
Administrator.
4.Values that would be required during the job run
5.Routines are which call the jobs or any actions to be performed using DS,Transforms are the
manipulation of data during teh load.
6.
7.Using HASH FILE
8. Using the target oracle stages depending on teh update action
9.using a row id or seq generated numbers
43

10.DATE FUNCTIONS
11.Sequential file reads data sequentially,using hash file the read process is faster
12.it can be any length
What are the difficulties faced in using DataStage ? or what are the constraints in using
DataStage ?
a)1)If the number of lookups are more?
2)what will happen, while loading the data due to some regions job aborts?
b)1. I feel, the most difficult part is understanding the "Datastage director job log error
messages'. It doesn't give u in proper readable message.
2.We dont have many date functions available like in Informatica or traditional Relational
databases.
3. Datastage is like unique product interms of functions ex: Most of the database or ETL tools
use for converting from lower case to upper case : UPPER. The datastage uses "UCASE".
Datastage is peculiar when we compare to other ETL tools.
Otherthan that, i dont see any issues with Datastage.
c)* The issue that i faced with datastage is that, it was very difficult to find the errors from the
error code since the error table did not specify the reason for the error. And as a fresher i did not
know what the error codes satnd for :)
* Another issue is that the help in the datastage was not of much use since it was not specific and
was more general.
* I donot know about other tools since this is the only tool that i have used until now. But it was
simple to use so liked using it inspite of above issues.
Have you ever involved in updating the DS versions like DS 5.X, if so tell us some the steps you
have
A) Yes. The following are some of the steps; I have taken in doing so:
1) Definitely take a back up of the whole project(s) by exporting the project
as a .dsx file
2) See that you are using the same parent folder for the new version also for
your old jobs using the hard-coded file path to work.
3) After installing the new version import the old project(s) and you have to
compile them all again. You can use Compile All tool for this.
4) Make sure that all your DB DSNs are created with the same name as old ones.
This step is for moving DS from one machine to another.
5) In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there
is tool on DS CD that can do this for you.
6) Do not stop the 6.0 server before the upgrade, version 7.0
Install process collects project information during the upgrade. There is NO
rework (recompilation of existing jobs/routines) needed after the upgrade.
44

What r XML files and how do you read data from XML files and what stage to be used?
a)In the pallet there is Real time stages like xml-input,xml-output,xml-transformer
b)First, u can use XML metadata importer to import the XML source definition.Once it is done.
U can use XML input to read the XML document. For each and every element of XML , we
should give the XPATH expression in the XML input.
XML stage document clearly explanins this.
c)This is how it can be done
Define the xml file path in the administrator
Under environmental parameters, import the xml file metadata in the designer repository.
Use a transformer stage (without an input link) to get this path in the server job.
Use the xml file input stage. In the input tab under the xml src, place this value from the
transformer.
On the output tab you can import the meta data (columns) of the xml file and then use them as
other input columns in the rest of the jobs.
How do you track performance statistics and enhance it?
Through Monitor we can view the performance statistics.
b)You can right click on the server job and select the "view performance statistics" option. This
will show the output in the number of rows per second format when the job runs.
Types of vies in Datastage Director?
b) Log View - Status of Job last run
There are four views in Director. Job view is not one of them.
From what I know there are four views
1> Status
2> Schedule
3> Log
4> Detail.
What is the default cache size? How do you change the cache size if needed?
Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and
selecting the Tunable Tab and specify the cache size over there.
hi friends, Default read cache size is 128MB. We can incraese it by going into Datastage
Administrator and selecting the Tunable Tab and specify the cache size over thereregardsjagan
45

b)The default cache size is 128 MB. This is primarily used for hash file data cache in the server.
This setting is only can be done in Administrator not in job level. Job level tuning is available
only for Buffer Size.
How do you pass the parameter to the job sequence if the job is running at night?
1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.2.
Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for
each parameter
b)You can insert the parameter values in a table and read them when the package runs using
ODBC Stage or Plug-In stage and use DS variables to assign them in the data pipeline,
or
pass the parameters using DSSetParam from the controling job (batch job or job sequence) or
Job Control Routine from with DS
or
use dsjob -param from within a shell script or a dos batch file when running from CLI.
How do you catch bad rows from OCI stage?
The question itself is a little ambiguous to me. I think the answer to the question might be, we
will place some conditions like 'where' inside the OCI stage and the rejected rows can be
obtained as shown in the example below:
1) Say, there are four departments in an office, 501 through 504. We place a where condition,
where deptno <= 503. Only these rows are output through the output link.
2) Now what we do is, take another output link to a seq. file or another stage where you want to
capture the rejected rows. In that link, we will define: where deptno > 503
3) Once the rows are output from the OCI stage, you can send them into a transformer, place
some constraint on it and use the reject row mechanism to collect the rows.
I am a little tentative because, I am not sure if I have answered the question or not. Please do
verify and let us know if this answer is wrong.
what is quality stage and profile stage?

Quality Stage:It is used for cleansing ,Profile stage:It is used for profiling
b)ProfileStage is used for analysing data and their relationships.
what is the use and advantage of procedure in datastage?
To trigger Database operations at before or after DB stage access.
What are the important considerations while using join stage instead of lookups.
a)If the volume of data is high then we should use Join stage instead of Lookup.
b)1. if u need to capture mismatches between the two sources, lookups provide easy option
how to implement type2 slowly changing dimenstion in datastage? give me with example?
Hi,
46

Slow changing dimension is a common problem in Dataware housing. For example: There exists
a customer called lisa in a company ABC and she lives in New York. Later she she moved to
Florida. The company must modify her address now. In general 3 ways to solve this problem
Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A
new record is added into the customer dimension table. Therefore, the customer is treated
essentially as two different people. Type 3: The original record is modified to reflect the
changes.
In Type1 the new one will over write the existing one that means no history is maintained,
History of the person where she stayed last is lost, simple to use.
In Type2 New record is added, therefore both the original and the new record Will be present,
the new record will get its own primary key, Advantage of using this type2 is, Historical
information is maintained But size of the dimension table grows, storage and performance can
become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical
changes.
In Type3 there will be 2 columns one to indicate the original value and the other to indicate the
current value. example a new column will be added which shows the original address as New
york and the current address as Florida. Helps in keeping some part of the history and table size
is not increased. But one problem is when the customer moves from Florida to Texas the new
york information is lost. so Type 3 should only be used if the changes will only occur for a finite
number of time.
b)you can use change-capture stage This will tell you whether the source record is insert/update/modified after comparing with DWH
record and then accordingly you can choose the action
How to implement the type 2 Slowly Changing dimension in DataStage?
You can use change-capture & change apply stages for this
What are Static Hash files and Dynamic Hash files?

As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The
Data file has a default size of 2Gb and the overflow file is used if the data exceeds the 2GB size.
a)The hashed files have the default size established by their modulus and separation when you
create them, and this can be static or dynamic.
Overflow space is only used when data grows over the reserved size for someone of the groups
(sectors) within the file. There are many groups as the specified by the modulus.
b)Dynamic Hash Files can automatically adjust their sie - modulous (no. of groups) and
separation (group size) based on the incoming data. Type 30 are dynamic.
Static files do not adjust their modulous automatically and are best when data is static.
47

Overflow groups are used when the data row size is equal to or greater than the specified Large
Record size in dynamic HFs.
Since Static HFs do not create hashing groups automatically, when the group cannot accomodate
a row it goes to overflow.
Overflows should be minized as much as possible to optimal performance.
What is the difference between Datastage Server jobs and Datastage Parallel jobs?
Basic difference is server job runs on windows platform usually and parallel job runs on unix
platform.
server job runs on on node whereas parallel job runs on more than one node.
What is ' insert for update ' in datastage
Question is not clear still, i think 'insert to update' is updated value is inserted to maintain history
There is a lock for update option in Hashed File Stage, which locks the hashed file for updating
when the search key in the lookup is not found.
How did u connect to DB2 in your last project?
Using DB2 ODBC drivers.
The following stages can connect to DB2 Database:
ODBC
DB2 Plug-in Stage
Dynamic Relational Stage
How do you merge two files in DS?
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or
created a job to concatenate the 2 files into one if the metadata is different.
b)Using Funnel stage you can merge the data from two files together.
c)We can use either FUNNEL stage or use the sequentil stage and read more then one file(both
files format should be same).
What is the order of execution done internally in the transformer with the stage editor having
input links on the lft hand side and output links?
a)Stage variables, constraints and column derivation or expressions.
b)There is only one Primary input link to the Transformer and there can be many reference input
links and there can be many output links. U can output to multiple output links by defining
constraints on the output links.
48

U can edit the order of the input and output links from the Link ordering tab in the transformer
stage properties dialog.
How will you call external function or subroutine from datastage?
there is datastage option to call external programs . execSH
b)U can call external functions, subroutines by using Before/After stage/job Subroutines :
ExecSH
ExecDOS
or By using Command Stage Plug-In or by calling the routine from external command activity
from Job Sequence.
What happens if the job fails at night?
Job Sequence Abort
b)If you are oncall, u will be called to fix and rerun the job.
c)U can define a job sequence to send an email using SMTP activity if the job fails. Or log the
failure to a log file using DSlogfatal/DSLogEvent from controlling job or using a After Job
Routine.
or
Use dsJob -log from CLI.
Parallel Processing is broadly classified into 2 types.
MPP - Massive Parallel Processing.
c)Then how about Pipeline and Partition Paralleism, are they also 2 types of Parallel processing?
d)3 types of llrlism .
data llrlism
pipeline llrlism
round robin
e)there two types of parallel processing1) SMP --> Symmertical Multi Processing2) MPP--->
Massive Parallel Processing
f)Parallel processing are two types.
1) Pipeline parellel processing
49

2) Partitioning parellel processing
g)Hardware wise there are 3 types of parallel processing systems available:
1. SMP (symetric multiprocessing: multiple CPUs, shared memory, single OS)
2. MPP (Massively Parallel Processing Systems: multiple CPUs each having a personal set of
resources - memory, OS, etc, but physically housed on the same machine)
3. Clusters: same as MPP, but physically dispersed (not on the same box & connected via high
speed networks).
DS offers 2 types of parallelism to take advantage of the above hardware:
1. Pipeline Parallelism
2. Partition Parallelism
What is DS Administrator used for - did u use it?

The Administrator enables you to set up DataStage users, control the purging of the Repository,
and, if National Language Support (NLS) is enabled, install and manage maps and locales.
b)It is primarily used to create the Datastage project, assign the user roles to the project, set
parameters of the jobs at project level. Assign the users to the project can also be done here.
How do you do oracle 4 way inner join if there are 4 oracle input files?
The Question asked incorrectly.
there wont be any Oracle file. It is Oracle table or view object.
I never heard about Oracle input file.
Can you please explain what your actual question is?
How do you pass filename as the parameter for a job?
a)While job developement we can create a paramater 'FILE_NAME' and the value can be passed
while running the job.
b)1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here
you can see a grid, where you can enter your parameter name and the corresponding the path of
the file.
2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select
the parameter name which you have given in the above. The selected parameter name appears in
the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box
and use it in your job. Keep the project default in the text box.
c)1. Define the job parameter at the job level or Project level.
2. Use the file name in the stage(source or target or lookup)
3. supply the file name at the run time.
50
How do you populate source files?

there are many ways to populate one is writting SQL statment in oracle is one way
How to handle Date convertions in Datastage? Convert a mm/dd/yyyy format to yyyy-dd-mm?
We use a) "Iconv" function - Internal Convertion.
b) "Oconv" function - External Convertion.
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")
b)Here is the right conversion:
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") .
c)ToChar(%date%, %format%)
This shuld work, in format specify which format u want i.e 'yyyy-dd-mm'
How do you execute datastage job from command line prompt?
Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname
Differentiate Primary Key and Partition Key?

Primary Key is a combination of unique and not null. It can be a collection of key values called
as composite primary key. Partition Key is a just a part of Primary Key. There are several
methods of partition like Hash, DB2, Random etc..While using Hash partition we specify the
Partition Key.
a)Primary key is the key we define on the table column or set of columns(composite pk) to make
sure all the rows in a table are unique.
Partition key is the key that we use while partition the table(in database), process the source
records in ETL(in the etl tools). We should define the partition based on the stages( in datastage)
or transformations(in Informationca) we use in the job(Datastage) or mapping(in
Informatica).To improve the target load process, we use partition.
If u need more info, plz go through Database doc or Datastage or Informatica doc on
partitioning.
What are all the third party tools used in DataStage?
a)Autosys, TNG, event coordinator are some of them that I know and worked with
51

b)Maestro Schedular is another third party tool.........
c)Contl-M job schedular
How do you eliminate duplicate rows?
a)Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate
records, and writes the results to an output data set.
b)If you dont have remove duplicates stge, you can use hash file to eliminate duplicates.
what is the difference between routine and transform and function?
Difference between Routiens and Transformer is that both are same to pronounce but Routines
describes the Business logic and Transformer specifies that transform the data from one location
to another by applying the changes by using transformation rules .
b)By using Routines we can return values but by transformers we cannot return values
Is it possible to calculate a hash total for an EBCDIC file and have the hash total stored as
EBCDIC using Datastage?
Currently, the total is converted to ASCII, even tho the individual records are stored as EBCDIC.
If your running 4 ways parallel and you have 10 stages on the canvas, how many processes does
datastage create?
Answer is 40
You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total
number of processes generated are 40
b)It depends on the number of active stages on canvas and how they are linked as only active
stages can create a process. for ex if there are 6 active stages (like transforms) linked by some
passive stages, the total no of processes are 6x4=24
Explain the differences between Oracle8i/9i?
mutliproceesing,databases more dimesnionsal modeling
b)mutliproceesing,databases more dimesnionsal modeling
c)Oracle 8i does not support pseudo column sysdate but 9i supports
Oracle 8i we can create 256 columns in a table but in 9i we can upto 1000 columns(fields)
what is an environment variable??
a)Basically Environment variable is predefined variable those we can use while creating DS
job.We can set eithere as Project level or Job level.Once we set specific variable that variable
will be availabe into the project/job.
We can also define new envrionment variable.For that we can got to DS Admin .
52

I hope u understand.for further details refer the DS Admin guide.
b)Theare are the variables used at the project or job level.We can use them to to configure the
job ie.we can associate the configuration file(Wighout this u can not run ur job), increase the
sequential or dataset read/ write buffer.
ex: $APT_CONFIG_FILE
Like above we have so many environment variables. Please go to job properties and click on
Paramer tab then click on "add environment variable" to see most of the environment variables.
c)Here is the full FAQ on this topic:Creating project specific environment variables- Start up
DataStage Administrator.- Choose the project and click the "Properties" button.- On the General
tab click the "Environment..." button.- Click on the "User Defined" folder to see the list of job
specific environment variables.There are two types of variables - string and encrypted. If you
create an encrypted environment variable it will appears as the string "*******" in the
Administrator tool and will appears as junk text when saved to the DSParams file or when
displayed in a job log. This provides robust security of the value.Migrating Project Specific Job
ParametersIt is possible to set or copy job specific environment variables directly to the
DSParams file in the project directory. There is also a DSParams.keep file in this directory and if
you make manual changes to the DSParams file you will find Administrator can roll back those
changes to DSParams.keep. It is possible to copy project specific parameters between projects by
overwriting the DSParams and DSParams.keep files. It may be safer to just replace the User
Defined section of these files and not the General and Parallel sections.Environment Variables as
Job Parameters- Open up a job.- Go to Job Properties and move to the parameters tab.- Click on
the "Add Environment Variables..." button.- Set the Default value of the new parameter to
"$PROJDEF".When the job parameter is first created it has a default value the same as the Value
entered in the Administrator. By changing this value to $PROJDEF you instruct DataStage to
retrieve the latest Value for this variable at job run time.If you have an encrypted environment
variable it should also be an encrypted job parameter. Set the value of these encrypted job
parameters to $PROJDEF. You will need to type it in twice to the password entry box.Using
Environment Variable Job ParametersThese job parameters are used just like normal parameters
by adding them to stages in your job enclosed by the #
symbol.Database=#$DW_DB_NAME#Password=#$DW_DB_PASSWORD#File=#$PROJECT
_PATH#/#SOURCE_DIR#/Customers_#PROCESS_DATE#.csv
how find duplicate records using transformer stage in server edition
a)This is questions has got more answers as the elimination od duplicates are situation specific.
Depending upon the siutation we can use the best choice to remove duplicates.
1. Can write a SQL qurey depending upon the fileds.
2. You can use a has file, by nautre which doesnt allow dupilcates. Attach a reject link to see the
duplicates for your verification.
53

b)Transformer stage to identify and remove duplicates from one output, and direct all input rows
to another output (the "rejects"). This approach requires sorted input.
what is panthom error in data stage
I know about the Phontom Process in datastage.
If a process is running and if you kill the process some times the process will be running in the
background. This process is called phantom process.
You can use the resource manager to cleanup that kind of process.
b)For every job in DataStage, a phantom is generated for the job as well as for every active stage
which contributes to the job. These phantoms writes logs reg the stage/job. If there is any
abnormality ocuurs, an error meesage is written and these errors are called phantom errors. These
logs are stored at &ph& folder.
Phantoms can be killed through DataStage Administrator or at server level.
c)For every job in DataStage, a phantom is generated for the job as well as for every active stage
which contributes to the job. These phantoms writes logs reg the stage/job. If there is any
abnormality ocuurs, an error meesage is written and these errors are called phantom errors. These
logs are stored at &ph& folder.
Phantoms can be killed through DataStage Administrator or at server level.
what is the use of environmental variables?
Environment variables are predefined variable those we can use while creating datastage
jobs.We can set in project level or job level once we set the variable the variable will be
available in the project.
b)Enviroment Variables are the one who set the enviroments.
Oce you set these varicables in datastage you can use them in any job as a perameter.
Example is
you you want to connect to database you need userid , password and schema.
These are constant through out the project so they will be created as env variables.
use them where ever you are want with #Var# .
By using this if there is any change in password or schema no need to worry about all the jobs .
change it at the level of env variable that will take care of all the jobs.
54

how can we run the batch using command line?
DSJOB Command is the command to run the datastage jobs from command line . in the older
architectures people use to create the Batchjob to control the remaining datastage jobs in the
process in KEN BLEND arch.
With DSJob command you can run any datastage job in datastage enviroment.
what is fact load?
a)You load the facts table in the data mart with the combined input of ODBC (OR DSE engine)
data sources. You also create transformation logic to redirect output to an alternate target, the
REJECTS table, using a row constraint.
b)In a star schema there will be facts and dimension tables to load in any datawarehouse
environments. You will generaly load the Dimension tables first and then Facts.
Facts will have the relative information of dimension.
Explain a specific scenario where we would use range partitioning ?
a)It is used when Data Volumn Is high.It's Partitioning by Column wise.
b)If the datais large and if you cannot process the the full data in one time process you will
generally use the Range partitioning.
what is job commit in datastage?

a)job commit means it saves the changes made
b)If you see datastage job commits each record in general cases. but you can force the datastage
to take a set of records and then commit them.
In case of Oracle stage in Trasaction Handling Tab you can set the number of rows per
transaction.
hi..
Disadvantages of staging area
a)I think disadvantage of staging are is disk space as we have to dump data into a local area.. As
per my knowledge concern, there is no other disadvantages of staging area.
b)Yes, its like a disadvantage of staging area, it takes more space in database and it may not be
cost effective for client.
How can we remove duplicates using sort stage?
a)Set the "Allow Duplicates" option to false
b) TreeSet<String> set = new TreeSet<String>(Arrays.asList(names));
for (String name : set)
System.out.println(name);
55
this is enough for sorting and removing of duplicate elements (using Java 5 in this example)
what is the difference between RELEASE THE JOB and KILL THE JOB?
Release the job is to release the job from any dependencies and run it.
Kill the job means kill the job that's currently running or scheduled to run.
Can you convert a snow flake schema into star schema?
Yes, We can convert by attaching one hierarchy to lowest level of another hierarchy.
No. It is not possible
What is repository?
Repository resides in a spcified data base. it holds all the meta data, rawdata, mapping
information and all the respective mapping information.
Repository is a content which is having all metadata (information).
What is Fact loading, how to do it?
a)firstly u have to run the hash-jobs, secondly dimensional jobs and lastly fact jobs.
b)Once we have loaded our dimensions, then as per business requirements we identify the
facts(columns or measures on which business is measured) and then load into fact tables..
What is the alternative way where we can do job control??
Job Control will possble Through scripting. Controling is dependent on Reqirements.need of
the job.
b)Jobcontrol can be done using :
Datastage job Sequencers
Datastage Custom routines
Scripting
Scheduling tools like Autosys
Where we can use these Stages Link Partetionar, Link Collector & Inter Process (OCI) Stage
whether in Server Jobs or in Parallel Jobs ? And SMP is a Parallel or Server ?
You can use Link partitioner and link collector stages in server jobs to speed up processing.
Suppose you have a source and target and a transformer in between that does some processing,
applying fns etc.
You can speed it up by using link partitioner to split the data from source into differernt links,
apply the Business logic and then collect the data back using link collector and pump it into
output.
56

IPC stage is also intended to speed up processing.
Where can you output data using the Peek Stage?
In datastage Director!
Look at the datastage director LOg
b)The output of peek stage can be viewed in director LOG and also it can be saved as a seperate
text file?
Do u know about METASTAGE?
in simple terms metadata is data about data and metastge can be anything like DS(dataset,sq
file.etc)
b)MetaStage is used to handle the Metadata which will be very useful for data lineage and data
analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are
stored in repository and can be accessed with the use of MetaStage.
c)Metastage is a metadata repository in which you can store the metadata (DDLs etc.) and
perform analysis on dependencies, change impact etc.
d)METASTAGE is datastage's native reporting tool it contains lots of functions and
reports.............
e)MetaStage is a persistent metadata Directory that uniquely synchronizes
metadata across multiple separate silos, eliminating rekeying and the manual
establishment of cross-tool relationships. Based on patented technology, it
provides seamless cross-tool integration throughout the entire Business
Intelligence and data integration lifecycle and toolsets.
f)MetaStage is a persistent metadata Directory that uniquely synchronizes
metadata across multiple separate silos, eliminating rekeying and the manual
establishment of cross-tool relationships. Based on patented technology, it
provides seamless cross-tool integration throughout the entire Business
Intelligence and data integration lifecycle and toolsets
57

Datastage Interview Questions

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Datastage Interview Questions

Enviado por

Direitos autorais:

Formatos disponíveis

Datastage Interview Questions - Answers