Você está na página 1de 11

Production Support/Application Testing/Software Defect

and IBM Mainframe COBOL ABEND Research


When an application ABEND (ABnormal END-of-job) occurs, Z/OS stops executing your program, closes files and
buffers and generates a single high-level message in the form of a System Completion Code (Sxxx). The System
Completion Code is usually written to an output listing file through your //SYSOUT DD * JCL entry. This
completion code indicates why the system has decided to stop executing your application. It is related to, but often
only loosely related to what is really wrong with your application. Because of this the System Completion Code
represents only the starting point for your analysis of the problem.

Other Debugging Assistance


Along with the System Completion Code, use IBMs Problem Determination tools (PD Tools) - this will generate a
listing (SYSOUT) which describes:

The System Completion Code (and often a short text description of what it designates)

A short explanation of the cause of the ABEND

The COBOL instruction (statement) or line number, which contained the invalid operation causing
Z/OS to halt execution

A "core-dump" (a hexadecimal printout) of the internal machine storage and registers relevant to the
areas of your program surrounding the COBOL instruction which caused Z/OS to halt execution.
This information is useful to begin understanding and researching the problem, but it is usually far from sufficient to
solve the problem, which could be any combination of:

Incomplete, incorrect or invalid COBOL procedural logic

A typo such as a misplaced period, or incorrectly specified field

Incorrect or invalid input data

Batch jobs run out of sequence

Input files missing or corrupted (hardware errors)

Errors which relate to JCL problems

etc.
There are as many different ways to analyze and research COBOL ABENDs as there are individual approaches to
writing procedural logic. However, if you've never done this type of "logic-detective" work on a large scale, and to
help you get started with this complex and crucial process, consider the following approach of five steps:

Preparation
Research
Hypothesis
Solution
Resolution

As a final note before beginning, understand that there are really two distinct phases of Production Support:
1. Data Center on-call ABEND resolution - wherein a technician receives notification that a job or
transaction has ABENDd and must be "fixed" within an extremely short timeframe (usually minutes to
hours). In this case, the technician's main concern is to "patch" the problem - get the system back online, or
get the batch jobstream back into production ("Patch-It").
2. NextDay problem resolution - wherein technician(s) actually track down and solve the problem that
caused the ABEND ("Fix-It").
The steps below represent a process for "FixIt" - they go well beyond the scope of the emergency measures used to
"patch" the problem during an OnCall emergency.

1.

Preparation - Collect all necessary background information (WHAT happened


and WHERE the ABEND occurred)

2.

Print out the ABEND information

Collect all supporting ABEND output (SYSOUT) from the job - (ABEND-AID, DISPLAY
statements, etc.)
Obtain copies of the run-time:

JCL

Program source -and all copybooks (or expanded source listing)


From the JCL learn the dataset names of input and output files accessed by the program (which you
may need to browse as part of your research)
Learn the nature of the batch job from system documentation , or from an application business expert
(at least at the level of module-flow and file-access)

Research - Construct a mental map (understanding) of the program's execution


(HOW the ABEND occurred)

To make the correct WHY determination usually requires a combination of "Static" and "Dynamic"
analysis - complementary research and investigative approaches. Note: These steps need not be
followed in this order. Rather, in time you will develop an "intuition" as to which kind(s) of analysis
will be most likely to provide the information you need to solve your problem. In a production support
role

Static Analysis:
1. Structural Visualization: is the generation of an accurate mental map, understanding
or mental image of the program's control structure, or logic-architecture. Using the
starting point represented by the ABEND condition (the statement which caused Z/OS to
halt execution) and using electronic-assisted tools (such as IBMs Rational Asset
Analyzer or Rational Developer for System z), build an accurate understanding of the
code invocation at:

The module/file level (System View)

Paragraph/Section level (Hierarchy chart)

(if necessary i.e. if the code is dense or complex) Statement level (Flow chart)
Structural Visualization can done be "top-down", by asking open-ended questions; such
as learning how a particular routine "hangs-together logically", or it can be used "bottomup", by asking specific close-ended questions about a program, such as "How does this
particular paragraph get executed?" "How did this module get invoked?"

2. Data Flow Analysis: A combination of control structure analysis and data item
analysis, which seeks to determine the usage of particular fields throughout a
program. Data flow analysis is used to determine (from a given instance of a data item)
where the next occurrence(s) of that item exist in your program, and how the data item is
used; (as a receiving field in a MOVE or mathematical operation, as the sending field in
a MOVEstatement, as part of a logic-branch (IF, PERFORM UNTIL/VARYING, etc.).

3. Data Impact Analysis: An expansion of Data Flow Analysis which traces the
movement of data from field-to-field throughout a program, or throughout an entire
application; including I/O (screens and files). Using Data Impact Analysis, you can
identify all fields that might have had an impact on the contents of a field (before the
ABEND occurred). And just as importantly - you can learn the affect changing this field
will have on the behavior of the application.

4. Textual or Data Item Usage: Utilized more for application maintenance and
enhancement requests, this type of Static Analysis involves searching for "categories"
of program-items, such as "List all fields that contain *JUL*, *GREG*, *YR*, *YEAR*
(suspect date candidates for Year2000 conversion), or list all such fields with two digits
(numeric) or two-byte (alphanumeric) definitions.

5. Code Partitioning: Again, utilized more for application maintenance, enhancements


and application reengineering, Code Partitioning involves mentally organizing and
analyzing code by function or process, such that you understand and can distinguish the
usage of code by business process. For example: Find all code that relates to the
calculation of premium renewal payments or Isolate the code that edits a particular
file, with an eye towards creating a shared subroutine from the code.

Dynamic Analysis:
1. Tracing: Source-level interactive debugging. Watch the program execute statement-bystatement, and line-by-line. This is very useful for detailed-debugging, particularly of
dense or complex instructions. Some software (for example, the Rational Developer for
System z) allows you to trace the program logic, attempting to re-create the sequence of
events (COBOL statements) that transpired up to and including the ABEND
condition. Tracing is an invaluable method for detailed debugging. However, given the
size and scope of production applications, it is generally more practical to Trace specific
problem areas of a program.

2. Interactive Execution: Execute (run) a program, stopping at


selective Breakpoints (Pause execution each time a certain field-value changes, or when a
value exceeds some threshold), and examining the contents (value) of specific
fields. Interactive Execution must be done by (or with) an application analyst who
understands how the system is supposed to operate. Interactive Execution is useful for
observing control flow, and is often combined with line-by-line tracing by setting
selective breakpoints, monitoring values, "running" the application to the breakpoints,
and then tracing the code line-by-line.

3. Selective Data State Collection: Execute code and establish a functional summary
of specific data states that it creates. Use these states in subsequent test runs to compare
results of current values to expected values.

4. Coverage: Analyze the number of times each COBOL statement is executed for a given
run. This technique is extremely useful for analyzing test data coverage of a given
application. And it can be used effectively for debugging if it makes apparent problems
such as infinite loops (S222, S322 and B37 ABENDs), over-loading tables - (loading
tables beyond the maximum OCCURS clause and overlaying storage, which can cause
S0C1, S0C4, and S0C7 ABENDs).

3.

Using a COBOL research and analysis tool (such as IBMs Rational Asset Analyzer or Rational
Developer for System z), or some other source-level analysis software) perform Static and/or Dynamic
Analysis on the specific areas of the application relating to the ABEND, to determine (based
on WHERE the problem manifested itself to the system - obtained from the ABEND-AID listing of
which statement caused the ABEND ) HOW this particular problem occurred in the application.

Hypothesis - Determine WHY the ABEND occurred

4.

With the research in steps 1 and 2, you should be able to describe WHAT, WHERE and HOW the
ABEND occurred (at what point in the program the logic failed, and what sequence of COBOL
statements caused the failure).
However, before modifying any logic, you must determine WHY these statements (or sequence of
events) caused this particular failure (e.g. "Why did this production input file contain spaces in a
numeric field?" "Why did the program's logic perform the Initialization routine twice?" "Why did the
Read routine execute past end-of-file?", etc.).
Only through a determination of WHY will you be able to make a change to production business logic
safely, and with confidence that;

Your change will resolve the ABEND

Your change will not introduce new (additional) ABENDs


Sometimes it is relatively easy to come to an understanding of WHY certain ABEND conditions
occurred. For example, perhaps a period was left off the appropriate termination point for
an IF statement - which caused execution to perform an operation out of
sequence. Or perhaps an IF .. NUMERIC test (which should have been coded for all numeric fields
in a file) was forgotten. Or a paragraph was performed through the wrong paragraph-exit, or a
production job was released before certain files were available (causing I/O errors). These types of
ABEND situations can be understood (and usually resolved) fairly quickly. However, this is not
always the case.
What if - in the case of the IF statement with the incorrect termination point - the logic that has been
coded, correctly processed the first 100,000 records in the file? Making a change to a
critical IF condition could very well affect other down-stream processing within the program,
wrecking havoc with subsequent routines. Or what if - in the case of the file containing blanks in the
numeric fields - the input file was supposed to be "clean" (validated) by this point in the jobstream having gone through allegedly "exhaustive" edits in prior modules. By simply adding an IF test you
may solve your program's specific ABEND, but you will not have resolved the actual problem - which
exists somewhere else in the system. In other words, provincial approaches to resolving production
ABENDs are not recommended - as they usually change the problem, instead of solving it.
It should be noted that, a clear understanding of the business functionality automated by this process is
usually required to completely resolve WHY something has gone wrong. Calling on business experts
or "application/business" experts who understand "the big picture" - and the context in which the job
executes is the rule rather than the exception to this process.
Developing a clear and accurate determination of WHY a problem that lead to an ABEND condition
exists may take a considerable amount of time, depending on the:

Size, complexity and structure of the code

Your familiarity with the program's business purpose - coupled with your ability to grasp the
point of each statement (assuming you didn't write the code)

Type of ABEND and reason for the problem (some are more diabolical than others)

Size of the input/output files, and capabilities of your file editor


Note that, in addition to an understanding of the reason for the ABEND, the results of your
investigation should produce an understanding of the solution to the problem (the fix itself).

Solution - Fix the problem and test your solution

Take the appropriate action to resolve any business - or system-wide issues. Depending on how
extensive the damage caused by the problem, or for how long any problems have persisted undetected:

Files may have to be restored from backups from a previous point-in-time

Jobs may have to be re-run from a previous point-in-time (synchronized with file
generations)

Files may have to be modified with "one-shot" programs, written to resolve issues that
require "surgery" on the data
Take the appropriate action to fix the technical (coding) problem

Edit program source - modifying the existing production logic and/or

5.

Modify the JCL (if the error included JCL issues)


Test your solution

Compile and Link the new version of the application

Create an "image copy" of the production file system, in order to test your fix

Re-Run the batch job and analyze results

Run "Regression Tests" against the new code - analyze for unexpected results

Resolution Build and migrate back in to production

Promote your changes into production


Schedule and re-run the cycle

Appendix - ABEND Completion Codes and some typical causes


While there is a wide variety of reasons for ABEND conditions ("WHYs") in production systems, it is possible (and
useful) to categorize and organize HOW certain conditions often lead to certain types of ABEND completion codes
- in order to expedite or streamline your analysis and research (an 80/20 approach to analysis). The following
information on a few common Z/OS ABEND completion codes, and the conditions which generated them is
included for you to make effective use of ABEND-AID listings and the above debugging, research and analysis
process.

S0C1

Attempt to execute an invalid machine instruction

S0C1s occur due to COBOL:

Table-handling overlay (MOVEs to table subscripts/indexes which are out-of-range - and which
overwrite PROCEDURE DIVISION instructions)

Statements referencing LINKAGE-SECTION fields incorrectly

CALLs to an invalid subroutine name

The COBOL compiler always generates valid machine instructions. S0C1's usually occur when populating
tables beyond the valid OCCURS range

Typical Reasons for S0C1s


Explanation
Moving elements to a table using a subscript or index
This usually happens because of a loop that
which contains a value beyond
is not terminated correctly - such as a routine which
the maximum OCCURS in the table declaration
populates a table from an input file containing more
records than the table OCCURS declaration provides
for. It can also happen through a MOVE or invalid math
statement which computes an invalid subscript/index
value.
Referencing incorrectly defined/passed
LINKAGE SECTION fields

If the definitions of your LINKAGE SECTION


fields do not match, or the definitions in the called
program are larger than the calling program, you could
be attempting to reference data outside of valid storage
when statements which reference those fields execute

CALL to an invalid or unavailable module-name

If your program makes a dynamic CALL and the


module-name being called is not found, you can get

S806, S0C4 or S0C1 system errors. The reasons for


invalid module-names include; misspelling the name,
incorrectly specifying the STEPLIB/JOBLIB DSN= in the
JCL (or incorrectly concatenating
the STEPLIB/JOBLIB datasets), leaving out apostrophes (or
quotes) on a CALL literal - which would cause the COBOL
compiler to treat the statement as if it were a CALL identifier
- and if an identifier with that name exists in the Data
Division, COBOL will attempt a dynamic CALLto the value of
the identifier.

S0C4

Attempt to reference an invalid storage address

S0C4s occur due to COBOL:

Table-handling overlay errors (MOVEs to table subscripts/indexes which are out-of-range - and which
overwrite PROCEDURE DIVISION instructions)

Statements referencing LINKAGE SECTION fields incorrectly

CALLs to an invalid subroutine name

STOP RUN or GOBACK in the INPUT or OUTPUT PROCEDURE when using the
COBOL SORT verb

Attempt to access an unopened dataset

Unless your program is executing with "bounds-checking" (supported by CA-Capex Optimizing, COBOL II
and COBOL/370 - and generally not used in production), your table routines could overlay the contents of
storage beyond the boundary of the OCCURS clause. This can cause S0C7s (see above) S0C1s and S0C4s by
overwriting field values in the Data Division (S0C7s) or actually overwriting the instructions in
your PROCEDURE DIVISION, producing invalid addresses (operands) for the executable (machine) code
(which in turn can cause S0C1s and S0C4s)

Typical Reasons for S0C4s


Table subscript or index contains a zero value

Explanation
Verify that all table-handling subscript/index
references are within the allowable range of
of the table's OCCURS clause
(>= 1, <= OCCURS max).

Moving elements to a table using a subscript or index


This usually happens because of a loop that
which contains a value beyond
is not terminated correctly - such as a routine which
the maximum OCCURS in the table declaration
populates a table from an input file containing more
records than the table OCCURS declaration provides
for. It can also happen through a MOVE or invalid math
statement which computes an invalid subscript/index
value.
Referencing incorrectly defined/passed
LINKAGE SECTION fields

If the definitions of your LINKAGE SECTION


fields do not match, or the definitions in the called
program are larger than the calling program, you could
be attempting to reference data outside of valid storage
when statements which reference those fields execute

CALL to an invalid or unavailable module-name

If your program makes a dynamic CALL and the


module-name being called is not found, you can get
S806, S0C4 or S0C1 system errors. The reasons for

invalid module-names include; misspelling the name,


incorrectly specifying the STEPLIB/JOBLIB DSN= in the
JCL (or incorrectly concatenating
the STEPLIB/JOBLIB datasets), leaving out apostrophes (or
quotes) on a CALL literal - which would cause the COBOL
compiler to treat the statement as if it were a CALL identifier
- and if an identifier with that name exists in the Data
Division, COBOL will attempt a dynamic CALLto the value of
the identifier.

S0C7

Data exception (invalid numeric data in numeric field - caught by a Convert-to-Binary machine instruction
during a mathematical operation or numeric compare)

S0C7s can occur on COBOL:

Arithmetic instructions:

ADD, SUBTRACT, MULTIPLY, DIVIDE, COMPUTE

Comparisons involving tests of numeric fields (which can occur with the following statements):

IF, EVALUATE, PERFORM UNTIL, PERFORM VARYING, GO TO DEPENDING

MOVE statements when the receiving field is packed (COMP-3) or binary (COMP) and the sending
field contains invalid numeric data

S0C7s occur when Z/OS finds invalid numeric data in a field defined as PIC 9 (all PIC 9 fields - DISPLAY,
COMP, COMP-3 and floating point) during arithmetic or compare operations

Note: S0C7s do not occur on an IF statement comparing PIC X fields

Typical Reasons for S0C7s


Failure to initialize a WORKING-STORAGE field

Non-numeric "input data" in a numeric field

Fall-through logic, or invalid branching sequence

MOVE statements when the receiving field's


compiler
definition is COMP or COMP-3
the

Explanation
Be sure all numeric work areas contain
a VALUE clause at the elementary level
or are correctly INITIALIZEd before they
are used (as other than receiving fields in MOVE
statements) within your program. Be particularly
careful with "counters and accumulators". Also,
always initialize elementary (rather than group)
COMP-3 fields.
May need "IF NUMERIC " test - or may
need to browse output files produced in
previous job step ("input" to this program)

Sometimes program logic errors force program


execution into a paragraph out of sequence (such
as executing an edit routine before a record is
READ, or after the file has been closed (and spaces
or HIGH-VALUES have been moved to the record).
The type of MOVE statement generated by COBOL
is based on the datatype definition of the receiving field. If
the receiving field is COMP or COMP-3, COBOL

generates an

"algebraic MOVE". This will result in a S0C7 if the


sending
field contains invalid numeric data. "IF NUMERIC " tests
on the sending field may be necessary prior to the MOVE
statement.
Table-loading overlay errors

This can happen if a table-loading process overlays


data beyond the table OCCURS range (i.e. non-numeric
data can be moved to numeric-defined fields that are

adjacent
to the storage area set aside for the table through its
OCCURS clause)
Referencing incorrectly defined/passed
LINKAGE SECTION fields

If the definitions of your LINKAGE SECTION


fields do not match, your program may reference
non-numeric data through numeric field definitions

S0CB

Attempt to divide by zero/decimal-divide overflow

S0CBs occur due to COBOL:

DIVIDE statements if the quotient in a division using a decimal operand is greater than the size of the
receiving field

Division by zero

Note that S0CB ABENDs may be intercepted by COBOL library subroutines (which automatically check for
zero before dividing). If this is the case zero-divide will result in "user" return-codes:

U0203 - OSVS COBOL

U1061 - VS COBOL II

Typical Reasons for S0CBs


DIVIDE by zero

Decimal DIVIDE exception

Explanation
Program logic should always check to see if the
divisor has been properly initialized or updated.
Or in the case of input edits and data validation,
that the divisor is > zero before doing the division.
Also check to see whether a fractional value was
MOVEd to an integer field, truncating the fractional
value and resulting in zero divide.

Check the specification of the COMP-3 receiving


field, the placement of the V in the receiving field
definition (and the overall definition of the
receiving field). Also, check to see if the ON SIZE
ERROR condition should have been coded.

S001

Input/Output problem

S001s occur due to COBOL logic errors

File READ/WRITE error

File OPEN/CLOSE error

S001 errors occur primarily due to incorrect COBOL logic (fall-thru errors, logic executed out of sequence,
etc.)

Typical Reasons for S001s


S001on a READ operation
file

Explanation
Occurs if your program READs before opening a
or READs after closing a file (Place file OPEN/CLOSE
statements in dedicated Initialization and Termination
paragraphs.)

UNTIL.)

Can also occur if your program READs


past the end-of-file condition (create a unique
end-of-file switch for each file your program reads,
watch "switch" on READ statement and PERFORM
Can also occur if your program attempts to READ
from a file OPEN for OUTPUT.

S001on a WRITE operation

Occurs if you WRITE before opening a file or after


closing a file (see above on Initialization/Termination
routines).
Can also occur if your program attempts to WRITE
to a file OPEN for INPUT.

S013

Conflict in DCB (Data Control Block) parameters

S013s occur due to inconsistencies between COBOL file description statements in your program, and:

The DCB (data control block) parameter specified on the file DD statement in your JCL (for output
files) or

The DCB entry taken from the physical file DCB parameters, stored on the file's device header.

Typical Reasons for S013s


S013 on an OPEN statement for an input file

Explanation
Occurs if your program's RECORD
CONTAINS clause conflicts with the
physical file's record length. Or if your
program's BLOCK CONTAINS clause
conflicts with the physical file's blocking
factor. Suggestion - on input files, do not
specify RECORD CONTAINS. Code
BLOCK CONTAINS 0 RECORDS.

S013 on an OPEN statement for an output file

Occurs if your program's RECORD


CONTAINS clause conflicts with the
file's JCL (LRECL= size). Or if your
program's BLOCK CONTAINS clause
conflicts with the file's JCL BLKSIZE=
parameter. Suggestion - on output files, code
BLOCK CONTAINS 0 RECORDS.

S213

File open error

S213s occur when an input file is not found. This can happen if:

The file does not exists or

The filename is misspelled on the JCL DSN= parameter

Typical Reasons for S213s


S213 on an OPEN statement for an input file

Explanation
Occurs on file OPEN when the system
cannot find the input filename as specified
in your JCL. This can happen because of
a simple typo in the JCL, or because a
previous job failed to complete successfully.

S122/222/322

Operator cancel

S122/S222s occur when an operator cancels a job

S122 means the job was canceled and a storage dump was requested

S222 means the job was canceled, but a dump was not requested (although, depending on which Z/OS
routine was active when the job was canceled a dump may have been produced)
S322s occur when Z/OS cancels a job because the default or specified CPU time limit for a job step or
procedure was exceeded

(Note on S122/222) It is important to note that S122/222 job cancellations are "judgment calls" by the system
operator, and that in fact, there may be nothing wrong at all. Always begin your research by calling the operator
and requesting an explanation of why they canceled the job.
(Note on S322) If a job that normally processes 100,000 records jumps to 10,000,000, or if it is run on a slower
CPU with slower external devices S322 may simply signify that you have to increase the CPU time in the JCL
However, it could be that S122/222/322s occur because of program logic or job execution errors:

Typical Reasons for S122/222/322s


Explanation
Job is deadlocked
(program is in a Wait state)
Occurs when a file your program has requests
cannot be allocated to your process, because
some other program is using it. This generally
occurs when jobs are initiated out of sequence.
Program is in an infinite loop

Occurs when a file your logic repeatedly executes


the same routines over and over. Generally due to
incorrectly setting or checking switches and return-codes,
or some type of fall-through error.

S806

Requested Load Module not found

S806s occur when a called program (or system subroutine) is not found. This can happen if:

The module name is misspelled on the CALL statement

The module was not successfully LINKed into the application

The program name is misspelled on the JCL EXEC PGM= parameter


The STEPLIB/JOBLIB DD statements point to incorrect load libraries, or the libraries are incorrectly
concatenated

Typical Reasons for S806s


Module name is misspelled

Explanation
If your program makes a dynamic CALL and the
module-name being called is not found, you can get
S806, S0C4 or S0C1 system errors. The reasons for
invalid module-names include; misspelling the name,
incorrectly specifying the STEPLIB/JOBLIB DSN= in the
JCL (or incorrectly concatenating
the STEPLIB/JOBLIB datasets), leaving out apostrophes (or
quotes) on a CALL literal - which would cause the COBOL
compiler to treat the statement as if it were a CALL identifier
- and if an identifier with that name exists in the Data
Division, COBOL will attempt a dynamic CALLto the value of
the identifier.

B37/E37

Out of space condition

B37/E37s occur when there is insufficient space on an output device. This can occur because of:

Insufficient SPACE allocated through the JCL for an output file - in which case you should reestimate the SPACE requirements for your output file, and increase SPACE allocation

Insufficient SPACE on a particular DASD device - in which case you should either choose a different
device, or remove some files from the pack.

A program logic error such as an infinite loop which includes WRITE statements

Typical Reasons for B37/E37s


Program is in an infinite loop in a WRITE routine

Explanation
Occurs when a file your logic repeatedly executes
a WRITE statement over and over. Generally due to
incorrectly setting or checking switches and return-codes,
or some type of fall-through error.

Você também pode gostar