SystemSoftware Compilers

MODULE III
Introduction to compiling:- Compilers,

Analysis of a source program, the
phases of a compiler.
Lexical Analysis:-The role of the lexical
analyzer, Input buffering, specification
of tokens, Recognition of tokens, Finite
automata, Conversion of an NFA to DFA,
From a regular expression to an NFA.
COMPILERS
Introduction to Compilers
Translator
A translator is a program that takes a program
written in one programming language as input and
produces a program in another language as output. If the
source language is a high level language and the object
language is a low level language , then such a translator is
called a compiler.
Source
Program
Compiler
Object
Program
Analysis of Source Program
The analysis part breaks up the source program into
constituent pieces and imposes a grammatical

structure on
them.
It then uses this structure to create an intermediate
code of
the source program.
If the analysis part detects any error, it must provide

informative messages, so the user can take corrective
action.
The analysis part also collects information about the

source
program and stores it in the data structure SYMTAB,
The synthesis part constructs the desired

target
program from the intermediate
representation and
the information in the SYMTAB.
The analysis part is often called the front

end and
synthesis phase is called the back end.
Source program
Lexical Analyzer
token stream
Syntax Analyzer
Syntax tree
Semantic Analyzer
Syntax tree
Intermediate code generator
Symbol Table
Intermediate
representation
Machine independent
Code
optimizer
Intermediate
representation
Code generator
Target machine code
Machine dependent Code
optimizer
Target machine code
Phases of a compiler
Lexical Analysis (Scanning)

-The
first phase of a compiler
-The
lexical analyzer reads the stream of characters from the source

program and groups the characters into meaningful sequences
called lexemes.
-For
each lexeme, the lexical analyzer produces a token as output of

the form,
(token-name, attribute-value)
Where token-name is an abstract symbol that is used during
syntax analysis, and attribute-value points to an entry in the symbol
table for this token.
Eg. Position = initial +rate *60

The lexemes and tokens are
1) position is a lexeme would be mapped into a
token <id,1>, where id is identifier and 1 points to
the SYMTAB entry for position.
2) = is a lexeme that mapped into a token < = >.
Since this token needs no attribute value, we have
omitted the second component.
3) initial - <id,2>
4) + - < + >
5)rate - <id,3>
6 ) * - <*>
7) 60 - <60>
Syntax Analysis(Parsing)
The second phase of the compiler.
The parser uses the first components of the tokens

produced by the lexical analyzer to create syntax trees.
The syntax tree for the above eg is

=
<id,1>
+
<id,2>
<id,3>
*
60
Semantic Analysis
The semantic analyzer uses the syntax tree and the
information in the SYMTAB to check the source program for
semantic consistency with the language definition.
It also gathers type information and saves it in either the
syntax tree or the SYMTAB, for subsequent use during
intermediate code generation.
An important part of semantic analysis is type checking,
where the compiler checks that each operator has matching
operands.
(eg, the compiler must report an error, if a float value is used
as an array index).
Eg. Suppose position, initial and rate are float numbers. The
lexeme <60> is an integer. The type checker in semantic
analyzer discovers that the operator * is applied to a float
number rate and an int 60. So int 60 is converted to float.
Intermediate Code Generation.
In the process of translation from source to

target code, the compiler may construct one or
more intermediate representations.
This intermediate representations should be

(a) easy to produce and (b) easy to translate
into target machine.
Eg. t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimization
The machine independent code optimization phase

attempts to improve the intermediate code so that
better target code will result.
A simple intermediate code generation algorithm

followed by code optimization is a reasonable way
to generate good target code.
The optimizer can deduce that the conversion of 60

from int to float can be done once. So the inttofloat
operation can be eliminated by replacing int 60 by
float 60.0
Eg. t1 = id3 * 60.0

id1 = id2 * t1
Code Generation
The code generator takes as input an

intermediate representation of the source
program and maps it to the target language.
If the target language is machine code,

registers or memory locations are selected for
each of the variables used by the program.
Then the intermediate instructions are

translated into sequences of machine
instructions.
Eg.
LDF
MULF
LDF
ADDF
STF
R2,
R2,
R1,
R1,
id1,
id3
R2, #60.0
id2
R1, R2
R1
SYMBOL TABLE MANAGEMENT

An essential function of a compiler is to record
the variable names used in the source
program and collect information about various
attribute of each name.
This data structure should be designed to
allow the compiler to find the record for each
name quickly and to store or retrieve data
from that record quickly.
Position = initial +rate *60

Lexical Analyzer
<id,1> = <id,2> + <id,3> * <60>
Syntax Analyzer
=
<id,1>
+
<id,2>
<id,3>
*
60
Semantic Analyzer
=
<id,1>
+
<id,2>
<id,3>
*
inttofloat(60)
Intermediate code generator

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
code optimizer
t1 = id3 * 60.0
id1 = id2 * t1
Code Generator
LDF
MULF
LDF
ADDF
STF
R2,
R2,
R1,
R1,
id1,
id3
R2, #60.0
id2
R1, R2
R1
Translation of an assignment
statement
Role of Lexical Analyzer
The
main task of lexical analyzer is to

read the input characters and group them
into lexemes and produce tokens
The
stream of tokens is send to the parser

for syntax analysis.
token
Source
Program
Lexical
Analyzer
Parser
getnexttoke
n
Symbol Table
To semantic
analysis
Tasks(Role) of Lexical Analyzer

Identification
of Lexemes
Removal
of comments and white

spaces(blank,newline,tab etc)
Correlating
error messages generated by

the compiler with the source program.
Lexical Analyzer processes.

a)
Scanning consists (simple processes) that do not

require tokenization of the input, such as deletion of
comments and compaction of consecutive white
spaces into one.
b)
Lexical analysis(complex process) where scanner

produces the sequence of tokens as output.
Tokens, Patterns and Lexemes
A token
is a pair with a token name and an optional

attribute value
A pattern
is a description of the form that the

lexemes of a token may take
A lexeme
is a sequence of characters in the source

program that matches the pattern for a token
INPUT BUFFERING
Specialized
buffering techniques have

been developed to reduce the amount
of overhead required to process a
single input character.
Two
buffers are alternately reloaded.

Each buffer is of same size N. N is the
size of a disk block.
E
M *
eof
lexemeBegin forward
Input Buffering
Two pointers are required
lexemeBegin - marks the beginning of a lexeme
forward - scans ahead until a pattern match is
found.
Advancing forward requires that we first test
whether we have reached the end of one of the
buffers , and if so, we must reload the other
buffer from the input, and move forward to
the beginning of the newly loaded buffer.
Sentinels
Used
to mark the end of input.
Natural
choice is the character eof.
Any
eof that appears other than at

the end of a buffer means that the
input is at an end.
E
eof C *
eof
lexemeBegin forward
Sentinels at the end of each buffer
eof
Switch(*forward++){
case eof:
if(forward is at the end of first buffer)
{ reload second buffer;
forward = beginning of second buffer; }
else if(forward is at the end of second buffer)
{ reload first buffer;
forward = beginning of first buffer; }
else /* eof within a buffer marks the end of
input */
terminate lexicalanalysis;
break;
}
SPECIFICATION OF TOKENS
Strings and Languages

alphabet
is a finite sequence of symbols.

The string over an alphabet is a finite sequence of
symbols drawn from that alphabet.
|s| represents the length of a string s, Ex: banana is a
string of length 6
The set {0,1} is the binary alphabet
A language is any countable set of strings over some
fixed alphabet.
languages - , the empty set , or {}, the set
containing only the empty string.
The empty string is the identity under concatenation;
that is, for any string s, s = s = s.
Abstract
of strings :- s0 is , and for all i >0, si

is si-1s. Since S = S , s1 = s , s2=ss,s3 =sss and so on.
Exponentiation
Operations on Languages
OPERATION
union of L and M
written L U M
concatenation of L
and M written LM
Kleene closure of L
written L*
DEFINITION
L U M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L*=
L
i 0
L* denotes zero or more concatenations of L
positive closure of L
written L+
L = L
+
i 1
L+ denotes one or more concatenations of L
Operations on Languages (contd.)

Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let

D be the set of digits {0,1,.. .9). L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits.
Other languages constructed from L and D are
1. L U D is the set of letters and digits - strictly speaking the

language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
letter followed by one digit.(1052).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including .
5. L(L U D)* is the set of all strings of letters and digits beginning
with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expression & Regular language

Regular
Expression
notation that allows us to define a pattern in

a high level language.
Regular
Each
language
regular expression r denotes a language
L(r) (the set of sentences relating to the regular

expression r)
Notes: Each word in a program can be expressed in a

regular expression
Eg. Suppose we want to describe the set of valid

C identifiers.
If letter_ stand for any letter or the underscore,
and digit_ stands for any digit, then we would
describe the language of C identifiers by :
letter_(letter_ | digit)*
The | means union.
( ) are used to group sub expressions.
* means zero or more occurrences of
The juxtaposition of letter_ with the remainder
of the expression signifies concatenation.
Rules for constructing regular expressions

The regular expressions are built recursively out of smaller
regular expressions using the following rules.
1. is a regular expression denoting {}, the language

containing only empty string . L() = {}
2. If a is a symbol in alphabet , then a is a regular

expression, and L(a) = { a } ,that is the language with
one string, of length one, with a in its one position.
(We use italics for symbols and boldface for their corresponding
regular expression.)
INDUCTION:
Let r and s be regular expressions with languages

L(r) and L(s). Then
a) (r) | (s) is a regular expression denoting the
language L(r) L(s)
b) (r)(s) is a regular expression denoting the
language L(r) L(s)
c) (r)* is a regular expression denoting the
language (L(r))*
d) (r) is a regular expression denoting the
language L(r).
Precedence
*
has highest precedence.
Concatenation
ha second highest
precedence
|
has lowest precedence
Eg. (a) | ((b) * (c )) may be

replaced by a | b * c.
Examples
Algebraic laws of
Regular Expressions
AXIOM
r|s=s|r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r(s|t)=rs|rt
(s|t)r=sr|tr
r = r
r = r
r* = ( r | )*
r** = r*
DESCRIPTION
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
Is the identity element for concatenation
relation between * and
* is idempotent
Regular Definitions
We
can give names to certain

regular expressions and use those
names in subsequent expressions.
d1 -> r1
d2 -> r2
.....
dn -> rn
e.g. C identifiers are strings of letters , digits and

underscores.
letter_ A|B||Z|a|b||z|_
digit 0|1|2||9
Id letter_(letter_|digit)*
This can also be written as
letter_ [A Za-z_]
digit [0-9]
Id letter_(letter_|digit)*
We shall conventionally use italics for the symbols
defined in the regular expressions.
Recognition of tokens
In this topic we study how to

take the patterns for all the
needed tokens and build a
piece of code that examines
the input string and finds a
prefix that is a lexeme
matching one of the patterns.
06/13/1
5
41
Consider following example.

stmt If expr then stmt
| if expr then stmt else stmt
I
expr term relop term
| term
term id
| number
A grammar for branching statements

42
For relop, we use the comparison operators.
The patterns for tokens (id and number) are

digit
[0-9]
digits
digit+
number
digits (. digits)?(E [+-]? digits)?
letter
[A-Za-z]
id
letter (letter | digit)*
if
if
then
then
else
else
relop
<|>|<=|>=|=|<>
06/13/1
5
43
Token for white space is

ws (blank | tab | newline )+
Token
ws is different from the

other tokens in that, when we
recognize it, we do not return
it to the parser, but rather
restart the lexical analysis from
the character that follows the
whitespace.
06/13/1
5
44
LEXEMES
TOKEN NAME
ATTRIBUTE VALUE
Any ws
If
if
then
then
else
else
Any id
id
Pointer to table entry
Any number
number
Pointer to table entry
<
relop
LT
<=
relop
LE
relop
EQ
<>
relop
NE
>
relop
GT
>=
relop
Tokens, their patterns, and attribute values

45
Transition Diagrams
As an intermediate step in the
construction of a lexical analyzer, we
first convert patterns into "transition
diagrams.
Transition diagrams have a collection
of nodes or circles, called states.
Each state represents a condition that
could occur during the process of
scanning the input looking for a
lexeme that matches one of several
patterns.
06/13/1
5
46
Edges
are directed from one

state of the transition diagram
to another.
Each
edge is labeled by a
symbol or set of symbols.
All our transition diagrams are

deterministic, meaning that
there is never more than one
edge out of a given state with a
given symbol among its labels.
06/13/1
5
47
Some important
conventions about transition diagrams are:
Certain states are said to be accepting, or
final.
These states indicate that a lexeme has
been found.
(We always indicate an accepting state by a
double circle, and if there is an action to be
taken typically returning a token and an
attribute value to the parser we shall attach
that action to the accepting state.)
2. In addition, if it is necessary to retract the
forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the
accepting state), then we shall additionally
place a * near that accepting state.
1.
3. One state is designated the

start state, or initial state; it is
indicated by an edge, labeled
"start," entering from no where.
The transition diagram always
begins in the start state before
any input symbols have been
read.
Transition diagram for relop
We begin in state 0, the start state. If

we see < as the first input symbol, then
among the lexemes that match the
pattern for relop we can only be looking
at <, <>, or <=.
Therefore go to state 1, and look at the
next character.
If it is =, then we recognize lexeme <=,
enter state 2, and return the token relop
with attribute LE, the symbolic
constant representing this particular
comparison operator.
If in state 1 the next character is >, then
instead we have lexeme <>, and enter
state 3 to return an indication that the
not-equals operator has been found.
State
4 has a * to indicate that we

must retract the input one position.
In
state 0, we see any character

besides <, =, or >, we can not
possibly be seeing a relop lexeme, so
this transition diagram will not be
used.
Recognition of Reserved Words and Identifiers

Usually,
keywords like if or then are reserved

so they are not identifiers even though they
look like identifiers.
Letter or digit
start
letter
10
other
11 return(getToken(), installlD ())
Transition diagram for id's and keywords
53
There are two ways that we can handle

reserved words that look like identifiers:
1) Install the reserved words in the

symbol table initially.
When we find an identifier, a call to installlD
places it in the symbol table if it is not
already there and returns a pointer to the
symbol-table entry for the lexeme found.
Any identifier not in the symbol table during
lexical analysis cannot be a reserved word, so
its token is id.
06/13/1
5
54
The
function getToken examines the

symbol table entry for the lexeme found,
and returns whatever token name the
symbol table says this lexeme represents
either id or one of the keyword tokens
that was initially installed in the table.
2.) Create separate transition diagrams for

each keyword.
start
Transition diagram for

then
nonlet/dig
55
A transition diagram for

unsigned numbers
start
56
A transition diagram for whitespace

delim
Here we look for one or more white space

characters ,represented by delim. These
characters would be blank, tab newline etc.
In state 24, we have found a block of consecutive
whitespace characters, followed by a non
whitespace character. We retract the input to begin
at the non whitespace, but we do not return to the
parser.
57
Design of Lexical Analyzer
Initial step is to form flowcharts for the valid possible

tokens
Flowcharts for lexical analyzer is known as Transition
diagrams
Components are
States represent the circles
Edges the arrows connecting the states
The labels on the edges indicate the input character that
can appear after
that state
Transition diagram for identifier
letter or digit
Start
lette
r
delimite 2 *
1
r
Fig : Transition diagram for

identifier
The next step is to produce code for each of the states

The code for State 0
State 0 : C:= GETCHAR( );
if LETTER then goto state1
else FAIL( )
Here LETTER is a boolean valued function, returns true if C
is a letter
FAIL is a routine which retracts the lookahead pointer
and starts up the next transition diagram or calls the error
routine.

State 1 : C:= GETCHAR( );
if LETTER or DIGIT( C ) then goto state1
else if DELIMITER ( C ) then goto state 2
else FAIL( )
Here DIGIT is a boolean valued function, returns true if C is
one of the digits 0, 1, .,9. DELIMITER is a procedure which
returns true whenever C is a character that could follow an
identifier

State 2 : RETRACT( );
return (id, INSTALL( ) )
state 2 indicates that an identifier has been found. Since
the delimiter is not part of the token found, the function
RETRACT will move the lookahead pointer one character
back.* indicate states on which input retraction must take
place.
INSTALL( ) procedure will install the identifier into symbol
table if it is not already there.
Token
Code
Value
begin
---
end
---
If
---
Then
---
Else
---
Identifier
table
Constant
symbol table
pointer to symbol
7
pointer to
<
<=
<>
>
Fig: Tokens
recognized
Keywords :
Start
G 3
Blank/
newline 6 * return(1,
5
)
Blank/
*
newline
N
D
10 return(2,
E
7
8
9
)
Blank/
*
newline
L
S
E
14 return(5,
11
12
13
)
Blank/ *
I 15
F
16 newline17 return(3,
)
T 18
19
Blank/
*
N
20
21 newline22 return(4,
)
Identifier :
letter or digit
Start
23
lette
r
24
Not
letter
or digit
25
*
return(6, INSTALL( )
)
constant :
digit
Start
26
digit
27
Not
digit
28
*
return(7, INSTALL( )
)
Relops :
Start
Not
*
= or 31 return(8,1)
>
29 < 30
32 return(8,
2)
>
33 return(8,4
)
*
= 34 return(8,3
> 35
)
not =
36
37
*
return(8,5
)
return(8,6
)
Regular Expressions
Strings and Languages

Alphabet or character class
denote any finite set of symbols
Eg : {0,1} is an alphabet, with two symbols 0 and 1
String
It is a finite sequence of symbols
Eg: 001, 10101,.
Operations with string
Length : x denotes the length of string x, will be the

number of
characters in x
is the empty string, = 0
Concatenation of x and y is denoted by x.y or xy ,
formed by
appending string y to x
Eg: x = abc
y = de
then x.y = abcde
x = x = x where is the identity in

concatenation
Exponentiation xi means string x repeated i times
Eg: x1 = x; x2 = xx; x3 = xxx; .. and x0 =
Prefix
is obtained by discarding o or more trailing
symbols of x
Eg: abc, abcd, a .. Are prefix of abcde
Suffix
of x is obtained by discarding 0 or more leading
symbols of x
Eg: cde, e, represent the suffix of abcde
Substring
of x is obtained by deleting a prefix and suffix
from x
Eg: cd, abc, de, abcde represent the substring
of abcde
All suffix and prefix will be a substring, but the
substring need
not be a suffix or prefix
and x are prefixes, suffixes, and substring of x
Language
It is the set of strings formed from specific
alphabet
If L & M are two languages, the possible operations
are
Concatenation
Concatenation of L & M is denoted as L.M and can
be found by selecting a string x from L and y from M
and joining them in that order
LM = {xy x is in L and y is in M}
L = L =
Exponentiation
Li = LLLLL L (i times)
L0 = {}, {}L = L{}=L
Union
LUM = {x x is in L or x is in M}
UL = LU = L
Closure
* denotes 0 or more instances of; L* = U Li

i=
0
Eg: let L = { aa }
L* is all strings of even number of as

L0 = {} L1 = { aa } L2 = { aaaa } .
+ is the positive closure, means one or more
instances of
exclude {}, then its L.(L*)
i=
i=
0 =
i=
L1
L.(L*) = 0L. U Li
U
i+1 =
U Li = L+
Regular Expressions
used to describe the tokens
Eg: for identifier,
identifier = letter ( letter digit )*
used to define a language
Regular Expression construction rules
1. is a regular expression denoting {}, that is the
language
containing only the empty string
2. For each a in , a is regular expression denoting
{a}, the language
with only one string, that string consisting of the
single symbol a
3. If R and S are regular expressions denoting
languages LR and LS
respectively then
A regular expression is defined in terms of primitive

regular
expression (basis) and compound regular expressions
(induction
rules)
So rules (i) and (ii) form the basis, (iii) forms the
inductive
Eg: Someportion
Regular Expressions
1. a* - denotes all strings of 0 or more as
2. aa* - denotes the string of one or more as (a+)
3. (a b)*- the set of all strings of as and bs i.e. (a*b*)*
4. (aa ab ba bb)* - all strings of even length
5. a b strings of length 0 or 1
6. (a b) (a b) (a b) denotes strings of length 3
so (a b) (a b) (a b) (a b)* denotes strings of length 3
or more
a b (a b) (a b) (a b) (a b)* - all strings whose
length is not 2
Regular Expressions for

Keyword = BEGIN/ END/ IF/THEN/ ELSE
Identifier = letter (letter/digit)*
Constant = digit+
relop = </ <=/ = / <>/ >/ >=
If two regular expressions R and S denote same
language, then R and S are equivalent
i.e. (a/b)* = (a*b*)*
Algebraic laws with Regular Expressions
1. R/S = S/R
( / is commutative)
2. ( R/S) /T = R/ (S/T) (/ is associative)

3. R (ST) = (RS) T
4. R (S/T) = RS / RT
(S/T) R= SR /TR
5. R = R = R
concatenation)
( . Is associative)
and
( . Distribution over / )
( is identity for
Finite Automata
Language Recognizer
It is a program that identifies the presence of a token on
the input . It takes a string x as its input, answers yes if x
is a sentence of L and no otherwise.
How it works?
To determine x belongs to a language L, x is decomposed
into a sequence of substrings denoted by the primitive sub
expressions in R
Example
Given R = (a/b)*abb, the set of all strings ending in
abb,and
x = aabb
Since R = R1R2 where R1 = (a/b)* and
R2 = abb
It is easy to show a the language(a is an element of the
Nondeterministic Automata
It is the generalized transition diagram that is derived from
the expression
a
start
b
Fig: A non deterministic finite automata of
(a b)*abb
The nodes are called states and the labeled edges are
called transitions. Edges can be labeled by & characters.
Also same character can label two or more transitions out
of one state. It has one start state and can be one or more
final states(accepting states).
Transition table
The tabular form representing the transitions of an NFA . In
the transition table, there is a row for each state and a
column for each admissible input symbol and .
The entry for row i and symbol a is the set of possible
next states for state i on i/p a.
State
Input symbol
a
{0,1} {0}
-----
{2}
---- {3}
Fig: Transition table
The path for the input string aabb can be represented by

the following sequence of moves
State
Remaining i/p
aabb
abb
bb
2
3
The language defined by an NFA is the set of i/p strings it

accepts.
NFA accepting aa* bb*
start
b
3
Algorithm to construct an NFA from a Regular Expression

Input :- A regular expression R over alphabet
Output :- An NFA, N accepting the language denoted by R
Method : Decompose R into its primitive components. For
each component, we construct a finite automata
inductively using basis and induction rules
Finite Automata construction from regular expression

The basis and induction rules are
1. NFA for
i
where i and f are new initial

state and final state
2. NFA for a
i'
f'
each state should be new
Each time we need a new state, we give that state a

new name. Even
if a appears several times in the regular expression
R, we give each
instance of a a separate finite automation with its
own states.
3. Having constructed components for the basis regular
expressions,
we proceed to combine them in ways that correspond
to the way
compound regular expressions are formed from
smaller regular
expressions.
3. NFA for R1 / R2
Let N1 and N2 be NFAs corresponds R1 and R2 respectively
N1
i'i
N2
There is a transition on from the new initial state to the initial

states of N1 and N2.There is an -transition from the final states
of N1 and N2 to the new final state f. Any path from i to f must
4. NFA for R1R2

Let N1 and N2 be NFAs corresponds R1 and R2 respectively
N1
N2
The initial state of N2 is identified with the accepting state of

N1. A path from i to f must go first through N1, then through
N2.
5. NFA for R1*
N1
In this, we can go from i to f directly along a path labeled ,or go

through N1 one or more times.
Decomposition of (a / b)*abb
R11
R9
R10
R7
R5
R4
(
R1
a
R3
/
R6
*
)
R2
b
R8
b
b
R1= a
N1 :
R2= b
N2 :
R3=
R1/R2
N3 :
N4 : R4= (R3) is same

as N3
R5=
(R4)*
N5 :
R6= a
N6 :
R7=
R5R6
7'
N7 :
R8= b
N8 :
R9=
R7R8
8'
N9 :
R10 = b
N10 :
9'
10
R11= R9R10
N11 :
Start
10
Deterministic Automata (DFA)
Since in the NFA transition function is multivalued and , it

is difficult to simulate an NFA with a computer program
A finite automaton is deterministic if
(i)
it has no transitions on input
(ii)
for each state s and input symbol a, there is at most one

edge labeled a leaving s
For each NFA, we can find a DFA accepting the same

language.
Start
-closure (0) = { 0, 1, 2, 4, 7} -------------a

(A)
{3, 8}
{3, 8} = {1, 2, 3, 4, 6, 7, 8}
{-closure
5}
------------- (B)
a
{3, 8}
b
{ 5,
9}
10
-closure {5} = {1, 2, 4, 5, 6, 7}

---------------- (C)
a
{3, 8}
b
{ 5}
-closure {5, 9} = {1, 2, 4, 5, 6, 7, 9}

----------- (D)
a
{3, 8}
b
{ 5,
10 }
-closure {5, 10 } = {1, 2, 4, 5, 6, 7, 10}
------------(E)
a
{3, 8}
b
{5}
State
Input symbol
a
A (Start)
C
D
E (Accept)
C
B
a
E
C
a
a
Start A
b
a
D
a
b
b
E
Minimizing the number of states

State
Input symbol
a
A (Start)
D
E (Accept)
B
B
E
A
a
a
Start
a
b
Constructing DFA from NFA
Algorithm
Input: a NFA N.
output: a DFA D accepting the same language
Let us define the function -CLOSURE(s) to be the set of states of N built by applying the following rules:
1. S is added to -closure (s)

2. If t is in -CLOSURE (s), and there is an edge
labeled from t to u, then u is added to CLOSURE(s) if u is not already there. Rule 2 is
repeated until no more states can be added to CLOSURE(s) .
Thus,-CLOSURE(s) is the set of states that can be
reached from s on -transitions only. If T is a set of
states, then -CLOSURE(T) is the union over all states
s in T of -CLOSURE(s).
Algorithm - CLOSURE
Push all states in T onto stack;
-closure(T) := T;
while stack is not empty do
begin
pop s, the top element, off the stack
for each state t with an edge from s to t
labeled do
if t is not in -closure(T) do
begin
add t to -closure(T)
push t onto stack
end if
end do
end while
Algorithm Subset construction

While there is an unmarked state x= {s1,s2,.,sn) of D
do
Begin
mark x;
for each input symbol a do
Begin
let T be the set of states to which there is a
transition on a from some state si in
x;
y := - CLOSURE (T)
If y has not yet been added to the set
of states of D then
make y an unmarked state of D
Add a transition from x to y labeled a if
not already
Minimizing the number of states in DFA
Algorithm
Input: a DFA M
output: a minimum state DFA M
If some states in M ignore some inputs, add
transitions to a dead state.
Let P = {accepting state, All nonaccepting states}
Let P = {}
Loop: for each group G in P do
Partition G into subgroups so that s and t (in G)
belong to the same subgroup if and only if
each input a,states s and t have transitions
to states in the same group of P
put those subgroups in P
if (P != P) goto loop
Remove any dead states and unreachable states.
NFA to DFA Example-2
start
closure({0})={0,1,3,7}
subset({0,1,3,7},a)={2,4,7}
subset({0,1,3,7},b) = {8
b
b
closure({2,4,7})={2,4,7}
subset({2,4,7},a)={7}
subset({2,4,7},b) = {
-closure({8}) = {8}
subset({8},a) =
subset({8},b) = {8}
closure({7})={7}
subset({7},a) = {7}
subset({7},b)={8}
DFAstates
A={0,1,3,7}
B={2,4,7}
C={8}
D={7}
E={5,8}
F={6,8}
C
b
start
a3
a
D
a
a1
a3
a2a3
Minimizing the Number of

States of a DFA
b
C
start
B
a
b
a
D
a
start
D
a
A language for specifying Lexical Analyzers
A LEX source pgm is a specification of a lexical analyzer,

consisting of a
set of regular expressions together with an action for each
regular
expression.
The action is a piece of code which is to be executed

whenever a token
specified by the corresponding regular expression is
recognized.
The output of LEX is a lexical analyzer pgm constructed

from the LEX
source specification.
Creating a Lexical
Analyzer with Lex
lex
source
program
lex
compiler
input
stream
Lexical
Analyzer L
Lexical
analyzer L
sequence
oftokens
A LEX source pgm consists of 2 parts:

Auxiliary definitions and translation rules
Auxiliary Definitions
The auxiliary definitions are stmnts of the form
D1=R1
D2=R2
.
.
Dn=Rn
Eg: letter=A B Z
digit=0 1 . 9
identifier= letter (letter digit )*
Translation Rules
The translation rules of a LEX pgm are stmnts of the form
P1
{A1}
P2
{A2}
.
.
Pm
{Am}
Where each pi is a regular expression called a pattern and
each Ai is a pgm fragment
The pattern describe the form of the tokens
The pgm fragment describes what action the lexical
analyzer
should take when token Pi is found.
AUXILIARY DEFINITIONS
letter = A B .. Z
digit = 0 1 .. 9
TRANSLATION RULES
BEGIN
END
IF
THEN
ELSE
letter(letter digit)*
digit*
<
<=
=
<>
>
>=
{return
{return
{return
{return
{return
1}
2}
3}
4}
5}
{LEXVAL:= INSTALL();
return 6}
{LEXVAL:= INSTALL();
return 7}
{LEXVAL:=1;
return 8}
{LEXVAL:=2;
return 8}
{LEXVAL:=3;
return 8}
{LEXVAL:=4;
return 8}
{LEXVAL:=5;
return 8}
{LEXVAL:=6;
return 8}
Regular Expressions in Lex

x
matchthecharacterx
\.
matchthecharacter.
stringmatchcontentsofstringofcharacters
.
matchanycharacterexceptnewline
^
matchbeginningofaline
$
matchtheendofaline
[xyz] matchonecharacterx,y,orz(use\toescape-)
[^xyz]matchanycharacterexceptx,y,andz
[a-z] matchoneofatoz
r*
closure(matchzeroormoreoccurrences)
r+
positiveclosure(matchoneormoreoccurrences)
r?
optional(matchzerooroneoccurrence)
r1r2
matchr1thenr2(concatenation)
r1|r2
matchr1orr2(union)
(r)
grouping
r1\r2
matchr1whenfollowedbyr2
{d}
matchtheregularexpressiondefinedbyd

SystemSoftware Compilers

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

SystemSoftware Compilers

Enviado por

Direitos autorais:

Formatos disponíveis

MODULE III

Introduction to compiling:- Compilers,

Analysis of Source Program

The analysis part breaks up the source program into

constituent pieces and imposes a grammatical

If the analysis part detects any error, it must provide

The analysis part also collects information about the

The synthesis part constructs the desired

The analysis part is often called the front

Lexical Analysis (Scanning)

first phase of a compiler

lexical analyzer reads the stream of characters from the source

each lexeme, the lexical analyzer produces a token as output of

Eg. Position = initial +rate *60

The second phase of the compiler.

The parser uses the first components of the tokens

The syntax tree for the above eg is

Intermediate Code Generation.

In the process of translation from source to

This intermediate representations should be

The machine independent code optimization phase

A simple intermediate code generation algorithm

The optimizer can deduce that the conversion of 60

Eg. t1 = id3 * 60.0

The code generator takes as input an

If the target language is machine code,

Then the intermediate instructions are

SYMBOL TABLE MANAGEMENT

Position = initial +rate *60

Intermediate code generator

Role of Lexical Analyzer

main task of lexical analyzer is to

stream of tokens is send to the parser

Tasks(Role) of Lexical Analyzer

of comments and white

error messages generated by

Lexical Analyzer processes.

Scanning consists (simple processes) that do not

Lexical analysis(complex process) where scanner

Tokens, Patterns and Lexemes

is a pair with a token name and an optional

is a description of the form that the

is a sequence of characters in the source

buffering techniques have

buffers are alternately reloaded.

to mark the end of input.

choice is the character eof.

eof that appears other than at

Sentinels at the end of each buffer

Strings and Languages

is a finite sequence of symbols.

of strings :- s0 is , and for all i >0, si

L* denotes zero or more concatenations of L

L+ denotes one or more concatenations of L

Operations on Languages (contd.)

Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let

1. L U D is the set of letters and digits - strictly speaking the

Regular Expression & Regular language

notation that allows us to define a pattern in

L(r) (the set of sentences relating to the regular

Notes: Each word in a program can be expressed in a

Eg. Suppose we want to describe the set of valid

Rules for constructing regular expressions