Você está na página 1de 109

MODULE III

Introduction to compiling:- Compilers,


Analysis of a source program, the
phases of a compiler.
Lexical Analysis:-The role of the lexical
analyzer, Input buffering, specification
of tokens, Recognition of tokens, Finite
automata, Conversion of an NFA to DFA,
From a regular expression to an NFA.

COMPILERS

Introduction to Compilers
Translator
A translator is a program that takes a program
written in one programming language as input and
produces a program in another language as output. If the
source language is a high level language and the object
language is a low level language , then such a translator is
called a compiler.
Source
Program

Compiler

Object
Program

Analysis of Source Program

The analysis part breaks up the source program into

constituent pieces and imposes a grammatical


structure on
them.
It then uses this structure to create an intermediate
code of
the source program.

If the analysis part detects any error, it must provide


informative messages, so the user can take corrective
action.

The analysis part also collects information about the


source
program and stores it in the data structure SYMTAB,

The synthesis part constructs the desired


target
program from the intermediate
representation and
the information in the SYMTAB.

The analysis part is often called the front


end and
synthesis phase is called the back end.

Source program
Lexical Analyzer
token stream
Syntax Analyzer
Syntax tree
Semantic Analyzer
Syntax tree
Intermediate code generator
Symbol Table

Intermediate
representation
Machine independent
Code
optimizer
Intermediate
representation
Code generator
Target machine code
Machine dependent Code
optimizer
Target machine code

Phases of a compiler

Lexical Analysis (Scanning)


-The

first phase of a compiler

-The

lexical analyzer reads the stream of characters from the source


program and groups the characters into meaningful sequences
called lexemes.
-For

each lexeme, the lexical analyzer produces a token as output of


the form,
(token-name, attribute-value)
Where token-name is an abstract symbol that is used during
syntax analysis, and attribute-value points to an entry in the symbol
table for this token.

Eg. Position = initial +rate *60


The lexemes and tokens are
1) position is a lexeme would be mapped into a
token <id,1>, where id is identifier and 1 points to
the SYMTAB entry for position.
2) = is a lexeme that mapped into a token < = >.
Since this token needs no attribute value, we have
omitted the second component.
3) initial - <id,2>
4) + - < + >
5)rate - <id,3>
6 ) * - <*>
7) 60 - <60>

Syntax Analysis(Parsing)

The second phase of the compiler.

The parser uses the first components of the tokens


produced by the lexical analyzer to create syntax trees.

The syntax tree for the above eg is


=
<id,1>

+
<id,2>
<id,3>

*
60

Semantic Analysis
The semantic analyzer uses the syntax tree and the
information in the SYMTAB to check the source program for
semantic consistency with the language definition.
It also gathers type information and saves it in either the
syntax tree or the SYMTAB, for subsequent use during
intermediate code generation.
An important part of semantic analysis is type checking,
where the compiler checks that each operator has matching
operands.
(eg, the compiler must report an error, if a float value is used
as an array index).
Eg. Suppose position, initial and rate are float numbers. The
lexeme <60> is an integer. The type checker in semantic
analyzer discovers that the operator * is applied to a float
number rate and an int 60. So int 60 is converted to float.

Intermediate Code Generation.

In the process of translation from source to


target code, the compiler may construct one or
more intermediate representations.

This intermediate representations should be


(a) easy to produce and (b) easy to translate
into target machine.

Eg. t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

Code Optimization

The machine independent code optimization phase


attempts to improve the intermediate code so that
better target code will result.

A simple intermediate code generation algorithm


followed by code optimization is a reasonable way
to generate good target code.

The optimizer can deduce that the conversion of 60


from int to float can be done once. So the inttofloat
operation can be eliminated by replacing int 60 by
float 60.0

Eg. t1 = id3 * 60.0


id1 = id2 * t1

Code Generation

The code generator takes as input an


intermediate representation of the source
program and maps it to the target language.

If the target language is machine code,


registers or memory locations are selected for
each of the variables used by the program.

Then the intermediate instructions are


translated into sequences of machine
instructions.

Eg.

LDF
MULF
LDF
ADDF
STF

R2,
R2,
R1,
R1,
id1,

id3
R2, #60.0
id2
R1, R2
R1

SYMBOL TABLE MANAGEMENT


An essential function of a compiler is to record
the variable names used in the source
program and collect information about various
attribute of each name.
This data structure should be designed to
allow the compiler to find the record for each
name quickly and to store or retrieve data
from that record quickly.

Position = initial +rate *60


Lexical Analyzer
<id,1> = <id,2> + <id,3> * <60>
Syntax Analyzer
=
<id,1>

+
<id,2>
<id,3>

*
60

Semantic Analyzer

=
<id,1>

+
<id,2>
<id,3>

*
inttofloat(60)

Intermediate code generator


t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
code optimizer
t1 = id3 * 60.0
id1 = id2 * t1

Code Generator

LDF
MULF
LDF
ADDF
STF

R2,
R2,
R1,
R1,
id1,

id3
R2, #60.0
id2
R1, R2
R1

Translation of an assignment
statement

Role of Lexical Analyzer

The

main task of lexical analyzer is to


read the input characters and group them
into lexemes and produce tokens

The

stream of tokens is send to the parser


for syntax analysis.

token

Source
Program

Lexical
Analyzer

Parser
getnexttoke
n

Symbol Table

To semantic
analysis

Tasks(Role) of Lexical Analyzer


Identification

of Lexemes

Removal

of comments and white


spaces(blank,newline,tab etc)

Correlating

error messages generated by


the compiler with the source program.

Lexical Analyzer processes.


a)

Scanning consists (simple processes) that do not


require tokenization of the input, such as deletion of
comments and compaction of consecutive white
spaces into one.

b)

Lexical analysis(complex process) where scanner


produces the sequence of tokens as output.

Tokens, Patterns and Lexemes

A token

is a pair with a token name and an optional


attribute value

A pattern

is a description of the form that the


lexemes of a token may take

A lexeme

is a sequence of characters in the source


program that matches the pattern for a token

INPUT BUFFERING
Specialized

buffering techniques have


been developed to reduce the amount
of overhead required to process a
single input character.

Two

buffers are alternately reloaded.


Each buffer is of same size N. N is the
size of a disk block.
E

M *

eof

lexemeBegin forward

Input Buffering
Two pointers are required
lexemeBegin - marks the beginning of a lexeme
forward - scans ahead until a pattern match is
found.
Advancing forward requires that we first test
whether we have reached the end of one of the
buffers , and if so, we must reload the other
buffer from the input, and move forward to
the beginning of the newly loaded buffer.

Sentinels
Used

to mark the end of input.

Natural

choice is the character eof.

Any

eof that appears other than at


the end of a buffer means that the
input is at an end.
E

eof C *

eof

lexemeBegin forward

Sentinels at the end of each buffer

eof

Switch(*forward++){
case eof:
if(forward is at the end of first buffer)
{ reload second buffer;
forward = beginning of second buffer; }
else if(forward is at the end of second buffer)
{ reload first buffer;
forward = beginning of first buffer; }
else /* eof within a buffer marks the end of
input */
terminate lexicalanalysis;
break;
}

SPECIFICATION OF TOKENS

Strings and Languages


alphabet

is a finite sequence of symbols.


The string over an alphabet is a finite sequence of
symbols drawn from that alphabet.
|s| represents the length of a string s, Ex: banana is a
string of length 6
The set {0,1} is the binary alphabet
A language is any countable set of strings over some
fixed alphabet.
languages - , the empty set , or {}, the set
containing only the empty string.
The empty string is the identity under concatenation;
that is, for any string s, s = s = s.
Abstract

of strings :- s0 is , and for all i >0, si


is si-1s. Since S = S , s1 = s , s2=ss,s3 =sss and so on.

Exponentiation

Operations on Languages

OPERATION
union of L and M
written L U M
concatenation of L
and M written LM
Kleene closure of L
written L*

DEFINITION
L U M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L*=

L
i 0

L* denotes zero or more concatenations of L

positive closure of L
written L+

L = L
+

i 1

L+ denotes one or more concatenations of L

Operations on Languages (contd.)


Example:

Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let


D be the set of digits {0,1,.. .9). L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits.
Other languages constructed from L and D are

1. L U D is the set of letters and digits - strictly speaking the


language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one
letter followed by one digit.(1052).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including .
5. L(L U D)* is the set of all strings of letters and digits beginning
with a letter.
6. D+ is the set of all strings of one or more digits.

Regular Expression & Regular language


Regular

Expression

notation that allows us to define a pattern in


a high level language.

Regular
Each

language
regular expression r denotes a language

L(r) (the set of sentences relating to the regular


expression r)

Notes: Each word in a program can be expressed in a


regular expression

Eg. Suppose we want to describe the set of valid


C identifiers.
If letter_ stand for any letter or the underscore,
and digit_ stands for any digit, then we would
describe the language of C identifiers by :
letter_(letter_ | digit)*
The | means union.
( ) are used to group sub expressions.
* means zero or more occurrences of
The juxtaposition of letter_ with the remainder
of the expression signifies concatenation.

Rules for constructing regular expressions


The regular expressions are built recursively out of smaller
regular expressions using the following rules.

1. is a regular expression denoting {}, the language


containing only empty string . L() = {}

2. If a is a symbol in alphabet , then a is a regular


expression, and L(a) = { a } ,that is the language with
one string, of length one, with a in its one position.
(We use italics for symbols and boldface for their corresponding
regular expression.)

INDUCTION:

Let r and s be regular expressions with languages


L(r) and L(s). Then
a) (r) | (s) is a regular expression denoting the
language L(r) L(s)
b) (r)(s) is a regular expression denoting the
language L(r) L(s)
c) (r)* is a regular expression denoting the
language (L(r))*
d) (r) is a regular expression denoting the
language L(r).

Precedence
*

has highest precedence.

Concatenation

ha second highest

precedence
|

has lowest precedence

Eg. (a) | ((b) * (c )) may be


replaced by a | b * c.

Examples

Algebraic laws of
Regular Expressions
AXIOM
r|s=s|r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r(s|t)=rs|rt
(s|t)r=sr|tr
r = r
r = r
r* = ( r | )*
r** = r*

DESCRIPTION
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
Is the identity element for concatenation
relation between * and
* is idempotent

Regular Definitions
We

can give names to certain


regular expressions and use those
names in subsequent expressions.
d1 -> r1
d2 -> r2
.....
dn -> rn

e.g. C identifiers are strings of letters , digits and


underscores.
letter_ A|B||Z|a|b||z|_
digit 0|1|2||9
Id letter_(letter_|digit)*
This can also be written as
letter_ [A Za-z_]
digit [0-9]
Id letter_(letter_|digit)*
We shall conventionally use italics for the symbols
defined in the regular expressions.

Recognition of tokens

In this topic we study how to


take the patterns for all the
needed tokens and build a
piece of code that examines
the input string and finds a
prefix that is a lexeme
matching one of the patterns.
06/13/1
5

41

Consider following example.


stmt If expr then stmt
| if expr then stmt else stmt
I
expr term relop term
| term
term id
| number

A grammar for branching statements


42

For relop, we use the comparison operators.

The patterns for tokens (id and number) are


digit

[0-9]

digits

digit+

number

digits (. digits)?(E [+-]? digits)?

letter

[A-Za-z]

id

letter (letter | digit)*

if

if

then

then

else

else

relop

<|>|<=|>=|=|<>
06/13/1
5

43

Token for white space is


ws (blank | tab | newline )+
Token

ws is different from the


other tokens in that, when we
recognize it, we do not return
it to the parser, but rather
restart the lexical analysis from
the character that follows the
whitespace.
06/13/1
5

44

LEXEMES

TOKEN NAME

ATTRIBUTE VALUE

Any ws

If

if

then

then

else

else

Any id

id

Pointer to table entry

Any number

number

Pointer to table entry

<

relop

LT

<=

relop

LE

relop

EQ

<>

relop

NE

>

relop

GT

>=

relop

Tokens, their patterns, and attribute values


45

Transition Diagrams
As an intermediate step in the
construction of a lexical analyzer, we
first convert patterns into "transition
diagrams.
Transition diagrams have a collection
of nodes or circles, called states.
Each state represents a condition that
could occur during the process of
scanning the input looking for a
lexeme that matches one of several
patterns.

06/13/1
5

46

Edges

are directed from one


state of the transition diagram
to another.

Each

edge is labeled by a
symbol or set of symbols.

All our transition diagrams are


deterministic, meaning that
there is never more than one
edge out of a given state with a
given symbol among its labels.
06/13/1
5

47

Some important
conventions about transition diagrams are:
Certain states are said to be accepting, or
final.
These states indicate that a lexeme has
been found.
(We always indicate an accepting state by a
double circle, and if there is an action to be
taken typically returning a token and an
attribute value to the parser we shall attach
that action to the accepting state.)
2. In addition, if it is necessary to retract the
forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the
accepting state), then we shall additionally
place a * near that accepting state.
1.

3. One state is designated the


start state, or initial state; it is
indicated by an edge, labeled
"start," entering from no where.
The transition diagram always
begins in the start state before
any input symbols have been
read.

Transition diagram for relop

We begin in state 0, the start state. If


we see < as the first input symbol, then
among the lexemes that match the
pattern for relop we can only be looking
at <, <>, or <=.
Therefore go to state 1, and look at the
next character.
If it is =, then we recognize lexeme <=,
enter state 2, and return the token relop
with attribute LE, the symbolic
constant representing this particular
comparison operator.
If in state 1 the next character is >, then
instead we have lexeme <>, and enter
state 3 to return an indication that the
not-equals operator has been found.

State

4 has a * to indicate that we


must retract the input one position.

In

state 0, we see any character


besides <, =, or >, we can not
possibly be seeing a relop lexeme, so
this transition diagram will not be
used.

Recognition of Reserved Words and Identifiers


Usually,

keywords like if or then are reserved


so they are not identifiers even though they
look like identifiers.
Letter or digit

start

letter

10

other

11 return(getToken(), installlD ())

Transition diagram for id's and keywords

53

There are two ways that we can handle


reserved words that look like identifiers:

1) Install the reserved words in the


symbol table initially.
When we find an identifier, a call to installlD
places it in the symbol table if it is not
already there and returns a pointer to the
symbol-table entry for the lexeme found.
Any identifier not in the symbol table during
lexical analysis cannot be a reserved word, so
its token is id.
06/13/1
5

54

The

function getToken examines the


symbol table entry for the lexeme found,
and returns whatever token name the
symbol table says this lexeme represents
either id or one of the keyword tokens
that was initially installed in the table.

2.) Create separate transition diagrams for


each keyword.
start

Transition diagram for


then

nonlet/dig

55

A transition diagram for


unsigned numbers
start

56

A transition diagram for whitespace


delim

Here we look for one or more white space


characters ,represented by delim. These
characters would be blank, tab newline etc.
In state 24, we have found a block of consecutive
whitespace characters, followed by a non
whitespace character. We retract the input to begin
at the non whitespace, but we do not return to the
parser.
57

Design of Lexical Analyzer

Initial step is to form flowcharts for the valid possible


tokens
Flowcharts for lexical analyzer is known as Transition
diagrams
Components are
States represent the circles
Edges the arrows connecting the states
The labels on the edges indicate the input character that
can appear after
that state

Transition diagram for identifier

letter or digit
Start

lette
r

delimite 2 *
1
r

Fig : Transition diagram for


identifier

The next step is to produce code for each of the states


The code for State 0
State 0 : C:= GETCHAR( );
if LETTER then goto state1
else FAIL( )
Here LETTER is a boolean valued function, returns true if C
is a letter
FAIL is a routine which retracts the lookahead pointer
and starts up the next transition diagram or calls the error
routine.

The code for State 1


State 1 : C:= GETCHAR( );
if LETTER or DIGIT( C ) then goto state1
else if DELIMITER ( C ) then goto state 2
else FAIL( )
Here DIGIT is a boolean valued function, returns true if C is
one of the digits 0, 1, .,9. DELIMITER is a procedure which
returns true whenever C is a character that could follow an
identifier

The code for State 2


State 2 : RETRACT( );
return (id, INSTALL( ) )
state 2 indicates that an identifier has been found. Since
the delimiter is not part of the token found, the function
RETRACT will move the lookahead pointer one character
back.* indicate states on which input retraction must take
place.
INSTALL( ) procedure will install the identifier into symbol
table if it is not already there.

Token

Code

Value

begin

---

end

---

If

---

Then

---

Else

---

Identifier
table

Constant
symbol table

pointer to symbol
7

pointer to

<

<=

<>

>

Fig: Tokens
recognized

Keywords :

Start

G 3

Blank/
newline 6 * return(1,
5
)

Blank/
*
newline
N
D
10 return(2,
E
7
8
9
)
Blank/
*
newline
L
S
E
14 return(5,
11
12
13
)
Blank/ *
I 15
F
16 newline17 return(3,
)

T 18

19

Blank/
*
N
20
21 newline22 return(4,
)

Identifier :

letter or digit
Start

23

lette
r

24

Not
letter
or digit

25

*
return(6, INSTALL( )
)

constant :

digit
Start

26

digit

27

Not
digit

28

*
return(7, INSTALL( )
)

Relops :

Start

Not
*
= or 31 return(8,1)
>

29 < 30

32 return(8,
2)

>

33 return(8,4
)

*
= 34 return(8,3
> 35

)
not =

36
37

*
return(8,5
)
return(8,6
)

Regular Expressions

Strings and Languages


Alphabet or character class
denote any finite set of symbols
Eg : {0,1} is an alphabet, with two symbols 0 and 1
String
It is a finite sequence of symbols
Eg: 001, 10101,.

Operations with string

Length : x denotes the length of string x, will be the


number of
characters in x
is the empty string, = 0
Concatenation of x and y is denoted by x.y or xy ,
formed by
appending string y to x
Eg: x = abc

y = de

then x.y = abcde

x = x = x where is the identity in


concatenation
Exponentiation xi means string x repeated i times
Eg: x1 = x; x2 = xx; x3 = xxx; .. and x0 =

Prefix
is obtained by discarding o or more trailing
symbols of x
Eg: abc, abcd, a .. Are prefix of abcde
Suffix
of x is obtained by discarding 0 or more leading
symbols of x
Eg: cde, e, represent the suffix of abcde
Substring
of x is obtained by deleting a prefix and suffix
from x
Eg: cd, abc, de, abcde represent the substring
of abcde
All suffix and prefix will be a substring, but the
substring need
not be a suffix or prefix
and x are prefixes, suffixes, and substring of x

Language
It is the set of strings formed from specific
alphabet
If L & M are two languages, the possible operations
are
Concatenation
Concatenation of L & M is denoted as L.M and can
be found by selecting a string x from L and y from M
and joining them in that order
LM = {xy x is in L and y is in M}
L = L =
Exponentiation
Li = LLLLL L (i times)
L0 = {}, {}L = L{}=L

Union
LUM = {x x is in L or x is in M}
UL = LU = L
Closure

* denotes 0 or more instances of; L* = U Li


i=
0

Eg: let L = { aa }

L* is all strings of even number of as


L0 = {} L1 = { aa } L2 = { aaaa } .
+ is the positive closure, means one or more
instances of
exclude {}, then its L.(L*)

i=

i=
0 =

i=
L1

L.(L*) = 0L. U Li
U

i+1 =

U Li = L+

Regular Expressions
used to describe the tokens
Eg: for identifier,
identifier = letter ( letter digit )*
used to define a language
Regular Expression construction rules
1. is a regular expression denoting {}, that is the
language
containing only the empty string
2. For each a in , a is regular expression denoting
{a}, the language
with only one string, that string consisting of the
single symbol a
3. If R and S are regular expressions denoting
languages LR and LS
respectively then

A regular expression is defined in terms of primitive


regular
expression (basis) and compound regular expressions
(induction
rules)
So rules (i) and (ii) form the basis, (iii) forms the
inductive
Eg: Someportion
Regular Expressions
1. a* - denotes all strings of 0 or more as
2. aa* - denotes the string of one or more as (a+)
3. (a b)*- the set of all strings of as and bs i.e. (a*b*)*
4. (aa ab ba bb)* - all strings of even length
5. a b strings of length 0 or 1
6. (a b) (a b) (a b) denotes strings of length 3
so (a b) (a b) (a b) (a b)* denotes strings of length 3
or more
a b (a b) (a b) (a b) (a b)* - all strings whose
length is not 2

Regular Expressions for


Keyword = BEGIN/ END/ IF/THEN/ ELSE
Identifier = letter (letter/digit)*
Constant = digit+
relop = </ <=/ = / <>/ >/ >=
If two regular expressions R and S denote same
language, then R and S are equivalent
i.e. (a/b)* = (a*b*)*
Algebraic laws with Regular Expressions
1. R/S = S/R

( / is commutative)

2. ( R/S) /T = R/ (S/T) (/ is associative)


3. R (ST) = (RS) T
4. R (S/T) = RS / RT
(S/T) R= SR /TR
5. R = R = R
concatenation)

( . Is associative)
and
( . Distribution over / )
( is identity for

Finite Automata

Language Recognizer
It is a program that identifies the presence of a token on
the input . It takes a string x as its input, answers yes if x
is a sentence of L and no otherwise.
How it works?
To determine x belongs to a language L, x is decomposed
into a sequence of substrings denoted by the primitive sub
expressions in R
Example
Given R = (a/b)*abb, the set of all strings ending in
abb,and
x = aabb
Since R = R1R2 where R1 = (a/b)* and

R2 = abb

It is easy to show a the language(a is an element of the

Nondeterministic Automata
It is the generalized transition diagram that is derived from
the expression
a
start

b
Fig: A non deterministic finite automata of
(a b)*abb
The nodes are called states and the labeled edges are
called transitions. Edges can be labeled by & characters.
Also same character can label two or more transitions out
of one state. It has one start state and can be one or more
final states(accepting states).

Transition table
The tabular form representing the transitions of an NFA . In
the transition table, there is a row for each state and a
column for each admissible input symbol and .
The entry for row i and symbol a is the set of possible
next states for state i on i/p a.
State

Input symbol
a

{0,1} {0}

-----

{2}

---- {3}

Fig: Transition table

The path for the input string aabb can be represented by


the following sequence of moves

State

Remaining i/p

aabb

abb

bb

2
3

The language defined by an NFA is the set of i/p strings it


accepts.

NFA accepting aa* bb*

start

b
3

Algorithm to construct an NFA from a Regular Expression


Input :- A regular expression R over alphabet
Output :- An NFA, N accepting the language denoted by R
Method : Decompose R into its primitive components. For
each component, we construct a finite automata
inductively using basis and induction rules

Finite Automata construction from regular expression


The basis and induction rules are
1. NFA for
i

where i and f are new initial


state and final state

2. NFA for a
i'

f'

each state should be new

Each time we need a new state, we give that state a


new name. Even
if a appears several times in the regular expression
R, we give each
instance of a a separate finite automation with its
own states.
3. Having constructed components for the basis regular
expressions,
we proceed to combine them in ways that correspond
to the way
compound regular expressions are formed from
smaller regular
expressions.

3. NFA for R1 / R2
Let N1 and N2 be NFAs corresponds R1 and R2 respectively

N1

i'i

N2

There is a transition on from the new initial state to the initial


states of N1 and N2.There is an -transition from the final states
of N1 and N2 to the new final state f. Any path from i to f must

4. NFA for R1R2


Let N1 and N2 be NFAs corresponds R1 and R2 respectively

N1

N2

The initial state of N2 is identified with the accepting state of


N1. A path from i to f must go first through N1, then through
N2.

5. NFA for R1*

N1

In this, we can go from i to f directly along a path labeled ,or go


through N1 one or more times.

Decomposition of (a / b)*abb
R11
R9
R10

R7
R5
R4
(
R1
a

R3
/

R6
*

)
R2
b

R8

b
b

R1= a
N1 :

R2= b

N2 :

R3=
R1/R2

N3 :

N4 : R4= (R3) is same


as N3

R5=
(R4)*

N5 :

R6= a
N6 :

R7=
R5R6

7'

N7 :

R8= b
N8 :

R9=
R7R8

8'

N9 :

R10 = b
N10 :

9'

10

R11= R9R10

N11 :

Start

10

Deterministic Automata (DFA)

Since in the NFA transition function is multivalued and , it


is difficult to simulate an NFA with a computer program
A finite automaton is deterministic if
(i)

it has no transitions on input

(ii)

for each state s and input symbol a, there is at most one


edge labeled a leaving s

For each NFA, we can find a DFA accepting the same


language.

Start

-closure (0) = { 0, 1, 2, 4, 7} -------------a


(A)
{3, 8}

{3, 8} = {1, 2, 3, 4, 6, 7, 8}
{-closure
5}
------------- (B)
a
{3, 8}
b

{ 5,
9}

10

-closure {5} = {1, 2, 4, 5, 6, 7}


---------------- (C)
a
{3, 8}
b

{ 5}

-closure {5, 9} = {1, 2, 4, 5, 6, 7, 9}


----------- (D)
a
{3, 8}
b

{ 5,
10 }
-closure {5, 10 } = {1, 2, 4, 5, 6, 7, 10}
------------(E)
a
{3, 8}
b

{5}

State

Input symbol
a

A (Start)

C
D
E (Accept)

C
B

a
E
C

a
a

Start A
b

a
D

a
b

b
E

Minimizing the number of states


State

Input symbol
a

A (Start)

D
E (Accept)

B
B

E
A

a
a

Start

a
b

Constructing DFA from NFA

Algorithm

Input: a NFA N.
output: a DFA D accepting the same language
Let us define the function -CLOSURE(s) to be the set of states of N built by applying the following rules:

1. S is added to -closure (s)


2. If t is in -CLOSURE (s), and there is an edge
labeled from t to u, then u is added to CLOSURE(s) if u is not already there. Rule 2 is
repeated until no more states can be added to CLOSURE(s) .
Thus,-CLOSURE(s) is the set of states that can be
reached from s on -transitions only. If T is a set of
states, then -CLOSURE(T) is the union over all states
s in T of -CLOSURE(s).

Constructing DFA from NFA

Algorithm - CLOSURE
Push all states in T onto stack;
-closure(T) := T;
while stack is not empty do
begin
pop s, the top element, off the stack
for each state t with an edge from s to t
labeled do
if t is not in -closure(T) do
begin
add t to -closure(T)
push t onto stack
end if
end do
end while

Constructing DFA from NFA

Algorithm Subset construction


While there is an unmarked state x= {s1,s2,.,sn) of D
do
Begin
mark x;
for each input symbol a do
Begin
let T be the set of states to which there is a
transition on a from some state si in
x;
y := - CLOSURE (T)
If y has not yet been added to the set
of states of D then
make y an unmarked state of D
Add a transition from x to y labeled a if
not already

Minimizing the number of states in DFA

Algorithm

Input: a DFA M
output: a minimum state DFA M
If some states in M ignore some inputs, add
transitions to a dead state.
Let P = {accepting state, All nonaccepting states}
Let P = {}
Loop: for each group G in P do
Partition G into subgroups so that s and t (in G)
belong to the same subgroup if and only if
each input a,states s and t have transitions
to states in the same group of P
put those subgroups in P
if (P != P) goto loop
Remove any dead states and unreachable states.

NFA to DFA Example-2

start

closure({0})={0,1,3,7}
subset({0,1,3,7},a)={2,4,7}
subset({0,1,3,7},b) = {8
b
b

closure({2,4,7})={2,4,7}
subset({2,4,7},a)={7}
subset({2,4,7},b) = {
-closure({8}) = {8}
subset({8},a) =
subset({8},b) = {8}
closure({7})={7}
subset({7},a) = {7}
subset({7},b)={8}

DFAstates
A={0,1,3,7}
B={2,4,7}
C={8}
D={7}
E={5,8}
F={6,8}

C
b
start

a3
a

D
a

a1

a3

a2a3

Minimizing the Number of


States of a DFA
b

C
start

B
a

b
a

D
a

start

D
a

A language for specifying Lexical Analyzers

A LEX source pgm is a specification of a lexical analyzer,


consisting of a
set of regular expressions together with an action for each
regular
expression.

The action is a piece of code which is to be executed


whenever a token
specified by the corresponding regular expression is
recognized.

The output of LEX is a lexical analyzer pgm constructed


from the LEX
source specification.

Creating a Lexical
Analyzer with Lex
lex
source
program

lex
compiler

input
stream

Lexical
Analyzer L

Lexical
analyzer L

sequence
oftokens

A LEX source pgm consists of 2 parts:


Auxiliary definitions and translation rules
Auxiliary Definitions
The auxiliary definitions are stmnts of the form
D1=R1
D2=R2
.
.
Dn=Rn
Eg: letter=A B Z
digit=0 1 . 9
identifier= letter (letter digit )*

Translation Rules
The translation rules of a LEX pgm are stmnts of the form
P1
{A1}
P2
{A2}
.
.
Pm
{Am}
Where each pi is a regular expression called a pattern and
each Ai is a pgm fragment
The pattern describe the form of the tokens
The pgm fragment describes what action the lexical
analyzer
should take when token Pi is found.

AUXILIARY DEFINITIONS
letter = A B .. Z
digit = 0 1 .. 9
TRANSLATION RULES
BEGIN
END
IF
THEN
ELSE
letter(letter digit)*
digit*
<
<=
=
<>
>
>=

{return
{return
{return
{return
{return

1}
2}
3}
4}
5}
{LEXVAL:= INSTALL();
return 6}
{LEXVAL:= INSTALL();
return 7}
{LEXVAL:=1;
return 8}
{LEXVAL:=2;
return 8}
{LEXVAL:=3;
return 8}
{LEXVAL:=4;
return 8}
{LEXVAL:=5;
return 8}
{LEXVAL:=6;
return 8}

Regular Expressions in Lex


x
matchthecharacterx
\.
matchthecharacter.
stringmatchcontentsofstringofcharacters
.
matchanycharacterexceptnewline
^
matchbeginningofaline
$
matchtheendofaline
[xyz] matchonecharacterx,y,orz(use\toescape-)
[^xyz]matchanycharacterexceptx,y,andz
[a-z] matchoneofatoz
r*
closure(matchzeroormoreoccurrences)
r+
positiveclosure(matchoneormoreoccurrences)
r?
optional(matchzerooroneoccurrence)
r1r2
matchr1thenr2(concatenation)
r1|r2
matchr1orr2(union)
(r)
grouping
r1\r2
matchr1whenfollowedbyr2
{d}
matchtheregularexpressiondefinedbyd

Você também pode gostar