Escolar Documentos
Profissional Documentos
Cultura Documentos
COMPILERS
Introduction to Compilers
Translator
A translator is a program that takes a program
written in one programming language as input and
produces a program in another language as output. If the
source language is a high level language and the object
language is a low level language , then such a translator is
called a compiler.
Source
Program
Compiler
Object
Program
Source program
Lexical Analyzer
token stream
Syntax Analyzer
Syntax tree
Semantic Analyzer
Syntax tree
Intermediate code generator
Symbol Table
Intermediate
representation
Machine independent
Code
optimizer
Intermediate
representation
Code generator
Target machine code
Machine dependent Code
optimizer
Target machine code
Phases of a compiler
-The
Syntax Analysis(Parsing)
+
<id,2>
<id,3>
*
60
Semantic Analysis
The semantic analyzer uses the syntax tree and the
information in the SYMTAB to check the source program for
semantic consistency with the language definition.
It also gathers type information and saves it in either the
syntax tree or the SYMTAB, for subsequent use during
intermediate code generation.
An important part of semantic analysis is type checking,
where the compiler checks that each operator has matching
operands.
(eg, the compiler must report an error, if a float value is used
as an array index).
Eg. Suppose position, initial and rate are float numbers. The
lexeme <60> is an integer. The type checker in semantic
analyzer discovers that the operator * is applied to a float
number rate and an int 60. So int 60 is converted to float.
Eg. t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimization
Code Generation
Eg.
LDF
MULF
LDF
ADDF
STF
R2,
R2,
R1,
R1,
id1,
id3
R2, #60.0
id2
R1, R2
R1
+
<id,2>
<id,3>
*
60
Semantic Analyzer
=
<id,1>
+
<id,2>
<id,3>
*
inttofloat(60)
Code Generator
LDF
MULF
LDF
ADDF
STF
R2,
R2,
R1,
R1,
id1,
id3
R2, #60.0
id2
R1, R2
R1
Translation of an assignment
statement
The
The
token
Source
Program
Lexical
Analyzer
Parser
getnexttoke
n
Symbol Table
To semantic
analysis
of Lexemes
Removal
Correlating
b)
A token
A pattern
A lexeme
INPUT BUFFERING
Specialized
Two
M *
eof
lexemeBegin forward
Input Buffering
Two pointers are required
lexemeBegin - marks the beginning of a lexeme
forward - scans ahead until a pattern match is
found.
Advancing forward requires that we first test
whether we have reached the end of one of the
buffers , and if so, we must reload the other
buffer from the input, and move forward to
the beginning of the newly loaded buffer.
Sentinels
Used
Natural
Any
eof C *
eof
lexemeBegin forward
eof
Switch(*forward++){
case eof:
if(forward is at the end of first buffer)
{ reload second buffer;
forward = beginning of second buffer; }
else if(forward is at the end of second buffer)
{ reload first buffer;
forward = beginning of first buffer; }
else /* eof within a buffer marks the end of
input */
terminate lexicalanalysis;
break;
}
SPECIFICATION OF TOKENS
Exponentiation
Operations on Languages
OPERATION
union of L and M
written L U M
concatenation of L
and M written LM
Kleene closure of L
written L*
DEFINITION
L U M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L*=
L
i 0
positive closure of L
written L+
L = L
+
i 1
Expression
Regular
Each
language
regular expression r denotes a language
INDUCTION:
Precedence
*
Concatenation
ha second highest
precedence
|
Examples
Algebraic laws of
Regular Expressions
AXIOM
r|s=s|r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r(s|t)=rs|rt
(s|t)r=sr|tr
r = r
r = r
r* = ( r | )*
r** = r*
DESCRIPTION
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
Is the identity element for concatenation
relation between * and
* is idempotent
Regular Definitions
We
Recognition of tokens
41
[0-9]
digits
digit+
number
letter
[A-Za-z]
id
if
if
then
then
else
else
relop
<|>|<=|>=|=|<>
06/13/1
5
43
44
LEXEMES
TOKEN NAME
ATTRIBUTE VALUE
Any ws
If
if
then
then
else
else
Any id
id
Any number
number
<
relop
LT
<=
relop
LE
relop
EQ
<>
relop
NE
>
relop
GT
>=
relop
Transition Diagrams
As an intermediate step in the
construction of a lexical analyzer, we
first convert patterns into "transition
diagrams.
Transition diagrams have a collection
of nodes or circles, called states.
Each state represents a condition that
could occur during the process of
scanning the input looking for a
lexeme that matches one of several
patterns.
06/13/1
5
46
Edges
Each
edge is labeled by a
symbol or set of symbols.
47
Some important
conventions about transition diagrams are:
Certain states are said to be accepting, or
final.
These states indicate that a lexeme has
been found.
(We always indicate an accepting state by a
double circle, and if there is an action to be
taken typically returning a token and an
attribute value to the parser we shall attach
that action to the accepting state.)
2. In addition, if it is necessary to retract the
forward pointer one position (i.e., the lexeme
does not include the symbol that got us to the
accepting state), then we shall additionally
place a * near that accepting state.
1.
State
In
start
letter
10
other
53
54
The
nonlet/dig
55
56
letter or digit
Start
lette
r
delimite 2 *
1
r
Token
Code
Value
begin
---
end
---
If
---
Then
---
Else
---
Identifier
table
Constant
symbol table
pointer to symbol
7
pointer to
<
<=
<>
>
Fig: Tokens
recognized
Keywords :
Start
G 3
Blank/
newline 6 * return(1,
5
)
Blank/
*
newline
N
D
10 return(2,
E
7
8
9
)
Blank/
*
newline
L
S
E
14 return(5,
11
12
13
)
Blank/ *
I 15
F
16 newline17 return(3,
)
T 18
19
Blank/
*
N
20
21 newline22 return(4,
)
Identifier :
letter or digit
Start
23
lette
r
24
Not
letter
or digit
25
*
return(6, INSTALL( )
)
constant :
digit
Start
26
digit
27
Not
digit
28
*
return(7, INSTALL( )
)
Relops :
Start
Not
*
= or 31 return(8,1)
>
29 < 30
32 return(8,
2)
>
33 return(8,4
)
*
= 34 return(8,3
> 35
)
not =
36
37
*
return(8,5
)
return(8,6
)
Regular Expressions
y = de
Prefix
is obtained by discarding o or more trailing
symbols of x
Eg: abc, abcd, a .. Are prefix of abcde
Suffix
of x is obtained by discarding 0 or more leading
symbols of x
Eg: cde, e, represent the suffix of abcde
Substring
of x is obtained by deleting a prefix and suffix
from x
Eg: cd, abc, de, abcde represent the substring
of abcde
All suffix and prefix will be a substring, but the
substring need
not be a suffix or prefix
and x are prefixes, suffixes, and substring of x
Language
It is the set of strings formed from specific
alphabet
If L & M are two languages, the possible operations
are
Concatenation
Concatenation of L & M is denoted as L.M and can
be found by selecting a string x from L and y from M
and joining them in that order
LM = {xy x is in L and y is in M}
L = L =
Exponentiation
Li = LLLLL L (i times)
L0 = {}, {}L = L{}=L
Union
LUM = {x x is in L or x is in M}
UL = LU = L
Closure
Eg: let L = { aa }
i=
i=
0 =
i=
L1
L.(L*) = 0L. U Li
U
i+1 =
U Li = L+
Regular Expressions
used to describe the tokens
Eg: for identifier,
identifier = letter ( letter digit )*
used to define a language
Regular Expression construction rules
1. is a regular expression denoting {}, that is the
language
containing only the empty string
2. For each a in , a is regular expression denoting
{a}, the language
with only one string, that string consisting of the
single symbol a
3. If R and S are regular expressions denoting
languages LR and LS
respectively then
( / is commutative)
( . Is associative)
and
( . Distribution over / )
( is identity for
Finite Automata
Language Recognizer
It is a program that identifies the presence of a token on
the input . It takes a string x as its input, answers yes if x
is a sentence of L and no otherwise.
How it works?
To determine x belongs to a language L, x is decomposed
into a sequence of substrings denoted by the primitive sub
expressions in R
Example
Given R = (a/b)*abb, the set of all strings ending in
abb,and
x = aabb
Since R = R1R2 where R1 = (a/b)* and
R2 = abb
Nondeterministic Automata
It is the generalized transition diagram that is derived from
the expression
a
start
b
Fig: A non deterministic finite automata of
(a b)*abb
The nodes are called states and the labeled edges are
called transitions. Edges can be labeled by & characters.
Also same character can label two or more transitions out
of one state. It has one start state and can be one or more
final states(accepting states).
Transition table
The tabular form representing the transitions of an NFA . In
the transition table, there is a row for each state and a
column for each admissible input symbol and .
The entry for row i and symbol a is the set of possible
next states for state i on i/p a.
State
Input symbol
a
{0,1} {0}
-----
{2}
---- {3}
State
Remaining i/p
aabb
abb
bb
2
3
start
b
3
2. NFA for a
i'
f'
3. NFA for R1 / R2
Let N1 and N2 be NFAs corresponds R1 and R2 respectively
N1
i'i
N2
N1
N2
N1
Decomposition of (a / b)*abb
R11
R9
R10
R7
R5
R4
(
R1
a
R3
/
R6
*
)
R2
b
R8
b
b
R1= a
N1 :
R2= b
N2 :
R3=
R1/R2
N3 :
R5=
(R4)*
N5 :
R6= a
N6 :
R7=
R5R6
7'
N7 :
R8= b
N8 :
R9=
R7R8
8'
N9 :
R10 = b
N10 :
9'
10
R11= R9R10
N11 :
Start
10
(ii)
Start
{3, 8} = {1, 2, 3, 4, 6, 7, 8}
{-closure
5}
------------- (B)
a
{3, 8}
b
{ 5,
9}
10
{ 5}
{ 5,
10 }
-closure {5, 10 } = {1, 2, 4, 5, 6, 7, 10}
------------(E)
a
{3, 8}
b
{5}
State
Input symbol
a
A (Start)
C
D
E (Accept)
C
B
a
E
C
a
a
Start A
b
a
D
a
b
b
E
Input symbol
a
A (Start)
D
E (Accept)
B
B
E
A
a
a
Start
a
b
Algorithm
Input: a NFA N.
output: a DFA D accepting the same language
Let us define the function -CLOSURE(s) to be the set of states of N built by applying the following rules:
Algorithm - CLOSURE
Push all states in T onto stack;
-closure(T) := T;
while stack is not empty do
begin
pop s, the top element, off the stack
for each state t with an edge from s to t
labeled do
if t is not in -closure(T) do
begin
add t to -closure(T)
push t onto stack
end if
end do
end while
Algorithm
Input: a DFA M
output: a minimum state DFA M
If some states in M ignore some inputs, add
transitions to a dead state.
Let P = {accepting state, All nonaccepting states}
Let P = {}
Loop: for each group G in P do
Partition G into subgroups so that s and t (in G)
belong to the same subgroup if and only if
each input a,states s and t have transitions
to states in the same group of P
put those subgroups in P
if (P != P) goto loop
Remove any dead states and unreachable states.
start
closure({0})={0,1,3,7}
subset({0,1,3,7},a)={2,4,7}
subset({0,1,3,7},b) = {8
b
b
closure({2,4,7})={2,4,7}
subset({2,4,7},a)={7}
subset({2,4,7},b) = {
-closure({8}) = {8}
subset({8},a) =
subset({8},b) = {8}
closure({7})={7}
subset({7},a) = {7}
subset({7},b)={8}
DFAstates
A={0,1,3,7}
B={2,4,7}
C={8}
D={7}
E={5,8}
F={6,8}
C
b
start
a3
a
D
a
a1
a3
a2a3
C
start
B
a
b
a
D
a
start
D
a
Creating a Lexical
Analyzer with Lex
lex
source
program
lex
compiler
input
stream
Lexical
Analyzer L
Lexical
analyzer L
sequence
oftokens
Translation Rules
The translation rules of a LEX pgm are stmnts of the form
P1
{A1}
P2
{A2}
.
.
Pm
{Am}
Where each pi is a regular expression called a pattern and
each Ai is a pgm fragment
The pattern describe the form of the tokens
The pgm fragment describes what action the lexical
analyzer
should take when token Pi is found.
AUXILIARY DEFINITIONS
letter = A B .. Z
digit = 0 1 .. 9
TRANSLATION RULES
BEGIN
END
IF
THEN
ELSE
letter(letter digit)*
digit*
<
<=
=
<>
>
>=
{return
{return
{return
{return
{return
1}
2}
3}
4}
5}
{LEXVAL:= INSTALL();
return 6}
{LEXVAL:= INSTALL();
return 7}
{LEXVAL:=1;
return 8}
{LEXVAL:=2;
return 8}
{LEXVAL:=3;
return 8}
{LEXVAL:=4;
return 8}
{LEXVAL:=5;
return 8}
{LEXVAL:=6;
return 8}