Recognition of Tokens

Specification of Tokens
 Regular expressions are an important

notation for specifying patterns.
 Operation on languages
 Regular expressions
 Regular definitions
Winter 2007 SEG2101 Chapter 8 1

Operations on Languages

Regular Expressions
 Regular expression is a compact notation
for describing string.
 In Pascal, an identifier is a letter followed
by zero or more letter or digits
letter(letter|digit)*
 |: or
 *: zero or more instance of
 a(a|d)*
Rules
  is a regular expression that denotes {}, the set
containing empty string.
 If a is a symbol in , then a is a regular expression that
denotes {a}, the set containing the string a.
 Suppose r and s are regular expressions denoting the
language L(r) and L(s), then
 (r) |(s) is a regular expression denoting L(r)L(s).
 (r)(s) is regular expression denoting L (r) L(s).
 (r) * is a regular expression denoting (L (r) )*.
 (r) is a regular expression denoting L (r).

Precedence Conventions
 The unary operator * has the highest
precedence and is left associative.
 Concatenation has the second highest
precedence and is left associative.
 | has the lowest precedence and is left
associative.
 (a)|(b)*(c)a|b*c

Example of Regular
Expressions

Properties of Regular
Expression

Regular Definitions
 If  is an alphabet of basic symbols, then a
regular definition is a sequence of definitions
of the form:
d1r1
d2r2
...
dnrn
 where each di is a distinct name, and each ri
is a regular expression over the symbols in
{d1,d2,…,di-1}, i.e., the basic symbols and
the previously defined names.
Examples of Regular Definitions
Example 3.5. Unsigned numbers

Recognition of Tokens
 A grammar for branching statements
 stmt  if expr then stmt
| if expr then stmt else stmt
|
 expr  term relop term
| term
 term  id
| number
Example
 Patterns for tokens in the grammar
 digit  [0-9]
digits  digit+
number  digits (. digits)? (E [+|-]? digits )?
id  letter (letter |digit)*
if  if
then  then
else  else
relop  < | > | <= | >= | = | < >
 ws  (blank | tab | newline)+
Tokens, their patterns, and attribute values
Lexemes Token name Attribute value

Any ws - -
if if -
then then -
else else -
Any id id Pointer to table entry
Any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Example
C=a+b*5
<id, pointer to symbol table entry>
<relop, EQ>
<assign_op, ->
<multi_op, ->
<num, pointer to symbol table entry>
Transition Diagrams
 Nodes: states, conditions that could occur during the process of scanning
the input looking for a lexeme that matches one of several patterns
 Edges: directed from state to state
 Labeled by a symbol or set of symbols
 Deterministic: there’s never more than one edge out of a given state with a
given symbol among its labels
 Certain states are accepting or final: a lexeme has been found
 Double circle
 If it’s necessary to retract the forward pointer, we shall additionally place a
* near that accepting state
 Start state, or initial state, is indicated by an edge, labeled “start”, entering
from nowhere
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
Example Transition Diagram for
relop
start < =
0 1 2 return(relop, LE)
>
= 3 return(relop, NE)
other
*
4 return(relop, LT)
> 5 return(relop, EQ)
=
6 7 return(relop, GE)
other *
8 return(relop, GT)
Recognition of Reserved Words
and Identifiers
 Problem: keywords look like identifiers
 Solution:
 Install the reserved words in the symbol table
initially
 Create separate transition diagrams for each
keyword
Examples for Identifiers and
Keywords
start *
letter other
9 10 11 return(getToken(),
installID())
letter or digit
*
Completion of the Running
Example – Unsigned Numbers
3.14E-5
3.14
314
Transition Diagram for
Whitespace
delim
*
start delim other
22 23 24
delim -> blank | tab | newline

start 0 < 1 = 2
>
= 3
other *
4
> 5
return(relop, EQ)
6 = 7
start lett othe
other 8 * 9 er 10 r 11
start digit other *

12 13 20
. deli
digi m
* start deli other *
14
digit t15 other
21 22 m
23 24
E
E digit
+ or - digit other *
16 17 18 19
digit
Transition Diagram
C code to find next start state
C Code for Lexical analyzers
Finite Automata
 Finite automata are recognizers
 They simply say “yes” or “no” about each input
string
 Two kinds:
 Nondeterministic finite automata (NFA)
 No restrictions on the labels of the edges
 Deterministic finite automata (DFA)
 For
each state, and for each symbol, there’s exactly
one edge with that symbol leaving that state
Nondeterministic Finite Automata
 NFA consists of
 A finite set of states S
 A set of input symbol , the input alphabet
 A transition function that gives, for each state,
and for each symbol in ∪{} a set of states
 A state s0 from S (the start state or initial state)
 A set of states F, a subset of S (the accepting
states, or final states)
 NFA can be represented by a transition
graph
 There’s an edge labeled a from state s to
state t iff t is one of the next states for state s
and input a
 It’s similar to a transition diagram except:
 The same symbol can label edges from one state
to several different states
 An edge may be labeled by , in addition to
symbols from the input alphabet
An Example NFA: (a|b)*abb
Transition Graph Transition Tables

a
Stat a b 
start a b b e
0 1 2 3
0 {0, 1} {0} 
b 1  {2} 
2  {3} 
3   
Example NFA: aa*|bb*
a
a
 1 3
start
0
b
 2 4
b
Deterministic Finite Automata
 DFA is a special case of an NFA where:
 There are no moves on input 
 For each state s and input symbol a, there’s
exactly one edge out of s labeled s
 Every regular expression and every NFA
can be converted to a DFA accepting the
same language
Example DFA accepting (a|b)*abb
b
b
start a b b
0 1 2 3
a
a
a
Construction of an NFA from a
Regular Expression
(Thomson’s algorithm)
 Basis:
 For expression , construct the NFA
start 
i f
 For subexpression a in , construct the NFA

start a
i f
NFA for the concatenation of two
regular expressions N(s).N(t)
start
i N(s) N(t) f
abb
start a b b
0 1 2 3
NFA for the union of two regular
expressions r=N(s)|N(t)
N(s)
 
start
i f
 
N(t)
a
a|b  1 2

start
0 5
 b 
3 4
NFA for the closure of a regular expression N(s)*

start  
i N(s) f
(a|b)* a
 2 3

start  
0 1 6 7
 b 
4 5

NFA for (a|b)*abb#
a
2 3
start  
  a b b #
0 1 6 7 8 9 10 11
 b 
4 5

Recognition of Tokens

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Recognition of Tokens

Enviado por

Direitos autorais:

Formatos disponíveis

Specification of Tokens

 Regular expressions are an important

Winter 2007 SEG2101 Chapter 8 1

Winter 2007 SEG2101 Chapter 8 2

Winter 2007 SEG2101 Chapter 8 4

Winter 2007 SEG2101 Chapter 8 5

Winter 2007 SEG2101 Chapter 8 6

Winter 2007 SEG2101 Chapter 8 7

Example 3.5. Unsigned numbers

Winter 2007 SEG2101 Chapter 8 9

Lexemes Token name Attribute value

delim -> blank | tab | newline

start digit other *

Transition Graph Transition Tables

 For subexpression a in , construct the NFA

Você também pode gostar