Escolar Documentos
Profissional Documentos
Cultura Documentos
Compilers
David Walker
Outline
Last Week
Introduction to ML
Today:
Lexical Analysis
Reading: Chapter 2 of Appel
stream of
tokens
Lexer
abstract
syntax
Parser
Type
Checker
Lexical Analysis
Lexical Analysis: Breaks stream of ASCII
characters (source) into tokens
Token: An atomic unit of program syntax
i.e., a word as opposed to a sentence
Type:
ID
REAL
SEMI
LPAREN
NUM
IF
Token:
ID(foo), ID(x), ...
REAL(10.45), REAL(3.14), ...
SEMI
LPAREN
NUM(50), NUM(100)
IF
4.0
4.0
Lexical Analysis
ID(x)
4.0
Lexical Analysis
ID(x) ASSIGN
4.0
Lexical Analysis
ID(x) ASSIGN LPAREN ID(y) PLUS REAL(4.0) RPAREN SEMI
Lexer Implementation
Implementation Options:
1. Write a Lexer from scratch
Boring, error-prone and too much work
Lexer
Specification
Lexer Implementation
Implementation Options:
1. Write a Lexer from scratch
Boring, error-prone and too much work
Lexer
Specification
Lexer
lexer
generator
Lexer Implementation
Implementation Options:
1. Write a Lexer from scratch
Boring, error-prone and too much work
Lexer
Specification
Lexer
lexer
generator
stream of
tokens
Some Definitions
We will want to define the language of legal tokens
our lexer can recognize
Alphabet a collection of symbols (ASCII is an alphabet)
String a finite sequence of symbols taken from our
alphabet
Language of legal tokens a set of strings
Language of ML keywords set of all strings which are ML
keywords (FINITE)
Language of ML tokens set of all strings which map to ML tokens
(INFINITE)
Some people use the word language to mean more general sets:
eg: ML Language set of all strings representing correct ML
programs (INFINITE).
Regular Expressions
Integers begin with an optional minus sign,
continue with a sequence of digits
Regular Expression:
(- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
Regular Expressions
Integers begin with an optional minus sign,
continue with a sequence of digits
Regular Expression:
(- | e) (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*
So writing (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
and even worse (a | b | c | ...) gets
tedious...
Regular Expressions
common abbreviations:
[a-c]
.
\n
a+
a?
== (a | b | c)
== any character except \n
== new line character
== one or more
== zero or one
How do we tokenize:
foobar ==>
if
==>
How do we tokenize:
foobar ==>
if
==>
How do we tokenize:
foobar ==>
if
==>
Lexer Implementation
Implementation Options:
1. Write Lexer from scratch
ML-Lex Specification
Lexical specification consists of 3 parts:
User Declarations
%%
ML-LEX Definitions
%%
Rules
User Declarations
User Declarations:
User can define various values that are
available to the action fragments.
Two values must be defined in this section:
type lexresult
type of the value returned by each rule action.
fun eof ()
called by lexer when end of input stream is reached.
ML-LEX Definitions
ML-LEX Definitions:
User can define regular expression
abbreviations:
DIGITS = [0-9] +;
LETTER = [a-zA-Z];
Rules
Rules:
<lexer_list> regular_expression => (action.code) ;
Rules
Rules:
<lexer_list> regular_expression => (action.code) ;
Rules
Rule actions can use any value defined in the
User Declarations section, including
type lexresult
type of value returned by each rule action
special variables:
yytext: input substring matched by regular expression
yypos: file position of the beginning of matched string
continue (): used to recursively called lexer
A Simple Lexer
datatype token = Num of int | Id of string | IF | THEN | ELSE | EOF
type lexresult = token
(* mandatory *)
fun eof () = EOF
(* mandatory *)
fun itos s = case Int.fromString s of SOME x => x | NONE => raise fail
%%
NUM = [1-9][0-9]*
ID = [a-zA-Z] ([a-zA-Z] | NUM)*
%%
if
then
else
{NUM}
{ID}
=>
=>
=>
=>
=>
(IF);
(THEN);
(ELSE);
(Num (itos yytext));
(Id yytext);
(* mandatory *)
(* mandatory *)
%%
%s COMMENT
%%
<INITIAL> if
<INITIAL> [a-z]+
<INITIAL> (*
<COMMENT> *)
<COMMENT> \n | .
=>
=>
=>
=>
=>
();
();
(YYBEGIN COMMENT; continue ());
(YYBEGIN INITIAL; continue ());
(continue ());
(* mandatory *)
(* mandatory *)
%%
%s COMMENT
INT = [1-9] [0-9]*;
%%
<INITIAL> if
<INITIAL> then
<INITIAL> {INT}
<INITIAL> (*
<COMMENT> *)
<COMMENT> \n | .
=>
=>
=>
=>
=>
=>
(IF);
(THEN);
( INT( ^ yytext ^ ) );
(YYBEGIN COMMENT; continue ());
(YYBEGIN INITIAL; continue ());
(continue ());
Implementing Lexers
By compiling, of course:
convert REs into non-deterministic finite
automata
convert non-deterministic finite automata into
deterministic finite automata
convert deterministic finite automata into a
blazingly fast table-driven algorithm
Table-driven algorithm
DFA:
Table:
1
a
1
a
+
b
2
2
c
4
Table-driven algorithm
DFA:
a-z
1
a-z
Table-driven algorithm
DFA:
a-z
1
a-z
Summary
A Lexer:
input: stream of characters
output: stream of tokens