Escolar Documentos
Profissional Documentos
Cultura Documentos
Stephen Brookes
regular expressions
using a datatype
structural recursion and induction
higher-order functional programming
Regular expression
From Wikipedia, the free encyclopedia! (adapted by me)
!
A regular expression is a sequence of characters that forms a search pattern,
mainly for use in pattern matching with strings. The concept arose in the 1950s,
when the American mathematician Stephen Kleene formalized the description of a
regular language, and came into common use with the Unix utilities ed, an editor,
and grep (global regular expression print), a filter.!
Each character in a regular expression is either a metacharacter with a special meaning,
or a literal character with its usual meaning.!
A regular expression can be used to find the same word spelled different ways,!
or multiple occurrences of a word, or a word appearing in a specific context. !
regular expression
strings
matches "serialise" and "serialize"
!seriali(s+z)e
Standard ML( )*{of New Jersey}
matches "Standard ML
"!
and "Standard ML of New Jersey"
R1{R2} is redundant
in practice
DNA sequencing
Ranking search results
Spam, spam, spam,
regular expressions!
are widely used!
Spam detection software has identified this incoming email as possible spam. !
rule name
description!
-----------------------------------------------------------------------!
NO_REAL_NAME From: does not include a real name!
UNDISC_RECIPS Says To: "undisclosed-recipients"!
INVALID_DATE Invalid Date: header (not RFC 2822)!
DATE_IN_PAST_12_24 Date: is 12 to 24 hours before Received: date!
X-MAS BONANZA LOAN OFFER !! PINNACLE LOANS CORPORATION is offering loans!
at 0.2% interest rate without any collateral,We offer consolidation loan, student loan, mortgage,and !
business loans, Do you need Loan for individual or corporate concern? !
PINNACLE LOANS CORPORATION ,DO YOU NEED AN URGENT LOAN ? GET UP TO $500,000.00 !
USD:::PERSONAL AND BUSINESS LOANS:::: CONTACT US ONLY ON OUR WORK EMAIL
regular expressions
R ::= 0 | 1 | c | R1+R2 | R1 R2 | R*
empty language
the empty string
literal characters
alternation
concatenation
iteration
regular expressions
The set of values of type regexp
is inductively characteri{s+z}ed
by the following rules
If R is a value, so is Star R
!
representation
regular!
expression
foo
regexp value
(a+b)
(a+b)*
there may be !
many ways to !
represent !
the same !
regular expression
regular languages
A regular expression R
denotes a language L(R) ... a set of char lists
L(Zero) = { }
L(One) = { [ ] }
L(Char c) = { [c] }
L(Plus(R1,R2)) = L(R1) L(R2)
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}
comments
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}
string/char list
explode : string -> char list
implode : char list -> string
strings!
and !
character lists!
are!
in 1-1 correspondence
languages
regular!
expression
foo
regexp value
strings in language
(a+b)
(a+b)*
{a, b}
{, a, b,!
aa, ab, bb, }
building a regexp
EXERCISE
specification
Write a function
accepts : regexp -> char list -> bool
to check if L L(R)
problem
Not easy to solve directly
For Times(R ,R ) its possible to
1
solution
Generalize the problem...
Does L have a prefix in L(R)
with a suffix that satisfies a success condition?
L = L1@L2
L1 is a prefix of L, with suffix L2
success? some total function from char list to bool
intuition
match : regexp -> char list -> (char list -> bool) -> bool
success condition
match R L p = true
iff L has a split L=L1@L2 with
L1 L(R) & p(L2)=true
a prefix of L
is in L(R)
the rest of L
satisfies p
design
match will use structural recursion on regexp
fun match Zero L p
= (* easy *)
| match One L p
= (* easy *)
| match (Char c) L p = (* easy *)
| match (Plus(R1, R2)) L p =
(* use match R1 and match R2 *)
| match (Times(R1, R2)) L p =
(* use match R1 and match R2 *)
| match (Star R) L p =
(* use match R *)
Zero
L(Zero) = { }
match Zero L p = false
no prefix of L is in L(Zero)
One
L(One) = { [ ] }
match One L p = p L
the only prefix !
worth checking is [ ]
Char c
L(Char c) = { [c] }
match (Char c) L p =
case L of
[ ] => false
| x::L => (c=x) andalso p(L)
!
match
(Char c) [ ] p = false
match (Char c) (x::L) = (c=x) andalso p(L)
Plus
L(Plus(R1,R2)) = L(R1) L(R2)
match (Plus(R1, R2)) L p =
(match R1 L p) orelse (match R2 L p)
why is this !
the right thing!
to do?
Plus
property (a)
Plus
property (b)
Times
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}
match (Times(R1,R2)) L p =
match R1 L (fn L => match R2 L p)
success continuation
says what to do
when R1 matches a prefix...
!
Star
R* = 1 + R R*
match (Star R) L p =
match (Plus(One, Times(R, Star R))) L p
Star
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}
match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
check for a
match with R
followed by !
a match with !
Star R
Star
match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
match (Star (Char c)) [ ] null
match (Star (Char c)) [c] null
match (Star (Char c)) [c,c] null
= true
= true
= true
Star
match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
match (Star One) [c] null
=======================================!
=======================================>*!
.
Should be false
L(Star One) = {[ ]}
diagnosis
L(One) = {[ ]}
loops forever
Star (corrected)
L(Star(R)) = {[ ]} {L1@L2 | L1L(R), L2L(Star R)}
= {[ ]} {L1@L2 | L1L(R), L1[ ], L2L(Star R)}
match (Star R) L p =
followed by !
a match with !
Star R
analysis
match (Star One) [c] null
=>* null [c] orelse match One [c] (fn ...)
=>* match One [c] (fn L => L<> [c] andalso ...)
=>* (fn L => L<> [c] andalso ...) [c]
=>* [c] <> [c] andalso ...
=>* false
correctness?
Let P(R) be:
For all values L:char list
and all total functions p : char list -> bool,
(a) match R L p = true
if there are L1, L2 such that
L=L1@L2 & L1 L(R) & p(L2)=true
(b) match R L p = false
otherwise
THEOREM
For all values R : regexp, P(R) holds.
proof outline
By structural induction on R
Base cases: Zero, One, Char c
Easier inductive cases: Plus(R ,R ), Times(R , R )
Use P(R ) and P(R ) as hypotheses
Key fact: P(R ) implies
1
fn L => match R2 L p
is total
reflection