Você está na página 1de 36

15-150 Fall 2014

Stephen Brookes

regular expressions
using a datatype
structural recursion and induction
higher-order functional programming

Regular expression
From Wikipedia, the free encyclopedia! (adapted by me)

!
A regular expression is a sequence of characters that forms a search pattern,
mainly for use in pattern matching with strings. The concept arose in the 1950s,
when the American mathematician Stephen Kleene formalized the description of a
regular language, and came into common use with the Unix utilities ed, an editor,
and grep (global regular expression print), a filter.!
Each character in a regular expression is either a metacharacter with a special meaning,
or a literal character with its usual meaning.!
A regular expression can be used to find the same word spelled different ways,!
or multiple occurrences of a word, or a word appearing in a specific context. !
regular expression
strings
matches "serialise" and "serialize"
!seriali(s+z)e
Standard ML( )*{of New Jersey}

Regular expressions are built from !


empty string
!
literal characters
c (including space)!
concatenation
R 1R 2!
alternation
R1+R2 and R1{R2} (= R1 + R1R2)!
iteration
R*

matches "Standard ML
"!
and "Standard ML of New Jersey"

R1{R2} is redundant

in practice
DNA sequencing

Ranking search results

Spam, spam, spam,

regular expressions!
are widely used!

Spam detection software has identified this incoming email as possible spam. !

rule name
description!
-----------------------------------------------------------------------!
NO_REAL_NAME From: does not include a real name!
UNDISC_RECIPS Says To: "undisclosed-recipients"!
INVALID_DATE Invalid Date: header (not RFC 2822)!
DATE_IN_PAST_12_24 Date: is 12 to 24 hours before Received: date!
X-MAS BONANZA LOAN OFFER !! PINNACLE LOANS CORPORATION is offering loans!
at 0.2% interest rate without any collateral,We offer consolidation loan, student loan, mortgage,and !
business loans, Do you need Loan for individual or corporate concern? !
PINNACLE LOANS CORPORATION ,DO YOU NEED AN URGENT LOAN ? GET UP TO $500,000.00 !
USD:::PERSONAL AND BUSINESS LOANS:::: CONTACT US ONLY ON OUR WORK EMAIL

regular expressions
R ::= 0 | 1 | c | R1+R2 | R1 R2 | R*

datatype regexp = Zero | One | Char of char



| Plus of regexp * regexp

| Times of regexp * regexp

| Star of regexp
Zero : regexp

One : regexp

Char : char -> regexp

Plus : regexp * regexp -> regexp

Times : regexp * regexp -> regexp

Star : regexp -> regexp

empty language
the empty string
literal characters
alternation
concatenation
iteration

regular expressions
The set of values of type regexp
is inductively characteri{s+z}ed
by the following rules

Zero, One, are values



When c is a character, Char c is a value

If R and R are values, so are

1

Plus(R1, R2) and Times(R1, R2)



!

If R is a value, so is Star R

!

representation
regular!
expression

foo

regexp value

Times(Char #f, Times(Char #o, Times(Char #o, One)))


Times(Char #f, Times(Char #o, Char #o))

(a+b)

Plus(Char #a, Char #b)

(a+b)*

Star(Plus(Char #a, Char #b))

there may be !
many ways to !
represent !
the same !
regular expression

regular languages
A regular expression R

denotes a language L(R) ... a set of char lists
L(Zero) = { }
L(One) = { [ ] }
L(Char c) = { [c] }
L(Plus(R1,R2)) = L(R1) L(R2)
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}

comments
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}

This is a recursive description of L(Star(R))



We mean that L(Star(R)) is the smallest set S
of char lists that satisfies the equation

S = {[ ]} {L1@L2 | L1 L(R), L2 S}

Hence L(Star(R)) consists of all lists of form

L1@L2@...@Ln where n0 and each Li L(R)

string/char list
explode : string -> char list

implode : char list -> string

strings!
and !
character lists!
are!
in 1-1 correspondence

explode foo = [#f, #o, #o]


implode [#f, #o, #o] = foo

We say that string s is in the language of R!


iff (explode s) L(R)
explode(s1^s2) = (explode s1) @ (explode s2)
explode() = [ ]

languages
regular!
expression

foo

regexp value

strings in language

Times(Char #f, Times(Char #o, Char #o))


{foo}

(a+b)

Plus(Char #a, Char #b)

(a+b)*

Star(Plus(Char #a, Char #b))

{a, b}

{, a, b,!
aa, ab, bb, }

building a regexp
EXERCISE

Using foldr and map, define a function


string2reg : string -> regexp

such that for all s : string,


L (string2reg s) = {explode s}
HINT: Use Times, One, Char, explode

specification
Write a function
accepts : regexp -> char list -> bool
to check if L L(R)

problem
Not easy to solve directly

For Times(R ,R ) its possible to
1

generate-and-test all splits L1@L2 of L


But this can be very costly!



And what about Star R?

solution
Generalize the problem...
Does L have a prefix in L(R)
with a suffix that satisfies a success condition?
L = L1@L2
L1 is a prefix of L, with suffix L2
success? some total function from char list to bool

intuition
match : regexp -> char list -> (char list -> bool) -> bool
success condition
match R L p = true
iff L has a split L=L1@L2 with

L1 L(R) & p(L2)=true
a prefix of L

is in L(R)

the rest of L

satisfies p

the generalized problem


Write an ML function
match : regexp -> char list ->

(char list -> bool) -> bool
such that for all values R : regexp,
For all L, and all total p
(a) match R L p = true
if there are L1, L2 such that

P(R)
L=L1@L2 & L1 L(R) & p(L2)=true
(b) match R L p = false

otherwise
NOTE: P(R) implies If p is total, so is (fn L => match R L p).

how that helps


Can then define
accepts : regexp -> char list -> bool
fun accepts R L = match R L null
REQUIRES: true

ENSURES:

accepts R L = true if L is in L(R)

accepts R L = false otherwise

design
match will use structural recursion on regexp
fun match Zero L p
= (* easy *)

| match One L p
= (* easy *)

| match (Char c) L p = (* easy *)

| match (Plus(R1, R2)) L p =

(* use match R1 and match R2 *)

| match (Times(R1, R2)) L p =

(* use match R1 and match R2 *)

| match (Star R) L p =

(* use match R *)

use spec as guide

Zero
L(Zero) = { }
match Zero L p = false

no prefix of L is in L(Zero)

One
L(One) = { [ ] }
match One L p = p L
the only prefix !
worth checking is [ ]

Char c
L(Char c) = { [c] }

the only possible prefix!


would be [c]

match (Char c) L p =
case L of

[ ] => false

| x::L => (c=x) andalso p(L)
!
match
(Char c) [ ] p = false
match (Char c) (x::L) = (c=x) andalso p(L)

Plus
L(Plus(R1,R2)) = L(R1) L(R2)
match (Plus(R1, R2)) L p =
(match R1 L p) orelse (match R2 L p)
why is this !
the right thing!
to do?

Plus

property (a)

L(Plus(R1,R2)) = L(R1) L(R2)


match (Plus(R1, R2)) L p =
(match R1 L p) orelse (match R2 L p)
=>* true
if match R1 L p =>* true

L has a prefix in L(R1) !


with suffix satisfying p

or match R1 L p =>* false


& match R2 L p =>* true

L has a prefix in L(R2) !


with suffix satisfying p

=>* true if L has a prefix in L(R1) L(R2) !


with suffix satisfying p

Plus

property (b)

L(Plus(R1,R2)) = L(R1) L(R2)


match (Plus(R1, R2)) L p =
(match R1 L p) orelse (match R2 L p)
=>* false
if match R1 L p =>* false
& match R2 L p =>* false

L has no prefix in L(R1) !


with suffix satisfying p

L has no prefix in L(R2) !


with suffix satisfying p

=>* false if L has no prefix in L(R1) L(R2) !


with suffix satisfying p

Times
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}

match (Times(R1,R2)) L p =
match R1 L (fn L => match R2 L p)
success continuation
says what to do
when R1 matches a prefix...
!

try matching R2 on suffix

Star
R* = 1 + R R*
match (Star R) L p =
match (Plus(One, Times(R, Star R))) L p

Yes, this should be true.


But not much good as a definition!
... not structurally recursive!

Star
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}

match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
check for a
match with R

followed by !
a match with !
Star R

match (Star R) uses match R



and calls itself recursively on a suffix

Star
match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
match (Star (Char c)) [ ] null
match (Star (Char c)) [c] null
match (Star (Char c)) [c,c] null

= true
= true
= true

L(Star(Char c)) = {[ ], [c], [c,c], }

Star
match (Star R) L p =
p(L) orelse match R L (fn L => match (Star R) L p)
match (Star One) [c] null
=======================================!
=======================================>*!
.

Should be false
L(Star One) = {[ ]}

diagnosis
L(One) = {[ ]}

null [c] = false

match (Star One) [c] null


+
=> null [c] orelse match One [c] (fn L => match (Star One) L null)
=>+ match One [c] (fn L => match (Star One) L null)
=>+ (fn L => match (Star One) L null) [c]
=>+ match (Star One) [c] null
match(Star One) [c] null

loops forever

Star (corrected)
L(Star(R)) = {[ ]} {L1@L2 | L1L(R), L2L(Star R)}
= {[ ]} {L1@L2 | L1L(R), L1[ ], L2L(Star R)}

match (Star R) L p =

p(L) orelse match R L (fn L => L


<> L andalso
match (Star R) L p)
check for a non-trivial
match with R

followed by !
a match with !
Star R

analysis
match (Star One) [c] null
=>* null [c] orelse match One [c] (fn ...)
=>* match One [c] (fn L => L<> [c] andalso ...)
=>* (fn L => L<> [c] andalso ...) [c]
=>* [c] <> [c] andalso ...
=>* false

correctness?
Let P(R) be:

For all values L:char list

and all total functions p : char list -> bool,
(a) match R L p = true
if there are L1, L2 such that

L=L1@L2 & L1 L(R) & p(L2)=true
(b) match R L p = false

otherwise
THEOREM

For all values R : regexp, P(R) holds.

proof outline
By structural induction on R

Base cases: Zero, One, Char c

Easier inductive cases: Plus(R ,R ), Times(R , R )

Use P(R ) and P(R ) as hypotheses

Key fact: P(R ) implies

1

fn L => match R2 L p

Tricky inductive case: Star(R)



Use P(R) and induction on L

is total

(why is this needed?)

reflection

Most regular expressions built with Zero


denote the empty language

match may be slow, only to return false


Theres a Zero-free regexp for the same language



Write a function
DeZero : regexp -> regexp
to remove Zero (except at top level)

Você também pode gostar