Slides13 Regex PDF

15-150 Fall 2014
Stephen Brookes
regular expressions
using a datatype
structural recursion and induction
higher-order functional programming
Regular expression
From Wikipedia, the free encyclopedia! (adapted by me)
!
A regular expression is a sequence of characters that forms a search pattern,
mainly for use in pattern matching with strings. The concept arose in the 1950s,
when the American mathematician Stephen Kleene formalized the description of a
regular language, and came into common use with the Unix utilities ed, an editor,
and grep (global regular expression print), a filter.!
Each character in a regular expression is either a metacharacter with a special meaning,
or a literal character with its usual meaning.!
A regular expression can be used to find the same word spelled different ways,!
or multiple occurrences of a word, or a word appearing in a specific context. !
regular expression
strings
matches "serialise" and "serialize"
!seriali(s+z)e
Standard ML( )*{of New Jersey}
Regular expressions are built from !

empty string
!
literal characters
c (including space)!
concatenation
R 1R 2!
alternation
R1+R2 and R1{R2} (= R1 + R1R2)!
iteration
R*
matches "Standard ML
"!
and "Standard ML of New Jersey"
R1{R2} is redundant
in practice
DNA sequencing

Ranking search results

Spam, spam, spam,
regular expressions!
are widely used!
Spam detection software has identified this incoming email as possible spam. !
rule name
description!
-----------------------------------------------------------------------!
NO_REAL_NAME From: does not include a real name!
UNDISC_RECIPS Says To: "undisclosed-recipients"!
INVALID_DATE Invalid Date: header (not RFC 2822)!
DATE_IN_PAST_12_24 Date: is 12 to 24 hours before Received: date!
X-MAS BONANZA LOAN OFFER !! PINNACLE LOANS CORPORATION is offering loans!
at 0.2% interest rate without any collateral,We offer consolidation loan, student loan, mortgage,and !
business loans, Do you need Loan for individual or corporate concern? !
PINNACLE LOANS CORPORATION ,DO YOU NEED AN URGENT LOAN ? GET UP TO $500,000.00 !
USD:::PERSONAL AND BUSINESS LOANS:::: CONTACT US ONLY ON OUR WORK EMAIL
regular expressions
R ::= 0 | 1 | c | R1+R2 | R1 R2 | R*
datatype regexp = Zero | One | Char of char

| Plus of regexp * regexp

| Times of regexp * regexp

| Star of regexp
Zero : regexp

One : regexp

Char : char -> regexp

Plus : regexp * regexp -> regexp

Times : regexp * regexp -> regexp

Star : regexp -> regexp
empty language
the empty string
literal characters
alternation
concatenation
iteration
regular expressions
The set of values of type regexp
is inductively characteri{s+z}ed
by the following rules
Zero, One, are values

When c is a character, Char c is a value

If R and R are values, so are

1
Plus(R1, R2) and Times(R1, R2)

!
If R is a value, so is Star R

!
representation
regular!
expression
foo
regexp value
Times(Char #f, Times(Char #o, Times(Char #o, One)))

Times(Char #f, Times(Char #o, Char #o))
(a+b)
Plus(Char #a, Char #b)
(a+b)*
Star(Plus(Char #a, Char #b))
there may be !
many ways to !
represent !
the same !
regular expression
regular languages
A regular expression R

denotes a language L(R) ... a set of char lists
L(Zero) = { }
L(One) = { [ ] }
L(Char c) = { [c] }
L(Plus(R1,R2)) = L(R1) L(R2)
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}
L(Star(R)) = {[ ]} {L1@L2 | L1 L(R), L2 L(Star R)}
comments
This is a recursive description of L(Star(R))

We mean that L(Star(R)) is the smallest set S
of char lists that satisfies the equation

S = {[ ]} {L1@L2 | L1 L(R), L2 S}
Hence L(Star(R)) consists of all lists of form
L1@L2@...@Ln where n0 and each Li L(R)
string/char list
explode : string -> char list

implode : char list -> string
strings!
and !
character lists!
are!
in 1-1 correspondence
explode foo = [#f, #o, #o]

implode [#f, #o, #o] = foo
We say that string s is in the language of R!

iff (explode s) L(R)
explode(s1^s2) = (explode s1) @ (explode s2)
explode() = [ ]
languages
regular!
expression
foo
regexp value
strings in language
Times(Char #f, Times(Char #o, Char #o))

{foo}
(a+b)
Plus(Char #a, Char #b)
(a+b)*
Star(Plus(Char #a, Char #b))
{a, b}
{, a, b,!
aa, ab, bb, }
building a regexp
EXERCISE
Using foldr and map, define a function

string2reg : string -> regexp
such that for all s : string,

L (string2reg s) = {explode s}
HINT: Use Times, One, Char, explode
specification
Write a function
accepts : regexp -> char list -> bool
to check if L L(R)
problem
Not easy to solve directly

For Times(R ,R ) its possible to
1
generate-and-test all splits L1@L2 of L

But this can be very costly!

And what about Star R?
solution
Generalize the problem...
Does L have a prefix in L(R)
with a suffix that satisfies a success condition?
L = L1@L2
L1 is a prefix of L, with suffix L2
success? some total function from char list to bool
intuition
match : regexp -> char list -> (char list -> bool) -> bool
success condition
match R L p = true
iff L has a split L=L1@L2 with

L1 L(R) & p(L2)=true
a prefix of L

is in L(R)
the rest of L

satisfies p
the generalized problem

Write an ML function
match : regexp -> char list ->

(char list -> bool) -> bool
such that for all values R : regexp,
For all L, and all total p
(a) match R L p = true
if there are L1, L2 such that

P(R)
L=L1@L2 & L1 L(R) & p(L2)=true
(b) match R L p = false

otherwise
NOTE: P(R) implies If p is total, so is (fn L => match R L p).
how that helps

Can then define
accepts : regexp -> char list -> bool
fun accepts R L = match R L null
REQUIRES: true

ENSURES:

accepts R L = true if L is in L(R)

accepts R L = false otherwise
design
match will use structural recursion on regexp
fun match Zero L p
= (* easy *)

| match One L p
= (* easy *)

| match (Char c) L p = (* easy *)

| match (Plus(R1, R2)) L p =

(* use match R1 and match R2 *)

| match (Times(R1, R2)) L p =

(* use match R1 and match R2 *)

| match (Star R) L p =

(* use match R *)
use spec as guide
Zero
L(Zero) = { }
match Zero L p = false
no prefix of L is in L(Zero)
One
L(One) = { [ ] }
match One L p = p L
the only prefix !
worth checking is [ ]
Char c
L(Char c) = { [c] }
the only possible prefix!

would be [c]
match (Char c) L p =
case L of

[ ] => false

| x::L => (c=x) andalso p(L)
!
match
(Char c) [ ] p = false
match (Char c) (x::L) = (c=x) andalso p(L)
Plus
L(Plus(R1,R2)) = L(R1) L(R2)
match (Plus(R1, R2)) L p =
(match R1 L p) orelse (match R2 L p)
why is this !
the right thing!
to do?
Plus
property (a)
L(Plus(R1,R2)) = L(R1) L(R2)

=>* true
if match R1 L p =>* true
L has a prefix in L(R1) !

with suffix satisfying p
or match R1 L p =>* false

& match R2 L p =>* true
L has a prefix in L(R2) !

=>* true if L has a prefix in L(R1) L(R2) !

Plus
property (b)
L(Plus(R1,R2)) = L(R1) L(R2)

=>* false
if match R1 L p =>* false
& match R2 L p =>* false
L has no prefix in L(R1) !

L has no prefix in L(R2) !

=>* false if L has no prefix in L(R1) L(R2) !

Times
L(Times(R1,R2)) = {L1@L2 | L1 L(R1), L2 L(R2)}
match (Times(R1,R2)) L p =
match R1 L (fn L => match R2 L p)
success continuation
says what to do
when R1 matches a prefix...
!
try matching R2 on suffix
Star
R* = 1 + R R*
match (Star R) L p =
match (Plus(One, Times(R, Star R))) L p

Yes, this should be true.

But not much good as a definition!
... not structurally recursive!
Star
p(L) orelse match R L (fn L => match (Star R) L p)
check for a
match with R
followed by !
a match with !
Star R
match (Star R) uses match R

and calls itself recursively on a suffix
Star
match (Star (Char c)) [ ] null
match (Star (Char c)) [c] null
match (Star (Char c)) [c,c] null
= true
= true
= true
L(Star(Char c)) = {[ ], [c], [c,c], }
Star
match (Star One) [c] null
=======================================!
=======================================>*!
.
Should be false
L(Star One) = {[ ]}
diagnosis
L(One) = {[ ]}
null [c] = false

+
=> null [c] orelse match One [c] (fn L => match (Star One) L null)
=>+ match One [c] (fn L => match (Star One) L null)
=>+ (fn L => match (Star One) L null) [c]
=>+ match (Star One) [c] null
match(Star One) [c] null
loops forever
Star (corrected)
L(Star(R)) = {[ ]} {L1@L2 | L1L(R), L2L(Star R)}
= {[ ]} {L1@L2 | L1L(R), L1[ ], L2L(Star R)}
p(L) orelse match R L (fn L => L

<> L andalso
match (Star R) L p)
check for a non-trivial
match with R
followed by !
a match with !
Star R
analysis
=>* null [c] orelse match One [c] (fn ...)
=>* match One [c] (fn L => L<> [c] andalso ...)
=>* (fn L => L<> [c] andalso ...) [c]
=>* [c] <> [c] andalso ...
=>* false
correctness?
Let P(R) be:

For all values L:char list

and all total functions p : char list -> bool,
(a) match R L p = true
if there are L1, L2 such that

L=L1@L2 & L1 L(R) & p(L2)=true
(b) match R L p = false

otherwise
THEOREM

For all values R : regexp, P(R) holds.
proof outline
By structural induction on R

Base cases: Zero, One, Char c

Easier inductive cases: Plus(R ,R ), Times(R , R )

Use P(R ) and P(R ) as hypotheses

Key fact: P(R ) implies

1
fn L => match R2 L p
Tricky inductive case: Star(R)

Use P(R) and induction on L
is total
(why is this needed?)
reflection
Most regular expressions built with Zero

denote the empty language

match may be slow, only to return false

Theres a Zero-free regexp for the same language

Write a function
DeZero : regexp -> regexp
to remove Zero (except at top level)

Slides13 Regex PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Slides13 Regex PDF

Enviado por

Direitos autorais:

Formatos disponíveis

15-150 Fall 2014

Regular expressions are built from !

datatype regexp = Zero | One | Char of char

Zero, One, are values

Plus(R1, R2) and Times(R1, R2)

Times(Char #f, Times(Char #o, Times(Char #o, One)))

Plus(Char #a, Char #b)

Star(Plus(Char #a, Char #b))

This is a recursive description of L(Star(R))

Hence L(Star(R)) consists of all lists of form

L1@L2@...@Ln where n0 and each Li L(R)

explode foo = [#f, #o, #o]

We say that string s is in the language of R!

Times(Char #f, Times(Char #o, Char #o))

Plus(Char #a, Char #b)

Star(Plus(Char #a, Char #b))

Using foldr and map, define a function

such that for all s : string,

generate-and-test all splits L1@L2 of L

But this can be very costly!

the generalized problem

how that helps

use spec as guide

the only possible prefix!

L(Plus(R1,R2)) = L(R1) L(R2)

L has a prefix in L(R1) !

or match R1 L p =>* false

L has a prefix in L(R2) !

=>* true if L has a prefix in L(R1) L(R2) !

L(Plus(R1,R2)) = L(R1) L(R2)

L has no prefix in L(R1) !

L has no prefix in L(R2) !

=>* false if L has no prefix in L(R1) L(R2) !

try matching R2 on suffix

Yes, this should be true.

match (Star R) uses match R

L(Star(Char c)) = {[ ], [c], [c,c], }

null [c] = false

match (Star One) [c] null

p(L) orelse match R L (fn L => L

Tricky inductive case: Star(R)

(why is this needed?)

Most regular expressions built with Zero

match may be slow, only to return false

Theres a Zero-free regexp for the same language

Você também pode gostar