Você está na página 1de 51

Computation and

Linguistic Analysis
Damir Ćavar
Indiana University
January 2006
Regular Expressions

• Test Regular Expressions:


• http://www.personeel.unimaas.nl/H.Schotel/Testaregex/

• Advanced literature and references therein:


• Karttunen, L. et al. (1997) Regular
Expressions for Language Engineering

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
• Practical View:
• For Programming Languages
• matching strings: specification of strings
to be matched
• small and restricted: application scope
and RE language itself
• Input 1: a regular expression
• Input 2: some string or text
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Example:
• Input 1 RE: “test”
• Input 2: “This is a test”
• Result: match with occurrence of “test”
• Aexpression
character or a string as a regular
matches the same character
or string in the text or string that is
provided as the second parameter.
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• Special Language Elements (meta


characters):

• .^ $ * + - ? { } [ ] \ | ( )
• Each meta character has at least one special
meaning or function

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
[ ]
• Specification of a set of characters or a
character class
• Examples:
• [a-c] Match with any character from
the set: {a, b, c}

• [aeiou] Match with any vowel


© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Examples:
• [0-9] Match with any digit from the
set: {0, 1, 2, ..., 9}

• [Aa] Match with lower and capital letter


“a”

• [dolar$] Match with any character


from the set: {d, o, l, a, r, $}
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
^

• Complementing set
• [^Aa] Match with any character that is
not lower nor capital letter “a”

• [^9] Match with any character except


letter “9”

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
\

• Escape character
• Treat meta characters as characters
• Examples:
• \[ Matches with any occurrence of “[”
• \\ Matches with any occurrence of “\”
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
.
• Matches any character
• Examples:
• [\s,.]
and a “.”
Matches white spaces, a “,”

• d.g “The dog is digging in the


garden.”
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
*
• Matches any preceding character zero or
more times
• Examples:
• is* “The issue is that John doesn’t
want to read Rilke.”

• do*g dog dg doog dooog ...


© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• What does the following RE match with?


• d[oiu]*g
•?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
+
• Matches any preceding character one or
more times (difference to *)
• Examples:
• is+ “The issue is that John doesn’t
want to read Rilke.”
• do+g dog doog dooog
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• What does the following RE match with?


• d[oiu]+g
•?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
?
• Matches any preceding character once or
zero times (difference to * and +)
• Examples:
• is? “The issue is that John doesn’t
want to read Rilke.”
• do?g dog dg
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• What does the following RE match with?


• d[oiu]?g
•?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
{m, n}
• Matches any preceding character
minimum m times and at most n times
(difference to *, +, ?)
• Examples:
• is{1, 2} The issue is that John
doesn’t want to read Rilke.”
• do{1,3}g dog doog dooog
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

{m, n}

• What does the following RE match with?


• d[oiu]{2,3}g
•?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
{m, n}

• Omitting m: m = 0

• Omitting n: n = ∞

• {0, } = *
• {1, } = +
• {0,1} = ?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Predefined sequences or macros
• \d Matches any decimal digit; this is
equivalent to the class [0-9].
• \D Matches any non-digit character; this
is equivalent to the class [^0-9].
• \s Matches any whitespace character;
this is equivalent to the class [\t\n\r
\f\v].
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Predefined sequences or macros
• \S Matches any non-whitespace character; this
is equivalent to the class [^\t\n\r\f\v].
• \w Matches any alphanumeric character; this is
equivalent to the class [a-zA-Z0-9_].
• \W Matches any non-alphanumeric character;
this is equivalent to the class
[^a-zA-Z0-9_].

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
• Try REs out and find applications that use
them.

• Example: tokenization
• Find the places where a string can be
broken up.

• Find the places where tokens end and


start.
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• Tokenization
• Morpheme boundaries
• Word boundaries
• Phrase boundaries
• Sentence boundaries
© 2006 by Damir Ćavar, Indiana University
Linguistic Units
• Construction of words from smaller units:
meaningful units

• Inflectional Morphology
• root + suffix
• Verbs: -ed, -s, -ing
• call: called, calls, calling
© 2006 by Damir Ćavar, Indiana University
Linguistic Units
• Construction of words from smaller units:
morphemes = meaningful units

• Derivational Morphology (word type →


new word type)

• nouns to adjectives: -ly (friend: friend-ly)


• adjectives to nouns: -ness (friendly:
friendliness)
© 2006 by Damir Ćavar, Indiana University
Linguistic Units
• Morpheme regularities:
• stem + derivational + inflectional
• Phonological regularities:
• onset + nucleus + coda
• Syntactic regularities:
• SVO - SOV - VSO...
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Matching with pronouns in English:
• Limit: subject pronouns
• What is the regular expression for that?
• What does a subject pronoun look
like?

• Where can it occur?


© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• English subject pronouns:


• I, you, he, she, it, we, they
• ambiguous: you
• Contextual properties:
•?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• Orthographic contextual information:


• Punctuation marks
• White space
• Lexical contextual information:
• Words preceding and following
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• I, you, he, she, it, we, they


• Capitalization in sentence beginning
• Seldom in sentence final position
• Very seldom preceded and followed by
articles, adjectives...

© 2006 by Damir Ćavar, Indiana University


Regular Expressions

• Grouping: ( )
• Or-operator: |
• And-operator is implicit

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
• RE that matches with the preposition on.
• on alone RE: on
• on in sentence initial position RE: ^[oO]n
• on preceded or followed by some symbol
(, or .) = non-alphabetic

• RE: (^|[^a-zA-Z])[oO]n[^a-zA-Z]
© 2006 by Damir Ćavar, Indiana University
Regular Expressions

• In Python:
• see RETest.py and RETest2.py

© 2006 by Damir Ćavar, Indiana University


Regular Expressions
• REs describe a set of strings or character
sequences that they match with.

• This set is potentially endless, REs can be


recursive (+, *, {m, n}) and very long.

• Looking from the set of expressions an RE


matches with, we can also use an
alternative formalism.
© 2006 by Damir Ćavar, Indiana University
Automata
• Regular Expressions (RE)
• Description of finite-state automata (FSA)
• Any RE can be implemented as a FSA.
• Any FSA can be described by a RE.
• REs characterize a formal language: regular
language.

© 2006 by Damir Ćavar, Indiana University


Automata
• A set of expressions:
• ba!
• bas!
• bass!
• basss! ...
• RE: bas*!
© 2006 by Damir Ćavar, Indiana University
Automata
• Description in terms of instructions
followed step by step:

© 2006 by Damir Ćavar, Indiana University


Automata

• For recognition
• For every input symbol change to
possible new state.

• For generation
• Change to a new state and emit symbol.
© 2006 by Damir Ćavar, Indiana University
Finite State Automata

• Directed graph:
• Finite set of vertices/nodes = states.
• A set of directed links between pairs of
vertices = arcs or transitions.

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

4 states = nodes
Q0 = start state
Q3 = final/accepting state
4 transitions = arcs
© 2006 by Damir Ćavar, Indiana University
Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata

© 2006 by Damir Ćavar, Indiana University


Finite State Automata
• Accepting Automaton (Recognition)
• start in start state Q0

• repeat:

• if next symbol on tape matches any arc symbol from current state

• move to next state as indicated by the corresponding arc

• if in finite state: accept and terminate

• move to next symbol on tape

• else: reject

© 2006 by Damir Ćavar, Indiana University


Finite State Automata
State Transition Table
Input
State
b a s !
0 1 0 0 0
1 0 2 0 0
2 0 0 2 3
3: 0 0 0 0

© 2006 by Damir Ćavar, Indiana University


Finite State Automata
• FSA definition:
• Q: a finite set of n states Q , Q , ... Q 0 1 n

• ∑: a finite input alphabet of symbols


• Q : the start state
0

• F: the set of final states, F Q


• δ(q,i): transition function or transition
matrix between states
© 2006 by Damir Ćavar, Indiana University
Finite State Automata
• “bas!”-FSA definition:
• Q = [ Q ,Q ,Q ,Q ]
0 1 2 3

• ∑ = [b, a, s, !]
• start state: Q 0

• set of final states: [ Q ] 3

• δ(q,i): relation between states and input


symbols to new states as in transition
table
© 2006 by Damir Ćavar, Indiana University
Assignment
• How would you use FSA's to generate
expressions?
• Define the FSA's for: (vertices & arches
model and formal def. with transition table)
• ab, abab, ababab, abababab, ...
• a, ab, ac, acc, accc, accc, ...
• Define the regular expressions for the two
languages above.
© 2006 by Damir Ćavar, Indiana University

Você também pode gostar