Escolar Documentos
Profissional Documentos
Cultura Documentos
Linguistic Analysis
Damir Ćavar
Indiana University
January 2006
Regular Expressions
• .^ $ * + - ? { } [ ] \ | ( )
• Each meta character has at least one special
meaning or function
• Complementing set
• [^Aa] Match with any character that is
not lower nor capital letter “a”
• Escape character
• Treat meta characters as characters
• Examples:
• \[ Matches with any occurrence of “[”
• \\ Matches with any occurrence of “\”
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
.
• Matches any character
• Examples:
• [\s,.]
and a “.”
Matches white spaces, a “,”
{m, n}
• Omitting m: m = 0
• Omitting n: n = ∞
• {0, } = *
• {1, } = +
• {0,1} = ?
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Predefined sequences or macros
• \d Matches any decimal digit; this is
equivalent to the class [0-9].
• \D Matches any non-digit character; this
is equivalent to the class [^0-9].
• \s Matches any whitespace character;
this is equivalent to the class [\t\n\r
\f\v].
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• Predefined sequences or macros
• \S Matches any non-whitespace character; this
is equivalent to the class [^\t\n\r\f\v].
• \w Matches any alphanumeric character; this is
equivalent to the class [a-zA-Z0-9_].
• \W Matches any non-alphanumeric character;
this is equivalent to the class
[^a-zA-Z0-9_].
• Example: tokenization
• Find the places where a string can be
broken up.
• Tokenization
• Morpheme boundaries
• Word boundaries
• Phrase boundaries
• Sentence boundaries
© 2006 by Damir Ćavar, Indiana University
Linguistic Units
• Construction of words from smaller units:
meaningful units
• Inflectional Morphology
• root + suffix
• Verbs: -ed, -s, -ing
• call: called, calls, calling
© 2006 by Damir Ćavar, Indiana University
Linguistic Units
• Construction of words from smaller units:
morphemes = meaningful units
• Grouping: ( )
• Or-operator: |
• And-operator is implicit
• RE: (^|[^a-zA-Z])[oO]n[^a-zA-Z]
© 2006 by Damir Ćavar, Indiana University
Regular Expressions
• In Python:
• see RETest.py and RETest2.py
• For recognition
• For every input symbol change to
possible new state.
• For generation
• Change to a new state and emit symbol.
© 2006 by Damir Ćavar, Indiana University
Finite State Automata
• Directed graph:
• Finite set of vertices/nodes = states.
• A set of directed links between pairs of
vertices = arcs or transitions.
4 states = nodes
Q0 = start state
Q3 = final/accepting state
4 transitions = arcs
© 2006 by Damir Ćavar, Indiana University
Finite State Automata
• repeat:
• if next symbol on tape matches any arc symbol from current state
• else: reject
• ∑ = [b, a, s, !]
• start state: Q 0