Escolar Documentos
Profissional Documentos
Cultura Documentos
Used mainly for parsing text: search or search-and-replace. But this is not your momma's search-and-replace. Available in a variety of environments: Text editors, including TextPad. Unix commands such as grep. Most programming languages.
Page 1 of 27
Preliminaries
Before getting started, configure TextPad: Check "Regular expression" in the search or search-and-replace dialog box. Edit the TextPad preferences: Configure -> Preferences -> Editor Check "Use POSIX regular expression syntax"
Page 2 of 27
Page 3 of 27
Literals
The simplest regular expressions are no different than garden-variety searching: Ordinary characters match themselves. "Ordinary characters" are those that have no special meaning within regular expression syntax.
/in/ LAZINESS: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write laborsaving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer.
Page 4 of 27
To search for such characters, precede them with a backslash, which is known as escaping the character:
/\./ Mr. Green, with the revolver, in the billiard room. /\\/ C:\MPC\regular_expressions_intro.doc
Page 5 of 27
Page 6 of 27
The wildcard
A period matches any character (with one exception, to be covered later):
/a..e/ IMPATIENCE: The anger you feel when the computer is being lazy. This makes you write programs that don't just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris.
Page 7 of 27
Quantifiers
Regular expression syntax provides several ways to specify the number of times that a particular elements should occur:
+ ? * {N} {N,} {N,M} 1 0 0 N N N or more times or 1 time; item is optional or more times; item is optional or repeatable times or more times to M times
Page 8 of 27
Page 9 of 27
Page 10 of 27
Page 11 of 27
Generosity?
Some implementations of regular expressions allow for non-greedy quantifiers. In Perl, a question mark following a quantifier causes it to match as few characters as possible.
/john.+?\.edu/ john.aim@sa.edu education john.aim@sa.edu education /IPUMSI?/ IPUMSI is good, never evil. # Matches this. IPUMSI is good, never evil. # Not this. # Matches this # rather than this.
Page 12 of 27
Anchoring
Positional requirements can be placed on patterns. This is called anchoring.
/^PATTERN/ /PATTERN$/ Anchor to start of line. Anchor to end of line.
/ department$/ john.levin@yale.edu history department marcus_peterson@hotmail.com BS department john@nation.com department of knowledge # Not a match john_modell@brown.edu history department /^P.{23}4/ # Finds 4 in column 25 on person record.
Word anchors
Most regular expression syntaxes have word anchors, which force a pattern to be located at the word boundaries. Syntax for word anchors in TextPad.
\> \< End of word. Beginning of word. # No match in 'evil'.
Perl has a different and more robust syntax for word anchors.
Page 14 of 27
Implication 2: The end-of-line anchor assumes the existence of the newline character, so you do not need to specify it. These two patterns are both find 'department' only if it exists at the end of the line, but they differ in the text matched.
/department$/ marcus_peterson@hotmail.com BS department /department\n/ marcus_peterson@hotmail.com BS department
Page 15 of 27
Character classes
Character classes provide a way to define sets within a pattern. Enclose one or more characters within square brackets. The characters can be typed directly or using intuitive ranges.
[brc]at unit[0-9] unit[7-9] [a-z]{2}[0-9]{4} Matches Matches Matches Matches 'bat', 'rat', or 'cat'. 'unit' followed by any digit. 'unit7', 'unit8', or 'unit9'. MPC sample IDs, such as ih1970.
Character classes can also be defined in a negative fashion by placing the caret symbol as the first item in the brackets.
unit[^0-9] Matches 'unit' followed by any non-digit.
Page 16 of 27
Grouping or sub-patterns
A regular expression can be divided into parts, often called sub-patterns. Enclosing a portion of a regular expression in parentheses defines a sub-pattern.
/STUFF(SUB_PATTERN)MORE_STUFF(SUB_PATTERN)/
Page 17 of 27
Page 19 of 27
^([0-9])([0-9])([0-9]).+\3\2\1 232 123 232 123 237 123 612 456 321 456 456 732 456 216
Page 20 of 27
Page 21 of 27
Matches a date in the mm-yy format. Stores the month and year portions as sub-matches. Converts match to a tab-delimited string -- year then month. Before
06-60 02-60 10-70 03-70 06-80 03-80 03-80 11-80 12-90 03-90
After
1960 1960 1970 1970 1980 1980 1980 1980 1990 1990 06 02 10 03 06 03 03 11 12 03
Page 22 of 27
Parses numbers into their integer and decimal components. Matches one or more digits, optionally followed by a decimal point and some more digits. Preserves every component in the replacement: full match, integer, entire decimal portion, and just the digits following the decimal. Before
460.914 336.591 60 108.767 148.368 24.911
After
460.914 336.591 60 108.767 148.368 24.911 460 336 60 108 148 24 .914 .591 \2 .767 .368 .911 914 591 \3 767 368 911
Page 23 of 27
Alternation
The pipe symbol can be used to specify alternatives within a regular expression Whereas a character class provides alternatives at the level of individual characters, this syntax can be applied to entire sub-patterns.
/\.(edu|gov)$/ john.levin@yale.edu marie.johnson@argonmedical.com samantha.johnson@bateswhite.gov peterson@pop.com marcus_g_peterson@alum.mit.edu /^(A|An|The) .+/ # Matches entire lines that # start with 'A', 'An', or 'The'.
Page 24 of 27
Page 25 of 27
Quick reference
Characters with special meaning: Basic special characters:
\ . \t \n . \ + ? * ^ $ () [] {} | Treat the next character as literal text. Match any character except newline. Tab. Newline. 1 or more times 0 or 1 time; item is optional 0 or more times; item is optional or repeatable N times N or more times N to M times Preceding quantifier non-greedy (Perl, not TextPad)
Quantifiers:
+ ? * {N} {N,} {N,M} ?
Page 26 of 27
Anchors:
^ $ \> \<
Start of line. End of line. End of word (TextPad). Beginning of word (TextPad). Define character class. Define character class in negative fashion. Define a sub-pattern. Define alternative sub-patterns. Use the Nth back-reference. Use the entire match in the replacement. Ditto.
Character classes:
[] [^] () | \N & \0
Page 27 of 27