Você está na página 1de 27

What are regular expressions?

Formally, a regular expression defines a set of strings.


/[Dd]at(a|um)/ Defines the set: Data, data, Datum, datum

Used mainly for parsing text: search or search-and-replace. But this is not your momma's search-and-replace. Available in a variety of environments: Text editors, including TextPad. Unix commands such as grep. Most programming languages.

Page 1 of 27

Preliminaries
Before getting started, configure TextPad: Check "Regular expression" in the search or search-and-replace dialog box. Edit the TextPad preferences: Configure -> Preferences -> Editor Check "Use POSIX regular expression syntax"

Page 2 of 27

Terminology and notation


A regular expression defines a pattern. It is said to find a match when it succeeds. This tutorial uses the Perl idiom of enclosing regular expression patterns in forward slashes:
/PATTERN/ s/SEARCH/REPLACE/

Page 3 of 27

Literals
The simplest regular expressions are no different than garden-variety searching: Ordinary characters match themselves. "Ordinary characters" are those that have no special meaning within regular expression syntax.
/in/ LAZINESS: The quality that makes you go to great effort to reduce overall energy expenditure. It makes you write laborsaving programs that other people will find useful, and document what you wrote so you don't have to answer so many questions about it. Hence, the first great virtue of a programmer.

Page 4 of 27

Escaped character literals


The following characters have special meaning in regular expressions:
+ ? . * ^ $ () [] {} | \

To search for such characters, precede them with a backslash, which is known as escaping the character:
/\./ Mr. Green, with the revolver, in the billiard room. /\\/ C:\MPC\regular_expressions_intro.doc

Page 5 of 27

Other special characters


Sometimes you need to match special whitespace characters. Here are the most common:
\t \n tab newline (end of line marker)

Page 6 of 27

The wildcard
A period matches any character (with one exception, to be covered later):
/a..e/ IMPATIENCE: The anger you feel when the computer is being lazy. This makes you write programs that don't just react to your needs, but actually anticipate them. Or at least pretend to. Hence, the second great virtue of a programmer. See also laziness and hubris.

Page 7 of 27

Quantifiers
Regular expression syntax provides several ways to specify the number of times that a particular elements should occur:
+ ? * {N} {N,} {N,M} 1 0 0 N N N or more times or 1 time; item is optional or more times; item is optional or repeatable times or more times to M times

Page 8 of 27

Quantifiers: basic examples


To use quantifiers in a regular expression you place them after the element that you want to quantify:
/IPUMSI?/ /40+9/ /ab{2,3}a/ /.+/ /.*/ Matches Matches Matches Matches Matches 'IPUMS' and 'IPUMSI'. 409, 4009, 40009, etc. 'abba' and 'abbba'. any line with at least one character. any line, even blank ones.

Page 9 of 27

Quantifiers: a greedy example


/john.+\.edu/ john.levin@yale.edu history marcus_g_peterson@hotmail.com geography john@nationwidecash.com computers john_modell@brown.edu history waijohnchang@yahoo.com geography carol.johnson@normandale.edu sociology john.johnson@ibm.com commerce holly_johnston@hotmail.com biology marie.johnson@argonmedical.com other john.aim@sa.edu education chewie1974@gmail.test science

Page 10 of 27

Regular expressions are greedy by default


Quantifiers match as many characters as possible, consistent with the overall goal of finding a successful match.
/john.+\.edu/ john.aim@sa.edu john.aim@sa.edu john.aim@sa.edu john.aim@sa.edu john.aim@sa.edu john.aim@sa.edu john.aim@sa.edu education education education education education education education # # # # # # # The quantifier consumes the line, but the match fails... ...so the matching engine... ...backs off the quantifier... ...step... ...by step... ...until the pattern succeeds.

Page 11 of 27

Generosity?
Some implementations of regular expressions allow for non-greedy quantifiers. In Perl, a question mark following a quantifier causes it to match as few characters as possible.
/john.+?\.edu/ john.aim@sa.edu education john.aim@sa.edu education /IPUMSI?/ IPUMSI is good, never evil. # Matches this. IPUMSI is good, never evil. # Not this. # Matches this # rather than this.

TextPad lacks this feature.

Page 12 of 27

Anchoring
Positional requirements can be placed on patterns. This is called anchoring.
/^PATTERN/ /PATTERN$/ Anchor to start of line. Anchor to end of line.

/ department$/ john.levin@yale.edu history department marcus_peterson@hotmail.com BS department john@nation.com department of knowledge # Not a match john_modell@brown.edu history department /^P.{23}4/ # Finds 4 in column 25 on person record.

H00003101110011000000000310400000001 P00003111110011000000000301110881088 H00004101110000000000000410400000001 P00004111110000000000000408420011001 H00005101110000001100000510400000001 P00005111110000001110000507410031003


Page 13 of 27

Word anchors
Most regular expression syntaxes have word anchors, which force a pattern to be located at the word boundaries. Syntax for word anchors in TextPad.
\> \< End of word. Beginning of word. # No match in 'evil'.

IPUMSI is good, never evil.

Perl has a different and more robust syntax for word anchors.

Page 14 of 27

Regular expressions are line-based


By default, regular expressions are applied one line at a time. Implication 1: The wildcard does not match the newline character.
/.+/ # Matches full line, not entire document.

Implication 2: The end-of-line anchor assumes the existence of the newline character, so you do not need to specify it. These two patterns are both find 'department' only if it exists at the end of the line, but they differ in the text matched.
/department$/ marcus_peterson@hotmail.com BS department /department\n/ marcus_peterson@hotmail.com BS department
Page 15 of 27

Character classes
Character classes provide a way to define sets within a pattern. Enclose one or more characters within square brackets. The characters can be typed directly or using intuitive ranges.
[brc]at unit[0-9] unit[7-9] [a-z]{2}[0-9]{4} Matches Matches Matches Matches 'bat', 'rat', or 'cat'. 'unit' followed by any digit. 'unit7', 'unit8', or 'unit9'. MPC sample IDs, such as ih1970.

Character classes can also be defined in a negative fashion by placing the caret symbol as the first item in the brackets.
unit[^0-9] Matches 'unit' followed by any non-digit.

Page 16 of 27

Grouping or sub-patterns
A regular expression can be divided into parts, often called sub-patterns. Enclosing a portion of a regular expression in parentheses defines a sub-pattern.
/STUFF(SUB_PATTERN)MORE_STUFF(SUB_PATTERN)/

Page 17 of 27

How sub-patterns are used


1. To apply quantifiers to a subset of a regular expression, rather than just to a single character. 2. To search for text with repeated elements. 3. To use portions of a pattern when defining the replacement string in searchand-replace operations. In the latter two situations, the sub-patterns can be referred using a \N notation. These are known as back-references.
\1 \2 ... \9 first sub-pattern second sub-pattern ninth sub-pattern

Perl's syntax for back-references is $N.


Page 18 of 27

Sub-pattern examples: with quantifiers


house(cat)? (ha)+ Matches 'house' or 'housecat'. Matches 'ha', 'haha', 'hahaha', etc.

Page 19 of 27

Sub-pattern examples: text with repeated elements


/([0-9\-]+) +\1/ 232-456-789 123-456-789 232-456-789 123-456-789 232-456-789 123-456-789 612-612-3245 763-456-7890 123-456-789 612-612-3245 763-456-7890 232-456-789 612-612-3245 612-612-3245

^([0-9])([0-9])([0-9]).+\3\2\1 232 123 232 123 237 123 612 456 321 456 456 732 456 216

Page 20 of 27

Using sub-patterns in the replacement string -- preliminaries


In a search-and-replace operations, the replacement string is not a regular expression. It is mainly a literal string with a few bits of added functionality, which vary considerably from one environment to another. The primary TextPad features are the following:
\N & \0 \p \i \i(N,M) Use the Nth sub-pattern in the replacement. Use the entire match in the replacement Ditto. Use the clipboard contents. Generate a sequence number. Ditto, starting at N and incrementing by M.

Page 21 of 27

Sub-pattern examples: using sub-patterns in the replacement string


s/([0-9]{2})-([0-9]{2})/19\2\t\1/

Matches a date in the mm-yy format. Stores the month and year portions as sub-matches. Converts match to a tab-delimited string -- year then month. Before
06-60 02-60 10-70 03-70 06-80 03-80 03-80 11-80 12-90 03-90

After
1960 1960 1970 1970 1980 1980 1980 1980 1990 1990 06 02 10 03 06 03 03 11 12 03

Page 22 of 27

Sub-pattern examples: using sub-patterns in the replacement string


s/([0-9]+)(\.([0-9]+))?/\0\t\1\t\2\t\3/

Parses numbers into their integer and decimal components. Matches one or more digits, optionally followed by a decimal point and some more digits. Preserves every component in the replacement: full match, integer, entire decimal portion, and just the digits following the decimal. Before
460.914 336.591 60 108.767 148.368 24.911

After
460.914 336.591 60 108.767 148.368 24.911 460 336 60 108 148 24 .914 .591 \2 .767 .368 .911 914 591 \3 767 368 911
Page 23 of 27

Alternation
The pipe symbol can be used to specify alternatives within a regular expression Whereas a character class provides alternatives at the level of individual characters, this syntax can be applied to entire sub-patterns.
/\.(edu|gov)$/ john.levin@yale.edu marie.johnson@argonmedical.com samantha.johnson@bateswhite.gov peterson@pop.com marcus_g_peterson@alum.mit.edu /^(A|An|The) .+/ # Matches entire lines that # start with 'A', 'An', or 'The'.

Page 24 of 27

Text parsing -- a typical MPC example


Example: Czechoslovakia 1991 codebook. General points: Identify regularities -- without them, the task is not amendable to automated solutions. Identify irregularities -- these are the challenges. Control white space. Be practical: use strategic "manual" editing. Do not force automation unless the magnitude of the job demands it. Know the strengths of your tools, and use them in combination: Excel, TextPad, even Word.

Page 25 of 27

Quick reference
Characters with special meaning: Basic special characters:
\ . \t \n . \ + ? * ^ $ () [] {} | Treat the next character as literal text. Match any character except newline. Tab. Newline. 1 or more times 0 or 1 time; item is optional 0 or more times; item is optional or repeatable N times N or more times N to M times Preceding quantifier non-greedy (Perl, not TextPad)

Quantifiers:
+ ? * {N} {N,} {N,M} ?

Page 26 of 27

Anchors:
^ $ \> \<

Start of line. End of line. End of word (TextPad). Beginning of word (TextPad). Define character class. Define character class in negative fashion. Define a sub-pattern. Define alternative sub-patterns. Use the Nth back-reference. Use the entire match in the replacement. Ditto.

Character classes:
[] [^] () | \N & \0

Sub-patterns and back-references:

Other TextPad options in the replacement string:


\p Use the clipboard contents in the replacement. \i Generate a sequence number (TextPad, not Perl). \i(N,M) Ditto, starting at N and incrementing by M.

Page 27 of 27

Você também pode gostar