Escolar Documentos
Profissional Documentos
Cultura Documentos
Introduction
Regular expressions are tiny programs in their own special language, built inside Perl. These allow fast, flexible, and reliable string handling. A regular expression, often called a pattern in Perl, is a template that either matches or doesnt match a given string. That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that dont. Dont confuse regular expressions with shell filenamematching patterns, called globs, which is a different sort of pattern with its own rules.
Simple Pattern
To match a pattern (regular expression) against the contents of $_, simply put the pattern between a pair of forward slashes (/).
$_ = "yabba dabba doo"; if (/abba/) { print "It matched!\n"; }
The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value.
Unicode Properties
Unicode characters know something about themselves; they arent just sequences of bits. Instead of matching on a particular character, you can match a type of character. To match a particular property, you put the name in \p{PROPERTY}.
if (/\p{Space}/) { # 26 different possible characters print "The string has some whitespace.\n"; } if (/\p{Digit}/) { # 411 different possible characters print "The string has a digit.\n"; }
Meta-characters
The dot (.) is a wildcard characterit matches any single character except a newline.
/bet.y/ - > matches betty, betsy, bet=y, bet.y, doesnt match bety or betsey.
The dot always matches exactly one character. If you wanted the dot to match just a period, you can simply backslash it.
/3\.141/ -> matches 3.141596456 doesnt match 3a141545
Simple Quantifiers
* -- zero or more occurrences
/fred\t*barney/ matches fredbarney, fred\tbarney, fred\t\tbarney /fred.*barney/ matches fredbarney, fredabcdbarney
Grouping in Patterns
Use parentheses (( )) to group parts of a pattern. So, parentheses are also meta-characters.
/fred+/ matches fredddd, fredd /(fred)+/ matches fred, fredfred, fredfredfred /(fred)*/ matches hello, barney, fred, fredfred
Using of parentheses makes perl to store matched text in the special variables $1, $2, and so on. The number denotes the capture group.
$_ = perl version is 5.14; if(/perl version is (.*)/) { print $1; #prints 5.14 }
Use back references to refer to text that you matched in the parentheses, called a capture group. You denote a back reference as a backslash followed by a number, like \1, \2, and so on.
$_ = "abba"; if (/(.)\1/) { # matches 'bb' print "It matched same character next to itself!\n"; } $_ = "yabba dabba doo"; if (/y(....) d\1/) { print "It matched the same after y and d!\n"; }
$_ = "yabba dabba doo"; if (/y(.)(.)\2\1/) { # matches 'abba' print "It matched after the y!\n"; }
How do I know which group gets which number? --just count the order of the opening parenthesis and ignore nesting.
$_ = "yabba dabba doo"; if (/y((.)(.)\3\2) d\1/) { print "It matched!\n"; }
Consider the problem where you want to use a back reference next to a part of the pattern that is a number. In this regular expression, you want to use \1 to repeat the character you matched in the parentheses and follow that with the literal string 11
$_ = "aa11bb"; if (/(.)\111/) { print "It matched!\n"; }
Starting from perl 5.10, by using \g{1}, you disambiguate the back reference and the literal parts of the pattern:
use 5.010; $_ = "aa11bb"; if (/(.)\g{1}11/) { print "It matched!\n"; }
With the \g{N} notation, you can also use negative numbers.
use 5.010; $_ = "xaa11bb"; if (/(.)(.)\g{1}11/) { print "It matched!\n"; }
Alternatives
The vertical bar (|), often called or in this usage, means, if the part of the pattern on the left of the bar fails, the part on the right gets a chance to match.
/fred|barney|betty/ matches fred, barney, betty. /fred( |\t)+barney/ matches if fred and barney are separated by spaces, tabs, or a mixture of the two.
/fred( +|\t+)barney/ matches if fred and barney are separated either only by space or only by tabs not mixture of space and tabs.
/fred (and|or) barney/ matches fred and barney, fred or barney. Same as pattern /fred and barney|fred or barney/.
Character Classes
A character class, a list of possible characters inside square brackets. It matches just one single character, but that one character may be any of the ones you list in the brackets.
[abcwxyz] matches a,b,c,w,x,y,z (any of those seven characters)
$_ = "The HAL-9000 requires authorization to continue."; if (/HAL-[0-9]+/) { print "The string mentions some model of HAL computer.\n"; }
However, there are many more digits than the 0 to 9 that you may expect from ASCII, so that will also match HAL Recognizing this problematic shift from ASCII to Unicode, Perl 5.14 adds /a modifier on the end of the match perator tells Perl to use the old ASCII interpretation.
\s matches any whitespace, which is almost the same as the Unicode property \p{Space} \h only matches horizontal whitespace. \v shortcut only matches vertical whitespace. Taken together, the \h and \v are the same as \p{Space} The \R shortcut, introduced in Perl 5.10, matches any sort of line-break, independent of operating system. \w matches the set of characters [a-zA-Z0-9_]
The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. Wisely choose a delimiter that doesnt appear in your pattern.
m%http://% instead of /http:\/\// to match the initial "http://".
Match Modifiers
Case-Insensitive Matching with /i
$_=Is Freddy there?; if(/freddy/i) { print Yes Freddy is here; }
Without the /s modifier, that match would fail, since the two names arent on the same line. If you wanted to still match any character except a newline? --You could use the character class [^\n], or from Perl 5.12 added the shortcut \N to mean the complement of \n.
There are many other modifiers available at perlop documentation. A few are described below.
Perl considers comments a type of whitespace, so you can put comments into that pattern to tell what you are trying to do: / -? # an optional minus sign [0-9]+ # one or more digits before the decimal point \.? # an optional decimal point [0-9]* # some optional digits after the decimal point /x # end of string Use the escaped character, \#, or the character class, [#], if you need to match a literal pound sign as it indicates start of comment / [0-9]+ # one or more digits before the decimal point [#] # literal pound sign /x # end of string
Be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern. This pattern ends before you think it does: / -? # with / without - <--- OOPS! [0-9]+ # one or more digits before the decimal point \.? # an optional decimal point [0-9]* # some optional digits after the decimal point /x # end of string
Misc
The trick with a good pattern is to not match more than you ever mean to match.