P. 1
Reg Ex

Reg Ex

|Views: 3|Likes:
Publicado porgood_friend_1233054

More info:

Published by: good_friend_1233054 on Jul 27, 2011
Direitos Autorais:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/27/2011

pdf

text

original

Sections

  • 1. Regular Expression Tutorial
  • 2. Literal Characters
  • 3. First Look at How a Regex Engine Works Internally
  • 4. Character Classes or Character Sets
  • 5. The Dot Matches (Almost) Any Character
  • 6. Start of String and End of String Anchors
  • 7. Word Boundaries
  • 8. Alternation with The Vertical Bar or Pipe Symbol
  • 9. Optional Items
  • 10. Repetition with Star and Plus
  • 11. Use Round Brackets for Grouping
  • 12. Use Round Brackets for Grouping
  • 13. Regex Matching Modes
  • 14. Atomic Grouping and Possessive Quantifiers
  • 15. Lookahead and Lookbehind Zero-Width Assertions
  • 16. Testing The Same Part of The String for More Than One Requirement
  • 17. Finding Matches Only Inside a Section of The String
  • 18. Continuing at The End of The Previous Match
  • 19. If-Then-Else Conditionals in Regular Expressions
  • 20. Adding Comments to Regular Expressions

39

1. Regular Expression Tutorial
In this tutorial, I will teach you all you need to know to be able to craft powerful time-saving regular expressions. I will start with the most basic concepts, so that you can follow this tutorial even if you know nothing at all about regular expressions yet. But I will not stop there. I will also explain how a regular expression engine works on the inside, and alert you at the consequences. This will help you to understand quickly why a particular regex does not do what you initially expected. It will save you lots of guesswork and head-scratching when you need to write more complex regexes.

What Regular Expressions Are Exactly - Terminology
Basically, a regular expression is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based. But we will not dig into that. Since most people including myself are lazy to type, you will usually find the name abbreviated to regex or regexp. I prefer regex, because it is easy to pronounce the plural "regexes". In this book, regular expressions are printed guillemots: «regex». They clearly separate the pattern from the surrounding text and punctuation. This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text „regex”. A "match" is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex processing software. Matches are indicated by double quotation marks, with the left one at the base of the line. «\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}\b» is a more complex pattern. It describes a series of letters, digits, dots, percentage signs and underscores, followed by an at sign, followed by another series of letters, digits, dots, percentage signs and underscores, finally followed by a single dot and between two and four letters. In other words: this pattern describes an email address. With the above regular expression pattern, you can search through a text file to find email addresses, or verify if a given string looks like an email address. In this tutorial, I will use the term "string" to indicate the text that I am applying the regular expression to. I will indicate strings using regular double quotes. The term “string” or “character string” is used by programmers to indicate a sequence of characters. In practice, you can use regular expressions with whatever data you can access using the application or programming language you are working with.

Different Regular Expression Engines
A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly. Rather, the application will invoke it for you when needed, making sure the right regular expression is applied to the right file or data. As usual in the software world, different regular expression engines are not fully compatible with each other. It is not possible to describe every kind of engine and regular expression syntax (or “flavor”) in this tutorial. I will focus on the regex flavor used by Perl 5, for the simple reason that this regex flavor is the most popular

40 one, and deservedly so. Many more recent regex engines are very similar, but not identical, to the one of Perl 5. Examples are the open source PCRE engine (used in many tools and languages like PHP), the .NET regular expression library, and the regular expression package included with version 1.4 and later of the Java JDK. I will point out to you whenever differences in regex flavors are important, and which features are specific to the Perl-derivatives mentioned above.

Give Regexes a First Try
You can easily try the following yourself in a text editor that supports regular expressions, such as EditPad Pro. If you do not have such an editor, you can download the free evaluation version of EditPad Pro to try this out. EditPad Pro's regex engine is fully functional in the demo version. As a quick test, copy and paste the text of this page into EditPad Pro. Then select Edit|Search and Replace from the menu. In the search pane that appears near the bottom, type in «regex» in the box labeled “Search Text”. Mark the “Regular expression” checkbox, unmark “All open documents” and mark “Start from beginning”. Then click the Search button and see how EditPad Pro's regex engine finds the first match. When “Start from beginning” is checked, EditPad Pro uses the entire file as the string to try to match the regex to. When the regex has been matched, EditPad Pro will automatically turn off “Start from beginning”. When you click the Search button again, the remainder of the file, after the highlighted match, is used as the string. When the regex can no longer match the remaining text, you will be notified, and “Start from beginning” is automatically turned on again. Now try to search using the regex «reg(ular expressions?|ex(p|es)?)». This regex will find all names, singular and plural, I have used on this page to say “regex”. If we only had plain text search, we would have needed 5 searches. With regexes, we need just one search. Regexes save you time when using a tool like EditPad Pro. If you are a programmer, your software will run faster since even a simple regex engine applying the above regex once will outperform a state of the art plain text search algorithm searching through the data five times. Regular expressions also reduce development time. With a regex engine, it takes only one line (e.g. in Perl, PHP, Java or .NET) or a couple of lines (e.g. in C using PCRE) of code to, say, check if the user's input looks like a valid email address.

41

2. Literal Characters
The most basic regular expression consists of a single literal character, e.g.: «a». It will match the first occurrence of that character in the string. If the string is “Jack is a boy”, it will match the „a” after the “J”. The fact that this “a” is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using word boundaries. We will get to that later. This regex can match the second „a” too. It will only do so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its “Find Next” or “Search Forward” function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match. Similarly, the regex «cat» will match „cat” in “About cats and dogs”. This regular expression consists of a series of three literal characters. This is like saying to the regex engine: find a «c», immediately followed by an «a», immediately followed by a «t». Note that regex engines are case sensitive by default. «cat» does not match “Cat”, unless you tell the regex engine to ignore differences in case.

Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 11 characters with special meanings: the opening square bracket «[», the backslash «\», the caret «^», the dollar sign «$», the period or dot «.», the vertical bar or pipe symbol «|», the question mark «?», the asterisk or star «*», the plus sign «+», the opening round bracket «(» and the closing round bracket «)». These special characters are often called “metacharacters”. If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match „1+1=2”, the correct regex is «1\+1=2». Otherwise, the plus sign will have a special meaning. Note that «1+1=2», with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match “1+1=2”. It would match „111=2” in “123+111=234”, due to the special meaning of the plus character. If you forget to escape a special character where its use is not allowed, such as in «+1», then you will get an error message. All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. E.g. «\d» will match a single digit from 0 to 9.

Special Characters and Programming Languages
If you are a programmer, you may be surprised that characters like the single quote and double quote are not special characters. That is correct. When using a regular expression or grep tool like PowerGREP or the

42 search function of a text editor like EditPad Pro, you should not escape or repeat the quote characters like you do in a programming language. In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string. So the regex «1\+1=2» must be written as "1\\+1=2" in C++ code. The C++ compiler will turn the escaped backslash in the source code into a single backslash in the string that is passed on to the regex library. To match „c:\temp”, you need to use the regex «c:\\temp». As a string in C++ source code, this regex becomes "c:\\\\temp". Four backslashes to match a single one indeed. See the tools and languages section in this book for more information on how to use regular expressions in various programming languages.

Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. «\t» will match a tab character (ASCII 0x09), «\r» a carriage return (0x0D) and «\n» a line feed (0x0A). Remember that Windows text files use “\r\n” to terminate lines, while UNIX text files use “\n”. You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use «\xA9». Another way to search for a tab is to use «\x09». Note that the leading zero is required.

Notable tools that use text-directed engines are awk. This is because certain very useful features. If the resulting match is only „regex”. The result is that the regex-directed engine will return the leftmost match. Again. At the 15th character in the match. it will try all possible permutations of the regex. If backreferences and/or lazy quantifiers are available. Only if all possibilities have been tried and found to fail. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that «a» matches „a” and «t» matches „t”. The engine is "eager" to report a match. All the regex flavors treated in this tutorial are based on regex-directed engines. But then. the engine is regex-directed. .43 3. It will help you understand quickly why a particular regex does not do what you initially expected. This fails too. So it will continue with the 5th: “a”. I will explain step by step how the regex engine actually processes that token. «c» fails to match here and the engine carries on. Jeffrey Friedl calls them DFA and NFA engines. When applying a regex to a string. For awk and egrep. The engine never proceeds beyond this point to see if there are any “better” matches. flex. The entire regular expression could be matched starting at character 15. The Regex-Directed Engine Always Returns the Leftmost Match This is a very important point to understand: a regex-directed engine will always return the leftmost match. and regex-directed engines. even if a “better” match could be found later. If the result is „regex not”. egrep. «c» again matches „c”. Again. in exactly the same order. You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. At that point. such as lazy quantifiers and backreferences. the engine will start at the first character of the string. You can do the test by applying the regex «regex|regex not» to the string “regex not”. In this tutorial. can only be implemented in regex-directed engines. This will save you lots of guesswork and head-scratching when you need to write more complex regexes. There are no other possible permutations of this regex. because it merely consists of a sequence of literal characters. Arriving at the 4th character in the match. there are a few versions of these tools that use a regex-directed engine.”. When applying «cat» to “He captured a catfish for his cat. It will therefore report the first three letters of catfish as a valid match. The reason behind this is that the regex-directed engine is “eager”. will the engine continue with the second character in the text. So the regex engine tries to match the «c» with the “e”. „a”. MySQL and Procmail. lex. This succeeds too. The engine will then try to match the second token «a» to the 5th character. This inside look may seem a bit long-winded at certain times. This fails. But understanding how the regex engine works will enable you to use its full power and help you avoid common mistakes. It will try all possible permutations of the regular expression at the first character. you can be certain the engine is regex-directed. The first match is considered good enough. the engine will try to match the first token in the regex «c» to the first character in the match “H”. after introducing a new regex token. «c» matches „c”. There are two kinds of regular expression engines: text-directed engines. No surprise that this kind of engine is more popular. First Look at How a Regex Engine Works Internally Knowing how the regex engineworks will enable you to craft better regexes more easily. the engine knows the regex cannot be matched starting at the 4th character in the match. respectively. as does matching the «c» with the space. «t» fails to match “p”. then it is text-directed.

the way the engine works will have a profound impact on the matches it will find. Some of the results may be surprising. it is important that you can follow the steps the engine takes in your mind.44 In this first example of the engine's internals. However. In following examples. A text-directed engine would have returned the same result too. once you know how the engine works. But they are always logical and predetermined. our regex engine simply appears to work like a regular text search routine. .

such as «sep[ae]r[ae]te» or «li[cs]en[cs]e». because it is the “character that is not a u” that is matched by the negated character class in the above regexp. Find an identifier in a programming language with «[A-Za-z_][A-Za-z_0-9]*». Metacharacters Inside Character Classes Note that the only special characters or metacharacters inside a character class are the closing bracket (]). The usual metacharacters are normal characters inside a character class. and only the q. even if it is misspelled.45 4. «gr[ae]y» will not match “graay”. Useful Applications Find a word. «q[^u]» does not mean: “a q not followed by a u”. you can tell the regex engine to match only one out of several characters. The results are identical. You can use more than one range. Your regex will work fine if you escape the regular metacharacters inside a character class. and do not need to be escaped by a backslash. Very useful if you do not know whether the document you are searching through is written in American or British English. You can use a hyphen inside a character class to specify a range of characters. It is important to remember that a negated character class still must match a character. “graey” or any such thing. It will not match the q in the string “Iraq”. But we will get to that later. the caret (^) and the hyphen (-). use «[ae]». «[0-9a-fxA-FX]» matches a hexadecimal digit or the letter X. also called “character set”. case insensitively. «[0-9]» matches a single digit between 0 and 9. but doing so significantly reduces readability. To search for a star or plus. «[0-9a-fA-F]» matches a single hexadecimal digit. The order of the characters inside a character class does not matter. It means: “a q followed by a character that is not a u”. Negated Character Classes Typing a caret after the opening square bracket will negate the character class. Unlike the dot. You could use this in «gr[ae]y» to match either „gray” or „grey”. the order of the characters and the ranges does not matter. Indeed: the space will be part of the overall match. Simply place the characters you want to match between square brackets. If you want the regex to match the q. you need to use negative lookahead: «q(?!u)». use «[+*]». It will match the q and the space after the q in “Iraq is a country”. in both strings. If you want to match an a or an e. The result is that the character class will match any character that is not in the character class. the backslash (\). Character Classes or Character Sets With a "character class". You can combine ranges and single characters. . Again. negated character classes also match (invisible) line break characters. Find a C-style hexadecimal number with «0[xX][A-Fa-f0-9]+». A character class matches only a single character.

In all flavors. or right before the closing bracket. Shorthand Character Classes Since certain character classes are used often. depends on the regex flavor. etc. the actual character range depends on the script you have chosen in Options|Font. . you have to escape it with another backslash. You can put the closing bracket right after the opening bracket. To include a caret. or by placing them in a position where they do not take on their special meaning. In all flavors discussed in this tutorial. «[x^]» matches an x or a caret.46 To include a backslash as a character without any special meaning inside a character class. The closing bracket (]). «\d» is short for «[0-9]». the former regex will match „ 2” (space two). The best way to find out is to do a couple of tests with the regex flavor you are using. That is: «\s» will match a space or a tab. it will include «[A-Za-z]». since it improves readability. «\W» is short for «[^\w]» and «\S» is the equivalent of «[^\s]». Some flavors include additional. Negated Shorthand Character Classes The above three shorthands also have negated versions. while the latter matches „1” (one). In most. which characters this actually includes. for example. Both «[-x]» and «[x-]» match an x or a hyphen. or right after the negating caret. «[]x]» matches a closing bracket or an x. I recommend the latter method. «[\da-fA-F]» matches a hexadecimal digit. «\D» is the same as «[^\d]». place it anywhere except right after the opening bracket. you can see the characters matched by «\w» in PowerGREP when using the Western script. «[^]x]» matches any character that is not a closing bracket. characters with diacritics used in languages such as French and Spanish will be included. Shorthand character classes can be used both inside and outside the square brackets. Exactly which characters it matches differs between regex flavors. In the screen shot. In most flavors. rarely used non-printable characters such as vertical tab and form feed. a series of shorthand character classes are available. the underscore and digits are also included. Russian characters will be included. If you are using the Cyrillic script. In some flavors. If you are using the Western script. «\s\d» matches a whitespace character followed by a digit. Again. «[\\x]» matches a backslash or an x. «\w» stands for “word character”. When applied to “1 + 2 = 3”. «[\s\d]» matches a single character that is either whitespace or a digit. or the negating caret. and is equivalent to «[0-9a-fA-F]». it also includes a carriage return or a line feed as in «[ \t\r\n]». word characters from other languages may also match. the caret (^) and the hyphen (-) can be included by escaping them with a backslash. In EditPad Pro. it includes «[ \t]». The hyphen can be included right after the opening bracket. «\s» stands for “whitespace character”.

you need to use lookahead and lookbehind. however. The former. you will repeat the entire character class. The last regex token is «y». and fail. That is: «gr[ae]y» can match both „gray” and „grey”.47 Be careful when using the negated shorthands inside square brackets. But because we are using a regex-directed engine. even though we put the «a» first in the character class. and whitespace is not a digit. «*» or «+» operators. or is not whitespace. because that is the leftmost match. The latter will match any character that is not a digit or whitespace. It will return „grey” as the match result. I did not yet explain how character classes work inside the regex engine. Nothing noteworthy happens for the first twelve characters in the string. I will explain how it applies a regex that has more than one permutation. «[\D\S]» is not the same as «[^\d\s]». The next token in the regex is the literal «r». which matches the next character in the text. digit. Looking Inside The Regex Engine As I already said: the order of the characters inside a character class does not matter. We already saw how the engine applies a regex consisting only of literal characters. When applied to the string “833337”. The engine has found a complete match with the text starting at character 13. «[ae]» is attempted at the next character in the text (“e”). but not “8”. and not just the character that it matched. The engine will fail to match «g» at every step. Let us take a look at that first. So it will match „x”. rather than the class. But I digress. The regex «[0-9]+» can match „837” as well as „222”. whitespace or otherwise. and „gray” could have been matched in the string. Because a digit is not whitespace. because another equally valid match was found to the left of it. will match any character that is either not a digit. «[\D\S]» will match any character. which can be matched with the following character as well. If you do not want that. it must continue trying to match all the other permutations of the regex pattern before deciding that the regex cannot be matched with the text starting at character 13. Below. So the third token. «gr[ae]y» will match „grey” in “Is his hair grey or gray?”. It will first attempt to match «a». Again. Repeating Character Classes If you repeat a character class by using the «?». and continue with the next character in the string. and look no further. . So it will continue with the other option. you will need to use backreferences. The character class gives the engine two options: match «a» or match «e». it will match „3333” in the middle of this string. When the engine arrives at the 13th character. «([09])\1+» will match „222” but not “837”. „g” is matched. But the engine simply did not get that far. and find that «e» matches „e”. If you want to repeat the matched character. the leftmost match was returned. The engine will then try to match the remainder of the regex with the text.

so we do not need to escape it with a backslash. the mode where the dot also matches newlines is called "single-line mode". space. EditPad Pro or PowerGREP. In Perl.NET framework.\d\d. Use The Dot Sparingly The dot is a very powerful regex metacharacter. the first dot matched „5”. Seems fine at first. The effect is that with these tools. RegexOptions.]\d\d» is a better solution. The dot matches a single character. This is a bit unfortunate. You can activate single-line mode by adding an s after the regex code. In all regex flavors discussed in this tutorial. and apply the regular expression separately to each line. Put in a dot. This regex allows a dash. the dot will not match a newline character by default. and everything will match just fine when you test the regex on valid data. So by default. Trouble is: „02512703” is also considered a valid date by this regular expression. Obviously not what we intended. In all programming languages and regex libraries I know. In this match.Match(“string”. This exception exists mostly because of historic reasons. like this: m/^regex$/s. If you are new to regular expressions. it is also the most commonly misused metacharacter. but we want to leave the user the choice of date separators. the dot is short for the negated character class «[^\n]» (UNIX regex flavors) or «[^\r\n]» (Windows regex flavors). The quick solution is «\d\d. dot and forward slash as date separators. I will illustrate this with a simple example. Remember that the dot is not a metacharacter inside a character class./. It allows you to be lazy. the string could never contain newlines.48 5. In RegexBuddy. . The problem is that the regex will also match in cases where it should not match./. The Dot Matches (Almost) Any Character In regular expressions. EditPad Pro and PowerGREP. Unfortunately. including newlines. and single-line mode only affects the dot. It will match a date like „02/12/03” just fine. so the dot could never match them. some of these cases may not be so obvious at first. Let's say we want to match a date in mm/dd/yy format.]\d\d[. So if you expose this option to your users. please give it a clearer label like was done in RegexBuddy. the dot or period is one of the most commonly used metacharacters. without caring what that character is. Modern tools and languages can apply regular expressions to very large strings or even entire files. Other languages and regex libraries have adopted Perl's terminology. you simply tick the checkbox labeled “dot matches newline”. «\d\d[.Singleline). Multi-line mode only affects anchors. and the second matched „7”. All regex flavors discussed here have an option to make the dot match all characters. “regex”. The only exception are newlinecharacters.\d\d». because it is easy to mix up this term with “multi-line mode”. The first tools that used regular expressions were line-based. When using the regex classes of the . activating single-line mode has no effect other than making the dot match newlines.Singleline.. such as in Regex. They would read a file line by line. you activate this mode by specifying RegexOptions.

We want any number of characters that are not double quotes or newlines between the quotes. The regex matches „“string one” and “string two””.” Ouch. Here.*"» seems to do the trick just fine. we will do the same. we have a problem with “string one” and “string two”. The dot matches any character. «[0-1]\d[. . Please respond. We do not want any number of any character between the quotes. our last attempt is probably more than sufficient to parse the data without errors. Definitely not what we intended. It matches „99/99/99” as a valid date. so «". though it will still match „19/39/99”. In the date-matching example. The reason for this is that the star is greedy. and the star allows the dot to be repeated any number of times.]\d\d» is a step ahead. So the proper regex is «"[^"\r\n]*"». If you test this regex on “Put a “string” between double quotes”.49 This regex is still far from perfect. Now go ahead and test it on “Houston. I will illustrate with an example. but the warning is important enough to mention it here as well. We can have any number of any character between the double quotes. Sounds easy. it has to be perfect. If you are parsing data files from a known source that generates its files in the same way every time. including zero./. it will match „“string”” just fine. Suppose you want to match a double-quoted string. Our original definition of a double-quoted string was faulty. How perfect you want your regex to be depends on what you want to do with it. You can find a better regex to match dates in the example section. If you are validating user input. we improved our regex by replacing the dot with a character class.][0-3]\d[/. Use Negated Character Sets Instead of the Dot I will explain this in depth when I present you the repeat operators star and plus.

«^\s+» matches leading whitespace and «\s+$» matches trailing whitespace. Similarly. while «a$» does not match at all. RegexOptions.. If you use the code if ($input =~ m/\d+/) in a Perl script to see if the user entered an integer number.Multiline. «^» can then match at the start of the string (before the “f” in the above string). Applying «^a» to “abc” matches „a”. Instead. and “end of string” must be matched right after it. like “first line\nsecond line” (where \n indicates a line break). because the «b» cannot be matched right after the start of the string. See below for the inside view of the regex engine. such as in Regex. So before validating input. They can be used to “anchor” the regex match at a certain position. They do not match any character at all. the anchors match before and after newlines when you specify RegexOptions. In Perl. matched by «^».Match(“string”. This makes sense because those applications are designed to work with entire files. “regex”. Because “start of string” must be matched before the match of «\d+». Using ^ and $ as Start of Line and End of Line Anchors If you have a string consisting of multiple lines. Anchors are a different breed. it is often desirable to work with lines. you have to explicitly activate this extended functionality. all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. «$» matches right after the last character in the string. «c$» matches „c” in “abc”. as well as after each line break (between “\n” and “s”). «^b» will not match “abc” at all. Useful Applications When using regular expressions in a programming language to validate user input. the line break will also be stored in the variable. the entire string must consist of digits for «^\d+$» to be able to match. it will accept the input even if the user entered “qsdf4ghjk”. like this: m/^regex$/m. I have explained literal characters and character classes.NET.Multiline). they match a position before. Start of String and End of String Anchors Thus far. In text editors like EditPad Pro or GNU Emacs. after or between characters. «$» will still match at the end of the string (after the last “e”). using anchors is very important. you do this by adding an m after the regex code.50 6. It is traditionally called "multi-line mode". you could use $input =~ s/^\s+|\s+$//g. . putting one in a regex will cause the regex engine to try to match a single character. it is good practice to trim leading and trailing whitespace. In both cases. In Perl. The correct regex to use is «^\d+$». In every programming language and regex library I know. In . the caret and dollar always match at the start and end of each line. and also before every line break (between “e” and “\n”). It is easy for the user to accidentally type in a space. because «\d+» matches the 4. Likewise. When Perl reads from a line from a text file. rather than short strings. Handy use of alternation and /g allows us to do this in a single line of code. Therefore. The caret «^» matches the position before the first character in the string. rather than the entire string. and regex tools like PowerGREP.

"^". Since this token is a zero-width token. In VB. but rather with the position before the character that the regex engine has reached so far. This is true in all regex flavors discussed in this tutorial. «\z» matches after the line break. Reading a line from a file with the text “joe” results in the string “joe\n”. including Java. The first token in the regular expression is «^». In EditPad Pro and PowerGREP.Multiline). rather than at the very end of the string. «\Z» only ever matches at the end of the string. use «\z» (lower case z instead of upper case Z). it can result in a zero-length match. There are no other permutations of the . Zero-Length Matches We saw that the anchors match at a position. the regex engine starts at the first character: “7”. where the caret and dollar always match at the start and end of lines. If the string ends with a line break. When applied to this string. for example. However.51 Permanent Start of String and End of String Anchors «\A» only ever matches at the start of the string. It remains at “7”. it is common to prepend a “greater than” symbol and a space to each line of the quoted message. These two tokens never match at line breaks. matching only a position can be very useful. the engine does not try to match it with the character. As usual. this can be very useful or undesirable. The engine then advances to the next regex token: «4». just like we want it. This means that when a regex only consists of one or more anchors. In Perl. However.Replace method will remove the regex match from the string. This “enhancement” was introduced by Perl. «\A[a-z]+\z» does not match “joe\n”. Since the match does not include any characters. and the replacement string is inserted there. nothing is deleted. «4» is a literal character. the match does include a starting position. See below. The Regex. there is one exception.Replace(Original. the regex engine does not advance to the next character in the string. Using «^\d*$» to test if the user entered a number (notice the use of the star instead of the plus). rather than matching a character. even when you turn on “multiline mode”. Depending on the situation. Since the previous token was zero-width.NET and PCRE. would cause the script to accept an empty string as a valid input. which does not match “7”. and insert the replacement string (greater than symbol and a space). the resulting string will end with a line break. so the regex «^» matches at the start of the quoted message. then «\Z» and «$» will match at the position before that line break. both «^[a-z]+$» and «\A[a-z]+\Z» will match „joe”. and after each newline. "> ".NET. Likewise. In email. and is copied by many regex flavors. RegexOptions. . We are using multi-line mode. when reading a line from a file. Strings Ending with a Line Break Even though «\Z» and «$» only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off). «\A» and «\Z» only match at the start and the end of the entire file. Looking Inside the Regex Engine Let's see what happens when we try to match «^4$» to “749\n486\n4” (where \n represents a newline character) in multi-line mode. If you only want a match at the absolute very end of the string. which is not matched by the character class. we can easily do this with Dim Quoted as String = Regex. «^» indeed matches the position before “7”.

the position before “\n” is preceded by a character. where the caret does not match. but the star turns the failure of the «\d» into a zero-width success. This time. the dollar matches successfully. and the mighty dollar is a strange beast. At this point. but does not advance the character position in the string. the entire regex has matched the empty string. In fact. and the engine reports success. and the engine advances both the regex token and the string character. one of the star's effects is that it makes the «\d». without advancing the position in the string. optional. “9”. because MatchPosition can point to the void after the string. and that character is not a newline. We already saw that those match. If you would query the engine for the length of the match. That fails. Again. The current regex token is advanced to «$». the engine successfully matches «4» with „4”. Let's see why. Since «$» was the last token in the regex. The engine will try to match «\d» with the void after the string. the regex engine tries to match the first token at the third “4” in the string. “8”. Finally. at “\n”. «4» matches „4”. and fails again. Same at the six and the newline. at the next character: “4”. Yet again. Then.52 regex. and the void after the string. It is zero-width. The «^» can match at the position before the “4”. and that character is not a newline. This position is preceded by a character. and that character is not a newline. . the regex engine arrives at the second “4” in the string. because this position is preceded by a character. in this case. «^» cannot match at the position before the 4. The dollar cannot match here. The first token in the regex is «^». also fails. it was successfully matched at the second “4”. for «$» to match the position before the current character. or the void after the string. the engine has found a successful match: the last „4” in the string. and the current character is advanced to the very last position in the string: the void after the string. the engine must try to match the first token again. If you would query the engine for the character position. so the engine starts again with the first regex token. The next attempt. After that. The engine continues at “9”. No regex token that needs a character to match can match here. It does not matter that this “character” is the void after the string. With success. It must be either a newline. Another Inside Look Earlier I mentioned that «^\d*$» would successfully match an empty string. so it will try to match the position before the current character. it would return the length of the string if string indices are zero-based. This can also happen with «^» and «^$» if the last character in the string is a newline. Since that is the case after the example. we are trying to match a dollar sign. What you have to watch out for is that String[Regex. The engine will proceed with the next regex token. There is only one “character” position in an empty string: the void after the string. it would return zero. As we will see later.MatchPosition] may cause an access violation or segmentation fault. or the length+1 if string indices are one-based in your programming language. The next token is «\d*». Now the engine attempts to match «$» at the position before (indeed: before) the “8”. because it is preceded by a newline character. the dollar will check the current character. It matches the position before the void after the string. Again. However. so the engine continues at the next character. Caution for Programmers A regular expression such as «$» all by itself can indeed match after the string. Not even a negated character class. «4». the regex engine advances to the next regex token. Previously. So the engine arrives at «$». because it is preceded by the void before the string.

Word Boundaries The metacharacter «\b» is an anchor like the caret and the dollar sign. and the preceding character is. Between a non-word character and a word character following right after the non-word character. Negated Word Boundary «\B» is the negated version of «\b». It matches at a position that is called a “word boundary”. because the previous regex token was zero-width. «i» does not match “T”. All characters that are not “word characters” are “non-word characters”. «\B» matches at every position where «\b» does not. The engine does not advance to the next character in the string. The exact list of characters is different for each regex flavor. if the first character is a word character. Simply put: «\b» allows you to perform a “whole words only” search using a regular expression in the form of «\bword\b». Looking Inside the Regex Engine Let's see what happens when we apply the regex «\bis\b» to the string “This island is beautiful”. A “word character” is a character that can be used to form words. the engine continues with the «i» which does not match with the space. «\b» matches here because the space is not a word character. if the last character is a word character. It cannot match between the “h” and the “i” either. After the last character in the string. so the engine retries the first token at the next character position. This match is zero-length. Since this token is zero-length. Again. So saying "«\b» matches before and after an alphanumeric sequence“ is more exact than saying ”before and after a word". Note that «\w» usually also matches digits. «\b» cannot match at the position between the “T” and the “h”. All non-word characters are always matched by «\W». In Perl and the other regex flavors discussed in this tutorial. Between a word character and a non-word character following right after the word character. there is only one metacharacter that matches both before a word and after a word. «\B» matches at any position between two word characters as well as at any position between two non-word characters. This is because any position between characters can never be both at the start and at the end of a word. Using only one operator makes things easier for you. the position before the character is inspected. «\b» matches here. So «\b4\b» can be used to match a 4 that is not part of a larger number. This regex will not match “44 sheets of a4”. Effectively. but all word characters are always matched by the short-hand character class «\w». and neither between the “i” and the “s”. The engine starts with the first token «\b» at the first character “T”.53 7. . The engine continues with the next token: the literal «i». There are four different positions that qualify as word boundaries: • • • • Before the first character in the string. The next character in the string is a space. because the T is a word character and the character before it is the void before the start of the string.

«\b». skipping the two earlier occurrences of the characters i and s. It matches there. the «\b» fails to match and continues to do so until the second space is reached. and the character before it is. the engine tries to match the second «\b» at the position before the “l”. the regex engine finds that «i» matches „i” and «s» matches „s”. The engine has successfully matched the word „is” in our string. But «\b» matches at the position before the third “i” in the string. . and finds that «i» matches „i” and «s» matches «s». also matches at the position before the second space in the string because the space is not a word character. Now. «\b» matches between the space and the second “i” in the string. The last token in the regex. Again. The engine continues. but matching the «i» fails. it would have matched the „is” in “This”. This fails because this position is between two word characters.54 Advancing a character and restarting with the first regex token. The engine reverts to the start of the regex and advances one character to the “s” in “island”. If we had used the regular expression «is». Continuing.

So it knows that this regular expression uses alternation. you will need to use round brackets for grouping. So it continues with the second option. the third option in the alternation has been successfully matched. being the second «G» in the regex. «e» matches „e”.55 8. Because the regex engine is eager. «SetValue» will be attempted before «Set». The consequence is that in certain situations. separate both options with a vertical bar or pipe symbol: «cat|dog». or. and at the first character in the string. and then another word boundary. «G». Alternation is similar. it tells the regex engine to match either everything to the left of the vertical bar. So the solution is «\b(Get|GetValue| Set|SetValue)\b» or «\b(Get(Value)?|Set(Value)?)\b». the regex engine studied the entire regular expression before starting. Since all options have the same end. It will stop searching as soon as it finds a valid match. then either “cat” or “dog”. and that the entire regex has not failed yet. At this point. or everything to the right of the vertical bar. If you want to search for the literal text «cat» or «dog». If you want to limit the reach of the alternation. The next token. Let's see how this works out when the string is “SetValue”. Remember That The Regex Engine Is Eager I already explained that the regex engine is eager. The alternation operator has the lowest precedence of all regex operators. The regex engine starts at the first token in the regex. If we had omitted the round brackets. However. simply expand the list: «cat|dog|mouse|fish». The match fails. The obvious solution is «Get|GetValue|Set|SetValue». and the engine will match the entire string. it considers the entire alternation to have been successfully matched as soon as one of the options has. “S”. This tells the regex engine to find a word boundary. We do not want to match Set or SetValue if the string is “SetValueFunction”. "dog followed by a word boundary. The next token in the regex is the «e» after the «S» that just successfully matched. Suppose you want to use a regex to match a list of function names in a programming language: Get. as well as the next token in the regex. In this example. If we use «GetValue|Get|SetValue|Set». and the engine continues with the next character in the string. The next token is the first «S» in the regex. If you want more options. One option is to take into account that the regex engine is eager. There are several solutions. The match fails again. You can use alternation to match a single regular expression out of several possible regular expressions. GetValue. . the regex did not match the entire string. «t» matches „t”. If we want to improve the first example to match whole words only. so the entire regex has successfully matched „Set” in “SetValue”. The best option is probably to express the fact that we only want to match complete words. the order of the alternatives matters. there are no other tokens in the regex outside the alternation. Alternation with The Vertical Bar or Pipe Symbol I already explained how you can use character classes to match a single character out of several possible characters. Set or SetValue. We could also combine the four options into two and use the question mark to make part of them optional: «Get(Value)?|Set(Value)?». and change the order of the options. the regex engine would have searched for “a word boundary followed by cat”. we would need to use «\b(cat|dog)\b». The match succeeds. we can optimize this further to «\b(Get|Set)(Value)?\b». That is. Because the question mark is greedy. Contrary to what we intended. «SetValue» will be attempted before «Set».

the engine will skip ahead to the next regex token: «r». The question mark allows the engine to continue with «r». The engine will always try to match that part. or do not try to match it. You can make the question mark lazy (i. Therefore. You can make several tokens optional by grouping them together using round brackets. the match will always be „Feb 23rd” and not „Feb 23”. „Feb 23rd” and „Feb 23”. Therefore.g. «l» and «o» match the following characters. After a series of failures.56 9. the engine starts again trying to match «c» to the first o in “colonel”. and finds that «o» matches „o”. Then the engine checks whether «u» matches “n”. The first token in the regex is the literal «c». Important Regex Concept: Greediness With the question mark. the question mark tells the regex engine that failing to match «u» is acceptable. «Feb(ruary)? 23(rd)?» matches „February 23rd”. The first position where it matches successfully is the „c” in “colonel”. Now. will the engine try ignoring the part the question mark applies to. Looking Inside The Regex Engine Let's apply the regular expression «colou?r» to the string “The colonel likes the color green”. Only if this causes the entire regular expression to fail. The engine continues. The question mark gives the regex engine two choices: try to match the part the question mark applies to.g. Now the engine checks whether «u» matches “r”.e. . and placing the question mark after the closing bracket. «c» will match with the „c” in “color”. The effect is that if you apply the regex «Feb 23(rd)?» to the string “Today is Feb 23rd. „February 23”. E. E. Optional Items The question mark makes the preceding token in the regular expression optional. This matches „r” and the engine reports that the regex successfully matched „color” in our string. This fails. However. But this fails to match “n” as well. 2003”. «l» matches „l” and another «o» matches „o”. I have introduced the first metacharacter that is greedy. This fails.: «colou?r» matches both „colour” and „color”. turn off the greediness) by putting a second question mark after the first. Again: no problem.: «Nov(ember)?» will match „Nov” and „November”. and «o». the engine can only conclude that the entire regular expression cannot be matched starting at the „c” in “colonel”. You can write a regular expression that matches many alternatives by including more than one question mark. I will say a lot more about greediness when discussing the other repetition operators.

Only if that causes the entire regex to fail. The syntax is {min. So «{0. I could also have used «<[A-Za-z0-9]+>». Notice the use of the word boundaries. and «{1. The asterisk or star tells the engine to attempt to match the preceding token zero or more times. You know that the input will be a valid HTML file. The second character class matches a letter or digit.}» is the same as «+». The sharp brackets are literals. where min is a positive integer number indicating the minimum number of matches. Like the plus. When matching „<HTML>”.4}\b» matches a number between 100 and 99999. it is an HTML tag. Obviously not what we wanted. Watch Out for The Greediness! Suppose you want to use a regex to match an HTML tag. the first character class will match „H”. They will be surprised when they test it on a string like “This is a <EM>first</EM> test”. „M” and „L” with each step. have an additional repetition operator that allows you to specify how many times a token can be repeated. „</EM>”. That is. Because we used the star. Let's take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. the plus causes the regex engine to repeat the preceding token as often as possible. The plus tells the engine to attempt to match the preceding token once or more. «\b[1-9][09]{2. I did not. . The regex will match „<EM>first</EM>”. will the regex engine backtrack. matching „T”. the star and the repetition using curly braces are greedy. So our regex will match a tag like „<B>”.57 10. That is. The first character class matches a letter. If the comma is present but max is omitted. You could use «\b[1-9][0-9]{3}\b» to match a number between 1000 and 9999. «<[A-Za-z][A-Za-z0-9]*>» matches an HTML tag without any attributes. Omitting both the comma and max tells the engine to repeat the token exactly min times. so the regular expression does not need to exclude any invalid use of sharp brackets. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags. I will present you with two possible solutions. Most people new to regular expressions will attempt to use «<. The star repeats the second character class.max}. Repetition with Star and Plus I already introduced one repetition operator or quantifier: the question mark. If it sits between sharp brackets. The reason is that the plus is greedy. it's OK if the second character class matches nothing. like those discussed in this tutorial. make it give up the last iteration. it will go back to the plus.+>». Limiting Repetition Modern regex flavors. the maximum number of matches is infinite. You might expect the regex to match „<EM>” and when continuing after that match. in effect making it optional. It tells the engine to attempt match the preceding token zero times or once. But it does not.}» is the same as «*». which is not a valid HTML tag. and proceed with the remainder of the regex. because this regex would match „<1>”. After that. The star will cause the second character class to be repeated three times. and max is an integer equal to or greater than min indicating the maximum number of matches.

Laziness Instead of Greediness The quick fix to this problem is to make the plus lazy instead of greedy.+?>». The minimum is one. As we already know. Only at this point does the regex engine continue with the next token: «>». no backtracking occurs at all when the string contains valid HTML code. . The dot matches „E”. The next token in the regex is still «>». Again. „>” is matched successfully. and the engine continues with «>» and “M”. Again. So the match of «. Let's have another look inside the regex engine. and the engine continues repeating the dot. This tells the regex engine to repeat the dot as few times as possible. This is a literal. The engine reports that „<EM>” has been successfully matched. the engine has to backtrack for each character in the HTML tag that it is trying to match.+» is reduced to „EM>first</EM> tes”. The last token in the regex has been matched. the engine will backtrack. The engine remembers that the plus has repeated the dot more often than is required. The next character is the “>”. «<. The reason why this is better is because of the backtracking. the backtracking will force the lazy plus to expand rather than reduce its reach. The engine reports that „<EM>first</EM>” has been successfully matched. Because of greediness. But now the next character in the string is the last “t”. the curly braces and the question mark itself. But this time. So our example becomes «<. It will reduce the repetition of the plus by one.58 Looking Inside The Regex Engine The first token in the regex is «<». Now. You can do the same with the star. That's more like it. So far. This fails. An Alternative to Laziness In this case. «>» cannot match here.+» is reduced to „EM>first</EM”. these cannot match. When using the lazy plus. The dot will match all remaining characters in the string. Now. You can do that by putting a question markbehind the plus in the regex.+» has matched „<EM>first</EM> test” and the engine has arrived at the end of the string. The total match so far is reduced to „<EM>first</EM> te”. The requirement has been met.+» is expanded to „EM”. The dot matches the „>”. So the engine continues backtracking until the match of «. «>» can match the next character in the string. It will report the first valid match it finds. this is the leftmost longest match. and the dot is repeated once more. The plus is greedy. the engine will backtrack. causing the engine to backtrack further. which matches any character except newlines. So the match of «. Therefore. so the regex continues to try to match the dot with the next character. The next token is the dot. But «>» still cannot match. Again. there is a better option than making the plus lazy. the first place where it will match is the first „<” in the string. The dot fails when the engine has reached the void after the end of the string. The last token in the regex has been matched. When using the negated character class. the engine will repeat the dot as many times as it can. and the engine tries again to continue with «>». Remember that the regex engine is eager to return a match. this time repeated by a lazy plus. The next token is the dot.) Rather than admitting failure. «<» matches the first „<” in the string. We can use a greedy plus and a negated character class: «<[^>]+>». The dot is repeated by the plus. So the engine matches the dot with „E”. You should see the problem by now. (Remember that the plus requires the dot to match only once. „M” is matched. It will not continue backtracking further to see if there is another possible match. and then continue trying the remainder of the regex.

They do not get the speed penalty. or perhaps in a custom syntax coloring scheme for EditPad Pro. But you will save plenty of CPU cycles when using such a regex is used repeatedly in a tight loop in a script that you are writing. remember that this tutorial only talks about regex-directed engines. but they also do not support lazy repetition operators.59 Backtracking slows down the regex engine. Text-directed engines do not backtrack. . Finally. You will not notice the difference when doing a single search in a text editor.

In EditPad Pro or PowerGREP. unless you use non-capturing parentheses.g. you can speed things up by using non-capturing parentheses. If you do not use the backreference. What you can do with it afterwards. Round Brackets Create a Backreference Besides grouping part of a regular expression together. there is no confusion between the question mark as an operator to make a token optional. The colon indicates that the change we want to make is to turn off capturing the backreference. \I1 inserts it with the first letter of each word capitalized. a repetition operator. \L1 in lowercase and \F1 with the first character in uppercase and the remainder in lowercase. This allows you to apply a regex operator. This operator cannot appear after an opening round bracket. A backreference stores the part of the string matched by the part of the regular expression inside the parentheses. at the expense of making your regular expression slightly harder to read. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. . and “Pro version” in case „EditPad Pro” was matched. or afterwards. You can reuse it inside the regular expression (see below). How to Use Backreferences Backreferences allow you to reuse part of the regex match. you can group that part of the regular expression together. In the second case. and curly braces are used by a special repetition operator. Note that only round brackets can be used for grouping. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex. Remembering part of the regex match in a backreference.60 11. That is. slows down the regex engine because it has more work to do. to the entire group. That question mark is the regex operator that makes the previous token optional. Square brackets define a character class. the first backreference will be empty. In the first case. The regex «Set(Value)?» matches „Set” or „SetValue”. I have already used round brackets for this purpose in previous topics throughout this tutorial. the first backreference will contain „Value”. because it did not match anything. Finally. If you searched for «EditPad (Lite|Pro)» and use “\1 version” as the replacement. e. round brackets also create a “backreference”. Use Round Brackets for Grouping By placing part of a regular expression inside round brackets or parentheses. the actual replacement will be “Lite version” in case „EditPad Lite” was matched. you can optimize this regular expression into «Set(?:Value)?». If you do not use the backreference. EditPad Pro and PowerGREP have a unique feature that allows you to change the case of the backreference. depends on the tool you are using. because an opening bracket by itself is not a valid regex token. \U1 inserts the first backreference in uppercase. you can use the backreference in the replacement text during a search-and-replace operation by typing \1 (backslash one) into the replacement text. and the question mark as a character to change the properties of a pair of round brackets. Therefore. and the other letters in lowercase.

etc. In . The first bracket starts backreference number one. Depending on your regex flavor.Groups[3]. the item with index zero holds the entire regex match. Here's how: «<([A-Z][A-Z09]*)[^>]*>. Using backreference zero is more efficient than putting an extra pair of round brackets around the entire regex. you can use MyMatch. In Perl. In Perl. you can use the Match object that is returned by the Match method of the Regex class. only in the replacement. you can use the entire regex match in the replacement text during a search and replace operation by typing \0 (backslash zero) into the replacement text. A backreference cannot be used inside itself.*?</\1>». you can use the magic variables $1. It will simply be replaced with nothingness. Therefore. This object has a property called Groups. Libraries like . or it will fail to match anything without an error message.NET (dot net) where backreferences are made available as an array or numbered list. which capture the string matched by «[A-Z][A-Z0-9]» into the first backreference. $2. but also during the match. it is simply empty.61 Regex libraries in programming languages also provide access to the backreference. «([a-c])x\1x\1» will match „axaxa”. This regex contains only one pair of parentheses. „bxbxb” and „cxcxc”. which is a collection of Group objects. you can use $1. You can reuse the same backreference more than once. By putting the opening tag into a backreference. the second number two. . we can reuse the name of the tag for the closing tag. \0 cannot be used inside a regex. Non-capturing parentheses are not counted.NET (dot net). Suppose you want to match a pair of opening and closing HTML tags. scan the regular expression from left to right and count the opening round brackets.Value. the magic variable $& holds the entire regex match. The Entire Regex Match As Backreference Zero Certain tools make the entire regex match available as backreference zero. Using Backreferences in The Regular Expression Backreferences can not only be used after a match has been found. This backreference is reused with «\1» (backslash one). etc.NET (dot net) Regex class also has a method Replace that can do a regex-based search-and-replace on a string. to insert backreferences. The «/» before it is simply the forward slash in the closing HTML tag that we are trying to match. etc. To figure out the number of a particular backreference. it will either give an error message. $2. This fact means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the backreferences. because that would force the engine to continuously keep an extra copy of the entire regex match. In EditPad Pro or PowerGREP. This can be very useful when modifying a complex regular expression. «([abc]\1)» will not work. and the text in between. To get the string matched by the third backreference in C#. The . Using an empty backreference in the regex is perfectly fine. to access the part of the string matched by the backreference. In the replacement text. If a backreference was not used in a particular match attempt (such as in the first example where the question mark made the first backreference optional).

„c” was stored. and position in the regex is advanced to «>». The reason is that when the engine arrives at «\1». The position in the regex is advanced to «[^>]». At this point. This fails to match at “I”. this is not a problem. «B» matches „B”. and the next token is «/» which matches “/”. There is a clear difference between «([abc]+)» and «([abc])+». The second time „a” and the third time „b”. The first time. so „b” remains. because of the star. The dot matches the second „<” in the string. If a new match is found by capturing parentheses. After storing the backreference. The regex engine also takes note that it is now inside the first pair of capturing parentheses. and the dot consumes the third “<” in the string. The star is still lazy. This match fails. the regex engine will initially skip this token. A complete match has been found: „<B><I>bold italic</I></B>”. Each time. The engine does not substitute the backreference in the regular expression. The first token in the regex is the literal «<». the previously saved match is overwritten. However. the plus caused the pair of parentheses to repeat three times. The engine has now arrived at the second «<» in the regex. it will read the value that was stored. «[A-Z]» matches „B”. But this did not happen here. This step crosses the closing bracket of the first pair of capturing parentheses. This also means that «([abc]+)=\1» will match „cab=cab”. Because of the laziness. while the second regex will only store „b”. At this point. that's perfectly fine. the regex engine does not permanently substitute backreferences in the regular expression. it holds «b» which fails to match “c”. Obvious when you look at a . because of another star. the new value stored in the first backreference would be used. and that «([abc])+=\1» will not. This does not match “I”. so the engine backtracks again. The position in the string remains at “>”. «[^>]» does not match „>”. so the engine again takes note of the available backtracking position and advances to «<» and “I”. Repetition and Backreferences As I mentioned in the above inside look. and not «B». Note that the token the backreference. «<» matches „<” and «/» matches „/”. repeated by a lazy star. so the engine again backtracks. The engine arrives again at «\1». The regex engine will traverse the string until it can match at the first „<” in the string. „B” is stored. Every time the engine arrives at the backreference. In this case. Again. This means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the second time at «\1». It will use the last match saved into the backreference each time it needs to be used. The next token is «/». so „B” it is. The position in the string remains at “>”. These obviously match. the first regex will put „cab” into the first backreference. The next token is a dot. The next token is «[A-Z]». These do not match.62 Looking Inside The Regex Engine Let's see how the regex engine applies the above regex to the string “Testing <B><I>bold italic</I></B> text”. the previous value was overwritten. The next token is «\1». The engine advances to «[A-Z0-9]» and “>”. taking note that it should backtrack in case the remainder of the regex fails. The last token in the regex. The backtracking continues until the dot has consumed „<I>bold italic”. and the second “<” in the string. the engine proceeds with the match attempt. That is because in the second regex. and the engine is forced to backtrack to the dot. These match. This prompts the regex engine to store what was matched inside them into the first backreference. «<» matches the third „<” in the string. Though both successfully match „cab”. The backreference still holds „B”. «>» matches „>”. Backtracking continues again until the dot has consumed „<I>bold italic</I>”.

always double check that you are really capturing what you want. doubled words such as “the the” easily creep in. it is treated as a literal character. Using the regex «\b(\w+)\s+\1\b» in your text editor. The \1 in regex like «(a)[\1b]» will be interpreted as an octal escape in most regex flavors.63 simple example like this one. Backreferences also cannot be used inside a character class. When using backreferences. So this regex will match an a followed by either «\x01» or a «b». Parentheses and Backreferences Cannot Be Used Inside Character Classes Round brackets cannot be used inside character classes. Useful Example: Checking for Doubled Words When editing text. When you put a round bracket in a character class. So the regex «[(a)b]» matches „a”. at least not as metacharacters. . „b”. „(” and „)”. you can easily find them. but a common cause of difficulty with regular expressions nonetheless. To delete the second word. simply type in “\1” as the replacement text and click the Replace button.

The first syntax is preferable in strings. Here is an example with two capturing groups in . Named Capture with . starting with one. and number both kinds from left to right. no other regex flavor supports Microsoft's version of named capture.NET style: «(?<first>group)(?'second'group)». «(?P<name>group)» captures the match of «group» into the backreference “name”. the numbering can get a little confusing. Unfortunately. Names and Numbers for Capturing Groups Here is where things get a bit ugly. and the other using single quotes. Python and PCRE treat named capturing groups just like unnamed capturing groups. RegexBuddy supports both Python's and Microsoft's style. you can use double-quoted string interpolation with the $regs parameter you passed to pcre_match(): “$regs['name']”.NET languages. To reference a capturing group inside the regex. Simply use a name instead of a number between the curly braces. and offers named capture using the same syntax.64 12.NET offers two syntaxes to create a capturing group: one using sharp brackets. you can easily reference it by name. Use Round Brackets for Grouping All modern regular expression engines support capturing groups. The open source PCRE library has followed Python's example. where the sharp brackets are used for HTML tags. starting with one. and will convert one flavor of named capture into the other when generating source code snippets for Python. When doing a search-and-replace.NET framework also support named capture. Python's sub() function allows you to reference a named group as “\1” or “\g<name>”. . You can reference the contents of the group with the numbered backreference «\1» or the named backreference «(?P=name)». use «\k<name>» or «\k'name'». By assigning a name to a capturing group. you can reference the named group with the familiar dollar sign syntax: “${name}”. or one of the . Again. the Microsoft developers decided to invent their own syntax. PHP. you can use the two syntactic variations interchangeably.RegularExpressions The regular expression classes of the . where single quotes may need to be escaped.NET's System. The regex . PHP/preg. which are numbered from left to right. You can use the pointy bracket flavor and the quoted flavors interchangeably. PCRE and PHP Python's regex module was the first to offer a solution: named capture. In a complex regular expression with many capturing groups. rather than follow the one pioneered by Python. since they are based on PCRE. The second syntax is preferable in ASP code. Currently. The PHP preg functions offer the same functionality. Named Capture with Python. or to use part of the regex match for further processing. In PHP.Text. This does not work in PHP. As you can see. The numbers can then be used in backreferences to match the same text again in the regular expression.

NET's regex support. I strongly recommend that you do not mix named and unnamed capturing groups at all. Probably not what you expected. or make it non-capturing as in «(?:nocapture)». The regex «(a)(?<x>b)(c)(?<y>d)» again matches „abcd”. Either give a group a name. Easy and logical. To keep things compatible across regex flavors. continuing from the unnamed groups. you will get “acbd”. To make things simple. If you do a search-and-replace with this regex and the replacement “\1\2\3\4”. but numbers them after all the unnamed groups have been numbered.NET framework. in this case: three. The . So the unnamed groups «(a)» and «(c)» get numbered first. you will get “abcd”. starting at one. Then the named groups «(?<x>b)» and «(?<y>d)» get their numbers. All four groups were numbered from left to right. since the regex engine does not need to keep track of their matches.NET framework does number named capturing groups from left to right.65 «(a)(?P<x>b)(c)(?P<y>d)» matches „abcd” as expected. if you do a search-and-replace with “$1$2$3$4” as the replacement. from left to right. Things are quite a bit more complicated with the . and reference them by name exclusively. from one till four. . However. when using . just assume that named groups do not get numbered at all. Non-capturing groups are more efficient.

In this mode. You have probably noticed the resemblance between the modifier span and the non-capturing group «(?:group)». you use a modifier span. It is obvious that the modifier span does not create a backreference. one to turn an option on.g.matches() method in Java does not take a parameter for matching options like Pattern. To turn off several modes. Many regex flavors have additional modes or options that have single letter equivalents. You can turn off a mode by preceding it with a minus sign. The latest versions of all tools and languages discussed in this book do. E. no matter where you placed it. Older regex flavors usually apply the option to the entire regular expression. Most programming languages allow you to pass option flags when constructing the regex object. and one to turn it off.g. Regex Matching Modes All regular expression engines discussed in this tutorial support the following three matching modes: • • • /i makes the regex match case insensitive. /m enables "multi-line mode". E. while Pattern. (?i-sm) turns on case insensitivity. the modifier only applies to the part of the regex to the right of the modifier. the non-capturing group is a modifier span that does not change any modifiers. the dot matches newlines. and turns on multi-line mode. you can add a mode modifier to the start of the regex. E. Most tools that support regular expressions have checkboxes or similar controls that you can use to turn these modes on or off. In this mode. precede each of their letters with a minus sign. Turning Modes On and Off for Only Part of The Regular Expression Modern regex flavors allow you to apply modifiers to only part of the regular expression.CASE_INSENSITIVE) does the same in Java. the tool or language does not provide the ability to specify matching options. Pattern. the handy String. The regex «(?i)te(?-i)st» should match „test” and „TEst”. but these differ widely.g. . m/regex/i turns on case insensitivity. the caret and dollar match before and after newlines in the subject string.compile(“regex”. turns off single-line mode. in Perl. Technically. /s enables "single-line mode". Modifier Spans Instead of using two modifiers.66 13. but not “teST” or “TEST”. Specifying Modes Inside The Regular Expression Sometimes. E.g. Not all regex flavors support this. while (?ism) turns on all three options. (?i) turns on case insensitivity.compile() does. In that situation. You can quickly test this. «(?i)ignorecase(?-i)casesensitive(?i)ignorecase» is equivalent to «(?i)ignorecase(?i:casesensitive)ignorecase». If you insert the modifier (?ism) in the middle of the regex.

10. The dot matches the comma! However.11”.”.10. In fact.5. the regex engine can no longer match the 11th iteration of «.”. there are more possiblities to be tried. Finally. The lazy dot and comma match a single comma-delimited field. and 11th iterations.*?. The next token is again the dot. this leads to a catastrophic amount of backtracking.*?. and the {11} skips the first 11 fields.67 14. The dot matches a comma. Atomic Grouping and Possessive Quantifiers When discussing the repetition operators or quantifiers. A greedy quantifier will first try to repeat the token as many times as possible.){11}P».11. I explained the difference between greedy and lazy repetition. the 10th could match just „11. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern.){11}» had consumed „1. subsequently expanding it to „9.12.3.2. When the 9th iteration consumes „9. You get the idea: the possible number of combinations that the regex engine will try for each line where the 12th field does not start with a P is huge.4. It backtracks to the 10th iteration.12. Let's say the string is “1.”. You can already see the root of the problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma).9.”. it stopped responding) when trying to find lines in a comma-delimited text file where the 12th item on a line started with a “P”. The customer was using the regexp «^(.*?. Catastrophic Backtracking Recently I got a complaint from a customer that EditPad Pro hung (i. the P checks if the 12th field indeed starts with P.12. Because of the double repetition (star inside {11}).7. and gradually expand the match as the engine backtracks through the regex to find an overall match.11.9. It will backtrack to the point where «^(. Reaching the end of the string again. . or even crash as the regex engine runs out of memory trying to remember all backtracking positions. giving up the last match of the comma.2.6. let's see why backtracking can lead to problems. this is exactly what will happen when the 12th field indeed starts with a P.10. again trying all possible combinations for the 9th.11.11.12. the comma does not match the “1” in the 12th field. 10th.”.6.13”.10. „9.11. so the dot continues until the 11th iteration of «.” as well as „11. the same story starts with the 9th iteration.10. A lazy quantifier will first repeat the token as few times as required. they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found.7.4.8.».e.8. At first sight. The regex engine now checks whether the 13th field starts with a P. At that point. and gradually give up matches as the engine backtracks to find an overall match. they can change the overall regex match. the 10th iteration is expanded to „10. Since there is still no P.”.”.10. It does not. Continuously failing. However. This causes software like EditPad Pro to stop responding.”. But it does not give up there.5. First. the engine backtracks to the 8th iteration.12. expanding the match of the 10th iteration to „10. The problem rears its ugly head when the 12th field does not start with a P. Because greediness and laziness change the order in which permutations are tried. Since there is no comma after the 13th field.*?.3. the regex engine will backtrack.» has consumed „11. this regex looks like it should do the job just fine. „9. But between each expansion.

and each time the «[^. The Java supports it starting with JDK version 1. Because the entire group is one token. The latest versions of EditPad Pro and PowerGREP support both atomic grouping and possessive quantifiers. When nesting repetition operators. allowing the regex engine to fail faster. Python does not support atomic grouping. If repeating the inner loop 4 times and the outer loop 7 times results in the same overall match as repeating the inner loop 6 times and the outer loop 2 times. Everything between (?>) is treated as one single token by the regex engine. In that case. as do recent versions of PCRE and PHP's pgreg functions. without trying further options.2. «x?+» and «x{m. and PCRE version 4 and later. But that is not always possible in such a straightforward manner. If there is no token before the group. If the P cannot be found. We want to match 11 commadelimited fields. So the regex becomes: «^([^. possessive quantifiers are only supported by the Java JDK 1. Similarly.0 and later. The fields must not contain comma's. Perl supports it starting with version 5. All versions of . But it will backtrack only 11 times. and only supported by the latest versions of most regex flavors. make absolutely sure that there is only one way to match the same match.68 Preventing Catastrophic Backtracking The solution is simple. place a plus after it. If backtracking is required.\r\n]*. «x++» is the same as «(?>x+)».n}+». Atomic Grouping and Possessive Quantifiers Recent regex flavors have introduced two additional solutions to this problem: atomic grouping and possessive quantifiers. At this time. the engine will still backtrack. It would match the minimum number of matches and never expand the match because backtracking is not allowed.){11}P». the regex must retry the entire regex at the next position in the string. In the above example. though the JDK documentation uses the term “independent group” rather than “atomic group”.*?. To make a quantifier possessive.6. In our example.\r\n]» is not able to expand beyond the comma. the solution is to be more exact about what we want to match. Note that you cannot make a lazy quantifier possessive. you should use atomic grouping to prevent the regex engine from backtracking.4. the above regex becomes «^(?>(. once the regex engine leaves the group. no backtracking can take place once the regex engine has found a match for the group.){11})P». Using atomic grouping. you can use «x*+». . as do all versions of RegexBuddy. you can be sure that the regex engine will try all those combinations.NET support atomic grouping. Possessive quantifiers are a limited form of atomic grouping with a cleaner notation. we could easily reduce the amount of backtracking to a very low level by better specifying what we wanted. Tool and Language Support for Atomic Grouping and Possessive Quantifiers Atomic grouping is a recent addition to the regex scene.4. the engine has to backtrack to the regex token before the group (the caret in our example). forcing the regex engine to the previous one of the 11 iterations immediately. Their purpose is to prevent backtracking.

greedy repetition of the star is faster than a backtracking lazy dot. the regex engine backtracks once for each character matched by the star. only failure.13”. While «x[^x]*+x» and «x(?>[^x]*)x» fail faster than «x[^x]*x». then atomic grouping may make a difference. When nesting quantifiers like in the above example. and declares failure. the amount of time wasted increases exponentially and will very quickly exhaust the capabilities of your computer. This shows again that understanding how the regex engine works on the inside will enable you to avoid many pitfalls and craft efficient regular expressions that match exactly what you want. the engine leaves the atomic group. Because the group is atomic. and the comma matches too.){11})P» is applied to “1. With simple repetition. so the engine backtracks. This fails.11. With combined repetition. Failure is declared after 30 attempts to match the caret.3.){11})P». If you are simply doing a search in a text editor.6. rather than after 30 attempts to match the caret and a huge number of attempts to try all combinations of both quantifiers in the regex. and is not immediately enclosed by an atomic group. Still. which fails. the amount of time wasted with pointless backtracking increases in a linear fashion to the length of the string. Now comes the difference.2. you often can avoid the problem without atomic grouping as in the example above. or process huge amounts of data. The previous token is an atomic group. troublesome regular expression.8. you really should use atomic grouping and/or possessive quantifiers whenever possible.6.\r\n]*).69 Atomic Grouping Inside The Regex Engine Let's see how «^(?>(.9. «P» failed to match. They do not speed up success. all backtracking information is discarded and the group is now considered a single token.5. you will not earn back the extra time to type in the characters for the atomic grouping. That is. and the match fails. the increase in speed is minimal. the regex engine did not cross the closing round bracket of the atomic group. The engine walks through the string until the end.7.2. The caret matches at the start of the string and the engine enters the atomic group.4. When To Use Atomic Grouping or Possessive Quantifiers Atomic grouping and possessive quantifiers speed up failure by eliminating backtracking. Now. «{11}» causes further repetition until the atomic group has matched „1. The most efficient regex for our problem at hand would be «^(?>((?>[^. everything happened just like in the original. no backtracking is allowed. With the former regex.8. so the group's entire match is discarded and the engine backtracks further to the caret. the engine backtracks until the 6 can be matched. So far. The star is not possessive. The engine now tries to match the caret at the next position in the string. If possessive quantifiers are available.9.4.”. If the final x in the regex cannot be matched.10. so the dot is initially skipped. and just one attempt to match the atomic group. using simple repetition. Sometimes this is desirable. «\d++6» will not match at all. But the comma does not match “1”. Again. If the regex will be used in a tight loop in an application. you can reduce clutter by writing «^(?>([^. That's right: backtracking is allowed here.){11})P».\r\n]*+. often it is not. The dot matches „1”. .7. the cause of this is that the token «\d» that is repeated can also match the delimiter «6». The engine now tries to match «P» to the “1” in the 12th field.3. if you are smart about combined repetition.10. The star is lazy.12. since possessive. «\d+» will match the entire string.5. so the engine backtracks to the dot. That is what atomic grouping and possessive quantifiers are for: efficiency by disallowing backtracking.11. «\d+6» will match „123456” in “123456789”. Note that atomic grouping and possessive quantifiers can alter the outcome of the regular expression match. In the latter case.*?.

because the lookahead will already have discarded the regex match by the time the backreference is to be saved. They do not consume characters in the string. The engine takes note that it is inside a lookahead construct now. you have to put capturing parentheses around the regex inside the lookahead. Positive lookahead works just the same. That is why they are called “assertions”. The next token is the lookahead. this means that the lookahead has successfully matched at the current position. The positive lookahead construct is a pair of round brackets. If it contains capturing parentheses. The next token is the «u» inside the lookahead. with the opening bracket followed by a question mark and an equals sign. these are called “lookaround”. I will explain why below. This does not match the void behind the string. this will cause the engine to traverse the string until the „q” in the string is matched. «q(?=u)» matches a q that is followed by a u. Lookarounds allow you to create regular expressions that are impossible to create without them. They are also called “zero-width assertions”.70 15. without making the u part of the match. All regex flavors discussed in this book support lookaround. So it is not included in the count towards numbering the backreferences. and begins matching the regex inside the lookahead. but only assert whether a match is possible or not. which supports lookahead but not lookbehind. However. The negative lookahead construct is the pair of round brackets. the entire regex has matched. The next character is the “u”. Lookahead and Lookbehind Zero-Width Assertions Perl 5 introduced two very powerful constructs: “lookahead” and “lookbehind”. I already explained why you cannot use a negated character class to match a “q” not followed by a “u”. «q» matches „q”. but then give up the match and only return the result: match or no match. Collectively. . Negative lookahead provides the solution: «q(?!u)». and start and end of word anchors that I already explained. When explaining character classes. (Note that this is not the case with lookbehind. The engine notes success. Note that the lookahead itself does not create a backreference. and „q” is returned as the match. the backreferences will be saved. Because the lookahead is negative. let's see how the engine applies «q(?!u)» to the string “Iraq”. At this point. The position in the string is now the void behind the string. You can use any regular expression inside the lookahead. The other way around will not work. This causes the engine to step back in the string to “u”.) Any valid regular expression can be used inside the lookahead. So the next token is «u». we have the trivial regex «u». Regex Engine Internals First. As we already know. They are zero-width just like the start and end of line. The first token in the regex is the literal «q». or that would get very longwinded without them. The exception is JavaScript. Inside the lookahead. Let's try applying the same regex to “quit”. it is done with the regex inside the lookahead. like this: «(?=(regex))». These match. and discards the regex match. The difference is that lookarounds will actually match characters. The engine advances to the next character: “i”. The engine notes that the regex inside the lookahead failed. with the opening bracket followed by a question mark and an explanation point. Positive and Negative Lookahead Negative lookahead is indispensable if you want to match something not followed by something else. If you want to store the match of the regex inside a backreference.

If you want to find a word not ending with an “s”. It will not match “cab”. because there are no more q's in the string. the engine temporarily steps back one character to check if an “a” can be found there. «q» matches „q” and «u» matches „u”. The engine steps back and finds out that „a” satisfies the lookbehind. (Note that a negative lookbehind would have succeeded here. The next token is «b». Negative lookbehind is written as «(?<!text)». and the entire regex has been matched successfully. The next character is the first “b” in the string. The engine cannot step back one character because there are no characters before the “t”. the current position in the string remains at the “m”. and finds out that the “m” does not match «a». The lookbehind continues to fail until the regex reaches the “m” in the string. This is definitely not the same as . It tells the regex engine to temporarily step backwards in the string. «(?<!a)b» matches a “b” that is not preceded by an “a”. The engine again steps back one character. The engine steps back. the successful match inside it causes the lookahead to fail. So this match attempt fails. It finds a “t”. Let's apply «q(?=u)i» to “quit”. so the positive lookbehind fails again. I have made the lookahead positive. All remaining attempts will fail as well. the engine has to start again at the beginning. In this case. but will match the „b” (and only the „b”) in “bed” or “debt”. But «i» cannot match “u”. To lookahead was successful. but works backwards. «b» matches „b”. The engine starts with the lookbehind and the first character in the string. but does not match “bed” or “debt”. So the lookbehind fails. Positive and Negative Lookbehind Lookbehind has the same effect. the “h”. The positive lookbehind matches. the engine reports failure. not only at the start. More Regex Engine Internals Let's apply «(?<=a)b» to “thingamabob”. you could use «\b\w+(?<!s)\b». Again. which cannot match here. using negative lookbehind. to check if the text inside the lookbehind can be matched there. Important Notes About Lookbehind The good news is that you can use lookbehind anywhere in the regex. “less than” symbol and an equals sign. It matches one character: the first „b” in the string. Let's take one more look inside. the lookbehind tells the engine to step back one character. with the opening bracket followed by a question mark. and put a token after it. Since «q» cannot match anywhere else. Because it is zero-width. The construct for positive lookbehind is «(?<=text)»: a pair of round brackets. Since there are no other permutations of this regex. the match from the lookahead must be discarded. and the engine starts again at the next character.) Again. The next character is the second “a” in the string. «(?<=a)b» (positive lookbehind) matches the „b” (and only the „b”) in „cab”. Again. to make sure you understand the implications of the lookahead. so the engine steps back from “i” in the string to “u”. and notices that the „a” can be matched there.71 Because the lookahead is negative. so the engine continues with «i». using an exclamation point instead of an equals sign. and see if an “a” can be matched there.

but fixed lengths. However. Even with these limitations. I recommend you use only fixed-length strings.NET framework can apply regular expressions backwards. This includes PCRE. I will leave it up to you to figure out why. The reason is that regular expressions do not work backwards. Not to regex engines. some more advanced flavors support the above. the former will match „John” and the latter „John'” (including the apostrophe). Microsoft has promised to resolve this in version 2. JavaScript does not support lookbehind at all. the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind. (Hint: «\b» matches between the apostrophe and the “s”). alternation and character classes inside lookbehind. PHP.72 «\b\w+[^s]\b». When applied to “John's”. But each string in the alternation must still be of fixed length.NET framework. plus finite repetition.4. RegexBuddy. including those used by Perl 5 and Python. The last regex. These regex flavors recognize the fact that finite repetition can be rewritten as an alternation of strings with different. The bad news is that you cannot use just any regex inside a lookbehind. though. Some regex flavors support the above. only allow fixed-length strings. Finally. This means you can use literal text and character classes. Finally. plus alternation with strings of different lengths. which works correctly. You cannot use repetition or optional items. and \W in the character class). Therefore. the semantics of applying a regular expression backwards are currently not well-defined. lookbehind is a valuable addition to the regular expression syntax. many regex flavors. and will allow you to use any regex. including infinite repetition. has a double negation (the \W in the negated character class). The string must be traversed from left to right. EditPad Pro and PowerGREP. Double negations tend to be confusing to humans. Technically. The correct regex without using lookbehind is «\b\w*[^s\W]\b» (star instead of plus. but only if all options in the alternation have the same length. The only regex flavor that I know of that currently supports this is Sun's regex package in the JDK 1. Until that happens. This means you can still not use the star or plus. the . You can use alternation. . inside lookbehind. but you can use the question mark and the curly braces with the max parameter specified.0 of the . Therefore. Personally. I find the lookbehind easier to understand. You can use any regex of which the length of the match can be predetermined. The latter will also not match single-letter words like “a” or “I”. so only literals and character classes can be used.

the lookahead will fail. We just specify all the options and hump them together using alternation: «cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat». Second. which I introduced in detail in the previous topic. the engine will first attempt the regex inside the positive lookahead. we want a word that is 6 letters long. Easy! Here's how this works. until «cat» can be matched. then the regex will traverse part of the string twice. Actually. we basically have two requirements for a successful match. where the lookahead will fail. reducing the number of characters matched by «\w*». Unfortunately. At this position will the regex engine attempt the remainder of the regex. the second «\w*» will consume the remaining letters. Testing The Same Part of The String for More Than One Requirement Lookaround. the engine has no other choice but to restart at the beginning of the regex. the word we found must contain the word “cat”. To make this clear. The lookahead is zero-width. the last «\b» in the regex is guaranteed to match where the second «\b» inside the lookahead matched. at the next character position in the string. So when the regex inside the lookahead has found the 6-letter word. If «cat» can be successfully matched. So if you have a regex in which a lookahead is followed by another piece of regex. Lookaround to The Rescue In this example. Our double-requirement-regex has matched successfully. If «cat» cannot be matched. Combining the two. The engine will then backtrack. matches only when the current character position in the string is at the start of a 6-letter word in the string. This sub-regex. we get: «(?=\b\w{6}\b)\b\w*cat\w*\b». because lookaround is a bit confusing. . “dog” or “mouse”. a bit more practical example. Because we already know that a 6-letter word can be matched at the current position. The confusing part is that the lookaround is zero-width. But this method gets unwieldy if you want to find any word between 6 and 12 letters long containing either “cat”. in the 6letter word. and the engine will continue trying the regex from the start at the next character position in the string. we know that «\b» matches and that the first «\w*» will match 6 times. First. the current position in the string is still at the beginning of the 6-letter word. At each character position in the string where the regex is attempted.73 16. Easy enough. or a lookbehind is preceded by another piece of regex. and therefore the lookahead. causing the engine to advance character by character until the next 6-letter word. is a very powerful concept. If not. I would like to give you another. Matching a word containing “cat” is equally easy: «\b\w*cat\w*\b». Let's say we want to find a word that is six letters long and contains the three subsequent letters “cat”. After that. it is often underused by people new to regular expressions. if any. Matching a 6-letter word is easy with «\b\w{6}\b». we can match this without lookaround. This is at the second letter in the 6-letter word we just found.

Remember that the lookahead discards its match. One last. the resulting match would be the start of a 6-letter word containing “cat”. Very easy. it is not the most optimal solution. what would you use to find any word between 6 and 12 letters long containing either “cat”.3}». as I did above.12}\b)\w{0. You can discover these optimizations by yourself if you carefully examine the regex and follow how the regex engine applies it. minor. so it does not contribute to the match returned by the regex engine. So we have «(?=\b\w{6}\b)\w{0. instead of the entire word. A More Complex Problem So. there can never be more than 3 letters before “cat”. If we omitted the «\w*». Though the last «\w*» is also guaranteed to match. Since it is zero-width itself. Note that making the asterisk lazy would not have optimized this sufficiently. there's no need to put it inside the lookahead. As it stands. So we can optimize this to «\w{0. This is not a problem if you are just doing a search in a text editor. I said the third and last «\b» are guaranteed to match. But we can optimize the first «\w*». up to and including “cat”. it would still cause the regex engine to try matching “cat” at the last two letters. once you get the hang of it. at the last single letter. So the final regex is: «\b(?=\w{6}\b)\w{0. which we can easily combine using a lookahead: « \b(?=\w{6. we can remove them. and therefore does not change the result returned by the regex engine.9}(cat|dog|mouse)\w*». “dog” or “mouse” into the first backreference.3}cat\w*».74 Optimizing Our Solution While the above regex works just fine. and even at one character beyond the 6-letter word. This regex will also put “cat”. leaving: «(?=\b\w{6}\b)\w*cat\w*». optimization involves the first «\b». The lazy asterisk would find a successful match sooner. But we know that in a successful match. But optimizing things is a good idea if this regex will be used repeatedly and/or on large chunks of data in an application you are developing.3}cat\w*». Since it is zero-width. it will match 6 letters and then backtrack. but if a 6-letter word does not contain “cat”. “dog” or “mouse”? Again we have two requirements. . we cannot remove it because it adds characters to the regex match.

To keep things simple. First we match the string we want. the lazy star will continue to repeat until the end of the section is reached. Second. we can do without lookahead. The dot and negative lookahead match any character that is not the first character of the start of a section. and end with the section stop. A title tag starts with «<H[1-6]» and . this will not work. Finding Matches Only Inside a Section of The String Lookahead allows you to create regular expression patterns that are impossible to create without it. each match of «start» must be followed exactly by one match of «stop». In a regex. we found a match after a section rather than inside a section. and then we test if it is inside the proper section. Since we do not know in advance how many characters there will be between “start” and “wanted”. we found a match before a section rather than inside a section. This. That is.*?stop» would do the trick. When we apply the regex again to the same string or file. How do we know if we matched «wanted» inside a section? First.*?)wanted(?=. The entire section is included in the regex match.75 17. So we need a way to match „wanted” without matching the rest of the section.*?». we must be able to match «stop» after matching «wanted». this is written as: «((?!start). not after „wanted”. it will continue after „stop”.*?stop)».)*?stop)». we need to match the end of the section. However. You may be tempted to use a combination of lookbehind and lookahead like in «(?<=start. The final regular expression becomes: «wanted(?=((?!start). The regex engine will refuse to compile this regular expression. the star will also stop at the start of a section. One example is matching a particular regex only inside specific sections of the string or file searched through. this will not work if “wanted” occurs more than once inside a single section. This is possible with lookahead. After this. and «stop» as the regex matching the end of the section. The reason is that this regular expression consumes the entire section. we need to use «. I will use «wanted» as a substitute for the regular expression that we are trying to match inside the section. we repeat zero or more times with the star. If we could. but only inside title tags. and «start» as the regex matching the start of the section. So we have to resort to using lookahead only. Effectively. The star is obviously not of fixed length. If not. replacing a certain word with another. The final regular expression will be in the form of «wanted(?=insidesection)». Because of the negative lookahead inside the star. I used a lazy star to make the regex more efficient. because lookahead is zero-width. So inside the lookahead we need to look for a series of unspecified characters that do not match the start of a section anywhere in the series. at which point stop cannot be matched and thus the regex will fail. «start» and «stop» with the regexes of your choice. we must not be able to match «start» between matching «wanted» and matching «stop». «start. However. If “wanted” occurs only once inside the section. you can easily build a regex to do a search and replace on HTML files. Example: Search and Replace within Header Tags Using the above generic regular expression.)*?stop». Substitute «wanted». Note that these two rules will only yield success if the string or file searched through is properly translated into sections. Lookbehind must be of fixed length.*?wanted.

Escaping the < takes care of the problem. You may have noticed that I escaped the < of the opening tag in the final regex. But lookahead is what we need here. I did that because some regex flavors interpret «(?!<» as identical to «(?<!». . So the regex becomes «wanted(?=((?!\<H[1-6]). I omitted the closing > in the start tag to allow for attributes.)*?</H[1-6]>)». or negative lookbehind.76 ends with «</H[1-6]>».

During the first match attempt.. «\G» matches at the start of the match attempt.77 18. rather than at the end of the previous match result. so the match fails. rather than the end of the previous match. «\G» matches at the start of the string in the way «\A» does... specify the continuation modifier /c. Continuing at The End of The Previous Match The anchor «\G» matches at the position where the previous match ended. This is the case with EditPad Pro. The fifth attempt fails. and move the text cursor to the end of the match. When a match is found. The position is not associated with any regular expression. Applying «\G\w» to the string “test string” matches „t”. this makes a lot of sense in the context of a text editor. without having to write a single big regex that matches all tags you are interested in. But that position is not followed by a word character. The 3rd attempt yields „s” and the 4th attempt matches the second „t” in the string. you could parse an HTML file in the following fashion: while ($string =~ m/</g) { if ($string =~ m/\GB>/c) { } elsif ($string =~ m/\GI>/c) { } else { } } # Bold # Italics # . All this is very useful to make several regular expressions work together. EditPad Pro will select the match. If a match attempt fails. and the regexes inside the loop check which tag we found. This way you can parse the tags in the file in the order they appear in the file. . the position where the last match ended is a “magical” value that is remembered separately for each string variable.etc. This means that you can use «\G» to make a regex continue in a subject string where another regex left off. Applying it again matches „e”.g. \G Magic with Perl In Perl. End of The Previous Match vs Start of The Match Attempt With some regex flavors or tools. To avoid this. the stored position for «\G» is reset to the start of the string. E. The result is that «\G» matches at the end of the previous match result only when you do not move the text cursor between two searches. All in all. The regex in the while loop searches for the tag's opening bracket.. the only place in the string where «\G» matches is after the second t. During the fifth attempt. where «\G» matches at the position of the text cursor.

«\G» will then match at this position. E. What you can do though is to add a line of code to make the match attempt of the second Matcher start where the match of the first Matcher ended.78 \G in Other Programming Langauges This flexibility is not available with most other programming languages. The Matcher is strictly associated with a single regular expression and a single subject string. . the position for «\G» is remembered by the Matcher object.g. in Java.

If you want to use alternation. Using positive lookahead. then the regex engine will attempt to match the then or else part (depending on the outcome of the lookahead) at the same position where the if was attempted. like in «(?(?=condition)(then1|then2|then3)|(else1|else2|else3))». Because the lookahead has its own parentheses. Otherwise. The opening bracket must be followed by a question mark. You may omit the else part. . and the vertical bar with it. immediately followed by the then part. the syntax becomes «(?(?=regex)then|else)». If the if part evaluates to true. the else part is attempted instead. you can use any regular expression. there is no need to use parentheses around the then and else parts. then the regex engine will attempt to match the then part. Otherwise. This part can be followed by a vertical bar and the else part. For the if part. If-Then-Else Conditionals in Regular Expressions A special construct «(?ifthen|else)» allows you to create conditional regular expressions.79 19. immediately followed by the if part. If you use a lookahead as the if part. For the then and else. The syntax consists of a pair of round brackets. you will have to group the then or else together using parentheses. the if and then parts are clearly separated. you can use the lookahead and lookbehind constructs. Remember that the lookaround constructs do not consume any characters.

80 20./. The syntax is «(?#comment)» where “comment” is be whatever you want. I could clarify the regex to match a valid date by writing it as «(?#year)(19|20)\d\d[/. Some software. EditPad Pro and PowerGREP can apply syntax coloring to regular expressions while you write them. enabling the right comment in the right spot to make a complex regular expression much easier to understand.](?#month)(0[1-9]|1[012])[. Now it is instantly obvious that this regex matches a date in yyyy-mm-dd format. E. such as RegexBuddy.g. The regex engine ignores everything after the «(?#» until the first closing round bracket.](?#day)(0[1-9]|[12][0-9]|3[01])». many modern regex flavors allow you to insert comments into regexes. That makes the comments really stand out. . Adding Comments to Regular Expressions If you have worked through the entire tutorial. as long as it does not contain a closing round bracket. Therefore. I guess you will agree that regular expressions can quickly become rather cryptic.

You're Reading a Free Preview

Descarregar
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->