39

1. Regular Expression Tutorial
In this tutorial, I will teach you all you need to know to be able to craft powerful time-saving regular expressions. I will start with the most basic concepts, so that you can follow this tutorial even if you know nothing at all about regular expressions yet. But I will not stop there. I will also explain how a regular expression engine works on the inside, and alert you at the consequences. This will help you to understand quickly why a particular regex does not do what you initially expected. It will save you lots of guesswork and head-scratching when you need to write more complex regexes.

What Regular Expressions Are Exactly - Terminology
Basically, a regular expression is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based. But we will not dig into that. Since most people including myself are lazy to type, you will usually find the name abbreviated to regex or regexp. I prefer regex, because it is easy to pronounce the plural "regexes". In this book, regular expressions are printed guillemots: «regex». They clearly separate the pattern from the surrounding text and punctuation. This first example is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text „regex”. A "match" is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex processing software. Matches are indicated by double quotation marks, with the left one at the base of the line. «\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z0-9._%-]{2,4}\b» is a more complex pattern. It describes a series of letters, digits, dots, percentage signs and underscores, followed by an at sign, followed by another series of letters, digits, dots, percentage signs and underscores, finally followed by a single dot and between two and four letters. In other words: this pattern describes an email address. With the above regular expression pattern, you can search through a text file to find email addresses, or verify if a given string looks like an email address. In this tutorial, I will use the term "string" to indicate the text that I am applying the regular expression to. I will indicate strings using regular double quotes. The term “string” or “character string” is used by programmers to indicate a sequence of characters. In practice, you can use regular expressions with whatever data you can access using the application or programming language you are working with.

Different Regular Expression Engines
A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly. Rather, the application will invoke it for you when needed, making sure the right regular expression is applied to the right file or data. As usual in the software world, different regular expression engines are not fully compatible with each other. It is not possible to describe every kind of engine and regular expression syntax (or “flavor”) in this tutorial. I will focus on the regex flavor used by Perl 5, for the simple reason that this regex flavor is the most popular

40 one, and deservedly so. Many more recent regex engines are very similar, but not identical, to the one of Perl 5. Examples are the open source PCRE engine (used in many tools and languages like PHP), the .NET regular expression library, and the regular expression package included with version 1.4 and later of the Java JDK. I will point out to you whenever differences in regex flavors are important, and which features are specific to the Perl-derivatives mentioned above.

Give Regexes a First Try
You can easily try the following yourself in a text editor that supports regular expressions, such as EditPad Pro. If you do not have such an editor, you can download the free evaluation version of EditPad Pro to try this out. EditPad Pro's regex engine is fully functional in the demo version. As a quick test, copy and paste the text of this page into EditPad Pro. Then select Edit|Search and Replace from the menu. In the search pane that appears near the bottom, type in «regex» in the box labeled “Search Text”. Mark the “Regular expression” checkbox, unmark “All open documents” and mark “Start from beginning”. Then click the Search button and see how EditPad Pro's regex engine finds the first match. When “Start from beginning” is checked, EditPad Pro uses the entire file as the string to try to match the regex to. When the regex has been matched, EditPad Pro will automatically turn off “Start from beginning”. When you click the Search button again, the remainder of the file, after the highlighted match, is used as the string. When the regex can no longer match the remaining text, you will be notified, and “Start from beginning” is automatically turned on again. Now try to search using the regex «reg(ular expressions?|ex(p|es)?)». This regex will find all names, singular and plural, I have used on this page to say “regex”. If we only had plain text search, we would have needed 5 searches. With regexes, we need just one search. Regexes save you time when using a tool like EditPad Pro. If you are a programmer, your software will run faster since even a simple regex engine applying the above regex once will outperform a state of the art plain text search algorithm searching through the data five times. Regular expressions also reduce development time. With a regex engine, it takes only one line (e.g. in Perl, PHP, Java or .NET) or a couple of lines (e.g. in C using PCRE) of code to, say, check if the user's input looks like a valid email address.

41

2. Literal Characters
The most basic regular expression consists of a single literal character, e.g.: «a». It will match the first occurrence of that character in the string. If the string is “Jack is a boy”, it will match the „a” after the “J”. The fact that this “a” is in the middle of the word does not matter to the regex engine. If it matters to you, you will need to tell that to the regex engine by using word boundaries. We will get to that later. This regex can match the second „a” too. It will only do so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its “Find Next” or “Search Forward” function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match. Similarly, the regex «cat» will match „cat” in “About cats and dogs”. This regular expression consists of a series of three literal characters. This is like saying to the regex engine: find a «c», immediately followed by an «a», immediately followed by a «t». Note that regex engines are case sensitive by default. «cat» does not match “Cat”, unless you tell the regex engine to ignore differences in case.

Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 11 characters with special meanings: the opening square bracket «[», the backslash «\», the caret «^», the dollar sign «$», the period or dot «.», the vertical bar or pipe symbol «|», the question mark «?», the asterisk or star «*», the plus sign «+», the opening round bracket «(» and the closing round bracket «)». These special characters are often called “metacharacters”. If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match „1+1=2”, the correct regex is «1\+1=2». Otherwise, the plus sign will have a special meaning. Note that «1+1=2», with the backslash omitted, is a valid regex. So you will not get an error message. But it will not match “1+1=2”. It would match „111=2” in “123+111=234”, due to the special meaning of the plus character. If you forget to escape a special character where its use is not allowed, such as in «+1», then you will get an error message. All other characters should not be escaped with a backslash. That is because the backslash is also a special character. The backslash in combination with a literal character can create a regex token with a special meaning. E.g. «\d» will match a single digit from 0 to 9.

Special Characters and Programming Languages
If you are a programmer, you may be surprised that characters like the single quote and double quote are not special characters. That is correct. When using a regular expression or grep tool like PowerGREP or the

42 search function of a text editor like EditPad Pro, you should not escape or repeat the quote characters like you do in a programming language. In your source code, you have to keep in mind which characters get special treatment inside strings by your programming language. That is because those characters will be processed by the compiler, before the regex library sees the string. So the regex «1\+1=2» must be written as "1\\+1=2" in C++ code. The C++ compiler will turn the escaped backslash in the source code into a single backslash in the string that is passed on to the regex library. To match „c:\temp”, you need to use the regex «c:\\temp». As a string in C++ source code, this regex becomes "c:\\\\temp". Four backslashes to match a single one indeed. See the tools and languages section in this book for more information on how to use regular expressions in various programming languages.

Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. «\t» will match a tab character (ASCII 0x09), «\r» a carriage return (0x0D) and «\n» a line feed (0x0A). Remember that Windows text files use “\r\n” to terminate lines, while UNIX text files use “\n”. You can include any character in your regular expression if you know its hexadecimal ASCII or ANSI code for the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use «\xA9». Another way to search for a tab is to use «\x09». Note that the leading zero is required.

First Look at How a Regex Engine Works Internally Knowing how the regex engineworks will enable you to craft better regexes more easily. The engine never proceeds beyond this point to see if there are any “better” matches. after introducing a new regex token. But then. even if a “better” match could be found later. Arriving at the 4th character in the match. There are no other possible permutations of this regex. . «c» matches „c”. It will try all possible permutations of the regular expression at the first character. Notable tools that use text-directed engines are awk. But understanding how the regex engine works will enable you to use its full power and help you avoid common mistakes. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that «a» matches „a” and «t» matches „t”. If the result is „regex not”. the engine is regex-directed. If the resulting match is only „regex”.43 3. So the regex engine tries to match the «c» with the “e”. and regex-directed engines. I will explain step by step how the regex engine actually processes that token. you can be certain the engine is regex-directed.”. The entire regular expression could be matched starting at character 15. the engine knows the regex cannot be matched starting at the 4th character in the match. as does matching the «c» with the space. „a”. «t» fails to match “p”. The engine will then try to match the second token «a» to the 5th character. because it merely consists of a sequence of literal characters. The engine is "eager" to report a match. This succeeds too. The reason behind this is that the regex-directed engine is “eager”. For awk and egrep. lex. such as lazy quantifiers and backreferences. «c» fails to match here and the engine carries on. The result is that the regex-directed engine will return the leftmost match. the engine will start at the first character of the string. there are a few versions of these tools that use a regex-directed engine. This is because certain very useful features. All the regex flavors treated in this tutorial are based on regex-directed engines. in exactly the same order. At that point. Only if all possibilities have been tried and found to fail. So it will continue with the 5th: “a”. it will try all possible permutations of the regex. MySQL and Procmail. the engine will try to match the first token in the regex «c» to the first character in the match “H”. In this tutorial. «c» again matches „c”. At the 15th character in the match. This fails. It will help you understand quickly why a particular regex does not do what you initially expected. There are two kinds of regular expression engines: text-directed engines. If backreferences and/or lazy quantifiers are available. Again. flex. can only be implemented in regex-directed engines. This will save you lots of guesswork and head-scratching when you need to write more complex regexes. Jeffrey Friedl calls them DFA and NFA engines. This inside look may seem a bit long-winded at certain times. respectively. This fails too. No surprise that this kind of engine is more popular. You can do the test by applying the regex «regex|regex not» to the string “regex not”. Again. You can easily find out whether the regex flavor you intend to use has a text-directed or regex-directed engine. then it is text-directed. The first match is considered good enough. It will therefore report the first three letters of catfish as a valid match. will the engine continue with the second character in the text. When applying a regex to a string. egrep. When applying «cat» to “He captured a catfish for his cat. The Regex-Directed Engine Always Returns the Leftmost Match This is a very important point to understand: a regex-directed engine will always return the leftmost match.

it is important that you can follow the steps the engine takes in your mind. the way the engine works will have a profound impact on the matches it will find. In following examples. . once you know how the engine works. our regex engine simply appears to work like a regular text search routine.44 In this first example of the engine's internals. Some of the results may be surprising. However. But they are always logical and predetermined. A text-directed engine would have returned the same result too.

A character class matches only a single character. Metacharacters Inside Character Classes Note that the only special characters or metacharacters inside a character class are the closing bracket (]). Find a C-style hexadecimal number with «0[xX][A-Fa-f0-9]+». such as «sep[ae]r[ae]te» or «li[cs]en[cs]e». If you want the regex to match the q. «gr[ae]y» will not match “graay”. Very useful if you do not know whether the document you are searching through is written in American or British English. If you want to match an a or an e. Simply place the characters you want to match between square brackets. It will match the q and the space after the q in “Iraq is a country”. Again. also called “character set”. and only the q. The usual metacharacters are normal characters inside a character class. the backslash (\). «[0-9a-fA-F]» matches a single hexadecimal digit.45 4. Indeed: the space will be part of the overall match. the caret (^) and the hyphen (-). Useful Applications Find a word. It will not match the q in the string “Iraq”. The order of the characters inside a character class does not matter. «[0-9]» matches a single digit between 0 and 9. the order of the characters and the ranges does not matter. and do not need to be escaped by a backslash. because it is the “character that is not a u” that is matched by the negated character class in the above regexp. Unlike the dot. . To search for a star or plus. Character Classes or Character Sets With a "character class". But we will get to that later. you need to use negative lookahead: «q(?!u)». «[0-9a-fxA-FX]» matches a hexadecimal digit or the letter X. You can use a hyphen inside a character class to specify a range of characters. in both strings. You can combine ranges and single characters. Your regex will work fine if you escape the regular metacharacters inside a character class. you can tell the regex engine to match only one out of several characters. “graey” or any such thing. You could use this in «gr[ae]y» to match either „gray” or „grey”. «q[^u]» does not mean: “a q not followed by a u”. The result is that the character class will match any character that is not in the character class. The results are identical. It is important to remember that a negated character class still must match a character. Negated Character Classes Typing a caret after the opening square bracket will negate the character class. Find an identifier in a programming language with «[A-Za-z_][A-Za-z_0-9]*». even if it is misspelled. You can use more than one range. but doing so significantly reduces readability. negated character classes also match (invisible) line break characters. use «[+*]». use «[ae]». It means: “a q followed by a character that is not a u”. case insensitively.

In some flavors. «\w» stands for “word character”. or right before the closing bracket. «[\s\d]» matches a single character that is either whitespace or a digit. Negated Shorthand Character Classes The above three shorthands also have negated versions. place it anywhere except right after the opening bracket. . for example. it also includes a carriage return or a line feed as in «[ \t\r\n]». In all flavors discussed in this tutorial. In the screen shot. That is: «\s» will match a space or a tab. The hyphen can be included right after the opening bracket. Russian characters will be included. «\D» is the same as «[^\d]». «\W» is short for «[^\w]» and «\S» is the equivalent of «[^\s]». To include a caret. In most. «[\da-fA-F]» matches a hexadecimal digit. You can put the closing bracket right after the opening bracket. Shorthand Character Classes Since certain character classes are used often. the underscore and digits are also included. it includes «[ \t]». the former regex will match „ 2” (space two). etc. the actual character range depends on the script you have chosen in Options|Font. rarely used non-printable characters such as vertical tab and form feed. the caret (^) and the hyphen (-) can be included by escaping them with a backslash. you have to escape it with another backslash. «\d» is short for «[0-9]». characters with diacritics used in languages such as French and Spanish will be included. When applied to “1 + 2 = 3”. it will include «[A-Za-z]». If you are using the Cyrillic script. and is equivalent to «[0-9a-fA-F]». a series of shorthand character classes are available. The best way to find out is to do a couple of tests with the regex flavor you are using. or the negating caret. you can see the characters matched by «\w» in PowerGREP when using the Western script. Exactly which characters it matches differs between regex flavors. depends on the regex flavor.46 To include a backslash as a character without any special meaning inside a character class. In most flavors. Both «[-x]» and «[x-]» match an x or a hyphen. «\s\d» matches a whitespace character followed by a digit. Some flavors include additional. or by placing them in a position where they do not take on their special meaning. «[]x]» matches a closing bracket or an x. which characters this actually includes. In all flavors. I recommend the latter method. The closing bracket (]). or right after the negating caret. If you are using the Western script. «[x^]» matches an x or a caret. In EditPad Pro. since it improves readability. word characters from other languages may also match. Again. while the latter matches „1” (one). «[\\x]» matches a backslash or an x. «[^]x]» matches any character that is not a closing bracket. Shorthand character classes can be used both inside and outside the square brackets. «\s» stands for “whitespace character”.

which can be matched with the following character as well. When the engine arrives at the 13th character. but not “8”. I did not yet explain how character classes work inside the regex engine. When applied to the string “833337”. whitespace or otherwise. and find that «e» matches „e”. it must continue trying to match all the other permutations of the regex pattern before deciding that the regex cannot be matched with the text starting at character 13. because that is the leftmost match. But the engine simply did not get that far. you need to use lookahead and lookbehind. It will return „grey” as the match result. The character class gives the engine two options: match «a» or match «e». But because we are using a regex-directed engine. rather than the class. So it will continue with the other option. The former. and fail. Nothing noteworthy happens for the first twelve characters in the string. That is: «gr[ae]y» can match both „gray” and „grey”. . because another equally valid match was found to the left of it. Let us take a look at that first. «([09])\1+» will match „222” but not “837”. Looking Inside The Regex Engine As I already said: the order of the characters inside a character class does not matter. even though we put the «a» first in the character class. and whitespace is not a digit. and not just the character that it matched. you will need to use backreferences. The engine has found a complete match with the text starting at character 13. however. Because a digit is not whitespace. So the third token. «[ae]» is attempted at the next character in the text (“e”). We already saw how the engine applies a regex consisting only of literal characters. The regex «[0-9]+» can match „837” as well as „222”. The latter will match any character that is not a digit or whitespace. which matches the next character in the text. If you do not want that. The engine will then try to match the remainder of the regex with the text. the leftmost match was returned. I will explain how it applies a regex that has more than one permutation. and continue with the next character in the string. «*» or «+» operators. digit. But I digress. The next token in the regex is the literal «r». you will repeat the entire character class. So it will match „x”.47 Be careful when using the negated shorthands inside square brackets. will match any character that is either not a digit. it will match „3333” in the middle of this string. It will first attempt to match «a». and look no further. «[\D\S]» is not the same as «[^\d\s]». If you want to repeat the matched character. The engine will fail to match «g» at every step. The last regex token is «y». or is not whitespace. «[\D\S]» will match any character. Below. Again. Repeating Character Classes If you repeat a character class by using the «?». «gr[ae]y» will match „grey” in “Is his hair grey or gray?”. „g” is matched. and „gray” could have been matched in the string.

and apply the regular expression separately to each line. «\d\d[. Other languages and regex libraries have adopted Perl's terminology.\d\d. dot and forward slash as date separators. This regex allows a dash. and everything will match just fine when you test the regex on valid data. activating single-line mode has no effect other than making the dot match newlines. All regex flavors discussed here have an option to make the dot match all characters. “regex”. and the second matched „7”. The quick solution is «\d\d. space. I will illustrate this with a simple example. Seems fine at first. Modern tools and languages can apply regular expressions to very large strings or even entire files. EditPad Pro and PowerGREP. So by default. you activate this mode by specifying RegexOptions. Use The Dot Sparingly The dot is a very powerful regex metacharacter. like this: m/^regex$/s. so the dot could never match them. The first tools that used regular expressions were line-based. In all regex flavors discussed in this tutorial. The only exception are newlinecharacters. Put in a dot. so we do not need to escape it with a backslash. RegexOptions.Singleline. If you are new to regular expressions. The problem is that the regex will also match in cases where it should not match. EditPad Pro or PowerGREP. You can activate single-line mode by adding an s after the regex code. In this match. the first dot matched „5”.. . The dot matches a single character. you simply tick the checkbox labeled “dot matches newline”.]\d\d» is a better solution. because it is easy to mix up this term with “multi-line mode”.Match(“string”. Unfortunately. In Perl. Multi-line mode only affects anchors. The Dot Matches (Almost) Any Character In regular expressions.\d\d». but we want to leave the user the choice of date separators. it is also the most commonly misused metacharacter. the mode where the dot also matches newlines is called "single-line mode". In RegexBuddy. This exception exists mostly because of historic reasons. Let's say we want to match a date in mm/dd/yy format. the dot is short for the negated character class «[^\n]» (UNIX regex flavors) or «[^\r\n]» (Windows regex flavors). some of these cases may not be so obvious at first. including newlines. So if you expose this option to your users.NET framework. the dot will not match a newline character by default.]\d\d[. without caring what that character is. Remember that the dot is not a metacharacter inside a character class.48 5. In all programming languages and regex libraries I know. This is a bit unfortunate. Obviously not what we intended. It allows you to be lazy. the dot or period is one of the most commonly used metacharacters. They would read a file line by line./. Trouble is: „02512703” is also considered a valid date by this regular expression. such as in Regex. the string could never contain newlines. please give it a clearer label like was done in RegexBuddy.Singleline). and single-line mode only affects the dot. When using the regex classes of the ./. The effect is that with these tools. It will match a date like „02/12/03” just fine.

we have a problem with “string one” and “string two”. we improved our regex by replacing the dot with a character class. including zero.” Ouch.49 This regex is still far from perfect. so «". How perfect you want your regex to be depends on what you want to do with it. We want any number of characters that are not double quotes or newlines between the quotes. it has to be perfect. we will do the same. You can find a better regex to match dates in the example section. . Sounds easy. If you are validating user input. though it will still match „19/39/99”. Please respond. It matches „99/99/99” as a valid date. but the warning is important enough to mention it here as well.][0-3]\d[/. We do not want any number of any character between the quotes. Here. Use Negated Character Sets Instead of the Dot I will explain this in depth when I present you the repeat operators star and plus. So the proper regex is «"[^"\r\n]*"»./. If you are parsing data files from a known source that generates its files in the same way every time. Definitely not what we intended. The regex matches „“string one” and “string two””. and the star allows the dot to be repeated any number of times. If you test this regex on “Put a “string” between double quotes”. The reason for this is that the star is greedy. our last attempt is probably more than sufficient to parse the data without errors. I will illustrate with an example. The dot matches any character. «[0-1]\d[. Suppose you want to match a double-quoted string.]\d\d» is a step ahead.*"» seems to do the trick just fine. Now go ahead and test it on “Houston. In the date-matching example. Our original definition of a double-quoted string was faulty. it will match „“string”” just fine. We can have any number of any character between the double quotes.

the caret and dollar always match at the start and end of each line. Similarly. after or between characters. it is often desirable to work with lines. Start of String and End of String Anchors Thus far. «^b» will not match “abc” at all.Multiline). «^» can then match at the start of the string (before the “f” in the above string). In Perl. like this: m/^regex$/m. It is easy for the user to accidentally type in a space. “regex”. Applying «^a» to “abc” matches „a”.Multiline. In every programming language and regex library I know. Using ^ and $ as Start of Line and End of Line Anchors If you have a string consisting of multiple lines. In . you have to explicitly activate this extended functionality.50 6. rather than short strings. like “first line\nsecond line” (where \n indicates a line break). it is good practice to trim leading and trailing whitespace. such as in Regex. and “end of string” must be matched right after it. rather than the entire string. They do not match any character at all. all the regex engines discussed in this tutorial have the option to expand the meaning of both anchors. See below for the inside view of the regex engine. Therefore. as well as after each line break (between “\n” and “s”). They can be used to “anchor” the regex match at a certain position. the entire string must consist of digits for «^\d+$» to be able to match. «$» will still match at the end of the string (after the last “e”). So before validating input. the line break will also be stored in the variable. putting one in a regex will cause the regex engine to try to match a single character. In both cases. The correct regex to use is «^\d+$». In Perl.NET. because the «b» cannot be matched right after the start of the string. When Perl reads from a line from a text file. The caret «^» matches the position before the first character in the string. because «\d+» matches the 4. It is traditionally called "multi-line mode". they match a position before. it will accept the input even if the user entered “qsdf4ghjk”.. Anchors are a different breed. while «a$» does not match at all. Instead. you could use $input =~ s/^\s+|\s+$//g. RegexOptions. In text editors like EditPad Pro or GNU Emacs. . If you use the code if ($input =~ m/\d+/) in a Perl script to see if the user entered an integer number. «^\s+» matches leading whitespace and «\s+$» matches trailing whitespace. Handy use of alternation and /g allows us to do this in a single line of code. Useful Applications When using regular expressions in a programming language to validate user input. using anchors is very important. and also before every line break (between “e” and “\n”). and regex tools like PowerGREP. Likewise. This makes sense because those applications are designed to work with entire files. matched by «^». the anchors match before and after newlines when you specify RegexOptions. Because “start of string” must be matched before the match of «\d+». I have explained literal characters and character classes. «c$» matches „c” in “abc”. you do this by adding an m after the regex code. «$» matches right after the last character in the string.Match(“string”.

This “enhancement” was introduced by Perl. Looking Inside the Regex Engine Let's see what happens when we try to match «^4$» to “749\n486\n4” (where \n represents a newline character) in multi-line mode. just like we want it. This is true in all regex flavors discussed in this tutorial.NET and PCRE. These two tokens never match at line breaks.Multiline). However. we can easily do this with Dim Quoted as String = Regex. rather than at the very end of the string.51 Permanent Start of String and End of String Anchors «\A» only ever matches at the start of the string. even when you turn on “multiline mode”. Since this token is a zero-width token. If you only want a match at the absolute very end of the string. Since the previous token was zero-width. When applied to this string. and is copied by many regex flavors. including Java. the regex engine starts at the first character: “7”. "^". use «\z» (lower case z instead of upper case Z). This means that when a regex only consists of one or more anchors. Likewise. Zero-Length Matches We saw that the anchors match at a position. There are no other permutations of the . In VB. then «\Z» and «$» will match at the position before that line break. which does not match “7”. and the replacement string is inserted there.Replace(Original. «\Z» only ever matches at the end of the string. «\z» matches after the line break. which is not matched by the character class. both «^[a-z]+$» and «\A[a-z]+\Z» will match „joe”. Since the match does not include any characters. the engine does not try to match it with the character. «\A[a-z]+\z» does not match “joe\n”. We are using multi-line mode. Strings Ending with a Line Break Even though «\Z» and «$» only match at the end of the string (when the option for the caret and dollar to match at embedded line breaks is off). If the string ends with a line break. and after each newline. the resulting string will end with a line break. In EditPad Pro and PowerGREP. «\A» and «\Z» only match at the start and the end of the entire file. The first token in the regular expression is «^». In Perl. so the regex «^» matches at the start of the quoted message.Replace method will remove the regex match from the string. However. the regex engine does not advance to the next character in the string. The Regex. when reading a line from a file. there is one exception. In email. Depending on the situation. the match does include a starting position. would cause the script to accept an empty string as a valid input. . «4» is a literal character. RegexOptions. As usual. where the caret and dollar always match at the start and end of lines. Using «^\d*$» to test if the user entered a number (notice the use of the star instead of the plus). The engine then advances to the next regex token: «4». «^» indeed matches the position before “7”. "> ". for example. It remains at “7”. and insert the replacement string (greater than symbol and a space). See below. Reading a line from a file with the text “joe” results in the string “joe\n”. this can be very useful or undesirable. matching only a position can be very useful. it can result in a zero-length match. nothing is deleted. it is common to prepend a “greater than” symbol and a space to each line of the quoted message. but rather with the position before the character that the regex engine has reached so far.NET. rather than matching a character.

where the caret does not match. but does not advance the character position in the string. Yet again. “8”. However. and the mighty dollar is a strange beast. In fact. the dollar will check the current character. So the engine arrives at «$». Again. This position is preceded by a character. With success. «4» matches „4”. «4». at “\n”. Same at the six and the newline. and fails again. As we will see later. optional. and that character is not a newline.MatchPosition] may cause an access violation or segmentation fault. That fails. If you would query the engine for the character position. the entire regex has matched the empty string. but the star turns the failure of the «\d» into a zero-width success. This time. «^» cannot match at the position before the 4. for «$» to match the position before the current character. The «^» can match at the position before the “4”. The dollar cannot match here. and the void after the string. Since that is the case after the example. and the engine advances both the regex token and the string character. After that. and that character is not a newline. There is only one “character” position in an empty string: the void after the string. the engine has found a successful match: the last „4” in the string. without advancing the position in the string. We already saw that those match. Another Inside Look Earlier I mentioned that «^\d*$» would successfully match an empty string. It is zero-width. the engine must try to match the first token again. The engine will try to match «\d» with the void after the string. so it will try to match the position before the current character. so the engine starts again with the first regex token. The next token is «\d*». The current regex token is advanced to «$». If you would query the engine for the length of the match. or the length+1 if string indices are one-based in your programming language. the regex engine arrives at the second “4” in the string. because it is preceded by the void before the string. the dollar matches successfully. one of the star's effects is that it makes the «\d». It matches the position before the void after the string. It must be either a newline. . and the current character is advanced to the very last position in the string: the void after the string. Let's see why. in this case. This can also happen with «^» and «^$» if the last character in the string is a newline. and the engine reports success. the engine successfully matches «4» with „4”. the regex engine tries to match the first token at the third “4” in the string. The engine continues at “9”. the regex engine advances to the next regex token. It does not matter that this “character” is the void after the string. because this position is preceded by a character. What you have to watch out for is that String[Regex. The engine will proceed with the next regex token. because it is preceded by a newline character. Caution for Programmers A regular expression such as «$» all by itself can indeed match after the string. also fails. Again.52 regex. Previously. Since «$» was the last token in the regex. The next attempt. or the void after the string. “9”. it would return zero. The first token in the regex is «^». Not even a negated character class. and that character is not a newline. No regex token that needs a character to match can match here. Finally. because MatchPosition can point to the void after the string. Now the engine attempts to match «$» at the position before (indeed: before) the “8”. it was successfully matched at the second “4”. Then. at the next character: “4”. it would return the length of the string if string indices are zero-based. the position before “\n” is preceded by a character. At this point. so the engine continues at the next character. we are trying to match a dollar sign.

because the T is a word character and the character before it is the void before the start of the string.53 7. So saying "«\b» matches before and after an alphanumeric sequence“ is more exact than saying ”before and after a word". The engine continues with the next token: the literal «i». «\B» matches at any position between two word characters as well as at any position between two non-word characters. There are four different positions that qualify as word boundaries: • • • • Before the first character in the string. Word Boundaries The metacharacter «\b» is an anchor like the caret and the dollar sign. the engine continues with the «i» which does not match with the space. the position before the character is inspected. The next character in the string is a space. if the last character is a word character. Between a word character and a non-word character following right after the word character. A “word character” is a character that can be used to form words. The engine starts with the first token «\b» at the first character “T”. Note that «\w» usually also matches digits. It cannot match between the “h” and the “i” either. and the preceding character is. «\b» matches here because the space is not a word character. In Perl and the other regex flavors discussed in this tutorial. This match is zero-length. «\b» matches here. Simply put: «\b» allows you to perform a “whole words only” search using a regular expression in the form of «\bword\b». Between a non-word character and a word character following right after the non-word character. Using only one operator makes things easier for you. The engine does not advance to the next character in the string. if the first character is a word character. All non-word characters are always matched by «\W». . «\B» matches at every position where «\b» does not. Since this token is zero-length. Again. «i» does not match “T”. «\b» cannot match at the position between the “T” and the “h”. Effectively. Looking Inside the Regex Engine Let's see what happens when we apply the regex «\bis\b» to the string “This island is beautiful”. but all word characters are always matched by the short-hand character class «\w». After the last character in the string. It matches at a position that is called a “word boundary”. This is because any position between characters can never be both at the start and at the end of a word. The exact list of characters is different for each regex flavor. Negated Word Boundary «\B» is the negated version of «\b». and neither between the “i” and the “s”. so the engine retries the first token at the next character position. All characters that are not “word characters” are “non-word characters”. So «\b4\b» can be used to match a 4 that is not part of a larger number. there is only one metacharacter that matches both before a word and after a word. This regex will not match “44 sheets of a4”. because the previous regex token was zero-width.

«\b». also matches at the position before the second space in the string because the space is not a word character. Continuing.54 Advancing a character and restarting with the first regex token. The engine reverts to the start of the regex and advances one character to the “s” in “island”. If we had used the regular expression «is». skipping the two earlier occurrences of the characters i and s. it would have matched the „is” in “This”. . and the character before it is. It matches there. and finds that «i» matches „i” and «s» matches «s». The engine has successfully matched the word „is” in our string. This fails because this position is between two word characters. the engine tries to match the second «\b» at the position before the “l”. But «\b» matches at the position before the third “i” in the string. The engine continues. the «\b» fails to match and continues to do so until the second space is reached. but matching the «i» fails. «\b» matches between the space and the second “i” in the string. The last token in the regex. Now. the regex engine finds that «i» matches „i” and «s» matches „s”. Again.

In this example. The obvious solution is «Get|GetValue|Set|SetValue». the third option in the alternation has been successfully matched. There are several solutions. Let's see how this works out when the string is “SetValue”. it tells the regex engine to match either everything to the left of the vertical bar. The consequence is that in certain situations. we can optimize this further to «\b(Get|Set)(Value)?\b». and that the entire regex has not failed yet. it considers the entire alternation to have been successfully matched as soon as one of the options has. One option is to take into account that the regex engine is eager. being the second «G» in the regex. the order of the alternatives matters. The regex engine starts at the first token in the regex. That is. We do not want to match Set or SetValue if the string is “SetValueFunction”. The match fails again. and then another word boundary. "dog followed by a word boundary. The match fails. we would need to use «\b(cat|dog)\b». The best option is probably to express the fact that we only want to match complete words. So it continues with the second option. the regex engine would have searched for “a word boundary followed by cat”. the regex engine studied the entire regular expression before starting. Because the regex engine is eager. Set or SetValue. We could also combine the four options into two and use the question mark to make part of them optional: «Get(Value)?|Set(Value)?». «t» matches „t”. or everything to the right of the vertical bar. If we want to improve the first example to match whole words only. and the engine will match the entire string. Since all options have the same end. Alternation with The Vertical Bar or Pipe Symbol I already explained how you can use character classes to match a single character out of several possible characters. However. «SetValue» will be attempted before «Set». If you want to limit the reach of the alternation. you will need to use round brackets for grouping. If you want more options. If we use «GetValue|Get|SetValue|Set». The next token. The match succeeds. The next token is the first «S» in the regex. there are no other tokens in the regex outside the alternation. the regex did not match the entire string. The alternation operator has the lowest precedence of all regex operators. Because the question mark is greedy. Contrary to what we intended. If we had omitted the round brackets. so the entire regex has successfully matched „Set” in “SetValue”. At this point. «e» matches „e”. It will stop searching as soon as it finds a valid match. Suppose you want to use a regex to match a list of function names in a programming language: Get. and at the first character in the string. Alternation is similar. «G». and the engine continues with the next character in the string. You can use alternation to match a single regular expression out of several possible regular expressions. Remember That The Regex Engine Is Eager I already explained that the regex engine is eager. So it knows that this regular expression uses alternation. simply expand the list: «cat|dog|mouse|fish». The next token in the regex is the «e» after the «S» that just successfully matched. then either “cat” or “dog”. «SetValue» will be attempted before «Set». If you want to search for the literal text «cat» or «dog». This tells the regex engine to find a word boundary. . “S”. So the solution is «\b(Get|GetValue| Set|SetValue)\b» or «\b(Get(Value)?|Set(Value)?)\b». as well as the next token in the regex. and change the order of the options. GetValue. separate both options with a vertical bar or pipe symbol: «cat|dog».55 8. or.

You can write a regular expression that matches many alternatives by including more than one question mark. . and «o». will the engine try ignoring the part the question mark applies to. turn off the greediness) by putting a second question mark after the first.: «colou?r» matches both „colour” and „color”.: «Nov(ember)?» will match „Nov” and „November”. However. The engine will always try to match that part. This matches „r” and the engine reports that the regex successfully matched „color” in our string. Therefore. Now the engine checks whether «u» matches “r”. Optional Items The question mark makes the preceding token in the regular expression optional. After a series of failures. «Feb(ruary)? 23(rd)?» matches „February 23rd”. Therefore. The engine continues. I will say a lot more about greediness when discussing the other repetition operators.e. Again: no problem. «l» matches „l” and another «o» matches „o”. This fails. and finds that «o» matches „o”. E. The effect is that if you apply the regex «Feb 23(rd)?» to the string “Today is Feb 23rd. You can make several tokens optional by grouping them together using round brackets. The first token in the regex is the literal «c». But this fails to match “n” as well.g. Then the engine checks whether «u» matches “n”. «c» will match with the „c” in “color”. I have introduced the first metacharacter that is greedy. Important Regex Concept: Greediness With the question mark. Looking Inside The Regex Engine Let's apply the regular expression «colou?r» to the string “The colonel likes the color green”.56 9. the engine starts again trying to match «c» to the first o in “colonel”. or do not try to match it. the engine will skip ahead to the next regex token: «r». Now. Only if this causes the entire regular expression to fail. „Feb 23rd” and „Feb 23”. the engine can only conclude that the entire regular expression cannot be matched starting at the „c” in “colonel”. and placing the question mark after the closing bracket. the match will always be „Feb 23rd” and not „Feb 23”. The first position where it matches successfully is the „c” in “colonel”. The question mark allows the engine to continue with «r».g. «l» and «o» match the following characters. E. The question mark gives the regex engine two choices: try to match the part the question mark applies to. This fails. „February 23”. You can make the question mark lazy (i. 2003”. the question mark tells the regex engine that failing to match «u» is acceptable.

The regex will match „<EM>first</EM>”.}» is the same as «+». in effect making it optional. make it give up the last iteration. That is. The star will cause the second character class to be repeated three times. Notice the use of the word boundaries. Limiting Repetition Modern regex flavors. «\b[1-9][09]{2. If the comma is present but max is omitted. After that.57 10. When matching „<HTML>”. They will be surprised when they test it on a string like “This is a <EM>first</EM> test”. it is an HTML tag. the first character class will match „H”. Let's take a look inside the regex engine to see in detail how this works and why this causes our regex to fail. . so the regular expression does not need to exclude any invalid use of sharp brackets. Watch Out for The Greediness! Suppose you want to use a regex to match an HTML tag. You could use «\b[1-9][0-9]{3}\b» to match a number between 1000 and 9999. The first character class matches a letter. So our regex will match a tag like „<B>”. The plus tells the engine to attempt to match the preceding token once or more. Like the plus. Because we used the star. I could also have used «<[A-Za-z0-9]+>». The asterisk or star tells the engine to attempt to match the preceding token zero or more times. I will present you with two possible solutions. The second character class matches a letter or digit. Obviously not what we wanted. it will go back to the plus. The syntax is {min. So «{0. But it does not. You know that the input will be a valid HTML file. The star repeats the second character class.}» is the same as «*». The reason is that the plus is greedy. and max is an integer equal to or greater than min indicating the maximum number of matches. You might expect the regex to match „<EM>” and when continuing after that match. That is. which is not a valid HTML tag.4}\b» matches a number between 100 and 99999. I did not. where min is a positive integer number indicating the minimum number of matches. „M” and „L” with each step. If it sits between sharp brackets. It tells the engine to attempt match the preceding token zero times or once. have an additional repetition operator that allows you to specify how many times a token can be repeated. will the regex engine backtrack. matching „T”.+>». because this regex would match „<1>”. Only if that causes the entire regex to fail. it's OK if the second character class matches nothing. The sharp brackets are literals.max}. the plus causes the regex engine to repeat the preceding token as often as possible. Repetition with Star and Plus I already introduced one repetition operator or quantifier: the question mark. „</EM>”. and «{1. the star and the repetition using curly braces are greedy. the maximum number of matches is infinite. like those discussed in this tutorial. and proceed with the remainder of the regex. Most people new to regular expressions will attempt to use «<. «<[A-Za-z][A-Za-z0-9]*>» matches an HTML tag without any attributes. Omitting both the comma and max tells the engine to repeat the token exactly min times. But this regex may be sufficient if you know the string you are searching through does not contain any such invalid tags.

+» has matched „<EM>first</EM> test” and the engine has arrived at the end of the string. and the engine tries again to continue with «>». So far. We can use a greedy plus and a negated character class: «<[^>]+>». The total match so far is reduced to „<EM>first</EM> te”. But this time. . «>» cannot match here. So the match of «. Because of greediness. So the engine matches the dot with „E”. so the regex continues to try to match the dot with the next character. When using the negated character class. It will reduce the repetition of the plus by one. which matches any character except newlines. The reason why this is better is because of the backtracking. But «>» still cannot match.+» is reduced to „EM>first</EM”. no backtracking occurs at all when the string contains valid HTML code. The engine reports that „<EM>” has been successfully matched. „M” is matched. and the engine continues repeating the dot. „>” is matched successfully. When using the lazy plus. The next token is the dot. causing the engine to backtrack further. (Remember that the plus requires the dot to match only once. this time repeated by a lazy plus. But now the next character in the string is the last “t”. As we already know. The last token in the regex has been matched. So the engine continues backtracking until the match of «. the backtracking will force the lazy plus to expand rather than reduce its reach. It will report the first valid match it finds. the engine will backtrack.) Rather than admitting failure. Let's have another look inside the regex engine. That's more like it.+?>». You should see the problem by now. Remember that the regex engine is eager to return a match. The dot matches the „>”.+» is reduced to „EM>first</EM> tes”. the first place where it will match is the first „<” in the string. The plus is greedy. The next character is the “>”. «<. You can do the same with the star. Again. Laziness Instead of Greediness The quick fix to this problem is to make the plus lazy instead of greedy. and the dot is repeated once more. «<» matches the first „<” in the string. The next token is the dot. The requirement has been met.+» is expanded to „EM”. It will not continue backtracking further to see if there is another possible match. This fails. An Alternative to Laziness In this case. The engine reports that „<EM>first</EM>” has been successfully matched. the curly braces and the question mark itself. Now. This tells the regex engine to repeat the dot as few times as possible. The dot matches „E”. and then continue trying the remainder of the regex. Only at this point does the regex engine continue with the next token: «>». So the match of «. This is a literal. the engine has to backtrack for each character in the HTML tag that it is trying to match. The minimum is one. So our example becomes «<. The engine remembers that the plus has repeated the dot more often than is required. Again. «>» can match the next character in the string. these cannot match. and the engine continues with «>» and “M”. there is a better option than making the plus lazy. Now. Again. the engine will repeat the dot as many times as it can.58 Looking Inside The Regex Engine The first token in the regex is «<». The last token in the regex has been matched. Therefore. this is the leftmost longest match. the engine will backtrack. The dot is repeated by the plus. You can do that by putting a question markbehind the plus in the regex. The dot will match all remaining characters in the string. The next token in the regex is still «>». The dot fails when the engine has reached the void after the end of the string.

They do not get the speed penalty.59 Backtracking slows down the regex engine. but they also do not support lazy repetition operators. Text-directed engines do not backtrack. Finally. But you will save plenty of CPU cycles when using such a regex is used repeatedly in a tight loop in a script that you are writing. remember that this tutorial only talks about regex-directed engines. You will not notice the difference when doing a single search in a text editor. . or perhaps in a custom syntax coloring scheme for EditPad Pro.

because it did not match anything. and “Pro version” in case „EditPad Pro” was matched. depends on the tool you are using. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex. Remembering part of the regex match in a backreference. round brackets also create a “backreference”. you can group that part of the regular expression together. because an opening bracket by itself is not a valid regex token. Use Round Brackets for Grouping By placing part of a regular expression inside round brackets or parentheses. Note that only round brackets can be used for grouping. the first backreference will contain „Value”. You can reuse it inside the regular expression (see below). unless you use non-capturing parentheses. That question mark is the regex operator that makes the previous token optional. That is. to the entire group. This allows you to apply a regex operator. Finally. \U1 inserts the first backreference in uppercase. In the second case. \I1 inserts it with the first letter of each word capitalized. slows down the regex engine because it has more work to do. Square brackets define a character class. I have already used round brackets for this purpose in previous topics throughout this tutorial. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference.g. What you can do with it afterwards. If you searched for «EditPad (Lite|Pro)» and use “\1 version” as the replacement. In EditPad Pro or PowerGREP. The colon indicates that the change we want to make is to turn off capturing the backreference. In the first case. The regex «Set(Value)?» matches „Set” or „SetValue”. you can use the backreference in the replacement text during a search-and-replace operation by typing \1 (backslash one) into the replacement text. Round Brackets Create a Backreference Besides grouping part of a regular expression together. A backreference stores the part of the string matched by the part of the regular expression inside the parentheses. . the actual replacement will be “Lite version” in case „EditPad Lite” was matched. EditPad Pro and PowerGREP have a unique feature that allows you to change the case of the backreference. e. you can speed things up by using non-capturing parentheses. you can optimize this regular expression into «Set(?:Value)?». at the expense of making your regular expression slightly harder to read. Therefore. and the question mark as a character to change the properties of a pair of round brackets. If you do not use the backreference. a repetition operator. How to Use Backreferences Backreferences allow you to reuse part of the regex match. the first backreference will be empty. and curly braces are used by a special repetition operator. This operator cannot appear after an opening round bracket. there is no confusion between the question mark as an operator to make a token optional. \L1 in lowercase and \F1 with the first character in uppercase and the remainder in lowercase. and the other letters in lowercase. If you do not use the backreference. or afterwards.60 11.

the second number two. To figure out the number of a particular backreference. «([a-c])x\1x\1» will match „axaxa”.NET (dot net). you can use the Match object that is returned by the Match method of the Regex class. In Perl. This fact means that non-capturing parentheses have another benefit: you can insert them into a regular expression without changing the numbers assigned to the backreferences. To get the string matched by the third backreference in C#. the magic variable $& holds the entire regex match. but also during the match. «([abc]\1)» will not work. In . This regex contains only one pair of parentheses. you can use MyMatch. you can use the entire regex match in the replacement text during a search and replace operation by typing \0 (backslash zero) into the replacement text. to insert backreferences. By putting the opening tag into a backreference. it is simply empty. the item with index zero holds the entire regex match.NET (dot net) Regex class also has a method Replace that can do a regex-based search-and-replace on a string. etc. $2. Depending on your regex flavor. $2. Libraries like . \0 cannot be used inside a regex. Using Backreferences in The Regular Expression Backreferences can not only be used after a match has been found. .Groups[3].*?</\1>». scan the regular expression from left to right and count the opening round brackets. In the replacement text. Using an empty backreference in the regex is perfectly fine. because that would force the engine to continuously keep an extra copy of the entire regex match. which capture the string matched by «[A-Z][A-Z0-9]» into the first backreference. or it will fail to match anything without an error message. You can reuse the same backreference more than once. only in the replacement. This can be very useful when modifying a complex regular expression. you can use $1. The «/» before it is simply the forward slash in the closing HTML tag that we are trying to match. Therefore. and the text in between. If a backreference was not used in a particular match attempt (such as in the first example where the question mark made the first backreference optional). etc. you can use the magic variables $1. it will either give an error message. we can reuse the name of the tag for the closing tag. The first bracket starts backreference number one. etc. Using backreference zero is more efficient than putting an extra pair of round brackets around the entire regex. It will simply be replaced with nothingness. A backreference cannot be used inside itself. Here's how: «<([A-Z][A-Z09]*)[^>]*>. In Perl. This object has a property called Groups.Value. Non-capturing parentheses are not counted. This backreference is reused with «\1» (backslash one). Suppose you want to match a pair of opening and closing HTML tags.NET (dot net) where backreferences are made available as an array or numbered list. The . The Entire Regex Match As Backreference Zero Certain tools make the entire regex match available as backreference zero. to access the part of the string matched by the backreference. In EditPad Pro or PowerGREP.61 Regex libraries in programming languages also provide access to the backreference. „bxbxb” and „cxcxc”. which is a collection of Group objects.

that's perfectly fine. the first regex will put „cab” into the first backreference. Note that the token the backreference. A complete match has been found: „<B><I>bold italic</I></B>”. Every time the engine arrives at the backreference. However. so the engine backtracks again. Repetition and Backreferences As I mentioned in the above inside look. The first time. „c” was stored. This also means that «([abc]+)=\1» will match „cab=cab”. «>» matches „>”. Each time. Backtracking continues again until the dot has consumed „<I>bold italic</I>”. The dot matches the second „<” in the string. The next token is a dot. it will read the value that was stored. because of another star. These do not match. At this point. The star is still lazy. The backreference still holds „B”. because of the star. and that «([abc])+=\1» will not. After storing the backreference. it holds «b» which fails to match “c”. The first token in the regex is the literal «<». The next token is «\1». and the next token is «/» which matches “/”. Again. This means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the second time at «\1». The engine has now arrived at the second «<» in the regex. The last token in the regex. At this point. The engine arrives again at «\1». The engine does not substitute the backreference in the regular expression. The position in the string remains at “>”. This step crosses the closing bracket of the first pair of capturing parentheses. so the engine again backtracks. «<» matches the third „<” in the string. the regex engine will initially skip this token. and the dot consumes the third “<” in the string. this is not a problem. taking note that it should backtrack in case the remainder of the regex fails. The next token is «/». the previous value was overwritten. so „b” remains. so „B” it is. In this case. But this did not happen here. «B» matches „B”. Because of the laziness. This match fails. The regex engine also takes note that it is now inside the first pair of capturing parentheses. These match. the regex engine does not permanently substitute backreferences in the regular expression. and the engine is forced to backtrack to the dot. the plus caused the pair of parentheses to repeat three times. so the engine again takes note of the available backtracking position and advances to «<» and “I”. Though both successfully match „cab”. The engine advances to «[A-Z0-9]» and “>”. This does not match “I”. «[^>]» does not match „>”. «[A-Z]» matches „B”. The second time „a” and the third time „b”. This prompts the regex engine to store what was matched inside them into the first backreference. It will use the last match saved into the backreference each time it needs to be used. and position in the regex is advanced to «>». while the second regex will only store „b”. These obviously match. The regex engine will traverse the string until it can match at the first „<” in the string. repeated by a lazy star. the engine proceeds with the match attempt. „B” is stored. There is a clear difference between «([abc]+)» and «([abc])+». That is because in the second regex. The position in the regex is advanced to «[^>]». The reason is that when the engine arrives at «\1». The next token is «[A-Z]». the previously saved match is overwritten. and not «B».62 Looking Inside The Regex Engine Let's see how the regex engine applies the above regex to the string “Testing <B><I>bold italic</I></B> text”. the new value stored in the first backreference would be used. If a new match is found by capturing parentheses. Obvious when you look at a . This fails to match at “I”. and the second “<” in the string. «<» matches „<” and «/» matches „/”. The backtracking continues until the dot has consumed „<I>bold italic”. The position in the string remains at “>”.

To delete the second word. you can easily find them. Using the regex «\b(\w+)\s+\1\b» in your text editor. simply type in “\1” as the replacement text and click the Replace button.63 simple example like this one. „(” and „)”. Useful Example: Checking for Doubled Words When editing text. When you put a round bracket in a character class. but a common cause of difficulty with regular expressions nonetheless. The \1 in regex like «(a)[\1b]» will be interpreted as an octal escape in most regex flavors. Parentheses and Backreferences Cannot Be Used Inside Character Classes Round brackets cannot be used inside character classes. Backreferences also cannot be used inside a character class. always double check that you are really capturing what you want. it is treated as a literal character. When using backreferences. So this regex will match an a followed by either «\x01» or a «b». . at least not as metacharacters. „b”. doubled words such as “the the” easily creep in. So the regex «[(a)b]» matches „a”.

Again. In a complex regular expression with many capturing groups. This does not work in PHP. RegexBuddy supports both Python's and Microsoft's style. The open source PCRE library has followed Python's example. PCRE and PHP Python's regex module was the first to offer a solution: named capture. and will convert one flavor of named capture into the other when generating source code snippets for Python. the numbering can get a little confusing. the Microsoft developers decided to invent their own syntax. you can use the two syntactic variations interchangeably. Simply use a name instead of a number between the curly braces. Unfortunately. Python and PCRE treat named capturing groups just like unnamed capturing groups. and number both kinds from left to right. You can use the pointy bracket flavor and the quoted flavors interchangeably. starting with one.NET languages. Named Capture with . Currently. or one of the . The first syntax is preferable in strings.NET's System. Python's sub() function allows you to reference a named group as “\1” or “\g<name>”. and the other using single quotes.Text. you can use double-quoted string interpolation with the $regs parameter you passed to pcre_match(): “$regs['name']”. you can reference the named group with the familiar dollar sign syntax: “${name}”. By assigning a name to a capturing group. The regex .NET style: «(?<first>group)(?'second'group)». Names and Numbers for Capturing Groups Here is where things get a bit ugly.64 12. As you can see. The PHP preg functions offer the same functionality. and offers named capture using the same syntax. since they are based on PCRE. To reference a capturing group inside the regex. PHP/preg. The numbers can then be used in backreferences to match the same text again in the regular expression. or to use part of the regex match for further processing. «(?P<name>group)» captures the match of «group» into the backreference “name”. which are numbered from left to right. where single quotes may need to be escaped. You can reference the contents of the group with the numbered backreference «\1» or the named backreference «(?P=name)». Use Round Brackets for Grouping All modern regular expression engines support capturing groups. In PHP. When doing a search-and-replace. . Named Capture with Python. use «\k<name>» or «\k'name'». Here is an example with two capturing groups in .NET offers two syntaxes to create a capturing group: one using sharp brackets. you can easily reference it by name. starting with one. no other regex flavor supports Microsoft's version of named capture.NET framework also support named capture. where the sharp brackets are used for HTML tags. rather than follow the one pioneered by Python.RegularExpressions The regular expression classes of the . PHP. The second syntax is preferable in ASP code.

NET framework. from one till four. Non-capturing groups are more efficient. you will get “acbd”. So the unnamed groups «(a)» and «(c)» get numbered first. and reference them by name exclusively.65 «(a)(?P<x>b)(c)(?P<y>d)» matches „abcd” as expected. in this case: three. Probably not what you expected. The .NET's regex support. or make it non-capturing as in «(?:nocapture)». All four groups were numbered from left to right. since the regex engine does not need to keep track of their matches. To make things simple. from left to right. Then the named groups «(?<x>b)» and «(?<y>d)» get their numbers. when using . you will get “abcd”. I strongly recommend that you do not mix named and unnamed capturing groups at all. Either give a group a name. The regex «(a)(?<x>b)(c)(?<y>d)» again matches „abcd”. If you do a search-and-replace with this regex and the replacement “\1\2\3\4”. To keep things compatible across regex flavors. continuing from the unnamed groups. if you do a search-and-replace with “$1$2$3$4” as the replacement. However. . just assume that named groups do not get numbered at all. Things are quite a bit more complicated with the . Easy and logical. but numbers them after all the unnamed groups have been numbered.NET framework does number named capturing groups from left to right. starting at one.

In that situation. while (?ism) turns on all three options. It is obvious that the modifier span does not create a backreference. but these differ widely. m/regex/i turns on case insensitivity. the non-capturing group is a modifier span that does not change any modifiers. Most tools that support regular expressions have checkboxes or similar controls that you can use to turn these modes on or off. (?i) turns on case insensitivity. In this mode. . The latest versions of all tools and languages discussed in this book do. In this mode. Most programming languages allow you to pass option flags when constructing the regex object. You can quickly test this. Older regex flavors usually apply the option to the entire regular expression. (?i-sm) turns on case insensitivity. /s enables "single-line mode".compile(“regex”. the modifier only applies to the part of the regex to the right of the modifier.66 13. while Pattern. E. If you insert the modifier (?ism) in the middle of the regex. and one to turn it off. no matter where you placed it.matches() method in Java does not take a parameter for matching options like Pattern. E.compile() does.g. but not “teST” or “TEST”. Many regex flavors have additional modes or options that have single letter equivalents. /m enables "multi-line mode". Regex Matching Modes All regular expression engines discussed in this tutorial support the following three matching modes: • • • /i makes the regex match case insensitive.g. and turns on multi-line mode. Specifying Modes Inside The Regular Expression Sometimes. To turn off several modes. The regex «(?i)te(?-i)st» should match „test” and „TEst”. «(?i)ignorecase(?-i)casesensitive(?i)ignorecase» is equivalent to «(?i)ignorecase(?i:casesensitive)ignorecase». one to turn an option on. precede each of their letters with a minus sign. you use a modifier span. the dot matches newlines. the caret and dollar match before and after newlines in the subject string. Not all regex flavors support this. in Perl. You have probably noticed the resemblance between the modifier span and the non-capturing group «(?:group)». Pattern. E. You can turn off a mode by preceding it with a minus sign. the tool or language does not provide the ability to specify matching options.g.g. turns off single-line mode. Technically. Modifier Spans Instead of using two modifiers. E.CASE_INSENSITIVE) does the same in Java. you can add a mode modifier to the start of the regex. Turning Modes On and Off for Only Part of The Regular Expression Modern regex flavors allow you to apply modifiers to only part of the regular expression. the handy String.

and the {11} skips the first 11 fields.*?.11. But it does not give up there.12. again trying all possible combinations for the 9th. so the dot continues until the 11th iteration of «.10. At that point.3. I explained the difference between greedy and lazy repetition. Atomic Grouping and Possessive Quantifiers When discussing the repetition operators or quantifiers.2.){11}» had consumed „1.8. giving up the last match of the comma. You get the idea: the possible number of combinations that the regex engine will try for each line where the 12th field does not start with a P is huge. the 10th could match just „11.4. It does not.9. A greedy quantifier will first try to repeat the token as many times as possible. this regex looks like it should do the job just fine.11.”. In fact.3. When the 9th iteration consumes „9. and 11th iterations.9.”. This causes software like EditPad Pro to stop responding.». Finally. However. they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found. The regex engine now checks whether the 13th field starts with a P. The dot matches the comma! However. The customer was using the regexp «^(.12.6. The next token is again the dot. the 10th iteration is expanded to „10.12. the comma does not match the “1” in the 12th field. this is exactly what will happen when the 12th field indeed starts with a P. Catastrophic Backtracking Recently I got a complaint from a customer that EditPad Pro hung (i. At first sight. the regex engine will backtrack. the same story starts with the 9th iteration.10. The dot matches a comma. the regex engine can no longer match the 11th iteration of «. Since there is still no P. Reaching the end of the string again.e.”.10.”. and gradually expand the match as the engine backtracks through the regex to find an overall match. Let's say the string is “1.7. The problem rears its ugly head when the 12th field does not start with a P.11”.11.*?. Since there is no comma after the 13th field.11.10.”. Because greediness and laziness change the order in which permutations are tried. expanding the match of the 10th iteration to „10. The lazy dot and comma match a single comma-delimited field.67 14.13”. A lazy quantifier will first repeat the token as few times as required.» has consumed „11. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. You can already see the root of the problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma). or even crash as the regex engine runs out of memory trying to remember all backtracking positions.6. there are more possiblities to be tried.4.11.12.8.*?. Because of the double repetition (star inside {11}). It backtracks to the 10th iteration. it stopped responding) when trying to find lines in a comma-delimited text file where the 12th item on a line started with a “P”. let's see why backtracking can lead to problems.5.5. subsequently expanding it to „9. First. and gradually give up matches as the engine backtracks to find an overall match. „9.”. 10th.2.10.” as well as „11. the engine backtracks to the 8th iteration. It will backtrack to the point where «^(.7. „9. But between each expansion. Continuously failing. they can change the overall regex match. this leads to a catastrophic amount of backtracking.”.”.){11}P».12.*?. .10. the P checks if the 12th field indeed starts with P.

In that case. It would match the minimum number of matches and never expand the match because backtracking is not allowed.\r\n]*. Atomic Grouping and Possessive Quantifiers Recent regex flavors have introduced two additional solutions to this problem: atomic grouping and possessive quantifiers.){11})P».0 and later. Using atomic grouping.NET support atomic grouping.4. though the JDK documentation uses the term “independent group” rather than “atomic group”. We want to match 11 commadelimited fields.\r\n]» is not able to expand beyond the comma. Perl supports it starting with version 5. once the regex engine leaves the group. we could easily reduce the amount of backtracking to a very low level by better specifying what we wanted. Tool and Language Support for Atomic Grouping and Possessive Quantifiers Atomic grouping is a recent addition to the regex scene. So the regex becomes: «^([^. the engine will still backtrack. make absolutely sure that there is only one way to match the same match. Possessive quantifiers are a limited form of atomic grouping with a cleaner notation. Note that you cannot make a lazy quantifier possessive. the regex must retry the entire regex at the next position in the string.2. If there is no token before the group. Their purpose is to prevent backtracking. In the above example. The latest versions of EditPad Pro and PowerGREP support both atomic grouping and possessive quantifiers.*?. allowing the regex engine to fail faster. and PCRE version 4 and later. the solution is to be more exact about what we want to match. as do all versions of RegexBuddy. forcing the regex engine to the previous one of the 11 iterations immediately. .n}+». the above regex becomes «^(?>(. and only supported by the latest versions of most regex flavors. But that is not always possible in such a straightforward manner. as do recent versions of PCRE and PHP's pgreg functions. place a plus after it. If backtracking is required. But it will backtrack only 11 times. In our example. At this time. All versions of . «x++» is the same as «(?>x+)». The fields must not contain comma's. The Java supports it starting with JDK version 1. you should use atomic grouping to prevent the regex engine from backtracking. you can be sure that the regex engine will try all those combinations. If repeating the inner loop 4 times and the outer loop 7 times results in the same overall match as repeating the inner loop 6 times and the outer loop 2 times. Because the entire group is one token. you can use «x*+». the engine has to backtrack to the regex token before the group (the caret in our example). and each time the «[^. no backtracking can take place once the regex engine has found a match for the group.){11}P». If the P cannot be found. possessive quantifiers are only supported by the Java JDK 1. To make a quantifier possessive. Python does not support atomic grouping. «x?+» and «x{m.4.68 Preventing Catastrophic Backtracking The solution is simple. Similarly. When nesting repetition operators. Everything between (?>) is treated as one single token by the regex engine.6. without trying further options.

11.5.\r\n]*+. They do not speed up success. or process huge amounts of data. and just one attempt to match the atomic group.9. While «x[^x]*+x» and «x(?>[^x]*)x» fail faster than «x[^x]*x». «{11}» causes further repetition until the atomic group has matched „1. The dot matches „1”. The engine now tries to match «P» to the “1” in the 12th field. the amount of time wasted with pointless backtracking increases in a linear fashion to the length of the string.5.12. and the comma matches too. you often can avoid the problem without atomic grouping as in the example above. When To Use Atomic Grouping or Possessive Quantifiers Atomic grouping and possessive quantifiers speed up failure by eliminating backtracking.*?. So far. Failure is declared after 30 attempts to match the caret. The caret matches at the start of the string and the engine enters the atomic group.6. the engine backtracks until the 6 can be matched.7. if you are smart about combined repetition.8.4.9. you will not earn back the extra time to type in the characters for the atomic grouping. The most efficient regex for our problem at hand would be «^(?>((?>[^. so the engine backtracks to the dot.3. Again. That is. The star is not possessive. so the engine backtracks. In the latter case. all backtracking information is discarded and the group is now considered a single token. «\d++6» will not match at all.4.2. That is what atomic grouping and possessive quantifiers are for: efficiency by disallowing backtracking. If the final x in the regex cannot be matched. the regex engine backtracks once for each character matched by the star. The star is lazy.”.){11})P» is applied to “1. Sometimes this is desirable. Because the group is atomic. the engine leaves the atomic group. no backtracking is allowed. .2. The engine now tries to match the caret at the next position in the string. When nesting quantifiers like in the above example. «P» failed to match. and the match fails.){11})P». «\d+» will match the entire string. troublesome regular expression. «\d+6» will match „123456” in “123456789”.10. the cause of this is that the token «\d» that is repeated can also match the delimiter «6». and declares failure. If you are simply doing a search in a text editor. since possessive. only failure. using simple repetition. Now. But the comma does not match “1”. then atomic grouping may make a difference.7.11.){11})P». the amount of time wasted increases exponentially and will very quickly exhaust the capabilities of your computer.10. so the group's entire match is discarded and the engine backtracks further to the caret. With simple repetition.69 Atomic Grouping Inside The Regex Engine Let's see how «^(?>(. This fails. rather than after 30 attempts to match the caret and a huge number of attempts to try all combinations of both quantifiers in the regex.6. and is not immediately enclosed by an atomic group. greedy repetition of the star is faster than a backtracking lazy dot. everything happened just like in the original. Note that atomic grouping and possessive quantifiers can alter the outcome of the regular expression match. With combined repetition.8. The engine walks through the string until the end.\r\n]*). you can reduce clutter by writing «^(?>([^. Now comes the difference. Still. The previous token is an atomic group.13”. That's right: backtracking is allowed here. the increase in speed is minimal.3. If possessive quantifiers are available. often it is not. This shows again that understanding how the regex engine works on the inside will enable you to avoid many pitfalls and craft efficient regular expressions that match exactly what you want. If the regex will be used in a tight loop in an application. which fails. you really should use atomic grouping and/or possessive quantifiers whenever possible. With the former regex. so the dot is initially skipped. the regex engine did not cross the closing round bracket of the atomic group.

it is done with the regex inside the lookahead. The other way around will not work. but only assert whether a match is possible or not.) Any valid regular expression can be used inside the lookahead. The engine takes note that it is inside a lookahead construct now. like this: «(?=(regex))». let's see how the engine applies «q(?!u)» to the string “Iraq”. Let's try applying the same regex to “quit”. or that would get very longwinded without them. The next token is the lookahead. I will explain why below. At this point. When explaining character classes. because the lookahead will already have discarded the regex match by the time the backreference is to be saved. These match. The exception is JavaScript. this means that the lookahead has successfully matched at the current position. As we already know. without making the u part of the match. you have to put capturing parentheses around the regex inside the lookahead. The engine notes that the regex inside the lookahead failed. The engine notes success. Inside the lookahead. The next character is the “u”. we have the trivial regex «u».70 15. I already explained why you cannot use a negated character class to match a “q” not followed by a “u”. with the opening bracket followed by a question mark and an equals sign. If you want to store the match of the regex inside a backreference. The positive lookahead construct is a pair of round brackets. these are called “lookaround”. Positive lookahead works just the same. That is why they are called “assertions”. The engine advances to the next character: “i”. The position in the string is now the void behind the string. the entire regex has matched. If it contains capturing parentheses. the backreferences will be saved. . Positive and Negative Lookahead Negative lookahead is indispensable if you want to match something not followed by something else. So the next token is «u». Regex Engine Internals First. They are also called “zero-width assertions”. The negative lookahead construct is the pair of round brackets. Negative lookahead provides the solution: «q(?!u)». So it is not included in the count towards numbering the backreferences. The difference is that lookarounds will actually match characters. «q» matches „q”. All regex flavors discussed in this book support lookaround. However. which supports lookahead but not lookbehind. They do not consume characters in the string. Note that the lookahead itself does not create a backreference. «q(?=u)» matches a q that is followed by a u. but then give up the match and only return the result: match or no match. and start and end of word anchors that I already explained. The first token in the regex is the literal «q». Lookahead and Lookbehind Zero-Width Assertions Perl 5 introduced two very powerful constructs: “lookahead” and “lookbehind”. this will cause the engine to traverse the string until the „q” in the string is matched. and begins matching the regex inside the lookahead. This does not match the void behind the string. Lookarounds allow you to create regular expressions that are impossible to create without them. Collectively. and discards the regex match. (Note that this is not the case with lookbehind. They are zero-width just like the start and end of line. This causes the engine to step back in the string to “u”. with the opening bracket followed by a question mark and an explanation point. and „q” is returned as the match. Because the lookahead is negative. The next token is the «u» inside the lookahead. You can use any regular expression inside the lookahead.

“less than” symbol and an equals sign. More Regex Engine Internals Let's apply «(?<=a)b» to “thingamabob”. the engine temporarily steps back one character to check if an “a” can be found there. Again. In this case. using an exclamation point instead of an equals sign. «b» matches „b”. The lookbehind continues to fail until the regex reaches the “m” in the string. and the engine starts again at the next character. the match from the lookahead must be discarded. If you want to find a word not ending with an “s”. The construct for positive lookbehind is «(?<=text)»: a pair of round brackets. because there are no more q's in the string. I have made the lookahead positive. So the lookbehind fails. But «i» cannot match “u”. Important Notes About Lookbehind The good news is that you can use lookbehind anywhere in the regex. using negative lookbehind. It matches one character: the first „b” in the string. Let's take one more look inside. «q» matches „q” and «u» matches „u”. not only at the start. It will not match “cab”. the “h”. The positive lookbehind matches. The engine cannot step back one character because there are no characters before the “t”.71 Because the lookahead is negative. It tells the regex engine to temporarily step backwards in the string. to check if the text inside the lookbehind can be matched there. and finds out that the “m” does not match «a». Negative lookbehind is written as «(?<!text)». The next token is «b». so the positive lookbehind fails again. and see if an “a” can be matched there. the current position in the string remains at the “m”. The engine steps back. Since there are no other permutations of this regex. but does not match “bed” or “debt”. and put a token after it. the engine reports failure. «(?<!a)b» matches a “b” that is not preceded by an “a”. «(?<=a)b» (positive lookbehind) matches the „b” (and only the „b”) in „cab”. so the engine steps back from “i” in the string to “u”. so the engine continues with «i». (Note that a negative lookbehind would have succeeded here.) Again. but works backwards. with the opening bracket followed by a question mark. So this match attempt fails. and notices that the „a” can be matched there. the successful match inside it causes the lookahead to fail. Positive and Negative Lookbehind Lookbehind has the same effect. to make sure you understand the implications of the lookahead. Let's apply «q(?=u)i» to “quit”. It finds a “t”. The engine steps back and finds out that „a” satisfies the lookbehind. but will match the „b” (and only the „b”) in “bed” or “debt”. Again. Since «q» cannot match anywhere else. you could use «\b\w+(?<!s)\b». Because it is zero-width. the engine has to start again at the beginning. All remaining attempts will fail as well. The engine starts with the lookbehind and the first character in the string. The next character is the first “b” in the string. and the entire regex has been matched successfully. the lookbehind tells the engine to step back one character. The engine again steps back one character. which cannot match here. The next character is the second “a” in the string. This is definitely not the same as . To lookahead was successful.

Therefore. but fixed lengths. But each string in the alternation must still be of fixed length. (Hint: «\b» matches between the apostrophe and the “s”). and will allow you to use any regex. the former will match „John” and the latter „John'” (including the apostrophe). The correct regex without using lookbehind is «\b\w*[^s\W]\b» (star instead of plus. only allow fixed-length strings. Finally. Even with these limitations.72 «\b\w+[^s]\b».4. Finally.0 of the . the . lookbehind is a valuable addition to the regular expression syntax. RegexBuddy. These regex flavors recognize the fact that finite repetition can be rewritten as an alternation of strings with different. Microsoft has promised to resolve this in version 2. Double negations tend to be confusing to humans. I will leave it up to you to figure out why. Until that happens. but you can use the question mark and the curly braces with the max parameter specified. has a double negation (the \W in the negated character class). PHP. including infinite repetition. You can use any regex of which the length of the match can be predetermined. plus finite repetition. I recommend you use only fixed-length strings. many regex flavors. You cannot use repetition or optional items. but only if all options in the alternation have the same length. Therefore. This means you can use literal text and character classes. some more advanced flavors support the above. The reason is that regular expressions do not work backwards. I find the lookbehind easier to understand. The string must be traversed from left to right. EditPad Pro and PowerGREP. However. which works correctly. The bad news is that you cannot use just any regex inside a lookbehind. The latter will also not match single-letter words like “a” or “I”. the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.NET framework can apply regular expressions backwards. The only regex flavor that I know of that currently supports this is Sun's regex package in the JDK 1.NET framework. This includes PCRE. inside lookbehind. When applied to “John's”. Not to regex engines. and \W in the character class). Technically. the semantics of applying a regular expression backwards are currently not well-defined. Personally. plus alternation with strings of different lengths. You can use alternation. so only literals and character classes can be used. This means you can still not use the star or plus. though. alternation and character classes inside lookbehind. The last regex. JavaScript does not support lookbehind at all. including those used by Perl 5 and Python. Some regex flavors support the above. .

At this position will the regex engine attempt the remainder of the regex. we can match this without lookaround. If «cat» can be successfully matched. To make this clear. the word we found must contain the word “cat”. Unfortunately. where the lookahead will fail. the lookahead will fail. and the engine will continue trying the regex from the start at the next character position in the string. Lookaround to The Rescue In this example. The engine will then backtrack.73 16. This is at the second letter in the 6-letter word we just found. then the regex will traverse part of the string twice. Let's say we want to find a word that is six letters long and contains the three subsequent letters “cat”. Combining the two. Easy enough. Our double-requirement-regex has matched successfully. if any. If «cat» cannot be matched. At each character position in the string where the regex is attempted. the current position in the string is still at the beginning of the 6-letter word. is a very powerful concept. Matching a word containing “cat” is equally easy: «\b\w*cat\w*\b». Easy! Here's how this works. The lookahead is zero-width. Because we already know that a 6-letter word can be matched at the current position. the engine will first attempt the regex inside the positive lookahead. reducing the number of characters matched by «\w*». If not. until «cat» can be matched. the engine has no other choice but to restart at the beginning of the regex. we get: «(?=\b\w{6}\b)\b\w*cat\w*\b». So if you have a regex in which a lookahead is followed by another piece of regex. Second. Matching a 6-letter word is easy with «\b\w{6}\b». . we know that «\b» matches and that the first «\w*» will match 6 times. it is often underused by people new to regular expressions. “dog” or “mouse”. Actually. matches only when the current character position in the string is at the start of a 6-letter word in the string. Testing The Same Part of The String for More Than One Requirement Lookaround. So when the regex inside the lookahead has found the 6-letter word. This sub-regex. in the 6letter word. The confusing part is that the lookaround is zero-width. causing the engine to advance character by character until the next 6-letter word. the second «\w*» will consume the remaining letters. After that. which I introduced in detail in the previous topic. and therefore the lookahead. the last «\b» in the regex is guaranteed to match where the second «\b» inside the lookahead matched. First. But this method gets unwieldy if you want to find any word between 6 and 12 letters long containing either “cat”. we basically have two requirements for a successful match. because lookaround is a bit confusing. a bit more practical example. We just specify all the options and hump them together using alternation: «cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat». or a lookbehind is preceded by another piece of regex. I would like to give you another. we want a word that is 6 letters long. at the next character position in the string.

So the final regex is: «\b(?=\w{6}\b)\w{0. But optimizing things is a good idea if this regex will be used repeatedly and/or on large chunks of data in an application you are developing. Though the last «\w*» is also guaranteed to match.74 Optimizing Our Solution While the above regex works just fine. Since it is zero-width itself. Since it is zero-width. Note that making the asterisk lazy would not have optimized this sufficiently. so it does not contribute to the match returned by the regex engine. Very easy. but if a 6-letter word does not contain “cat”. A More Complex Problem So. leaving: «(?=\b\w{6}\b)\w*cat\w*». and therefore does not change the result returned by the regex engine. it would still cause the regex engine to try matching “cat” at the last two letters. there can never be more than 3 letters before “cat”. optimization involves the first «\b». But we can optimize the first «\w*». up to and including “cat”. So we can optimize this to «\w{0. “dog” or “mouse”? Again we have two requirements. “dog” or “mouse” into the first backreference. and even at one character beyond the 6-letter word. The lazy asterisk would find a successful match sooner. As it stands. You can discover these optimizations by yourself if you carefully examine the regex and follow how the regex engine applies it. it is not the most optimal solution. the resulting match would be the start of a 6-letter word containing “cat”. what would you use to find any word between 6 and 12 letters long containing either “cat”. But we know that in a successful match. we can remove them.3}cat\w*». This regex will also put “cat”. This is not a problem if you are just doing a search in a text editor. If we omitted the «\w*».3}cat\w*».9}(cat|dog|mouse)\w*». instead of the entire word. we cannot remove it because it adds characters to the regex match.12}\b)\w{0. I said the third and last «\b» are guaranteed to match. it will match 6 letters and then backtrack.3}». . which we can easily combine using a lookahead: « \b(?=\w{6. once you get the hang of it. So we have «(?=\b\w{6}\b)\w{0. as I did above. One last. at the last single letter. Remember that the lookahead discards its match. minor. there's no need to put it inside the lookahead.

*?wanted. I used a lazy star to make the regex more efficient. One example is matching a particular regex only inside specific sections of the string or file searched through. Because of the negative lookahead inside the star. you can easily build a regex to do a search and replace on HTML files. The dot and negative lookahead match any character that is not the first character of the start of a section. Note that these two rules will only yield success if the string or file searched through is properly translated into sections. the lazy star will continue to repeat until the end of the section is reached. at which point stop cannot be matched and thus the regex will fail. If “wanted” occurs only once inside the section. If not. When we apply the regex again to the same string or file. This is possible with lookahead. because lookahead is zero-width. This. In a regex. replacing a certain word with another. So we need a way to match „wanted” without matching the rest of the section. we must be able to match «stop» after matching «wanted». this will not work if “wanted” occurs more than once inside a single section. The entire section is included in the regex match. Substitute «wanted». we can do without lookahead. we found a match before a section rather than inside a section. The star is obviously not of fixed length. not after „wanted”.*?». «start. but only inside title tags.*?stop» would do the trick. Effectively. A title tag starts with «<H[1-6]» and . Lookbehind must be of fixed length. we need to use «. First we match the string we want. and «start» as the regex matching the start of the section. and then we test if it is inside the proper section. However.*?)wanted(?=.75 17. Second. You may be tempted to use a combination of lookbehind and lookahead like in «(?<=start. The reason is that this regular expression consumes the entire section. Finding Matches Only Inside a Section of The String Lookahead allows you to create regular expression patterns that are impossible to create without it. So inside the lookahead we need to look for a series of unspecified characters that do not match the start of a section anywhere in the series.)*?stop». That is. The final regular expression becomes: «wanted(?=((?!start). To keep things simple. we found a match after a section rather than inside a section. After this. I will use «wanted» as a substitute for the regular expression that we are trying to match inside the section. it will continue after „stop”. and end with the section stop. we must not be able to match «start» between matching «wanted» and matching «stop». we repeat zero or more times with the star. we need to match the end of the section. «start» and «stop» with the regexes of your choice. The regex engine will refuse to compile this regular expression. However. the star will also stop at the start of a section.*?stop)». How do we know if we matched «wanted» inside a section? First. each match of «start» must be followed exactly by one match of «stop». The final regular expression will be in the form of «wanted(?=insidesection)». So we have to resort to using lookahead only. If we could. this is written as: «((?!start). Since we do not know in advance how many characters there will be between “start” and “wanted”. Example: Search and Replace within Header Tags Using the above generic regular expression.)*?stop)». this will not work. and «stop» as the regex matching the end of the section.

I did that because some regex flavors interpret «(?!<» as identical to «(?<!». You may have noticed that I escaped the < of the opening tag in the final regex. Escaping the < takes care of the problem. I omitted the closing > in the start tag to allow for attributes.76 ends with «</H[1-6]>». or negative lookbehind.)*?</H[1-6]>)». But lookahead is what we need here. . So the regex becomes «wanted(?=((?!\<H[1-6]).

rather than at the end of the previous match result. where «\G» matches at the position of the text cursor. specify the continuation modifier /c.. All in all. This means that you can use «\G» to make a regex continue in a subject string where another regex left off. To avoid this. the stored position for «\G» is reset to the start of the string. so the match fails. All this is very useful to make several regular expressions work together. \G Magic with Perl In Perl.77 18. «\G» matches at the start of the string in the way «\A» does. this makes a lot of sense in the context of a text editor. Continuing at The End of The Previous Match The anchor «\G» matches at the position where the previous match ended. The 3rd attempt yields „s” and the 4th attempt matches the second „t” in the string. you could parse an HTML file in the following fashion: while ($string =~ m/</g) { if ($string =~ m/\GB>/c) { } elsif ($string =~ m/\GI>/c) { } else { } } # Bold # Italics # . «\G» matches at the start of the match attempt. . The position is not associated with any regular expression. The result is that «\G» matches at the end of the previous match result only when you do not move the text cursor between two searches. During the fifth attempt.. EditPad Pro will select the match. The fifth attempt fails. When a match is found. But that position is not followed by a word character. Applying it again matches „e”. and the regexes inside the loop check which tag we found.etc. This way you can parse the tags in the file in the order they appear in the file. The regex in the while loop searches for the tag's opening bracket. Applying «\G\w» to the string “test string” matches „t”. without having to write a single big regex that matches all tags you are interested in. This is the case with EditPad Pro. the only place in the string where «\G» matches is after the second t. E.. the position where the last match ended is a “magical” value that is remembered separately for each string variable.g.. and move the text cursor to the end of the match. During the first match attempt. rather than the end of the previous match. End of The Previous Match vs Start of The Match Attempt With some regex flavors or tools. If a match attempt fails.

«\G» will then match at this position. The Matcher is strictly associated with a single regular expression and a single subject string. E. in Java.78 \G in Other Programming Langauges This flexibility is not available with most other programming languages.g. What you can do though is to add a line of code to make the match attempt of the second Matcher start where the match of the first Matcher ended. the position for «\G» is remembered by the Matcher object. .

If you want to use alternation. immediately followed by the if part. The syntax consists of a pair of round brackets. Remember that the lookaround constructs do not consume any characters. and the vertical bar with it. For the if part. Otherwise. . like in «(?(?=condition)(then1|then2|then3)|(else1|else2|else3))». immediately followed by the then part. If-Then-Else Conditionals in Regular Expressions A special construct «(?ifthen|else)» allows you to create conditional regular expressions. the syntax becomes «(?(?=regex)then|else)». then the regex engine will attempt to match the then part. the else part is attempted instead. You may omit the else part. This part can be followed by a vertical bar and the else part. The opening bracket must be followed by a question mark. Otherwise.79 19. there is no need to use parentheses around the then and else parts. the if and then parts are clearly separated. you can use any regular expression. you can use the lookahead and lookbehind constructs. If the if part evaluates to true. you will have to group the then or else together using parentheses. Using positive lookahead. If you use a lookahead as the if part. then the regex engine will attempt to match the then or else part (depending on the outcome of the lookahead) at the same position where the if was attempted. Because the lookahead has its own parentheses. For the then and else.

I could clarify the regex to match a valid date by writing it as «(?#year)(19|20)\d\d[/. . I guess you will agree that regular expressions can quickly become rather cryptic. The syntax is «(?#comment)» where “comment” is be whatever you want.](?#month)(0[1-9]|1[012])[. such as RegexBuddy.g. That makes the comments really stand out. Some software. Now it is instantly obvious that this regex matches a date in yyyy-mm-dd format. Adding Comments to Regular Expressions If you have worked through the entire tutorial.](?#day)(0[1-9]|[12][0-9]|3[01])». E./. many modern regex flavors allow you to insert comments into regexes. enabling the right comment in the right spot to make a complex regular expression much easier to understand. EditPad Pro and PowerGREP can apply syntax coloring to regular expressions while you write them. Therefore.80 20. The regex engine ignores everything after the «(?#» until the first closing round bracket. as long as it does not contain a closing round bracket.

Sign up to vote on this title
UsefulNot useful