Você está na página 1de 24

Regular Expression Basic Syntax Reference

Characters

Character Description Example

Any character All characters except the listed a matches a


except[\^$.|?*+() special characters match a
single instance of
themselves. { and } are literal
characters, unless they're part
of a valid regular expression
token (e.g. the {n} quantifier).

\ (backslash) A backslash escapes special \+ matches +


followed by any characters to suppress their
of [\^$.|?*+(){} special meaning.

\Q...\E Matches the characters \Q+-*/\E matches+-*/


between \Q and \E literally,
suppressing the meaning of
special characters.

\xFF where FF are 2 Matches the character with the \xA9 matches © when using the Latin-1
hexadecimal digits specified ASCII/ANSI value, code page.
which depends on the code
page used. Can be used in
character classes.

\n, \r and \t Match an LF character, CR \r\n matches a DOS/Windows CRLF line


character and a tab character break.
respectively. Can be used in
character classes.

\a, \e, \f and \v Match a bell character (\x07),  


escape character (\x1B), form
feed (\x0C) and vertical tab
(\x0B) respectively. Can be
used in character classes.

\cA through \cZ Match an ASCII character \cM\cJ matches a DOS/Windows CRLF


Control+A through Control+Z, line break.
equivalent
to \x01 through \x1A. Can be
used in character classes.

Character Classes or Character Sets [abc]

Character Description Example

[ (opening square Starts a character class. A


bracket) character class matches a
single character out of all the
possibilities offered by the
character class. Inside a
character class, different rules
apply. The rules in this section
 
are only valid inside character
classes. The rules outside this
section are not valid in
character classes, except for a
few character escapes that are
indicated with "can be used
inside character classes".

Any character All characters except the listed [abc] matches a, b orc


except^-]\ add that special characters.
character to the
possible matches for
the character class.

\ (backslash) A backslash escapes special [\^\]] matches ^ or ]


followed by any characters to suppress their
of ^-]\ special meaning.
- (hyphen) except Specifies a range of [a-zA-Z0-9] matches any letter or digit
immediately after the characters. (Specifies a
opening [ hyphen if placed immediately
after the opening [)

^ (caret) immediately Negates the character class, [^a-d] matches x (any character except


after the opening [ causing it to match a single a, b, c or d)
character not listed in the
character class. (Specifies a
caret if placed anywhere
except after the opening [)

\d, \w and \s Shorthand character classes [\d\s] matches a character that is a digit


matching digits, word or whitespace
characters (letters, digits, and
underscores), and whitespace
(spaces, tabs, and line
breaks). Can be used inside
and outside character classes.

\D, \W and \S Negated versions of the above. \D matches a character that is not a digit
Should be used only outside
character classes. (Can be
used inside, but that is
confusing.)

[\b] Inside a character class, \b is [\b\t] matches a backspace or tab


a backspace character. character

Dot

Character Description Example

. (dot) Matches any single character . matches x or (almost) any other


except line break characters \r character
and \n. Most regex flavors
have an option to make the dot
match line break characters
too.

Anchors

Character Description Example

^ (caret) Matches at the start of the ^. matches a inabc\ndef. Also


string the regex pattern is matches d in "multi-line" mode.
applied to. Matches a position
rather than a character. Most
regex flavors have an option to
make the caret match after line
breaks (i.e. at the start of a line
in a file) as well.

$ (dollar) Matches at the end of the .$ matches f inabc\ndef. Also


string the regex pattern is matches c in "multi-line" mode.
applied to. Matches a position
rather than a character. Most
regex flavors have an option to
make the dollar match before
line breaks (i.e. at the end of a
line in a file) as well. Also
matches before the very last
line break if the string ends
with a line break.

\A Matches at the start of the \A. matches a in abc


string the regex pattern is
applied to. Matches a position
rather than a character. Never
matches after line breaks.

\Z Matches at the end of the .\Z matches f inabc\ndef


string the regex pattern is
applied to. Matches a position
rather than a character. Never
matches before line breaks,
except for the very last line
break if the string ends with a
line break.

\z Matches at the end of the .\z matches f inabc\ndef


string the regex pattern is
applied to. Matches a position
rather than a character. Never
matches before line breaks.

Word Boundaries

Character Description Example

\b Matches at the position .\b matches c in abc


between a word character
(anything matched by \w) and
a non-word character (anything
matched by [^\w] or \W) as
well as at the start and/or end
of the string if the first and/or
last characters in the string are
word characters.

\B Matches at the position \B.\B matches b in abc


between two word characters
(i.e the position between \w\w)
as well as at the position
between two non-word
characters (i.e. \W\W).

Alternation

Character Description Example

| (pipe) Causes the regex engine to abc|def|xyz matchesabc, def or xyz


match either the part on the left
side, or the part on the right
side. Can be strung together
into a series of options.

| (pipe) The pipe has the lowest abc(def|xyz)matches abcdef orabcxyz


precedence of all operators.
Use grouping to alternate only
part of the regular expression.

Quantifiers

Character Description Example

? (question mark) Makes the preceding item abc? matches ab orabc


optional. Greedy, so the
optional item is included in the
match if possible.

?? Makes the preceding item abc?? matches ab orabc


optional. Lazy, so the optional
item is excluded in the match if
possible. This construct is
often excluded from
documentation because of its
limited use.

* (star) Repeats the previous item zero ".*" matches"def" "ghi" inabc "def"


or more times. Greedy, so as "ghi" jkl
many items as possible will be
matched before trying
permutations with less
matches of the preceding item,
up to the point where the
preceding item is not matched
at all.

*? (lazy star) Repeats the previous item zero ".*?" matches "def"inabc "def"


or more times. Lazy, so the "ghi" jkl
engine first attempts to skip the
previous item, before trying
permutations with ever
increasing matches of the
preceding item.

+ (plus) Repeats the previous item ".+" matches"def" "ghi" inabc "def"


once or more. Greedy, so as "ghi" jkl
many items as possible will be
matched before trying
permutations with less
matches of the preceding item,
up to the point where the
preceding item is matched only
once.

+? (lazy plus) Repeats the previous item ".+?" matches "def"inabc "def"


once or more. Lazy, so the "ghi" jkl
engine first matches the
previous item only once,
before trying permutations with
ever increasing matches of the
preceding item.

{n} where n is an Repeats the previous item a{3} matches aaa


integer >= 1 exactly n times.

{n,m} where n >= 0 Repeats the previous item a{2,4} matches aaaa,aaa or aa


and m >= n between n and m times.
Greedy, so repeating m times
is tried before reducing the
repetition to n times.

{n,m}? where n >= 0 Repeats the previous item a{2,4}? matches aa,aaa or aaaa


and m >= n between n and m times. Lazy,
so repeating n times is tried
before increasing the repetition
to m times.

{n,} where n >= 0 Repeats the previous item at a{2,} matches aaaaain aaaaa


least n times. Greedy, so as
many items as possible will be
matched before trying
permutations with less
matches of the preceding item,
up to the point where the
preceding item is matched only
n times.

{n,}? where n >= 0 Repeats the previous item n or a{2,}? matches aa inaaaaa


more times. Lazy, so the
engine first matches the
previous item n times, before
trying permutations with ever
increasing matches of the
preceding item.

Regular Expression Advanced Syntax Reference

Grouping and Backreferences

Syntax Description Example

(regex) Round brackets group (abc){3}matchesabcabcabc. First group


the regex between them. matches abc.
They capture the text
matched by the regex
inside them that can be
reused in a
backreference, and they
allow you to apply regex
operators to the entire
grouped regex.

(?:regex) Non-capturing (?:abc){3}matchesabcabcabc. No


parentheses group the groups.
regex so you can apply
regex operators, but do
not capture anything and
do not create
backreferences.

\1 through \9 Substituted with the text (abc|def)=\1matchesabc=abc o


matched between the 1st rdef=def, but not abc=def ordef=abc.
through 9th pair of
capturing parentheses.
Some regex flavors allow
more than 9
backreferences.

Modifiers

Syntax Description Example

(?i) Turn on case insensitivity te(?i)stmatches teSTbut not TEST.


for the remainder of the
regular expression.
(Older regex flavors may
turn it on for the entire
regex.)

(?-i) Turn off case insensitivity (?i)te(?-i)stmatches TEstbut


for the remainder of the not TEST.
regular expression.

(?s) Turn on "dot matches  


newline" for the
remainder of the regular
expression. (Older regex
flavors may turn it on for
the entire regex.)

(?-s) Turn off "dot matches  


newline" for the
remainder of the regular
expression.
(?m) Caret and dollar match  
after and before newlines
for the remainder of the
regular expression.
(Older regex flavors may
apply this to the entire
regex.)

(?-m) Caret and dollar only  


match at the start and
end of the string for the
remainder of the regular
expression.

(?x) Turn on free-spacing  


mode to ignore
whitespace between
regex tokens, and allow
# comments.

(?-x) Turn off free-spacing  


mode.

(?i-sm) Turns on the options "i"  


and "m", and turns off "s"
for the remainder of the
regular expression.
(Older regex flavors may
apply this to the entire
regex.)

(?i-sm:regex) Matches the regex inside (?i:te)stmatches TEstbut not TEST.


the span with the options
"i" and "m" turned on,
and "s" turned off.

Atomic Grouping and Possessive Quantifiers


Syntax Description Example

(?>regex) Atomic groups prevent x(?>\w+)x is more efficient than x\w+x if


the regex engine from the second x cannot be matched.
backtracking back into
the group (forcing the
group to discard part of
its match) after a match
has been found for the
group. Backtracking can
occur inside the group
before it has matched
completely, and the
engine can backtrack
past the entire group,
discarding its match
entirely. Eliminating
needless backtracking
provides a speed
increase. Atomic
grouping is often
indispensable when
nesting quantifiers to
prevent a catastrophic
amount of backtracking
as the engine needlessly
tries pointless
permutations of the
nested quantifiers.

?+, *+, ++ and{m,n}+ Possessive quantifiers x++ is identical to (?>x+)


are a limited yet
syntactically cleaner
alternative to atomic
grouping. Only available
in a few regex flavors.
They behave as normal
greedy quantifiers,
except that they will not
give up part of their
match for backtracking.

Lookaround

Syntax Description Example

(?=regex) Zero-width positive t(?=s)matches the second t instreets.


lookahead. Matches at a
position where the
pattern inside the
lookahead can be
matched. Matches only
the position. It does not
consume any characters
or expand the match. In
a pattern likeone(?
=two)three,
both two and three hav
e to match at the position
where the match
of one ends.

(?!regex) Zero-width negative t(?!s)matches the firstt in streets.


lookahead. Identical to
positive lookahead,
except that the overall
match will only succeed if
the regex inside the
lookahead fails to match.

(?<=text) Zero-width positive (?<=s)tmatches the firstt in streets.


lookbehind. Matches at a
position to the left of
which text appears.
Since regular
expressions cannot be
applied backwards, the
test inside the lookbehind
can only be plain text.
Some regex flavors allow
alternation of plain text
options in the
lookbehind.

(?<!text) Zero-width negative (?<!s)tmatches the second t instreets.


lookbehind. Matches at a
position if the text does
not appear to the left of
that position.

Continuing from The Previous Match

Syntax Description Example

\G Matches at the position \G[a-z] first matches a, then


where the previous matches b and then fails to match in ab_cd.
match ended, or the
position where the
current match attempt
started (depending on
the tool or regex flavor).
Matches at the start of
the string during the first
match attempt.

Conditionals

Syntax Description Example

(?(?=regex)then| If the lookahead (?(?<=a)b|c)matches the second b and


else) succeeds, the "then" part the first c inbabxcac
must match for the
overall regex to match. If
the lookahead fails, the
"else" part must match
for the overall regex to
match. Not just positive
lookahead, but all four
lookarounds can be
used. Note that the
lookahead is zero-width,
so the "then" and "else"
parts need to match and
consume the part of the
text matched by the
lookahead as well.

(?(1)then|else) If the first capturing (a)?(?(1)b|c)matches ab, the first c and


group took part in the the second c inbabxcac
match attempt thus far,
the "then" part must
match for the overall
regex to match. If the first
capturing group did not
take part in the match,
the "else" part must
match for the overall
regex to match.

Comments

Syntax Description Example

(?#comment) Everything between (? a(?#foobar)bmatches a


# and ) is ignored by the
regex engine.
printf - a quick look at Perl and Java

In this cheat sheet I'm going to show all the examples using Perl, but I thought at first it
might help to one printf example using both Perl and Java. So, here's a simple Perl printf
example to get us started:

printf("the %s jumped over the %s, %d times", "cow", "moon", 2);

And here are three different ways of using printf format specifier syntax with Java:

System.out.format("the %s jumped over the %s, %d times", "cow",


"moon", 2);
System.err.format("the %s jumped over the %s, %d times", "cow",
"moon", 2);
String result = String.format("the %s jumped over the %s, %d
times", "cow", "moon", 2);

As you can see in that last String.format example, that line of code doesn't print any
output, while the first line prints to standard output, and the second line prints to standard
error.

In the remainder of this document I'm going to use Perl examples, but again, the actual
format specifier strings can be used in many different languages.

A summary of the printf format specifiers


Here's a quick summary of the available print format specifiers:

%c character

%d decimal (integer) number (base 10)

%e exponential floating-point number

%f floating-point number
%i integer (base 10)

%o octal number (base 8)

%s a string of characters

%u unsigned decimal (integer) number

%x number in hexadecimal (base 16)

%% print a percent sign

\% print a percent sign

Controlling printf integer width

The "%3d" specifier means a minimum width of three spaces, which, by default, will be
right-justified. (Note: the alignment is not currently being displayed properly here.)

printf("%3d", 0); 0

printf("%3d", 123456789); 123456789

printf("%3d", -10); -10

printf("%3d", -123456789); -123456789


Left-justifying printf integer output
To left-justify those previous printf examples, just add a minus sign ( -) after the % symbol,
like this:

printf("%-3d", 0); 0

printf("%-3d", 123456789); 123456789

printf("%-3d", -10); -10

printf("%-3d", -123456789); -123456789

The printf zero-fill option


To zero-fill your integer output, just add a zero ( 0) after the % symbol, like this:

printf("%03d", 0); 000

printf("%03d", 1); 001

printf("%03d", 123456789); 123456789

printf("%03d", -10); -10

printf("%03d", -123456789); -123456789

printf - integers with formatting

Here is a collection of examples for integer printing. Several different options are shown,
including a minimum width specification, left-justified, zero-filled, and also a plus sign for
positive numbers.

Description Code Result


At least five wide printf("'%5d'", 10); ' 10'

At least five-wide, left-justified printf("'%-5d'", 10); '10 '

At least five-wide, zero-filled printf("'%05d'", 10); '00010'

At least five-wide, with a plus sign printf("'%+5d'", 10); ' +10'

Five-wide, plus sign, left-justified printf("'%-+5d'", 10); '+10 '

Description Code Result

Print one position after the decimal printf("'%.1f'", 10.3456); '10.3'

Two positions after the decimal printf("'%.2f'", 10.3456); '10.35'

Eight-wide, two positions after the decimal printf("'%8.2f'", 10.3456); ' 10.35'

Eight-wide, four positions after the decimal printf("'%8.4f'", 10.3456); ' 10.3456'

Eight-wide, two positions after the decimal,


printf("'%08.2f'", 10.3456); '00010.35'
zero-filled

Eight-wide, two positions after the decimal,


printf("'%-8.2f'", 10.3456); '10.35 '
left-justified
Printing a much larger number with that printf("'%-8.2f'",
'101234567.35'
same format 101234567.3456);

How to print strings with printf formatting


Here are several printf formatting examples that show how to format string output
with printf format specifiers.

Description Code Result

A simple string printf("'%s'", "Hello"); 'Hello'

A string with a minimum length printf("'%10s'", "Hello"); ' Hello'

Minimum length, left-justified printf("'%-10s'", "Hello"); 'Hello '

Summary of special printf characters


The following character sequences have a special meaning when used as printf format
specifiers:

\a audible alert

\b backspace

\f form feed

\n newline, or linefeed

\r carriage return

\t tab
\v vertical tab

\\ backslash

As you can see from that last example, because the backslash character itself is treated
specially, you have to print two backslash characters in a row to get one backslash
character to appear in your output.

Here are a few examples of how to use this special characters:

Description Code Result

Insert a tab character in a string printf("Hello\tworld"); Hello world

Insert a newline character in a Hello


printf("Hello\nworld");
string world

Typical use of the newline


printf("Hello world\n"); Hello world
character

A DOS/Windows path with printf("C:\\Windows\\System32\\")


C:\Windows\System32\
backslash characters ;
Algorithms: Big-Oh Notation
How time and space grow as the amount of data increases

It's useful to estimate the cpu or memory resources an algorithm requires. This "complexity analysis"
attempts to characterize the relationship between the number of data elements and resource usage (time or
space) with a simple formula approximation. Many programmers have had ugly surprises when they moved
from small test data to large data sets. This analysis will make you aware of potential problems.

Dominant Term

Big-Oh (the "O" stands for "order of") notation is concerned with what happens for very large values of N,
therefore only the largest term in a polynomial is needed. All smaller terms are dropped.

For example, the number of operations in some sorts is N 2 - N. For large values of N, the single N term is
insignificant compared to N2, therefore one of these sorts would be described as an O(N 2) algorithm.

Similarly, constant multipliers are ignored. So a O(4*N) algorithm is equivalent to O(N), which is how it
should be written. Ultimately you want to pay attention to these multipliers in determining the performance,
but for the first round of analysis using Big-Oh, you simply ignore constant factors.

Why Size Matters

Here is a table of typical cases, showing how many "operations" would be performed for various values of N.
Logarithms to base 2 (as used here) are proportional to logarithms in other base, so this doesn't affect the
big-oh formula.

  constant logarithmic linear   quadratic cubic

n O(1) O(log N) O(N) O(N log N) O(N2) O(N3)

1 1 1 1 1 1 1

2 1 1 2 2 4 8

4 1 2 4 8 16 64

8 1 3 8 24 64 512

16 1 4 16 64 256 4,096

1,024 1 10 1,024 10,240 1,048,576 1,073,741,824

1,048,576 1 20 1,048,576 20,971,520 1012 1016


Does anyone really have that much data?
It's quite common. For example, it's hard to find a digital camera that that has fewer than a million pixels (1
mega-pixel). These images are processed and displayed on the screen. The algorithms that do this had
better not be O(N2)! If it took one microsecond (1 millionth of a second) to process each pixel, an O(N 2)
algorithm would take more than a week to finish processing a 1 megapixel image, and more than three
months to process a 3 megapixel image (note the rate of increase is definitely not linear).

Another example is sound. CD audio samples are 16 bits, sampled 44,100 times per second for each of two
channels. A typical 3 minute song consists of about 8 million data points. You had better choose the write
algorithm to process this data.

A dictionary I've used for text analysis has about 125,000 entries. There's a big difference between a linear
O(N), binary O(log N), or hash O(1) search.

Best, worst, and average cases

You should be clear about which cases big-oh notation describes. By default it usually refers to the average
case, using random data. However, the characteristics for best, worst, and average cases can be very
different, and the use of non-random data (often more realistic) data can have a big effect on some
algorithms.

Why big-oh notation isn't always useful

Complexity analysis can be very useful, but there are problems with it too.

 Too hard to analyze. Many algorithms are simply too hard to analyze mathematically.

 Average case unknown. There may not be sufficient information to know what the most
important "average" case really is, therefore analysis is impossible.

 Unknown constant. Both walking and traveling at the speed of light have a time-as-
function-of-distance big-oh complexity of O(N). Altho they have the same big-oh
characteristics, one is rather faster than the other. Big-oh analysis only tells you how it
grows with the size of the problem, not how efficient it is.

 Small data sets. If there are no large amounts of data, algorithm efficiency may not be
important.

Benchmarks are better

Big-oh notation can give very good ideas about performance for large amounts of data, but the only real
way to know for sure is to actually try it with large data sets. There may be performance issues that are not
taken into account by big-oh notation, eg, the effect on paging as virtual memory usage grows. Although
benchmarks are better, they aren't feasible during the design process, so Big-Oh complexity analysis is the
choice.

Typical big-oh values for common algorithms

Searching

Here is a table of typical cases.

Type of Search Big-Oh Comments


Linear search array/ArrayList/LinkedList O(N)  

Binary search sorted array/ArrayList O(log N) Requires sorted data.

Search balanced tree O(log N)  

Search hash table O(1)  

Other Typical Operations

array
Algorithm LinkedList
ArrayList

access front O(1) O(1)

access back O(1) O(1)

access middle O(1) O(N)

insert at front O(N) O(1)

insert at back O(1) O(1)

insert in middle O(N) O(1)

Sorting arrays/ArrayLists

Some sorting algorithms show variability in their Big-Oh performance. It is therefore interesting to look at
their best, worst, and average performance. For this description "average" is applied to uniformly distributed
values. The distribution of real values for any given application may be important in selecting a particular
algorithm.

Type of Sort Best Worst Average Comments

BubbleSort O(N) O(N )2


O(N )2
Not a good sort, except with ideal data.

Selection O(N2) O(N2) O(N2) Perhaps best of O(N2) sorts


sort

QuickSort O(N log N) O(N2) O(N log Good, but it worst case is O(N2)
N)

HeapSort O(N log N) O(N log N) O(N log Typically slower than QuickSort, but worst
N) case is much better.

Example - choosing a non-optimal algorithm


I had to sort a large array of numbers. The values were almost always already in order, and even when they
weren't in order there was typically only one number that was out of order. Only rarely were the values
completely disorganized. I used a bubble sort because it was O(1) for my "average" data. This was many
years ago when CPUs were 1000 times slower. Today I would simply use the library sort for the amount of
data I had because the difference in execution time would probably be unnoticed. However, there are always
data sets which are so large that a choice of algorithms really matters.
Example - O(N3) surprise
I once wrote a text-processing program to solve some particular customer problem. After seeing how well it
processed the test data, the customer produced real data, which I confidently ran the program on. The
program froze -- the problem was that I had inadvertently used an O(N 3) algorithm and there was no way it
was going to finish in my lifetime. Fortunately, my reputation was restored when I was able to rewrite the
offending algorithm within an hour and process the real data in under a minute. Still, it was a sobering
experience, illustrating dangers in ignoring complexity analysis, using unrealistic test data, and giving
customer demos.

Same Big-Oh, but big differences

Altho two algorithms have the same big-oh characteristics, they may differ by a factor of three (or more) in
practical implementations. Remember that big-oh notation ignores constant overhead and constant factors.
These can be substantial and can't be ignored in practical implementations.

Time-space tradeoffs

Sometimes it's possible to reduce execution time by using more space, or reduce space requirements by
using a more time-intensive algorithm.

Você também pode gostar