epubjs
Version:
Render ePub documents in the browser, across many devices
344 lines • 85.2 kB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Regular Expressions</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body><div class="sect1" title="Regular Expressions"><div class="titlepage"><div><div><h1 class="title"><a id="learnjava3-CHP-10-SECT-7"/>Regular Expressions</h1></div></div></div><p>Now it’s time to take a brief detour on our trip through Java and
enter the land of <span class="emphasis"><em>regular expressions</em></span>. A regular
expression, or regex for short, describes a text pattern. Regular
expressions are used with many tools—including the <a id="I_indexterm10_id732046" class="indexterm"/><code class="literal">java.util.regex</code> package,
text editors, and many scripting languages—to provide sophisticated
text-searching and powerful string-manipulation capabilities.</p><p>If you are already familiar with the concept of regular expressions
and how they are used with other languages, you may wish to skim through
this section. At the very least, you’ll need to look at the “The
java.util.regex API” section later in this chapter, which covers the Java
classes necessary to use them. On the other hand, if you’ve come to this
point on your Java journey with a clean slate on this topic and you’re
wondering exactly what regular expressions are, then pop open your
favorite beverage and get ready. You are about to learn about the most
powerful tool in the arsenal of text manipulation and what is, in fact, a
tiny language within a language, all in the span of a few pages.</p><div class="sect2" title="Regex Notation"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-10-SECT-7.1"/>Regex Notation</h2></div></div></div><p><a id="idx10587" class="indexterm"/>A regular expression describes a pattern in text. By
pattern, we mean just about any feature you can imagine identifying in
text from the literal characters alone, without actually understanding
their meaning. This includes features, such as words, word groupings,
lines and paragraphs, punctuation, case, and more generally, strings and
numbers with a specific structure to them, such as phone numbers, email
addresses, and quoted phrases. With regular expressions, you can search
the dictionary for all the words that have the letter “q” without its
pal “u” next to it, or words that start and end with the same letter.
Once you have constructed a pattern, you can use simple tools to hunt
for it in text or to determine if a given string matches it. A regex can
also be arranged to help you dismember specific parts of the text it
matched, which you could then use as elements of replacement text if you
wish.</p><div class="sect3" title="Write once, run away"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.1"/>Write once, run away</h3></div></div></div><p><a id="I_indexterm10_id732113" class="indexterm"/>Before moving on, we should say a few words about
regular expression syntax in general. At the beginning of this
section, we casually mentioned that we would be discussing a new
language. Regular expressions do, in fact, constitute a simple form of
programming language. If you think for a moment about the examples we
cited earlier, you can see that something like a language is going to
be needed to describe even simple patterns—such as email
addresses—that have some variation in form.</p><p>A computer science textbook would classify regular expressions
at the bottom of the hierarchy of computer languages, in terms of both
what they can describe and what you can do with them. They are still
capable of being quite sophisticated, however. As with most
programming languages, the elements of regular expressions are simple,
but they can be built up in combination to arbitrary complexity. And
that is where things start to get sticky.</p><p>Since regexes work on strings, it is convenient to have a very
compact notation that can be easily wedged between characters. But
compact notation can be very cryptic, and experience shows that it is
much easier to write a complex statement than to read it again later.
Such is the curse of the regular expression. You may find that in a
moment of late-night, caffeine-fueled inspiration, you can write a
single glorious pattern to simplify the rest of your program down to
one line. When you return to read that line the next day, however, it
may look like Egyptian hieroglyphics to you. Simpler is generally
better. If you can break your problem down and do it more clearly in
several steps, maybe you should.</p></div><div class="sect3" title="Escaped characters"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.2"/>Escaped characters</h3></div></div></div><p><a id="idx10516" class="indexterm"/> <a id="idx10519" class="indexterm"/> <a id="idx10557" class="indexterm"/>Now that you’re properly warned, we have to throw one
more thing at you before we build you back up. Not only can the regex
notation get a little hairy, but it is also somewhat ambiguous with
ordinary Java strings. An important part of the notation is the
escaped character, a character with a backslash in front of it. For
example, the escaped <code class="literal">d</code> character,
<a id="I_indexterm10_id732204" class="indexterm"/><code class="literal">\d</code>, (backslash ‘d’)
is shorthand that matches any single digit character (0-9). However,
you cannot simply write <code class="literal">\d</code> as part
of a Java string, because Java uses the backslash for its own special
characters and to specify Unicode character sequences (<code class="literal">\uxxxx</code>). Fortunately, Java gives us a
replacement: an escaped backslash, which is two backslashes (\\),
means a literal backslash. The rule is, when you want a backslash to
appear in your regex, you must escape it with an extra one:</p><a id="I_10_tt627"/><pre class="programlisting"> <code class="s">"\\d"</code> <code class="c1">// Java string that yields backslash "d"</code></pre><p><a id="I_indexterm10_id732241" class="indexterm"/> <a id="I_indexterm10_id732247" class="indexterm"/>And just to make things crazier, because regex notation
itself uses backslash to denote special characters, it must provide
the same “escape hatch” as well—allowing you to double up backslashes
if you want a literal backslash. So if you want to specify a regular
expression that includes a single literal backslash, it looks like
this:</p><a id="I_10_tt628"/><pre class="programlisting"> <code class="s">"\\\\"</code> <code class="c1">// Java string yields two backslashes; regex yields one</code></pre><p>Most of the “magic” operator characters you read about in this
section operate on the character that precedes them, so these also
must be escaped if you want their literal meaning. This includes such
characters as <a id="I_indexterm10_id732275" class="indexterm"/><a id="I_indexterm10_id732282" class="indexterm"/><code class="literal">.</code>, <a id="I_indexterm10_id732298" class="indexterm"/><a id="I_indexterm10_id732305" class="indexterm"/><code class="literal">*</code>, <a id="I_indexterm10_id732320" class="indexterm"/><a id="I_indexterm10_id732328" class="indexterm"/><code class="literal">+</code>, braces <a id="I_indexterm10_id732344" class="indexterm"/><a id="I_indexterm10_id732354" class="indexterm"/><code class="literal">{}</code>, and parentheses
<a id="I_indexterm10_id732369" class="indexterm"/><a id="I_indexterm10_id732374" class="indexterm"/><code class="literal">()</code>.</p><p>If you need to create part of an expression that has lots of
literal characters in it, you can use the special delimiters
<a id="I_indexterm10_id732388" class="indexterm"/><code class="literal">\Q</code> and <a id="I_indexterm10_id732399" class="indexterm"/><code class="literal">\E</code> to help you. Any
text appearing between <code class="literal">\Q</code> and
<code class="literal">\E</code> is automatically escaped. (You
still need the Java <code class="literal">String</code>
escapes—double backslashes for backslash, but not quadruple.) There is
also a static method <a id="I_indexterm10_id732428" class="indexterm"/><code class="literal">Pattern.quote()</code>,
which does the same thing, returning a properly escaped version of
whatever string you give
it.</p><p>Beyond that, my only suggestion to help maintain your sanity
when working with these examples is to keep two copies—a comment line
showing the naked regular expression and the real Java string, where
you must double up all backslashes.<a id="I_indexterm10_id732451" class="indexterm"/><a id="I_indexterm10_id732458" class="indexterm"/><a id="I_indexterm10_id732465" class="indexterm"/></p></div><div class="sect3" title="Characters and character classes"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.3"/>Characters and character classes</h3></div></div></div><p><a id="idx10515" class="indexterm"/> <a id="idx10556" class="indexterm"/>Now, let’s dive into the actual regex syntax. The
simplest form of a regular expression is plain, literal text, which
has no special meaning and is matched directly (character for
character) in the input. This can be a single character or more. For
example, in the following string, the pattern “s” can match the
character <code class="literal">s</code> in the words <code class="literal">rose</code> and <code class="literal">is</code>:</p><a id="I_10_tt629"/><pre class="programlisting"> <code class="s">"A rose is $1.99."</code></pre><p>The pattern “rose” can match only the literal word <code class="literal">rose</code>. But this isn’t very interesting. Let’s
crank things up a notch by introducing some special characters and the
notion of character “classes.”</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Any character: dot</em></span> (<code class="literal">.</code>)</span></dt><dd><p>The special character dot (<code class="literal">.</code>) matches any single character. The
pattern “.ose” matches rose, nose, _ose (space followed by ose)
or any other character followed by the sequence ose. Two dots
match any two characters, and so on. The dot operator is not
discriminating; it normally stops only for an end-of-line
character (and, optionally, you can tell it not to; we discuss
that later).</p><p>We can consider “.” to represent the group or class of all
characters. And regexes define more interesting character
classes as well.</p></dd><dt><span class="term"><span class="emphasis"><em>Whitespace or nonwhitespace character:</em></span>
<a id="I_indexterm10_id732588" class="indexterm"/> <code class="literal">\s</code>, <code class="literal">\S</code></span></dt><dd><p>The special character <code class="literal">\s</code> matches a literal-space character
or one of the following characters: <a id="I_indexterm10_id732613" class="indexterm"/><code class="literal">\t</code> (tab),
<a id="I_indexterm10_id732624" class="indexterm"/><code class="literal">\r</code> (carriage
return), <a id="I_indexterm10_id732635" class="indexterm"/><code class="literal">\n</code> (newline),
<a id="I_indexterm10_id732646" class="indexterm"/><code class="literal">\f</code> (formfeed),
and backspace. The corresponding special character <a id="I_indexterm10_id732658" class="indexterm"/><code class="literal">\S</code> does the
inverse, matching any character except whitespace.</p></dd><dt><span class="term"><span class="emphasis"><em>Digit or nondigit character</em></span>: <code class="literal">\d</code>, <code class="literal">\D</code></span></dt><dd><p><a id="I_indexterm10_id732686" class="indexterm"/> <code class="literal">\d</code> matches any
of the digits 0-9. <a id="I_indexterm10_id732699" class="indexterm"/><code class="literal">\D</code> does the
inverse, matching all characters except digits.</p></dd><dt><span class="term"><span class="emphasis"><em>Word or nonword character</em></span>: <code class="literal">\w</code>, <code class="literal">\W</code></span></dt><dd><p><a id="I_indexterm10_id732727" class="indexterm"/> <code class="literal">\w</code> matches a
“word” character, including upper- and lowercase letters A-Z,
a-z, the digits 0-9, and the underscore character (_).
<a id="I_indexterm10_id732742" class="indexterm"/><code class="literal">\W</code> matches
everything except those characters.<a id="I_indexterm10_id732753" class="indexterm"/><a id="I_indexterm10_id732760" class="indexterm"/></p></dd></dl></div></div><div class="sect3" title="Custom character classes"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.4"/>Custom character classes</h3></div></div></div><p><a id="I_indexterm10_id732774" class="indexterm"/> <a id="I_indexterm10_id732786" class="indexterm"/> <a id="idx10518" class="indexterm"/> <a id="I_indexterm10_id732807" class="indexterm"/>You can define your own character classes using the
notation [...]. For example, the following class matches any of the
characters a, b, c, x, y, or z:</p><a id="I_10_tt630"/><pre class="programlisting"> <code class="o">[</code><code class="n">abcxyz</code><code class="o">]</code></pre><p>The special x-y range notation can be used as shorthand for the
alphabetic characters. The following example defines a character class
containing all upper- and lowercase letters:</p><a id="I_10_tt631"/><pre class="programlisting"> <code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">Za</code><code class="o">-</code><code class="n">z</code><code class="o">]</code></pre><p><a id="I_indexterm10_id732841" class="indexterm"/> <a id="I_indexterm10_id732852" class="indexterm"/>Placing a caret (^) as the first character inside the
brackets inverts the character class. This example matches any
character except uppercase A-F:</p><a id="I_10_tt632"/><pre class="programlisting"> <code class="o">[^</code><code class="n">A</code><code class="o">-</code><code class="n">F</code><code class="o">]</code> <code class="c1">// G, H, I, ..., a, b, c, ... etc.</code></pre><p><a id="I_indexterm10_id732874" class="indexterm"/>Nesting character classes simply adds them:</p><a id="I_10_tt633"/><pre class="programlisting"> <code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">F</code><code class="o">[</code><code class="n">G</code><code class="o">-</code><code class="n">Z</code><code class="o">]]</code> <code class="c1">// A-Z</code></pre><p><a id="I_indexterm10_id732891" class="indexterm"/>The && logical AND notation can be used to take
the intersection (characters in
common):<a id="I_indexterm10_id732904" class="indexterm"/></p><a id="I_10_tt634"/><pre class="programlisting"> <code class="o">[</code><code class="n">a</code><code class="o">-</code><code class="n">p</code><code class="o">&&[</code><code class="n">l</code><code class="o">-</code><code class="n">z</code><code class="o">]]</code> <code class="c1">// l, m, n, o, p</code>
<code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">Z</code><code class="o">&&[^</code><code class="n">P</code><code class="o">]]</code> <code class="c1">// A through Z except P</code></pre></div><div class="sect3" title="Position markers"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.5"/>Position markers</h3></div></div></div><p><a id="idx10537" class="indexterm"/> <a id="idx10561" class="indexterm"/>The pattern “[Aa] rose” (including an upper- or
lowercase A) matches three times in the following phrase:</p><a id="I_10_tt635"/><pre class="programlisting"> <code class="s">"A rose is a rose is a rose"</code></pre><p>Position characters allow you to designate the relative location
of a match. The most important are <code class="literal">^</code> and <a id="I_indexterm10_id732972" class="indexterm"/><a id="I_indexterm10_id732977" class="indexterm"/><code class="literal">$</code>, which match the
beginning and end of a line, respectively:</p><a id="I_10_tt636"/><pre class="programlisting"> <code class="o">^[</code><code class="n">Aa</code><code class="o">]</code> <code class="n">rose</code> <code class="c1">// matches "A rose" at the beginning of line</code>
<code class="o">[</code><code class="n">Aa</code><code class="o">]</code> <code class="n">rose$</code> <code class="c1">// matches "a rose" at end of line</code></pre><p>By default, <code class="literal">^</code> and <code class="literal">$</code> match the beginning and end of “input,”
which is often a line. If you are working with multiple lines of text
and wish to match the beginnings and endings of lines within a single
large string, you can turn on “multiline” mode as described later in
this chapter.</p><p>The position markers <a id="I_indexterm10_id733018" class="indexterm"/><code class="literal">\b</code> and <a id="I_indexterm10_id733028" class="indexterm"/><code class="literal">\B</code> match a word
boundary or nonword boundary, respectively. For example, the following
pattern matches rose and rosemary, but not primrose:<a id="I_indexterm10_id733040" class="indexterm"/><a id="I_indexterm10_id733047" class="indexterm"/></p><a id="I_10_tt637"/><pre class="programlisting"> <code class="err">\</code><code class="n">brose</code></pre></div><div class="sect3" title="Iteration (multiplicity)"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.6"/>Iteration (multiplicity)</h3></div></div></div><p><a id="idx10527" class="indexterm"/> <a id="idx10533" class="indexterm"/> <a id="idx10559" class="indexterm"/>Simply matching fixed character patterns would not get
us very far. Next, we look at operators that count the number of
occurrences of a character (or more generally, of a pattern, as we’ll
see in <a class="xref" href="ch10s07.html#learnjava3-CHP-10-SECT-7.1.8" title="Capture groups">Capture groups</a>):</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Any (zero or more iterations): asterisk</em></span>
(<code class="literal">*</code>)</span></dt><dd><p><a id="I_indexterm10_id733129" class="indexterm"/> <a id="I_indexterm10_id733137" class="indexterm"/>Placing an asterisk (*) after a character or
character class means “allow any number of that type of
character”—in other words, zero or more. For example, the
following pattern matches a digit with any number of leading
zeros (possibly none):</p><a id="I_10_tt638"/><pre class="programlisting"> <code class="mi">0</code><code class="o">*</code><code class="err">\</code><code class="n">d</code> <code class="c1">// match a digit with any number of leading zeros</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Some (one or more iterations): plus
sign</em></span> (<code class="literal">+</code>)</span></dt><dd><p><a id="I_indexterm10_id733176" class="indexterm"/> <a id="I_indexterm10_id733185" class="indexterm"/>The plus sign (+) means “one or more” iterations
and is equivalent to XX* (pattern followed by pattern asterisk).
For example, the following pattern matches a number with one or
more digits, plus optional leading zeros:</p><a id="I_10_tt639"/><pre class="programlisting"> <code class="mi">0</code><code class="o">*</code><code class="err">\</code><code class="n">d</code><code class="o">+</code> <code class="c1">// match a number (one or more digits) with optional leading </code>
<code class="c1">// zeros</code></pre><p>It may seem redundant to match the zeros at the beginning
of an expression because zero is a digit and is thus matched by
the <code class="literal">\d+</code> portion of the
expression anyway. However, we’ll show later how you can pick
apart the string using a regex and get at just the pieces you
want. In this case, you might want to strip off the leading
zeros and keep only the digits.</p></dd><dt><span class="term"><span class="emphasis"><em>Optional (zero or one iteration): question
mark</em></span> (<code class="literal">?</code>)</span></dt><dd><p><a id="I_indexterm10_id733237" class="indexterm"/> <a id="I_indexterm10_id733246" class="indexterm"/>The question mark operator (?) allows exactly zero
or one iteration. For example, the following pattern matches a
credit-card expiration date, which may or may not have a slash
in the middle:</p><a id="I_10_tt640"/><pre class="programlisting"> <code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/?</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code> <code class="c1">// match four digits with an optional slash in the middle</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Range (between x and y iterations,
inclusive</em></span>): <code class="literal">{x,y}</code></span></dt><dd><p>The <code class="literal">{x,y}</code> curly-brace
range operator is the most general iteration operator. It
specifies a precise range to match. A range takes two arguments:
a lower bound and an upper bound, separated by a comma. This
regex matches any word with five to seven characters,
inclusive:</p><a id="I_10_tt641"/><pre class="programlisting"> <code class="err">\</code><code class="n">b</code><code class="err">\</code><code class="n">w</code><code class="o">{</code><code class="mi">5</code><code class="o">,</code><code class="mi">7</code><code class="o">}</code><code class="err">\</code><code class="n">b</code> <code class="c1">// match words with at least 5 and at most 7 characters</code></pre></dd><dt><span class="term"><span class="emphasis"><em>At least x or more iterations (y is
infinite</em></span>): <code class="literal">{x,}</code></span></dt><dd><p>If you omit the upper bound, simply leaving a dangling
comma in the range, the upper bound becomes infinite. This is a
way to specify a minimum of occurrences with no
maximum.<a id="I_indexterm10_id733314" class="indexterm"/><a id="I_indexterm10_id733321" class="indexterm"/><a id="I_indexterm10_id733328" class="indexterm"/></p></dd></dl></div></div><div class="sect3" title="Grouping"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.7"/>Grouping</h3></div></div></div><p><a id="I_indexterm10_id733342" class="indexterm"/> <a id="idx10524" class="indexterm"/> <a id="I_indexterm10_id733358" class="indexterm"/> <a id="I_indexterm10_id733365" class="indexterm"/>Just as in logical or mathematical operations,
parentheses can be used in regular expressions to make subexpressions
or to put boundaries on parts of expressions. This power lets us
extend the operators we’ve talked about to work not only on
characters, but also on words or other regular expressions. For
example:</p><a id="I_10_tt642"/><pre class="programlisting"> <code class="o">(</code><code class="n">yada</code><code class="o">)+</code></pre><p>Here we are applying the + (one or more) operator to the whole
pattern <code class="literal">yada</code>, not just one
character. It matches yada, yadayada, yadayadayada, and so on.</p><p>Using grouping, we can start building more complex expressions.
For example, while many email addresses have a three-part structure
(e.g., <span class="emphasis"><em>foo@bar.com</em></span>), the domain name portion can,
in actuality, contain an arbitrary number of dot-separated components.
To handle this properly, we can use an expression like this
one:</p><a id="I_10_tt643"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+</code><code class="err">@\</code><code class="n">w</code><code class="o">+(</code><code class="err">\</code><code class="o">.</code><code class="err">\</code><code class="n">w</code><code class="o">)+</code> <code class="c1">// Match an email address</code></pre><p>This expression matches a word, followed by an <code class="literal">@</code> symbol, followed by another word and then
one or more literal dot-separated words—e.g.,
<span class="email"><a class="email" href="mailto:pat@pat.net">pat@pat.net</a></span>, <span class="email"><a class="email" href="mailto:friend@foo.bar.com">friend@foo.bar.com</a></span>, or
<span class="email"><a class="email" href="mailto:mate@foo.bar.co.uk">mate@foo.bar.co.uk</a></span>.<a id="I_indexterm10_id733436" class="indexterm"/></p></div><div class="sect3" title="Capture groups"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.8"/>Capture groups</h3></div></div></div><p><a id="idx10514" class="indexterm"/> <a id="idx10555" class="indexterm"/>In addition to basic grouping of operations, parentheses
have an important, additional role: the text matched by each
parenthesized subexpression can be separately retrieved. That is, you
can isolate the text that matched each subexpression. There is then a
special syntax for referring to each capture group within the regular
expression by number. This important feature has two uses.</p><p>First, you can construct a regular expression that refers to the
text it has already matched and uses this text as a parameter for
further matching. This allows you to express some very powerful
things. For example, we can show the dictionary example we mentioned
in the introduction. Let’s find all the words that start and end with
the same letter:</p><a id="I_10_tt644"/><pre class="programlisting"> <code class="err">\</code><code class="n">b</code><code class="o">(</code><code class="err">\</code><code class="n">w</code><code class="o">)</code><code class="err">\</code><code class="n">w</code><code class="o">*</code><code class="err">\</code><code class="mi">1</code><code class="err">\</code><code class="n">b</code> <code class="c1">// match words beginning and ending with the same letter</code></pre><p>See the <code class="literal">1</code> in this expression?
It’s a reference to the first capture group in the expression,
<code class="literal">(\w)</code>. References to capture groups
take the form <code class="literal">\</code><em class="replaceable"><code>n</code></em> where
<em class="replaceable"><code>n</code></em> is the number of the capture group,
counting from left to right. In this example, the first capture group
matches a word character on a word boundary. Then we allow any number
of word characters up to the special reference <code class="literal">\1</code> (also followed by a word boundary). The
<code class="literal">\1</code> means “the value matched in
capture group one.” Because these characters must be the same, this
regex matches words that start and end with the same character.</p><p>The second use of capture groups is in referring to the matched
portions of text while constructing replacement text. We’ll show you
how to do that a bit later when we talk about the Regular Expression
API.</p><p>Capture groups can contain more than one character, of course,
and you can have any number of groups. You can even nest capture
groups. Next, we discuss exactly how they are numbered.<a id="I_indexterm10_id733548" class="indexterm"/><a id="I_indexterm10_id733555" class="indexterm"/></p></div><div class="sect3" title="Numbering"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.9"/>Numbering</h3></div></div></div><p><a id="I_indexterm10_id733569" class="indexterm"/> <a id="I_indexterm10_id733575" class="indexterm"/>Capture groups are numbered, starting at 1, and moving
from left to right, by counting the number of open parentheses it
takes to reach them. The special group number 0 always refers to the
entire expression match. For example, consider the following
string:</p><a id="I_10_tt645"/><pre class="programlisting"> <code class="n">one</code> <code class="o">((</code><code class="n">two</code><code class="o">)</code> <code class="o">(</code><code class="n">three</code> <code class="o">(</code><code class="n">four</code><code class="o">)))</code></pre><p>This string creates the following matches:</p><a id="I_10_tt646"/><pre class="programlisting"> <code class="n">Group</code> <code class="mi">0</code><code class="o">:</code> <code class="n">one</code> <code class="n">two</code> <code class="n">three</code> <code class="n">four</code>
<code class="n">Group</code> <code class="mi">1</code><code class="o">:</code> <code class="n">two</code> <code class="n">three</code> <code class="n">four</code>
<code class="n">Group</code> <code class="mi">2</code><code class="o">:</code> <code class="n">two</code>
<code class="n">Group</code> <code class="mi">3</code><code class="o">:</code> <code class="n">three</code> <code class="n">four</code>
<code class="n">Group</code> <code class="mi">4</code><code class="o">:</code> <code class="n">four</code></pre><p>Before going on, we should note one more thing. So far in this
section we’ve glossed over the fact that parentheses are doing double
duty: creating logical groupings for operations and defining capture
groups. What if the two roles conflict? Suppose we have a complex
regex that uses parentheses to group subexpressions and to create
capture groups? In that case, you can use a special noncapturing group
operator <code class="literal">(?:)</code> to do logical
grouping instead of using parentheses. You probably won’t need to do
this often, but it’s good to know.</p></div><div class="sect3" title="Alternation"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.10"/>Alternation</h3></div></div></div><p><a id="I_indexterm10_id733632" class="indexterm"/> <a id="idx10513" class="indexterm"/> <a id="I_indexterm10_id733648" class="indexterm"/> <a id="I_indexterm10_id733659" class="indexterm"/>The vertical bar (|) operator denotes the logical OR
operation, also called alternation or choice. The | operator does not
operate on individual characters but instead applies to everything on
either side of it. It splits the expression in two unless constrained
by parentheses grouping. For example, a slightly naive approach to
parsing dates might be the following:</p><a id="I_10_tt647"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+,</code> <code class="err">\</code><code class="n">w</code><code class="o">+</code> <code class="err">\</code><code class="n">d</code><code class="o">+</code> <code class="err">\</code><code class="n">d</code><code class="o">+|</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code> <code class="c1">// pattern 1 or pattern 2</code></pre><p>In this expression, the left matches patterns such as Fri, Oct
12, 2001, and the right matches 10/12/2001.</p><p>The following regex might be used to match email addresses with
one of three domains (<span class="emphasis"><em>net</em></span>,
<span class="emphasis"><em>edu</em></span>, and <span class="emphasis"><em>gov</em></span>):<a id="I_indexterm10_id733694" class="indexterm"/></p><a id="I_10_tt648"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+</code><code class="err">@</code><code class="o">[</code><code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="o">.]*</code><code class="err">\</code><code class="o">.(</code><code class="n">net</code><code class="o">|</code><code class="n">edu</code><code class="o">|</code><code class="n">gov</code><code class="o">)</code> <code class="c1">// email address ending in .net, .edu, or .gov</code></pre></div><div class="sect3" title="Special options"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.11"/>Special options</h3></div></div></div><p><a id="I_indexterm10_id733717" class="indexterm"/>There are several special options that affect the way
the regex engine performs its matching. These options can be applied
in two ways:</p><div class="itemizedlist"><ul class="itemizedlist"><li class="listitem"><p>You can pass in one or more flags during the <code class="literal">Pattern.compile()</code> step (discussed later
in this chapter).</p></li><li class="listitem"><p>You can include a special block of code in your
regex.</p></li></ul></div><p>We’ll show the latter approach here. To do this, include one or
more flags in a special block <code class="literal">(?</code><em class="replaceable"><code>x</code></em><code class="literal">)</code>, where <em class="replaceable"><code>x</code></em> is the
flag for the option we want to turn on. Generally, you do this at the
beginning of the regex. You can also turn off flags by adding a minus
sign <code class="literal">(?-</code><em class="replaceable"><code>x</code></em><code class="literal">)</code>, which allows you to apply flags to select
parts of your pattern.</p><p>The following flags are available:</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Case-insensitive</em></span>: <code class="literal">(?i)</code></span></dt><dd><p>The <a id="I_indexterm10_id733792" class="indexterm"/><code class="literal">(?i)</code> flag tells
the regex engine to ignore case while matching, for
example:</p><a id="I_10_tt649"/><pre class="programlisting"> <code class="o">(?</code><code class="n">i</code><code class="o">)</code><code class="n">yahoo</code> <code class="c1">// match Yahoo, yahoo, yahOO, etc.</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Dot all</em></span>: <code class="literal">(?s)</code></span></dt><dd><p>The <a id="I_indexterm10_id733825" class="indexterm"/><code class="literal">(?s)</code> flag turns
on “dot all” mode, allowing the dot character to match anything,
including end-of-line characters. It is useful if you are
matching patterns that span multiple lines. The <code class="literal">s</code> stands for “single-line mode,” a
somewhat confusing name derived from Perl.</p></dd><dt><span class="term"><span class="emphasis"><em>Multiline</em></span>: <code class="literal">(?m)</code></span></dt><dd><p>By default, <code class="literal">^</code> and
<code class="literal">$</code> don’t really match the
beginning and end of lines (as defined by carriage return or
newline combinations); they instead match the beginning or end
of the entire input text. Turning on multiline mode with
<a id="I_indexterm10_id733874" class="indexterm"/><code class="literal">(?m)</code> causes
them to match the beginning and end of every line as well as the
beginning and end of input. Specifically, this means the spot
before the first character, the spot after the last character,
and the spots just after and before line terminators inside the
string.</p></dd><dt><span class="term"><span class="emphasis"><em>Unix lines</em></span>: <code class="literal">(?d)</code></span></dt><dd><p>The (<a id="I_indexterm10_id733900" class="indexterm"/><code class="literal">?d)</code> flag limits
the definition of the line terminator for the <code class="literal">^</code>, <code class="literal">$</code>, and <code class="literal">.</code> special characters to Unix-style
newline only (<code class="literal">\n</code>). By
default, carriage return newline (<code class="literal">\r\n</code>) is also allowed.</p></dd></dl></div></div><div class="sect3" title="Greediness"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.12"/>Greediness</h3></div></div></div><p><a id="idx10523" class="indexterm"/> <a id="idx10558" class="indexterm"/>We’ve seen hints that regular expressions are capable of
sorting some complex patterns. But there are cases where what should
be matched is ambiguous (at least to us, though not to the regex
engine). Probably the most important example has to do with the number
of characters the iterator operators consume before stopping. The
<a id="I_indexterm10_id733976" class="indexterm"/><a id="I_indexterm10_id733984" class="indexterm"/><code class="literal">.*</code> operation best
illustrates this. Consider the following string:</p><a id="I_10_tt650"/><pre class="programlisting"> <code class="s">"Now is the time for <bold>action</bold>, not words."</code></pre><p>Suppose we want to search for all the HTML-style tags (the parts
between the < and > characters), perhaps because we want to
remove them.</p><p>We might naively start with this regex:</p><a id="I_10_tt651"/><pre class="programlisting"> <code class="o"></?.*></code> <code class="c1">// match <, optional /, and then anything up to ></code></pre><p>We then get the following match, which is much too long:</p><a id="I_10_tt652"/><pre class="programlisting"> <code class="o"><</code><code class="n">bold</code><code class="o">></code><code class="n">action</code><code class="o"></</code><code class="n">bold</code><code class="o">></code></pre><p>The problem is that the <code class="literal">.*</code>
operation, like all the iteration operators, is by default “greedy,”
meaning that it consumes absolutely everything it can, up until the
last match for the terminating character (in this case, >) in the
file or line.</p><p>There are solutions for this problem. The first is to “say what
it is”—that is, to be specific about what is allowed between the
braces. The content of an HTML tag cannot actually include
<span class="emphasis"><em>anything</em></span>; for example, it cannot include a
closing bracket (>). So we could rewrite our expression as:</p><a id="I_10_tt653"/><pre class="programlisting"> <code class="o"></?</code><code class="err">\</code><code class="n">w</code><code class="o">*></code> <code class="c1">// match <, optional /, any number of word characters, then ></code></pre><p>But suppose the content is not so easy to describe. For example,
we might be looking for quoted strings in text, which could include
just about any text. In that case, we can use a second approach and
“say what it is not.” We can invert our logic from the previous
example and specify that anything <span class="emphasis"><em>except</em></span> a
closing bracket is allowed inside the brackets:</p><a id="I_10_tt654"/><pre class="programlisting"> <code class="o"></?[^>]*></code></pre><p>This is probably the most efficient way to tell the regex engine
what to do. It then knows exactly what to look for to stop reading.
This approach has limitations, however. It is not obvious how to do
this if the delimiter is more complex than a single character. It is
also not very elegant.</p><p>Finally, we come to our general solution: the use of “reluctant”
operators. For each of the iteration operators, there is an
alternative, nongreedy form that consumes as few characters as
possible, while still trying to get a match with what comes after it.
This is exactly what we needed in our previous example.</p><p>Reluctant operators take the form of the standard operator with
a “?” appended. (Yes, we know that’s confusing.) We can now write our
regex as:</p><a id="I_10_tt655"/><pre class="programlisting"> <code class="o"></?.*?></code> <code class="c1">// match <, optional /, minimum number of any chars, then ></code></pre><p>We have appended <a id="I_indexterm10_id734118" class="indexterm"/><a id="I_indexterm10_id734126" class="indexterm"/><code class="literal">?</code> to <code class="literal">.*</code> to cause <code class="literal">.*</code> to match as few characters as possible
while still making the final match of >. The same technique
(appending the <code class="literal">?</code>) works with all
the iteration operators, as in the two following examples:<a id="I_indexterm10_id734160" class="indexterm"/><a id="I_indexterm10_id734167" class="indexterm"/></p><a id="I_10_tt656"/><pre class="programlisting"> <code class="o">.+?</code> <code class="c1">// one or more, nongreedy</code>
<code class="o">.{</code><code class="n">x</code><code class="o">,</code><code class="n">y</code><code class="o">}?</code> <code class="c1">// between x and y, nongreedy</code></pre></div><div class="sect3" title="Lookaheads and lookbehinds"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.13"/>Lookaheads and lookbehinds</h3></div></div></div><p><a id="idx10530" class="indexterm"/> <a id="idx10560" class="indexterm"/>In order to understand our next topic, let’s return for
a moment to the position marking characters (<code class="literal">^</code>, <code class="literal">$</code>,
<code class="literal">\b</code>, and <code class="literal">\B</code>) that we discussed earlier. Think about
what exactly these special markers do for us. We say, for example,
that the <code class="literal">\b</code> marker matches a word
boundary. But the word “match” here may be a bit too strong. In
reality, it “requires” a word boundary to appear at the specified
point in the regex. Suppose we didn’t have <code class="literal">\b</code>; how could we construct it? Well, we
could try constructing a regex that matches the word boundary. It
might seem easy, given the word and nonword character classes
(<code class="literal">\w</code> and <code class="literal">\W</code>):</p><a id="I_10_tt657"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="n">W</code><code class="o">|</code><code class="err">\</code><code class="n">W</code><code class="err">\</code><code class="n">w</code> <code class="c1">// match the start or end of a word</code></pre><p>But now what? We could try inserting that pattern into our
regular expressions wherever we would have used <code class="literal">\b</code>, but it’s not really the same. We’re
actually matching those characters, not just requiring them. This
regular expression matches the two characters composing the word
<span class="emphasis"><em>boundary</em></span> in addition to whatever else matches
afterward, whereas the <code class="literal">\b</code> operator
simply <span class="emphasis"><em>requires</em></span> the word boundary but doesn’t
match any text. The distinction is that <code class="literal">\b</code> isn’t a matching pattern but a kind of
lookahead. A <span class="emphasis"><em>lookahead</em></span> is a pattern that is
required to match next in the string, but is not consumed by the regex
engine. When a lookahead pattern succeeds, the pattern moves on, and
the characters are left in the stream for the next part of the pattern
to use. If the lookahead fails, the match fails (or it backtracks and
tries a different approach).</p><p>We can make our own lookaheads with the lookahead operator
<a id="I_indexterm10_id734316" class="indexterm"/><a id="I_indexterm10_id734322" class="indexterm"/><a id="I_indexterm10_id734327" class="indexterm"/><code class="literal">(?=)</code>. For example, to
match the letter X at the end of a word, we could use:</p><a id="I_10_tt658"/><pre class="programlisting"> <code class="o">(?=</code><code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="n">W</code><code class="o">)</code><code class="n">X</code> <code class="c1">// Find X at the end of a word</code></pre><p>Here the regex engine requires the <code class="literal">\W\w</code> pattern to match but not consume the
characters, leaving them for the next part of the pattern. This
effectively allows us to write overlapping patterns (like the previous
example). For instance, we can match the word “Pat” only when it’s
part of the word “Patrick,” like so:</p><a id="I_10_tt659"/><pre class="programlisting"> <code class="o">(?=</code><code class="n">Patrick</code><code class="o">)</code><code class="n">Pat</code> <code class="c1">// Find Pat only in Patrick</code></pre><p>Another operator, <code class="literal">(?!)</code>, the
<span class="emphasis"><em>negative lookahead</em></span>, requires that the pattern not
match. We can find all the occurrences of Pat not inside of a Patrick
with this:</p><a id="I_10_tt660"/><pre class="programlisting"> <code class="o">(?!</code><code class="n">Patrick</code><code class="o">)</code><code class="n">Pat</code> <code class="c1">// Find Pat never in Patrick</code></pre><p>It’s worth noting that we could have written all of these
examples in other ways, by simply matching a larger amount of text.
For instance, in the first example we could have matched the whole
word “Patrick.” But that is not as precise, and if we wanted to use
capture groups to pull out the matched text or parts of it later, we’d
have to play games to get what we want. For example, suppose we wanted
to substitute something for Pat (say, change the font). We’d have to
use an extra capture group and replace the text with itself. Using
lookaheads is easier.</p><p>In addition to looking ahead in the stream, we can use the
<a id="I_indexterm10_id734406" class="indexterm"/><a id="I_indexterm10_id734412" class="indexterm"/><a id="I_indexterm10_id734418" class="indexterm"/><code class="literal">(?<=)</code> and
<a id="I_indexterm10_id734429" class="indexterm"/><a id="I_indexterm10_id734435" class="indexterm"/><a id="I_indexterm10_id734441" class="indexterm"/><code class="literal">(?<!)</code><span class="emphasis"><em><span>lookbehind</span></em></span> operators to
look backward in the stream. For example, we can find my last name,
but only when it refers to me:</p><a id="I_10_tt661"/><pre class="programlisting"> <code class="o">(?<=</code><code class="n">Pat</code> <code class="o">)</code><code class="n">Niemeyer</code> <code class="c1">// Niemeyer, only when preceded by Pat</code></pre><p>Or we can find the string “bean” when it is not part of the
phrase “Java bean”:</p><a id="I_10_tt662"/><pre class="programlisting"> <code class="o">(?<!</code><code class="n">Java</code> <code class="o">*)</code><code class="n">bean</code> <code class="c1">// The word bean, not preceded by Java</code></pre><p>In these cases, the lookbehind and the matched text didn’t
overlap because the lookbehind
was before the matched text. But you can place a lookahead or
lookbehind at either point—before or after the match—for example, we
could also match Pat Niemeyer like this:<a id="I_indexterm10_id734496" class="indexterm"/><a id="I_indexterm10_id734503" class="indexterm"/><a id="I_indexterm10_id734510" class="indexterm"/></p><a id="I_10_tt663"/><pre class="programlisting"> <code class="n">Niemeyer</code><code class="o">(?<=</code><code class="n">Pat</code> <code class="n">Niemeyer</code><code clas