epubjs

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head><title>Regular Expressions</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body><div class="sect1" title="Regular Expressions"><div class="titlepage"><div><div><h1 class="title"><a id="learnjava3-CHP-10-SECT-7"/>Regular Expressions</h1></div></div></div><p>Now it’s time to take a brief detour on our trip through Java and enter the land of <span class="emphasis"><em>regular expressions</em></span>. A regular expression, or regex for short, describes a text pattern. Regular expressions are used with many tools—including the <a id="I_indexterm10_id732046" class="indexterm"/><code class="literal">java.util.regex</code> package, text editors, and many scripting languages—to provide sophisticated text-searching and powerful string-manipulation capabilities.</p><p>If you are already familiar with the concept of regular expressions and how they are used with other languages, you may wish to skim through this section. At the very least, you’ll need to look at the “The java.util.regex API” section later in this chapter, which covers the Java classes necessary to use them. On the other hand, if you’ve come to this point on your Java journey with a clean slate on this topic and you’re wondering exactly what regular expressions are, then pop open your favorite beverage and get ready. You are about to learn about the most powerful tool in the arsenal of text manipulation and what is, in fact, a tiny language within a language, all in the span of a few pages.</p><div class="sect2" title="Regex Notation"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-10-SECT-7.1"/>Regex Notation</h2></div></div></div><p><a id="idx10587" class="indexterm"/>A regular expression describes a pattern in text. By pattern, we mean just about any feature you can imagine identifying in text from the literal characters alone, without actually understanding their meaning. This includes features, such as words, word groupings, lines and paragraphs, punctuation, case, and more generally, strings and numbers with a specific structure to them, such as phone numbers, email addresses, and quoted phrases. With regular expressions, you can search the dictionary for all the words that have the letter “q” without its pal “u” next to it, or words that start and end with the same letter. Once you have constructed a pattern, you can use simple tools to hunt for it in text or to determine if a given string matches it. A regex can also be arranged to help you dismember specific parts of the text it matched, which you could then use as elements of replacement text if you wish.</p><div class="sect3" title="Write once, run away"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.1"/>Write once, run away</h3></div></div></div><p><a id="I_indexterm10_id732113" class="indexterm"/>Before moving on, we should say a few words about regular expression syntax in general. At the beginning of this section, we casually mentioned that we would be discussing a new language. Regular expressions do, in fact, constitute a simple form of programming language. If you think for a moment about the examples we cited earlier, you can see that something like a language is going to be needed to describe even simple patterns—such as email addresses—that have some variation in form.</p><p>A computer science textbook would classify regular expressions at the bottom of the hierarchy of computer languages, in terms of both what they can describe and what you can do with them. They are still capable of being quite sophisticated, however. As with most programming languages, the elements of regular expressions are simple, but they can be built up in combination to arbitrary complexity. And that is where things start to get sticky.</p><p>Since regexes work on strings, it is convenient to have a very compact notation that can be easily wedged between characters. But compact notation can be very cryptic, and experience shows that it is much easier to write a complex statement than to read it again later. Such is the curse of the regular expression. You may find that in a moment of late-night, caffeine-fueled inspiration, you can write a single glorious pattern to simplify the rest of your program down to one line. When you return to read that line the next day, however, it may look like Egyptian hieroglyphics to you. Simpler is generally better. If you can break your problem down and do it more clearly in several steps, maybe you should.</p></div><div class="sect3" title="Escaped characters"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.2"/>Escaped characters</h3></div></div></div><p><a id="idx10516" class="indexterm"/> <a id="idx10519" class="indexterm"/> <a id="idx10557" class="indexterm"/>Now that you’re properly warned, we have to throw one more thing at you before we build you back up. Not only can the regex notation get a little hairy, but it is also somewhat ambiguous with ordinary Java strings. An important part of the notation is the escaped character, a character with a backslash in front of it. For example, the escaped <code class="literal">d</code> character, <a id="I_indexterm10_id732204" class="indexterm"/><code class="literal">\d</code>, (backslash ‘d’) is shorthand that matches any single digit character (0-9). However, you cannot simply write <code class="literal">\d</code> as part of a Java string, because Java uses the backslash for its own special characters and to specify Unicode character sequences (<code class="literal">\uxxxx</code>). Fortunately, Java gives us a replacement: an escaped backslash, which is two backslashes (\\), means a literal backslash. The rule is, when you want a backslash to appear in your regex, you must escape it with an extra one:</p><a id="I_10_tt627"/><pre class="programlisting"> <code class="s">"\\d"</code> <code class="c1">// Java string that yields backslash "d"</code></pre><p><a id="I_indexterm10_id732241" class="indexterm"/> <a id="I_indexterm10_id732247" class="indexterm"/>And just to make things crazier, because regex notation itself uses backslash to denote special characters, it must provide the same “escape hatch” as well—allowing you to double up backslashes if you want a literal backslash. So if you want to specify a regular expression that includes a single literal backslash, it looks like this:</p><a id="I_10_tt628"/><pre class="programlisting"> <code class="s">"\\\\"</code> <code class="c1">// Java string yields two backslashes; regex yields one</code></pre><p>Most of the “magic” operator characters you read about in this section operate on the character that precedes them, so these also must be escaped if you want their literal meaning. This includes such characters as <a id="I_indexterm10_id732275" class="indexterm"/><a id="I_indexterm10_id732282" class="indexterm"/><code class="literal">.</code>, <a id="I_indexterm10_id732298" class="indexterm"/><a id="I_indexterm10_id732305" class="indexterm"/><code class="literal">*</code>, <a id="I_indexterm10_id732320" class="indexterm"/><a id="I_indexterm10_id732328" class="indexterm"/><code class="literal">+</code>, braces <a id="I_indexterm10_id732344" class="indexterm"/><a id="I_indexterm10_id732354" class="indexterm"/><code class="literal">{}</code>, and parentheses <a id="I_indexterm10_id732369" class="indexterm"/><a id="I_indexterm10_id732374" class="indexterm"/><code class="literal">()</code>.</p><p>If you need to create part of an expression that has lots of literal characters in it, you can use the special delimiters <a id="I_indexterm10_id732388" class="indexterm"/><code class="literal">\Q</code> and <a id="I_indexterm10_id732399" class="indexterm"/><code class="literal">\E</code> to help you. Any text appearing between <code class="literal">\Q</code> and <code class="literal">\E</code> is automatically escaped. (You still need the Java <code class="literal">String</code> escapes—double backslashes for backslash, but not quadruple.) There is also a static method <a id="I_indexterm10_id732428" class="indexterm"/><code class="literal">Pattern.quote()</code>, which does the same thing, returning a properly escaped version of whatever string you give it.</p><p>Beyond that, my only suggestion to help maintain your sanity when working with these examples is to keep two copies—a comment line showing the naked regular expression and the real Java string, where you must double up all backslashes.<a id="I_indexterm10_id732451" class="indexterm"/><a id="I_indexterm10_id732458" class="indexterm"/><a id="I_indexterm10_id732465" class="indexterm"/></p></div><div class="sect3" title="Characters and character classes"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.3"/>Characters and character classes</h3></div></div></div><p><a id="idx10515" class="indexterm"/> <a id="idx10556" class="indexterm"/>Now, let’s dive into the actual regex syntax. The simplest form of a regular expression is plain, literal text, which has no special meaning and is matched directly (character for character) in the input. This can be a single character or more. For example, in the following string, the pattern “s” can match the character <code class="literal">s</code> in the words <code class="literal">rose</code> and <code class="literal">is</code>:</p><a id="I_10_tt629"/><pre class="programlisting"> <code class="s">"A rose is $1.99."</code></pre><p>The pattern “rose” can match only the literal word <code class="literal">rose</code>. But this isn’t very interesting. Let’s crank things up a notch by introducing some special characters and the notion of character “classes.”</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Any character: dot</em></span> (<code class="literal">.</code>)</span></dt><dd><p>The special character dot (<code class="literal">.</code>) matches any single character. The pattern “.ose” matches rose, nose, _ose (space followed by ose) or any other character followed by the sequence ose. Two dots match any two characters, and so on. The dot operator is not discriminating; it normally stops only for an end-of-line character (and, optionally, you can tell it not to; we discuss that later).</p><p>We can consider “.” to represent the group or class of all characters. And regexes define more interesting character classes as well.</p></dd><dt><span class="term"><span class="emphasis"><em>Whitespace or nonwhitespace character:</em></span> <a id="I_indexterm10_id732588" class="indexterm"/> <code class="literal">\s</code>, <code class="literal">\S</code></span></dt><dd><p>The special character <code class="literal">\s</code> matches a literal-space character or one of the following characters: <a id="I_indexterm10_id732613" class="indexterm"/><code class="literal">\t</code> (tab), <a id="I_indexterm10_id732624" class="indexterm"/><code class="literal">\r</code> (carriage return), <a id="I_indexterm10_id732635" class="indexterm"/><code class="literal">\n</code> (newline), <a id="I_indexterm10_id732646" class="indexterm"/><code class="literal">\f</code> (formfeed), and backspace. The corresponding special character <a id="I_indexterm10_id732658" class="indexterm"/><code class="literal">\S</code> does the inverse, matching any character except whitespace.</p></dd><dt><span class="term"><span class="emphasis"><em>Digit or nondigit character</em></span>: <code class="literal">\d</code>, <code class="literal">\D</code></span></dt><dd><p><a id="I_indexterm10_id732686" class="indexterm"/> <code class="literal">\d</code> matches any of the digits 0-9. <a id="I_indexterm10_id732699" class="indexterm"/><code class="literal">\D</code> does the inverse, matching all characters except digits.</p></dd><dt><span class="term"><span class="emphasis"><em>Word or nonword character</em></span>: <code class="literal">\w</code>, <code class="literal">\W</code></span></dt><dd><p><a id="I_indexterm10_id732727" class="indexterm"/> <code class="literal">\w</code> matches a “word” character, including upper- and lowercase letters A-Z, a-z, the digits 0-9, and the underscore character (_). <a id="I_indexterm10_id732742" class="indexterm"/><code class="literal">\W</code> matches everything except those characters.<a id="I_indexterm10_id732753" class="indexterm"/><a id="I_indexterm10_id732760" class="indexterm"/></p></dd></dl></div></div><div class="sect3" title="Custom character classes"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.4"/>Custom character classes</h3></div></div></div><p><a id="I_indexterm10_id732774" class="indexterm"/> <a id="I_indexterm10_id732786" class="indexterm"/> <a id="idx10518" class="indexterm"/> <a id="I_indexterm10_id732807" class="indexterm"/>You can define your own character classes using the notation [...]. For example, the following class matches any of the characters a, b, c, x, y, or z:</p><a id="I_10_tt630"/><pre class="programlisting"> <code class="o">[</code><code class="n">abcxyz</code><code class="o">]</code></pre><p>The special x-y range notation can be used as shorthand for the alphabetic characters. The following example defines a character class containing all upper- and lowercase letters:</p><a id="I_10_tt631"/><pre class="programlisting"> <code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">Za</code><code class="o">-</code><code class="n">z</code><code class="o">]</code></pre><p><a id="I_indexterm10_id732841" class="indexterm"/> <a id="I_indexterm10_id732852" class="indexterm"/>Placing a caret (^) as the first character inside the brackets inverts the character class. This example matches any character except uppercase A-F:</p><a id="I_10_tt632"/><pre class="programlisting"> <code class="o">[^</code><code class="n">A</code><code class="o">-</code><code class="n">F</code><code class="o">]</code> <code class="c1">// G, H, I, ..., a, b, c, ... etc.</code></pre><p><a id="I_indexterm10_id732874" class="indexterm"/>Nesting character classes simply adds them:</p><a id="I_10_tt633"/><pre class="programlisting"> <code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">F</code><code class="o">[</code><code class="n">G</code><code class="o">-</code><code class="n">Z</code><code class="o">]]</code> <code class="c1">// A-Z</code></pre><p><a id="I_indexterm10_id732891" class="indexterm"/>The && logical AND notation can be used to take the intersection (characters in common):<a id="I_indexterm10_id732904" class="indexterm"/></p><a id="I_10_tt634"/><pre class="programlisting"> <code class="o">[</code><code class="n">a</code><code class="o">-</code><code class="n">p</code><code class="o">&&[</code><code class="n">l</code><code class="o">-</code><code class="n">z</code><code class="o">]]</code> <code class="c1">// l, m, n, o, p</code> <code class="o">[</code><code class="n">A</code><code class="o">-</code><code class="n">Z</code><code class="o">&&[^</code><code class="n">P</code><code class="o">]]</code> <code class="c1">// A through Z except P</code></pre></div><div class="sect3" title="Position markers"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.5"/>Position markers</h3></div></div></div><p><a id="idx10537" class="indexterm"/> <a id="idx10561" class="indexterm"/>The pattern “[Aa] rose” (including an upper- or lowercase A) matches three times in the following phrase:</p><a id="I_10_tt635"/><pre class="programlisting"> <code class="s">"A rose is a rose is a rose"</code></pre><p>Position characters allow you to designate the relative location of a match. The most important are <code class="literal">^</code> and <a id="I_indexterm10_id732972" class="indexterm"/><a id="I_indexterm10_id732977" class="indexterm"/><code class="literal">$</code>, which match the beginning and end of a line, respectively:</p><a id="I_10_tt636"/><pre class="programlisting"> <code class="o">^[</code><code class="n">Aa</code><code class="o">]</code> <code class="n">rose</code> <code class="c1">// matches "A rose" at the beginning of line</code> <code class="o">[</code><code class="n">Aa</code><code class="o">]</code> <code class="n">rose$</code> <code class="c1">// matches "a rose" at end of line</code></pre><p>By default, <code class="literal">^</code> and <code class="literal">$</code> match the beginning and end of “input,” which is often a line. If you are working with multiple lines of text and wish to match the beginnings and endings of lines within a single large string, you can turn on “multiline” mode as described later in this chapter.</p><p>The position markers <a id="I_indexterm10_id733018" class="indexterm"/><code class="literal">\b</code> and <a id="I_indexterm10_id733028" class="indexterm"/><code class="literal">\B</code> match a word boundary or nonword boundary, respectively. For example, the following pattern matches rose and rosemary, but not primrose:<a id="I_indexterm10_id733040" class="indexterm"/><a id="I_indexterm10_id733047" class="indexterm"/></p><a id="I_10_tt637"/><pre class="programlisting"> <code class="err">\</code><code class="n">brose</code></pre></div><div class="sect3" title="Iteration (multiplicity)"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.6"/>Iteration (multiplicity)</h3></div></div></div><p><a id="idx10527" class="indexterm"/> <a id="idx10533" class="indexterm"/> <a id="idx10559" class="indexterm"/>Simply matching fixed character patterns would not get us very far. Next, we look at operators that count the number of occurrences of a character (or more generally, of a pattern, as we’ll see in <a class="xref" href="ch10s07.html#learnjava3-CHP-10-SECT-7.1.8" title="Capture groups">Capture groups</a>):</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Any (zero or more iterations): asterisk</em></span> (<code class="literal">*</code>)</span></dt><dd><p><a id="I_indexterm10_id733129" class="indexterm"/> <a id="I_indexterm10_id733137" class="indexterm"/>Placing an asterisk (*) after a character or character class means “allow any number of that type of character”—in other words, zero or more. For example, the following pattern matches a digit with any number of leading zeros (possibly none):</p><a id="I_10_tt638"/><pre class="programlisting"> <code class="mi">0</code><code class="o">*</code><code class="err">\</code><code class="n">d</code> <code class="c1">// match a digit with any number of leading zeros</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Some (one or more iterations): plus sign</em></span> (<code class="literal">+</code>)</span></dt><dd><p><a id="I_indexterm10_id733176" class="indexterm"/> <a id="I_indexterm10_id733185" class="indexterm"/>The plus sign (+) means “one or more” iterations and is equivalent to XX* (pattern followed by pattern asterisk). For example, the following pattern matches a number with one or more digits, plus optional leading zeros:</p><a id="I_10_tt639"/><pre class="programlisting"> <code class="mi">0</code><code class="o">*</code><code class="err">\</code><code class="n">d</code><code class="o">+</code> <code class="c1">// match a number (one or more digits) with optional leading </code> <code class="c1">// zeros</code></pre><p>It may seem redundant to match the zeros at the beginning of an expression because zero is a digit and is thus matched by the <code class="literal">\d+</code> portion of the expression anyway. However, we’ll show later how you can pick apart the string using a regex and get at just the pieces you want. In this case, you might want to strip off the leading zeros and keep only the digits.</p></dd><dt><span class="term"><span class="emphasis"><em>Optional (zero or one iteration): question mark</em></span> (<code class="literal">?</code>)</span></dt><dd><p><a id="I_indexterm10_id733237" class="indexterm"/> <a id="I_indexterm10_id733246" class="indexterm"/>The question mark operator (?) allows exactly zero or one iteration. For example, the following pattern matches a credit-card expiration date, which may or may not have a slash in the middle:</p><a id="I_10_tt640"/><pre class="programlisting"> <code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/?</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code> <code class="c1">// match four digits with an optional slash in the middle</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Range (between x and y iterations, inclusive</em></span>): <code class="literal">{x,y}</code></span></dt><dd><p>The <code class="literal">{x,y}</code> curly-brace range operator is the most general iteration operator. It specifies a precise range to match. A range takes two arguments: a lower bound and an upper bound, separated by a comma. This regex matches any word with five to seven characters, inclusive:</p><a id="I_10_tt641"/><pre class="programlisting"> <code class="err">\</code><code class="n">b</code><code class="err">\</code><code class="n">w</code><code class="o">{</code><code class="mi">5</code><code class="o">,</code><code class="mi">7</code><code class="o">}</code><code class="err">\</code><code class="n">b</code> <code class="c1">// match words with at least 5 and at most 7 characters</code></pre></dd><dt><span class="term"><span class="emphasis"><em>At least x or more iterations (y is infinite</em></span>): <code class="literal">{x,}</code></span></dt><dd><p>If you omit the upper bound, simply leaving a dangling comma in the range, the upper bound becomes infinite. This is a way to specify a minimum of occurrences with no maximum.<a id="I_indexterm10_id733314" class="indexterm"/><a id="I_indexterm10_id733321" class="indexterm"/><a id="I_indexterm10_id733328" class="indexterm"/></p></dd></dl></div></div><div class="sect3" title="Grouping"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.7"/>Grouping</h3></div></div></div><p><a id="I_indexterm10_id733342" class="indexterm"/> <a id="idx10524" class="indexterm"/> <a id="I_indexterm10_id733358" class="indexterm"/> <a id="I_indexterm10_id733365" class="indexterm"/>Just as in logical or mathematical operations, parentheses can be used in regular expressions to make subexpressions or to put boundaries on parts of expressions. This power lets us extend the operators we’ve talked about to work not only on characters, but also on words or other regular expressions. For example:</p><a id="I_10_tt642"/><pre class="programlisting"> <code class="o">(</code><code class="n">yada</code><code class="o">)+</code></pre><p>Here we are applying the + (one or more) operator to the whole pattern <code class="literal">yada</code>, not just one character. It matches yada, yadayada, yadayadayada, and so on.</p><p>Using grouping, we can start building more complex expressions. For example, while many email addresses have a three-part structure (e.g., <span class="emphasis"><em>foo@bar.com</em></span>), the domain name portion can, in actuality, contain an arbitrary number of dot-separated components. To handle this properly, we can use an expression like this one:</p><a id="I_10_tt643"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+</code><code class="err">@\</code><code class="n">w</code><code class="o">+(</code><code class="err">\</code><code class="o">.</code><code class="err">\</code><code class="n">w</code><code class="o">)+</code> <code class="c1">// Match an email address</code></pre><p>This expression matches a word, followed by an <code class="literal">@</code> symbol, followed by another word and then one or more literal dot-separated words—e.g., <span class="email"><a class="email" href="mailto:pat@pat.net">pat@pat.net</a></span>, <span class="email"><a class="email" href="mailto:friend@foo.bar.com">friend@foo.bar.com</a></span>, or <span class="email"><a class="email" href="mailto:mate@foo.bar.co.uk">mate@foo.bar.co.uk</a></span>.<a id="I_indexterm10_id733436" class="indexterm"/></p></div><div class="sect3" title="Capture groups"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.8"/>Capture groups</h3></div></div></div><p><a id="idx10514" class="indexterm"/> <a id="idx10555" class="indexterm"/>In addition to basic grouping of operations, parentheses have an important, additional role: the text matched by each parenthesized subexpression can be separately retrieved. That is, you can isolate the text that matched each subexpression. There is then a special syntax for referring to each capture group within the regular expression by number. This important feature has two uses.</p><p>First, you can construct a regular expression that refers to the text it has already matched and uses this text as a parameter for further matching. This allows you to express some very powerful things. For example, we can show the dictionary example we mentioned in the introduction. Let’s find all the words that start and end with the same letter:</p><a id="I_10_tt644"/><pre class="programlisting"> <code class="err">\</code><code class="n">b</code><code class="o">(</code><code class="err">\</code><code class="n">w</code><code class="o">)</code><code class="err">\</code><code class="n">w</code><code class="o">*</code><code class="err">\</code><code class="mi">1</code><code class="err">\</code><code class="n">b</code> <code class="c1">// match words beginning and ending with the same letter</code></pre><p>See the <code class="literal">1</code> in this expression? It’s a reference to the first capture group in the expression, <code class="literal">(\w)</code>. References to capture groups take the form <code class="literal">\</code><em class="replaceable"><code>n</code></em> where <em class="replaceable"><code>n</code></em> is the number of the capture group, counting from left to right. In this example, the first capture group matches a word character on a word boundary. Then we allow any number of word characters up to the special reference <code class="literal">\1</code> (also followed by a word boundary). The <code class="literal">\1</code> means “the value matched in capture group one.” Because these characters must be the same, this regex matches words that start and end with the same character.</p><p>The second use of capture groups is in referring to the matched portions of text while constructing replacement text. We’ll show you how to do that a bit later when we talk about the Regular Expression API.</p><p>Capture groups can contain more than one character, of course, and you can have any number of groups. You can even nest capture groups. Next, we discuss exactly how they are numbered.<a id="I_indexterm10_id733548" class="indexterm"/><a id="I_indexterm10_id733555" class="indexterm"/></p></div><div class="sect3" title="Numbering"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.9"/>Numbering</h3></div></div></div><p><a id="I_indexterm10_id733569" class="indexterm"/> <a id="I_indexterm10_id733575" class="indexterm"/>Capture groups are numbered, starting at 1, and moving from left to right, by counting the number of open parentheses it takes to reach them. The special group number 0 always refers to the entire expression match. For example, consider the following string:</p><a id="I_10_tt645"/><pre class="programlisting"> <code class="n">one</code> <code class="o">((</code><code class="n">two</code><code class="o">)</code> <code class="o">(</code><code class="n">three</code> <code class="o">(</code><code class="n">four</code><code class="o">)))</code></pre><p>This string creates the following matches:</p><a id="I_10_tt646"/><pre class="programlisting"> <code class="n">Group</code> <code class="mi">0</code><code class="o">:</code> <code class="n">one</code> <code class="n">two</code> <code class="n">three</code> <code class="n">four</code> <code class="n">Group</code> <code class="mi">1</code><code class="o">:</code> <code class="n">two</code> <code class="n">three</code> <code class="n">four</code> <code class="n">Group</code> <code class="mi">2</code><code class="o">:</code> <code class="n">two</code> <code class="n">Group</code> <code class="mi">3</code><code class="o">:</code> <code class="n">three</code> <code class="n">four</code> <code class="n">Group</code> <code class="mi">4</code><code class="o">:</code> <code class="n">four</code></pre><p>Before going on, we should note one more thing. So far in this section we’ve glossed over the fact that parentheses are doing double duty: creating logical groupings for operations and defining capture groups. What if the two roles conflict? Suppose we have a complex regex that uses parentheses to group subexpressions and to create capture groups? In that case, you can use a special noncapturing group operator <code class="literal">(?:)</code> to do logical grouping instead of using parentheses. You probably won’t need to do this often, but it’s good to know.</p></div><div class="sect3" title="Alternation"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.10"/>Alternation</h3></div></div></div><p><a id="I_indexterm10_id733632" class="indexterm"/> <a id="idx10513" class="indexterm"/> <a id="I_indexterm10_id733648" class="indexterm"/> <a id="I_indexterm10_id733659" class="indexterm"/>The vertical bar (|) operator denotes the logical OR operation, also called alternation or choice. The | operator does not operate on individual characters but instead applies to everything on either side of it. It splits the expression in two unless constrained by parentheses grouping. For example, a slightly naive approach to parsing dates might be the following:</p><a id="I_10_tt647"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+,</code> <code class="err">\</code><code class="n">w</code><code class="o">+</code> <code class="err">\</code><code class="n">d</code><code class="o">+</code> <code class="err">\</code><code class="n">d</code><code class="o">+|</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code><code class="o">/</code><code class="err">\</code><code class="n">d</code><code class="err">\</code><code class="n">d</code> <code class="c1">// pattern 1 or pattern 2</code></pre><p>In this expression, the left matches patterns such as Fri, Oct 12, 2001, and the right matches 10/12/2001.</p><p>The following regex might be used to match email addresses with one of three domains (<span class="emphasis"><em>net</em></span>, <span class="emphasis"><em>edu</em></span>, and <span class="emphasis"><em>gov</em></span>):<a id="I_indexterm10_id733694" class="indexterm"/></p><a id="I_10_tt648"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="o">+</code><code class="err">@</code><code class="o">[</code><code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="o">.]*</code><code class="err">\</code><code class="o">.(</code><code class="n">net</code><code class="o">|</code><code class="n">edu</code><code class="o">|</code><code class="n">gov</code><code class="o">)</code> <code class="c1">// email address ending in .net, .edu, or .gov</code></pre></div><div class="sect3" title="Special options"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.11"/>Special options</h3></div></div></div><p><a id="I_indexterm10_id733717" class="indexterm"/>There are several special options that affect the way the regex engine performs its matching. These options can be applied in two ways:</p><div class="itemizedlist"><ul class="itemizedlist"><li class="listitem"><p>You can pass in one or more flags during the <code class="literal">Pattern.compile()</code> step (discussed later in this chapter).</p></li><li class="listitem"><p>You can include a special block of code in your regex.</p></li></ul></div><p>We’ll show the latter approach here. To do this, include one or more flags in a special block <code class="literal">(?</code><em class="replaceable"><code>x</code></em><code class="literal">)</code>, where <em class="replaceable"><code>x</code></em> is the flag for the option we want to turn on. Generally, you do this at the beginning of the regex. You can also turn off flags by adding a minus sign <code class="literal">(?-</code><em class="replaceable"><code>x</code></em><code class="literal">)</code>, which allows you to apply flags to select parts of your pattern.</p><p>The following flags are available:</p><div class="variablelist"><dl><dt><span class="term"><span class="emphasis"><em>Case-insensitive</em></span>: <code class="literal">(?i)</code></span></dt><dd><p>The <a id="I_indexterm10_id733792" class="indexterm"/><code class="literal">(?i)</code> flag tells the regex engine to ignore case while matching, for example:</p><a id="I_10_tt649"/><pre class="programlisting"> <code class="o">(?</code><code class="n">i</code><code class="o">)</code><code class="n">yahoo</code> <code class="c1">// match Yahoo, yahoo, yahOO, etc.</code></pre></dd><dt><span class="term"><span class="emphasis"><em>Dot all</em></span>: <code class="literal">(?s)</code></span></dt><dd><p>The <a id="I_indexterm10_id733825" class="indexterm"/><code class="literal">(?s)</code> flag turns on “dot all” mode, allowing the dot character to match anything, including end-of-line characters. It is useful if you are matching patterns that span multiple lines. The <code class="literal">s</code> stands for “single-line mode,” a somewhat confusing name derived from Perl.</p></dd><dt><span class="term"><span class="emphasis"><em>Multiline</em></span>: <code class="literal">(?m)</code></span></dt><dd><p>By default, <code class="literal">^</code> and <code class="literal">$</code> don’t really match the beginning and end of lines (as defined by carriage return or newline combinations); they instead match the beginning or end of the entire input text. Turning on multiline mode with <a id="I_indexterm10_id733874" class="indexterm"/><code class="literal">(?m)</code> causes them to match the beginning and end of every line as well as the beginning and end of input. Specifically, this means the spot before the first character, the spot after the last character, and the spots just after and before line terminators inside the string.</p></dd><dt><span class="term"><span class="emphasis"><em>Unix lines</em></span>: <code class="literal">(?d)</code></span></dt><dd><p>The (<a id="I_indexterm10_id733900" class="indexterm"/><code class="literal">?d)</code> flag limits the definition of the line terminator for the <code class="literal">^</code>, <code class="literal">$</code>, and <code class="literal">.</code> special characters to Unix-style newline only (<code class="literal">\n</code>). By default, carriage return newline (<code class="literal">\r\n</code>) is also allowed.</p></dd></dl></div></div><div class="sect3" title="Greediness"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.12"/>Greediness</h3></div></div></div><p><a id="idx10523" class="indexterm"/> <a id="idx10558" class="indexterm"/>We’ve seen hints that regular expressions are capable of sorting some complex patterns. But there are cases where what should be matched is ambiguous (at least to us, though not to the regex engine). Probably the most important example has to do with the number of characters the iterator operators consume before stopping. The <a id="I_indexterm10_id733976" class="indexterm"/><a id="I_indexterm10_id733984" class="indexterm"/><code class="literal">.*</code> operation best illustrates this. Consider the following string:</p><a id="I_10_tt650"/><pre class="programlisting"> <code class="s">"Now is the time for <bold>action</bold>, not words."</code></pre><p>Suppose we want to search for all the HTML-style tags (the parts between the < and > characters), perhaps because we want to remove them.</p><p>We might naively start with this regex:</p><a id="I_10_tt651"/><pre class="programlisting"> <code class="o"></?.*></code> <code class="c1">// match <, optional /, and then anything up to ></code></pre><p>We then get the following match, which is much too long:</p><a id="I_10_tt652"/><pre class="programlisting"> <code class="o"><</code><code class="n">bold</code><code class="o">></code><code class="n">action</code><code class="o"></</code><code class="n">bold</code><code class="o">></code></pre><p>The problem is that the <code class="literal">.*</code> operation, like all the iteration operators, is by default “greedy,” meaning that it consumes absolutely everything it can, up until the last match for the terminating character (in this case, >) in the file or line.</p><p>There are solutions for this problem. The first is to “say what it is”—that is, to be specific about what is allowed between the braces. The content of an HTML tag cannot actually include <span class="emphasis"><em>anything</em></span>; for example, it cannot include a closing bracket (>). So we could rewrite our expression as:</p><a id="I_10_tt653"/><pre class="programlisting"> <code class="o"></?</code><code class="err">\</code><code class="n">w</code><code class="o">*></code> <code class="c1">// match <, optional /, any number of word characters, then ></code></pre><p>But suppose the content is not so easy to describe. For example, we might be looking for quoted strings in text, which could include just about any text. In that case, we can use a second approach and “say what it is not.” We can invert our logic from the previous example and specify that anything <span class="emphasis"><em>except</em></span> a closing bracket is allowed inside the brackets:</p><a id="I_10_tt654"/><pre class="programlisting"> <code class="o"></?[^>]*></code></pre><p>This is probably the most efficient way to tell the regex engine what to do. It then knows exactly what to look for to stop reading. This approach has limitations, however. It is not obvious how to do this if the delimiter is more complex than a single character. It is also not very elegant.</p><p>Finally, we come to our general solution: the use of “reluctant” operators. For each of the iteration operators, there is an alternative, nongreedy form that consumes as few characters as possible, while still trying to get a match with what comes after it. This is exactly what we needed in our previous example.</p><p>Reluctant operators take the form of the standard operator with a “?” appended. (Yes, we know that’s confusing.) We can now write our regex as:</p><a id="I_10_tt655"/><pre class="programlisting"> <code class="o"></?.*?></code> <code class="c1">// match <, optional /, minimum number of any chars, then ></code></pre><p>We have appended <a id="I_indexterm10_id734118" class="indexterm"/><a id="I_indexterm10_id734126" class="indexterm"/><code class="literal">?</code> to <code class="literal">.*</code> to cause <code class="literal">.*</code> to match as few characters as possible while still making the final match of >. The same technique (appending the <code class="literal">?</code>) works with all the iteration operators, as in the two following examples:<a id="I_indexterm10_id734160" class="indexterm"/><a id="I_indexterm10_id734167" class="indexterm"/></p><a id="I_10_tt656"/><pre class="programlisting"> <code class="o">.+?</code> <code class="c1">// one or more, nongreedy</code> <code class="o">.{</code><code class="n">x</code><code class="o">,</code><code class="n">y</code><code class="o">}?</code> <code class="c1">// between x and y, nongreedy</code></pre></div><div class="sect3" title="Lookaheads and lookbehinds"><div class="titlepage"><div><div><h3 class="title"><a id="learnjava3-CHP-10-SECT-7.1.13"/>Lookaheads and lookbehinds</h3></div></div></div><p><a id="idx10530" class="indexterm"/> <a id="idx10560" class="indexterm"/>In order to understand our next topic, let’s return for a moment to the position marking characters (<code class="literal">^</code>, <code class="literal">$</code>, <code class="literal">\b</code>, and <code class="literal">\B</code>) that we discussed earlier. Think about what exactly these special markers do for us. We say, for example, that the <code class="literal">\b</code> marker matches a word boundary. But the word “match” here may be a bit too strong. In reality, it “requires” a word boundary to appear at the specified point in the regex. Suppose we didn’t have <code class="literal">\b</code>; how could we construct it? Well, we could try constructing a regex that matches the word boundary. It might seem easy, given the word and nonword character classes (<code class="literal">\w</code> and <code class="literal">\W</code>):</p><a id="I_10_tt657"/><pre class="programlisting"> <code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="n">W</code><code class="o">|</code><code class="err">\</code><code class="n">W</code><code class="err">\</code><code class="n">w</code> <code class="c1">// match the start or end of a word</code></pre><p>But now what? We could try inserting that pattern into our regular expressions wherever we would have used <code class="literal">\b</code>, but it’s not really the same. We’re actually matching those characters, not just requiring them. This regular expression matches the two characters composing the word <span class="emphasis"><em>boundary</em></span> in addition to whatever else matches afterward, whereas the <code class="literal">\b</code> operator simply <span class="emphasis"><em>requires</em></span> the word boundary but doesn’t match any text. The distinction is that <code class="literal">\b</code> isn’t a matching pattern but a kind of lookahead. A <span class="emphasis"><em>lookahead</em></span> is a pattern that is required to match next in the string, but is not consumed by the regex engine. When a lookahead pattern succeeds, the pattern moves on, and the characters are left in the stream for the next part of the pattern to use. If the lookahead fails, the match fails (or it backtracks and tries a different approach).</p><p>We can make our own lookaheads with the lookahead operator <a id="I_indexterm10_id734316" class="indexterm"/><a id="I_indexterm10_id734322" class="indexterm"/><a id="I_indexterm10_id734327" class="indexterm"/><code class="literal">(?=)</code>. For example, to match the letter X at the end of a word, we could use:</p><a id="I_10_tt658"/><pre class="programlisting"> <code class="o">(?=</code><code class="err">\</code><code class="n">w</code><code class="err">\</code><code class="n">W</code><code class="o">)</code><code class="n">X</code> <code class="c1">// Find X at the end of a word</code></pre><p>Here the regex engine requires the <code class="literal">\W\w</code> pattern to match but not consume the characters, leaving them for the next part of the pattern. This effectively allows us to write overlapping patterns (like the previous example). For instance, we can match the word “Pat” only when it’s part of the word “Patrick,” like so:</p><a id="I_10_tt659"/><pre class="programlisting"> <code class="o">(?=</code><code class="n">Patrick</code><code class="o">)</code><code class="n">Pat</code> <code class="c1">// Find Pat only in Patrick</code></pre><p>Another operator, <code class="literal">(?!)</code>, the <span class="emphasis"><em>negative lookahead</em></span>, requires that the pattern not match. We can find all the occurrences of Pat not inside of a Patrick with this:</p><a id="I_10_tt660"/><pre class="programlisting"> <code class="o">(?!</code><code class="n">Patrick</code><code class="o">)</code><code class="n">Pat</code> <code class="c1">// Find Pat never in Patrick</code></pre><p>It’s worth noting that we could have written all of these examples in other ways, by simply matching a larger amount of text. For instance, in the first example we could have matched the whole word “Patrick.” But that is not as precise, and if we wanted to use capture groups to pull out the matched text or parts of it later, we’d have to play games to get what we want. For example, suppose we wanted to substitute something for Pat (say, change the font). We’d have to use an extra capture group and replace the text with itself. Using lookaheads is easier.</p><p>In addition to looking ahead in the stream, we can use the <a id="I_indexterm10_id734406" class="indexterm"/><a id="I_indexterm10_id734412" class="indexterm"/><a id="I_indexterm10_id734418" class="indexterm"/><code class="literal">(?<=)</code> and <a id="I_indexterm10_id734429" class="indexterm"/><a id="I_indexterm10_id734435" class="indexterm"/><a id="I_indexterm10_id734441" class="indexterm"/><code class="literal">(?<!)</code><span class="emphasis"><em><span>lookbehind</span></em></span> operators to look backward in the stream. For example, we can find my last name, but only when it refers to me:</p><a id="I_10_tt661"/><pre class="programlisting"> <code class="o">(?<=</code><code class="n">Pat</code> <code class="o">)</code><code class="n">Niemeyer</code> <code class="c1">// Niemeyer, only when preceded by Pat</code></pre><p>Or we can find the string “bean” when it is not part of the phrase “Java bean”:</p><a id="I_10_tt662"/><pre class="programlisting"> <code class="o">(?<!</code><code class="n">Java</code> <code class="o">*)</code><code class="n">bean</code> <code class="c1">// The word bean, not preceded by Java</code></pre><p>In these cases, the lookbehind and the matched text didn’t overlap because the lookbehind was before the matched text. But you can place a lookahead or lookbehind at either point—before or after the match—for example, we could also match Pat Niemeyer like this:<a id="I_indexterm10_id734496" class="indexterm"/><a id="I_indexterm10_id734503" class="indexterm"/><a id="I_indexterm10_id734510" class="indexterm"/></p><a id="I_10_tt663"/><pre class="programlisting"> <code class="n">Niemeyer</code><code class="o">(?<=</code><code class="n">Pat</code> <code class="n">Niemeyer</code><code clas