Regular Expression Pocket Reference, 2nd Edition


This book review was published by Slashdot, 2008-03-24.

Regular Expression Pocket Reference, 2nd Edition

When software developers need to manipulate text programmatically — such as finding all substrings within some text that match a particular pattern — the most concise and flexible solution is to use "regular expressions", which are strings of characters and symbols that can look anything but regular. Nonetheless, they can be invaluable for locating text that matches a pattern (the "expression"), and optionally replacing the matched text with new text. Regular expressions have proven so popular that they have been incorporated into most if not all major programming languages and editors, and even at least one Web server. But each one implements regular expressions in its own way — which is reason enough for programmers to appreciate the latest edition of Regular Expression Pocket Reference, by Tony Stubblebine.

The second edition of the book was published by O'Reilly Media on 18 July 2007, under the ISBN 978-0596514273. On the book's Web page, the publisher makes available the book's table of contents and index, as well as links for providing feedback and any errata. As of this writing, there are no unconfirmed errata (those submitted by readers but not yet checked by the author to see whether they are valid), and no confirmed ones, either. In fact, in my review of the first edition, published in 2004, it was noted that there were no unconfirmed errata, despite the book being out for some time prior to that review. The most likely explanation is that the author — in addition to any technical reviewers — did a thorough job of checking all of the regular expressions in the book, along with the sample code that make use of them. These efforts have paid off with the apparent absence of any errors in this new edition — something unseen in any other technical book with which I am familiar.

Before discussing this particular book, it may be of value to briefly discuss the essential concept of regular expressions, for the benefit of any readers who are not familiar with them. As noted earlier, a regular expression (frequently termed a "regex") is a string of characters intended for matching substrings in a block of text. A regex pattern can match literally, such as the pattern "book" matching both "book" and "bookshelf". A pattern can also use special characters and character combinations — often termed metasymbols and metasequences — such as \w to indicate a single word character (A-Z, a-z, 0-9, or '_'). Thus, the regex "b\w\wk" would match "book", but not "brook".

Here is a simple example to show the use of regexes in code, written in Perl: The statement "$text =~ m/book/;" would find the first instance of the string "book" inside the scalar variable $text, which presumably contains some text. To substitute all instances of the string with the word "publication", you could use the statement "$text =~ s/book/publication/g;" ('g' for globally search) or use "$text =~ s/bo{2}k/publication/g;". In this simplistic example, the second statement makes use of a quantifier, {2}, indicating two of the preceding letter.

These examples employ only one metacharacter (\w) and one quantifier ({2}). The total number of metacharacters, metasymbols, quantifiers, character classes, and assertions (to say nothing of capturing, clustering, and alternation) that are available, in most regex-enabled languages, is tremendous. However, the same cannot be said for the readability of all but the simplest regular expressions — especially lengthy ones not improved by whitespace and comments. As a consequence, when using regexes in their code, many programmers find themselves repeatedly consulting reference materials that do not focus on regular expressions. These resources comprise convoluted Perl books, incomplete tutorials on the Internet, and confusing discussions in technical newsgroups. For too many years, there was no published book providing the details of regexes for the various languages that utilize them, in addition to a clear explanation of how to use regexes wisely.

Fortunately, O'Reilly Media offers two titles in hopes of meeting that need: Mastering Regular Expressions, by Jeffrey Friedl, and Regular Expression Pocket Reference, by Tony Stubblebine. In several respects, the books are related — particularly in that Stubblebine bases his slender monograph upon Friedl's larger and more extensive title, justifiably characterized by Stubblebine as "the definitive work on the subject". In addition, Stubblebine's book follows the structure of Friedl's book, and contains page references to the same. Another major difference is that Regular Expression Pocket Reference is, just as the title indicates, for reference purposes only, and not intended as a tutorial.

At first glance, it is clear that Stubblebine's book packs a great deal of information into its modest 126 pages. That may partly be a result of the terseness of most, if not all, of the regular expression syntax; a metasymbol of more than two characters would be considered long-winded! Yet the high information density is likely also due to the manner in which Stubblebine has distilled the operators and rules, as well as the meaning and usage thereof, down to the bare bones. But this does not imply that the book is bereft of examples. Most of the sections contain at least one, and sometimes several, code fragments that illustrate the regex elements under discussion.

The book begins with a brief introduction to regexes and pattern matching, followed by an even briefer cookbook section, with Perl-style regexes for a dozen commonly-needed tasks, e.g., validating dates. The bulk of the book's material is divided into 11 sections, each one devoted to the usage of regexes within a particular language, application, or library: Perl 5.8, Java,.NET and C#, PHP, Python, Ruby, JavaScript, PCRE, the Apache Web server, the vi programmer's editor, and shell tools.

Each of these sections begins with a brief overview of how regexes fit into the overall language covered in that section. Following this is a subsection listing all of the supported metacharacters, with a summary of their meanings, in tabular format. In most cases, this is followed by a subsection showing the usage of those metacharacters — either in the form of operators or pattern-matching functions, depending upon how regular expressions are used within that language. Next is a subsection providing several examples, which is often the first material that most programmers turn to when trying to quickly figure out how to use one aspect of a language. Each section concludes with a short listing of other resources related to regexes for that particular language.

There are no glaring problems in this book, and I can only assume that all of the regular expressions themselves have been tested by the author and by previous readers. However, there is a minor weakness that should be pointed out, and could be corrected in the next edition. In most of the sections' examples, Stubblebine wisely formats the code so that every left brace ("{") is on the same line as the beginning of the statement that uses that brace, and each closing brace ("}") is lined up directly underneath the first character of the statement. This format saves space and makes it easier to match up the statement with its corresponding close brace. However, in the.NET / C# and PCRE library sections, the open braces consume their own lines, and also are indented inconsistently, as are the close braces, which makes the code less readable, as well as less consistent among the sections.

Some readers may fault the book's sparse index. Admittedly, an inadequate index in any sizable programming book can make it difficult if not impossible to find what one is looking for. As a result, one ends up flipping through the book's pages hoping to luckily spot the desired topic. This is the rather unpleasant method to which a reader must resort when a technical book has no index, or one that is inadequate — which is far too often the case. Stubblebine's index offers only several dozen entries for all the letters of the alphabet, and only two symbols. Some readers might demand that all of the metacharacters and metasequences be listed in the index, so they can be found even faster than otherwise. But given the large number of metacharacters and metasequences, as well as method names, module functions, and everything else relevant, creating an exhaustive index would almost double the size of the book, and be largely redundant with the language-specific sections. Within each language, there is typically a limited enough number of pages that scanning through them to find a particular topic, would not be onerous. On the other hand, some of the index's inclusions and omissions are odd. For instance, two symbols are listed, and yet no others; why bother with those two? Also, a few key concepts are missing, such as grouping and capturing.

Yet aside from these minor blemishes, Regular Expression Pocket Reference is a concise, well-written, and information-rich resource that should be kept on hand by any busy software developer.

Copyright © 2008 Michael J. Ross. All rights reserved.

This book is available on Amazon