Regular Expression Pocket Reference


This book review was published by SDOUG, 2004-04-12.

Regular Expression Pocket Reference

For any computer programmer who has needed to manipulate text — specifically, matching certain substrings and perhaps replacing them with other text — regular expressions usually seem heaven-sent. That is because they are by far the most concise and powerful way to find a pattern (the "expression") within text, and specify replacements for one or all of the substrings that match the pattern.

For those readers who are not familiar with regular expressions ("regexes"), here is a simple example, written in Perl: The statement "$input =~ m/password/;" would find the first instance of the string "password" inside the scalar variable $input, which presumably contains some text that the user has entered. To substitute all (i.e., globally) instances of the string with eight X's, you could use the statement "$input =~ s/password/XXXXXXXX/g;". This simplistic example does not even utilize any metacharacters, which make regexes much more powerful. For example, to substitute all instances of every word comprising exactly eight "word" characters, you could use the statement "$input =~ s/\w{8}/XXXXXXXX/g;".

That last example employs only one metasymbol (\w) and one quantifier ({8}). The total number of metacharacters, metasymbols, quantifiers, character classes, and assertions (to say nothing of capturing, clustering, and alternation) that are available, in most regex-enabled languages, is tremendous. However, the same cannot be said for the readability of all but the simplest regular expressions. As a consequence, when using regexes in their code, most programmers find themselves repeatedly consulting reference materials that do not focus on regular expressions. These take the form of overrated Perl books, incomplete tutorials on the Internet, or confusing discussions in technical newsgroups. For too many years, there was no published book providing the details of regexes for the various languages that use them, in addition to a clear explanation of how to use regexes wisely.

Fortunately, O'Reilly & Associates has made available two titles in hopes of meeting that need: Mastering Regular Expressions, by Jeffrey Friedl, and Regular Expression Pocket Reference (published in May 2003 under the ISBN 978-0596004156), by Tony Stubblebine. In several respects, the books are related — particularly in that Stubblebine bases his slender monograph upon Friedl's larger and more extensive title, justifiably characterized by Stubblebine as "the definitive work on the subject". In addition, Stubblebine's book follows the structure of Friedl's book, and contains page references to the same. In this article, we will review the Regular Expression Pocket Reference.

At first glance, it is clear that the publication in question packs a great deal of information into only 93 pages. That may partly be a result of the terseness of most, if not all, of the regular expression syntax; a metasymbol of more than two characters would be considered long-winded! Yet the high information density is likely also due to the manner in which Stubblebine has distilled the operators and rules, as well as the meaning and usage thereof, down to the bare bones. But this does not imply that the book is bereft of examples. Most of the sections contain at least one, and sometimes several, code fragments that illustrate the regex elements under discussion.

The book is divided into nine major sections, each one devoted to the usage of regexes within a particular language, application, or library: Perl 5.8, Java, .NET and C#, Python, PCRE Lib, PHP, vi editor, JavaScript, and shell tools. In addition, there is a brief introduction, and an index, which is especially valuable for the subject matter, given how small regular expression elements can be, and thus how difficult they can be to locate in the primary text of a book by flipping through its pages hoping to luckily spot the desired topic. (This is the rather unpleasant method to which a reader must resort when a technical book has no index, or one that is inadequate — which is far too often the case.)

Each of the nine sections in the Regular Expression Pocket Reference begins with a very brief overview of how regexes fit into the overall language covered in that section. Following this is a subsection listing all of the supported metacharacters, with a summary of their meaning, in tabular format. In most cases, this is followed by a subsection showing the usage of those metacharacters, either in the form of operators or pattern-matching functions, depending upon how regular expressions are used within that language. Next is a subsection providing several examples, which is often the first material that most programmers turn to when trying to quickly figure out how to use one aspect of a language. Each section concludes with a short listing of other resources related to regexes for that particular language.

There are no glaring faults in this book, and I can only assume that most if not all of the regular expressions themselves have been tested by the author and by previous readers. However, there are some minor problems that should be pointed out, and could be corrected in future editions. In the Perl and Java examples, Stubblebine wisely formats the code so that every left brace, "{", is on the same line as the beginning of the statement that uses that brace, and each closing brace, "}", is lined up directly underneath the first character of the statement. This format saves space and makes it easier to match up the statement with its corresponding close brace. However, in the .NET and C# and PCRE Lib sections, the open braces consume their own lines, and are indented, as are the close braces, which makes the code less readable, as well as less consistent among the sections.

The second minor fault is also related to the formatting of code. In all of the function calls, there is no space after the beginning parenthesis and no space before the closing parenthesis. This is of no great consequence; but it is unfortunate, as it makes the code just that much more difficult to read.

If you do purchase the book (and I recommend that you do if you use regular expressions in any of the languages covered), and you find an erratum not already discovered and reported, then you should visit the website for the book, and pass along your feedback. You may notice that the cover image on the site (at the time of this writing) differs from the particular copy of the book under review, which is identical to that of Friedl's book. It is quite possible that the publisher has now rightly chosen to give Stubblebine's book its own cover image — though they share the same animal, the owl.

On the topic of errata, it should be noted that of all the errata pages that this reviewer has seen — for at least 20 technical books — this may be the only one in which the page for unconfirmed errata (those submitted by readers but not yet checked by the author to see if they are valid or not) contained no unprocessed entries. The most likely explanation is that Stubblebine is doing a commendable job of checking all of the errata submitted by readers, and processing them — verifying that they are true and adding them to the confirmed errata page, or verifying that they do not apply, in any sense. I have never understood why authors allow unconfirmed errata to pile up on the corresponding websites for their books. Have they no interest in staying on top of any reported problems with their own books?! Like the owls on the cover of his book, Stubblebine is likely keeping an eye on things.

Copyright © 2004 Michael J. Ross. All rights reserved.

This book is available on Amazon