JavaScript
Article

An Alternative to Regular Expressions: apg-exp

By Lowell D. Thomas

This article was peer reviewed by Sebastian Seitz and Almir Bijedic. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

Hardly any programmer escapes the need to use regular expressions in one form or another from time to time. For many, the pattern syntax can seem cryptic and forbidding. This tutorial will introduce a new pattern-matching engine, apg-exp—a feature-rich alternative to RegExp with an ABNF pattern syntax that is a little easier on the eyes.

A Quick Comparison

Have you ever needed to verify an email address and come across something like this?

^[\w!#$%&'*+/=?^_`{|}~-]+(?:\.[\w!#$%&'*+/=?^_`{|}~-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

A pattern-matching engine is the right tool for the job. This is a well-designed, well-written regular expression. It works great. So what’s not to like?

Well, if you are an expert with regular expressions, nothing at all. But for the rest of us, they may be

  • Hard to read
  • Even harder to write
  • Hard to maintain

The regular expression syntax has a long, time-honored history and is deeply integrated into many of the tools and languages that we, as programmers, use every day.

There is, however, an alternative syntax that has been around almost as long, is very popular with writers and users of Internet technical specifications, has all the power of regular expressions but is seldom used in the world of JavaScript programming. Namely, the Augmented Backus-Naur Form, or ABNF, formally defined by the IETF in RFC 5234 and RFC 7405.

Let’s see what that same email address might look like in ABNF.

email-address   = local "@" domain
local           = local-word *("." local-word)
domain          = 1*(sub-domain ".") top-domain
local-word      = 1*local-char
sub-domain      = 1*sub-domain-char
top-domain      = 2*6top-domain-char
local-char      = alpha / num / special
sub-domain-char = alpha / num / "-"
top-domain-char = alpha
alpha           = %d65-90 / %d97-122
num             = %d48-57
special         = %d33 / %d35 / %d36-39 / %d42-43 / %d45 / %d47 
                / %d61 / %d63 / %d94-96 / %d123-126

Not as compact, for sure, but like HTML and XML it is designed to be read by humans as well as machines. I’m guessing that with nothing more than a passing knowledge of wild card search patterns, you can just about read what is going on here in “plain English”.

  • the email address is defined as a local part and a domain separated by @
  • the local part is one word followed by optional dot-separated words
  • the domain is one or more dot-separated sub-domains followed by a single top domain
  • the only things you might not know here, but can probably guess, are:
    • just as the wild card character * means “zero or more”, 1* means “one or more” and 2*6 means min 2 and max 6 repetitions
    • / separates alternate choices
    • %d defines decimal character codes and character code ranges
    • for example, %d35 represents #, ASCII decimal 35
    • %d65-90 represents any character in the range A-Z, ASCII decimals 65-90

RegExp and apg-exp are compared for this email address in example 1.

apg-exp is a pattern-matching engine designed to have the look and feel of RegExp but to use the ABNF syntax for pattern definitions. In the next few sections I’ll walk you through:

  • How to get apg-exp into your app
  • A short guide to the ABNF syntax
  • Working with apg-exp—a few examples
  • Where to go next—more details, advanced examples

Up and Running—How to Get It

npm

If you are working in a Node.js environment, from your project directory run:

npm install apg-exp --save

You can then access it in your code with require().

For example:

var ApgExp = require("apg-exp");
var exp = new ApgExp(pattern, flags);
var result = exp.exec(stringToMatch);

GitHub

To get a copy of the code from GitHub, you can clone the repository to your project directory:

git clone https://github.com/ldthomas/apg-js2-exp.git apg-exp

or download it as a zip file.

Then in page.html:

<!-- optional stylesheet used in tutorial examples -->
<link rel="stylesheet" href="./apg-exp/apgexp.css">
<script src="./apg-exp/apgexp-min.js"></script>

<script>
  var useApgExp = function(){
      var exp = new ApgExp(pattern, flags); 
      var result = exp.exec(stringToMatch);
      /* do something with the result */
  }
</script>

CDN

You can also create a CDN version directly from the GitHub source using RawGit. However, be sure to read the no uptime or support guarantees (In fact, be sure to read the entire FAQ).

The following are used in all of the examples in this tutorial.

<link rel="stylesheet"
 href="https://cdn.rawgit.com/ldthomas/apg-js2-exp/89c6681798ba9e47583b685c87b244406b18a26d/apgexp.css">
<script
 src="https://cdn.rawgit.com/ldthomas/apg-js2-exp/c0fc3adac954a6f6ad6f265fd2f8f06f68001e10/apgexp-min.js"
 charset="utf-8"></script>

These files are cached on the MaxCDN servers and you are free to use them for testing as long as they remain available. However, for production, you should place copies of apgexp-min.js and apgexp.css on your own servers for guaranteed access
and include them in your pages as best suited to your application.

A Short Guide to ABNF

ABNF is a syntax to describe phrases, a phrase being any string. As you saw in the email example above, it allows you to break down complex phrases into a collection of simpler phrases. A phrase definition has the form:

name = elements LF

where LF is a line feed (newline \n) character.

The table below is a short guide to the elements (see SABNF for the full guide).

Element Definition Comments/Examples
name rule name alphanum + hyphen (see note 1 below)
%d32 single character code decimal value of the character code
%d97.98.99 string of character codes abc
%d48-57 character code range any ASCII digit 0-9
“abc” case-insensitive literal string abc or ABC, etc.
%s”aBc” case-sensitive literal string aBc only (= %d97.66.99)
space concatenates two elements %d97 %d98 (= ab)
/ separates two alternate elements %d97 / %d98 (= a or b)
*element zero or more repetitions of element (see note 2 below)
(elements) grouping, treated as single element (see note 3 below)
[elements] optional grouping [%d97] %d98 (ab or b)
%^ beginning of input string matches position only as empty phrase
%$ end of input string matches position only as empty phrase
&element look ahead for element element must follow current string position
!element look ahead for no element element must NOT follow current string position
&&element look behind for element element must precede current string position
!!element look behind for no element element must NOT precede current string position
\name back reference to rule “name” matches previous phrase found for “name”
;comment comment comment runs from ; to end of line

Note 1: Rule names must begin with an alphabetic character in the first column of the line. Continuation lines must begin with a space or tab. The first rule defines the entire phrase or pattern to be matched. The following rules define the named sub-phrases (or named capture) within the entire phrase.

Note 2: The general form of a repetition is n*m, defining a minimum of n and maximum of m repetitions. Short hand notation can be *m for zero to m, n* for n to infinity or just nfor n*n.

Note 3: Groupings are important for keeping alternation and concatenation behaving as expected. Concatenation has tighter binding than alternation. If

phrase1 = elem (foo / bar) blat LF
phrase2 = (elem foo) / (bar blat) LF
phrase3 = elem foo / bar blat LF

then, phrase1 matches elem foo blat or elem bar blat and both phrase2 and phrase3 match elem foo or bar blat. Be careful and use groupings liberally.

Using apg-exp—A Few Examples

Now that you have apg-exp included in your app and know the basics of writing the pattern syntax, let’s go straight to the fun part and look at some examples of how to use it.

I’ve put the boring details of the constructed object in the next section, below. You can refer to it as needed to understand the examples. I’ll also skip error handling but you should be aware that if there is a pattern error, the constructor will throw an ApgExpError exception object which has a couple of handy functions for formatted display of the pattern errors. Your try/catch block might look something like this:

try {
  var exp = new ApgExp(pattern, flags);
  var result = exp.exec(stringToMatch);
  if (result) {
    // do something with results
  } else {
    // handle failure
  }
} catch(e) {
  if (e.name === "ApgExpError") {
    // display pattern errors to console
    console.log(e.toText());
    // display pattern errors to HTML page
    $("#errors").html(e.toHtml());
  } else {
    // handle other exceptions
  }
}

Telephone Numbers

Telephone numbers present a common form validation problem and a good opportunity to present some basics. Often you just want to know that 10 digits have been entered in a form that looks something like a phone number without worrying about the details of the North American Plan or international numbers.

Let’s just require that it starts with a parenthesis, (, or a digit and has blocks of 3, 3 & 4 digits separated by zero to three non-digits. That’s general enough to accept the most common formats and specific enough to grab the area, office and subscriber codes as capture groups for any reformatting.

Here’s what it looks like in ABNF:

phone-number = ["("] area-code sep office-code sep subscriber
area-code    = 3digit                       ; 3 digits
office-code  = 3digit                       ; 3 digits
subscriber   = 4digit                       ; 4 digits
sep          = *3(%d32-47 / %d58-126 / %d9) ; 0-3 ASCII non-digits
digit        = %d48-57                      ; 0-9

Example 2 demonstrates this and gives you an opportunity to vary the phone number formats:

Example 3 let’s you play with the pattern syntax and get a look at the error messages you might expect to see:

The RegExp syntax,

\(?(\d{3})\D{0,3}(\d{3})\D{0,3}(\d{4})

is not difficult either and example 4 gives a side-by-side comparison.

Dates

Now let’s ramp it up a little an see how apg-exp and RegExp compare when matching dates.

Our date format requirements are:

mm/dd/yy or mm/dd/yyyy
dd/mm/yy or dd/mm/yyyy
mm, 1-12 or 01-12, i.e. with or without leading zero
dd, 1-31 or 01-31, i.e. with or without leading zero
yy, 00-99
yyyy, 1900-1999 or 2000-2099

The mm/dd/yyyy format itself is not so difficult, but constraining the numeric ranges is what ramps it up. Getting there with ABNF looks like this:

date     = %^ (mm-first / dd-first) %$
mm-first = mm "/" dd "/" yyyy   ; month before day
dd-first = dd "/" mm "/" yyyy   ; day before month
dd       = "0" digit1           ;    01-09
         / ("1"/"2") digit      ; or 10-29
         / "3" %d48-49          ; or 30-31
         / digit1               ; or 1-9
mm       = "0" digit1           ;    01-09
         / "1" %d48-50          ; or 10-12
         / digit1               ; or 1-9
yyyy     = ("19" / "20") 2digit ; 1900-1999 or 2000-2099
         / 2digit               ; or 00-99
digit    = %d48-57              ; 0-9
digit1   = %d49-57              ; 1-9

The comments make it fairly self-explanatory. Note that dd, mm and yyyy have the shortest alternative last. This is very important, since apg-exp always takes a “first match wins” approach to the alternatives. Alternates are tried left to right and once a match is found any remaining alternates are ignored. Here, this means that when a pattern can match one or two digits, the two-digit pattern needs to be first.

Following exactly the same approach as above, breaking down the date into alternate patterns of dd, mm and [yy]yy then combining them for the full date results in the following RegExp syntax:

^(?:((?:0[1-9])|(?:1[0-2])|(?:[1-9]))/((?:0[1-9])|(?:(?:1|2)[0-9])|(?:3[0-1])|(?:[1-9]))|((?:0[1-9])|(?:(?:1|2)[0-9])|(?:3[0-1])|(?:[1-9]))/((?:0[1-9])|(?:1[0-2])|(?:[1-9])))/((?:19|20)?[0-9][0-9])$

I’m not a RegExp expert, so there may be ways to shorten this, but this one works. You can compare for yourself in example 5. Jump over there and give it a work out.

Matching nested pairs (()) with recursion

Finally, I’d like to show how to match nested pairs of parentheses, brackets and the like. While this is a very important pattern-matching problem, you can’t do it with RegExp. Consider the following ABNF for matched pairs of parentheses:

P = L P R / L R
L = "("
R = ")"

Notice that the rule P appears in its own definition. This is called recursion. And while some flavors of regular expression engines do support recursion and some RegExp tools do provide a measure of recursion capabilities, JavaScript’s RegExp has no support for it at all. L and R above have been chosen to match parentheses, but they can be most anything as long as L can’t match the same thing as R.

Head over to example 6 and we’ll have some fun with recursion.

Before leaving the subject of matching, nested pairs, I’d like to demonstrate a couple of real-world examples that apg-exp can help you with.

In example 6 you saw how to match pairs within pairs and include text between the brackets. Suppose your assignment were to write a program such that: if the cursor is on an opening curly bracket, {, highlight the matching closing bracket, } .

Example 7 shows you a solution to this problem. You will need to understand sticky mode.

One final nested-pair example. Did you ever want to comment out a large block of HTML only to find that there were already comments in the block? Frustrating? Just do an Internet search for “HTML nested comment problems” and feel the frustration. Example 8 shows you one possible solution to this problem. Warning—this one might stretch you a little. You will need to understand the result.rules object and global mode.

Bringing it all together

To close out and bring it all together let’s see how a full form-validation example might look.

Typically when creating a new account you are asked for user name, email address, password and password confirmation. Let’s require that user names be 3-32 ASCII letters, hyphens and periods. The password must be 8-16 upper case letters, lower case letters or digits and must have at least one of each.

The form will display a descriptive error message above any invalid entry before continuing. Example 9 brings it all together.

See the Pen apg-exp: A RegExp Alternative: Example 9 by SitePoint (@SitePoint) on CodePen.

The library API

I’ve tried to provide enough information in this section to understand the examples above. However, there are a number of advanced features that require a more in-depth understanding of parsing theory to use and I’ve just indicated those with an “advanced” comment.

The input arguments

The apg-exp constructor, ApgExp here, takes up to four arguments.

var exp = new ApgExp(pattern[, flags[, nodeHits[, treeDepth]]]);
  • pattern
    • string: An SABNF pattern syntax* as described in the previous section.
    • object: An instantiated APG parser object. (Advanced option.)
  • flags: string: any of the characters "gyud"
    • g – global mode: Iterate over all pattern matches (see Example 8)
    • y – sticky mode: Anchors the match at exp.lastIndex (see Example 7)
    • u – Unicode mode: Returns results as arrays of integer character codes instead of strings
    • d – debug mode: Advanced option – exposes APG trace object
  • nodeHits: integer > 0: default = Infinity: Limits the matching algorithm to "nodeHits" steps. Protects against catastrophic backtracking.
  • treeDepth: integer > 0: default = Infinity: Limits the matching parser’s tree depth.

* For example, to match an alphanumeric name you might use the input string:

var pattern = "alphanum = alpha *(alpha / num)\n"
            + "alpha = %d65-90 / %d97-122\n"
            + "num = %d48-57\n";

Properties and methods

The constructed object, exp, itself has 16 properties and 14 methods.

Property Type Description
ast object (Advanced: the APG Abstract Syntax Tree object)
debug boolean true if the debug flag, d, was set, false otherwise
flags string a re-formatted copy of the flags argument
global boolean true if the global flag, g, was set, false otherwise
input string/array(*) a copy of the input string
lastIndex integer the character index to begin the match attempt at(**)
leftContext string/array(*) prefix of the matched pattern
lastMatch string/array(*) the matched pattern (=result[0])
nodeHits integer the input value (see result.nodeHits for actual value)
rightContext string/array(*) the suffix of the matched pattern
rules object See result.rules. Only the last matched phrase is retained here.
source string the input SABNF pattern syntax string
sticky boolean true if the sticky flag, y, was set, false otherwise
trace object (Advanced: see the APG trace object.) undefined if the debug flag is false.
treeDepth integer the input value (see result.treeDepth for actual value)
unicode boolean true if the Unicode flag, u, was set, false otherwise

Note 1: If the unicode flag is true this will be an array of integer character codes, otherwise, a string.

Note 2: The user is free to set this to any value. The pattern match always begins at this index value. Its use and value after the match attempt are affected by global and sticky mode. See examples 7 and 8.

Method Arguments Description
defineUdt(name, func) string, function Advanced: used for SABNF’s UDT feature.
exclude([string]) array of rule names a list of rule names to exclude from the result.rules object
exec(str) string attemps a pattern match in str
include([string]) array of rule names list of rule names to include exclusively in the result.rules object
maxCallStackDepth() none returns upper bound on the call stack depth
replace(str, repl) string, string or function match patten in str, replace with repl
sourceToHtml() none returns input pattern in HTML format
sourceToHtmlPage() none returns input pattern as complete HTML page
sourceToText() none returns input pattern in ASCII text format
split(str[, limit]) string, integer splits input string at pattern matches
test(str) string returns true if a pattern match is found, false otherwise (See example 9.)
toHtml() none returns this object’s properties in HTML format
toHtmlPage() none returns this object’s properties as complete HTML page
toText() none returns this object’s properties in ASCII text format

The result object

A successful pattern match returns the results in a result object with 7 properties and 3 methods:

var result = exp.exec(str);
Property Type Description
[0] string/array(*) the matched pattern
input string/array(*) a copy of the input string
index integer index of first character of the matched pattern
length integer the number of characters in the matched pattern
nodeHits integer the actual number of parsing steps required for the match
rules object (named capture) An array of all matched phrases for each named rule(***)
treeDepth integer the actual maximum parse tree depth reached during the match

Note 3: For example, result.rules["name"][i] = {phrase: string/array, index: integer} See example 8.

Method Arguments Description
toHtml() none returns properties in HTML format
toHtmlPage() none returns properties as a complete HTML page
toText() none returns properties in ASCII text format

Where to Go Next

I’ve tried to keep it to a minimum here just to give you a taste of what the apg-exp library and ABNF look like and how they stack up against RegExp. But there is much, much more you can do. If you want to improve your pattern matching skills or you are just feeling adventurous take look at these more advanced examples and refer to the full user’s guide.

Have I convinced you to give apg-exp a try in your next project? Do you think ABNF is easier to work with than regular expressions? I’d love to hear from you in the comments below!

  • stephane chretien

    Hi thomas

    This sounds really good, I have always had a big concern with regexp syntax
    Now, did you get a chance to ‘benchmark’ your engine and do a comparaison with native regexp performence (indeed In the context of NodeJs)

    Thanks

    • Lowell Thomas

      Thanks for your interest. But no, I haven’t gotten around to any benchmarks. I was more interested in added features, like recursion, Abstract Syntax Trees and user-defined (user-written JavaScript) patterns than speed.

  • JoshuaTenner

    Just pointing out that the syntax you are describing for expressions matches “peg.js” and can be an alternative. I’ve used peg quite a bit myself.

    • Lowell Thomas

      Thanks for pointing out peg.js. I hadn’t seen it before. I agree that PEG would also be a good starting point for developing a pattern-matching engine. I’ve often felt that APG, with its terminal expressions, greedy repetitions and prioritized-choice (first match wins) alternates was, in fact, equivalent to PEG. But I’ve never had the interest or CS chops to definitively prove it. A couple of points though. A parser generator by itself is not a RegExp-type pattern matching engine. That’s quite a bit more work to do. And along the way I had to add some features to APG that you probably won’t find in most PEG parser generators – especially look-behind and back references.

      • JoshuaTenner

        Yeah these are all valid points. Parser Generators are /fantastic/ for what they do.

        Thanks for the thoughtful response.

Recommended

Learn Coding Online
Learn Web Development

Start learning web development and design for free with SitePoint Premium!

Get the latest in JavaScript, once a week, for free.