Optimizing regular expressions

We lispers generally look down our noses at regular expressions. Regular expressions are ugly. They are not expressive. However, they are a reality of programming. When used with care, they can express complex text patterns concisely.

Writing software almost always means processing some form of user input. When the format of that input is pre-determined and (mostly) guaranteed to be valid, regular expressions are just so much overhead. But when the validity of user input can vary widely, such as with form validation, regular expressions are much more concise than writing a custom parser.

Regular expressions are slower for simple matching

A regular expression must be compiled. Incautiously crafted patterns can take up quite a bit of memory or result in poorly performing matches. Routines that make heavy use of regular expressions can quickly become a bottlneck.

If a match operation doesn’t need features of regular expressions, such as alternating patterns or backreferences, it is faster to use string matching. It takes much less time to see if a string is terminated by a semi-colon in Python using string.endswith(';') than re.compile(r';$').search(string), especially over many iterations.

The exception is when a pattern is matched often over lengthy iterations. Once compiled, regular expression matching is O(log n). Regular string matching is typically O(n). However, it generally takes an extraordinarily large number of iterations before regular expressions become more efficient than string matching.

Optimizing for speed

Entire books have been written on this subject. Here are a few tips.

Alternating patterns (such as (abc|def)) can be expensive. Always try to put the pattern most likely to match first ((John|Rumplestiltskin), rather than (Rumplestiltskin|John)).

Avoid nesting repeating patterns when possible. They grow quickly in memory and increase the number of possible matches exponentially (thereby slowing a match down noticeably).

As the target string gets longer, matching slows drastically when using repeating patterns. Follow indefinite patterns with quantifiers ({min, max}) or with a literal or atomic group.

When possible, use anchors (^ and $) or lookahead/lookbehind to limit the scope of a pattern and make failures occur faster.

Optimizing for memory usage

Some types of expressions can grow to quite large sizes in memory. A little reading shows how apparently simple expressions can grow when compiled:

(abc|def){2,4}
is compiled as if it were
(abc|def)(abc|def)((abc|def)(abc|def)?)?

Indefinite and large quantifiers (*, +, and {min,max}) are expanded. Imagine if the example above were (abc|def){2,1000}, or if the pattern matched were more complex and contained internal quantifiers, such as (abc+|def){2,1000}. A little care is needed to prevent such exponential expansion.

One solution is to use a subroutine. Although slower, a subroutine calls back a previous match without the memory-eating expansion:

(abc+|def)(?1){1,999}

The subroutine backreferences the first grouped match and repeats it. However, subroutine matches are treated as atomic groups. When an indefinite match fails, the engine will typically backtrack and see if a smaller substring of the target matches.

For example, \w+0 matches one or more letters followed by a zero. If it is matched against “abcdefg0″, it matches all of the letters and the zero. If it is matched against “abcdefg1″, it will backtrack and attempt to match against substrings of the target (“abcdefg”, “abcdef”, “abcde”, …).

This may or may not be the desired behavior, but it will keep the pattern’s footprint down.

Links

Submit article

Artful Code