5.2. Input Validation Tools including Regular Expressions

There are many ways to validate input. Number ranges can be checked using typical condtions such as less-than. If a string can only be one of a short list of possibilities, simply enumerate the possibilities and ensure that the input is one of them. If the input is extremely complex, tools often used to create compilers (such as lexers and parser generators) may be appropriate, though be sure that these tools are prepared to process malicious input.

In many cases regular expression libraries are especially useful for input validation. Many whitelists are easily expressed as regular expressions, making them a very easy tool to use. In addition, regular expression libraries are built-in or easily available in almost all language (the POSIX specification even requires one).

5.2.1. Introduction to regular expressions

The regular expression language is a simple language for describing text patterns. There are three major variants of the language in use: the very old POSIX “basic regular expresion (BRE)” format, the POSIX “extended regular expression (ERE)”, and the perl-compatible regular expression (PCRE) format. From here on we’ll assume you’re using the ERE or PCRE variations of the language. In the regular expression language, a latin letter or digit simply represents itself. A dot (period) matches any one character (with the possible exception of newline, depending on various options).

A bracketed expression matches one character, as long as that one character is one of the characters listed inside the brackets. Inside brackets the period has no special meaning (it just matches a period), and a “-” inside brackets indicates a range, so “[A-Za-z0-9]” matches one Latin alphanumeric character (presuming you’re not using EBCDIC).

You can also indicate repetition, e.g., “?” means that the previous expression is optional (may occur 0 or 1 times), “+” means the previous expression may repeat 1 or more times, and a “*” means that the previous expression may repeat 0 or more times. More generally, “{N,M}” indicates that the previous expression can occur N through M number of repetitions. Parentheses can group a sequence so that it is considered a single pattern. A much more complete discussion of regular expressions is given in [Friedl 1997].

5.2.2. Using regular expressions for input validation

The regular expression language was originally designed for searching, not for describing input filters. To use regular expressions as whitelists, your whitelists will typically begin with “^” (which normally means “match the beginning of the string”) and end with ''$'' (which normally means “match the end of the string”). Thus, you can require that an input have a Latin letter, followed by one or more digits, using this expression: “[A-Za-z][0-9]+”.

A word of warning: Regular expressions support the “|” operator, which means “any one of these”. However, the precedence of “|” is different from what many expect, and unwary developers can end up having vulnerable input validation routines as a result. For example, the expression “^x|y$” means “begins with x, or ends with y”. In practically all cases you should surround the “|” branches with parentheses when using regular expressions for input filtering, e.g., “^(x|y)$” means “either an x or a y”.

5.2.3. Regular expression denial of service (reDOS) attacks

In some cases, when using regular expressions (regexes) there is a risk of enabling regular expression denial of service (reDOS) attacks. Some regexes, on some implementations, can take exponential time and memory to process certain data. Such regexes are called "evil" regexes. Attackers can intentionally provide triggering data (and maybe regexes!) to cause this exponential growth, leading to a denial-of-service. Thus, when using regexes, developers need to avoid these regexes or limit these effects. In many cases this is not hard, once you're aware of the issue.

Fundamentally, many modern regex engines (including those in PCRE, perl, Java, etc.) use backtracking to implement regexes. In these implementations, if there is more than one potential solution for a match, if will first try one branch to try to find a match, and if it doesn't match, it will repeatedly backtrack to the last untried solution and try again until all options are exhausted. The problem is that an attacker may be able to cause many backtracks. In general, you want to bound the number of backtracks that occur. The primary risks are groups with repetition, pariticularly if they are inside more repetition or alternation with overlapping patterns. The regex "^([a-zA-Z]+)*$" with data "aaa1" involves a large number of backtracks; once the engine encounters the "1", many implementations will backtrack through all possible combinations of "+" and "*" before it can determine there is no match.

Simply avoiding the use of regexes doesn't reliably counter reDOS attacks, because naively implementing the regex processing causes exactly the same problem. There are, however, simple things that can be done. First, avoid running regexes provided by an attacker (or limit the time they can run). If you can, use a Thompson NFA-to-DFA implementation; these never backtrack and thus are immune to the problem (though they can't provide some useful functions like backreferences). Otherwise, review regexes to prevent backtracking if you can. At any point, any given character should cause only one branch to be taken in regex (just imagine that the regex is code). For every repetition, you should be able to uniquely determine if the code will repeat or not based on the single next input character. You should especially examine any repetition in a repetition - if possible, eliminate them (these in particular cause a combinatorial explosion). You can use regex fuzzers and static analysis tools to examine these. In addition, you can limit the input data size first before before using a regex; this greatly limits the effects of exponential growth in time. You can find more information in [Crosby2003] and the OWASP's "Regular Expression Denial of Service"