First, make sure you identify all inputs from potentially untrusted users, so that you validate them all. Where you can, eliminate the inputs or make it impossible for untrusted users to provide information to them. At each remaining input from potentially untrusted users you need to validate the data that comes in.
You should determine what is legal, as narrowly as you reasonably can, and reject anything that does not match that definition. The rules that define what is legal, and by implication reject everything else, are called a whitelist. Do not do the reverse, that is, do not try to identify what is illegal and write code to reject those cases. This bad approach, where you try to list everything that should be rejected, is called blacklisting; the list of inputs that should be rejected is called a blacklist. Blacklisting typically leads to security vulnerabilities, because you are likely to forget to handle one or more important cases of illegal input. Improper input validation is such a common cause of security vulnerabilities that it has its own CWE identifier, CWE-20.
There is a good reason for identifying “illegal” values, though, and that’s as a set of tests to be sure that your validation code is thorough. These tests may possibly just executed in your head, but at least a few should become test cases. When I set up an input filter, I mentally attack my whitelist with a few pre-identified illegal values to make sure that a few obvious illegal values will not get through. Depending on the input, here are a few examples of common “illegal” values that your input filters may need to prevent: the empty string, “.”, “..”, “../”, anything starting with “/” or “.”, anything with “/” or “&” inside it, any control characters (especially NIL and newline), and/or any characters with the “high bit” set (especially values decimal 254 and 255, and character 133 is the Unicode Next-of-line character used by OS/390). Again, your code should not be checking for “bad” values; you should do this check mentally to be sure that your pattern ruthlessly limits input values to legal values. If your pattern isn’t sufficiently narrow, you need to carefully re-examine the pattern to see if there are other problems.
Limit the maximum character length (and minimum length if appropriate), and be sure to not lose control when such lengths are exceeded (see Chapter 6 for more about buffer overflows).
Here are a few common data types, and things you should validate before using them from an untrusted user:
For strings, identify the legal characters or legal patterns (e.g., as a regular expression) and reject anything not matching that form. There are special problems when strings contain control characters (especially linefeed or NIL) or metacharacters (especially shell metacharacters); it is often best to “escape” such metacharacters immediately when the input is received so that such characters are not accidentally sent. CERT goes further and recommends escaping all characters that aren’t in a list of characters not needing escaping [CERT 1998, CMU 1998]. See Section 8.3 for more information on metacharacters. Note that line ending encodings vary on different computers: Unix-based systems use character 0x0a (linefeed), CP/M and DOS based systems (including Windows) use 0x0d 0x0a (carriage-return linefeed, and some programs incorrectly reverse the order), the Apple MacOS uses 0x0d (carriage return), and IBM OS/390 uses 0x85 (0x85) (next line, sometimes called newline).
Limit all numbers to the minimum (often zero) and maximum allowed values.
A full email address checker is actually quite complicated, because there are legacy formats that greatly complicate validation if you need to support all of them; see mailaddr(7) and IETF RFC 822 [RFC 822] for more information if such checking is necessary. Friedl  developed a regular expression to check if an email address is valid (according to the specification); his “short” regular expression is 4,724 characters, and his “optimized” expression (in appendix B) is 6,598 characters long. And even that regular expression isn’t perfect; it can’t recognize local email addresses, and it can’t handle nested parentheses in comments (as the specification permits). Often you can simplify and only permit the “common” Internet address formats.
Filenames should be checked; see Section 5.6 for more information on filenames.
URIs (including URLs) should be checked for validity. If you are directly acting on a URI (i.e., you’re implementing a web server or web-server-like program and the URL is a request for your data), make sure the URI is valid, and be especially careful of URIs that try to “escape” the document root (the area of the filesystem that the server is responding to). The most common ways to escape the document root are via “..” or a symbolic link, so most servers check any “..” directories themselves and ignore symbolic links unless specially directed. Also remember to decode any encoding first (via URL encoding or UTF-8 encoding), or an encoded “..” could slip through. URIs aren’t supposed to even include UTF-8 encoding, so the safest thing is to reject any URIs that include characters with high bits set.
If you are implementing a system that uses the URI/URL as data, you’re not home-free at all; you need to ensure that malicious users can’t insert URIs that will harm other users. See Section 5.13.4 for more information about this.
When accepting cookie values, make sure to check the the domain value for any cookie you’re using is the expected one. Otherwise, a (possibly cracked) related site might be able to insert spoofed cookies. Here’s an example from IETF RFC 2965 of how failing to do this check could cause a problem:
User agent makes request to victim.cracker.edu, gets back cookie session_id="1234" and sets the default domain victim.cracker.edu.
User agent makes request to spoof.cracker.edu, gets back cookie session-id="1111", with Domain=".cracker.edu".
User agent makes request to victim.cracker.edu again, and passes:
Cookie: $Version="1"; session_id="1234", $Version="1"; session_id="1111"; $Domain=".cracker.edu"
Unless you account for them, the legal character patterns must not include characters or character sequences that have special meaning to either the program internals or the eventual output:
A character sequence may have special meaning to the program’s internal storage format. For example, if you store data (internally or externally) in delimited strings, make sure that the delimiters are not permitted data values. A number of programs store data in comma (,) or colon (:) delimited text files; inserting the delimiters in the input can be a problem unless the program accounts for it (i.e., by preventing it or encoding it in some way). Other characters often causing these problems include single and double quotes (used for surrounding strings) and the less-than sign "<" (used in SGML, XML, and HTML to indicate a tag’s beginning; this is important if you store data in these formats). Most data formats have an escape sequence to handle these cases; use it, or filter such data on input.
A character sequence may have special meaning if sent back out to a user. A common example of this is permitting HTML tags in data input that will later be posted to other readers (e.g., in a guestbook or “reader comment” area). However, the problem is much more general. See Section 7.16 for a general discussion on the topic, and see Section 5.13 for a specific discussion about filtering HTML.
These tests should usually be centralized in one place so that the validity tests can be easily examined for correctness later.
Make sure that your validity test is actually correct; this is particularly a problem when checking input that will be used by another program (such as a filename, email address, or URL). Often these tests have subtle errors, producing the so-called “deputy problem” (where the checking program makes different assumptions than the program that actually uses the data). If there’s a relevant standard, look at it, but also search to see if the program has extensions that you need to know about.
While parsing user input, it’s a good idea to temporarily drop all privileges, or even create separate processes (with the parser having permanently dropped privileges, and the other process performing security checks against the parser requests). This is especially true if the parsing task is complex (e.g., if you use a lex-like or yacc-like tool), or if the programming language doesn’t protect against buffer overflows (e.g., C and C++). See Section 7.4 for more information on minimizing privileges.
When using data for security decisions (e.g., “let this user in”), be sure to use trustworthy channels. For example, on a public Internet, don’t just use the machine IP address or port number as the sole way to authenticate users, because in most environments this information can be set by the (potentially malicious) user. See Section 7.12 for more information.