Filenames and Pathnames in Shell: How to do it correctly

David A. Wheeler

2012-10-08

Traditionally, Unix/Linux/POSIX filenames and pathnames can be almost any sequence of bytes. Unfortunately, most developers and users of Bourne shells (including bash, dash, ash, and ksh) don’t handle filenames and pathnames correctly. Even good textbooks on shell programming, and many examples in the POSIX standard, get filename and pathname processing completely wrong. Thus, many shell scripts are buggy, leading to surprising failures. These failures are a significant source of security vulnerabilities (see the “Secure Programming for Linux and Unix HOWTO” section on filenames, CERT’s “Secure Coding” item MSC09-C, CWE 78, CWE 73, CWE 116, and the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors).

This little essay explains how to correctly process filenames in Bourne shells. I presume that you already know how to write Bourne shell scripts.

First, some terminology. A pathname lets you select a particular file, and may include zero or more “/” characters. Thus, “/usr/bin” and “../etc/passwd” are pathnames. Most files are regular files or directories (though there are a few other kinds of files). Each pathname component (separated by “/”) is a filename; filenames cannot contain “/”. Neither filenames nor pathnames can contain the ASCII NUL character (\0), because that is the terminator. Many people use the term “filename” to include both filenames and pathnames; we’ll do that from here on.

The problem is that most Unix/Linux/POSIX systems allow filenames (including pathnames) to include any other bytes, including those for space, leading dash (-), tabs, newlines, shell metacharacters, and sequences that aren’t legal UTF-8 values, The result: programs often fail.

How to do it wrong

First, let’s go through some examples that are wrong, because the first step to fixing things is to know what’s broken. We’ll presume that there’s no header in your script (e.g., to set IFS). Given that, here’s how to process files incorrectly:

cat * > ../collection  # WRONG
This is wrong. If a filename in the current directory begins with “-”, it will be misinterpreted as an option instead of as a filename. For example, if there’s a file named “-n”, it will suddenly enable cat’s “-n” option instead. In general you should never have a glob that begins with * — it should be prefixed with “./”. Also, if there are no (unhidden) files in the directory, a glob will return the pattern ("*"), and the "for" loop will execute one time with the unprocessed pattern instead.
for file in * ; do  # WRONG
  cat "$file" >> ../collection
done
Also wrong, for the same reason; a file named “-n” will fool the cat program, and if the pattern does not match, it will loop once with the pattern itself as the value.

cat $(find . -type f) > ../collection  # WRONG
Wrong. If any filename contains a space, newline, or tab, its name will be split (file “a b” will be incorrectly parsed as two files, “a” and “b”).
( for file in $(find . -type f) ; do  # WRONG
    cat "$file"
  done ) > ../collection
Wrong, for the same reason; it breaks up filenames that contain space, newline, or tab.
 ( find . -type f |   # WRONG
   while read filename ; do cat "$filename" ; done ) > ../collection
Wrong. This works if a filename has spaces in the middle, but it won’t work correctly if the filename begins or ends with whitespace (they will get chopped off). Also, if a filename includes “\”, it’ll get corrupted; in particular, if it ends in “\”, it will be combined with the next filename (trashing both). In general, using “read” in shell without the “-r” option is usually a mistake, and in many cases you should set IFS="" just before the read.
( find . -type f | xargs cat ) > ../collection # WRONG, WAY WRONG
Wrong. By default, xargs’ input is parsed, so space characters (as well as newlines) separate arguments, and the backslash, apostrophe, double-quote, and ampersand characters are used for quoting. According to the POSIX standard, you have to include the option -E "" or underscore may have a special meaning too. Note that many of the examples in the POSIX standard xargs section are wrong; filenames with spaces, newlines, or many other characters will cause many of the examples to fail.
 ( find . -type f |
   while IFS="" read -r filename ; do cat "$filename" ; done ) \
          > ../collection # WRONG
Wrong. Actually, this is fine if filenames can’t include newline, but if filenames can include newline, then any line-at-a-time processing of filenames (such as this one) will fail on filenames that include newline.
cat $filename
Wrong. If $filename can contain an arbitrary filename, then it could include shell metacharacters like “*”, which would then be interpreted.

Doing it correctly: A quick summary

So, how can you process filenames correctly in shell? Here's a quick summary about how to do it correctly, for the impatient who "just want the answer". In short: Double-quote to use "$variable" instead of $variable, set IFS to just newline and tab, prefix all globs/filenames so they cannot begin with "-" when expanded, and use one of a few templates that work correctly. Here are some of those templates that work correctly:

 IFS="$(printf '\n\t')"   # Remove 'space', so filenames with spaces work well.

 # Correct glob use: always use "for" loop, prefix glob, check for existence:
 for file in ./* ; do        # Use "./*", NEVER bare "*"
   if [ -e "$file" ] ; then  # Make sure it isn't an empty match
     COMMAND ... "$file" ...
   fi
 done

 # Correct glob use, but requires nonstandard bash extension:
 shopt -s nullglob  # Bash extension, so globs with no matches return empty
 for file in ./* ; do        # Use "./*", NEVER bare "*"
   COMMAND ... "$file" ...
 done

 # These handle all filenames correctly; can be unwieldy if COMMAND is large:
 find ... -exec COMMAND... {} \;
 find ... -exec COMMAND... {} \+  # If multiple files are okay for COMMAND

 # Okay if filenames can't contain tabs or newlines; beware the assumption:
 IFS="$(printf '\n\t')"
 for file in $(find .) ; do
   COMMAND "$file" ...
 done

 # Skip filenames with embedded control chars, including newline and tab:
 IFS="$(printf '\n\t')"
 controlchars="$(printf '*[\001-\037\177]*')"
 for file in $(find . ! -name "$controlchars") ; do
   COMMAND "$file" ...
 done

 # Requires nonstandard but common extensions in find and xargs:
 find . -print0 | xargs -0 COMMAND

 # Requires nonstandard extensions to find and to shell (bash works);
 # variables might not stay set once the loop ends:
 find . -print0 | while IFS="" read -r -d "" file ; do ...
   COMMAND "$file" # Use quoted "$file", not $file, everywhere.
 done

 # Requires nonstandard extensions to find and to shell (bash works);
 # underlying system must inc. named pipes (FIFOs) or the /dev/fd mechanism.
 # In this version, variables, *do* stay set after the loop ends, and
 # you can read from stdin (change the 4s to another number if fd 4 is needed):
 while IFS="" read -r -d "" file <&4 ; do
   COMMAND "$file" # Use quoted "$file", not $file, everywhere.
 done 4< <(find . -print0)

 # Named pipe version.
 # Requires nonstandard extensions to find and to shell's read (bash works);
 # underlying system must inc. named pipes (FIFOs). Again,
 # in this version, variables, *do* stay set after the loop ends, and
 # you can read from stdin (change the 4s to something else if fd 4 needed).
 mkfifo mypipe
 find . -print0 > mypipe &
 while IFS="" read -r -d "" file <&4 ; do
   COMMAND "$file" # Use quoted "$file", not $file, everywhere.
 done 4< mypipe

 # Requires author's nul2pfb program. This uses "find . -print0"; for
 # POSIX 2008 compliance, replace that with: find . -exec printf '%s\0' {} \;
 for encoded_filename in $(find . -print0 | nul2pfb) ; do
   filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
   # Use "$filename" from here on...
 done

Below, I explain this in more detail.

How to do it right: The simple stuff

Here are some basic rules on how to correctly process filenames (where COMMAND is some arbitrary command).

Double-quote variable uses and substitutions

Always surround with double-quotes (") any substitution that might produce input field separator characters (by default these are space, newline, and tab), including one that might produce a filename. For example, when using (not setting) a variable or command substitution with a filename, surround it with double quotes. Otherwise, they’ll be expanded and misinterpreted. You don’t need to surround variable references if you know it can contain only alphanumeric characters, but it’s best to get in the habit, since the script might change in the future. Note: There is no portable way to store truly arbitrary multiple filenames in a single shell variable, because a filename might include awkward sequences such as newline, so see below for options on how to deal with this. Here are some examples:
Don't useInstead use
$filename"$filename"
$(pwd)"$(pwd)"
$(dirname $filename)"$(dirname "$filename")"

Set IFS at the start of each script

One of the first non-comment commands in every shell script should be:

   IFS="$(printf '\n\t')"
   # or:
   IFS="`printf '\n\t'`"
   # or, not portable but widely supported:
   IFS=$'\n\t'

This sets the IFS variable so that the “space” character is no longer an input field separator, and thus only newline and tab are field separators. Setting IFS and always using double-quotes around filename variables eliminates most of the problems caused by filenames with spaces. It also eliminates some other common scripting errors involving space, so this is generally a very good idea.

You can still build a list of command options inside a single shell variable, even when space isn’t in IFS. However, you need to use tab or newline to separate parameters, and not space. You can embed filenames in this variable, even if it has spaces in it. However, you can only include filenames this way if each filename does not include a newline, tab, or a shell globbing character (“*” or “?” or “[” at least, possibly “!”), does not begin with “-”, and does not begin with a tilde. Here’s an example:

  tab="(printf "\t")"     # Use tab as separator
  options="--option1${tab}--option2${tab}option_filename"
  options="$options${tab}--anotheroption"  # Build up the options.
  COMMAND $options "$filename"

You might also want to put “set -eu” at the beginning of your scripts; it does nothing for filenames, but it can help detect other script errors.

Prefix all globs/filenames

A "glob" is a pattern for filename matching like “*.pdf”. Whenever you use globbing to select files, never begin with a globbing character (typically the characters “*”, “?”, or “[”) or with a value that might begin with “-”. If you’re starting from the current directory, prefix the glob with “./”. In short, use:

 cat ./*                   # Use this, NOT "cat *" ... Must have 1+ files.
 for file in ./* ; do      # Use this, NOT "for file in *" (beware empty lists)
   ...
 done

Similarly, if you read in a filename, if it begins with “-” you should immediately prefix it with “./”.

If you always prefix filenames (e.g., those acquired through globs), then filenames starting with “-” will always be handled correctly. Globbing is often the easiest way to handle all files, or a subset of them, in a specific directory, but you need to make sure you do it correctly.

The good news about globbing in shell is that glob expansion is done after IFS expansion, so as long as you directly use globs as command parameters or for...in... you will have no problem with filenames containing whitespace or contorl characters (including newline).

Many books instead recommend that you carefully use “--” on every command before each filename. I think that is stupid and completely impractical advice. Many programs don’t accept “--” at all. Even when they do, in practice it’s just too hard to remember to use “--” perfectly every time you invoke a command; sooner or later you will make a mistake. It’s usually much easier to fix the locations where the filename pattern is specified or the input is provided; there are typically fewer of them, they tend to be easier to find, and this approach works even with the many commands that do not accept “--”.

Remember that globbing normally skips hidden files (those beginning with "."). Often that is what you want; you may want to use "find" instead if that is not what you want.

Beware of globs if there might be empty lists of filenames

Beware of globbing if there might be no matches with the pattern (and this is often the case). By default, if a glob like ./*.pdf matches no files, then the original glob pattern will be returned instead. This is almost never what you want, e.g., in a "for" loop this will cause the loop to execute once, but with the pattern instead of a filename!

You can use use globbing in a for loop, even if it might not match anything, using one of two approaches. One approach, which is completely portable, is to re-test for the existance of the file before using it in the loop:

 for file in ./* ; do        # Use this, NOT "for file in *"
   if [ -e "$file" ] ; then  # Make sure it exists and isn't an empty match
     COMMAND ... "$file" ...
   fi
 done

This is both ugly and a little inefficient (you have to re-test each file again). There are also pathological cases where the pattern doesn't match but there is a file that is identical to the unmatched pattern (though for typical patterns that can't happen), so you have to check to see if that could happen.

A more efficient but nonstandard solution for empty matches is to use a nonstandard shell extension called "null globbing". Null globbing fixes this by replacing an unmatched pattern with nothing at all. In bash you can enable nullglob with "shopt -s nullglob". In zsh, you can use setopt NULL_GLOB for the same result. Then this will work correctly:

 shopt -s nullglob  # Bash extension, so that empty glob matches will work
 for file in ./* ; do        # Use this, NOT "for file in *"
     COMMAND ... "$file" ...
 done

If the match might be empty, you should normally not use globbing as part of a command. Thus, use "cat *.pdf" only if you know there's at least one .pdf file. One exception: If you enable null globbing, and if the command does nothing when handed an empty list of files, then things will be fine. But this condition is often untrue, and in any case, if there are too many matches it will also fail. In short, in robust scripts, globbing should normally be used only as a "for" loop's list.

The globstar extension

Traditional globbing is only useful when you want to process files in a particular directory. Some shells have added a nonstandard "globstar" extension, but it's both nonstandard and has various limitations. I discuss it here, but you probably want to use find (discussed next). With the globstar extension, the pattern "**" returns every filename (including directories) in the current directory, recursively; it omits dot files, doesn't descend into dot dirs, and sorts the file list.

Bash version 4 recently added this, but you must enable it with "shopt -s globstar". The zsh shell originally came up with this, and ksh93 was the first to copy it (but in ksh you have to enable it with "set -G"). Note that there's no standard way to invoke it!

If you use this in a for loop list and combine it with nullglob, you can handle absolutely all filenames easily and efficiently, including the empty case. That sounds great, but watch the fine print... I think there are many reasons to avoid this right now. It's nonstandard, and gives you little control over the recursion. Most importantly, at least some implementations have trouble if there are links in the directories. Bash 4, at least, can get stuck in infinite loops if there are links. In many cases, find is currently the better approach for reliably doing recursive descent into directories.

Use find correctly

If you want to process files beyond what normal globbing can do (e.g., recursively handle directories), or you don't like the limitations on having to re-check for non-matches, use find. The find command is always passed a starting directory; as long as the starting directory doesn’t begin with “-” you won’t have a problem with leading “-”. The find command can be badly misused; here are ways that work.

  1. If what you do with each file is simple, then just use the following very portable construct, which works on all filenames:
     find ... -exec COMMAND... {} \;
    
    This gets ugly fast if COMMAND becomes complicated. Also, every file will start up a separate process for each COMMAND, causing overhead if there are lots of files.

  2. If the command is simple but there are many files, you can use this, which also works on all filenames:
     find ... -exec COMMAND... {} \+
    
    This causes a set of files to be listed, not one file at a time. It’s in the POSIX standard, though not all implementations of find include it.

  3. Although technically non-standard extensions, many systems (including GNU and *BSDs) have a "-print0" option in find, and a corresponding "-0" option with xargs, which makes it easy to do this:
     find ... -print0 | xargs -0 ....
    
  4. If you begin your shell script with IFS="$(printf '\n\t')" (as recommended above), and filenames cannot include tab or newline, then you can use find in the “normal” way, either inside ‘...‘ or with a normal “for” loop:
      # CORRECT if filenames can't include tab/newline *and* if IFS omits space:
      COMMAND $(find .)
      # OR:
      for file in $(find .) ; do
        COMMAND "$file" ...
      done
    

    This is a simple and clear solution, and it can handle filenames with spaces, leading dashes, shell metacharacters, and so on. In short, this is the best and clearest solution for non-trivial processing as long as filenames cannot include tab or newline. Below we discuss how to ignore filenames with control characters (like tab or newline); if you add that, then this correctly handles or ignores all filenames. Note that ‘...‘ cannot handle lists that are too large; if that might be a problem, use the for loop instead.

    Similarly, as long as filenames can’t include tab or newline, you can store filenames in files with one record per newline-separated line, and tabs can separate the fields. This format that is well-supported by tools like cut, join, and paste. I think it’d be best if POSIX systems simply forbid filenames from including control characters like tab and newline; many programs assume it anyway, and many filesystems require it. Then these constructs would just work, no matter what.

  5. Here’s another solution as long as filenames cannot include newline:
     find . | while IFS="" read -r file ; do ...
       COMMAND "$file" # Use "$file" not $file everywhere.
     done
    

    This one can handle tabs in filenames. However, the values of any variables set in the while loop will be lost when the loop ends (the pipe causes the loop to run in a subshell). In addition, you cannot read from standard input inside the loop. I recommend using the previous “for” loop instead, in most cases, because the “for” loop is is easier to understand and more flexible.

  6. One head-busting solution handles arbitrary filenames using find, but it doesn't scale to larger commands (quoting will quickly too complicated), and it's ugly as well:
    find . -exec sh -c '
     for file do
        ...
     done' sh {} +
    

  7. You could combine xargs with find using a pipe, and use newlines to separate filenames, but don’t do it. The problem is that xargs interprets many characters in surprising ways, so it’s hard to use xargs correctly when using newlines as separators. The correct portable way to use xargs with newline separators requires that you pipe filenames through another command like sed to do character substitutions, like this:
     find . | sed -e 's/[^A-Za-z0-9]/\\&/g' | xargs -E "" COMMAND
    
    This is complicated, hard to read, rediculously inefficient, and isn't better than many other alternatives (e.g., it doesn't handle newlines either). Don’t do this; instead, use one of the better ways described in this paper.

  8. A big advantage of find is that it has lots of options for controlling how you process files. You can use options to limit it to one directory, determine the ordering, and so on. It normally processes all files (including hidden ones), but you can use this pattern to skip hidden files:
     find . -name '.?*' -prune -o ....
    

Consider ignoring filenames with control characters

If you might have files in the filesystem that contain newline and tab, and you need to use find, you can sometimes simply declare that filenames with control characters (including newline and tab) are “bad” and should be ignored. If you can simply ignore “bad” files, you can easily add an option to find to skip such filenames, and use all the simple approaches above that use find. Here’s an example:

  # Correctly skips filenames with control chars, inc. newline and tab:
  controlchars="$(printf '*[\001-\037\177]*')"
  for file in $(find . ! -name "$controlchars") ; do
    COMMAND "$file" ...
  done

Using null-separated filenames

In many cases, the previous section’s approaches are sufficient, especially if you skip filenames with control characters whenever you use find.

Still, let’s imagine that you need to walk directories, and you truly must handle arbitrary filenames including ones with newline and tab. You could create a shell function to walk the directory tree, using globbing and recursion. However, the “easy” way to implement this in shell can get stuck in infinite loops when confronted with certain kinds of symbolic links and hard links. Re-implementing accurate directory traversal in the shell is possible, but both painful and silly. After all, the find tool is specifically designed to handle this stuff; it’d be better to just use find instead of rewriting find.

A common solution in this case is to use byte 0 (aka \0 or null) to separate filenames, and use tools like find to walk the directory. This works because filenames, by definition, cannot include byte 0.

There are many downsides to this approach:

But if you want maximum generality when recursing into subdirectories, this is the usual way to do it. So, let’s look at ways to do this:

  1. Simple use of find -print0 and xargs:
    # CORRECT but nonstandard:
    find . -print0 | xargs -0 COMMAND
    
    Note that COMMAND is run on a set of files, but not necessarily on all of them, and processes restart each time, so it’s hard to keep track of things or stop processing in the middle. If you want to use xargs, then this is the way to do it, but note that it requires a non-standard extension.

  2. Using find -print0 and a while loop:
     find . -print0 |
     while IFS="" read -r -d "" file ; do ...
       COMMAND "$file" # Use "$file" not $file everywhere.
     done
    
    This depends on find’s nonstandard -print0 option, as well as the bash-specific read -d option (the -d with empty string makes \0 be the delimiter). Note that you have to set IFS to be empty; otherwise, a filename that includes IFS characters at the end would be corrupted (see the POSIX.1-2008 specification lines 103920-103925). This works well in many cases, but it does have a subtle weakness: If you need to process filenames in a way that “remembers” what happens as you go, or afterwards, it doesn’t work well. Because we use a pipe, the "find" and "while..." commands will be executed in separate processes in separate subshells. Thus, if any variables are set inside the “while” loop, their values will disappear once we exit the loop (because the loop’s subshell will disappear).

  3. Using find -print0, a while loop, and process substitution
    # This handles all filenames, but uses nonstandard extensions:
    while IFS="" read -r -d "" file ; do ...
      COMMAND "$file" # Use "$file" not $file everywhere.
      # You can set variables, and they'll stay set.
    done < <(find . -print0)
    
    With this version we can now loop through all the filenames, no matter what they are, and retain any variable values we set. Unfortunately, this construct is hard to read and non-portable. It not only requires the use of a nonstandard find option (-print0), but uses the nonstandard "process substitution" extension (bash, zsh, and ksh 93 have it; dash and ksh 88 do not). In fact, process substitution doesn’t even work on all systems that support bash; it has to support named pipes or /dev/fd (thankfully these are common). This approach means we can’t read the original standard input (standard input is used to provide the filenames), which in many programs would be a problem, leading us to our final version...
  4. Using find -print0, a while loop, process substitution, and another fd
    # This handles all filenames, but uses nonstandard extensions
    # and is a little ugly too:
    while IFS="" read -r -d "" file <&4 ; do ...
      COMMAND "$file" # Use "$file" not $file everywhere.
      # You can set variables, and they'll stay set.
    done 4< <(find . -print0)
    
    Here is the most general file looping mechanism while staying within shell. This version can loop through all filenames, you can set variables and retain their value, and you can simultaneously use stdin. Change the "4" in both places to some other number more than 2 if the loop needs to use file descriptor 4. (I previously used bash's "-u" option for read, but that isn't standard while <&4 is in the POSIX standard.) Again, this needs lots of nonstandard extensions. Also, the shell "read" command is often a little slow (it's often implemented by reading a character at a time), which is a basic downside of all the while...read constructs.

Of course, if variables can contain newline and tab, you can’t use those values inside a shell variable or data file as separators. Many shells (including bash) do not permit shell variables to contain byte 0, making this more difficult. One possible solution is to use shell arrays, if your shell supports them, but that gets even more complicated.

After a certain point, you may find it easier to switch languages.

Encoding filenames

The usual in-the-field approach to dealing with all possible filenames is null byte termination; it's simple, has some support in key key tools, and you can easily store lists of filenames this way (you can easily store them in files with other data per file if the filename is the last item). However, it is possible to encode filenames so that all filenames can be handled. Unfortunately, there is no single standard encoding for filenames that doesn't use null bytes; instead, there are many similar yet incompatible encodings. What's worse, utilities often do not support any of them well. But, let's take a look.

printf(1) %b (pfb) encoding

One approach is to use "printf %b" encoding, pfb encoding for short. This is the encoding supported by the printf(1) "%b" format, and it has several advantages: printf is part of the POSIX 2008 specification, it is widely implemented, and it is typically a shell builtin (making it speedier). This encoding uses the backslash ('\') to introduce an escape sequence, and the POSIX specification supports '\\' (for backslash itself), '\a' , '\b' , '\f' , '\n' , '\r' , '\t' , '\v', and '\ddd' (where ddd is a 1-3 digit octal number). This is really easy to read, because it's similar to other formats. As long as you escape all characters ranged 1-31, you can easily include the encoded filename in other tab-delimited, newline-per-record files. (If you're on an EBCDIC system (!), you can still escape \n, but in EBCDIC the \n is outside the range of 1-31.) There's a gotcha in shell, though; filenames can end in newline, and that would be consumed if it's bare in a command substitution. So, this would be a wrong way to decode it:

 # WRONG: this encoding fails if encoded_filename ends in \n
 cat encoded_filenames |
  while IFS="" read -r encoded_filename ; do
    filename="$(printf "%b" "$encoded_filename")"
    ...
  done

Instead, you need to decode such filenames this way in shell:

 # CORRECT: Unencodes all possible filenames.
 cat pfb_encoded_filenames |
  while IFS="" read -r encoded_filename ; do
    filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
    ...
  done

But what about generating or encoding filenames into pfb format? Unfortunately, the POSIX standard does not include an easy option for find or other tools to generate this format. The easiest thing to do this is to use the widely-implemented -print0 option with find, then filter it through something to convert to this format. Here's a small C program I wrote, nul2pfb.c, which converts filename lists ending in \0 to line-oriented lists using the pfb escapes. There are several different ways to encode into pfb; I recommend encoding newline as \n and tab as \t, to make them obvious, and encoding space as \040 (so you don't need to worry about unintentional splitting). You really need to encode all characters less than or equal to 32, and those greater than or equal to 127. Yes, that makes international characters harder to read, but it preserves them, and that's more important. Since you cannot be sure of the character encoding used in filenames, and there's no guarantee that a filename obeys the rules anyway, it's safest to encode everything. And since we're going that far, we may as well encode the shell metacharacters "*", "?", "[", and "!", so that if we forget to surround variable references with double-quotes we'll be okay. In the end, I decided to encode every character except for alphanumerics, "/", ".", "_", and ":", since that gives maximum safety while still making most ordinary system filenames readable. (This encodes "-", which is good because a leading "-" can cause problems). A nice side-effect of encoding so many characters is that if you forget to decode them, you'll probably detect that fairly early in testing. It'd be nice if find and ls could generate pfb encoding directly, and if xargs could directly process them, but at least you can use null byte encoding to get there.

GNU ls includes some nonstandard quoting options, but none of them (at this time) quite output the pfb format. You would think its "-b" (aka --escape or --quoting-style=escape) would be the same thing, but it's not; this option changes space to '\ ', and pfb won't accept that. GNU ls' -Q (--quote-name or --quoting-style=c) encloses filenames in double-quotes, but even when you strip out the external double-quotes, it generates '\"' for double-quotes which also isn't in pfb. At this time it's easier to use find, and then use depth control if you just want one level.

Anyway, this brings us a final approach to handling filenames in shell. This one lets us use a for loop, which is very nice, because we now keep all file descriptors (including stdin) and variable setting works as expected. The encoding and decoding we have to do is unfortunate, but there it is:

 # CORRECT, requires find -print0 and author's nul2pfb program.
 for encoded_filename in $(find . -print0 | nul2pfb) ; do
   filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
   # Use "$filename" from here on...
 done

Other encodings

There are many other possible encodings. GNU ls has several ways to encode filenames; its "-b" format is quite reasonable, and is similar to pfb, but note that it puts \ in front of space, so it's a little more trouble to decode.

Another option is percent encoding, aka URL encoding. Basically, any byte can represented by '%' followed by a 2-digit hexadecimal number.

The default format of xargs is a stinking, rotting mess, and I do not recommend it at all. It's not compatible with anything, including the find command it's supposed to work with. You can double-quote or single-quote text, but they have to end before the end of a line. The only way to encode newline is to precede it with "\", outside of a double-quote or single, which makes it hideously hard to deal with. You're better off using the widely-supported -0 option, even if it is technically non-portable. If you want to use encodings, it'd be nice if xargs supported pfb encoding, or if there was some other widely-supported encoding.

Displaying or storing filenames

Try to avoid displaying filenames. Filenames could contain control characters that control the terminal and/or X-windows, causing nasty side-effects on display. Displaying filenames can even cause a security vulnerability. If you must display them, consider stripping out control characters first.

Similarly, if a filename stored as data in a file or sent elsewhere (e.g., as part of HTML or XML), you’ll need to escape the filename as necessary.

In addition, you have no way of knowing for certain what the filename’s character encoding is, so if you got a filename from someone else who uses non-ASCII characters, you’re likely to end up with garbage mojibake. In practice, what most people do is exchange filenames in UTF-8. If you both use the same locale, you could do something else, but UTF-8 is the only encoding in wide use that can handle arbitrary languages. I encourage you to always encode filenames in UTF-8.

Could the POSIX standard be changed to make file processing easier?

The POSIX standard could (and should!) be modified to make it easier to to handle the outrageously permissive filenames that are permitted today. Basically, we need extensions to make globbing and find easier to use.

Globbing

There are two basic problems with globbing:

  1. Globbing in shell returns junk (the pattern) when there are zero matches. There should be a shell option (typically called a "nullglob" option) so an empty list is returned if nothing matches and there was at least one metacharacter. Oddly enough, the underlying glob() function has an option that's close to this, but there's no standard way for shells to take advantage of it! Bash, ksh, and others have support, but not in a common standard way, and glob() doesn't support exactly what is needed either (bugid:247).
  2. Globbing normally replies filenames beginning with "-". There should be an option that when set prepends "./" to any glob result that begins with "-". This should be an option for glob(), as well as for the shell. I think the standard should also state that implementations may enable this by default. Not all real-world commands support "--", and users often forget to add it; we need to have a mechanism to automatically deal with filenames beginning with "./" if you need them.

Find / null separators

There also needs to be standard way to use find with arbitrary filenames. The normal way to handle this is by separating filenames with the null (\0) character; a few changes would simplify this:

  1. Extend existing commands to generate or use null-separated filename lists. At the least, add "find -print0" (bugid:243) and "xargs -0" (bugid:244) since these are already widely implemented. For consistency, I think "-0" should be the standard option name for null-separated lists. It'd be useful to add "grep -0" (GNU grep accepts either -Z or --null) and "sort -0" (GNU sort uses -z). Once added, this common template would be portable:
     find . -print0 | xargs -0 COMMAND
    
  2. Extend the shell's read so that it can easily read null-separated streams. (bugid:245) Bash can do this today, but it's painful; the command is IFS="" read -d "" -r which is overly complicated I believe there should be a new "-0" option for read, which says "ignore IFS, and just read until the next \0 byte" (Here's a bash 4.1 patch). You can then do this (which makes it easier to have long command sequences, as long as you don't need stdin):
     find . -print0 | while read -0 file ; do ... done
    
  3. Extend the shell so that its for loop can handle a null-separated list. This one is harder; it's not obvious how to do this. My current theory is that there be a new shell option 'nullseparator"; when enabled, IFS is ignored, and instead \0 is the input seperator. Then, extend the shell's for loop syntax so that if you say then in instead of in, this mode is temporarily enabled while the list is processed (the original setting is then restored). (I originally had null in, but using then in means that no new keywords are needed.) You could then do this:
     for file then in $(find . -print0) do ... done
    

As a side note, it'd be nice if the $'...' construct was standard, as it makes certain things easier (bugid:249).

A quick aside about newline

Newline can be a little tricky to get into a shell variable. You can't do:
  newline="$(printf '\n')"
Because after the $(...) command is executed, any trailing newline is removed.

One alternative is:

newline='
'
But this can get corrupted by programs that change the encoding of file end-of-lines.

The following is a standards-compliant trick to get newline into a variable:

newline="$(printf '\nX')"
newline="${newline%X}"

If filenames were limited, would it be better?

Shell programming is remarkably easy in many cases; what’s sad is that this common case (file processing) is far complicated than it needs to be. Fundamentally, the rules on filenames are too permissive. Extending POSIX would make it somewhat easier, and we should do that. However, It would be much simpler if systems imposed a few simple rules on filenames, such as prohibiting control characters (bugid:251), prohibiting leading “-”, and requiring filenames to be UTF-8. Then you could always print filenames safely, and these “normal” shell constructs would always work:

 # This works if filenames never begin with "-" and nullglob is enabled:
 for file in *.pdf ; do ... done           # Use "$file" not $file
 # This works if filenames have no control chars and IFS is tab and newline:
 for file in $(find .) ; do ... done        # Use "$file" not $file

I think that we should both extend the POSIX standard and limit the permitted filenames. Not all systems will limit filenames, so we need standard mechanism for them. But the new standard mechanisms simply can't be as simple as restricting filenames; restricting filenames makes systems far easier to use correctly.

Please see my paper on fixing Unix/Linux filenames for more about this.

I've also done some work on how to encode/decode filenames; see the encodef home page for more information.

But for now, this is how to handle filenames properly in shell programs.


Feel free to see my home page at http://www.dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs. And, of course, my paper on fixing Unix/Linux/POSIX filenames.

(C) Copyright 2010-2011 David A. Wheeler. Released under Creative Commons CC-BY-SA (any version), GNU GPL v2+, and the Open Publication License (version 1.0 or later). You can use this under any of those licenses; if you do not say otherwise, then you release it under all of them. In addition, Mendel Cooper has explicit authorization to include this (or any modified portion) as part of his "Advanced Bash Scripting Guide". Let me know if you need other exceptions; my goal is to get this information out to the world!