David A. Wheeler's Blog

Sun, 26 Jul 2009

Limiting Unix/Linux/POSIX filenames simplifies things: Lowercasing filenames

My essay Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems argues that adding some limitations on legal Unix/Linux/POSIX filenames would be an improvement. In particular, a few minor limitations (which most people assume anyway) would eliminate certain kinds of bugs, some of which end up being security vulnerabilities. Forbidding crazy things (like control characters in filenames) simplifies creating programs that work all the time.

Here’s a little example of this. I wanted to convert all the filenames inside a directory tree to all lowercase letters. I didn’t want to lose any files without checking on them first, so I wanted it to ask before doing a rename in a way that would eliminate a file (i.e., I wanted to use mv -i). I didn’t find such a program built into my distro, so I wrote a short script to do it (which is just as well, because it makes a nice simple example). I wanted it to be portable, since I might need it again later.

So how do we write this? A simple glob like “*” won’t work, because it needs to recursively descend through a tree of directories, and simple globs will skip hidden filesystem objects too (and I want to include them). I could write a more complex glob that included hidden files and directories, and recursed down through subdirectories, but the naive way of recursing down subdirectories can have many problems (e.g., it could get stuck in endless loops created by symbolic links). If we need to handle a tree recursively, there’s a better tool designed for the purpose — find.

Unfortunately, an ordinary find . has an interesting problem — it will pick the upper-level names first, and if we rename the upper-level names first, find will fail when it tries to enter them (since they will no longer exist). No problem — if we are manipulating the tree structure (including renames), we can use the -depth option of find, which will process each directory’s contents before the directory itself. We can then rename just the basename of what find returns, so we won’t change anything before find descends into it.

Now, if we could assume that newlines and tabs cannot be in filenames, as recommended in Fixing Unix/Linux/POSIX Filenames…, then we can do a simple for loop around the results of find. My shell script mklowercase renames filenames to lowercase letters recursively. Here is its essence:

  #!/bin/sh
  # mklowercase - change all filenames to lowercase recursively from "." down.
  # Will prompt if there's an existing file of that name (mv -i)
  # Presumes that filenames don't include newline or tab.

  set -eu
  IFS=`printf '\n\t'`
  
  for file in `find . -depth` ; do
    [ "." = "$file" ] && continue                  # Skip "." entry.
    dir=`dirname "$file"`
    base=`basename "$file"`
    oldname="$dir/$base"
    newbase=`printf "%s" "$base" | tr A-Z a-z`
    newname="$dir/$newbase"
    if [ "$oldname" != "$newname" ] ; then
      mv -i "$file" "$newname"
    fi
  done

This script skips “.”, which is not strictly necessary, but I thought it would be a good idea to point out that you may need to skip “.” sometimes.

Yes, this could be modified to handle literally all possible Unix/Linux/POSIX filenames, but those modifications make it more complicated and uglier. One approach would be to use one program to use find…-exec, which then invokes another script to do the renaming. But then you have to maintain two scripts, and keep them in sync. You could embed the command into find, but then the find command becomes hideously complicated.

Another solution to handling all filenames would be to change the loop to:

  find . -depth -print0 |
  while IFS="" read -r -d '' file ; do ...

However, this requires non-standard GNU extensions to find (-print0) and bash (read -d), as well as being uglier and more complicated. Also, if “mv” is implemented as required by the Single Unix Standard, then the “mv -i” will fail badly if it tries to rename a file into an existing name. That’s because when it tries to get an answer, it will send a prompt to stderr, but it will expect a RESPONSE from stdin… and yet, stdin is where it gets the list of filenames!!

And it’s all silly anyway. If you put newlines in filenames, lots of scripts fail. It’s simply too much of a pain to deal with them “correctly”. Which is the point of Fixing Unix/Linux/POSIX Filenames — adding some limitations on legal Unix/Linux/POSIX filenames would be an improvement. At the least, by default let’s forbid control characters (so simple “find” and filename display is safe), forbid leading dash characters (so simple globbing is safe), require that all filenames be UTF-8 (so displaying filenames always works), and perhaps forbid trailing spaces (since these are dangerously misleading to end-users). I would like to see kernels build in the mechanisms to forbid certain kinds of filenames, so that administrators can then specify the specific “bad filename” policy they would like to use.

So please take a look at: Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems. I’ve made a few recent additions, thanks to some interesting comments people have sent, but the basic message is the same.

path: /security | Current Weblog | permanent link to this entry