Encodef

This is the main web site for encodef, a suite of programs that tries to make it easier to process filenames in Unix*/Linux/POSIX systems. You can get code, etc., from the encodef project page.

Historically, Unix/Linux/POSIX allow almost any byte in a filename, but this flexibility is the source of many problems. I describe the problem in Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems. I discuss ways of writing shell programs to work around this, using existing tools, in Filenames and Pathnames in Shell: How to do it correctly.

The “encodef” program takes filenames (which may include newlines, tabs, ESC, leading dash, space, and other nastiness), and encodes them into a format that’s easier to process. The “decodef” program reverses the process. The “xargsf” program is a stub prototype so that you can see how these integrate into the standard xargs program.

The Encodef man page has more details.

At this point, it's "usable", and more than adequate for prototyping and testing ideas about encoding filenames.

Feel free to download the encodef source code in tarball format (Free-libre/open source software, MIT license). It includes a self-test suite, so you can get more confidence that it works. Also, it follows common compilation and installation processes, which let you easily control how to install it (e.g., by setting DESTDIR and --prefix).

Here are a few thoughts, based in part on my experimentation with them:

  1. If POSIX systems always forbid or escaped bad filenames (like having control characters), many problems disappear.
  2. If bad filenames are possible, then there must be a way to easily deal with them. Forbidding the creation of bad filenames helps somewhat (because then in certain cases they won't happen), but then you still have to be able process bad filenames.
  3. The conventional way to do this is to use the null byte \0 to terminate/separate filenames. This is widely supported by find and xargs, you can then store these in files, and some programs can process them. These are very efficient. I believe that the POSIX standard should be modified so that the POSIX shell's read command could also easily process null-terminated data (I suggest using -0), and there's a good argument for grep as well.
  4. You could also escape the filenames, and that's what encodef does. In the long term, if encoding is to be supported, I believe that at least xargs and printf(1) should be modified to directly support decoding, and find should be modified to directly support encoding. The advantage of encoding is that then any text processing tool can process (encoded) filenames. But encoding/decoding has higher overhead, creates new issues (which characters are encoded? Which encoding system?), and it's more work for utilities to implement encoding compared to using null byte separators.
  5. If bad filenames must exist, I think that it'd be best if POSIX added support for both null byte termination and encoding. Null byte termination could be used for simple common cases (using find(1), shell read, grep, storing them in a file, using xargs). For more complicated cases, encoding/decoding could be used so that the full suite of POSIX tools could be used. If the encoder/decoder could also process null byte termination, then it could fill the gap when more complex tools are needed. A tool like pax could be trivially modified to output files with null byte terminators; the encoding tool could then transform that to nicely encoded filenames with newline terminators.
  6. If you're going to encode, it's best to encode a large number of characters. This reduces the risk of improperly handling metacharacters, and also increases the likelihood that testing will detect when you've forgotten to decode an encoded filename.

Some systems, like FreeBSD, have the tools vis(1) and unvis(1), but vis and unvis are terrible tools for this problem:

  1. vis(1) expects filenames on its command line, which it reads. That means it doesn't work easily with find(1); you end up with very complicated expressions that have to create multiple processes with each filename. It doesn't even slightly compete with the simpler find . -exec encodef {} \+
  2. unvis(1)'s decoder doesn't consider the complications of decoding in shell. The shell's command substitution removes all trailing newlines; a filename decoder should optionally append some static character so that the shell can get the data without corruption.


You might want to look at my Secure Programming HOWTO web page, or some of my other writings such as Open Standards and Security, Open Source Software and Software Assurance (Security), and High Assurance (for Security or Safety) and Free-Libre / Open Source Software (FLOSS).

You can also view my home page.