Traditionally, Unix/Linux/POSIX filenames and pathnames can be almost any sequence of bytes. Unfortunately, most developers and users of Bourne shells (including bash, dash, ash, and ksh) don’t handle filenames and pathnames correctly. Even good textbooks on shell programming, and many examples in the POSIX standard, get filename and pathname processing completely wrong. Thus, many shell scripts are buggy, leading to surprising failures. These failures are a significant source of security vulnerabilities (see the “Secure Programming for Linux and Unix HOWTO” section on filenames, CERT’s “Secure Coding” item MSC09-C, CWE 78, CWE 73, CWE 116, and the 2009 CWE/SANS Top 25 Most Dangerous Programming Errors).
This little essay explains how to correctly process filenames in Bourne shells. I presume that you already know how to write Bourne shell scripts.
First, some terminology. A pathname lets you select a particular file, and may include zero or more “/” characters. Thus, “/usr/bin” and “../etc/passwd” are pathnames. Most files are regular files or directories (though there are a few other kinds of files). Each pathname component (separated by “/”) is a filename; filenames cannot contain “/”. Neither filenames nor pathnames can contain the ASCII NUL character (\0), because that is the terminator. Many people use the term “filename” to include both filenames and pathnames; we’ll do that from here on.
The problem is that most Unix/Linux/POSIX systems allow filenames (including pathnames) to include any other bytes, including those for space, leading dash (-), tabs, newlines, shell metacharacters, and sequences that aren’t legal UTF-8 values, The result: programs often fail.
First, let’s go through some examples that are wrong, because the first step to fixing things is to know what’s broken. We’ll presume that there’s no header in your script (e.g., to set IFS). Given that, here’s how to process files incorrectly:
cat * > ../collection # WRONG
This is wrong. If a filename in the current directory begins with “-”, it will be misinterpreted as an option instead of as a filename. For example, if there’s a file named “-n”, it will suddenly enable cat’s “-n” option instead. In general you should never have a glob that begins with * — it should be prefixed with “./”. Also, if there are no (unhidden) files in the directory, a glob will return the pattern ("*"), and the "for" loop will execute one time with the unprocessed pattern instead.
for file in * ; do # WRONG cat "$file" >> ../collection done
Also wrong, for the same reason; a file named “-n” will fool the cat program, and if the pattern does not match, it will loop once with the pattern itself as the value.
cat $(find . -type f) > ../collection # WRONG
Wrong. If any filename contains a space, newline, or tab, its name will be split (file “a b” will be incorrectly parsed as two files, “a” and “b”).
( for file in $(find . -type f) ; do # WRONG
cat "$file"
done ) > ../collection
Wrong, for the same reason; it breaks up filenames that contain space, newline, or tab.
( find . -type f | # WRONG while read filename ; do cat "$filename" ; done ) > ../collection
Wrong. This works if a filename has spaces in the middle, but it won’t work correctly if the filename begins or ends with whitespace (they will get chopped off). Also, if a filename includes “\”, it’ll get corrupted; in particular, if it ends in “\”, it will be combined with the next filename (trashing both). In general, using “read” in shell without the “-r” option is usually a mistake, and in many cases you should set IFS="" just before the read.
( find . -type f | xargs cat ) > ../collection # WRONG, WAY WRONG
Wrong. By default, xargs’ input is parsed, so space characters (as well as newlines) separate arguments, and the backslash, apostrophe, double-quote, and ampersand characters are used for quoting. According to the POSIX standard, you have to include the option -E "" or underscore may have a special meaning too. Note that many of the examples in the POSIX standard xargs section are wrong; filenames with spaces, newlines, or many other characters will cause many of the examples to fail.
( find . -type f |
while IFS="" read -r filename ; do cat "$filename" ; done ) \
> ../collection # WRONG
Wrong. Actually, this is fine if filenames can’t include newline, but if filenames can include newline, then any line-at-a-time processing of filenames (such as this one) will fail on filenames that include newline.
cat $filename
Wrong. If $filename can contain an arbitrary filename, then it could include shell metacharacters like “*”, which would then be interpreted.
So, how can you process filenames correctly in shell? Here's a quick summary about how to do it correctly, for the impatient who "just want the answer". In short: Double-quote to use "$variable" instead of $variable, set IFS to just newline and tab, prefix all globs/filenames so they cannot begin with "-" when expanded, and use one of a few templates that work correctly. Here are some of those templates that work correctly:
IFS="$(printf '\n\t')" # Remove 'space', so filenames with spaces work well.
# Correct glob use: always use "for" loop, prefix glob, check for existence:
for file in ./* ; do # Use "./*", NEVER bare "*"
if [ -e "$file" ] ; then # Make sure it isn't an empty match
COMMAND ... "$file" ...
fi
done
# Correct glob use, but requires nonstandard bash extension:
shopt -s nullglob # Bash extension, so globs with no matches return empty
for file in ./* ; do # Use "./*", NEVER bare "*"
COMMAND ... "$file" ...
done
# These handle all filenames correctly; can be unwieldy if COMMAND is large:
find ... -exec COMMAND... {} \;
find ... -exec COMMAND... {} \+ # If multiple files are okay for COMMAND
# Okay if filenames can't contain tabs or newlines; beware the assumption:
IFS="$(printf '\n\t')"
for file in $(find .) ; do
COMMAND "$file" ...
done
# Skip filenames with embedded control chars, including newline and tab:
IFS="$(printf '\n\t')"
controlchars="$(printf '*[\001-\037\177]*')"
for file in $(find . ! -name "$controlchars") ; do
COMMAND "$file" ...
done
# Requires nonstandard but common extensions in find and xargs:
find . -print0 | xargs -0 COMMAND
# Requires nonstandard extensions to find and to shell (bash works);
# variables might not stay set once the loop ends:
find . -print0 | while IFS="" read -r -d "" file ; do ...
COMMAND "$file" # Use quoted "$file", not $file, everywhere.
done
# Requires nonstandard extensions to find and to shell (bash works);
# underlying system must inc. named pipes (FIFOs) or the /dev/fd mechanism.
# In this version, variables, *do* stay set after the loop ends, and
# you can read from stdin (change the 4s to another number if fd 4 is needed):
while IFS="" read -r -d "" file <&4 ; do
COMMAND "$file" # Use quoted "$file", not $file, everywhere.
done 4< <(find . -print0)
# Named pipe version.
# Requires nonstandard extensions to find and to shell's read (bash works);
# underlying system must inc. named pipes (FIFOs). Again,
# in this version, variables, *do* stay set after the loop ends, and
# you can read from stdin (change the 4s to something else if fd 4 needed).
mkfifo mypipe
find . -print0 > mypipe &
while IFS="" read -r -d "" file <&4 ; do
COMMAND "$file" # Use quoted "$file", not $file, everywhere.
done 4< mypipe
# Requires author's nul2pfb program. This uses "find . -print0"; for
# POSIX 2008 compliance, replace that with: find . -exec printf '%s\0' {} \;
for encoded_filename in $(find . -print0 | nul2pfb) ; do
filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
# Use "$filename" from here on...
done
Below, I explain this in more detail.
Here are some basic rules on how to correctly process filenames (where COMMAND is some arbitrary command).
Always surround with double-quotes (") any substitution that might produce input field separator characters (by default these are space, newline, and tab), including one that might produce a filename. For example, when using (not setting) a variable or command substitution with a filename, surround it with double quotes. Otherwise, they’ll be expanded and misinterpreted. You don’t need to surround variable references if you know it can contain only alphanumeric characters, but it’s best to get in the habit, since the script might change in the future. Note: There is no portable way to store truly arbitrary multiple filenames in a single shell variable, because a filename might include awkward sequences such as newline, so see below for options on how to deal with this. Here are some examples:
| Don't use | Instead use |
|---|---|
| $filename | "$filename" |
| $(pwd) | "$(pwd)" |
| $(dirname $filename) | "$(dirname "$filename")" |
One of the first non-comment commands in every shell script should be:
IFS="$(printf '\n\t')" # or: IFS="`printf '\n\t'`" # or, not portable but widely supported: IFS=$'\n\t'
This sets the IFS variable so that the “space” character is no longer an input field separator, and thus only newline and tab are field separators. Setting IFS and always using double-quotes around filename variables eliminates most of the problems caused by filenames with spaces. It also eliminates some other common scripting errors involving space, so this is generally a very good idea.
You can still build a list of command options inside a single shell variable, even when space isn’t in IFS. However, you need to use tab or newline to separate parameters, and not space. You can embed filenames in this variable, even if it has spaces in it. However, you can only include filenames this way if each filename does not include a newline, tab, or a shell globbing character (“*” or “?” or “[” at least, possibly “!”), does not begin with “-”, and does not begin with a tilde. Here’s an example:
tab="(printf "\t")" # Use tab as separator
options="--option1${tab}--option2${tab}option_filename"
options="$options${tab}--anotheroption" # Build up the options.
COMMAND $options "$filename"
You might also want to put “set -eu” at the beginning of your scripts; it does nothing for filenames, but it can help detect other script errors.
A "glob" is a pattern for filename matching like “*.pdf”. Whenever you use globbing to select files, never begin with a globbing character (typically the characters “*”, “?”, or “[”) or with a value that might begin with “-”. If you’re starting from the current directory, prefix the glob with “./”. In short, use:
cat ./* # Use this, NOT "cat *" ... Must have 1+ files. for file in ./* ; do # Use this, NOT "for file in *" (beware empty lists) ... done
Similarly, if you read in a filename, if it begins with “-” you should immediately prefix it with “./”.
If you always prefix filenames (e.g., those acquired through globs), then filenames starting with “-” will always be handled correctly. Globbing is often the easiest way to handle all files, or a subset of them, in a specific directory, but you need to make sure you do it correctly.
The good news about globbing in shell is that glob expansion is done after IFS expansion, so as long as you directly use globs as command parameters or for...in... you will have no problem with filenames containing whitespace or contorl characters (including newline).
Many books instead recommend that you carefully use “--” on every command before each filename. I think that is stupid and completely impractical advice. Many programs don’t accept “--” at all. Even when they do, in practice it’s just too hard to remember to use “--” perfectly every time you invoke a command; sooner or later you will make a mistake. It’s usually much easier to fix the locations where the filename pattern is specified or the input is provided; there are typically fewer of them, they tend to be easier to find, and this approach works even with the many commands that do not accept “--”.
Remember that globbing normally skips hidden files (those beginning with "."). Often that is what you want; you may want to use "find" instead if that is not what you want.
Beware of globbing if there might be no matches with the pattern (and this is often the case). By default, if a glob like ./*.pdf matches no files, then the original glob pattern will be returned instead. This is almost never what you want, e.g., in a "for" loop this will cause the loop to execute once, but with the pattern instead of a filename!
You can use use globbing in a for loop, even if it might not match anything, using one of two approaches. One approach, which is completely portable, is to re-test for the existance of the file before using it in the loop:
for file in ./* ; do # Use this, NOT "for file in *"
if [ -e "$file" ] ; then # Make sure it exists and isn't an empty match
COMMAND ... "$file" ...
fi
done
This is both ugly and a little inefficient (you have to re-test each file again). There are also pathological cases where the pattern doesn't match but there is a file that is identical to the unmatched pattern (though for typical patterns that can't happen), so you have to check to see if that could happen.
A more efficient but nonstandard solution for empty matches is to use a nonstandard shell extension called "null globbing". Null globbing fixes this by replacing an unmatched pattern with nothing at all. In bash you can enable nullglob with "shopt -s nullglob". In zsh, you can use setopt NULL_GLOB for the same result. Then this will work correctly:
shopt -s nullglob # Bash extension, so that empty glob matches will work
for file in ./* ; do # Use this, NOT "for file in *"
COMMAND ... "$file" ...
done
If the match might be empty, you should normally not use globbing as part of a command. Thus, use "cat *.pdf" only if you know there's at least one .pdf file. One exception: If you enable null globbing, and if the command does nothing when handed an empty list of files, then things will be fine. But this condition is often untrue, and in any case, if there are too many matches it will also fail. In short, in robust scripts, globbing should normally be used only as a "for" loop's list.
Traditional globbing is only useful when you want to process files in a particular directory. Some shells have added a nonstandard "globstar" extension, but it's both nonstandard and has various limitations. I discuss it here, but you probably want to use find (discussed next). With the globstar extension, the pattern "**" returns every filename (including directories) in the current directory, recursively; it omits dot files, doesn't descend into dot dirs, and sorts the file list.
Bash version 4 recently added this, but you must enable it with "shopt -s globstar". The zsh shell originally came up with this, and ksh93 was the first to copy it (but in ksh you have to enable it with "set -G"). Note that there's no standard way to invoke it!
If you use this in a for loop list and combine it with nullglob, you can handle absolutely all filenames easily and efficiently, including the empty case. That sounds great, but watch the fine print... I think there are many reasons to avoid this right now. It's nonstandard, and gives you little control over the recursion. Most importantly, at least some implementations have trouble if there are links in the directories. Bash 4, at least, can get stuck in infinite loops if there are links. In many cases, find is currently the better approach for reliably doing recursive descent into directories.
If you want to process files beyond what normal globbing can do (e.g., recursively handle directories), or you don't like the limitations on having to re-check for non-matches, use find. The find command is always passed a starting directory; as long as the starting directory doesn’t begin with “-” you won’t have a problem with leading “-”. The find command can be badly misused; here are ways that work.
find ... -exec COMMAND... {} \;
This gets ugly fast if COMMAND becomes complicated.
Also, every file will start up a separate process for each COMMAND,
causing overhead if there are lots of files.
find ... -exec COMMAND... {} \+
This causes a set of files to be listed, not one file at a time.
It’s in the POSIX standard, though not all implementations of find
include it.
find ... -print0 | xargs -0 ....
# CORRECT if filenames can't include tab/newline *and* if IFS omits space:
COMMAND $(find .)
# OR:
for file in $(find .) ; do
COMMAND "$file" ...
done
This is a simple and clear solution, and it can handle filenames with spaces, leading dashes, shell metacharacters, and so on. In short, this is the best and clearest solution for non-trivial processing as long as filenames cannot include tab or newline. Below we discuss how to ignore filenames with control characters (like tab or newline); if you add that, then this correctly handles or ignores all filenames. Note that ‘...‘ cannot handle lists that are too large; if that might be a problem, use the for loop instead.
Similarly, as long as filenames can’t include tab or newline, you can store filenames in files with one record per newline-separated line, and tabs can separate the fields. This format that is well-supported by tools like cut, join, and paste. I think it’d be best if POSIX systems simply forbid filenames from including control characters like tab and newline; many programs assume it anyway, and many filesystems require it. Then these constructs would just work, no matter what.
find . | while IFS="" read -r file ; do ... COMMAND "$file" # Use "$file" not $file everywhere. done
This one can handle tabs in filenames. However, the values of any variables set in the while loop will be lost when the loop ends (the pipe causes the loop to run in a subshell). In addition, you cannot read from standard input inside the loop. I recommend using the previous “for” loop instead, in most cases, because the “for” loop is is easier to understand and more flexible.
find . -exec sh -c '
for file do
...
done' sh {} +
find . | sed -e 's/[^A-Za-z0-9]/\\&/g' | xargs -E "" COMMANDThis is complicated, hard to read, rediculously inefficient, and isn't better than many other alternatives (e.g., it doesn't handle newlines either). Don’t do this; instead, use one of the better ways described in this paper.
find . -name '.?*' -prune -o ....
If you might have files in the filesystem that contain newline and tab, and you need to use find, you can sometimes simply declare that filenames with control characters (including newline and tab) are “bad” and should be ignored. If you can simply ignore “bad” files, you can easily add an option to find to skip such filenames, and use all the simple approaches above that use find. Here’s an example:
# Correctly skips filenames with control chars, inc. newline and tab:
controlchars="$(printf '*[\001-\037\177]*')"
for file in $(find . ! -name "$controlchars") ; do
COMMAND "$file" ...
done
In many cases, the previous section’s approaches are sufficient, especially if you skip filenames with control characters whenever you use find.
Still, let’s imagine that you need to walk directories, and you truly must handle arbitrary filenames including ones with newline and tab. You could create a shell function to walk the directory tree, using globbing and recursion. However, the “easy” way to implement this in shell can get stuck in infinite loops when confronted with certain kinds of symbolic links and hard links. Re-implementing accurate directory traversal in the shell is possible, but both painful and silly. After all, the find tool is specifically designed to handle this stuff; it’d be better to just use find instead of rewriting find.
A common solution in this case is to use byte 0 (aka \0 or null) to separate filenames, and use tools like find to walk the directory. This works because filenames, by definition, cannot include byte 0.
There are many downsides to this approach:
But if you want maximum generality when recursing into subdirectories, this is the usual way to do it. So, let’s look at ways to do this:
# CORRECT but nonstandard: find . -print0 | xargs -0 COMMAND
Note that COMMAND is run on a set of files, but not necessarily on all of them, and processes restart each time, so it’s hard to keep track of things or stop processing in the middle. If you want to use xargs, then this is the way to do it, but note that it requires a non-standard extension.
find . -print0 | while IFS="" read -r -d "" file ; do ... COMMAND "$file" # Use "$file" not $file everywhere. done
This depends on find’s nonstandard -print0 option, as well as the bash-specific read -d option (the -d with empty string makes \0 be the delimiter). Note that you have to set IFS to be empty; otherwise, a filename that includes IFS characters at the end would be corrupted (see the POSIX.1-2008 specification lines 103920-103925). This works well in many cases, but it does have a subtle weakness: If you need to process filenames in a way that “remembers” what happens as you go, or afterwards, it doesn’t work well. Because we use a pipe, the "find" and "while..." commands will be executed in separate processes in separate subshells. Thus, if any variables are set inside the “while” loop, their values will disappear once we exit the loop (because the loop’s subshell will disappear).
# This handles all filenames, but uses nonstandard extensions: while IFS="" read -r -d "" file ; do ... COMMAND "$file" # Use "$file" not $file everywhere. # You can set variables, and they'll stay set. done < <(find . -print0)
With this version we can now loop through all the filenames, no matter what they are, and retain any variable values we set. Unfortunately, this construct is hard to read and non-portable. It not only requires the use of a nonstandard find option (-print0), but uses the nonstandard "process substitution" extension (bash, zsh, and ksh 93 have it; dash and ksh 88 do not). In fact, process substitution doesn’t even work on all systems that support bash; it has to support named pipes or /dev/fd (thankfully these are common). This approach means we can’t read the original standard input (standard input is used to provide the filenames), which in many programs would be a problem, leading us to our final version...
# This handles all filenames, but uses nonstandard extensions # and is a little ugly too: while IFS="" read -r -d "" file <&4 ; do ... COMMAND "$file" # Use "$file" not $file everywhere. # You can set variables, and they'll stay set. done 4< <(find . -print0)
Here is the most general file looping mechanism while staying within shell. This version can loop through all filenames, you can set variables and retain their value, and you can simultaneously use stdin. Change the "4" in both places to some other number more than 2 if the loop needs to use file descriptor 4. (I previously used bash's "-u" option for read, but that isn't standard while <&4 is in the POSIX standard.) Again, this needs lots of nonstandard extensions. Also, the shell "read" command is often a little slow (it's often implemented by reading a character at a time), which is a basic downside of all the while...read constructs.
Of course, if variables can contain newline and tab, you can’t use those values inside a shell variable or data file as separators. Many shells (including bash) do not permit shell variables to contain byte 0, making this more difficult. One possible solution is to use shell arrays, if your shell supports them, but that gets even more complicated.
After a certain point, you may find it easier to switch languages.
The usual in-the-field approach to dealing with all possible filenames is null byte termination; it's simple, has some support in key key tools, and you can easily store lists of filenames this way (you can easily store them in files with other data per file if the filename is the last item). However, it is possible to encode filenames so that all filenames can be handled. Unfortunately, there is no single standard encoding for filenames that doesn't use null bytes; instead, there are many similar yet incompatible encodings. What's worse, utilities often do not support any of them well. But, let's take a look.
One approach is to use "printf %b" encoding, pfb encoding for short. This is the encoding supported by the printf(1) "%b" format, and it has several advantages: printf is part of the POSIX 2008 specification, it is widely implemented, and it is typically a shell builtin (making it speedier). This encoding uses the backslash ('\') to introduce an escape sequence, and the POSIX specification supports '\\' (for backslash itself), '\a' , '\b' , '\f' , '\n' , '\r' , '\t' , '\v', and '\ddd' (where ddd is a 1-3 digit octal number). This is really easy to read, because it's similar to other formats. As long as you escape all characters ranged 1-31, you can easily include the encoded filename in other tab-delimited, newline-per-record files. (If you're on an EBCDIC system (!), you can still escape \n, but in EBCDIC the \n is outside the range of 1-31.) There's a gotcha in shell, though; filenames can end in newline, and that would be consumed if it's bare in a command substitution. So, this would be a wrong way to decode it:
# WRONG: this encoding fails if encoded_filename ends in \n
cat encoded_filenames |
while IFS="" read -r encoded_filename ; do
filename="$(printf "%b" "$encoded_filename")"
...
done
Instead, you need to decode such filenames this way in shell:
# CORRECT: Unencodes all possible filenames.
cat pfb_encoded_filenames |
while IFS="" read -r encoded_filename ; do
filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
...
done
But what about generating or encoding filenames into pfb format? Unfortunately, the POSIX standard does not include an easy option for find or other tools to generate this format. The easiest thing to do this is to use the widely-implemented -print0 option with find, then filter it through something to convert to this format. Here's a small C program I wrote, nul2pfb.c, which converts filename lists ending in \0 to line-oriented lists using the pfb escapes. There are several different ways to encode into pfb; I recommend encoding newline as \n and tab as \t, to make them obvious, and encoding space as \040 (so you don't need to worry about unintentional splitting). You really need to encode all characters less than or equal to 32, and those greater than or equal to 127. Yes, that makes international characters harder to read, but it preserves them, and that's more important. Since you cannot be sure of the character encoding used in filenames, and there's no guarantee that a filename obeys the rules anyway, it's safest to encode everything. And since we're going that far, we may as well encode the shell metacharacters "*", "?", "[", and "!", so that if we forget to surround variable references with double-quotes we'll be okay. In the end, I decided to encode every character except for alphanumerics, "/", ".", "_", and ":", since that gives maximum safety while still making most ordinary system filenames readable. (This encodes "-", which is good because a leading "-" can cause problems). A nice side-effect of encoding so many characters is that if you forget to decode them, you'll probably detect that fairly early in testing. It'd be nice if find and ls could generate pfb encoding directly, and if xargs could directly process them, but at least you can use null byte encoding to get there.
GNU ls includes some nonstandard quoting options, but none of them (at this time) quite output the pfb format. You would think its "-b" (aka --escape or --quoting-style=escape) would be the same thing, but it's not; this option changes space to '\ ', and pfb won't accept that. GNU ls' -Q (--quote-name or --quoting-style=c) encloses filenames in double-quotes, but even when you strip out the external double-quotes, it generates '\"' for double-quotes which also isn't in pfb. At this time it's easier to use find, and then use depth control if you just want one level.
Anyway, this brings us a final approach to handling filenames in shell. This one lets us use a for loop, which is very nice, because we now keep all file descriptors (including stdin) and variable setting works as expected. The encoding and decoding we have to do is unfortunate, but there it is:
# CORRECT, requires find -print0 and author's nul2pfb program.
for encoded_filename in $(find . -print0 | nul2pfb) ; do
filename="$(printf "%bX" "$encoded_filename")" ; filename="${filename%X}"
# Use "$filename" from here on...
done
There are many other possible encodings. GNU ls has several ways to encode filenames; its "-b" format is quite reasonable, and is similar to pfb, but note that it puts \ in front of space, so it's a little more trouble to decode.
Another option is percent encoding, aka URL encoding. Basically, any byte can represented by '%' followed by a 2-digit hexadecimal number.
The default format of xargs is a stinking, rotting mess, and I do not recommend it at all. It's not compatible with anything, including the find command it's supposed to work with. You can double-quote or single-quote text, but they have to end before the end of a line. The only way to encode newline is to precede it with "\", outside of a double-quote or single, which makes it hideously hard to deal with. You're better off using the widely-supported -0 option, even if it is technically non-portable. If you want to use encodings, it'd be nice if xargs supported pfb encoding, or if there was some other widely-supported encoding.
Try to avoid displaying filenames. Filenames could contain control characters that control the terminal and/or X-windows, causing nasty side-effects on display. Displaying filenames can even cause a security vulnerability. If you must display them, consider stripping out control characters first.
Similarly, if a filename stored as data in a file or sent elsewhere (e.g., as part of HTML or XML), you’ll need to escape the filename as necessary.
In addition, you have no way of knowing for certain what the filename’s character encoding is, so if you got a filename from someone else who uses non-ASCII characters, you’re likely to end up with garbage mojibake. In practice, what most people do is exchange filenames in UTF-8. If you both use the same locale, you could do something else, but UTF-8 is the only encoding in wide use that can handle arbitrary languages. I encourage you to always encode filenames in UTF-8.
The POSIX standard could (and should!) be modified to make it easier to to handle the outrageously permissive filenames that are permitted today. Basically, we need extensions to make globbing and find easier to use.
There are two basic problems with globbing:
There also needs to be standard way to use find with arbitrary filenames. The normal way to handle this is by separating filenames with the null (\0) character; a few changes would simplify this:
find . -print0 | xargs -0 COMMAND
find . -print0 | while read -0 file ; do ... done
for file then in $(find . -print0) do ... done
As a side note, it'd be nice if the $'...' construct was standard, as it makes certain things easier (bugid:249).
newline="$(printf '\n')"Because after the $(...) command is executed, any trailing newline is removed.
One alternative is:
newline=' 'But this can get corrupted by programs that change the encoding of file end-of-lines.
The following is a standards-compliant trick to get newline into a variable:
newline="$(printf '\nX')"
newline="${newline%X}"
Shell programming is remarkably easy in many cases; what’s sad is that this common case (file processing) is far complicated than it needs to be. Fundamentally, the rules on filenames are too permissive. Extending POSIX would make it somewhat easier, and we should do that. However, It would be much simpler if systems imposed a few simple rules on filenames, such as prohibiting control characters (bugid:251), prohibiting leading “-”, and requiring filenames to be UTF-8. Then you could always print filenames safely, and these “normal” shell constructs would always work:
# This works if filenames never begin with "-" and nullglob is enabled: for file in *.pdf ; do ... done # Use "$file" not $file # This works if filenames have no control chars and IFS is tab and newline: for file in $(find .) ; do ... done # Use "$file" not $file
I think that we should both extend the POSIX standard and limit the permitted filenames. Not all systems will limit filenames, so we need standard mechanism for them. But the new standard mechanisms simply can't be as simple as restricting filenames; restricting filenames makes systems far easier to use correctly.
Please see my paper on fixing Unix/Linux filenames for more about this.
I've also done some work on how to encode/decode filenames; see the encodef home page for more information.
But for now, this is how to handle filenames properly in shell programs.
Feel free to see my home page at http://www.dwheeler.com. You may also want to look at my paper Why OSS/FS? Look at the Numbers! and my book on how to develop secure programs. And, of course, my paper on fixing Unix/Linux/POSIX filenames.
(C) Copyright 2010-2011 David A. Wheeler. Released under Creative Commons CC-BY-SA (any version), GNU GPL v2+, and the Open Publication License (version 1.0 or later). You can use this under any of those licenses; if you do not say otherwise, then you release it under all of them. In addition, Mendel Cooper has explicit authorization to include this (or any modified portion) as part of his "Advanced Bash Scripting Guide". Let me know if you need other exceptions; my goal is to get this information out to the world!