When Bash Scripts Bite

There are abundant resources online trying to scare programmers away from using shell scripts. Most of them, if anything, succeed in convincing the reader to blindly put something that resembles

set -euo pipefail

at the top of their scripts. Let’s focus on the “-e” flag. What does this do? Well, here are descriptions of this flag from the first two results on Google for “writing safe bash scripts”:

“If a command fails, set -e will make the whole script exit, instead of just resuming on the next line” (https://sipb.mit.edu/doc/safe-shell/)
“This tells bash that it should exit the script if any statement returns a non-true return value.” (http://www.davidpashley.com/articles/writing-robust-shell-scripts/)

Unfortunately, this is bash we are talking about and the story is never that simple.

A couple months ago, a particular production bash script (if that doesn’t sound horrifying, hopefully it will by the end of this post) failed in the worst kind of way: silently. The script generates a list of valid users at Jane Street and pushes this out to our inbound mail servers. It looks something like:

set -euo pipefail
...
echo "($(ldap-query-for-valid-users))" > "/tmp/all-users.sexp"
...
push-all-users-if-different

On this one particular day, a file was deployed with the contents “()”. But why didn’t set -e cause the script to exit when ldap-query-for-valid-users failed? A quick look at the bash man page answers this question. It turns out that there are a couple surprising subtleties to this flag. Here are a couple:

`set -e` works on “simple commands”

A script will exit early if the exit status of a simple command is nonzero. So how is a simple command executed? In short, bash does all expansions and checks to see if there is still a command to run. If there is a command to run, the exit status of the simple command is the exit status of the command. If there is not a command to run, the exit status of the simple command is the exit status of the last command substitution performed. Here are some example commands that all have exit status 0, so would not cause a set -e script to exit:

# echo, local and export are commands that always have exit status 0
echo "$(/bin/false)"
local foo="$(/bin/false)"
export foo="$(/bin/false)"

# the last command substitution has exit status 0
foo="$(/bin/false)$(/bin/true)"

`set -e` does not get passed to subshells in command substitution (without `--posix`)

Here is an example consequence of this:

set -e

foo() {
    /bin/false
    echo "foo"
}
echo "$(foo)"

Running this script with bash will print “foo” while running this with bash --posix (or sh) will not. Both scripts will exit with status 0.

Tangible takeaway

This is not to say that something like set -euo pipefail should not be used at the top of all bash scripts, but it should not give you a false sense of security. Like all production code, you must reason about all failure conditions and ensure they are handled appropriately. Even if you are some kind of bash expert who knows all these subtleties, chances are your peers do not. The execution of shell scripts is subtle and confusing, and for production code, there is likely a better tool for the job.