When we set up a schedule on a computer, such as a list of commands to run every day at particular times via Linux cron jobs, we expect that schedule to execute reliably. Of course we’ll check the logs to see whether the job has failed, but we never question whether the cron daemon itself will function. We always assume that it will, as it always has done; we are not expecting mutiny in the ranks of the operating system.

So when we had recently experienced problems with Linux logins and cron jobs failing for no apparent reason, this came as a big surprise. It was as if the cron daemon had gone on vacation. There was something deeply disconcerting about such a foundational piece of infrastructure suddenly becoming unreliable.

There was a common fingerprint that accompanied these events, a clue discovered in /var/log/messages:

systemd[1]: Failed to propagate agent release message: No buffer space available
systemd-logind[2781]: Failed to abandon session scope: No buffer space available
systemd-logind[2781]: Failed to start session scope session-19928.scope: No buffer space available
dbus[1467]: [system] Rejected: destination has a full message queue, 0 matched rules; type="signal", sender=":1.0" (uid=0 pid=1 comm="/usr/lib/systemd/systemd --switched-root --system ") interface="org.freedesktop.DBus.Properties" member="PropertiesChanged" error name="(unset)" requested_reply="0" destination="org.freedesktop.DBus" (uid=0 pid=2781 comm="/usr/lib/systemd/systemd-logind ")

Systemd is a daemon responsible for starting and tracking Linux processes, among other things, which it does by defining units for every item of interest, each of which has a collection of properties and dependency information. D-Bus is a message bus system, designed to make it easier for different Linux components to communicate with one another. How these components related to cron was unclear, but we discovered a trail leading back to the Linux automounter, which is responsible for mounting directories automatically as users traverse more deeply into a directory hierarchy.

We observed that if a lot of automounted directories were mounted in quick succession (known in the trade as a “mount storm”), systemd would choke. This was discussed upstream, where the advice was that the systemd buffer space should be increased – and indeed it had been very recently.

With this information, we had a fix in hand. But there were still some intriguing questions that had not yet been answered:

  • What is this buffer space and why did it run out?
  • How close are we to the limit?
  • By increasing the buffer’s size, are we just kicking the can down the road?

Turning on the tap

I decided to investigate, so began by turning to SystemTap, a diagnostic tool for instrumenting and inspecting parts of Linux that traditional tools cannot reach. Among other things, SystemTap allows you to place probes in a kernel module that will fire when particular functions run. From there, you can query the function parameters or the return value and walk the stack or display a stack trace. These scripts can be launched on a running system and can query already-running processes.

With SystemTap, all I needed was a way to reproduce the original condition. I found that listing the contents of thousands of automounted directories would occasionally cause the issue.

The notes in systemd issue #13674 indicated that the problem related to the BUS_RQUEUE_MAX or BUS_WQUEUE_MAX settings in systemd. A quick rg in the systemd source code turned up that these C macros were used to check against the bus->rqueue_size and bus->wqueue_size buffers, where bus was an sd_bus structure:

$ rg 'BUS_[RW]QUEUE_MAX'
src/libsystemd/sd-bus/sd-bus.c
1677:        if (bus->rqueue_size >= BUS_RQUEUE_MAX)
1785:                if (bus->wqueue_size >= BUS_WQUEUE_MAX)

src/libsystemd/sd-bus/bus-internal.h
329:#define BUS_WQUEUE_MAX (192*1024)
330:#define BUS_RQUEUE_MAX (192*1024)

Next, I found all of the functions that changed these queue sizes, and instrumented them with a stap script. Confirming that the function parameters I needed were available:

# stap -L 'process("/usr/lib/systemd/systemd").function("bus_send_internal").return'
process("/usr/lib/systemd/systemd").function("bus_send_internal@src/libsystemd/sd-bus/sd-bus.c:1718").return $return:int $bus:sd_bus* $_m:sd_bus_message* $cookie:uint64_t* $hint_sync_call:_Bool $m:sd_bus_message* $__PRETTY_FUNCTION__:char[] const

I could display the size of the queues and the return codes with a script I wrote, systemd-bus.stp.

# stap systemd-bus.stp 5000

5000 is the number of milliseconds to wait between re-displaying each header. Here’s the output from around the time the initial problem was reproduced:

   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196601 1
     1 bus_send_internal              0 196602 1
     1 bus_send_internal              0 196603 1
     1 bus_send_internal              0 196604 1
     1 bus_send_internal              0 196605 1
     1 bus_send_internal              0 196606 1
     1 bus_send_internal              0 196607 1
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:05 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:15 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:25 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:35 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:45 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
--- Wed Nov 20 09:02:55 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_send_internal              0 196608 ENOBUFS
     1 dispatch_rqueue                0  53865 -
     1 bus_rqueue_make_room           0  53865 0
     1 bus_socket_make_message        0  53865 1
     1 dispatch_rqueue                0  53849 -
     1 bus_rqueue_make_room           0  53849 0
     1 bus_socket_make_message        0  53849 1
     1 bus_send_internal              0  53849 1
     1 bus_rqueue_make_room           0  53850 0
     1 bus_socket_make_message        0  53850 1
     1 bus_rqueue_make_room           1  53850 0
     1 bus_socket_make_message        1  53850 1
     1 bus_rqueue_make_room           2  53850 0
     1 bus_socket_make_message        2  53850 1
--- Wed Nov 20 09:03:05 2019 GMT ---
   PID FUNCTION                  RQUEUE WQUEUE RETURN
     1 bus_rqueue_make_room           3      0 0
     1 bus_socket_make_message        3      0 1
     1 sd_bus_call                    0  53849 1
     1 bus_send_internal              3      0 1
     1 sd_bus_call                    3      0 1
     1 dispatch_rqueue                3      0 -
     1 dispatch_rqueue                2      0 -
     1 dispatch_rqueue                1      0 -
     1 bus_send_internal              0      0 1
     1 bus_rqueue_make_room           0      0 0
     1 bus_socket_make_message        0      0 1
     1 sd_bus_call                    0      0 1

When the problem occurs, it is the write queue (WQUEUE) that fills up, as seen just before 09:02:05 when it reaches 196,608 items, after which the bus_send_internal function returns ENOBUFS instead. About 1 minute later the number of items in the write queue dropped down to zero again, so something eventually drained the queue.

However, I was finding it difficult to reproduce this and I did not know why. If I was lucky, I was able to observe this behaviour at most once a day. When I did reproduce the problem and collected the output above, in another window an attempt to log in to the host hung until the size of the queue dropped, at which point I observed the following error in the messages log:

systemd-logind[2184]: Failed to start session scope session-750.scope: Connection timed out

Taking a closer look

SystemTap allowed me to identify precisely which queue was filling up and watch in real-time while it did.

SystemTap allowed me to identify precisely which queue was filling up and watch in real-time while it did.

However, I wanted to know what data was going into the queue and what was responsible for draining it. That would help me understand the relationship with the automounter – and also help explain why I was seeing this issue only sporadically, not every time I traversed a tree of automounted directories.

Examining the source code I could see that the bus_send_internal() function accepted a pointer to a structure called sd_bus_message. The definition of that structure begins with the following members:

struct sd_bus_message {
        unsigned n_ref;

        sd_bus *bus;

        uint64_t reply_cookie;

        const char *path;
        const char *interface;
        const char *member;
        const char *destination;
        const char *sender;
...

With some trial-and-error, I found that interface, member and path contained the pertinent information. To display this data required writing the following probe:

probe process("/usr/lib/systemd/systemd").function("bus_send_internal").return
{
    printf("%-40s %-25s %s\n",
		user_string_quoted(@entry($_m->interface)),
            	user_string_quoted(@entry($_m->member)),
            	user_string_quoted(@entry($_m->path)))
}

The probe output a screenful of information, which began:

"org.freedesktop.systemd1.Manager"       "UnitNew"                 "/org/freedesktop/systemd1"
"org.freedesktop.systemd1.Manager"       "UnitRemoved"             "/org/freedesktop/systemd1"
"org.freedesktop.systemd1.Manager"       "UnitNew"                 "/org/freedesktop/systemd1"
"org.freedesktop.DBus.Properties"        "PropertiesChange\260"    "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket1_2emount"
"org.freedesktop.DBus.Properties"        "PropertiesChange\260"    "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket1_2emount"
"org.freedesktop.DBus.Properties"        "PropertiesChange\260"    "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket2_2emount"
"org.freedesktop.DBus.Properties"        "PropertiesChange\260"    "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket2_2emount"
"org.freedesktop.DBus.Properties"        "PropertiesChanged"       "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket3_2emount"
"org.freedesktop.DBus.Properties"        "PropertiesChanged"       "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket3_2emount"
...

This made me suspect that systemd units were being set up for each automounted directory, as I recognised the /usr_2dmount_2dapp_2dbucket1 text as one of our mount-points: /usr/mount/app/bucket1. A quick check confirmed this:

$ systemctl list-units | grep '\.mount'
-.mount                                      loaded active mounted   /
dev-hugepages.mount                          loaded active mounted   Huge Pages File System
dev-mqueue.mount                             loaded active mounted   POSIX Message Queue File System
home-mark.mount                              loaded active mounted   /home/mark
usr-mount-app-bucket1.mount                  loaded active mounted   /usr/mount/app/bucket1
usr-mount-app-bucket2.mount                  loaded active mounted   /usr/mount/app/bucket2
usr-mount-app-bucket3.mount                  loaded active mounted   /usr/mount/app/bucket3
usr-mount-app-bucket4.mount                  loaded active mounted   /usr/mount/app/bucket4
usr-mount-app-bucket5.mount                  loaded active mounted   /usr/mount/app/bucket5
run-user-0.mount                             loaded active mounted   /run/user/0
run-user-100.mount                           loaded active mounted   /run/user/100
run-user-400.mount                           loaded active mounted   /run/user/400
sys-kernel-config.mount                      loaded active mounted   Configuration File System
sys-kernel-debug.mount                       loaded active mounted   Debug File System
tmp.mount                                    loaded active mounted   /tmp
var-lib-nfs-rpc_pipefs.mount                 loaded active mounted   RPC Pipe File System
var-lib-sss-db.mount                         loaded active mounted   Mount SSSD cache to tmpfs
var.mount                                    loaded active mounted   /var

$ systemctl show home-mark.mount | wc -l
117

Automounted directories – specifically those that were being mounted automatically by /usr/sbin/automount – were ending up as systemd units. Picking one example (an automounted home directory) revealed it contained 117 properties. This was not just an unexpectedly large number – it was surprising to me that systemd would have a unit defined for /home/mark at all. The automounter was responsible for mounting and unmounting it; why did systemd need to know anything about it?

There are actually two distinct systems in Linux that can handle automounting and they function independently of each other: the traditional automounter, which we were using, and the systemd automounter. Searching through the automounter source code for anything related to systemd revealed nothing. I could not figure out how or where the automounter was sending details of each of its mounts to systemd, nor could I guess at the purpose for this exchange of information.

“Use the Source, Luke!”

... a masterful debugger, able to find the root cause ... no matter how deep the problem was hiding.

My mind was cast back to the words of an old colleague who was a masterful debugger, able to find the root cause of almost any Linux-related problem, no matter how deep into user-space, the kernel, or machine code the problem was hiding. “Use the Source, Luke!” he would say, conjouring images of the Jedi Order.

I continued to examine the systemd source code and it wasn’t long before I encountered the mount_sigchld_event() function, and this comment:

/* So here's the thing, we really want to know before /usr/bin/mount or /usr/bin/umount exit whether
 * they established/remove a mount. This is important when mounting, but even more so when unmounting
 * since we need to deal with nested mounts and otherwise cannot safely determine whether to repeat
 * the unmounts. In theory, the kernel fires /proc/self/mountinfo changes off before returning from
 * the mount() or umount() syscalls, and thus we should see the changes to the proc file before we
 * process the waitid() for the /usr/bin/(u)mount processes. However, this is unfortunately racy: we
 *  have to waitid() for processes using P_ALL (since we need to reap unexpected children that got
 * reparented to PID 1), but when using P_ALL we might end up reaping processes that terminated just
 * instants ago, i.e. already after our last event loop iteration (i.e. after the last point we might
 * have noticed /proc/self/mountinfo events via epoll). This means event loop priorities for
 * processing SIGCHLD vs. /proc/self/mountinfo IO events are not as relevant as we want. To fix that
 * race, let's explicitly scan /proc/self/mountinfo before we start processing /usr/bin/(u)mount
 * dying. It's ugly, but it makes our ordering systematic again, and makes sure we always see
 * /proc/self/mountinfo changes before our mount/umount exits. */
(void) mount_process_proc_self_mountinfo(u->manager);

This was the link in the puzzle I had been missing! Systemd watches /proc/self/mountinfo for changes. When changes are detected, it sets up systemd units by calling mount_setup_unit(). This was a very interesting discovery; I had not been aware of this interaction before. Checking with the upstream maintainer for the Linux automounter, it seemed that this interaction also surprised him.

When the Linux kernel mounts or unmounts a filesystem, it updates a table of mount-points which can be viewed in a file in the /proc pseudo filesystem. Systemd was watching changes to this file – /proc/self/mountinfo – creating units for itself for everything that was mounted. So there was no direct link between systemd and the automounter, and also this problem was not specific to automounting. The same thing would happen when mounting any filesystem, even by using the mount command directly.

Looking at a systemd unit for an automounted directory, my attention was drawn to the Conflicts attribute:

$ systemctl show -p Conflicts home-mark.mount
Conflicts=umount.target

From systemd.special(7):

umount.target
    A special target unit that umounts all mount and automount points 
    on system shutdown.

    Mounts that shall be unmounted on system shutdown shall add
    Conflicts dependencies to this unit for their mount unit, which is 
    implicitly done when DefaultDependencies=yes is set (the default).

systemd would seem to want to be responsible for unmounting targets that were added by the Linux automounter. From systemd.mount(5):

Mount points created at runtime (independently of unit files or
/etc/fstab) will be monitored by systemd and appear like any other
mount unit in systemd. See /proc/self/mountinfo description in proc(5).

However, as the automounter is capable of unmounting its own targets and as umount -a at shutdown will also unmount targets, it remained unclear why systemd would need to know anything about mount points except for those that an administrator may have chosen to explicitly configure as systemd units for dependency resolution.

What is draining the queue?

Having determined that systemd was creating its own mount units, I shifted the investigation back to the question: why does this problem not happen all the time? Most of the time the queue stays at zero, suggesting that something is draining it. If I could understand what was draining the queue, perhaps I’d have a better idea why it would sometimes not drain the queue fast enough?

The bus_send_internal() function in systemd ends up with a write to file descriptor 45, as I discovered with this SystemTap probe:

probe process("/usr/lib/systemd/systemd").function("bus_socket_write_message") {
    printf("%d %d\n", pid(), $bus->output_fd)
}

This could also be seen in the output of strace on PID 1. So where does fd 45 go? Who is reading from the other end?

# ls -l /proc/1/fd/45
lrwx------ 1 root root 64 Nov 20 16:37 /proc/1/fd/45 -> socket:[17738]

# ss -xp | grep 17738
u_str ESTAB 0 0 * 17738 * 23294 users:(("systemd",pid=1,fd=45))
u_str ESTAB 0 0 /run/dbus/system_bus_socket 23294 * 17738 users:(("dbus-daemon",pid=1696,fd=9))

According to an strace of the dbus-daemon, when it reads from fd 9 it writes to fd 10, so chasing that down:

# ls -l /proc/1696/fd/{9,10}
lrwx------ 1 root root 64 Nov 22 13:07 /proc/1696/fd/10 -> socket:[23452]
lrwx------ 1 root root 64 Nov 22 13:07 /proc/1696/fd/9 -> socket:[23294]

# ss -xp | grep 23452
u_str ESTAB 0 0 * 18029 * 23452 users:(("systemd-logind",pid=2267,fd=12))
u_str ESTAB 0 0 /run/dbus/system_bus_socket 23452 * 18029 users:(("dbus-daemon",pid=1696,fd=10))

When I mount a filesystem, systemd is using dbus to send messages to systemd-logind. Is that right? Why? And are they all going there?

To dig deeper, I really wanted to see a count of how many messages were sent out of the write socket from systemd vs. how many were received by systemd-logind on the read side of the socket. I wrote another stap script, systemd-buffers-trace.stp.

While running this script with stap, I mounted 2,148 directories via the automounter and got the following output:

systemd/bus_send_internal() returned 1, 1339624 times
systemd/bus_socket_write_message() sent 1339624 messages -> fd 45
systemd-logind/bus_socket_read_message() recv 2666566 messages <- fd 12
systemd-logind/bus_socket_make_message() called 1333283 times
systemd-logind/bus_process_internal() returned 1, 1333283 times

To mount just over 2,000 directories, bus_socket_write_message() sent over 1.3 million messages and bus_socket_read_message() received over 2.6 million messages?! That’s incredible! I thought that 2,000 directories was only a mount-storm-in-a-teacup, but having been expanded now to over a million messages, the teacup had clearly broken into pieces on the floor.

I added an extra probe to count the number of messages that had the same interface, member and path components received by bus_process_object(), and then ran stap while mounting a single directory:

systemd/bus_send_internal() returned 1, 65 times
systemd/bus_socket_write_message() sent 65 messages -> fd 45
systemd-logind/bus_socket_read_message() recv 126 messages <- fd 12
systemd-logind/bus_socket_make_message() called 63 times
systemd-logind/bus_process_internal() returned 1, 63 times

Top 20 messages handled by bus_process_object()

COUNT INTERFACE MEMBER PATH
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket1_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket2_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket3_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket4_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket5_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/usr_2dmount_2dapp_2dbucket6_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/home_2dmark_2emount"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/dev_2dmapper_2dvg01_5cx2ddata_2edevice"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/dev_2dmapper_2dvg01_5cx2dvar_2edevice"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/dev_2dmapper_2dvg01_5cx2dstate_2edevice"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/dev_2dmapper_2dvg01_5cx2dtmp_2edevice"
4 "org.freedesktop.DBus.Properties" "PropertiesChanged" "/org/freedesktop/systemd1/unit/dev_2dmapper_2dvg01_5cx2droot_2edevice"
...

At first I couldn’t understand this. I mounted a single directory; systemd would have picked up on that because it is watching changes to /proc/self/mountinfo and would want to create a new mount unit; but it resulted in messages for all the other current mount points on the system (in fact four messages per mount point).

Then the penny dropped. While monitoring /proc/self/mountinfo for changes, when anything changed in the mount table then systemd re-communicated every single mount point over the bus to systemd-logind and repeated this each time any single entry changed. So if we’re mounting over 2,000 directories, and these will be done one at a time, then each time a new mount point from the list of 2,000 was added, the entire current list of mount points was communicated again over the bus – and, of course, the list of current mount points was growing each time. That’s why we were seeing millions of messages on the bus. A mount storm had produced a tsunami of IPC messages that was swamping the system.

Summary

Under normal operating conditions, the systemd-logind process manages to read all of the messages destined for it on its channel in the message bus fast enough. However, if systemd-logind happens to be busy at the time processing other messages, the queue fills up and anything that relies on the systemd bus chokes. This is what caused our logins and cron jobs to fail. During the investigation, I discovered that mounting just over 2,000 directories via the Linux automounter caused systemd to produce millions of messages over dbus.

There is a partial fix for this issue upstream that attempts to address the problem by a form of rate limiting. Unfortunately, it was backed out from the upstream master branch and apparently needs some work. I have proposed a way forward and hope that a final solution will be forthcoming in the not-too-distant future.