January 24, 2021

A lot of go applications try to do something clever with signals and end up dropping signals on the floor. I've definitely written this kind of bug myself. It's not a community practice to lean on an application server rather than the stdlib, so that creates an opportunity for folks to incorrectly implement it from scratch.

Note that we're not talking about signal-safety(7). For purposes of this discussion we're going to merrily assume the authors of os/signal.Notify have avoided any signal-unsafe code. Although it'd be neat to dig into how that worked out with the go scheduler at some point.

The docs for os/signal.Notify say:

Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate. For a channel used for notification of just one signal value, a buffer of size 1 is sufficient.

We have to read this a bit carefully; it says a buffer of size 1 is sufficient for one signal value, which is not the same as one signal type.

Suppose we have a server that can reload its configuration on SIGHUP and does a graceful shutdown on SIGINT (or SIGTERM). If we're in the middle of doing a configuration load and get a shutdown notice, we'll queue-up the shutdown signal and process it afterwards. The signal mask is still in place, so any other signal sent during that window will get dropped.

func main() {
    c := make(chan os.Signal, 1)
    signal.Notify(c, syscall.SIGINT, syscall.SIGHUP)

    for {
        s := <-c
        switch s {
        case syscall.SIGHUP:
            fmt.Println("Got SIGHUP, reloading config...", s)
            time.Sleep(1 * time.Second)
        case syscall.SIGINT:
            fmt.Println("Got SIGINT, gracefully shutting down...", s)
            time.Sleep(1 * time.Second)

If we run this program in one terminal and then send it 3 signals in a row, we can see we drop one of them.

# first terminal
$ go run .
Got SIGHUP, reloading config... hangup
Got SIGHUP, reloading config... hangup

# second terminal
$ pkill -SIGHUP signals; pkill -SIGHUP signals; pkill -SIGINT signals

This would be a catastrophic bug in an init system or process supervisor (and/or something like ContainerPilot, where it actually was a bug in early versions). We need to catch SIGWAIT to reap zombie processes. It'd also cause dropped signals for an interactive terminal application, where we'd probably masking SIGWINCH to detect terminal window size changes.

But for most web applications this isn't a huge deal. Typically where this bites us is if we have an orchestration layer that sends SIGINT or SIGTERM for graceful shutdown and then kills the process unceremoniously after a timeout. But there's some kind of automated process that's picking up changes from the environment and firing SIGHUP to do a config reload. If we drop the graceful shutdown signal because we're stuck in a config reload, then the orchestrator sends an interrupt that the application ignores. After 10 seconds or whatever your timeout is, the orchestration says "whelp, I give up" and sends a SIGKILL. And then our application drops in-flight requests and users are unhappy.


