Be Careful What You Benchmark

February 20, 2017

I have a project where I need to generate unique database keys at the client rather than using autoincrementing keys, in order to support sharding at the client. (I'm effectively re-implementing Etsy's MySQL master-master scheme if you want to know more.) I don't want to use UUIDs because they're 128 bits instead of 64 bits, and because they're not time ordered they can result in poor insert performance. So I'm implementing Twitter's now-retired Snowflake, which produces roughly-ordered but unique IDs. In so doing, I ran into an interesting performance riddle that's worth sharing.

Be forewarned this whole post is a bit of a shaggy dog story. There isn't a satisfying conclusion with a smoking gun at the end, just a place to start work and some lessons about being careful around your assumptions about your test environment. But if you're interested in benchmarking golang code and how virtualization can foul the results, you might find it worth your while.

Making Snowflakes

The Snowflakes I'm generating are 64 bit unsigned ints. The original Twitter Snowflake used int64, or 63 useful bits plus the unused sign bit, for what I assume have to do with the same legacy reasons they cited for wanting 64 bits in the first place. The first 41 bits are milliseconds since a custom epoch, which gives me 69 years worth of IDs. The major constraint here is that we have to ensure we're using a clock that never goes backwards within the lifetime of a process. The next 11 bits are a unique worker ID, which gives me 2048 worker processes.

These worker IDs are the sole point of coordination required; whatever schedules the worker will need to reserve the worker IDs. In practice this is 100 times more workers than I'm likely to need for this project but that gives me plenty of headroom. It's safe to recycle worker IDs so long as they're unique among the running workers. The last 12 bits of the ID are a monotonic sequence ID. Each worker can therefore generate a maximum of 4096 IDs per millisecond.

That last requirement means that our ID "server" for each worker process needs to make sure it doesn't issue sequence IDs larger than 4095 within a single millisecond, otherwise we overflow and end up repeating IDs. I also want the server to be safe to use by multiple threads (or goroutines in this case because I'm writing it in golang because... I'm too dumb to use generics?). I poked around on GitHub for existing implementations first without much satisfaction.

One version was a straight port of Twitter's Scala code which looked OK but obviously wasn't very idiomatic golang code. Another was laughably unthreadsafe. Another had an API that tangled up the server setup with implementation details for their worker ID generation. And so on. All the implementations I found return a uint64 or int64 directly without wrapping it. I want any consuming code I write later to notice if I accidentally do something stupid like try to do math on an ID, so embedding the uint64 in a Snowflake type gives me a modicum of type safety. There was also a nagging doubt in my mind about the number of times the original design was checking the time. (Ooo... foreshadowing!)

So I implemented my own version, where the server listens on a channel for millisecond ticks and resets the sequence ID counter. In theory this would avoid a lot of calls to get the time but would involve an extra mutex per millisecond-long cycle. Because I'm paranoid I wanted to make sure I wasn't missing some profound performance regression in dumping the original Twitter design.

I also took the library that was the closest direct implementation of Twitter's design, sdming/gosnow, and ported it to use the same API as my own, and also took a slightly different approach, Matt Ho's savaki/snowflake, and gave it the same treatment. After writing a bunch of test code to make sure all three implementations were correct in behavior, I put them all through the same benchmark (each with different names of course):

var result Snowflake // avoids compiler optimizing-away output

func BenchmarkSnowflakeGen(b *testing.B) {
	server := NewServer()
	var id Snowflake
	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			id = server.GetSnowflake()
		}
	})
	result = id
}

The Benchmarks

Before I ran the benchmark I wanted to do some reasoning from first principles. Because I can't generate more than 4096 IDs per millisecond, I know that we have a lower bound of 244 nanoseconds assuming we can saturate the server. A single thread could manage this only if all the rest of the code was free. So that's our target for multi-threaded code. I ran through all three benchmarks on my Linux development machine and I was pretty happy with the result.

$ cat /proc/cpuinfo | grep "model name" | head -1
model name      : Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz

$ nproc
8

$ go test -v -run NULL -benchmem -bench .
BenchmarkSavakiSnowflakeGen-8    5000000     334 ns/op    0 B/op   0 allocs/op
BenchmarkSnowflakeGen-8          5000000     244 ns/op    0 B/op   0 allocs/op
BenchmarkTwitterSnowflakeGen-8   5000000     292 ns/op    0 B/op   0 allocs/op
PASS
ok      snowflake    5.247s

$ GOMAXPROCS=4 go test -v -run NULL -benchmem -bench .
BenchmarkSavakiSnowflakeGen-4    5000000     360 ns/op    0 B/op   0 allocs/op
BenchmarkSnowflakeGen-4          5000000     244 ns/op    0 B/op   0 allocs/op
BenchmarkTwitterSnowflakeGen-4   5000000     250 ns/op    0 B/op   0 allocs/op
PASS
ok      snowflake    5.148s

My own implementation hits the lower-bound benchmark! But the Savaki and Twitter clone library both do respectably, so my concerns about lots of calls to gettimeofday seem to have been for nothing. Note that I've run this on both 8 cores and 4 (via GOMAXPROCS) to check for hidden Amdahl's Law problems. This and all other tests are running under golang 1.7.

All done, right? Well I'm not going to be deploying on my 8-core Linux laptop. I'll probably be deploying to a Docker container on Triton, so I wanted to make sure the relative benchmark would hold up there. I didn't happen to have Triton keys on my Linux laptop so I switched over to my Mac that did. But first I needed to build a container with the code so I did that and sanity-checked it under Docker for Mac. At this point someone in the audience is slapping their head and saying "duh!", to which I can only say "ssh... you'll spoil it for everyone else!"

$ nproc
4

$ go test -v -run NULL -benchmem -bench .
BenchmarkSavakiSnowflakeGen-4     500000    3228 ns/op    0 B/op   0 allocs/op
BenchmarkSnowflakeGen-4          5000000     273 ns/op    0 B/op   0 allocs/op
BenchmarkTwitterSnowflakeGen-4   1000000    1086 ns/op    0 B/op   0 allocs/op
PASS
ok      _/go/snowflake       4.430s

Wat?

Ok, for some reason when this post was first shared on Twitter a bunch of dumbasses who didn't bother to read the article decided to heap snark upon it because "hurr hurr I ran my software under virtualization and it was slower." That's not the interesting thing here! The interesting thing, which I'm sure I don't have to spell out to you because you've read this far, is that one implementation performed only a tiny bit worse while the other implementations performed much much worse.

Here's where we go off the rails a bit. The first thing I did, which wasn't all that necessary, was to run the benchmark in other environments to make sure I hadn't made some strange mistake. There's no allocation in any of the algorithms, so from here on out I'll elide the -benchmem data.

Here I run it on the MacOS host:

$ sysctl -n hw.ncpu
8

$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz

$ go test -v -run NULL -bench .
BenchmarkSavakiSnowflakeGen-8       5000000      291 ns/op
BenchmarkSnowflakeGen-8             5000000      256 ns/op
BenchmarkTwitterSnowflakeGen-8      5000000      291 ns/op
PASS
ok      snowflake    5.045s

$ GOMAXPROCS=4 go test -v -run NULL -bench .
BenchmarkSavakiSnowflakeGen-4       5000000      342 ns/op
BenchmarkSnowflakeGen-4             5000000      245 ns/op
BenchmarkTwitterSnowflakeGen-4      5000000      244 ns/op
PASS
ok      snowflake    4.989s

Here I run it on a c4.xlarge on AWS:

$ nproc
4

$ cat /proc/cpuinfo | grep "model name" | head -1
model name      : Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz

$ go test -v -run NULL -bench .
BenchmarkSavakiSnowflakeGen-4       5000000      373 ns/op
BenchmarkSnowflakeGen-4             5000000      263 ns/op
BenchmarkTwitterSnowflakeGen-4     10000000      247 ns/op
PASS
ok      _/go/snowflake       6.228s

And here I run it on a Triton g4-highcpu-4G. Note that in this case we're running on a bare metal container where we can see all cores of the underlying host but the process is only scheduled time across those processors according to the Fair Share Scheduler (FSS). This instance has CPU shares very roughly equivalent to the same 4 vCPU that the AWS instance has, albeit on slightly slower processors and at half the price. This benchmark is also pure CPU which means that Triton's profound I/O advantages in avoiding hardware virtualization don't come into play at all! So don't try to directly compare the "marketing number" of 4 vCPU.

$ nproc
48
$ cat /proc/cpuinfo | grep "model name" | head -1
model name      : Intel(r) Xeon(r) CPU E5-2690 v3 @ 2.60GHz

# wow, lots of Amdahl's law appears with 48 processors visible!
$ go test -v -run NULL -bench .
BenchmarkSavakiSnowflakeGen-48      1000000     1280 ns/op
BenchmarkSnowflakeGen-48            1000000     1135 ns/op
BenchmarkTwitterSnowflakeGen-48     1000000     1414 ns/op
PASS
ok      _/go/snowflake       4.778s

$ GOMAXPROCS=8 go test -v -run NULL -bench .
BenchmarkSavakiSnowflakeGen-8       3000000      523 ns/op
BenchmarkSnowflakeGen-8             5000000      437 ns/op
BenchmarkTwitterSnowflakeGen-8      2000000      659 ns/op
PASS
ok      _/go/snowflake       6.721s

$ GOMAXPROCS=4 go test -v -run NULL  -bench .
BenchmarkSavakiSnowflakeGen-4       3000000      449 ns/op
BenchmarkSnowflakeGen-4             5000000      356 ns/op
BenchmarkTwitterSnowflakeGen-4      3000000      497 ns/op
PASS
ok      _/go/snowflake       6.001s

Ok, so now that we've run the benchmark everywhere we see that the problem appears to be unique to Docker for Mac. Which isn't a production environment so no need to get bent out of shape about that. Right? Well, you'll note from the position of your scroll bar that perhaps I decided to get bent out of shape about it anyways, just out of morbid curiosity.

Profiling

The next step was to take a profiler to both implementations on both my development machine and in Docker for Mac. It would have been neat to see the profiling on MacOS as well by way of comparison but apparently go's profiling is totally broken outside of Linux.

First I ran the benchmark again on my Linux laptop, this time outputting a CPU profile. I'll then run this output through go's pprof tool. I first looked at the top 20 calls, sorted by cumulative time. The cumulative time in pprof refers to the total amount of time spent in the call including all its callees. As we'd hope and expect for a tight-looping benchmark, we spend the vast majority of our time in the code we're benchmarking. Perhaps unsurprisingly we spend about 70% of our time locking and unlocking a mutex.

$ go test -v -run NULL -bench BenchmarkSnowflake -cpuprofile cpu-snowflake.prof
BenchmarkSnowflakeGen-8             5000000      244 ns/op
PASS
ok      snowflake 1.471s

$ go tool pprof cpu-snowflake.prof
Entering interactive mode (type "help" for commands)
(pprof) top20 -cum
2.77s of 3.65s total (75.89%)
Dropped 8 nodes (cum <= 0.02s)
Showing top 20 nodes out of 46 (cum >= 0.21s)
      flat  flat%   sum%        cum   cum%
     0.33s  9.04%  9.04%      2.88s 78.90%  snowflake.(*Server).GetSnowflake
         0     0%  9.04%      2.87s 78.63%  runtime.goexit
     0.01s  0.27%  9.32%      2.86s 78.36%  snowflake.BenchmarkSnowflakeGen.func1
         0     0%  9.32%      2.86s 78.36%  testing.(*B).RunParallel.func1
     0.51s 13.97% 23.29%      1.88s 51.51%  sync.(*Mutex).Lock
     0.83s 22.74% 46.03%      0.83s 22.74%  sync/atomic.CompareAndSwapUint32
     0.01s  0.27% 46.30%      0.74s 20.27%  runtime.mcall
         0     0% 46.30%      0.73s 20.00%  runtime.park_m
         0     0% 46.30%      0.68s 18.63%  runtime.schedule
     0.10s  2.74% 49.04%      0.67s 18.36%  sync.(*Mutex).Unlock
     0.02s  0.55% 49.59%      0.65s 17.81%  runtime.findrunnable
     0.06s  1.64% 51.23%      0.38s 10.41%  runtime.lock
     0.01s  0.27% 51.51%      0.34s  9.32%  sync.runtime_Semacquire
     0.03s  0.82% 52.33%      0.33s  9.04%  runtime.semacquire
     0.31s  8.49% 60.82%      0.31s  8.49%  runtime/internal/atomic.Xchg
     0.27s  7.40% 68.22%      0.27s  7.40%  sync/atomic.AddUint32
     0.02s  0.55% 68.77%      0.25s  6.85%  runtime.semrelease
         0     0% 68.77%      0.25s  6.85%  sync.runtime_Semrelease
     0.05s  1.37% 70.14%      0.21s  5.75%  runtime.unlock
     0.21s  5.75% 75.89%      0.21s  5.75%  runtime/internal/atomic.Xadd

But where is that mutex? For clarity we can list the GetSnowflake function and see the call times compared to each section of code.

(pprof)  list GetSnowflake
Total: 3.65s
ROUTINE ======================== snowflake.(*Server).GetSnowflake in snowflake.go
     330ms      2.94s (flat, cum) 80.55% of Total
         .          .     87:
         .          .     88:// GetSnowflake ...
         .          .     89:func (s *Server) GetSnowflake() Snowflake {
         .          .     90:   // we don't exit scope except in the happy case where we
         .          .     91:   // return, so we can't 'defer s.lock.Unlock()' here
      30ms      1.91s     92:   s.lock.Lock()
     140ms      140ms     93:   sequence := s.id
         .          .     94:   if sequence > maxSequence {
         .          .     95:           s.lock.Unlock()
         .       60ms     96:           return s.GetSnowflake()
         .          .     97:   }
         .          .     98:   s.id++
      10ms       10ms     99:   timestamp := s.lastTimestamp
      10ms      680ms    100:   s.lock.Unlock()
         .          .    101:   return Snowflake(
     140ms      140ms    102:           (timestamp << (workerIDBits + sequenceBits)) |
         .          .    103:                   (uint64(s.workerID) << sequenceBits) |
         .          .    104:                   (uint64(sequence)))
         .          .    105:}

This code perhaps makes little sense without the rest of the server code. Here's that.

// Server ...
type Server struct {
        workerID      workerID
        ticker        *time.Ticker
        lock          sync.Mutex
        lastTimestamp uint64
        id            uint16
}

// NewServer ...
func NewServer() *Server {
        ticker := time.NewTicker(time.Millisecond)
        server := &Server{
                ticker:        ticker,
                workerID:      0,
                lock:          sync.Mutex{},
                id:            0,
                lastTimestamp: uint64(time.Now().UnixNano()/nano - idEpoch),
        }
        go func() {
                for tick := range ticker.C {
                        server.reset(tick)
                }
        }()
        return server
}

func (s *Server) reset(tick time.Time) {
        s.lock.Lock()
        defer s.lock.Unlock()
        s.lastTimestamp = uint64(tick.UnixNano()/nano - idEpoch)
        s.id = 0
}

Let's now do the same for the Twitter port. A bit less time being spent in the mutex and a bit more spent switching in the runtime. Nothing unexpected. I've provided the code listing below again.

$ go tool pprof cpu-snowflake-twitter.prof
Entering interactive mode (type "help" for commands)
(pprof) top20 -cum
2.50s of 4s total (62.50%)
Dropped 16 nodes (cum <= 0.02s)
Showing top 20 nodes out of 61 (cum >= 0.25s)
      flat  flat%   sum%        cum   cum%
     0.03s  0.75%  0.75%      2.94s 73.50%  snowflake.BenchmarkTwitterSnowflakeGen.func1
         0     0%  0.75%      2.94s 73.50%  runtime.goexit
         0     0%  0.75%      2.94s 73.50%  testing.(*B).RunParallel.func1
     0.24s  6.00%  6.75%      2.88s 72.00%  snowflake.(*TwitterServer).GetSnowflake
     0.29s  7.25% 14.00%      1.46s 36.50%  sync.(*Mutex).Lock
     0.03s  0.75% 14.75%      0.87s 21.75%  runtime.mcall
     0.03s  0.75% 15.50%      0.84s 21.00%  runtime.park_m
     0.02s   0.5% 16.00%      0.77s 19.25%  runtime.schedule
     0.04s  1.00% 17.00%      0.75s 18.75%  sync.(*Mutex).Unlock
     0.18s  4.50% 21.50%      0.73s 18.25%  runtime.lock
     0.12s  3.00% 24.50%      0.70s 17.50%  runtime.findrunnable
     0.08s  2.00% 26.50%      0.57s 14.25%  runtime.semacquire
         0     0% 26.50%      0.57s 14.25%  sync.runtime_Semacquire
     0.01s  0.25% 26.75%      0.54s 13.50%  sync.runtime_Semrelease
     0.06s  1.50% 28.25%      0.53s 13.25%  runtime.semrelease
     0.11s  2.75% 31.00%      0.44s 11.00%  runtime.systemstack
     0.42s 10.50% 41.50%      0.42s 10.50%  sync/atomic.CompareAndSwapUint32
     0.33s  8.25% 49.75%      0.33s  8.25%  runtime.procyield
     0.26s  6.50% 56.25%      0.26s  6.50%  runtime/internal/atomic.Xchg
     0.25s  6.25% 62.50%      0.25s  6.25%  runtime/internal/atomic.Cas

(pprof) list GetSnowflake
Total: 4s
ROUTINE ======================== snowflake.(*TwitterServer).GetSnowflake in snowflake/snowflake_twitter.go
     240ms      2.88s (flat, cum) 72.00% of Total
         .          .     41:
         .          .     42:// GetSnowflake ...
         .          .     43:func (s *TwitterServer) GetSnowflake() Snowflake {
         .          .     44:
      20ms       40ms     45:   ts := uint64(time.Now().UnixNano()/nano - idTwitterEpoch)
     120ms      1.58s     46:   s.lock.Lock()
      20ms      220ms     47:   defer s.lock.Unlock()
         .          .     48:   // TODO: need to block on clock running backwards
         .          .     49:
      40ms       40ms     50:   if s.lastTimestamp == ts {
      10ms       10ms     51:           s.id = (s.id + 1) & maxSequence
         .          .     52:           if s.id == 0 {
         .          .     53:                   for ts <= s.lastTimestamp {
         .          .     54:                           time.Sleep(10 * time.Microsecond)
         .          .     55:                           ts = uint64(time.Now().UnixNano()/nano - idTwitterEpoch)
         .          .     56:                   }
         .          .     57:           }
         .          .     58:   } else {
         .          .     59:           s.id = 0
         .          .     60:   }
         .          .     61:   s.lastTimestamp = ts
         .          .     62:   id := Snowflake(
      20ms       20ms     63:           (s.lastTimestamp << (workerIDBits + sequenceBits)) |
         .          .     64:                   (uint64(s.workerID) << sequenceBits) |
      10ms       10ms     65:                   (uint64(s.id)))
         .      960ms     66:   return id
         .          .     67:
         .          .     68:}

I've replicated the results above under hardware virtualized Linux (the AWS c4.xlarge) and bare metal containers (the Triton g4-highcpu-4G) without learning anything new. From this point in the interest of space I'm going to abandon analysis of the SavakiServer and focus on just my version and the Twitter version. You'll have to trust me when I say that it matches the Twitter version pretty closely except in some small details. But now I'll take the cpu profiling to the Docker for Mac environment.

First my Snowflake server, which is mostly unchanged as we'd expect from the original benchmark values.

$ go tool pprof cpu-snowflake.prof
Entering interactive mode (type "help" for commands)
(pprof) top20 -cum
1.51s of 1.85s total (81.62%)
Showing top 20 nodes out of 58 (cum >= 0.04s)
      flat  flat%   sum%        cum   cum%
     0.19s 10.27% 10.27%      1.64s 88.65%  _/go/snowflake.(*Server).GetSnowflake
         0     0% 10.27%      1.39s 75.14%  runtime.goexit
     0.03s  1.62% 11.89%      1.37s 74.05%  _/go/snowflake.BenchmarkSnowflakeGen.func1
         0     0% 11.89%      1.37s 74.05%  testing.(*B).RunParallel.func1
     0.32s 17.30% 29.19%      1.08s 58.38%  sync.(*Mutex).Lock
     0.49s 26.49% 55.68%      0.49s 26.49%  sync/atomic.CompareAndSwapUint32
     0.12s  6.49% 62.16%      0.37s 20.00%  sync.(*Mutex).Unlock
     0.19s 10.27% 72.43%      0.19s 10.27%  sync/atomic.AddUint32
     0.16s  8.65% 81.08%      0.16s  8.65%  runtime.procyield
     0.01s  0.54% 81.62%      0.16s  8.65%  sync.runtime_doSpin
         0     0% 81.62%      0.08s  4.32%  runtime.mcall
         0     0% 81.62%      0.08s  4.32%  runtime.park_m
         0     0% 81.62%      0.08s  4.32%  runtime.semacquire
         0     0% 81.62%      0.08s  4.32%  sync.runtime_Semacquire
         0     0% 81.62%      0.06s  3.24%  runtime.findrunnable
         0     0% 81.62%      0.06s  3.24%  runtime.schedule
         0     0% 81.62%      0.05s  2.70%  runtime.copystack
         0     0% 81.62%      0.05s  2.70%  runtime.morestack
         0     0% 81.62%      0.05s  2.70%  runtime.newstack
         0     0% 81.62%      0.04s  2.16%  runtime.cansemacquire

And now for the Twitter server.

$ go tool pprof cpu-snowflake-twitter.prof
Entering interactive mode (type "help" for commands)
(pprof) top20 -cum
3.12s of 3.22s total (96.89%)
Dropped 9 nodes (cum <= 0.02s)
Showing top 20 nodes out of 24 (cum >= 0.02s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.84s 88.20%  runtime._System
     2.83s 87.89% 87.89%      2.83s 87.89%  runtime._ExternalCode
     0.01s  0.31% 88.20%      0.35s 10.87%  _/go/snowflake.BenchmarkTwitterSnowflakeGen.func1
         0     0% 88.20%      0.35s 10.87%  runtime.goexit
         0     0% 88.20%      0.35s 10.87%  testing.(*B).RunParallel.func1
     0.02s  0.62% 88.82%      0.34s 10.56%  _/go/snowflake.(*TwitterServer).GetSnowflake
     0.07s  2.17% 90.99%      0.27s  8.39%  sync.(*Mutex).Lock
     0.10s  3.11% 94.10%      0.10s  3.11%  runtime.procyield
         0     0% 94.10%      0.10s  3.11%  sync.runtime_doSpin
     0.02s  0.62% 94.72%      0.04s  1.24%  runtime.deferreturn
     0.04s  1.24% 95.96%      0.04s  1.24%  sync/atomic.CompareAndSwapUint32
         0     0% 95.96%      0.03s  0.93%  runtime.mcall
         0     0% 95.96%      0.03s  0.93%  runtime.park_m
     0.01s  0.31% 96.27%      0.03s  0.93%  runtime.schedule
         0     0% 96.27%      0.03s  0.93%  runtime.semacquire
         0     0% 96.27%      0.03s  0.93%  sync.runtime_Semacquire
     0.01s  0.31% 96.58%      0.03s  0.93%  sync.runtime_canSpin
         0     0% 96.58%      0.02s  0.62%  runtime.findrunnable
     0.01s  0.31% 96.89%      0.02s  0.62%  runtime.runqempty
         0     0% 96.89%      0.02s  0.62%  runtime.runqgrab

Finally a clue! But what is runtime._System and runtime._ExternalCode? Well the golang docs don't give us much help here but this post by Dmitry Vyukov on a blog for Intel has this to say:

System means time spent in goroutine scheduler, stack management code and other auxiliary runtime code. ExternalCode means time spent in native dynamic libraries.

We're pushing into areas where I don't have deep expertise, but given that I've got a statically linked binary here, this sounds like we're talking about system calls. So I put strace on the benchmark runs and counted up the calls. I've elided all the noise here and taken the top handful of syscalls from each one.

My version, on native Linux.

$ strace -cf go test -run NULL -bench BenchmarkSnowflake

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 74.57    1.592028          42     37914      8385 futex
 25.40    0.542223           9     62410           select
  0.02    0.000493           0     19141           sched_yield
  0.00    0.000053           5        10           wait4
  0.00    0.000015           0      1766           read
  0.00    0.000015           0        80        55 unlinkat
  ...
------ ----------- ----------- --------- --------- ----------------
100.00    2.134836                128426      8858 total

The Twitter version, on native Linux.

$  strace -cf go test -run NULL -bench BenchmarkTwitter
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 78.46    1.642500          35     46413      5088 futex
 20.76    0.434634           9     51023           select
  0.77    0.016064        1606        10           wait4
  0.01    0.000137           0     20525           sched_yield
  0.00    0.000022           0       833           close
  0.00    0.000017           0      1528        13 lstat
  0.00    0.000012           0       604           rt_sigaction
  0.00    0.000012           2         6           rt_sigreturn
  ...
------ ----------- ----------- --------- --------- ----------------
100.00    2.093406                126847      5561 total

My version, on Docker for Mac.

$ strace -cf go test -run NULL -bench BenchmarkSnowflake
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 76.19    4.201271         550      7641      1664 futex
 19.11    1.053696         119      8856           select
  3.27    0.180172         101      1776           read
  0.52    0.028537           2     16723           clock_gettime
  0.37    0.020437        2044        10           wait4
  0.36    0.020000        3333         6           waitid
  0.17    0.009448          12       785           sched_yield
  0.01    0.000342          57         6           rt_sigreturn
  ...
------ ----------- ----------- --------- --------- ----------------
100.00    5.514387                 42723      2108 total

The Twitter version, on Docker for Mac.

$ strace -cf go test -run NULL -bench BenchmarkTwitter
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.48    2.568430        1468      1750       258 futex
 21.61    0.822401         707      1164           select
 10.63    0.404649           3    118067           clock_gettime
  0.27    0.010230        1023        10           wait4
  0.00    0.000171           0      1515         9 lstat
  0.00    0.000136           0       749           openat
  ...
------ ----------- ----------- --------- --------- ----------------
100.00    3.806243                129994       702 total

We expect to see lots of futex calls ("fast user space mutex") because the golang runtime uses it a bunch under the hood for scheduling and our mutex.Lock/mutex.ULock calls. We can see that my version spends more of its total time in system calls than the Twitter version in both cases. We can also see the large amount of overhead that strace is incurring; these benchmarks only take ~1sec without it! If only Linux had dtrace! So we should take the timing with a heaping tablespoon of salt. But now wait a second, where is clock_gettime in the native Linux version?

You know what time it is

It's time to get vDSO! (That probably sounded cooler in my head.) Virtual dynamic shared objects are a pretty nifty performance optimization. Normally when a userspace program makes a system call, there's a fairly expensive bit of shuffling around of memory where some registers are set, kernel space memory gets swapped in, more registers are set, and kernel memory gets swapped back out. But there are lots of programs that want to make lots and lots of system calls like checking what time it is. So vDSOs map a read-only page of kernel memory associated with certain system call functions into userspace. This way userspace applications can use those functions without going through the context switch of a syscall. I'm not super-familiar with the history of this feature but there's an older mechanism called vsyscall that vDSO has replaced, and it looks like the feature started in ever-practical-but-often-a-bit-gross Linux and got ported or emulated in the BSDs and Illumos family later down the road. (I'm relying on Cunningham's Law to correct me on this point.)

We can use go tool pprof to disassemble our function but we don't get real assembler in this case. Instead we get a kind of pseudo-assembly that's specific to the golang runtime.

# irrelevant sections elided
(pprof) disasm GetSnowflake
Total: 4.15s
ROUTINE ======================== snowflake.(*TwitterServer).GetSnowflake
     230ms      3.17s (flat, cum) 76.39% of Total
         .       20ms     46ee6a: CALL time.Now(SB)
         .          .     46ef47: CALL time.Now(SB)

But we can take the test binary built by the CPU profiler and dump it to see how we're calling system calls there to see if vDSO might be a factor here.

# machine code bytes elided for space
$  objdump -d snowflake.test| grep gettime
45887e:   mov    0x16935b(%rip),%rax  # 5c1be0 <runtime.__vdso_clock_gettime_sym>
4588b5:   mov    0x139864(%rip),%rax  # 592120 <runtime.__vdso_gettimeofday_sym>
4588ee:   mov    0x1692eb(%rip),%rax  # 5c1be0 <runtime.__vdso_clock_gettime_sym>
45892e:   mov    0x1397eb(%rip),%rax  # 592120 <runtime.__vdso_gettimeofday_sym>

A bit of googling leads us to this bit of golang assembly in the golang runtime code. Note that this is not x86 assembly, and even it were I'm running into the limits of my useful knowledge. Presumably it's calling into the vDSO page that's being exposed to userspace.

// func now() (sec int64, nsec int32)
TEXT time.now(SB),NOSPLIT,$16
    // Be careful. We're calling a function with gcc calling convention here.
    // We're guaranteed 128 bytes on entry, and we've taken 16, and the
    // call uses another 8.
    // That leaves 104 for the gettime code to use. Hope that's enough!
    MOVQ  runtime.__vdso_clock_gettime_sym(SB), AX
    CMPQ  AX, $0
    JEQ   fallback
    MOVL  $0, DI // CLOCK_REALTIME
    LEAQ  0(SP), SI
    CALL  AX
    MOVQ  0(SP), AX// sec
    MOVQ  8(SP), DX// nsec
    MOVQ  AX, sec+0(FP)
    MOVL  DX, nsec+8(FP)
    RET
fallback:
    LEAQ  0(SP), DI
    MOVQ  $0, SI
    MOVQ  runtime.__vdso_gettimeofday_sym(SB), AX
    CALL  AX
    MOVQ  0(SP), AX// sec
    MOVL  8(SP), DX// usec
    IMUL  Q$1000, DX
    MOVQ  AX, sec+0(FP)
    MOVL  DX, nsec+8(FP)
    RET

Funny story with the "hope that's enough" comment is that my colleagues at Joyent recently ran into it while investigating a segfaulting golang program on LX. Spoiler alert: it was not enough.

So what's happening here?

So why is there a difference between Docker for Mac and all the other environments? If it's something specific to the Mac why don't we see it without Docker? There's one more wrench in this story, and that's Hypervisor Framework and xhyve. Unlike all the other cases, in Docker for Mac we're using xhyve, which runs entirely in userspace. Presumably somewhere in the userspace emulation we're losing a lot of our performance to virtualization.

This is unfortunately where my enthusiasm for following this to the end starts to flag. That's probably a disappointing finish, but in fairness I totally warned you! If you want to tackle this further, xhyve is open source and developed on GitHub. The real conclusion is to be careful that you're benchmarking what you really think you are. Even in cases where you've got "pure" CPU benchmarks you can be caught by surprise!