Locality

June 4, 2019

On Memorial Day weekend I grilled a steak, as one does. ¹ A few days before, I took it out of the freezer and put it into the fridge to thaw on a plate. Late Sunday afternoon I took it out of the fridge and patted it down with a paper towel. Then I put it back in the fridge. Then I took it out and rubbed it down with oil and salt and pepper. Then I put it back in the fridge. I stepped outside to get the charcoals going. When I came back I found that my partner had seen the steak sitting in the fridge unattended and put it back into the freezer. I took the steak out of the freezer and back into the fridge. Then I took it out of the fridge and moved it to the grill. After a few minutes I ~~stretched this metaphor to its breaking point~~ took the steak off the grill and put it back into the fridge. Then I took it out of the fridge and put it back on the grill to cook the other side. Then I put it back in the fridge. Finally, I took it out to serve.

This is your data on serverless.

In December a paper Serverless Computing: One Step Forward, Two Steps Back (Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov and Chenggang Wu) discussed this problem at more depth and seriousness that I've done here. The authors address missed opportunities in the current serverless landscape such as specialized hardware (ex. GPUs), but that's a matter of feature development and not inherent to the model as it currently exists.

The more fundamental problem they illustrate is that serverless is a "data shipping architecture" where communication between tasks is via storage I/O. Instead of being able to take advantage of all the last couple decades' worth of advances in distributed computing, we're relying on a giant blob of global state. This problem persists even if we assume that the various operational difficulties of deploying serverless can be resolved with better tooling. (I don't see any reason this shouldn't be the case, see companies like IOPipe for an example of the possibilities). But in the existing implementations of serverless, we can't get around the problem that your serverless functions are a sea of unstructured side-effects.

In addition to semantics that'll make Haskell developers cry, the lack of data locality undermines mechanical sympathy. How can we we reason about performance when the underlying compute is so profoundly abstracted and your next "cache line" is an S3 API response 200ms away? This isn't so bad if you're a large cloud provider charging a premium for those milliseconds. But if performance is important to your workload (or perhaps you just care about the environmental impact of all that extraneous compute power), it's worth considering if the tradeoffs are worth it.

There's an interesting historical note here in that only a short time ago the industry understood this problem of data locality, and this led to the Hadoop hype. In a typical map-reduce workflow, your data is distributed across HDFS and then your mapping computation happens physically co-located with the data. This hasn't ever been my particular area of expertise, but it seems that there were a couple of factors that contributed to the fizzling of the Hadoop hype. One is that it doesn't support update-in-place semantics, so you can't quite support arbitrary Unix applications. The second factor is the dominance of object storage in the form of S3 and the various upstack services that AWS has created on top of it. The pricing of S3 is aggressive relative to trying to build HDFS on top of instance storage or EBS, so if you're all-in on the cloud it's hard to make the economics work.

A counterexample of this trend is Joyent's Manta. They have an object store built on top of their Triton platform that allows you to instantiate a container (a SmartOS zone) directly "on" the objects in the object store. So you get a full Unix environment to perform compute on the objects without moving the data. Your ability to parallelize workloads is limited only by the replication factor and size of the storage cluster. Under the hood it's all built on ZFS, zones, and cleverly managed Postgres. It's really amazing technology and as a bonus it's open source!

There are definitely a few barriers to Manta's wider adoption. Without Linux support for the compute zones, machine learning teams are less likely to adopt it. It doesn't support the S3 API so organizations potentially have a bunch of third-party tooling to recreate. While Manta is open source, it's decidedly not a standalone application but really a way of building an entire datacenter. So it can't be deployed onto AWS if you're already there. (Joyent does have an excellent cloud offering if you don't need much in the way of AWS upstack services.) And most importantly from the standpoint of serverless workflows, there's not yet a way to "watch" for events on Manta or get a changefeed as an end user; this could allow Lambda-like workflows.

If you are a smaller cloud provider or just an organization struggling with problems of data locality in your data pipeline, you could do much worse than standing on the shoulders of giants and taking a look at Manta. Even if you're already all-in on AWS and/or Linux containerization schedulers like k8s or Nomad, there's an opportunity for a sufficiently motivated team² to take inspiration from Manta to build a system that brings better mechanical sympathy to serverless.

Yes, more cooking metaphors. ↩︎
Which could include me, if you were to hire me to work on projects like this at your org! ↩︎