After yesterday's post, my colleague Chris Baker Tweeted:
I'm especially curious whether the device lifecycle is gonna give you the hooks you need without hacks.
Ominous foreshadowing.
The first challenge I identified in the device plugin API was that the
plugin gets a list of device IDs in the
Reserve
method, but the scheduler only knows what device IDs are available
from the fingerprint. We can create the datasets ahead of time out of
band, but that's not a great experience for the job
submitter. Coincidentally, this is similar to what we face with
CSI in
claiming a unique volume per allocation. That's going to require
scheduler changes, which I'm working on in
nomad/#7877. So
maybe that implementation for volumes should take into account wanting
to apply "unique per allocation" interpolation to other resources.
Assuming we somehow solve that problem, there's another that comes to
mind. I'm looking at the API for the device plugin and I see Reserve
without a mirrored method to release the claim when we're done with
it. And if we're reserving a device, shouldn't it be for a particular
allocation? Then I see this part of the
docs:
After helping to provision a task with a scheduled device, a device plugin does not have any responsibility (or ability) to monitor the task.
Hold up. This is reminding me of the old Seinfeld bit about how anyone can just take reservations. How does this actually work?
Now at this point you might be asking yourself how an experienced engineer who's been working on a code base all day every day for a year and a half can simply not know how a feature like this works. Some of that is size: Nomad is 400k lines of go code, plus another 50k or so of JavaScript, not counting vendored dependencies. Roughly half of that is tests. That's a lot to digest.
But Nomad is also reasonably well-architected (with plenty of room for improvement, of course!). Many features can be implemented as discrete "hooks" that get called by the various event loops. So once you get one of those features working you can mentally unload that context and it will need minimal maintenance. Abstraction of components isn't good just for the sake of it, but because it lets mere mortal developers like me build software solving galaxy brain problems.
With that out of the way, let's look at what's happening under the hood with the device plugin API. Note that throughout this section I'm linking to a specific tag so that the line references don't change over time. If you're reading this much later, you may find you want to look for the same functions at different line numbers on the current version.
Each Nomad client has a device manager that runs the device
plugins. Each instance it tracks runs a fingerprint: one at start and
then periodically. The instance
fingerprint
asks the plugin for a
FingerprintResponse
.
We get back a list of devices, which according to the docs we're
supposed to assume are interchangeable from the perspective of the
scheduler.
The fingerprint goes up to the device manager, which calls its
updater
to add the fingerprint to the client node's state. This makes its way
via the
Node.UpdateStatus
RPC to the server, where it finally gets persisted as a
NodeDeviceResource
in the state store along with the rest of the client's resources. At
this point, the servers know what devices are available on the
client. This is the root of our problem around unique device IDs; the
plugin is telling the server what device IDs are available, and not
the other way around.
Now let's see what happens when we try to schedule a job with one of
these devices. I'm going to skip past most of the scheduler logic here
but check out my colleague
schmichael's awesome deep
dive if you want to
learn more. tl;dr we eventually get to a point where the scheduler has
to rank which nodes can best fulfill the request. In the ranking
iterator we attempt to get an
"offer"
for that device. The
AssignDevice
method checks the placement's
feasibility,
or whether the node has enough of the requested devices available that
match our constraints. But note that the scheduler is checking that
the server's state of the world says that we have enough of the
devices, and not communicating with the plugin at this point. Nomad's
scheduler workers always work with an in-memory snapshot of the server
state and don't perform I/O until they submit the plan to the leader.
So when do we talk to the device plugin? Once the plan is made and the
client receives a placement for the allocation, the client fires a
series of hooks for the allocation and all the tasks in the
allocation. The device pre-start
hook
is what finally takes the list of device IDs and calls the plugin's
Reserve
method.
But just as we suspected from the plugin API, there's no matching
post-stop hook. The Nomad server is responsible for keeping track of
whether or not a device has been reserved. Which makes Reserve
a bit
of a misnomer. Nomad is not expecting the device plugin to reserve the
device, but it's telling the device plugin that Nomad has reserved
the device, and to tell the client where it's been mounted.
Which leaves us in a tricky spot.
The device plugin can only get the state of the device via the fingerprint, so unless there's a visible side-effect of the device being used by a task, the plugin doesn't know when a task is done with the device.
Well, I did warn you that this series would include mistakes and dead ends. Unfortunately it looks like we're beyond the point of "without hacks", so what are my options?
First, I could certainly "cheat" and try to get a change in the device
plugin behavior into Nomad itself. Perhaps the plugin should be sent a
notice that the device isn't needed? Or maybe the Reserve
API should
have more information about the allocation? But that's the riskiest
approach because we're pretty serious about backwards compatibility in
Nomad's APIs, so we'd have to live with that change for a long
time. And besides, our team has plenty of more important work on their
plate than my silly experiments!
I could implement this whole workflow as a separate pre-start task
using a
lifecycle
block (this was suggested by
@anapsix on
Twitter). We often recommend pre-start tasks because they make a great
"escape hatch" for Nomad feature requests that we're unsure if we want
to implement, or ones we just don't have the time for on the road
map. But in this case it doesn't meet the design requirement of having
a separate operator and job submitter persona. The pre-start task
would have to be privileged and that lets a job submitter execute
arbitrary code as root on the host.
A variant on that idea would be to implement a custom task driver that only exposes a dataset for other tasks in the allocation. This would improve on the arbitrary pre-start task by having a plugin configuration controlled by the operator. There are a few disadvantages to a task driver: there would be extra running processes for each dataset, and implementing a task driver is just a lot more work to implement. But a task driver already has hooks for the task's lifecycle, so it would let Nomad manage when the ZFS workflows happen.
Or I could implement a CSI plugin after all. I had wanted to steer away from that because of how painful I'd found CSI plugins. But at least I know in this case the CSI plugin will be well-behaved with Nomad. Curses.
Lastly, just because the Nomad plugin API doesn't do what I want, doesn't mean my device plugin client couldn't also communicate with Nomad via its HTTP API. The plugin is already privileged code running as Nomad's user (typically root), so it could make blocking queries to the client to get allocation state on its own node.
Although weeks of coding can save hours of planning, I think I'm at the point where it's time to do some small experiments to explore these options. But next post I want to take a short detour to talk about RFCs.