0x74696d

Software Defined Culture, Part 1 - Reliability

February 18, 2018

This five-part series of posts covers a talk titled Software Defined Culture that I gave at DevOps Days Philadelphia 2016, GOTO Chicago 2017, and Velocity San Jose 2017.

If you'd like to read the rest of the series:

  1. Part 0: Software Defined Culture
  2. Part 1: Build for Reliability
  3. Part 2: Build for Operability
  4. Part 3: Build for Observability
  5. Part 4: Build for Responsibility

Building for Reliability

Failure to build for reliability means we develop a culture of firefighting. Firefighting becomes a rut for organizations. Things are broken, we rush out fixes, and we pile on technical debt. The "temporary" hacks lead to problems that come back, grow, and cascade. Urgency starts to supercede importance. That is, the thing that needs to be fixed right now sets back the mission of the organization as a whole. Problems become crises, and the organization simply starts lurching from crisis to crisis. This has insidious effects on the organization's culture as we start rewarding our best firefighters. We misalign incentives away from the creation of value and towards the prevention of largely self-inflicted harm.

This is going to burn out your team. People don't like getting paged. They want to sleep. They want to spend time with their families. They want to spend their working hours tackling the cool "hard problems" you sold them on when you recruited them. Ironically, the best firefighters will be the ones to burn out, and then you'll be without them as they either quit or their burnout starts to affect their personalities and performance. (Speaking from personal hard-won experience here!)

So given that we want to build for reliability, how do we get there? Far smarter people than me have given this serious detailed treatment, but from a high level view there are some obvious rough guidelines.

No More Resume Driven Development

Bleeding-edge technology is broken all. the. time. It's great that your engineers want to learn about Elixir or Vue JS or AerospikeDB. They should do that on their own time, or at least in your internal systems and dev tooling. Don't #YOLO that shit into production!

Not only is this bad for reliability it also tells people that this is good engineering decision making — that it's okay to push untrusted systems into production. In fact, in many organizations you'll be rewarded for this behavior because you've "shipped". Pay no attention to the unmaintainable tire fire afterwards!

It's not just rewarding the behavior on your team, but it's also a problem for hiring. If your job listings look like a Markov chain from the front page of HackerNews, what kind of developer is this attracting to your organization?

This sort of thing creates a vicious cycle. There are certain people within our profession who never really want to maintain anything. They want to chase the new shiny and then move on to the next thing. If one of these folks lands in your organization and they're allowed free rein, they'll bring in all their friends as well. I've witnessed these roving gangs of locust developers coming in, dropping a new framework or programming language on an organization, and heading off to the next organization to screw up.

Lack of reliability biases your organization against experienced hires. If you've been through a couple of cycles of burnout already (aside: that's super fucked up that this is even a thing), you're going to see a stack of shiny new tech in a job posting and immediately toss it out, because you know there's no way you want to live through that again. It also biases your organization against older folks in general, who will tend to have families at home — particularly women as they still manage a disproportionate share of household duties. They don't want to be up all night helping you debug MongoDB because you decided to switch to the new storage engine after only a couple of days of testing.

Choose Boring Tech

We should have a strong bias towards choosing boring technology. You have a limited amount of time and energy to pour into innovation. That energy should be spent on the things that will have the biggest positive impact for your organization's mission.

Building for reliability encourages the development of a certain set of cultural values. It builds a culture trust between developers and operators. It builds a culture of inclusivity. It builds a culture of sensible attitudes towards risk. And it builds a culture of work-life balance.


If you'd like to read the rest of the series:

  1. Part 0: Software Defined Culture
  2. Part 1: Build for Reliability
  3. Part 2: Build for Operability
  4. Part 3: Build for Observability
  5. Part 4: Build for Responsibility
Follow
Collaborate.
Communicate.
RSS.