Marginal cost of making mistakes

01Aug17

In a note to my last post ‘Safety first‘ I promised more on this topic, so here goes…

TL;DR

As software learns from manufacturing by adopting the practices we’ve called DevOps we’ve got better at catching mistakes earlier and more often in our ‘production lines’ to reduce their cost; but what if the whole point of software engineering is to make mistakes? What if mistake is the unit of production?

Marginal cost

Wikipedia has a pretty decent definition of marginal cost:

In economics, marginal cost is the change in the opportunity cost that arises when the quantity produced is incremented by one unit, that is, it is the cost of producing one more unit of a good. Intuitively, marginal cost at each level of production includes the cost of any additional inputs required to produce the next unit.

This begs the question of what is a ‘unit of a good’ with software?

What do we make?

Taking the evolutionary steps of industrial design maturity that I like to use when explaining DevOps it seems that we could say the following:

  • Design for purpose (software as a cottage industry) – we make a bespoke application. Making another one isn’t in incremental thing, it’s a whole additional dev team.
  • Design for manufacture (packaged software) – when software came in boxes this stuff looked like traditional manufactured goods, but the fixed costs associated with the dev team would be huge versus the incremental costs of another cardboard box, CD and set of manuals. As we’ve shifted to digital distribution marginal costs have tended towards zero, so thinking about marginal cost isn’t really useful if we’re thinking that the ‘good’ is a given piece of packaged software.
  • Design for operations (software as a service/software based services) – as we shift to this paradigm then the unit of good becomes more meaningful – a paying user, or a subscription. These are often nice businesses to be in as the marginal costs of adding more subscribers are generally small and can scale well against underlying infrastructure/platform costs that can also be consumed as services.

The cost of mistakes

Mistakes cost money, and the earlier you eliminate a mistake from a value chain the less money you waste on it. This is the thinking that lies at the heart of economic doctrine from our agricultural and industrial history. We don’t want rotten apples, so better to leave them unpicked versus spending effort on harvesting, transportation etc. just to get something to market that won’t be sold. It’s the same in manufacturing – we don’t want a car where the engine won’t run, or the panels don’t fit, so we’ve optimised factory floors to identify and eliminate mistakes as early as possible, and we’ve learned to build feedback mechanisms to identify the causes of mistakes and eliminate them from designs (for the product itself, and how it’s made).

What we now label ‘DevOps’ is largely the software industry relearning the lessons of 20th century manufacturing – catch mistakes early in the process, and systematically eliminate their causes.

Despite our best efforts mistakes make it through, and in the software world they become ‘bugs’ or ‘vulnerabilities’. For any sufficiently large code base we can start building statistical models for probability and impact of those mistakes, and we can even use the mistakes we’ve found already to build a model for the mistakes we’ve not found yet[1].

Externality and software

Once again I can point to a great Wikipedia definition for externality:

In economics, an externality is the cost or benefit that affects a party who did not choose to incur that cost or benefit. Economists often urge governments to adopt policies that “internalize” an externality, so that costs and benefits will affect mainly parties who choose to incur them.

Externalities, where the cost of a mistake don’t affect the makers of the mistake, happen a lot with software, and particularly with packaged software and the open source that’s progressively replaced it in many areas. It’s different at the other extremes. If I build a trading robot that goes awry and kills my fund then the cost of that mistake is internalised. Similarly if subscribers can’t watch their favourite show then although that might initially look like an externality (the service has their money, and the subscriber has to find something else to do with their time) it quickly gets internalised if it impacts subscriber loyalty.

Exploring the problem space

Where we really worry the most about mistakes in software is when there’s a potential real world impact – we don’t want planes falling out of the sky, or nuclear reactors melting down etc. This is the cause of statements like, ‘that’s fine for [insert thing I’ll trivialise here], but I wouldn’t build a [insert important thing here] like that’.

Software as a service (or software based services) can explore their problem space all the way into production using techniques like canary releases[2]. People developing industrial control systems don’t have that luxury (as impact is high, and [re]release cycles are long), so they necessarily need to spend more time on simulation and modelling thinking through what could go wrong and figuring out how to stop that. This dichotomy can easily distil down to a statement on the relative merits of waterfall versus agile design approaches, which Paul Downey nailed as:

Agile: make it up as you go along.
Waterfall: make it up before you start, live with the consequences.

It can be helpful to look at these through the lens of risk. ‘Make it up as you go along’ can actually make a huge amount of sense if you’re exploring something that’s unknown (or a priori unknowable), which is why it makes so much sense for ‘genesis’ activities[3]. ‘Live with the consequences’ is fine if you know what those consequences might be. In each case the risk appetite can be balanced against an ability to absorb or mitigate risk.

This can be where the ‘architecture’ thing breaks down

We frequently use ‘architecture’ when talking about software, but it’s a word taken from the building industry, and professional architects get quite upset about their trade moniker being (ab)used elsewhere. When you pour concrete mistakes get expensive, because fixing the mistake involves physical labour (with picks and shovels) to smash down what was done wrong before fresh concrete can be poured again.

Fixing a software mistake (if it’s caught soon enough) is nothing like smashing down concrete, which is why as an industry we’ve invested so much in moving towards continuous integration (CI) and related techniques in order to catch mistakes as quickly and cheaply as possible.

Turning this whole thing around

What if the unit of production is the mistake?

What then if we make the cost per unit as low as possible?

That’s an approach that lets us discover our way through a problem space as cheaply as possible. To test what works and find out what doesn’t – experimentation on a massive scale, or as Edison put it:

I’ve not failed. I’ve just found 10,000 ways that won’t work.

What we see software as a service and software based services companies doing is finding ways that work by eliminating thousands of ways that don’t work as cheaply and quickly as possible. The ultimate point is that their approach isn’t limited to those types of companies. When we simulate and model we can discover our way through almost any problem space. This is what banks do with the millions of ‘bump runs’ through Monte Carlo simulation of their financial instruments in every overnight risk analysis, and similar techniques lie at the heart of most science and engineering.

Of course there’s still scope for ‘stupid’ mistakes – mistakes made (accidentally or intentionally) when we should know better. This is why a big part of the manufacturing discipline now finding its way into software is ‘it’s OK to make mistakes, but try not to make the same mistake twice’.

Wrapping up

As children we’re taught not to make mistakes – for our own safety, and throughout our education the pressure is to get things right. With that deep cultural foundation it’s easy to characterise software development as a process that seeks to minimise the frequency and cost of mistakes. That’s a helpful approach to some degree, but as we get to the edges of our understanding it can be useful to turn things around. The point of software can be to make mistakes – lots of them, as quickly and cheaply as possible, because it’s often only by eliminating what doesn’t work that we find what does.

Acknowledgement

I’d like to thank Open Cloud Forum’s Tony Schehtman for making me re-examine the whole concept of margin cost after an early conversation on this topic – it’s what prompted me to go a lot deeper and figure out that the unit of production might be the mistake.

Notes

[1] ‘Milk or Wine: Does software security improve with age?
[2] I’d highly recommend Roy Rapoport’s ‘Canary Analyze All The Things: How We Learned to Keep Calm and Release Often‘, which explains the Netflix approach.
[3] Oblique reference to Wardley maps, where I’d recommend a look at: The OSCON videoThe CIO magazine articleThe blog introThe (incomplete) book (as a series of Medium posts by chapter)Chapter 2 has the key stuff about mapping, and The online course



No Responses Yet to “Marginal cost of making mistakes”

  1. Leave a Comment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.