Amazon’s Chris Munns announced at the recent Serverless Conference NYC that AWS Lambda will soon support a feature called traffic shifting. This will allow a weight to be applied to Lambda function aliases to shift traffic between two versions of a function. The feature will enable the use of canary releases and blue/green deployment.

Continue reading the full story at InfoQ.


I’m writing this for my fellow DXCers, but I’d expect the points I make here likely apply to any open source project.

The first thing I’ll check is the README.md

Because that’s the first thing that somebody visiting the project will see.

Is the README written for them – the newbies – the people who’ve never seen this stuff before?

The next thing I’ll check is the README.md

Does it explain the purpose of the project (why)?
Does it explain what is needed to get the project and its dependencies installed?
Does it explain how to use the project to fulfil its intended purpose?

Then I’ll check the README.md again

Does the writing flow, with proper grammar and correct spelling? Are the links to external resources correct? Are the links to other parts of the project correct (beware stuff carried over from previous repos where the project might have lived during earlier development)?

OK – I’m done with README.md – what else?

Is the Description field filled out (and correct, and sufficient to keep the lawyers happy)?

Is the project name in line with standards/conventions?

Have we correctly acknowledged the work of others (and their Trademarks etc.) where appropriate?

Is the LICENSE.md correct (dates, legal entities etc.)?

Is there a CONTRIBUTING.md telling people how they can become part of the community we’re trying to build around this thing (which is generally the whole point of open sourcing something)?

Are you ready for a Pull Request?

I might just do one to find out; but seriously – somebody needs to be on the hook to respond to PRs, and they need a combination of empowerment (to get things done) and discretion (to know what’s OK and what’s not).


TL;DR

VMs on public cloud don’t provide the same level of control over sizing as on premises VMs, and this can have a number of impacts on how capacity is managed. Most importantly ‘T-shirt’ type sizing can provide sub optimal fit of workload to infrastructure, and the ability to over commit CPUs is very much curtailed.

Introduction

Capacity Management is an essential discipline in IT – important enough to be a whole area of the IT Infrastructure Library (ITIL). As organisations shift from on premises based virtualisation to the use of cloud based infrastructure as a service (IaaS) it’s worth looking at how things change with respect to capacity management. Wikipedia describes the primary goal of capacity management as ensuring that:

IT resources are right-sized to meet current and future business requirements in a cost-effective manner

I’ll used VMware and AWS as my typical examples of on prem virtualisation and cloud IaaS, but many of the points are equally appropriate to their competitors.

Bin packing

Right sizing a current environment is essentially a bin packing problem where as much workload as possible should be squeezed onto as little physical equipment as possible. This type of problem is ‘combinatorial non-deterministic polynomial time (NP) hard’, which means that we can spend a lot of time and computer cycles coming up with a perfect answer. In practice perfect isn’t needed, and what needs to be packed into the bins keeps changing anyway, so various shortcuts can be taken to a good enough solution.

Inside out

It’s worth noting that the resources consumed within a VM (an inside measure) aren’t the same as the VM’s sizing (an outside measure). The inside measure is what’s really being used, whilst the outside measure sets constraints e.g. if a VM is given 4GB of memory and its running a Java app that never goes over 1.5GB of heap then it’s probably oversized – we could get away with with a 2GB VM, or even a 1.6GB VM.

Hypervisors are great at instrumentation

The hypervisor virtualisation layer has perfect knowledge of the hardware resources that it’s presenting to the VMs running on it, so it’s a rich source of data to inform sizing questions. In the example above the hypervisor can tell us that its guest is never consuming more than 1.6GB of RAM.

#1 important point about IaaS – the cloud service provider (CSP) runs the hypervisor rather than the end user. They may choose to offer some instrumentation from their environment, but it’s unlikely to be at full fidelity. Of course they have this data themselves to inform their own capacity management.

T-shirt sizes and fits

T-shirt sizes

IaaS is generally sold in various sizes like Small, Medium, Large, XL etc. – like (mens) T-shirts. Unlike real life T-shirts that are incrementally bigger cloud T-shirts are generally exponentially bigger, so an XL is twice a Large is twice a Medium is twice a Small, meaning that the XL is 8x as big (and expensive) as a small.

A typical instance type on AWS is an m4.xlarge, which has 4 vPCU and 16 GiB RAM, so if I have a workload that needs 2vCPU and 9 GiB RAM I need that instance type because it’s the smallest T-Shirt that fits (as an m4.large with 8GiB RAM would be short on RAM). In a VMware environment I’d just size that VM to 2 vCPU and 9 GiB RAM, but I don’t have that degree of control in the cloud.

#2 important point about IaaS – T-shirt sizes mean that fine control over capacity allocation isn’t possible.

It’s worth taking a quick detour here to explore the meaning of vCPU, because these are entirely different between VMware and AWS. In a VMware environment the vCPUs allocated to a VM represent a largely arbitrary mapping to the host’s physical CPUs. In modern AWS instances the mapping is much more clearly defined – a vCPU is a hyperthread on a physical core (and thus 2 vCPUs are a whole core). The exception is T2 type instances, which have shared cores, and quite a neat usage credit system to ensure fair allocation.

#3 important point about IaaS – CPU over commit is only possible by using the specific instance types that support it.

T-shirt fits

Just as different T-shirt fits apply to different body shapes, different instance types apply to different workload shapes. The M or T types are a ‘general purpose’ mix, whilst C is ‘compute optimised’ and X or R are ‘memory optimised’ and I are ‘storage optimised’. AWS can use the data from their actual users to tailor the fits to real world usage.

Returning to the misfit above (2 vCPU and 9 GiB RAM workload) this will fit onto an R4.large, which is $0.133/hr – a saving of $0.67/hr versus the M4.xlarge. Comparing to an M4.large at $0.1/hr it’s 33% more expensive for 12.5% more RAM needed, but that’s a whole lot better than 100%.

How do containers change things?

Bin packing is easier with lots of small things, and containers tend to be smaller than VMs, so in general containers provide a less lumpy problem that’s easier to optimise.

This shouldn’t be done manually

As noted above bin packing is NP-hard, so if you ask humans to do it by hand the approximations will be pretty atrocious. This is work that should be left to machines. VMware is great at scheduling work onto CPUs, but it doesn’t optimise a bunch of VMs across a set of machines. This is where 3rd party solutions like Turbonomic come into play, which can take care of rightsizing VMs (fit of outside to inside) and optimising the bin packing across physical machines.

Google has been doing a great job of this on their estate for some time, and I’d recommend a look at John Wilke’s ‘Cluster Management at Google‘. That best practice has been steadily leaking out, and Google’s Kubernetes project now provides a (rudimentary) scheduler for container based workload.

What about PaaS and FaaS?

Platform as a Service (like Cloud Foundry) and Functions as a Service (like AWS Lambda) abstract away from servers and the capacity management tasks associated with them. It’s also worth noting that if containers provide smaller things to pack into capacity bins then functions takes that to en entirely different level, with much more discrete elements of work to be considered in aggregate as workload.

Planning for the future

Most of this post has been about capacity management in the present, but a huge part of the discipline is about managing capacity into the future, which usually means planning for growth so that additional capacity is available on time. This is where IaaS has a clear advantage, as the future capacity management is a service provider responsibility.

Conclusion

IaaS saves its users from the need to do future capacity planning, but it’s less flexible in the present as the T-shirt sizes and fits provide only an approximate fit to any given workload rather than a perfect fit – so from a bin packing perspective it can leave a lot of space in the bin that’s been paid for but that can’t be used.


A little while ago I got myself an original[1] Wileyfox Swift to replace my ageing Samsung S4 Mini. The Amazon page I bought it from gave the impression that it would run Android 7, but that page was (and likely still is)[2] really confusing as it covered multiple versions of the Swift line up.

The phone I received came running Android 5.0.1 (Lollipop), which is pretty ancient, and yet the check for updates always reported that no updates were found. I went looking for a way to manually update, and found Cyanogen Update Tracker. There’s a footnote on the downloads page that specifically refers to my phone and the version of Android it carried (YOG7DAS2FI):

Wileyfox Swift (OLD VERSION) – Install this package only if your Wileyfox Swift is running one of the following Cyanogen OS versions: YOG4PAS1T1, YOG4PAS33J or YOG7DAS2FI.
Afterwards, install the package marked as “LATEST VERSION”.

As things turned out I didn’t need the “LATEST VERSION” as over the air (OTA) updates sprung to life as soon as I did the first update[3].

Doing the manual update

  1. Check that you’re on ‘Wileyfox Swift (OLD VERSION)’ by going into Settings > About Phone and confirming that the OS Version is YOG4PAS1T1, YOG4PAS33J or YOG7DAS2FI (mine was the latter).
  2. Download cm-13.0-ZNH0EAS2NH-crackling-signed-9c92ed2cde_recovery.zip (or whatever Cyanogen Update Tracker has on offer) onto the phone, keeping it on internal storage in the Downloads folder[4].
  3. Turn the phone off.
  4. Turn the phone on into recovery mode by holding the volume down button whilst pressing the power button.
  5. Use the volume button to move to and power button to select Install update > Install from internal memory > Browse to Downloads and select the zip file that was downloaded.
  6. Wait – it will take a while to install the update (Android with spinning matrix in front of it) and a while longer to update apps.

After the manual update

As soon as my phone was ready it started nagging me to accept OTA updates, which eventually took me to Android 7.1.2 (Nougat).

Notes

[1] I got an original Swift rather than the more recent Swift 2X as the original can handle two (micro) SIMs and a MicroSD card at the same time, which suits my travel needs of UK SIM (for Three ‘Feel at Home’), US SIM (for local calls in the US) and MicroSD (for my music and podcasts). Sadly the 2X makes you choose between that second SIM (one of which needs to be a nano SIM) and MicroSD.
[2] An example of an increasingly frequent anti pattern on Amazon where entirely different products have their Q&A and reviews all munged together.
[3] Though perhaps I could have saved some time here, as it took about 4 OTA updates to get me to the latest version.
[4] My initial attempt to upgrade from my MicroSD card didn’t work – perhaps because it’s 64GB.


I’ve had the links below in a OneNote snippet for some time, so that I can easily email them to people who want to know more about Wardley mapping; but I thought I might as well post them here too:

The OSCON video

The CIO magazine article

The blog intro

The (incomplete) book (as a series of Medium posts by chapter)Chapter 2 has the key stuff about mapping.

The online course (NB note for DXC/LEF staff on price plans)

Altas2 – the online tool


LessOps

09Aug17

JeffConf have posted the video from my talk there on LessOps (or should that be ‘LessOps), which is how I see operations working out in a world of ‘serverless’ cloud service:

The full playlist is here, and I’ve also published the slides:


In a note to my last post ‘Safety first‘ I promised more on this topic, so here goes…

TL;DR

As software learns from manufacturing by adopting the practices we’ve called DevOps we’ve got better at catching mistakes earlier and more often in our ‘production lines’ to reduce their cost; but what if the whole point of software engineering is to make mistakes? What if mistake is the unit of production?

Marginal cost

Wikipedia has a pretty decent definition of marginal cost:

In economics, marginal cost is the change in the opportunity cost that arises when the quantity produced is incremented by one unit, that is, it is the cost of producing one more unit of a good. Intuitively, marginal cost at each level of production includes the cost of any additional inputs required to produce the next unit.

This begs the question of what is a ‘unit of a good’ with software?

What do we make?

Taking the evolutionary steps of industrial design maturity that I like to use when explaining DevOps it seems that we could say the following:

  • Design for purpose (software as a cottage industry) – we make a bespoke application. Making another one isn’t in incremental thing, it’s a whole additional dev team.
  • Design for manufacture (packaged software) – when software came in boxes this stuff looked like traditional manufactured goods, but the fixed costs associated with the dev team would be huge versus the incremental costs of another cardboard box, CD and set of manuals. As we’ve shifted to digital distribution marginal costs have tended towards zero, so thinking about marginal cost isn’t really useful if we’re thinking that the ‘good’ is a given piece of packaged software.
  • Design for operations (software as a service/software based services) – as we shift to this paradigm then the unit of good becomes more meaningful – a paying user, or a subscription. These are often nice businesses to be in as the marginal costs of adding more subscribers are generally small and can scale well against underlying infrastructure/platform costs that can also be consumed as services.

The cost of mistakes

Mistakes cost money, and the earlier you eliminate a mistake from a value chain the less money you waste on it. This is the thinking that lies at the heart of economic doctrine from our agricultural and industrial history. We don’t want rotten apples, so better to leave them unpicked versus spending effort on harvesting, transportation etc. just to get something to market that won’t be sold. It’s the same in manufacturing – we don’t want a car where the engine won’t run, or the panels don’t fit, so we’ve optimised factory floors to identify and eliminate mistakes as early as possible, and we’ve learned to build feedback mechanisms to identify the causes of mistakes and eliminate them from designs (for the product itself, and how it’s made).

What we now label ‘DevOps’ is largely the software industry relearning the lessons of 20th century manufacturing – catch mistakes early in the process, and systematically eliminate their causes.

Despite our best efforts mistakes make it through, and in the software world they become ‘bugs’ or ‘vulnerabilities’. For any sufficiently large code base we can start building statistical models for probability and impact of those mistakes, and we can even use the mistakes we’ve found already to build a model for the mistakes we’ve not found yet[1].

Externality and software

Once again I can point to a great Wikipedia definition for externality:

In economics, an externality is the cost or benefit that affects a party who did not choose to incur that cost or benefit. Economists often urge governments to adopt policies that “internalize” an externality, so that costs and benefits will affect mainly parties who choose to incur them.

Externalities, where the cost of a mistake don’t affect the makers of the mistake, happen a lot with software, and particularly with packaged software and the open source that’s progressively replaced it in many areas. It’s different at the other extremes. If I build a trading robot that goes awry and kills my fund then the cost of that mistake is internalised. Similarly if subscribers can’t watch their favourite show then although that might initially look like an externality (the service has their money, and the subscriber has to find something else to do with their time) it quickly gets internalised if it impacts subscriber loyalty.

Exploring the problem space

Where we really worry the most about mistakes in software is when there’s a potential real world impact – we don’t want planes falling out of the sky, or nuclear reactors melting down etc. This is the cause of statements like, ‘that’s fine for [insert thing I’ll trivialise here], but I wouldn’t build a [insert important thing here] like that’.

Software as a service (or software based services) can explore their problem space all the way into production using techniques like canary releases[2]. People developing industrial control systems don’t have that luxury (as impact is high, and [re]release cycles are long), so they necessarily need to spend more time on simulation and modelling thinking through what could go wrong and figuring out how to stop that. This dichotomy can easily distil down to a statement on the relative merits of waterfall versus agile design approaches, which Paul Downey nailed as:

Agile: make it up as you go along.
Waterfall: make it up before you start, live with the consequences.

It can be helpful to look at these through the lens of risk. ‘Make it up as you go along’ can actually make a huge amount of sense if you’re exploring something that’s unknown (or a priori unknowable), which is why it makes so much sense for ‘genesis’ activities[3]. ‘Live with the consequences’ is fine if you know what those consequences might be. In each case the risk appetite can be balanced against an ability to absorb or mitigate risk.

This can be where the ‘architecture’ thing breaks down

We frequently use ‘architecture’ when talking about software, but it’s a word taken from the building industry, and professional architects get quite upset about their trade moniker being (ab)used elsewhere. When you pour concrete mistakes get expensive, because fixing the mistake involves physical labour (with picks and shovels) to smash down what was done wrong before fresh concrete can be poured again.

Fixing a software mistake (if it’s caught soon enough) is nothing like smashing down concrete, which is why as an industry we’ve invested so much in moving towards continuous integration (CI) and related techniques in order to catch mistakes as quickly and cheaply as possible.

Turning this whole thing around

What if the unit of production is the mistake?

What then if we make the cost per unit as low as possible?

That’s an approach that lets us discover our way through a problem space as cheaply as possible. To test what works and find out what doesn’t – experimentation on a massive scale, or as Edison put it:

I’ve not failed. I’ve just found 10,000 ways that won’t work.

What we see software as a service and software based services companies doing is finding ways that work by eliminating thousands of ways that don’t work as cheaply and quickly as possible. The ultimate point is that their approach isn’t limited to those types of companies. When we simulate and model we can discover our way through almost any problem space. This is what banks do with the millions of ‘bump runs’ through Monte Carlo simulation of their financial instruments in every overnight risk analysis, and similar techniques lie at the heart of most science and engineering.

Of course there’s still scope for ‘stupid’ mistakes – mistakes made (accidentally or intentionally) when we should know better. This is why a big part of the manufacturing discipline now finding its way into software is ‘it’s OK to make mistakes, but try not to make the same mistake twice’.

Wrapping up

As children we’re taught not to make mistakes – for our own safety, and throughout our education the pressure is to get things right. With that deep cultural foundation it’s easy to characterise software development as a process that seeks to minimise the frequency and cost of mistakes. That’s a helpful approach to some degree, but as we get to the edges of our understanding it can be useful to turn things around. The point of software can be to make mistakes – lots of them, as quickly and cheaply as possible, because it’s often only by eliminating what doesn’t work that we find what does.

Acknowledgement

I’d like to thank Open Cloud Forum’s Tony Schehtman for making me re-examine the whole concept of margin cost after an early conversation on this topic – it’s what prompted me to go a lot deeper and figure out that the unit of production might be the mistake.

Notes

[1] ‘Milk or Wine: Does software security improve with age?
[2] I’d highly recommend Roy Rapoport’s ‘Canary Analyze All The Things: How We Learned to Keep Calm and Release Often‘, which explains the Netflix approach.
[3] Oblique reference to Wardley maps, where I’d recommend a look at: The OSCON videoThe CIO magazine articleThe blog introThe (incomplete) book (as a series of Medium posts by chapter)Chapter 2 has the key stuff about mapping, and The online course


Safety first

27Jul17

Google’s Project Aristotle spent a bunch of time trying to figure out what made some teams perform better than others, and in the end they identified psychological safety as the primary factor[1]. It’s why one of the guiding principles to Modern Agile is ‘Make Safety a Prerequisite’.

The concept of safety comes up in Adrian Cockcroft’s comment on innovation culture that I referenced in my Wage Slaves post:

Here’s an analogy: just about everyone knows how to drive on the street, but if you take your team to a racetrack, sit them in a supercar and tell them to go as fast as they like, you’ll get three outcomes.

  1. Some people will be petrified, drive slowly, white knuckles on the steering wheel, and want to get back to driving on the street. Those are the developers that should stay in a high process, low risk culture.
  2. Some people will take off wildly at high speed and crash on the first corner. They also need process and structure to operate safely.
  3. The people that thrive in a high performance culture will take it easy for the first few laps to learn the track, gradually speed up, put a wheel off the track now and again as they push the limits, and enjoy the experience.

Numbering added by me for easier referencing

This unpacks to being about risk appetite and approach to learning, and it’s all a bit Goldilocks and the three bears:

  1. The risk appetite of the go slow racer is too cold, which means that they don’t create opportunities for learning.
  2. The risk appetite of the crash and burn racer is too hot, they too don’t create opportunities for learning.
  3. The risk appetite of the progressive racer is just right. They create a series of learning opportunities as they explore the limits of the car, the track and their skill.

This is where I think I’m going to diverge from Adrian’s view, and I’m somewhat taking what he says at face value, so there will inevitably be nuance that I’ve missed… I read Adrian as saying that forward leaning companies (like Netflix and Amazon) will set up their hiring, retention and development to favour type 3 people – the ‘natural’ racers.

I have a problem with that, because ‘natural’ talent is mostly a myth. If I think back to my own first track day (thanks Borland) I’d have been picked out as a type 2. I don’t know how many times I spun that VX220 out backwards from a corner on the West track at Bedford Autodrome, but it was a lot, and the (very scared looking) instructor would surely have said that I wasn’t learning (at least not quickly enough).

I returned to the same track(s) a few years later and had a completely different experience. The lessons from the first day had sunk in, I’d altered the way I drove on ordinary roads, I’d bought a sports car and learned to be careful with it, I’d spent time playing racing games on PCs and consoles. Second time out I’d clearly become a type 3 – maybe not the fastest on the track, but able to improve from one lap to the next, and certainly not a danger to myself and others.

So it seems that the easy path here is pick out the type 3s; but there’s another approach that involves getting the type 1s to take on more risk, and getting the type 2s to rein it in a little. Both of these activities happen away from the track, in a safer environment – the classroom and video games (or their equivalents) that let people explore their risk envelope and learning opportunities without threat to actual life or limb; somewhere that the marginal cost of making mistakes is low[2].

The story doesn’t end there. Once we have our type 3s (either by finding them or converting them) there’s still plenty that can be done for safety, and the racing world is a rich source for analogy. Bedford Autodrome is reputed to be one of the safest circuits in the world. It’s been purpose designed for teaching people to race rather than to be used as a venue for high profile competitions. Everywhere that you’re likely to spin out has been designed so that you won’t crash into things, or take off and crash land or whatever. So we can do things to the environment that ensure that a mistake is a learning experience and not a life ending, property destroying tragedy.

Some though should also be given to the vehicles we drive and the protective clothing we wear. Nomex suits, crash helmets, tyre tethers, roll over bars – there have been countless improvements in racing safety over the years. When I watched F1 back in the days of James Hunt it felt like every race was a life or death experience. We lost Aytron Senna, and Niki Lauda still wears the scars from his brush with death; it’s much better that I can watch Lewis Hamilton take on Sebastian Vettel pretty sure that everybody will still be alive and uninjured as the checkered flag gets waved. It’s the same with software, as agile methodologies, test driven development (TDD), chaos engineering and continuous integration/delivery (CI/CD) have converged on bringing us software that’s less likely to crash, and crashes that are less likely to injure. It’s generally easier to be safer if we use the right ‘equipment’.

This connects into the wider DevOps arc because the third DevOps way is continuous learning by experimentation. Learning organisations need to be places where people can take risk, and most people will only take risk when they feel safe. There may be some people out there who are ‘naturals’ at calibrating their approach to risk and learning from taking risks, but I expect that most people who seem to be ‘naturals’ are actually people who’ve found a safe environment to learn. So if we want learning organisations we must create safe organisations, and do everything we can to change the environment and ‘equipment’ to make that so.

Notes

[1] For more on Aristotle and its outcome check out Matt Sakaguchi’s QCon presentation ‘What Google Learned about Creating Effective Teams‘ and/or the interview he did with Shane Hastie on the ‘Key to High Performing Teams at Google‘.
[2] This is a huge topic in its own right, so I’ll cover it in a future post.


Wage Slaves

26Jul17

I recently had the good fortune of meeting Katz Kiely and learning about the Behavioural Enterprise Engagement Platform (BEEP) that she’s building. After that meeting I listened to Katz’s ‘Change for the Better‘ presentation, which provided some inspiring food for thought.

Katz’s point is that so much human potential is locked away by the way we construct organisations and manage people. If we change things to unlock that potential we have a win-win – happier people, and more productive organisations. It’s not hard to see the evidence of this at Netflix, Amazon (especially their Zappos acquisition), Apple etc.

The counter point hit home for me on the way home as I read an Umair Haque post subtitled ‘Slavery, Segregation and Stagnation‘. His observation is that the US economy started based on slavery, then moved to a derivative of slavery, then moved to a slightly different derivative of slavery. Student debt and the (pre existing) conditions associated with health insurance might not be anywhere near as bad as actual slavery, but they’re still artefacts of a systemically coercive relationship between capital and labour. Coercion might have seemed necessary in a world of farm hands and factory workers (though likely it was counterproductive even then), but it’s the wrong way to go in a knowledge economy.

Adrian Cockcroft puts it brilliantly in his response to (banking) CIOs asking where Netflix gets its amazing talent from, “we hired them from you and got out of their way”. He goes on to comment:

An unenlightened high overhead culture will drag down all engineers to a low level, maybe producing a third of what they would do, working on their own.

Steve Jobs similarly said:

It doesn’t make sense to hire smart people and then tell them what to do; we hire smart people so they can tell us what to do.

So the task at hand becomes to build organisations based on empowerment rather than coercion, and that starts with establishing trust (because so many of the things that take power away sprout from a lack of trust).

 


In a footnote to yesterday’s application intimacy post I said:

in time there will be services for provisioning, monitoring and logging, and all that will remain of ‘infrastructure’ will be the config of those services; and since we might treat that config as code then ultimately the NoOps ‘just add code – we’ll take care of the rest’ dream will become a reality. Barring any surprises, that time is likely something in the region of 5 years away.

That came from an extensive conversation with my colleague Simon Wardley on whether NoOps is really a thing. The conversation started at Serverlessconf London where I ended up editorialising the view that Serverless Operations is Not a Solved Problem. It’s worth pointing out a couple of things about my take on Simon’s perspective:

  1. Simon sees DevOps as a label for the (co-evolved) practices emerging from IaaS utilisation, and hence it’s not at the leading edge as we look to a more PaaS/FaaS future.
  2. Simon is a great visionary, so what he expects to come true isn’t the same as what’s actually there right now.

This whole debate was due to come up once again at London CloudCamp on 6 July at an event titled “Serverless and the death of DevOps“. Sadly I’m going to miss CloudCamp this time around, but in the meantime the topic has taken on a life of its own in a post from James Governor:

it’s a fun event and a really vibrant community, but the whole “death of devops” thing really grinds my gears. I blame Simon Wardley. 😉

Whilst not explicitly invoking Gene Kim and the ‘3 Ways’ of DevOps (Flow, Feedback and Continuous Learning by Experimentation); it seems that James and I are on the same page about the ongoing need to apply what manufacturing learned from the 50s onwards to today’s software industry (including Serverless).

Meanwhile Paul Johnston steps in with an excellent comment and follows up with a complete post ‘Serverless is SuperOps‘. In his conclusion Paul says:

Ops becomes your primary task, and Dev becomes a the tool to deliver the custom business logic the system needs.

I think that’s a sentiment born from the fact that (beyond trivial use cases) using Serverless right now is just the opposite of NoOps; the ops part is really hard, and ends up being the majority of the overall effort. There may no longer be a need to worry about VMs and OSes and patching and all of those IaaS concerns (that have in many cases been automated to the point of triviality); but there’s still a need to worry about provisioning, config management, logging and monitoring.

Something that Paul and I dived into recently are some of the issues around testing. Paul suggests ‘The serverless approach to testing is different and may actually be easier‘, but concludes:

we’re currently lacking the testing tools to really drive home the value.—looking forward to when they arrive.

I asked him, “How do you do canarying in Serverless?“, which led to a well thought through response in ‘Serverless and Deployment Issues‘. TL;DR canarying is pretty much impossible right now unless you build your own content router, which is something that’s right up there on the stupid and dangerous list; this is stuff that the platform should do for you.

Things will be better in the future. As Simon keeps pointing out the operational practices will co-evolve with the technologies. Right now Serverless is only being used by the brave pioneers, and behind them will come the settlers and town planners. Those later users won’t come until stuff like canarying has been sorted out, so the scope of what a Functions as a Service (FaaS) platform does will expand, and the effort to make things work will correspondingly contract. In due course it’s possible that if we look at it just right (and squint a bit) we could call that NoOps. Of course to do that we will have had to learn how to encode everything we want to do with provisioning, logging and monitoring into the (infrastructure as code) config management; we will have had to teach the machine (or at least the machine learning behind it) how to care on our behalf. Until then, as Charity Majors says – ‘you can’t outsource caring‘.