This is a practice that I’m trying to get traction with at work, but it’s not something I’ve seen or read about other people doing. But then it seems so obvious that other people must be doing it, so I’d love to hear more about that.

It’s pretty typical for a post incident review (aka ‘post-mortem) to include a detailed time line of what was done. But was each action helpful, harmful, or of no consequence (other than maybe wasting time)?

For this I’m suggesting a traffic light system:

  • RED – this is the stuff that made a bad situation worse. Next time we’re dealing with something similar we want to make sure to avoid doing that again.
  • AMBER – this is stuff that we tried, but it didn’t help. Thankfully it didn’t make things any worse, but it also took time, so best avoided next time around.
  • GREEN – this is the helpful stuff that actually drove us towards resolution. If we can just do green next time then we’re on the happy path to a quick resolution.

For a little while I’ve been maintaining a FaaS on Kubernetes list to track the many implementations of Functions as a Service running on top of Kubernetes. Today brings CloudState as the first addition in a little while, and it’s quite an interesting one for a variety of reasons.

Knative

I became aware of CloudState via Google’s Mark Chmarny linking to news from my InfoQ colleague Diogo Caeleto in a tweet:

Exciting to see new project building on top of @KnativeProject. We should really start that “built on Knative” page

I’ve spent a little time over the past few weeks trying to get my head around Knative, as I found it pretty bewildering. There was even a suggestion that Knative should itself be added to the FaaSonK8s list, which I finally decided was a no. The journey began with Ian Miell asking:

Anyone interested in how to set up a working knative environment for dev spun up with one script in the cloud for $0.18 per hour?

and me wondering why? Ian’s reply (along with some rumblings from DXC colleagues) got me pulled back in:

Ahhh. Well, I know right? But I get it now, as we belatedly realised recently that we were re-implementing it ourselves at $CORP and that this would save us a lot of sweat…

The point that emerged is that Knative isn’t itself a serverless platform that runs on K8s, but rather a kit of parts for those wanting to build such a thing. Google have already built their own in the shape of Cloud Run (which I touched upon yesterday in ‘Kubernetes and the 3 stage tech maturity model‘). I guess if a bank or similar determined that they couldn’t just use Cloud Run then their platform team could build their own using Knative. But CloudState is interesting for other reasons…

Stateful (and actor based)

Every meaningful IT system manages state, but for the sake of simplicity it’s pretty common for us to push the state management off to another system (the database) and build stateless systems. This lies at the heart of The Twelve-Factor App approach, and many other architectures.

Stateless is simple, but it can also be horribly inefficient. When I worked on grid computing some 15yrs ago most of the apps were stateless, which meant that the grid spent roughly half of its time waiting for data with the CPUs idle. Often the data it was waiting for was exactly the same data that was used for the last set of calculations, so we were effectively throwing away good data, and wasting time loading it up again. Given the cost of running thousands of servers that wasn’t sustainable, and we moved to a model where data was cached by a ‘data grid’.

I’ve been reminded of this in recent L8ist Sh9y podcasts interviewing Simon Crosby about his work at Swim.ai:

What I see in CloudState has a number of commonalities to Swim.ai:

So I’d expect that the powerful arguments Simon makes for these things in the context of his own work also extend to CloudState.

The serverless world has previously dealt with state by putting it into serverless state management services, but this model of keeping it close to the functional code shows promise of hitting the trifecta of better, faster, cheaper.


I’ve seen this emerge a few times:

  1. I want a thing
  2. Eek – too many things – I need a thing manager
  3. I don’t care about things, just do the thing for me

Applying the pattern to Kubernetes:

  1. I want a Kubernetes
  2. Eek – too many Kubernetes – I need a Kubernetes manager
  3. I don’t care about Kubernetes, just run my distributed app for me

If I look at some recent industry announcements:

  1. VMware’s next generation of vSphere with Project Pacific will have Kubernetes baked in
  2. VMware’s Tanzu is a ‘Mission Control’ for Kubernetes.
  3. Google’s Cloud Run is a ‘serverless’ service with Kubernetes underneath, but hiding all the gory details.

I’m not making a value judgement on Google being in some way ahead of VMware here – they’re skating to different pucks being played by different customers; because technology diffusion curves.

Mapping to ‘design for…’

When I talk about DevOps I usually talk about the shift from design for purpose, to design for manufacture, to design for operations. I can see some broad, but imprecise alignment between those three stages and these.

Needs change

Pat Kerpan first brought this to my attention with his observations from Cohesive Networks:

  1. I need an overlay network (and I must manage it myself, because part of my whole threat model is that I don’t want to entirely trust the underlying cloud service provider)
  2. Eek – too many networks to manage, give me a manager of managers (Cohesive created ‘Mothership’, now VNS3:ms)
  3. I don’t want to manage my own networks any more, just run them for me as a service

NB that at stage 3 the control requirement that was present at stage 1 has evaporated.

Pat also observed that shifts from one stage to the next normally coincided with people changes at the customers. The engineers who bought a technical solution at 1 gave way to managers who needed to scale at 2 gave way to new managers who just wanted simplicity at 3.


I hear these words a lot.

They’re a shield for ignorance.

A statement that the details don’t matter (when they really do).

Learning has stopped.

Submission to the people in the conversation who are technical – it’s your problem now, “I wash my hands of it”.

A power play, “I care about the business, you’re just playing with toys”.

It’s not OK.

“Software is eating the world”, and if you’re ‘not technical’ it will eat you too.

Update 7 Sep 2019

My customary tweet sparked off quite some conversation on Twitter, including this thread from Christian Reilly. Forrest Brazeal chimed in with his excellent Faas and Furious cartoon ‘Not Technical‘, which he was kind enough to permit me to copy here.


Policy debt

04Sep19

Background

When we talk about technical debt that conversation is usually about old code, or the legacy systems that run it. I’ve observed another type of debt, which comes from policies, and seems to be most harmful in the area of security policies.

Firewalls or encryption?

A primary purpose for this post is to put out a statement I’ve been using in discussions for the past few years:

Any company that wrote its security policy prior to the advent of SSH is doomed to do with firewalls things that should be done with encryption

I’m using SSH as a marker for the adoption of public key cryptography. The protocol itself is irrelevant to the discussion, and most likely it’s TLS that’s being used in systems that we care about.

I’ve also presented a false choice here – it’s not firewalls or encryption, it can be firewalls and encryption (belt and braces).

The point is that if your policy says that you must use firewalls then you’re going to need a bunch of firewalls, and a bunch of the network segments that they imply; and that’s a bunch of extra cost and complexity that a newer organisation might forego in favour of having a policy that tells them to use TLS.

Cloud natives

‘Cloud native’ organisations and their architectures will usually favour encryption over firewalls. In fact the insistence on firewalls (and hardware security modules [HSMs] {and especially HSMs behind firewalls}) will ruin a cloud native architecture, or maybe cloud adoption itself.

Password cycling

Another clear example we can look at is periodic password resets. For a long while it was accepted best practice that passwords should be cycled (every 90 days or so), and that practice found its way into policies.

A few years back CESG and NIST decided (within a few weeks of each other) that periodic password cycling wasn’t helpful, and changed their guidance accordingly. They now advise that passwords should only be changed when their is evidence of compromise.

The best practice has changed, but largely the policies have not. In part this is inertia, and in part it is fear that a change in policy might violate some compliance requirement. The problem here is that regulators have a nasty habit of using practice by value rather than practice by reference, so there will be cases where the older NIST or similar guidance has been hard coded. This is compounded by the fact that most published policy demands ‘what’ (and sometimes ‘how’) without bothering to explain ‘why’, so the threads of connection to the regulation that shaped policy get cut, making it much harder to determine the impact of a policy change.

That we’re mostly still cycling passwords every 90 days, years after the standards bodies announced that this was a bad practice, serves as ample evidence of policy debt.

Why does this matter?

Organisations are less agile, because they can’t embrace new technology and approaches.

Organisations are also less secure. Not just because they can’t embrace new technology and approaches, but also because they can’t stop doing bad things after overwhelming evidence emerges that those things are bad.

What can be done?

Policy debt needs to be tackled alongside of other aspects of organisational and cultural change, otherwise it impedes change. If culture is ‘the way we do things around here’ then policy encodes that, so if culture needs to change (for a DevOps adoption or Digital transformation or whatever else) then the policy needs to be dragged along with it.

Conclusion

There is clear evidence of policy debt accumulating in older organisations, and it’s getting in the way of them adapting to the realities of the business context and threat landscape they now operate in. Policy debt will continue to get in the way until it’s understood and tackled as part of larger change.


Saturday was very rainy, so I thought I’d finally get around to upgrading my home lab from ESXi 5.5 to 6.5. I started with my #2 Gen 8 Microserver as I don’t have any live VMs on it, and thus began many wasted hours of reboot after reboot failing to get anything to work.

Slow iLO

The Integrated Lights Out (iLO) management on the Gen 8s is their best feature, and I was able to start out by mounting the .iso for ESXi 6.5 through the iLO remote console and rebooting the server into that to kick off the upgrade.

Sadly the first pass failed on one of the VIBs that’s standard in the HPE ESXi 5.5 bundle (and not used by the Microserver).

After that things spiralled into doom. It wouldn’t boot from the Virtual CD, then it wouldn’t boot from a USB stick, then it wouldn’t complete the upgrade of the internal USB stick I boot from, then it wouldn’t boot from a fresh USB stick.

All the while the iLO was really slow compared to my #1 Microserver.

I did all the usual stuff

Power cycles, iLO resets, BIOS updates, iLO firmware updates (which took ages).

I thought maybe the CPU was overheating and checked the thermal paste.

Maybe it’s a USB problem

I disconnected the internal boot USB stick (and checked it in my laptop)

I disconnected the external KVM (it’s not like I really need it with iLO) in case that was causing issues.

And still the iLO was slow, and USB boot wasn’t working

At this stage I took a close look at the Active Health System Log, which was about 1GB.

A thread I’d found on Reddit ‘HP iLO4 very slow‘ after Googling ‘slow iLO’ suggested that flash issues could cause problems, and maybe a giant AHS log file could be the cause of flash issues.

Reformatting iLO Flash

Perhaps I should have just cleared the log, but I instead went for reformatting the iLO NAND.

Curiously the iLO Self-Test wasn’t reporting a problem with the embedded Flash/SD_CARD, so I wasn’t able to do things the easy way from the (now v2.7.0) iLO web interface. I had to download the HP Lights-Out Configuration utility and feed it a lump of XML to send over to the iLO.

HPQLOCFG.exe -f Force_Format.xml -s iLO_IP -u administrator -p mypassword

<RIBCL VERSION="2.0">
<LOGIN USER_LOGIN="administrator" PASSWORD="mypassword">
<RIB_INFO MODE="write">
<FORCE_FORMAT VALUE="all"/>
</RIB_INFO>
</LOGIN>
</RIBCL>

How did this happen?

I still don’t know. I’d love to read the AHS logs, but the tools to do that live behind a redirect loop on the HPE web site, and may require an active support contract.

Perhaps it’s because the server was powered down, but with the iLO still running, for a few years?

The iLO is supposed to be out of band, and so it shouldn’t affect things like the host USB bus, but I’m guessing that a few corners might have been cut to keep down the cost of adding iLO to the Gen 8 Microserver. I’m also guessing that decision hasn’t impacted many users because a lot of those machines went into home labs yet it seems this isn’t a common problem.

Reformatting worked

After the reformat and a reboot of the iLO it was back to snappy performance. Better still USB storage started working again, so I could finally do that ESXi upgrade.


If you’re here for my experiments in culinary science move along swiftly, this post isn’t for you. This is all about enterprise architecture versus cloud native architecture.

RDBMS is a meatball

Enterprises use (or at least have used) Relational Database Management Systems (RDBMS), and such things have become deeply embedded into the organisation and culture around maintaining ‘books and records’ of the firm. Something I’ve previously labelled the ‘cult of the DBA'[1].

Enterprise meatballs don’t scale

RDBMS are generally limited by ‘the biggest box money can buy’. That’s not entirely true since the advent of Oracle Real Application Clusters (RAC), but by then much of the norms of RDBMS use were well established. The story goes roughly like this.

You might have a business need for some of my data, but you can’t use my database because my application is already running the biggest box money can buy to the ragged edge of its performance envelope[2]. Get your own RDBMS with its own box, and I’ll give you a copy of the data with an Extract Transform Load (ETL) job

and so we get spaghetti

RDBMS to ETL to RDBMS to ETL to RDBMS to you get the drift… Meatballs and spaghetti, spaghetti and meatballs.

It quickly gets messy. The worst part is the T in ETL, because the shape and naming keeps changing as data gets re-purposed for different uses.

What changes in the cloud?

Sticking with the metaphor, the cloud has infinitely large meatballs[3], so no need for spaghetti any more.

‘Cloud scale’ architecture liberates us from ‘the biggest box money can buy’ because the clouders found ways to scale horizontally. This is largely achieved by throwing off the shackles of ‘relational’, though we get to keep that if it’s really needed, and we can still use SQL too if that’s useful.

This simplifies things greatly. Every app can rally to the same source of truth, and ‘master data management’ boils down to the good management of one giant database rather than the cat herding exercise of figuring out how you got 122 different ways of describing ‘yield curve’.

This does not map well to present enterprise organisations

Meatballs, and the monoliths built on top of them, fit super snugly into traditional organisation structures (the purpose, boundaries and budgets for each siloed function). The spaghetti that wired everything together then became a cultural norm (how we do things around here).

Enterprise adoption of cloud native data management might hold the promise of greatly simplifying everything, but will be fought every step of the way as it cuts across the organisation structure and culture that evolved around it.

If (as Adrian Cockcroft says) ‘DevOps is a reorg’ then this is the same. Somehow ‘cloud data management is a reorg’ sounds less catchy. It should probably happen alongside the DevOps reorg anyway.

Notes

[1] See NoSQL as a governance arbitrage
[2] This is usually somewhere between a small fib and a massive lie. The biggest box that money can buy has been bought in the anticipation of many things that might affect capacity management over time, including how long it takes to get approval to buy anything. But the lie is told anyway because who wants to worry about another group’s capacity needs (or worse still setting up an internal charge back for their usage)?
[3] Not actually true, but in the real world you’ll run out of money before they run out of capacity.


TL;DR

Decision making is at the heart of an organisation’s purpose, but it’s rare to see much effort being spent on improving the quality of decision making, and typical to see all decisions mired in time consuming bureaucratic process. We can do better, with a little coarse filtering, some doctrine and situational awareness, and a bias towards tightening feedback loops.

Background

Over the past few months this topic has come up in a few different places for me. First there was Sam Harris’s ‘Mental Models‘ podcast conversation with Farnam Street[1] blog founder Shane Parrish. Then there was Dominic Cummings‘[2] epic[3] ‘High performance government, ‘cognitive technologies’, Michael Nielsen, Bret Victor, & ‘Seeing Rooms’‘. All against a background of daily tweets from Simon Wardley about his mapping, culminating in this excellent explainer video from Mike Lamb:

Do good, or avoid bad?

My first observation would be that most organisations import the human frailty of loss aversion, and so the machinery of decision making (generally labelled ‘governance’) is usually arranged to stop bad decisions rather than to promote good decisions.

It’s also usual for the same governance processes to be applied to all decisions, whether they’re important or not. Amazon’s founder and CEO Jeff Bezos is a visible example of somebody who’s figured this out and done something about it. Bezos distinguishes between irreversible (Type 1) and reversible (Type 2) decisions. In his 2015 letter to shareholders he writes in a section headed ‘Invention Machine’:

Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgement individuals or small groups.

As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention.[4] We’ll have to figure out how to fight that tendency.

And one-size-fits-all thinking will turn out to be only one of the pitfalls. We’ll work hard to avoid it… and any other large organization maladies we can identify.

Data driven decision making

If you torture your data hard enough, it will tell you exactly what you want to hear[5]

‘Data is the new oil’ has been a slogan for the last decade or so, and Google (perhaps more than any other organisation) has championed the idea that every problem can be solved by starting with the data.

Unfortunately data is just raw material, and data management systems (whether they’re ‘big’ or not) are just tools. Data driven decisions need the right data (scope), correct data (accuracy), appropriate processing and presentation, and a proper insertion point into the decision making process. The Google approach can easily become A/B testing into mediocrity; but most organisations don’t even get that far. They spend $tons on some Hadoops or similar and a giant infrastructure to run it on, then build what they hope is a data lake, which in fact is a data swamp, somehow expecting insight to squirt forth directly into the boardroom.

Strategy first, then a data machinery to support that strategy, not the other way around.

Being agile

Deliberate little a.

Whether we’re learning from evolution or OODA loops we know that the fastest adapter wins. So a relatively high level decision that an organisation might commit to is being adaptive to customer needs.

Agility, agility, agility – we want to adapt to ever changing customer needs, which means we need Agile software development, which means to need an agile infrastructure… buy a cloud from us. The latter part is a jokey reference to behaviour I saw in the IT industry a few years back, and I think most players have now figured out that clouds don’t magically impute agility, that you actually need to build something that provides end-end connectivity from need to implementation.

The point here is that you can’t just pick a single aspect of ‘agile’, like buying a cloud service, or choosing to do Agile development[6]. It has to be driven end-end. This means that leaders can’t just decree that somebody else (lower down) in the organisation will ‘do agile’, they have to get involved themselves, and the ‘governance’ processes need to be dragged along too.

The Wardley adherents following along will at this stage be struggling to contain:

But Chris, Simon says that Agile is only suitable for genesis activities, and we should use Lean and Six Sigma for products and commodities.

To which I respond, genesis is the only area of any interest. For sure products and services should be bought; and even if you’re a product or service company the only interesting things happening within those companies are genesis. It’s turtles all the way down for the Lean and Six Sigma stuff, and it’s not interesting for decision making because (by definition) we already know how to do those things and do them well.

Doctrine

Also don’t waste time on decisions that other people have already figured out the answers to. That’s what doctrine’s all about, and Mr Wardley has been kind enough to catalogue it for us. This is why he tells us that there’s no point in using techniques like Pioneer, Settler, Town Planner (PST) until doctrine is straightened out, because it’s like building without foundations.

Conclusion

Organisations function by making decisions, about Why, What and How, so it’s startling how bad most organisations are at it, and how easily organisations that get good at decision making find it to outpace and outmanoeuvre their competition (or even just the status quo for organisations that don’t compete). It’s also sad but true that some of the best brains for decision making are sat within investment funds, effectively throwing tips from the sidelines rather than getting directly involved in the game.

The first step is doctrine – don’t spend time and treasure on stuff that’s already figured out. The next step is it to categorise decisions by their reversibility (which is inevitably a proxy for impact) and stream different categories through different levels of scrutiny. Then comes the time to focus on making good timely decisions in addition to avoiding bad decisions.

Notes

[1] Named after Warren Buffett‘s residence in Omaha where he spends his time reading and thinking about how to steer the fortunes of Berkshire Hathaway.
[2] Cummings is a contentious figure for me. I despise what he did as the Campaign Director of Vote Leave (wonderfully portrayed by Benedict Cumberbatch in Channel 4’s ‘Brexit: The Uncivil War‘); but I find that I must admire the way that he did it. He ran a thoughtful 21st century campaign against a bunch of half-hearted nitwits who clearly struggled to drag themselves out of the Victorian era. No wonder he won. I should also note that he’s disavowed himself of what’s subsequently become of Brexit, as his vision and strategy has not been taken on by those now handling the execution.
[3] It seems every Cummings post is an example of ‘If I had more time, I would have written a shorter letter’. He’s obviously still keeping himself very busy.
[4] Bezos footnotes: “The opposite situation is less interesting and there is undoubtedly some survivorship bias. Any companies that habitually use the light-weight Type 2 decision-making process to make Type 1 decisions go extinct before they get large.”
[5] With apologies to Ronald Coase who originally said, ‘If you torture the data enough, nature will always confess’.
[6] For an excellent treatise on the state of Agile adoption I can recommend Charles Lambdin’s ‘Dear Agile, I’m Tired of Pretending


As we hit the second anniversary of NotPetya, this retrospective is based on the author’s personal involvement in the post-incident activities.

Continue reading the full story at InfoQ.


It turned out that my TMS9995 system had no modules in common with my CP/M system, as it’s using the ROM and RAM modules left over from the CP/M upgrade. All I needed was another backplane to be able to run both at once.

SC116 3 slot backplane

As the TMS9995 uses three modules: CPU, ROM and RAM I wanted a 3 slot backplane[1], and it turns out that Steve Cousins makes just the thing with his SC116, which is available on Tindie. I ordered one yesterday, and it arrived today :)

It only took a few minutes to put together (in part because I left out all the bits I didn’t need).

Pi terminal server

It’s great to use my RC2014s from a serial terminal on my laptop or PC, but that means I can’t do anything with them when I’m away from home.

To get over that I’ve used the Raspberry Pi that’s sat on the UPS on my desk (my first original Model B) along with a couple of UART cables.

I wanted to get another FT232 cable, but the eBay supplier I used last time is away. So instead I order a pair of PL2303 based cables from Amazon. These turned out to work fine with my laptop, but not so great with the Pi, where I could only get one working at once (due to power issues?), and also hit a somewhat well documented issue with (clone?) PL2303X chips where it’s necessary to sudo modprobe -r pl2303 && sudo modprobe pl2303The compromise I ended up with was the FT232 for the CP/M system using screen -S Z80 /dev/ttyUSB0 115200 and the PL2303 for the TMS9995 using sudo minicom -D /dev/ttyUSB1This means I can now SSH to my Pi and connect from there to the two RC2014 systems.

Note

[1] At some stage I may make a TMS9901 I/O module, and I’d also like to see if I can add a TMS9918 video adapter (like this one), so I might have to upgrade to 5 slots later on.