Virtual Machine Capacity Management

15Oct17

TL;DR

VMs on public cloud don’t provide the same level of control over sizing as on premises VMs, and this can have a number of impacts on how capacity is managed. Most importantly ‘T-shirt’ type sizing can provide sub optimal fit of workload to infrastructure, and the ability to over commit CPUs is very much curtailed.

Introduction

Capacity Management is an essential discipline in IT – important enough to be a whole area of the IT Infrastructure Library (ITIL). As organisations shift from on premises based virtualisation to the use of cloud based infrastructure as a service (IaaS) it’s worth looking at how things change with respect to capacity management. Wikipedia describes the primary goal of capacity management as ensuring that:

IT resources are right-sized to meet current and future business requirements in a cost-effective manner

I’ll used VMware and AWS as my typical examples of on prem virtualisation and cloud IaaS, but many of the points are equally appropriate to their competitors.

Bin packing

Right sizing a current environment is essentially a bin packing problem where as much workload as possible should be squeezed onto as little physical equipment as possible. This type of problem is ‘combinatorial non-deterministic polynomial time (NP) hard’, which means that we can spend a lot of time and computer cycles coming up with a perfect answer. In practice perfect isn’t needed, and what needs to be packed into the bins keeps changing anyway, so various shortcuts can be taken to a good enough solution.

Inside out

It’s worth noting that the resources consumed within a VM (an inside measure) aren’t the same as the VM’s sizing (an outside measure). The inside measure is what’s really being used, whilst the outside measure sets constraints e.g. if a VM is given 4GB of memory and its running a Java app that never goes over 1.5GB of heap then it’s probably oversized – we could get away with with a 2GB VM, or even a 1.6GB VM.

Hypervisors are great at instrumentation

The hypervisor virtualisation layer has perfect knowledge of the hardware resources that it’s presenting to the VMs running on it, so it’s a rich source of data to inform sizing questions. In the example above the hypervisor can tell us that its guest is never consuming more than 1.6GB of RAM.

#1 important point about IaaS – the cloud service provider (CSP) runs the hypervisor rather than the end user. They may choose to offer some instrumentation from their environment, but it’s unlikely to be at full fidelity. Of course they have this data themselves to inform their own capacity management.

T-shirt sizes and fits

T-shirt sizes

IaaS is generally sold in various sizes like Small, Medium, Large, XL etc. – like (mens) T-shirts. Unlike real life T-shirts that are incrementally bigger cloud T-shirts are generally exponentially bigger, so an XL is twice a Large is twice a Medium is twice a Small, meaning that the XL is 8x as big (and expensive) as a small.

A typical instance type on AWS is an m4.xlarge, which has 4 vPCU and 16 GiB RAM, so if I have a workload that needs 2vCPU and 9 GiB RAM I need that instance type because it’s the smallest T-Shirt that fits (as an m4.large with 8GiB RAM would be short on RAM). In a VMware environment I’d just size that VM to 2 vCPU and 9 GiB RAM, but I don’t have that degree of control in the cloud.

#2 important point about IaaS – T-shirt sizes mean that fine control over capacity allocation isn’t possible.

It’s worth taking a quick detour here to explore the meaning of vCPU, because these are entirely different between VMware and AWS. In a VMware environment the vCPUs allocated to a VM represent a largely arbitrary mapping to the host’s physical CPUs. In modern AWS instances the mapping is much more clearly defined – a vCPU is a hyperthread on a physical core (and thus 2 vCPUs are a whole core). The exception is T2 type instances, which have shared cores, and quite a neat usage credit system to ensure fair allocation.

#3 important point about IaaS – CPU over commit is only possible by using the specific instance types that support it.

T-shirt fits

Just as different T-shirt fits apply to different body shapes, different instance types apply to different workload shapes. The M or T types are a ‘general purpose’ mix, whilst C is ‘compute optimised’ and X or R are ‘memory optimised’ and I are ‘storage optimised’. AWS can use the data from their actual users to tailor the fits to real world usage.

Returning to the misfit above (2 vCPU and 9 GiB RAM workload) this will fit onto an R4.large, which is $0.133/hr – a saving of $0.67/hr versus the M4.xlarge. Comparing to an M4.large at $0.1/hr it’s 33% more expensive for 12.5% more RAM needed, but that’s a whole lot better than 100%.

How do containers change things?

Bin packing is easier with lots of small things, and containers tend to be smaller than VMs, so in general containers provide a less lumpy problem that’s easier to optimise.

This shouldn’t be done manually

As noted above bin packing is NP-hard, so if you ask humans to do it by hand the approximations will be pretty atrocious. This is work that should be left to machines. VMware is great at scheduling work onto CPUs, but it doesn’t optimise a bunch of VMs across a set of machines. This is where 3rd party solutions like Turbonomic come into play, which can take care of rightsizing VMs (fit of outside to inside) and optimising the bin packing across physical machines.

Google has been doing a great job of this on their estate for some time, and I’d recommend a look at John Wilke’s ‘Cluster Management at Google‘. That best practice has been steadily leaking out, and Google’s Kubernetes project now provides a (rudimentary) scheduler for container based workload.

What about PaaS and FaaS?

Platform as a Service (like Cloud Foundry) and Functions as a Service (like AWS Lambda) abstract away from servers and the capacity management tasks associated with them. It’s also worth noting that if containers provide smaller things to pack into capacity bins then functions takes that to en entirely different level, with much more discrete elements of work to be considered in aggregate as workload.

Planning for the future

Most of this post has been about capacity management in the present, but a huge part of the discipline is about managing capacity into the future, which usually means planning for growth so that additional capacity is available on time. This is where IaaS has a clear advantage, as the future capacity management is a service provider responsibility.

Conclusion

IaaS saves its users from the need to do future capacity planning, but it’s less flexible in the present as the T-shirt sizes and fits provide only an approximate fit to any given workload rather than a perfect fit – so from a bin packing perspective it can leave a lot of space in the bin that’s been paid for but that can’t be used.



2 Responses to “Virtual Machine Capacity Management”

  1. Reblogged this on Cloud Information Management and commented:
    Bin Packing T-Shirts!

  2. 2 Justin

    This is a really good post.

    The art of ITIL capacity management seems to have been lost.

    I still believe it is absolutely key even when running public cloud. To make public cloud economically viable and extract maximum benefit from the provider an active capacity management function is key in my view.


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.