Grids, clouds and why productivity matters most

19Nov10

My friend Randy posted a few days ago on Grid, Cloud, HPC … What’s the Diff?. I started to make a comment on the blog, but it was getting too long so I moved it here.

Randy does a good job of pinning down both performance and scalability, but in my experience productivity trumps both. This is another way of saying that there are sometimes smarter ways of reaching an outcome than brute force. There’s a DARPA initiative spun up around this – High Productivity Computer Systems, which I think came about when somebody looked at what Moore’s law implied for the energy and cooling characteristics of a Hummvee full of C4I kit.

A crucial point when considering productivity is to look at the overall system (which can be a lot more than what’s in the data centre). Whilst it may be possible to squeeze the work done by hundreds of commodity machines onto a single FPGA that has implications for development and maintenance time and effort. In a banking environment where the quants that develop this stuff are far from cheap it may actually be worth throwing a few $m at servers and electricity rather than impacting upon developer productivity.

I’d argue that the lines between message passing interface (MPI) workloads and embarrassingly parallel problems (EPP) are blurrier than Randy makes out. It’s all about the ratio of compute to data, and dependency has a lot of importance – if the outcome of the next calculation depends on the results of an earlier one then you can end up shovelling a lot of data around (and high speed, low latency interconnect might be vital). On the other hand if the results of the next calculation are independent of previous results then there’s less data to be wrangled. Monte Carlo simulation, which is used a lot in finance, tends to have less data dependency than other types of algorithm.

Most ‘EPP’ are low in data dependency (usually with just initial input variables and an output result), and so the systems are designed to be stateless (in an effort to keep them simple). This causes a duty cycle effect where some time is spent loading data versus the time spent working on the data to produce a result (and if the result set is large there may also be dead time spent moving that around). Duty cycles can be improved in many cases by moving to a more stateful architecture, where input data is passed by reference rather than value and cached (so if some input data is there already from a previous calculation it can be reused immediately rather than hauled across the network). This is what ‘data grid’ is all about.

Getting back to the question of ‘grid’ versus ‘cloud’, I agree with Randy that there’s a big overlap, and it’s encouraging to see services like Amazon’s Cluster Computer Instance (and it’s new cousin the Cluster GPU instance). I will however return to a point I’ve made before – Cloud works despite the network, not because of it. The ‘thin straw’ between any data and the cloud capability that may work on it remains painful – making the duty cycle worse than it might otherwise be. For that reason I expect that even those with ‘EPP’ type problems will have to think more carefully about the question of stateful versus stateless than they may have done before. It matters a lot for the overall productivity of many of these systems.