Field Programmable Gate Arrays (FPGAs) have been around for decades, but they’ve become a hot topic again. Intel recently announced Xeon chips with FPGAs added on, Microsoft are using FPGAs to speed up search on Bing, and there are Kickstarter projects such as miniSpartan6+ trying to bring FPGA the ease of use and mass appeal of the Arduino. Better accessibility is a good thing, as whilst the technology might be easy to get at, the skills to use it are thin on the ground. That could be a big deal as Moore’s law comes to an end and people start looking closer at optimised hardware for improved speed.
I first came across FPGAs whilst doing my final year project in the compute lab of the Electronics department at the University of York. Neil Howard sat nearby, and was working on compiling C (or at least a subset of C) directly to hardware on FPGA. Using Conway’s Game of Life as a benchmark he was seeing 1000x speed improvement on the FPGA versus his Next Workstation. That three orders of magnitude is still on the table today, as FPGAs have been able to take on Moore’s law improvements in fabrication technology.
My next encounter
FPGAs came up again when I was working on market risk management systems in financial services. I’d done the engineering work on a multi thousand node compute grid, which was a large and costly endeavour. If we could seize a 1000x performance boost (or even just 100x) then we could potentially downsize from thousands of nodes to a handful of nodes. The data centre savings were very tantalising.
I found a defence contractor with FPGA experience that was looking to break into the banking sector. They very quickly knocked up a demo for Monte Carlo simulation of a Bermudan Option. It went about 400x faster than the reference C/C++ code. A slam dunk one might think.
Mind the skills gap
When the quants first saw the demo going 400x faster they were wowed. By the end of the demo it was clear that we weren’t going to be buying. The quant team had none of the skills needed to maintain FPGA code themselves, and were unwilling to outsource future development to a third party.
There was an element of ‘not invented here’ and other organisation politics in play, but this was also an example of local optimisation versus global optimisation. If we could switch off a thousand nodes in the data centre then that would save some $1m/yr. However if it cost us a more than a couple of quants to make that switch then that would cost >$1m/yr (quants don’t come cheap).
Field programmable means something that can be modified away from the factory, and a gate array is just a grid of elementary logic gates (usually NANDs). The programming is generally done using a hardware description language (HDL) such as Verilog or VHDL. HDLs are about as user friendly as assembly language, so they’re not a super productive environment.
My electronics degree had a tiny bit of PIC programming in it, but I didn’t really learn HDL. Likewise my friends doing computer science didn’t get much lower level than C (and many courses these days don’t ever go below Java). Enlightened schools might use a text like The Elements of Computing Systems (Building a Modern Computer from First Principles) aka Nand2Tetris, which uses a basic HDL for the hardware chapters; but I fear they are in the minority.
So since HDLs pretty much aren’t taught at schools then the only place people learn them is on the job – in roles where they’re designing hardware (whether it’s FPGA based or using application specific integrated circuits [ASICs]). The skills are out there, but very much concentrated in the hubs for semiconductor development such as the Far East, Silicon Valley and Cambridge.
The open source hardware community (such as London’s OSHUG) also represents a small puddle of FPGA/HDL skill. I was fortunate enough to recently attend a Chip Hack workshop with my son. It’s a lot of fun to go from blinking a few LEDs to running up Linux on an OpenRISC soft core that you just flashed in the space of a weekend.
The other speed issue
FPGAs are able to go very fast for certain dedicated operations, which is why specialist hardware is used for things like packet processing in networks. Programming FPGAs is also reasonably fast – even a relatively complex system like an OpenRISC soft core can be flashed in a matter of seconds. The problem is figuring out the translation from HDL to the array of gates, a process known as place and route. Deciding where to put components and how to wire them together is a very compute intensive and time consuming operation, which can take hours for a complex design. Worst of all even a trivial change in the HDL normally means starting from scratch to work out the new netlist.
Google’s Urz Hölzle alluded to this issue in a recent interview, explaining why he wouldn’t be following Microsoft in using FPGA for search.
Whilst FPGAs didn’t catch on for market risk at banks they’ve become a ubiquitous component of the ‘race to zero' in high frequency trading. The teams managing those systems now have grids of overclocked servers to speed up getting new designs into production.
Hard or soft core?
Whilst Intel might be just recently strapping FPGAs into its high end x86 processors many FPGAs have had their own CPUs built in for some time. Hard cores, which are usually ARM (or PowerPC in older designs) provide an easy way to combine hardware and software based approaches. FPGAs can also be programmed to become CPUs by using a soft core design such as OpenRISC or OpenSPARC.
Programming hardware directly offers potentially massive speed gains over using software on generic CPUs, but there’s a trade off in developer productivity and FPGA skills are pretty thin on the ground. That might start to change as we see Moore’s law coming to an end and more incentive to put in the extra effort. There are also good materials out there for self study where people can pick up the skills. I also hope that FPGA becomes more accessible from a tools perspective, as there’s nothing better than a keen hobbyist community to drive forward what happens next in industry – just look at what the Arduino and Raspberry Pi have enabled.
 The use of field-programmable gate arrays for the hardware acceleration of design automation tasks seems to be the main paper that emerged from his research (pdf download).
 From building line speed network traffic analysis tools
 As every type of digital circuit can be made up from NANDs, and NANDs can be made with just a couple of transistors. The other universal option is NORs.
 If I recall correctly we used schematic tools rather than an HDL.
 My colleagues at York actually learned Ada rather than C, a peculiar anomaly of the time (the DoD Ada mandate was still alive) and place (York created one of the original Ada compilers, and the Computer Science department was chock full of Ada talent).
 It’s a shame, my generation – the 8bit generation, pretty much grew up learning computers and programming from first principles because the first machines we had were so basic. Subsequent generations have learned everything on top of vast layers of abstraction, often with little understanding of what’s happening underneath.
 Bank of England paper ‘The race to zero‘ (pdf)
Filed under: technology | 1 Comment
Tags: FPGA, HDL, Nand2tetris, programming, skills, speed, Verilog, VHDL