CULA – work in progress

A big chunk of my research work – including my phd – has been development of the Markov fitness model (MFM). In short, this is a probabilistic model of the fitness function in an evolutionary algorithm. Originally this was used in an estimation of distribution algorithm for generating new solutions. More recently, I’ve looked at some other applications of it, like a fitness surrogate. (I’ve got another paper giving a lot more detail appearing soon in IEEE Transactions on Evolutionary Computation if you’re interested)

Anyway, one of the drawbacks of the MFM is that it’s quite computationally expensive to build. It involves solving a system of equations using a least-squares fit, which is at best O(n^3) complexity. This has meant that we’ve only really been able to use it with fitness functions that are quite costly themselves. One idea I’ve had has been to use the CULA library to speed things up by making use of the computer’s graphics card (GPU).

Why is this a good idea? Well, solving a system of equations the way we do is simply a big matrix operation. It turns out this is quite easy to parallelise, which means dividing the computation between several/many CPU cores. Typical desktop CPUs currently have around 4-6 cores (up to 12 virtual cores with hyperthreading), but modern GPUs can have hundreds, being designed for working with millions of pixels. This means a big potential speed-up, though in practice it’s not that easy. Although there’s more of them, GPU cores run slower than CPU ones (in pure clock speed terms, a few hundred MHz rather than a couple of GHz). More importantly, doing this means copying data into the graphics card memory so the GPU can work on it, then copying the results back into the main system RAM. This is quite a costly process in itself.

I’ve been wanting to play around with the idea for sometime, but haven’t had a graphics card that was compatible with CULA, and couldn’t justify spending money on one just to see if it worked. Yesterday I realised that the temporary PC I’ve been given at work does have a suitable graphics card, so I gave it a try out. It’s a nVidia GeForce 8400GS: not exactly top-end for this purpose as it has 16 cores running at 460MHz, but still worth trying.

Set up

It took me an age to set up. CULA doesn’t seem to come with bindings for Java, so I had to dig through the rusty bits of my brain to remember the little that I know of C++. I won’t go in to much detail, but I had a long argument with the C++ plugin for Eclipse (my normal development environment). Apparently there is some weird issue where cygwin (which contains c++ compilers) doesn’t support windows-style filenames/paths any more, but eclipse decides that it doesn’t want to work with anything else. So, there are issues when one tries to use an external library like CULA. Eventually I gave up and tried the CodeLite IDE, which worked ridiculously well in comparison. I ran a few examples and was satisfied that CULA was running fine. I then spotted that one of the examples was set up for benchmarking, and one of the tests was singular value decomposition (which is what we use for the least squares fitting). The tests compare a run on CULA with a run using the Intel MKL library (running on theCPU only), on several different matrix sizes.

Results

Well, the results are somewhat disappointing!


-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
 1024 14.78 3.44 0.2325
 1280 29.02 6.52 0.2248
 1536 47.05 10.60 0.2252
 1792 74.69 16.04 0.2148
 2048 112.48 21.96 0.1952
 2304 158.65 30.61 0.1929
 2560 212.97 38.59 0.1812
 2816 285.53 49.07 0.1719
 3072 367.10 76.47 0.2083
 3328 466.77 84.54 0.1811
 3584 587.30 104.99 0.1788
 3840 708.19 118.69 0.1676
 4096 867.12 162.01 0.1868

CULA takes a good bit longer to run than MKL – and it actually slows down further as the problem size increases. However, as I hinted at above, this is a fairly low-end graphics card and the slowdown is only 5x or thereabouts. It depends on just how many cores the SVD operation can use, but this seems to work well enough that it’s worth trying to find some better hardware to give it another go. After all, one of these is only a few hundred dollars and it has over a thousand cores. Well, maybe I’ll see if there’s something a bit faster lying around the department to start with.

Leave a Comment

Your email address will not be published.