I've been doing some parallel number crunching lately using Java 1.6 (1.6.0_17) on a MacPro (4,1) with a 2.93 GHz quad-core Nehalem processor. This weekend I decided to brush the dust off my C programming skills and re-implement the core of my Java code in C using Apple's Grand Central Dispatch (GCD) technology to see if I could get a dramatic speed boost. The results were both better and worse than expected.
The number crunching in question was the examination of data from a pseudo-random number generator (PRNG). Specifically, the PRNG output was interpreted as 30,000 triplets of 32-bit signed integers. The triplets were treated as points in 3-space, and the number crunching consisted of the search for every point coplanar with every possible plane those points could define. It's been done before, of course, and found weaknesses in PRNGs like multiplicative congruential generators. For my own edification, I wanted to look for the same weaknesses in some other PRNGs. I have yet to complete a single run, however, so those who've done it in the past either used better techniques than my brute-force approach, had more than my 23 GHz to work with, or were very patient. Maybe some even used worse techniques that were faster. I don't know. In any case, I've been giving the brute-force approach a shot.
My implementation identifies coplanarity as discussed in Wolfram MathWorld: by computing the determinant of a 4X4 matrix. If the determinant is zero, the point is coplanar. The determinant computation includes a lot of multiplies that can be precomputed, because the three points that define the plane don't change for each 30,000 point test, and my code (Java and C) dutifully precomputes the lot. [Some future version of the C code may be able to bring the CPU's vector processing unit, or the GPU's resources, to bear on this problem, but I've never tried vector processing, so that's a challenge I'll leave for another time.]
With that out of the way, here are the results for checking all 30,000 points for coplanarity with each of the first 5,999,600 planes. As you'd expect, lower test times are better.
|Java 1.6 (64-bit)||Mac OS X command line||811 secs.|
|Java 1.6 (64-bit)||IntelliJ IDEA 7.0.5 console||814 secs.|
|C & GCD (64-bit)||Mac OS X command line||658 secs.|
|C & GCD (64-bit)||Xcode 3.2.1 debugger console||657 secs.|
|C & GCD (64-bit)||Xcode 3.2.1 debugger console with GuardMalloc 18 enabled||1,814 secs.|
More details: The C compiler was the one supplied with the current Xcode tools, GNU gdb 6.3.50-20050815 (Apple version gdb-1346) configured as "x86_64-apple-darwin". The Java runtime version was 1.6.0_17-b04-248-10M3025 and the virtual machine was "Java HotSpot(TM) 64-Bit Server VM" – in other words it was the current version of Java supplied by Apple for Mac OS X 10.6.2.
The results: Running on a quiescent machine, both the Java and C/GCD applications were able to push the CPU utilization to roughly 795% and keep it there at all times (the "Hyper-Threading" feature of the Nehalem cores enabled each of the four cores to act like two, so 800% was the theoretical maximum in this case). In general, Java was 23% slower than C/GCD. Personally, I'd expected Java to either closely approximate or even slightly exceed C/GCD's performance (because the Java just-in-time compiler can optimize and natively compile the code in a manner optimal to the specific hardware it finds itself running on at any given time, whereas the C compiler might make have to make more generalized assumptions), or for C/GCD to beat Java by many times. Neither outcome manifested, though the first was closest to the truth.
I think that the actual results can be interpreted in a couple of ways: (1) "C with GCD is meaningfully faster than Java for this sort of number crunching", or (2) "Java is a reasonable choice for this sort of number crunching." I lean toward the second interpretation, due to two issues. First, the Java code was, in my opinion, much cleaner, clearer and more re-usable than the C code. Second, Java is doing a lot more work for you at the same time it is crunching numbers – helpful work that can affect the reliability of your code, and/or the integrity of its results, like ensuring memory integrity and runtime type-correctness – and that's work that C mostly doesn't do at all, or does very badly, as seen in the case of running the C version of the program with GuardMalloc enabled at the cost of it taking 2.24 times longer to run than the Java program.
Those issues aside, it must be said that, for this comparatively simple program, Apple's Grand Central Dispatch and its extensions to the C language worked very well, and were easy to use. I still find the new syntax difficult, but past experience with closures, and the examples in Apple's documentation, were enough to let me produce the code I needed, in spite of that. Whether the GCD language extensions, or the GCD API, provide as complete a set of services for concurrent programming as the Java language and its "java.util.concurrent" package, I don't know. The general impression I took away from the GCD documentation was that it provides fewer features to aid the development of concurrent software than does Java, but a day's work with GCD isn't sufficient to let me judge.
To conclude: Compared to C, Java's not half bad for number crunching as I'm currently doing it (in fact, it's somewhere between merely 23% bad, and 224% good); and Apple's Grand Central Dispatch is an interesting and useful extension to the C language.