I've been doing some parallel number crunching lately using Java 1.6 (1.6.0_17) on a MacPro (4,1) with a 2.93 GHz quad-core Nehalem processor. This weekend I decided to brush the dust off my C programming skills and re-implement the core of my Java code in C using Apple's Grand Central Dispatch (GCD) technology to see if I could get a dramatic speed boost. The results were both better and worse than expected.
The number crunching in question was the examination of data from a pseudo-random number generator (PRNG). Specifically, the PRNG output was interpreted as 30,000 triplets of 32-bit signed integers. The triplets were treated as points in 3-space, and the number crunching consisted of the search for every point coplanar with every possible plane those points could define. It's been done before, of course, and found weaknesses in PRNGs like multiplicative congruential generators. For my own edification, I wanted to look for the same weaknesses in some other PRNGs. I have yet to complete a single run, however, so those who've done it in the past either used better techniques than my brute-force approach, had more than my 23 GHz to work with, or were very patient. Maybe some even used worse techniques that were faster. I don't know. In any case, I've been giving the brute-force approach a shot.
My implementation identifies coplanarity as discussed in Wolfram MathWorld: by computing the determinant of a 4X4 matrix. If the determinant is zero, the point is coplanar. The determinant computation includes a lot of multiplies that can be precomputed, because the three points that define the plane don't change for each 30,000 point test, and my code (Java and C) dutifully precomputes the lot. [Some future version of the C code may be able to bring the CPU's vector processing unit, or the GPU's resources, to bear on this problem, but I've never tried vector processing, so that's a challenge I'll leave for another time.]
With that out of the way, here are the results for checking all 30,000 points for coplanarity with each of the first 5,999,600 planes. As you'd expect, lower test times are better.
Language | Environment | Time |
---|---|---|
Java 1.6 (64-bit) | Mac OS X command line | 811 secs. |
Java 1.6 (64-bit) | IntelliJ IDEA 7.0.5 console | 814 secs. |
C & GCD (64-bit) | Mac OS X command line | 658 secs. |
C & GCD (64-bit) | Xcode 3.2.1 debugger console | 657 secs. |
C & GCD (64-bit) | Xcode 3.2.1 debugger console with GuardMalloc 18 enabled | 1,814 secs. |
More details: The C compiler was the one supplied with the current Xcode tools, GNU gdb 6.3.50-20050815 (Apple version gdb-1346) configured as "x86_64-apple-darwin". The Java runtime version was 1.6.0_17-b04-248-10M3025 and the virtual machine was "Java HotSpot(TM) 64-Bit Server VM" – in other words it was the current version of Java supplied by Apple for Mac OS X 10.6.2.
The results: Running on a quiescent machine, both the Java and C/GCD applications were able to push the CPU utilization to roughly 795% and keep it there at all times (the "Hyper-Threading" feature of the Nehalem cores enabled each of the four cores to act like two, so 800% was the theoretical maximum in this case). In general, Java was 23% slower than C/GCD. Personally, I'd expected Java to either closely approximate or even slightly exceed C/GCD's performance (because the Java just-in-time compiler can optimize and natively compile the code in a manner optimal to the specific hardware it finds itself running on at any given time, whereas the C compiler might make have to make more generalized assumptions), or for C/GCD to beat Java by many times. Neither outcome manifested, though the first was closest to the truth.
I think that the actual results can be interpreted in a couple of ways: (1) "C with GCD is meaningfully faster than Java for this sort of number crunching", or (2) "Java is a reasonable choice for this sort of number crunching." I lean toward the second interpretation, due to two issues. First, the Java code was, in my opinion, much cleaner, clearer and more re-usable than the C code. Second, Java is doing a lot more work for you at the same time it is crunching numbers – helpful work that can affect the reliability of your code, and/or the integrity of its results, like ensuring memory integrity and runtime type-correctness – and that's work that C mostly doesn't do at all, or does very badly, as seen in the case of running the C version of the program with GuardMalloc enabled at the cost of it taking 2.24 times longer to run than the Java program.
Those issues aside, it must be said that, for this comparatively simple program, Apple's Grand Central Dispatch and its extensions to the C language worked very well, and were easy to use. I still find the new syntax difficult, but past experience with closures, and the examples in Apple's documentation, were enough to let me produce the code I needed, in spite of that. Whether the GCD language extensions, or the GCD API, provide as complete a set of services for concurrent programming as the Java language and its "java.util.concurrent" package, I don't know. The general impression I took away from the GCD documentation was that it provides fewer features to aid the development of concurrent software than does Java, but a day's work with GCD isn't sufficient to let me judge.
To conclude: Compared to C, Java's not half bad for number crunching as I'm currently doing it (in fact, it's somewhere between merely 23% bad, and 224% good); and Apple's Grand Central Dispatch is an interesting and useful extension to the C language.
Any chance you could post the source. There might be some corners to optimize
ReplyDeleteSince there's interest, yes, I'll make all the source available.
ReplyDeleteIt would have been good if you had provided a glimpse of the parallel algorithm you've used.
ReplyDeleteMany questions come to mind. How is the system configured for parallelism? In the case of Java, how many "worker" threads are you using? In GCD, are you dispatching each generated triplet onto a queue for execution? Are you using blocks? I'm wondering if you intend to share the GCD based code involved.
Also, your program seems to have two phases, one involving generation of the points and another involving the calculation of coplanar properties. Admittedly, the second phase, is more compute intensive. However, both phases can be parallelized separately, pipelined or interleaved. Which of these approaches does your program use?
Similar other questions exist and can affect the interpretation of the results. So, it would be good to share more regarding the implementation.
Masood,
ReplyDeleteI made the source code available in my next posting a few days later. Take a look. It's very simple and should answer your questions.
To answer a few of your questions explicitly:
I used 8 worker threads in Java, for two reason: (1) what I know about the architecture of my machine, and (2) that's also the value that “Runtime.availableProcessors” returned.
In GCD, for each point checked, I used “dispatch_group_create” to create a group for each (slightly less than) 30,000 point coplanarity check. I then dispatched work to the group using “dispatch_group_asynch”. Once all ≅30,000 coplanarity checks had been dispatched, I used “dispatch_group_wait(Group, DISPATCH_TIME_FOREVER)”, and, as soon as that completed, “dispatch_release(Group)”.
As to the two phases of the program, they are kept distinct. In the first phase, all 30,000 3-space points are acquired by reading 90,000 32-bit, big-endian integers from pre-generated data file. That takes no time at all, so no effort was made to parallelize that phase (premature optimization). Then the second phase, the coplanarity scanning begins. That's what takes the time.
Thanks Chris, and thank you for your further explanation of your strategy here.
ReplyDeleteI will certainly take take a look.