Saturday, April 19, 2008

JDK 7 - Fork-join and ParallelArrays

A few weeks ago, I had the opportunity of attending an interesting presentation by Sun's Brian Goetz on the new concurrency features in JDK 7 which will allow applications to take advantage of idle CPU cores. Here is a summary of how this works:

In the recent past we have stopped seeing increases in CPU clock rates. We have started getting more cores per chip rather than faster cores. This brings new challenges in programming. We need to learn the art of decomposing our applications into units of work that can be executed in parallel over multiple cores.

Enter the fork-join framework - a library to provide better support for fine-grained parallelism. The most promising candidates for parallelism are sorting and searching. Sorting can be parallelised with merge sort. Searching can be parallelised by searching sub-regions of the data in parallel and then merging the results. This is an example of divide-and-conquer, in which a problem is broken down into sub-problems which are solved and the results combined. In merge sort, we divide the data set into pieces, sort the pieces and merge the results. Result is still O(n log n), but sub-problems can be solved in parallel. We can apply this process recursively, until sub problems are small enough that a sequential solution would be faster.

Divide-and-conquer algorithms take this general form

Result solve(Problem problem) {
if (problem.size < SEQUENTIAL_THRESHOLD)
return solveSequentially(problem);
else {
Result left, right;
left = solve(extractLeftHalf(problem));
right = solve(extractRightHalf(problem));
return combine(left, right);

The INVOKE-IN-PARALLEL step creates two or more new tasks (fork) and suspends the current task until the new tasks complete (join). Naive implementations would create a new thread for each task and then invoke Thread.join() for the join operation. But thread creation is expensive and requires O(log n) idle threads. A non-naive implementation is available in package java.util.concurrent.forkjoin in JDK 7 or you can download jsr166y.

The fork-join framework is implemented using work stealing. You create a limited number of worker threads (best choice is Runtime.availableProcessors()). Each worker maintains a double-ended work queue (deque). When forking, worker pushes the new task at the head of its deque. When waiting or idle, worker pops a task off the head of its deque and executes it instead of sleeping. If worker's deque is empty, steals an element off the tail of the deque of another randomly chosen worker. This means that there is never any contention for head and almost never contention for tail. Stealing is infrequent - because order is LIFO, when a worker steals, it generally steals a big chunk which will keep it from having to steal again for a while.

Then pool.invoke() is called, task is placed on a random deque. That worker executes the task and usually just pushes two more tasks onto its deque. It then starts on one of the subtasks. Soon some other worker steals the other top-level subtask. Pretty soon, most of the forking is done and the tasks are distributed among the various work queues. Now the workers start on the sequential subtasks. If work is unequally distributed, corrected via stealing. The result is that there is reasonable load balancing, with no central coordination, little scheduling overhead and minimal synchronisation costs.

We can reduce the amount of boilerplate code in fork-join tasks, by using the ParallelArray classes which let you declaratively specify aggregate operations on data arrays. ParallelArrays use fork-join to efficiently execute on the available hardware.

Further reading:

No comments:

Post a Comment