Henrik Löf
Division of Scientific Computing and Division of Computer Systems
Department of Information Technology
Uppsala University
Efficient solutions of partial differential equations require a match
between the algorithm and the underlying architecture. The new chip-
multiprocessors, CMPs (a.k.a. multicore), feature low intra-chip
communication cost and smaller per-thread caches compared to previous
systems. From an algorithmic point of view this means that data
locality issues become more important than communication overheads.
This may require re-evaluation of many existing algorithms.
We have investigated parallel implementations of multigrid methods
using a temporally blocked, naturally ordered, smoother
implementation. Compared with the standard multigrid solution based
on the two-color red-black algorithm, we improve the data locality
often as much as ten times, while our use of a fine-grained locking
scheme keeps the parallel efficiency high.
While our algorithm initially was inspired by CMPs, it was surprising
to see our OpenMP multigrid implementation run up to 40 percent
faster than the standard red-black algorithm on an 8-way SMP system.
Studying the smoother part of the algorithm in isolation often shows
it performing two iterations at the same time as a single iteration
with an ordinary red-black smoother. Running our smoother on a 32-
thread UltraSPARC T1 (Niagara) SMT/CMP and a simulated 32-way CMP
demonstrates the communication cost of our algorithm to be low on
such architectures.