Autotuning on multicore CPU/manycore GPU system

(Direct link to animation)

Dynamics of a cylinderflow simulation using a vortex-based discretization and the FMM. The computations run concurrently on a multicore CPU and a manycore GPU. It is complicated to use these hardware in an optimal way and the approach tested here is based on autotuning for performance. Two parameters control the balance between offloading onto the GPU vs. CPU work. To the left in blue is the number of levels in the FMM-method. A large number of levels means that less work is offloaded to the GPU. To the right in green is a variable theta that controls the multipole acceptance criterion (the theta criterion). A large value of theta, say, close to 1, means that a larger number of multipole coefficients is used, but also that the communication stencils become smaller. The autotuner continuously test varying these parameters by measuring the benefits of accepting suggested changes. By also measuring how costly each parameter test is, the added cost of using the autotuner itself may be controlled.

References

The fast multipole method employed in the simulations was detailed in S. Engblom: On well-separated sets and fast multipole methods, in Appl. Numer. Math. 61(10):1096--1102, 2011: (doi).

The GPU implementation itself was described in A. Goude and S. Engblom: Adaptive fast multipole methods on the GPU, in J. Supercomput 63(3):897--918, 2013: (doi)

The idea of autotuning and the algorithm used is worked out in M. Holm, S. Engblom, A. Goude, S. Holmgren: Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU & GPU systems, in SIAM J. Sci. Comput. 36(4):C376--C399 (2014): (doi)

Stefan Engblom

Last modified: Mon Dec 29 13:53:56 MEST 2015