David Clarke
School of Computer Science and Informatics
University College Dublin
Dublin, Ireland
Traditional data-parallel scientific applications, designed for heterogeneous clusters of workstations, employ data partitioning algorithms that use too simplistic performance models. These algorithms assume that the absolute speed of a processor does not depend on the size of the computational task. This assumption proved inaccurate for modern highly heterogeneous HPC platforms. The class of data partitioning algorithms based on the functional performance model (FPM), representing the speed of the processor by a function of problem size, has been proved to be more accurate. These functional models are built empirically by measuring the time of execution of the core computational kernel of an application. They are platform and application specific. These models may be built in full at compile time or dynamically at application run time depending on the requirements of the application user and the computational platform. The partitioning algorithm takes these FPMs and computes a distribution such that all devices compute their workload within the same time. In this presentation we will present two partitioning algorithms; a method for using these algorithms to partition a matrix such that the volume of communication is minimised; a hierarchical partitioning algorithm for a heterogeneous cluster of CPU+GPU nodes; and a 2D partitioning algorithm suitable for when an application requires two partitioning parameters to distribute the workload across a two-dimensional grid of processors.