Afshin Zafari
Division of Scientific Computing
Department of Information Technology
Uppsala University
Uppsala
Current high-performance computer system used for scientific computing typically combine shared memory compute nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism. We have developed a task library for shared memory systems which performs well compared with other libraries. In DuctTeip framework, we extend this to distributed memory architectures. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. Our experiments on implementing distributed Cholesky factorization show that our framework performs at least twice as well as ScaLAPACK in both weak and strong scaling experiments.