1. Why don't we get perfect speedup when running a parallel program with OpenMP on a distributed shared memory computer? Give
at least five reasons for this (the parallel overhead) and explain each
of them. ( "Parallel overheads in OpenMP"? ) Suggested solution.
a) Serial sections (non-parallelized parts) in the code. Only one thread is doing any work - the others are idle.
[ This can be seen as an extreme version of load imbalance. See c) below! ]
b) Synchronization. Implicit / explicit. Parallel for, single, sections. Barrier, taskwait, locks.
Extra overhead because of synchronized writes for all threads.
c) Load imbalance. (A lot) more work is done by one thread compared to the others.
The time to complete the program's computations depends on the thread with the heaviest work load.
d) "Communication" - cache misses due to:
i) True sharing - my
thread's data in a cacheline is updated by another thread which
invalidates the data in the cache.
ii) False sharing - my
thread's cacheline is updated and invalidated by another thread
although my data in the
cacheline reamains unchanged. (In that sense, the invalidation is
"false", but still results in a cache miss.)
e) Data locality in Non-Uniform Memory Access (NUMA). - "Nearly all CPU architectures use a small amount of very fast
non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead." (https://en.wikipedia.org/wiki/Non-uniform_memory_access)
Data is stored in another thread's memory (by first
touch) which results in a heavy remote memory access (and hence
contributes
to overhead when using OpenMP).
"What matters, therefore, is the memory policy in effect when the allocation occurs[malloc]. This is called the first touch."
(http://queue.acm.org/detail.cfm?id=2513149)
f) Fork-join. It takes time to create and terminate threads. Cache locality can get lost when threads are re-scheduled.
g) When code is restructured to fit the demands of a parallelization, extra computations may become necessary.
(Compare with assignment 3, 2016, where the serial
algorithm was restructured to allow parallelization. A red-black grid
was
introduced which resulted in a need for more
iterations and therefore a slow-down in the overall program.) Source: http://user.it.uu.se/~hesc0353/paral16/2016-03-01-OpenMP.pdf, pp 17-21.
2. For each of the parallel overheads above (problem 1) describe or discuss techniques for how the overheads can be reduced. Suggested solution - techniques toremedy the problems listed above.
a) Serial sections (non-parallelized parts) in the code. Only one process/thread is doing any work while the others are idle. Remedy: Overlap serial sections (ordered/master/single/critical) with parallel activities so that there are fewer or no idle threads.
b) Synchronization. Implicit / explicit. Parallel for, single,
sections. Barrier, taskwait locks.
Extra overhead because of synchronized writes for all threads. Remedy: Minimize load imbalance.
Analyze and remove implicit barriers between independent loops, using nowait.
Use large parallel
regions, avoid fork-join synch (also good for cache performance,
threads are not re-scheduled).
Overlap activities to remove barriers ( iterative solver ).
Use locks to remove global barriers ( LU-factorization ).
c) Load imbalance. (A lot) more work is done by one thread compared to the others.
The time to complete the program's computations
depends on the thread with the heaviest work load. Remedy: Assign chunks to threads - dynamic or guided scheduling for good load balance.
d
) "Communication" - cache misses due to:
i) True sharing - my thread's data in a cacheline is updated
by another thread which invalidates the data in the cache.
ii) False sharing - my
thread's cacheline is updated and invalidated by another thread
although my data in the
cacheline reamains unchanged. (In that sense, the invalidation is
"false", but still results in a cache miss.) Remedy: Use static scheduling for good data locality (cache).
e) Data locality in Non-Uniform Memory Access (NUMA). Remedy: Avoid serial init on NUMA. Instead, use explicit load balancing supplied by the user: Loadbal(lowerbound, upperbound, nthr);
#pragma omp parallel private(id,j) {
id = omp_thread_num();
for (j=lowerbound(id); j<upperbound(id); j++)
A(j)=INIT(j); // !INIT, FIRST TOUCH
#pragma omp barrier
for (j=lowerbound(id);j<upperbound(id),j++)
WORK(A(j)); // !WORK, STATIC ACCESS
}
3. Gram-Schmidt algorithm ... (yada, yada, yada) Suggested solution. ( We don't expect this on the exam. )
4. In the course we have been discussing different metrics used for evaluating parallel algorithms and programs.
Describe and explain these different metrics. Suggested solution.
Yet to be written!
5. What is collective communication in MPI and what is characteristic for a collective communication call?
Give at least five examples of collective communication operations and explain their effect. Suggested solution.
Yet to be written!
6. Assume that we have a set of heterogenous multi-core nodes with
different number of cores within different
nodes but all cores over all
nodes are identical. The nodes are connected with som kind of
interconnect. This gives
us a heterogenous distributed (local address
space) memory parallel computer. We can use MPI to communicate
between
the nodes and OpenMP to parallelize over the cores within a node.
Assume that we want to do parallel
matrix-matrix multiplication on this
parallel machine when we have three nodes with 6 cores, 2 cores, and 4
cores,
respectively. The matrix sizes are 1200x1200. Construct an
efficient parallel algorithm for this problem using all
cores and nodes. Suggested solution.
Yet to be written!