Uppsala University, Department of Information Technology, Scientific Computing
http://www.it.uu.se/

                            Exam, Programming of Parallel Computers, 2008-06-02

                                        © Henrik D Schulze & Joel Axelsson 2016. Suggested solutions.

1. Why don't we get perfect speedup when running a parallel program with OpenMP on a distributed shared memory computer?
Give at least five reasons for this (the parallel overhead) and explain each of them. ( "Parallel overheads in OpenMP"? )
Suggested solution.
 a) Serial sections (non-parallelized parts) in the code. Only one thread is doing any work - the others are idle.
   [ This can be seen as an extreme version of load imbalance. See c) below! ]
 b) Synchronization. Implicit / explicit. Parallel for, single, sections. Barrier, taskwait, locks.
     Extra overhead because of synchronized writes for all threads.
 c) Load imbalance. (A lot) more work is done by one thread compared to the others.
     The time to complete the program's computations depends on the thread with the heaviest work load.
 d) "Communication" - cache misses due to:
          i)  True sharing - my thread's data in a cacheline is updated by another thread which invalidates the data in the cache.
          ii) False sharing - my thread's cacheline is updated and invalidated by another thread although my data in the
              cacheline reamains unchanged. (In that sense, the invalidation is "false", but still results in a cache miss.)
 e) Data locality in Non-Uniform Memory Access (NUMA). - "Nearly all CPU architectures use a small amount of very fast
     non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining
     cache coherence across shared memory has a significant overhead." (https://en.wikipedia.org/wiki/Non-uniform_memory_access)
    Data is stored in another thread's memory (by first touch) which results in a heavy remote memory access (and hence contributes
    to overhead when using OpenMP).
    "What matters, therefore, is the memory policy in effect when the allocation occurs[malloc]. This is called the first touch."
    (http://queue.acm.org/detail.cfm?id=2513149)
 f) Fork-join. It takes time to create and terminate threads. Cache locality can get lost when threads are re-scheduled.
 g) When code is restructured to fit the demands of a parallelization, extra computations may become necessary.
    (Compare with assignment 3, 2016, where the serial algorithm was restructured to allow parallelization. A red-black grid was
     introduced which resulted in a need for more iterations and therefore a slow-down in the overall program.)
Source: http://user.it.uu.se/~hesc0353/paral16/2016-03-01-OpenMP.pdf, pp 17-21.

2. For each of the parallel overheads above (problem 1) describe or discuss techniques for how the overheads can be reduced.
Suggested solution - techniques to remedy the problems listed above.
    a) Serial sections (non-parallelized parts) in the code. Only one process/thread is doing any work while the others are idle.
   Remedy: Overlap serial sections (ordered/master/single/critical) with parallel activities so that there are fewer or no idle threads.

     b) Synchronization. Implicit / explicit. Parallel for, single, sections. Barrier, taskwait locks.
         Extra overhead because of synchronized writes for all threads.
   Remedy: Minimize load imbalance.
          Analyze and remove implicit barriers between independent loops, using nowait.
          Use large parallel regions, avoid fork-join synch (also good for cache performance, threads are not re-scheduled).
          Overlap activities to remove barriers ( iterative solver ).
          Use locks to remove global barriers ( LU-factorization ).

    c) Load imbalance. (A lot) more work is done by one thread compared to the others.
         The time to complete the program's computations depends on the thread with the heaviest work load.
   Remedy: Assign chunks to threads - dynamic or guided scheduling for good load balance.

     d ) "Communication" - cache misses due to:
          i)  True sharing - my thread's data in a cacheline is updated by another thread which invalidates the data in the cache.
          ii) False sharing - my thread's cacheline is updated and invalidated by another thread although my data in the
              cacheline reamains unchanged. (In that sense, the invalidation is "false", but still results in a cache miss.)
   Remedy: Use static scheduling for good data locality (cache).

     e) Data locality in Non-Uniform Memory Access (NUMA).
   Remedy: Avoid serial init on NUMA. Instead, use explicit load balancing supplied by the user:
        Loadbal(lowerbound, upperbound, nthr);
        #pragma omp parallel private(id,j) {
            id = omp_thread_num();
            for (j=lowerbound(id); j<upperbound(id); j++)
                A(j)=INIT(j); // !INIT, FIRST TOUCH
        #pragma omp barrier
            for (j=lowerbound(id);j<upperbound(id),j++)
                WORK(A(j)); // !WORK, STATIC ACCESS
        }

 f) Fork-join. It takes time to create and terminate threads. Cache locality can get lost when threads are re-scheduled.
   Remedy: Don't create too many threads!
Source: http://user.it.uu.se/~hesc0353/paral16/2016-03-01-OpenMP.pdf, pp 17-21.

3. Gram-Schmidt algorithm ... (yada, yada, yada)
Suggested solution. ( We don't expect this on the exam. )

4. In the course we have been discussing different metrics used for evaluating parallel algorithms and programs.
Describe and explain these different metrics.
Suggested solution.
    Yet to be written!

5. What is collective communication in MPI and what is characteristic for a collective communication call?
Give at least five examples of collective communication operations and explain their effect.
Suggested solution.
    Yet to be written!

6. Assume that we have a set of heterogenous multi-core nodes with different number of cores within different
 nodes but all cores over all nodes are identical. The nodes are connected with som kind of interconnect. This gives
 us a heterogenous distributed (local address space) memory parallel computer. We can use MPI to communicate
 between the nodes and OpenMP to parallelize over the cores within a node. Assume that we want to do parallel
 matrix-matrix multiplication on this parallel machine when we have three nodes with 6 cores, 2 cores, and 4 cores,
 respectively. The matrix sizes are 1200x1200. Construct an efficient parallel algorithm for this problem using all
 cores and nodes.
Suggested solution.
    Yet to be written!