

| UPPSALA<br>UNIVERSITET | What is a p                                                                                               | arallel                    | compute                                          | er?                                                                                |
|------------------------|-----------------------------------------------------------------------------------------------------------|----------------------------|--------------------------------------------------|------------------------------------------------------------------------------------|
|                        | Answer 1: A set<br>Answer 2: "All"<br>com                                                                 | modern co                  | omputers a                                       |                                                                                    |
|                        | Compare:<br>• EDSAC<br>• Cray 1<br>• Cray 1<br>• Intel P4<br>• Intel i7<br>• Nvidia<br>Explanation: Intel | (2004)<br>(2010)<br>(2012) | 80 MHz<br>80 MHz<br>3.8 GHz<br>2.3 GHz<br>732MHz | 100 Flop/s<br>4 Mflop/s<br>12 Mflop/s<br>7.6 Gflop/s<br>37 Gflop/s<br>3.95 Tflop/s |





## Solutions to run faster:

- 1. Increase clock rate
  - $\Rightarrow$  Slow devices (memory) still stalls execution
  - $\Rightarrow$  More contention on the memory bus
  - ⇒ High power consumption Increase freq, increase voltage, p~f\*v<sup>2</sup> (Energy cost, cooling problem, battery time)
- 2. Introduce parallelism
  - a. Within a CPU
  - b. Multiple CPUs









| UPPSALA<br>UNIVERSITET | Enhancen        | PROBLEM:<br>nents (1) - (4) => Memory can't keep up<br>data at processor speed!                                                                                                                                                                      |  |  |
|------------------------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
|                        | Types of memory |                                                                                                                                                                                                                                                      |  |  |
|                        | (Improven       | Dynamic Random Access Memory<br>Charged based devices, needs to be<br>refreshed at read/write – takes time ~10 <sup>-8</sup> s<br>Cheap but slow, use for main memory.<br>nents: SDRAM, DDR SDRAM, RDRAM using<br>emory banks, can overlap accesses) |  |  |
|                        |                 | emory banks, can overlap accesses)                                                                                                                                                                                                                   |  |  |
|                        |                 | Static Random Access Memory<br>Gate based devices (transistors)<br>No need for refreshment, fast but expensive,<br>use for registers and cache                                                                                                       |  |  |



| UPPSALA<br>UNIVERSITET | Caches: L1 Cache (~128kB, on chip)<br>L2 Cache (~4MB, on/off chip)<br>L3 Cache (+8MB, off chip)                                                                                                                                   |
|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                        | <ul> <li>Types of caches (techniques to store/replace data):</li> <li>direct mapped – pages alphabetical order</li> <li>fully associative – blank pages</li> <li>set associative – free sections, each section ordered</li> </ul> |
|                        | Memory hierarchies makes it very important with data<br>layout and data accesses in your codes!<br>Register<br>Cache<br>Main memory<br>Virtual memory                                                                             |





|                        | Vector processors                                                                                                                          |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| UPPSALA<br>UNIVERSITET | <ol> <li>Vector instructions<br/>Vadd, vmult, etc (complex instructions)</li> </ol>                                                        |
|                        | <ol> <li>Vector registers</li> <li>Compare scalar registers in RISC</li> </ol>                                                             |
|                        | <ol> <li>Memory pipelines (1024 memory banks)<br/>Pipeline memory accesses, no need for<br/>caches, deliver data at clock speed</li> </ol> |
|                        | 4. Vector units<br>Arithmetical pipelines                                                                                                  |
|                        | <ol><li>Compiler directives for vectorization of code<br/>Similar as for OpenMP</li></ol>                                                  |





























|                                   | 2003                         | 2005                     | 2007        | 2008                      | 2009                | 2010         |
|-----------------------------------|------------------------------|--------------------------|-------------|---------------------------|---------------------|--------------|
|                                   | AMD<br>Opteron™              | AMD<br>Opteron™          | "Barcelona" | "Shanghai"                | "Istanbul"          | "Magny-Cours |
| Mfg.<br>Process                   | 90nm SOI                     | 90nm SOI                 | 65nm SOI    | 45nm SOI                  | 45nm SOI            | 45nm SOI     |
| 5                                 | К8                           | К8                       | Greyhound   | Greyhound+                | Greyhound+          | Greyhound+   |
| CPU Core                          |                              |                          |             |                           |                     |              |
| L2/L3                             | 1MB/0                        | 1MB/0                    | 512kB/2MB   | 512kB/6MB                 | 512kB/6MB           | 512kB/12MB   |
| Hyper<br>Transport™<br>Technology | 3x 1.6GT/.s                  | 3x 1.6GT/.s              | 3x 2GT/s    | 3x 4.0GT/s                | 3x 4.8GT/s          | 4x 6.4GT/s   |
| Memory                            | 2x DDR1 300                  | 2x DDR1 400              | 2x DDR2 667 | 2x DDR2 800               | 2x DDR2 1066        | 4x DDR3 1333 |
|                                   | 2010                         |                          | 2011        | 2012                      | _                   | 2013         |
|                                   |                              |                          |             |                           |                     |              |
|                                   | Magny-Cours<br>3- to 12-Core | Interlage<br>12- to 16-0 |             | Terramar<br>Up to 20-Core | Dublin<br>Up to 20- | -Core        |

## Graphical Processing Units (GPU)

Main vendors NVIDIA, AMD

Architectural features:

- Many simple processing elements (16-2668)
- 1000s 1 000 000s of threads
- Hardware thread scheduling (1 cycle)
- Focus on throughput (data parallel tasks)
- Limited memory (small on chip mem)
- Limited bandwidth CPU ⇔ GPU (bottleneck)







|             | Problem:                                                                                                                                                                                                                                           |  |  |  |  |
|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| UNIVERSITET | <ul> <li>Not scalable (above 64 proc)<br/>Memory bottleneck becomes worse</li> </ul>                                                                                                                                                               |  |  |  |  |
|             | <ul> <li>Cache-coherency problem (CC)<br/>Several processors have the same data<br/>element in its cache and one changes the<br/>data, the other caches must be invalidated.<br/>(Handled with snoop or directory based<br/>protocols.)</li> </ul> |  |  |  |  |
|             | Important problem to consider when programming OpenMP, we have <i>true sharing</i> and <i>false sharing</i> effects which results in communication of invalid data in caches.                                                                      |  |  |  |  |

























| UPPSALA<br>UNIVERSITET | <b>Parallel Comp</b><br>IT-servers:<br>Intel Core2 Quad<br>10 nodes x 2<br>CPUs x 4 cores<br>Gigabit Ethernet<br>between nodes | <ul> <li>fries.it.uu.se</li> </ul>                                                           |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
|                        | AMD Opteron<br>16 core                                                                                                         | <ul> <li>gullviva.it.uu.se</li> <li>tussilago.it.uu.se</li> <li>vitsippa.it.uu.se</li> </ul> |





