











| Where is Supercomputing heading                                                                                                                  | ng?                     |
|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| <ul> <li>1997, 500 fastest machines in the world:<br/>319 MPPs, 73 bus-based shared memory (SMP)<br/>parallel vector processors (PVP)</li> </ul> | ), 106                  |
| <ul> <li>2000, 381 of 500 fastest: 144 IBM SP (~clust<br/>Sun (bus SMP), 62 SGI (NUMA SMP), 54 Cray<br/>SMP)</li> </ul>                          | er), 121<br>(NUMA       |
| Parallel computer architecture : a hardware/ software<br>David E. Culler, Jaswinder Pal Singh, with Anoop<br>Francisco : Morgan Kaufmann, c1999. | approach,<br>Gupta. San |
| http://www.top500.org/                                                                                                                           |                         |
|                                                                                                                                                  | 7                       |

| Popular Flynn Categories for<br>Parallel Computers                                                                                                                                              |            |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| <ul> <li>SISD (Single Instruction Single Data)</li> <li>Uniprocessors</li> </ul>                                                                                                                |            |
| <ul> <li>MISD (Multiple Instruction Single Data)</li> <li>multiple processors on a single data stream</li> </ul>                                                                                |            |
| <ul> <li>SIMD (Single Instruction Multiple Data)         <ul> <li>Early Examples: Illiac-IV, CM-2</li> <li>Phrase reused by Intel marketing for media instructions ~ vec</li> </ul> </li> </ul> | tor        |
| <ul> <li>MIMD (Multiple Instruction Multiple Data)</li> <li>Examples: Sun Enterprise 5000, Cray T3D, SGI Origin</li> <li>Flexible</li> <li>Use off-the-shelf micros</li> </ul>                  |            |
| <ul> <li>MIMD current winner: Concentrate on major design emphases</li> <li>processor MIMD machines</li> </ul>                                                                                  | sis <= 128 |
|                                                                                                                                                                                                 | 8          |

| Major MIMD Styles                                                                                 |         |
|---------------------------------------------------------------------------------------------------|---------|
| 1. Centralized shared memory ("Uniform Memor<br>Access" time or "Shared Memory Processor"         | ry<br>) |
| 2. Decentralized memory (memory module with CPU)                                                  |         |
| <ul> <li>Shared Memory with "Non Uniform Memory Accestime (NUMA)</li> </ul>                       | SS"     |
| <ul> <li>Message passing "multicomputer" with separate<br/>address space per processor</li> </ul> |         |
|                                                                                                   |         |
| 9                                                                                                 | )       |

| Parallel Architecture                                                                                  |    |
|--------------------------------------------------------------------------------------------------------|----|
| Parallel Architecture extends traditional<br>computer architecture with a communicatio<br>architecture | 'n |
| <ul> <li>abstractions (HW/SW interface)</li> </ul>                                                     |    |
| <ul> <li>organizational structure to realize abstraction<br/>efficiently</li> </ul>                    |    |
|                                                                                                        |    |
|                                                                                                        |    |
|                                                                                                        |    |
|                                                                                                        |    |
|                                                                                                        |    |
|                                                                                                        | 10 |

|                                                                                                  | Parallel Framework                                                                                                                                                                        |
|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Layers:                                                                                          |                                                                                                                                                                                           |
| <ul> <li>Programm</li> <li>Multipro</li> <li>Shared</li> <li>Message</li> <li>Data Pa</li> </ul> | ning Model:<br>ogramming : lots of jobs, no communication<br>address space: communicate via memory<br>e passing: send and recieve messages<br>rallel: one operation, multiple data sets   |
| <ul> <li>Communic</li> <li><u>Shared</u></li> <li>Message</li> <li>Debate</li> </ul>             | cation Abstraction:<br><u>address space</u> : e.g., load, store, etc => multiprocessors<br>e passing: e.g., send, recieve library calls<br>over this topic (ease of programming, scaling) |
| May mix sh<br>different                                                                          | ared address space and message passing at<br>layers                                                                                                                                       |
|                                                                                                  |                                                                                                                                                                                           |

| Shared Address/Memory Processor Model                                                                                                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Each processor can name every physical location in the machine</li> </ul>                                                                          |
| Each process can name all data it shares with other processes                                                                                               |
| Data transfer via load and store                                                                                                                            |
| Data size: byte, word, or cache blocks                                                                                                                      |
| <ul> <li>Uses virtual memory to map virtual to local or remote<br/>physical</li> </ul>                                                                      |
| <ul> <li>Memory hierarchy model applies: now communication<br/>moves data to local processor cache (as load moves data<br/>from memory to cache)</li> </ul> |
| Latency, BW, scalability when communicate?                                                                                                                  |
| 12                                                                                                                                                          |

| Shared-Memory Programming Examples<br>struct alloc_t ( int first; int last ) alloc[MAX_THR];<br>pthread_t tid[MAX_THR];                                                                                                                                                                                  |    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| (                                                                                                                                                                                                                                                                                                        |    |
| <pre>for (int i=0; i<num (void="" );="" *)&alloc[i]="" *detach="" *parameters*="" *thread="" ,="" <="" alloc(i).first="i*M/num" alloc[i].last="(i!=num" i++)="" id="" method*="" nlll="" num="" pointer*7,="" pre="" pthread_create(&tid[i]="" thr)-1):n;="" thr)?((i+1)*(n="" thr;="" {=""></num></pre> |    |
| for (i=0; i <num i++)="" td="" thr;="" {<=""><td></td></num>                                                                                                                                                                                                                                             |    |
| <pre>pthread_join(tid[i]/*thread id*/, NULL/*return value*/);</pre>                                                                                                                                                                                                                                      |    |
|                                                                                                                                                                                                                                                                                                          |    |
| dmm_func(struct alloc_t *alloc) (                                                                                                                                                                                                                                                                        |    |
| for (int 1=alloc->first; 1 <alloc->last; 1++)<br/>for (int k=0: k<n: k++)<="" td=""><td></td></n:></alloc->                                                                                                                                                                                              |    |
| for (int j=0; j <n; j++)<="" td=""><td></td></n;>                                                                                                                                                                                                                                                        |    |
| Z[i][j] += X[i][k]*Y[k][j];                                                                                                                                                                                                                                                                              | 13 |

| Shared Address/Memory<br>Multiprocessor Model                                                                                                                                                                                                                                                                                                                                                   |   |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| <ul> <li>Communicate via Load and Store         <ul> <li>Oldest and most popular model</li> </ul> </li> <li>Based on timesharing: processes on multiple processors vs. sharing single processor</li> <li>process: a virtual address space and &gt; 1 thread of control         <ul> <li>ALL threads of a process share a process address space</li> <li>Example: Pthread</li> </ul> </li> </ul> |   |
| <ul> <li>Writes to shared address space by one thread<br/>are visible to reads of other threads</li> </ul>                                                                                                                                                                                                                                                                                      | d |
| 14                                                                                                                                                                                                                                                                                                                                                                                              |   |





| <ul> <li>Whole computers (CPU, memory, I/O devices), explicit send/receive as explicit I/O operations</li> <li><u>Send</u> specifies local buffer + receiving process on remote computer</li> <li><u>Receive</u> specifies sending process on remote computer + local buffer to place data</li> <li>Send+receive =&gt; memory-memory copy, where each each supplies local address</li> </ul> | Message Passing Model                                                                                      |            |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|------------|
| <ul> <li><u>Send</u> specifies local buffer + receiving process on remote computer</li> <li><u>Receive</u> specifies sending process on remote computer + local buffer to place data</li> <li>Send+receive =&gt; memory-memory copy, where each each supplies local address</li> </ul>                                                                                                       | <ul> <li>Whole computers (CPU, memory, I/O devices)<br/>send/receive as explicit I/O operations</li> </ul> | , explicit |
| <ul> <li><u>Receive</u> specifies sending process on remote computer +<br/>local buffer to place data</li> <li>Send+receive =&gt; memory-memory copy, where each each<br/>supplies local address</li> </ul>                                                                                                                                                                                  | <ul> <li><u>Send</u> specifies local buffer + receiving proces.<br/>remote computer</li> </ul>             | s on       |
| Send+receive => memory-memory copy, where each each supplies local address                                                                                                                                                                                                                                                                                                                   | <ul> <li><u>Receive</u> specifies sending process on remote c<br/>local buffer to place data</li> </ul>    | omputer +  |
|                                                                                                                                                                                                                                                                                                                                                                                              | Send+receive => memory-memory copy, where supplies local address                                           | each each  |
|                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                            |            |
|                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                            | 17         |

| Advantages of Message-Passing<br>Communication                                                   |     |
|--------------------------------------------------------------------------------------------------|-----|
| The hardware can be much simpler and is usually standard                                         |     |
| Explicit communication => simpler to understand, hel<br>make effort to reduce communication cost | p   |
| <ul> <li>Synchronization is naturally associated with<br/>sending/receiving messages</li> </ul>  |     |
| Easier to use sender-initiated communication, which have some advantages in performance          | may |
| Important, but will not be discussed in details                                                  |     |
|                                                                                                  | 18  |



| Amdahl's Law and Parallel Comput<br>Amdahl's Law: speedup is limited by the<br>of the portions that can be parallelized | ters<br>fraction |
|-------------------------------------------------------------------------------------------------------------------------|------------------|
| Speedup ≤ 1 / (1-f), where f is the fraction                                                                            | tion of          |
| How large can be fif we want 80X spee                                                                                   | edup from        |
| 100  processors<br>$1 / (f_+(1f)/100) = 80$                                                                             |                  |
| f = 0.25% !                                                                                                             |                  |
|                                                                                                                         | 20               |

| what Does coherency Mean?                                                                                                       |        |
|---------------------------------------------------------------------------------------------------------------------------------|--------|
| ♦ Informally:                                                                                                                   |        |
| "Any read must return the most recent write"                                                                                    |        |
| Too strict and too difficult to implement                                                                                       |        |
| 🕏 Better:                                                                                                                       |        |
| "Any write must eventually be seen by a read"                                                                                   |        |
| <ul> <li>All writes are seen in proper order ("serialization")</li> </ul>                                                       |        |
| Two rules to ensure this:                                                                                                       |        |
| <ul> <li>"If P writes x and P1 reads it, P's write will be seen by P1<br/>read and write are sufficiently far apart"</li> </ul> | if the |
| Writes to a single location are serialized: seen in one order                                                                   | er     |
| <ul> <li>Latest write will be seen</li> </ul>                                                                                   |        |
| <ul> <li>Otherwise could see writes in illogical order<br/>(could see older value after a newer value)</li> </ul>               |        |
| Cache coherency in multiprocessors: How does a processor know changes in the caches of other                                    |        |
| processors? How do other processors know change this cache?                                                                     | s in   |
|                                                                                                                                 | 21     |

| Potential HW Coherency Solutions                                                                           |      |
|------------------------------------------------------------------------------------------------------------|------|
| Snooping Solution (Snoopy Bus):                                                                            |      |
| <ul> <li>Send all requests for data to all processors</li> </ul>                                           |      |
| <ul> <li>Processors snoop to see if they have a copy and respond<br/>accordingly</li> </ul>                |      |
| <ul> <li>Requires broadcast, since caching information is at proces</li> </ul>                             | sors |
| <ul> <li>Works well with bus (natural broadcast medium)</li> </ul>                                         |      |
| <ul> <li>Dominates for small scale machines (most of the market)</li> </ul>                                |      |
| Directory-Based Schemes (discuss later)                                                                    |      |
| <ul> <li>Keep track of what is being shared in 1 centralized place<br/>(logically)</li> </ul>              |      |
| <ul> <li>Distributed memory =&gt; distributed directory for scalabilit<br/>(avoids bottlenecks)</li> </ul> | у    |
| <ul> <li>Send point-to-point requests to processors via network</li> </ul>                                 |      |
| <ul> <li>Scales better than Snooping</li> </ul>                                                            |      |
| <ul> <li>Actually existed BEFORE Snooping-based schemes</li> </ul>                                         | 22   |

| Basic Shoopy Protocols                                                                                                                |                   |
|---------------------------------------------------------------------------------------------------------------------------------------|-------------------|
| Write Invalidate Protocol:                                                                                                            |                   |
| <ul> <li>Multiple readers, single writer</li> </ul>                                                                                   |                   |
| <ul> <li>Write to shared data: an invalidate is se<br/>caches which snoop and <i>invalidate</i> any co</li> <li>Read Miss:</li> </ul> | nt to all<br>pies |
| <ul> <li>Write-through: memory is always up-to-date</li> <li>Write-back: snoop in caches to find most rec</li> </ul>                  | ent copy          |
| Write Broadcast Protocol (typically w through):                                                                                       | vith write        |
| <ul> <li>Write to shared data: broadcast on bus,<br/>snoop, and update any copies</li> </ul>                                          | processors        |
| Read miss: memory is always up-to-date                                                                                                |                   |
| <ul> <li>Write serialization: bus serializes rec</li> <li>Bus is single point of arbitration</li> </ul>                               | juests!           |
|                                                                                                                                       | 23                |

| M/nito 1     | Basic Shoo         | py Protocols                |      |
|--------------|--------------------|-----------------------------|------|
| T            | nvandare vers      | us Broducust.               |      |
| = Invali     | date requires one  | transaction per write-ru    | n    |
| Invali block | date uses spatial  | locality: one transaction p | per  |
| Broad        | cast has lower lat | tency between write and i   | read |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             |      |
|              |                    |                             | 24   |

























| Implementing Snooping Caches                                                            |
|-----------------------------------------------------------------------------------------|
| Bus serializes writes, getting bus ensures no one else can perform memory operation     |
| On a miss in a write back cache, may have the desired copy and its dirty, so must reply |
| Add extra state bit to cache to determine<br>shared or not                              |
| <ul> <li>Add 4th state (MESI)</li> <li>Modfied (private.!=Memory)</li> </ul>            |
| <pre></pre>                                                                             |
| = <u>I</u> nvalid                                                                       |
| 37                                                                                      |
|                                                                                         |



| MEST Hignligh                                                       | ts              |
|---------------------------------------------------------------------|-----------------|
| Actions:                                                            |                 |
| <ul> <li>Have read misses on a bloc<br/>request onto bus</li> </ul> | ck: send read   |
| <ul> <li>Have write misses on a blo<br/>request onto bus</li> </ul> | ock: send write |
| Receive bus read request:<br>block to shared state                  | transit the     |
| Receive bus write request<br>block to invalid state                 | : transit the   |
| Must write back data when<br>from modified state                    | n transiting    |