Parallel Architectures Parallel Architectures: Ever Faster
Parallel Architectures Parallel Architectures: Ever Faster
Ever Faster
There is a continuing drive to create faster computers. t Some large engineering problems require massive computing resources. Increases in clock speed alone will not produce a sufficient increase in computer performance. Parallelism provides an ability to get more done in the same amount of time.
Parallel Architectures
COMP375 Computer Architecture and dO Organization i ti
Intel Performance
7000.0 6000.0
Clock
5000.0 4000.0
Architecture
Architectural Improvements
3000.0
2000.0
1000.0
0.0 1987 1989 1991 1993 1995 1997 1999 2001 2002 2003
Parallel Architectures
Parallelism
The key to making a computer fast is to do as much hi in parallel ll l as possible. ibl Processors in modern home PCs do a lot in parallel
Superscalar execution Multiple ALUs Hyperthreading Pipelining
Demands exceed the capabilities of f even the th fastest f t t current t uniprocessor systems
from http://meseec.ce.rit.edu/eecc756-spring2002/756-3-8-2005.ppt
Hyperthreading
Hyperthreading allows the apparently parallel ll l execution ti of ft two th threads. d Each thread has its own set of registers, but they share a single CPU. The CPU alternates executing instructions from each thread thread. Pipelining is improved because adjacent instructions come from different threads and therefore do not have data hazards.
Parallel Architectures
Hyperthreading Performance
Also called Multiple Instruction Issue processors Overall throughput is increased Individual threads do not run as fast as they would without hyperthreading
Thread 1 Th d 2 Thread Thread 1 Thread 2 Thread 1 Thread 2 Th d 1 Thread Thread 2
Pipelining
Hyperthreading
1. Runs best in single th d d applications threaded li ti 2. Replaces pipelining 3. Duplicates the user register set 4 Replaces dual 4. dual-core core processors.
Multiple Threads
Programmer must create multiple threads in the program.
Multiple processes
Different programs p g
Programmer never notices
Parallel Architectures
MIMD Computers
Shared Memory
Attached processors SMP
Multiple processor systems Multiple instruction issue processors Multi-Core processors
NUMA
Separate S t Memory M
Clusters Message passing multiprocessors
Hypercube Mesh
Vector Processors
A vector is a one dimensional array. Vector processors are SIMD machines that have instructions to perform operations on vectors. A vector process might have vector registers with 64 doubles in each register register. A single instruction could add two vectors.
Parallel Architectures
Vector Example
/* traditional vector addition */ double a[64], b[64], c[64]; for (i = 0; i < 64; i++) c[i] = a[i] + b[i]; /* vector processor in one instruction*/ ci = ai + bi
Non-contiguous Vectors
A simple vector is a one dimensional array with ith each h element l t in i successive i addresses dd In a two dimensional array, the rows can be considered simple vectors. The columns are vectors, but the are not stored in consecutive addresses. The distance between elements of a vector is called the stride.
Vector Stride
Matrix multiplication
b11, b12 a11, a12, a13 a 21, a 22, a 23 * b21, b22 b31, b32
a11 b11
a12 a13 a21 a22 a23 b12 b21 b22 b31 b32
Parallel Architectures
Cray-2
Built in 1985 with ten times the performance of the Cray-1
A computed value is passed from one processor to the next. Each processor may perform a different operation. Minimizes memory accesses.
Parallel Architectures
Systolic Architectures
Systolic arrays meant to be special-purpose processors or coprocessors and were very finegrained Processors implement a limited and very simple computation, usually called cells Communication is very fast, granularity meant to be very fine (a small number of computational operations per communication) Very y fast clock rates due to regular, g synchronous y structure Data moved through pipelined computational units in a regular and rhythmic fashion Warp and iWarp were examples of systolic arrays
from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf
a0,2
a0,1
a0,0
a1 2 a1,2
a1,1 a1 1
a1,0 a1 0
a2,2
a2,1 T=0
a2,0
Alignments in time
b2,0 b1,0
b0,0
Alignments in time
b2 0 b2,0
b1,0
b2,1 b1,1 b1 1
b0,1
a0,0*b0,1
a0,2
a0,1
a0,0
a0,0*b0,0
a0,2
a0,1
a0,0*b0,0 + a0,1*b1,0
a0,0
b0,0
a1 2 a1,2
a1,1 a1 1
a1,0 a1 0
a1 2 a1,2
a1,1 a1 1
a1,0
a1,0*b0,0
a2,2 T=1
a2,1
a2,0 T=2
a2,2
a2,1
a2,0
Parallel Architectures
Alignments in time
b2,0 a0,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0
b2,1
b1,1
a0,0*b0,1 + a0,1*b1,1
b2,2 b1,2
b0,2
a0,0*b0,2
Alignments in time
b2,1
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0
b2,2
b1,2
a0,0*b0,2 + a0,1*b1,2
a0,1
a0,0
a0,2
a0,1
b1,0
b0,1
a1,0*b0,1
b2,0 a1,2
a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0
b1,1
a1,0*b0,1 +a1,1*b1,1
b0,2
a1,0*b0,2
a1 2 a1,2
a1,1
a1,0*b0,0 + a1,1*b1,0 , ,
a1,0
a1,1
a1,0 ,
b0,0
b1,0
b0,1
a2,0*b0,1
a2,2 T=3
a2,1
a2,0
a2,0*b0,0
a2,1
a2,0*b0,0 + a2,1*b1,0
a2,0
Alignments in time
b2,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1
Alignments in time
a0,2
b2,1
a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0
b1,2
a1,0*b0,2 + a1,1*b1,2 1 1*b1 2 a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1
b2,2 a1,2 ,
a1,0*b0,2 + a1,1*b1,2 1 1*b1 2 + a1,2*b2,2
a1,2
a1,1 ,
b2,0 a2,2
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0
b1,1
a2,0*b0,1 + a2,1*b1,1
b0,2
a2,0*b0,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0
b2,1 a2,2
a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1
b1,2
a2,0*b0,2 + a2,1*b1,2
a2,1
a2,0
a2,1
T=5
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
T=6
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Parallel Architectures
Systolic Architectures
Never caught on for General-purpose computation t ti
Alignments in time
Done
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0
Regular structures are HARD to achieve Real efficiency requires local communication, but flexibility of global communication is often critical
b2,2
a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1
a2,2
T=7
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
Specialized Processors
Some architectures have a general purpose processor with additional specialized processors I/O processors
I/O controllers with DMA Graphics processor
Vector processors
Parallel Architectures
Topologies
The CPUs can be interconnected in different ways. Goals of an interconnection method are:
Minimize the number of connections per CPU Minimize the average distance between nodes Minimize the diameter or maximum distance between any two nodes Maximize the throughput Simple routing of messages
Grid or Mesh
Each node has 2, 3 or 4 connections Routing is simple and many routes are available. The diameter is
Torus
Similar to a mesh with ith connected t d end d nodes. Many similar routes available. The diameter is
2 N 2
10
Parallel Architectures
Hypercube
Each node log2N connections ti Routing can be done by changing one bit of the address at a time. The diameter is log2N
NUMA systems
In NonUniform Memory Access systems each node has local memory but can also access the memory of other nodes nodes. Most memory access are to local memory. The memory of other nodes can be accessed, but the access is slower than local memory memory. Cache coherence algorithms are used when shared memory is accesses. More scalable than SMP.
NUMA system
11