0% found this document useful (0 votes)

33 views

Parallel Architectures Parallel Architectures: Ever Faster

The document discusses various parallel architectures for computers. It begins by explaining how parallelism allows computers to get more work done in the same amount of time to address the need for increased performance. It then covers different parallel classifications including SISD, SIMD, MISD, and MIMD. Specific parallel architectures discussed include vector processors, hyperthreading, multiprocessors, and systolic arrays. An example is provided of how a systolic array can perform matrix multiplication in parallel through a pipelined approach.

Uploaded by

manishbhardwaj8131

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Parallel Architectures Parallel Architectures: Ever Faster

Uploaded by

manishbhardwaj8131

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Parallel Architectures

Ever Faster
There is a continuing drive to create faster computers. t Some large engineering problems require massive computing resources. Increases in clock speed alone will not produce a sufficient increase in computer performance. Parallelism provides an ability to get more done in the same amount of time.

Parallel Architectures
COMP375 Computer Architecture and dO Organization i ti

Intel Microprocessor Performance

8000.0

Intel Performance
7000.0 6000.0

Clock
5000.0 4000.0

Architecture
Architectural Improvements

3000.0

2000.0

1000.0

0.0 1987 1989 1991 1993 1995 1997 1999 2001 2002 2003

From Stallings textbook

Parallel Architectures

Large Computing Challenges

(Memory Requirement)

Parallelism
The key to making a computer fast is to do as much hi in parallel ll l as possible. ibl Processors in modern home PCs do a lot in parallel
Superscalar execution Multiple ALUs Hyperthreading Pipelining

Demands exceed the capabilities of f even the th fastest f t t current t uniprocessor systems

from http://meseec.ce.rit.edu/eecc756-spring2002/756-3-8-2005.ppt

Processes and Threads

All modern operating systems support multiple processes and threads. Processes have their own memory space. Each program is a process. Threads are a flow of control through a process. Each process has at least one thread. thread Single processor systems support multiple processes and threads by sequentially sharing the processor.

Hyperthreading
Hyperthreading allows the apparently parallel ll l execution ti of ft two th threads. d Each thread has its own set of registers, but they share a single CPU. The CPU alternates executing instructions from each thread thread. Pipelining is improved because adjacent instructions come from different threads and therefore do not have data hazards.

Parallel Architectures

Hyperthreading Performance
Also called Multiple Instruction Issue processors Overall throughput is increased Individual threads do not run as fast as they would without hyperthreading
Thread 1 Th d 2 Thread Thread 1 Thread 2 Thread 1 Thread 2 Th d 1 Thread Thread 2

Pipelining

Hyperthreading
1. Runs best in single th d d applications threaded li ti 2. Replaces pipelining 3. Duplicates the user register set 4 Replaces dual 4. dual-core core processors.

How Visible is Parallelism?

Superscalar
Programmer never notices

Multiple Threads
Programmer must create multiple threads in the program.

Multiple processes
Different programs p g
Programmer never notices

Parts of the same program

Programmer must divide the work among different processes.

Parallel Architectures

Flynns Parallel Classification

SISD Single Instruction Single Data
standard uniprocessors

MIMD Computers
Shared Memory
Attached processors SMP
Multiple processor systems Multiple instruction issue processors Multi-Core processors

SIMD Single Instruction Multiple Data

vector and array processors

MISD - Multiple Instruction Single Data

systolic t li processors

NUMA

Separate S t Memory M
Clusters Message passing multiprocessors
Hypercube Mesh

MIMD Multiple Instruction Multiple Data

Single Instruction Multiple Data

Many high performance computing programs perform f calculations l l ti on l large d data t sets. t SIMD computers can be used to perform matrix operations. The Intel Pentium MMX instructions perform SIMD operations on 8 one byte integers.

Vector Processors
A vector is a one dimensional array. Vector processors are SIMD machines that have instructions to perform operations on vectors. A vector process might have vector registers with 64 doubles in each register register. A single instruction could add two vectors.

Parallel Architectures

Vector Example
/* traditional vector addition */ double a[64], b[64], c[64]; for (i = 0; i < 64; i++) c[i] = a[i] + b[i]; /* vector processor in one instruction*/ ci = ai + bi

Non-contiguous Vectors
A simple vector is a one dimensional array with ith each h element l t in i successive i addresses dd In a two dimensional array, the rows can be considered simple vectors. The columns are vectors, but the are not stored in consecutive addresses. The distance between elements of a vector is called the stride.

Vector Stride
Matrix multiplication
b11, b12 a11, a12, a13 a 21, a 22, a 23 * b21, b22 b31, b32

Cray-1 Vector Processor

Sold to Los Alamos National Laboratory in1976 for $8.8 million 80 MHz clock 160 megaflops g y ( (1 M words) ) 8 megabyte Freon cooled

a11 b11

a12 a13 a21 a22 a23 b12 b21 b22 b31 b32

Parallel Architectures

Cray-2
Built in 1985 with ten times the performance of the Cray-1

Intel calls some of their new Pentium instructions SIMD because

1 The 1. They operate on multiple bytes 2. There is a separate processor for them 3 Multiple processing 3. units operate on one piece of data 4. They are executed in the GPU

Data Flow Architecture

MISD Systolic Architectures

Memory Processor Element Processor Element Processor Element

Data flow graph computing N =(A+B) * (B-4)

A computed value is passed from one processor to the next. Each processor may perform a different operation. Minimizes memory accesses.

Parallel Architectures

Systolic Architectures
Systolic arrays meant to be special-purpose processors or coprocessors and were very finegrained Processors implement a limited and very simple computation, usually called cells Communication is very fast, granularity meant to be very fine (a small number of computational operations per communication) Very y fast clock rates due to regular, g synchronous y structure Data moved through pipelined computational units in a regular and rhythmic fashion Warp and iWarp were examples of systolic arrays
from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Ali Alignments in i time i

Rows of A

b2,0 b1,0 , b0,0

b2,1 b1,1 b0,1 ,

b2,2 b1,2 b0,2

Columns of B

a0,2

a0,1

a0,0

a1 2 a1,2

a1,1 a1 1

a1,0 a1 0

a2,2

a2,1 T=0

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid

Alignments in time

b2,0 b1,0
b0,0

b2,1 b1,1 b0,1

b2,2 b1,2 b0,2

Each processor accumulates one element of the product

Alignments in time

b2 0 b2,0
b1,0

b2,1 b1,1 b1 1
b0,1
a0,0*b0,1

b2,2 b1,2 b0,2 b0 2

a0,2

a0,1

a0,0

a0,0*b0,0

a0,2

a0,1

a0,0*b0,0 + a0,1*b1,0

a0,0

b0,0

a1 2 a1,2

a1,1 a1 1

a1,0 a1 0

a1 2 a1,2

a1,1 a1 1

a1,0

a1,0*b0,0

a2,2 T=1

a2,1

a2,0 T=2

a2,2

a2,1

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Parallel Architectures

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
b2,0 a0,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

b2,1
b1,1
a0,0*b0,1 + a0,1*b1,1

b2,2 b1,2
b0,2
a0,0*b0,2

Alignments in time
b2,1
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

b2,2
b1,2
a0,0*b0,2 + a0,1*b1,2

a0,1

a0,0

a0,2

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,1

b1,0

b0,1
a1,0*b0,1

b2,0 a1,2
a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0

b1,1
a1,0*b0,1 +a1,1*b1,1

b0,2
a1,0*b0,2

a1 2 a1,2

a1,1

a1,0*b0,0 + a1,1*b1,0 , ,

a1,0

a1,1

a1,0 ,

b0,0

b1,0

b0,1
a2,0*b0,1

a2,2 T=3

a2,1

a2,0

a2,0*b0,0

a2,2 a2,2 T=4

a2,1

a2,0*b0,0 + a2,1*b1,0

a2,0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Alignments in time
b2,2
a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

Alignments in time

a0,2

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

a0,0b0,0 + a0,1b1,0 + a0,2*b2,0

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

b2,1
a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0

b1,2
a1,0*b0,2 + a1,1*b1,2 1 1*b1 2 a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

b2,2 a1,2 ,
a1,0*b0,2 + a1,1*b1,2 1 1*b1 2 + a1,2*b2,2

a1,2

a1,0b0,1 +a1,1b1,1 + a1,2*b2,1

a1,1 ,

b2,0 a2,2
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0

b1,1
a2,0*b0,1 + a2,1*b1,1

b0,2
a2,0*b0,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0

b2,1 a2,2
a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1

b1,2
a2,0*b0,2 + a2,1*b1,2

a2,1

a2,0

a2,1

T=5
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

T=6
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Parallel Architectures

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Processors arranged in a 2-D grid Each processor accumulates one element of the product

Systolic Architectures
Never caught on for General-purpose computation t ti

Alignments in time

a0,0b0,0 + a0,1b1,0 + a0,2*b2,0

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

a1,0b0,0 + a1,1b1,0 , , + a1,2*a2,0

Done
a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0

a1,0b0,1 +a1,1b1,1 + a1,2*b2,1

a1,0b0,2 + a1,1b1,2 1 1b1 2 + a1,2b2,2

Regular structures are HARD to achieve Real efficiency requires local communication, but flexibility of global communication is often critical

Used mainly in structured settings

High speed multipliers Structured Array Computations
from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

b2,2
a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1

a2,2

a2,0b0,2 + a2,1b1,2 + a2,2*b2,2

T=7
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Multiple Instruction Multiple Data

Multiple independent processors can execute different parts of a program to different programs. programs The processors can be different or identical. Different architectures allow the processors to access:
the same memory separate memory local memory with access to all memory

Specialized Processors
Some architectures have a general purpose processor with additional specialized processors I/O processors
I/O controllers with DMA Graphics processor

Floating point processors

Intel 80387 for Intel 80386

Vector processors

Parallel Architectures

Separate Memory Systems

Each processor has its own memory that none of f the th other th processors can access. Each processor has unhindered access to its memory. No cache coherence problems. Scalable to very large numbers of processors Communication is done by passing messages over special I/O channels

Topologies
The CPUs can be interconnected in different ways. Goals of an interconnection method are:
Minimize the number of connections per CPU Minimize the average distance between nodes Minimize the diameter or maximum distance between any two nodes Maximize the throughput Simple routing of messages

Grid or Mesh
Each node has 2, 3 or 4 connections Routing is simple and many routes are available. The diameter is

Torus
Similar to a mesh with ith connected t d end d nodes. Many similar routes available. The diameter is

2 N 2

Parallel Architectures

Hypercube
Each node log2N connections ti Routing can be done by changing one bit of the address at a time. The diameter is log2N

NUMA systems
In NonUniform Memory Access systems each node has local memory but can also access the memory of other nodes nodes. Most memory access are to local memory. The memory of other nodes can be accessed, but the access is slower than local memory memory. Cache coherence algorithms are used when shared memory is accesses. More scalable than SMP.

NUMA system

Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Unit 5
No ratings yet
Unit 5
96 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
No ratings yet
Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones
48 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
No ratings yet
APznzaaBPbq19r7DttJsFJDiz6xdljQmPxg0oflqRAoyoqcN6IEEo4yrW Ck8XgHkH5PDMZIHRNz7h0ZpQWHOHwyjvO3PX93sVHvLd5fwcGETUu8XvmdTkaodNRbNrLgkDFPQZVQMfz8KHkZay30aqD0CVLA10PSummzrUt1vN32NEahcaq-m3CTYqZXjSBaBus9kPl5fj8KDKPT (1)
80 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Motivation For Parallelism Motivation For Parallelism
No ratings yet
Motivation For Parallelism Motivation For Parallelism
6 pages
Unit V
No ratings yet
Unit V
95 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
Chapter 02 - Asynchronous and Parallel Programming in .NET
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in .NET
55 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Architecture
No ratings yet
Architecture
67 pages
Model
No ratings yet
Model
14 pages
Unit -01 easid
No ratings yet
Unit -01 easid
18 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
No ratings yet
APznzabMSGRiAQ8A6MYm6rveAifgi1HxTbiTS9Yf85jZUPqJgWxkujRhNKxar3EMmdUmkYBO7lY9cgFKwY4fwAkv2bcmoL6bQOuYWj_ptvmKvZa7LIHiGWTA-SGiv4ZX1G6v7akwnOUhTbDF77ogwOam9w3m9razgp9_G3AN8-n7pGnvYDhIz5LR3pHaezRf34N7xBAUUWK5LTsnzw1
31 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
onur-digitaldesign-2020-lecture19-simd-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture19-simd-beforelecture
64 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Explicitly Parallel Platforms
No ratings yet
Explicitly Parallel Platforms
90 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
Parallel Computig Assignment
No ratings yet
Parallel Computig Assignment
15 pages
Unit 1
No ratings yet
Unit 1
22 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Computer Achitecture II - Parallel - Computing
No ratings yet
Computer Achitecture II - Parallel - Computing
46 pages
Lec1 Introduction to Parallel Computing (2)
No ratings yet
Lec1 Introduction to Parallel Computing (2)
40 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Quantum Computing
From Everand
Quantum Computing
Darsean White Johnson
No ratings yet
Process Management
No ratings yet
Process Management
10 pages
Scheduling: - Job Queue - Ready Queue - Device Queue - IPC Queue
No ratings yet
Scheduling: - Job Queue - Ready Queue - Device Queue - IPC Queue
20 pages
Segmentation
No ratings yet
Segmentation
35 pages
Chapter 8: Main Memory
No ratings yet
Chapter 8: Main Memory
10 pages
Ch6 Process Synchronization 6
No ratings yet
Ch6 Process Synchronization 6
11 pages
Introduction To Introduction To Introduction To Introduction To Operating Systems Operating Systems
No ratings yet
Introduction To Introduction To Introduction To Introduction To Operating Systems Operating Systems
6 pages
Deadlock 2
No ratings yet
Deadlock 2
8 pages
Chapter 15: Security
No ratings yet
Chapter 15: Security
8 pages
Magnetic Disks
No ratings yet
Magnetic Disks
19 pages
Crash Recovery: A C I D
No ratings yet
Crash Recovery: A C I D
9 pages
Evaluation of Relational Operations: Chapter 14, Part A (Joins)
No ratings yet
Evaluation of Relational Operations: Chapter 14, Part A (Joins)
6 pages
Physical Database Design
No ratings yet
Physical Database Design
9 pages
Evaluation of Relational Operations: Other Techniques: Chapter 12, Part B
No ratings yet
Evaluation of Relational Operations: Other Techniques: Chapter 12, Part B
4 pages
Storing Data: Disks and Files: Why Not Store Everything in Main Memory?
No ratings yet
Storing Data: Disks and Files: Why Not Store Everything in Main Memory?
10 pages
Hash-Based Indexes: As For Any Index, 3 Alternatives For Data Entries K
No ratings yet
Hash-Based Indexes: As For Any Index, 3 Alternatives For Data Entries K
7 pages
CPU Scheduling
No ratings yet
CPU Scheduling
5 pages
Multithreaded Programming Using Java Threads: Praveenraj R (Mark Education Academy)
No ratings yet
Multithreaded Programming Using Java Threads: Praveenraj R (Mark Education Academy)
46 pages
OS - Unit-2 - Process Generation or Schedulers in Operating System
No ratings yet
OS - Unit-2 - Process Generation or Schedulers in Operating System
2 pages
OS Notes - BK
No ratings yet
OS Notes - BK
54 pages
Cps 303 Note
No ratings yet
Cps 303 Note
40 pages
Middle Wares
No ratings yet
Middle Wares
3 pages
Ds File
No ratings yet
Ds File
27 pages
Properties of Schedules
No ratings yet
Properties of Schedules
34 pages
What Is RCU, Fundamentally
No ratings yet
What Is RCU, Fundamentally
23 pages
INTEL - The Parallel Universe - Issue 21 - 2015
No ratings yet
INTEL - The Parallel Universe - Issue 21 - 2015
36 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
E2 Answers
No ratings yet
E2 Answers
19 pages
Comparative Study of Scheduling Algorithms For Real Time Environment
No ratings yet
Comparative Study of Scheduling Algorithms For Real Time Environment
4 pages
BITS ZG553: Real Time Systems
No ratings yet
BITS ZG553: Real Time Systems
55 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
Unit-3-Process Scheduling and Deadloack
No ratings yet
Unit-3-Process Scheduling and Deadloack
18 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Distributed and Parallel Comuting
No ratings yet
Distributed and Parallel Comuting
70 pages
HDFS Material
100% (1)
HDFS Material
24 pages
Computer Organizations Unit 3 Process Scheduling
No ratings yet
Computer Organizations Unit 3 Process Scheduling
47 pages
Shared Memory Synchronization
No ratings yet
Shared Memory Synchronization
223 pages
CS405 Computer System Architecture PDF
No ratings yet
CS405 Computer System Architecture PDF
3 pages
Java Multithreading PPT (Join AICTE Telegram)
No ratings yet
Java Multithreading PPT (Join AICTE Telegram)
37 pages
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
No ratings yet
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
45 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
GTC-S62191 (1)
No ratings yet
GTC-S62191 (1)
89 pages
CPU Scheduling Part-1
No ratings yet
CPU Scheduling Part-1
23 pages
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
30 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Deadlocks: EE 442 Operating Systems Ch. 5 Deadlocks
No ratings yet
Deadlocks: EE 442 Operating Systems Ch. 5 Deadlocks
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parallel Architectures Parallel Architectures: Ever Faster

Uploaded by

Parallel Architectures Parallel Architectures: Ever Faster

Uploaded by

Parallel Architectures

Intel Microprocessor Performance

From Stallings textbook

Large Computing Challenges

Processes and Threads

How Visible is Parallelism?

Parts of the same program

Flynns Parallel Classification

SIMD Single Instruction Multiple Data

MISD - Multiple Instruction Single Data

MIMD Multiple Instruction Multiple Data

Single Instruction Multiple Data

Cray-1 Vector Processor

Intel calls some of their new Pentium instructions SIMD because

Data Flow Architecture

MISD Systolic Architectures

Data flow graph computing N =(A+B) * (B-4)

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Ali Alignments in i time i

b2,0 b1,0 , b0,0

b2,1 b1,1 b0,1 ,

b2,2 b1,2 b0,2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,1 b1,1 b0,1

b2,2 b1,2 b0,2

Each processor accumulates one element of the product

b2,2 b1,2 b0,2 b0 2

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a2,2 a2,2 T=4

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0

a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1

a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2

a1,0*b0,0 + a1,1*b1,0 , , + a1,2*a2,0

a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1

a1,0*b0,2 + a1,1*b1,2 1 1*b1 2 + a1,2*b2,2

Used mainly in structured settings

a2,0*b0,2 + a2,1*b1,2 + a2,2*b2,2

Multiple Instruction Multiple Data

Floating point processors

Separate Memory Systems

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

a0,0b0,0 + a0,1b1,0 + a0,2*b2,0

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

a1,0b0,1 +a1,1b1,1 + a1,2*b2,1

a0,0b0,0 + a0,1b1,0 + a0,2*b2,0

a0,0b0,1 + a0,1b1,1 + a0,2*b2,1

a0,0b0,2 + a0,1b1,2 + a0,2*b2,2

a1,0b0,0 + a1,1b1,0 , , + a1,2*a2,0

a1,0b0,1 +a1,1b1,1 + a1,2*b2,1

a1,0b0,2 + a1,1b1,2 1 1b1 2 + a1,2b2,2

a2,0b0,2 + a2,1b1,2 + a2,2*b2,2