0% found this document useful (0 votes)
6 views

HPC Lectures 1 5

There are two main types of high performance computing (HPC): clusters and supercomputers. Clusters are groups of interconnected computers that work together, while supercomputers provide high performance chips and processors that can solve problems faster. HPC uses parallel processing to improve computing performance and handle increasingly large datasets through techniques like distributed and shared memory architectures. Key algorithms like sorting and searching are important for optimizing HPC applications.

Uploaded by

mohamed samy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

HPC Lectures 1 5

There are two main types of high performance computing (HPC): clusters and supercomputers. Clusters are groups of interconnected computers that work together, while supercomputers provide high performance chips and processors that can solve problems faster. HPC uses parallel processing to improve computing performance and handle increasingly large datasets through techniques like distributed and shared memory architectures. Key algorithms like sorting and searching are important for optimizing HPC applications.

Uploaded by

mohamed samy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

High performance computing

Lectures
BY ME @😁🪄
Introduction to HPC
🪄

The ability to process data and perform complex calculations efficiently,


reliably and at high speeds.
Uses supercomputers and computer clusters to solve advanced
What is HPC? computation problems.
Is the practice of using parallel data processing to improve computing
performance and perform complex calculations.

As technologies like the Internet of Things (IOT), artificial intelligence (AI),


Why HPC is
and 3-D imaging evolve (3 HPC Application). The size and amount of data
important? that organizations have to work with is growing exponentially.

Healthcare , Urban planning


Top HPC
Engineering , Finance, and business
Industries Aerospace

a computer with a high level of performance as compared to a general-


purpose computer.
The performance of a supercomputer is commonly measured in floating-
point operations per second (FLOPS) : which Is a measure of a computer's
What is performance based on the number of floating-point arithmetic calculations
Supercomputer? that the processor can perform within a second.
Since 2017, there have existed supercomputers which can perform over
(1017) FLOPS.
For comparison, a desktop computer has performance in the range of
hundreds of giga-FLOPS (1011) to tens of tera-FLOPS (1013).

Fields of using computational science, weather forecasting, climate research, oil and gas
Supercomputers exploration, molecular modeling, nuclear weapons, nuclear fusion.

A computer cluster is a group of two or more computers, or nodes, that


run in parallel to achieve a common goal.
A cluster is a group of inter-connected computers or hosts that work
What is Cluster?
together to support applications.
Big Companies prefer using Clusters instead of supercomputers
because they have large number of users. (Note)
• The performance of a program depends on the effectiveness of :
1. Algorithms.
2. Software System (OS/compiler).
3. The computer executes the machine instructions (processor and memory).
4. I/O systems

- Algorithms: An algorithm is a step-wise representation of a solution to a given problem.


• Searching Algorithms:
1- Linear Search :
• is a very basic and simple search algorithm.
• In Linear search, we search for an element or value in a given array by traversing the
array from the starting, till the desired element or value is found.
2- Binary Search :
• It is mandatory for the target array to be sorted.
• In Linear search, First, we shall determine half of the array by using this formula - mid =
low + (high - low) / 2 , and then decide which part should we search in base on the
value is grater or less than the mid.
• Sorting Algorithms :
1- The selection sort :
• For ascending order, a selection sort looks for the largest value as it makes a pass and,
after completing the pass, places it in the proper location.
Advantages Dis-advantages
▪ Implementation is very easy. ▪ Slow
▪ suitable for small lists ordering ▪ Blind algorithm, If the list reach the goal at
▪ Sort in-place, doesn’t need extra memories, any time, it doesn’t terminate and perform
as it makes swapping numbers in their the all steps.
locations.

2- The bubble sort:


• Make multiple passes through a list. It compares adjacent items and exchanges those
that are out of order.
Advantages Dis-advantages
▪ Simple, suitable for small lists ordering. ▪ Slow
▪ Sort in-place,
▪ Not blind, smart, when it reaches sorting at
any time, when it doesn’t make any
swapping at any time, it stops and gets the
right order.
Set Of Comparisons :
Algorithm Linear Search Binary Search Selection Sort Bubble Sort
Big(O) O(n) O(log n) O(n2) O(n2)

Main types of HPC


The HPC clusters Supercomputer
• Clusters of servers interconnected using a • Providing high performance chips.
high-speed connection. • Providing processors that work on large
• Its sizes range from tens of servers to tens of amount of data (Vectorization).
thousands of servers. (10-10,000) • Providing more than 1 CPU on the same
• Solving the problems by dividing them into chip.
smaller problems (Divide and Conquer). • CPUs on the same chip share the address
• The most popular from of HPC. space (Memory).
• Solving the problems faster.
Sequential and Parallel computing
🪄

Sequential computing Parallel computing


Traditionally, software has been written for
serial computation:
A parallel computer is a computer consisting
— To be run on a single computer having a
of :
single Central Processing Unit(CPU).
—two or more processors that can cooperate
— A problem is broken into a discrete series
and communicate to solve large problem fast
of instmctions.
—one or more memory modules
— Instructions are executed one after
another. —an interconnection network that connects
processors with each Other and/or with the
— Only one instruction may execute at any
memory modules.
moment in time.

Example of parallelizable problems


• Calculate the potential energy of several thousand and independent of a molecule. When
done, find the minimum energy conformation. (molecule here is independent similar to
matrix item)
o The problem is able to be solved in parallel. Each of the molecular conformation is
independently determined.
o The calculation of the minimum energy conformation is also parallelizable problem.

Example of non-parallelizable problems


• Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,……) using the formula:
F(K+2)=F(K+1)+F(K)
• This is non-parallelizable problem because the calculation as shown would entail dependent
calculations rather than independent ones.
Amdahl’s Law
_________________________________________________________________________________

• Used to predict maximum speed up using multiple processors.

𝑙𝑒𝑡 𝑓 = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑘 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑒𝑑 𝑠𝑒𝑞𝑢𝑒𝑛𝑡𝑖𝑎𝑙𝑙𝑦 , 𝑡ℎ𝑒𝑛


(1 − 𝑓 ) = 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑤𝑜𝑟𝑘 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑧𝑎𝑏𝑙𝑒
𝑃 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠
𝑜𝑛 1 𝑐𝑝𝑢 ∶ 𝑇1 = 𝑓 + (1 − 𝑓) = 1
(1 − 𝑓)
𝑜𝑛 𝑃 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠 ∶ 𝑇𝑝 = 𝑓 +
𝑝
𝑇1 1 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝: = <
𝑇𝑝 (1 − 𝑓)⁄ 𝑓
𝑓+ 𝑝

Foster’s Design methodology


_________________________________________________________________________________

Problem → Partition → Communicate → Agglomerate → Map


Parallel computer memory Architecture
_________________________________________________________________________________

Shared memory :
• Multiple processors can operate Independently but share the
same memory resources.
• Changes in a memory location effected by one processor are
visible to all other processors (global address space).

Distributed memory :
• Requires a communication Network to connect.
• Processors have their own memory, so
✓ it operates independently.
✓ Changes it makes to its local memory have no
effect on the memory of other processor.
• When a processor needs to access data in another processor, it is the task of the programmer
to define how and when.

shared memory distributed memory


Global adress space Memory is scalable with number of
processor .
lack of scalabilty between memory and
number of CPUs connected to it.
each processor can rapidly access own
memory.
Adding more CPUs increase traffic.

programmer responsssible for specifying Need strong security system.


snchronization ( read & write )

Hybrid Distributed-Shared Memory(DSM)


• Increased scalability is an important advantage.
• Increased programmer complexity is an important
disadvantage.
Data Parallelizm – Flynn's Classical Taxonomy
_________________________________________________________________________________

• Flynn’s taxonomy distinguishes multi-processor computer architectures according to:


o how they can be classified along the two independent dimensions of Instruction
Stream and Date Stream.

Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD)
• A serial (non-parallel) computer • A type of parallel computer
• Single Instruction: Only one instruction • Single Instruction: All processing units
stream is being acted on by the CPU execute the same instruction at any
during any one clock cycle. given clock cycle .
• Single Data: Only one data stream is being • Multiple Data: Each processing unit can
used as input during any one clock cycle. operate on a different data element.
• Deterministic execution
• This is the oldest type of computer

Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer • A type of parallel computer
• Multiple Instruction: Each processing unit • Multiple Instruction: Every processor
operates on the data independently via may be executing a different instruction
separate instruction streams. stream.
• Single Data: A single data stream is fed • Multiple Data: Every processor may be
into multiple processing units. working with a different data stream
Amdahl’s law
🪄

• If we have a percentage of parallelism in the program (part serial and part parallel).
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑝
1−𝑝+( )
𝑛
o 𝑛 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠
o 𝑝 → 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑖𝑛 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
o ℎ𝑖𝑛𝑡 ∶ (1 − 𝑝) → 𝑝𝑒𝑟𝑠𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑠𝑒𝑟𝑖𝑎𝑙
Ex.1 on Amdahl’s law
• Using 3 cores , a 75% a parallized program,
o (1 − 𝑝) → 𝑝𝑒𝑟𝑠𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑠𝑒𝑟𝑖𝑎𝑙 = 1 − 0.75 = 0.25
o 𝑝 → 75 % , 𝑛 → 3
1
o 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 0.75 =2
0.25+( )
3

• The parallelizable portion runs 3 times faster with 3 cores compared to 1 core.
• The non-parallelizable portion runs at the same speed on 1 core.
• When you combine these two parts, the overall speedup is 2 times. This is because you still
have that 25% of the work that cannot be speed up by adding more cores, which limits the
overall speedup to a maximum of 2 times in this scenario.
Ex.2 : Which is better ?
• 10% paralellized program on 90-cores Or 90% paralellized program on 10-cores ??
For Scenario 1 (10% parallelized program on 90 cores): For Scenario 2 (90% parallelized program on 10 cores):
Speedup (S1) = 1 / [(0.90 + (0.10 / 90)] ≈ 1.1111 Speedup (S2) = 1 / [ 0.1 + (0.90 / 10)] ≈ 5.2632

• If you have two options and one has more tasks running at the same time (parallelization)
and the other has more workers (cores), go for the one with more tasks running at the same
time if you want things to get done faster. This is because having more tasks running at the
same time can often be more efficient than just having more workers.
• So, the smart choice is to focus on maximizing the degree of parallelization in your program
to make the most of available computational resources and achieve better performance.
Effectiveness of parallel processing
_________________________________________________________________________________

• 𝑝 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
• 𝑊(𝑝) → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
• 𝑇(𝑝) → 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑢𝑠𝑖𝑛𝑔 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠

Speed up Efficiency Redundancy Utilization Quality


𝑇(1) 𝑇(1) 𝑤(𝑝) 𝑤(𝑝) 𝑇 3 (1)
𝑠(𝑝) = 𝐸(𝑝) = 𝑅(𝑝) = 𝑈(𝑝) = 𝑄(𝑝) =
𝑇(𝑝) 𝑝 ∗ 𝑇(𝑝) 𝑤(1) 𝑝 ∗ 𝑇(1) 𝑝 ∗ 𝑇 2 (𝑝) ∗ 𝑤(𝑝)
𝑠𝑒𝑟𝑖 𝑠(𝑝)
𝑠(𝑝) = 𝐸(𝑝) =
𝑝𝑎𝑟𝑎 𝑝
Example of measuring efficiency
Multiprocessor Interconnection networks
🪄

Mode of operation
Synchronous Asynchronous
• a single global clock is used by all • No global clock required.
components in the system(lock-step • Hand shaking signals are used to coordinate
manner). the operation of asynchronous systems.
Control strategy
Centralized Decentralized
• one central control unit is used to control
• The control function is distributed among
the operations of the components of the
different components in the system.
system.
Switching Techniques
Circuit switching Paket switching
• a complete path has to be established prior • communication between a source and a
to the start of communication between a destination takes place via messages divided
source and a destination. into smaller entities, called packets
Topology
• Describes how to connect processors and memories to other processors and memories.
Static Dynamic
direct fixed links are established among nodes
connections are established when needed.
to form a fixed network.
- Have a fixed path.
- Uni-direction or bi-direction between
processors .
- number of links : O(N2)
- delay complexity : O(1) completely connected network

Static INs

Dynamic INs
- Simplest way to connect multiprocessor systems.
- The use of local caches reduces the processor- memory
traffic.
- Size of such system varies between 2 and 50 processors.
- Single bus multiprocessors are inherently limited by:
1. Bandwidth of bus.

Single
2. 1 processor can access the bus.
3. 1 memory access can take place at any given time.
Bus-Based Dynamic INs

- Several parallel buses to interconnect multiple processors


and multiple memory modules.
- Many connection schemes are possible.
• Multiple Bus with Full Bus — Memory Connection
Multiple

(MBFBMC).
• Multiple Bus with Single Bus — Memory Connection
(MBSBMC).
• Multiple Bus with Partial Bus — Memory Connection
(MBPBMC).
• Multiple Bus with Class-based Bus — Memory Connection
(MBCBMC).
• Provide simultaneous
connections among all its
inputs and all its outputs.
Switched-based INs.

• A Switching Element (SE) is


at the intersection of any 2
Crossbar

lines extended horizontally or


vertically inside the switch.
• It is a non-blocking
network allowing multiple
input- output connection
pattern to be achieved
simultaneously.
Symmetric and asymmetric multiprocessors
Symmetric Asymmetric
• All processors have equal access to all • One processor (master) executes the
peripheral devices. operating system.
• All processors are identical. • Other processors may be of different types
and may be dedicated to special tasks.

Parallel Computer Memory Architectures


• Two (Three) categories of parallel computers are distinguished on the basis of:
o Shared Memory:
1) Uniform Memory Access (UNIA)
2) Non-Uniform Memory Access (NUNIA)
3) Cache Only Memory Access (CONIA)
o Distributed Memory
o Hybrid Memory

Shared Memory
General Characteristics:
o Shared memory parallel computers vary widely, but generally have in common the ability
for all processors to access all memory as global address space.
o Multiple processors can operate independently but share the same memory resources.
o Changes in a memory location effected by one processor are visible to all other processors.
Uniform Memory Access Non-Uniform Memory Access Cache Only Memory Access
(UMA) (NUMA) (COMA)
• Most commonly represented today • Often made by physically linking • The Cache-Only Memory
by Symmetric Multiprocessor (SMP) two or more SMPs. Architecture (COMA) increases the
machines. • One SMP can directly access chances of data being available
• Identical processors - Equal access memory of another SNP. locally because the hardware
and access times to memory • Not all processors have equal transparently replicates the data
Sometimes called Cache Coherent access tune to all memories. and migrates it to the memory
UMA (CC-UMA). • Memory access across link is module of the node that is
• Cache Coherent means if one slower. currently assessing it.
processor updates a location in • If cache coherency is maintained, • Each memory module acts as a
shared memory, all the other then may also be called CC-NUMA. huge cache memory in which each
processors know about the block has a tag with the address
update. and the state.
• Cache Coherent is accomplished at • Data can be migrated or replicated
the hardware level. in the various memory banks of
the central main memory.
MIMD processing

Tightly coupled multi-processors Loosely coupled multiprocessors


• Shared global memory address space. • No shared global memory address space.
• Traditional multiprocessing: symmetric • Multicomputer network : “Network-based
multiprocessing (SMP) multiprocessor”
• Programming model similar to • usually programmed via message passing:
uniprocessors. “Explicit calls(send , receive) for
• Operations on shared data require communication”.
synchronization.

Why the Sequential Bottleneck?


• Parallel machines have a sequential bottleneck.
• Main cause: Non-parallelizable operations on data (e.g. non-parallelizable loops)
for ( i = 0 ; i < N; i++)
A[i] = (A[i] + A[i-1]) / 2
• Single thread prepares data and spawns parallel tasks (usually sequential)

Task Assignments

Static Assignment Dynamic Assignment


• No movement of tasks • Efficient: better utilizes resources when load
• Inefficient : underutilizes resources when is not balanced.
load is balanced “all processors take the • Ex: Compute histogram “distribution” of a
same number of instructions” large set of values.
• Ex: multiplying matrices to zeros or ones • counting the number of occurancy of
faster than multiplying to any other number. specific words in a book.
• Each task could count the occurancy of each
word in a set of pages. Some pages may be
empty or have little statements, so they will
be faster
Parallel speedup example
𝑅 = 𝑎4 𝑥 4 + 𝑎3 𝑥 3 + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0

• Assume each operation 1 second, no communication cost.


• How fast is this with a single processor?
• How fast is this with 3 processors?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy