HPC Lectures 1 5
HPC Lectures 1 5
Lectures
BY ME @😁🪄
Introduction to HPC
🪄
Fields of using computational science, weather forecasting, climate research, oil and gas
Supercomputers exploration, molecular modeling, nuclear weapons, nuclear fusion.
Shared memory :
• Multiple processors can operate Independently but share the
same memory resources.
• Changes in a memory location effected by one processor are
visible to all other processors (global address space).
Distributed memory :
• Requires a communication Network to connect.
• Processors have their own memory, so
✓ it operates independently.
✓ Changes it makes to its local memory have no
effect on the memory of other processor.
• When a processor needs to access data in another processor, it is the task of the programmer
to define how and when.
Single Instruction, Single Data (SISD) Single Instruction, Multiple Data (SIMD)
• A serial (non-parallel) computer • A type of parallel computer
• Single Instruction: Only one instruction • Single Instruction: All processing units
stream is being acted on by the CPU execute the same instruction at any
during any one clock cycle. given clock cycle .
• Single Data: Only one data stream is being • Multiple Data: Each processing unit can
used as input during any one clock cycle. operate on a different data element.
• Deterministic execution
• This is the oldest type of computer
Multiple Instruction, Single Data (MISD) Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer • A type of parallel computer
• Multiple Instruction: Each processing unit • Multiple Instruction: Every processor
operates on the data independently via may be executing a different instruction
separate instruction streams. stream.
• Single Data: A single data stream is fed • Multiple Data: Every processor may be
into multiple processing units. working with a different data stream
Amdahl’s law
🪄
• If we have a percentage of parallelism in the program (part serial and part parallel).
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑝
1−𝑝+( )
𝑛
o 𝑛 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠
o 𝑝 → 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑖𝑛 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
o ℎ𝑖𝑛𝑡 ∶ (1 − 𝑝) → 𝑝𝑒𝑟𝑠𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑠𝑒𝑟𝑖𝑎𝑙
Ex.1 on Amdahl’s law
• Using 3 cores , a 75% a parallized program,
o (1 − 𝑝) → 𝑝𝑒𝑟𝑠𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑝𝑟𝑜𝑔𝑟𝑎𝑚 𝑤𝑜𝑟𝑘 𝑠𝑒𝑟𝑖𝑎𝑙 = 1 − 0.75 = 0.25
o 𝑝 → 75 % , 𝑛 → 3
1
o 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = 0.75 =2
0.25+( )
3
• The parallelizable portion runs 3 times faster with 3 cores compared to 1 core.
• The non-parallelizable portion runs at the same speed on 1 core.
• When you combine these two parts, the overall speedup is 2 times. This is because you still
have that 25% of the work that cannot be speed up by adding more cores, which limits the
overall speedup to a maximum of 2 times in this scenario.
Ex.2 : Which is better ?
• 10% paralellized program on 90-cores Or 90% paralellized program on 10-cores ??
For Scenario 1 (10% parallelized program on 90 cores): For Scenario 2 (90% parallelized program on 10 cores):
Speedup (S1) = 1 / [(0.90 + (0.10 / 90)] ≈ 1.1111 Speedup (S2) = 1 / [ 0.1 + (0.90 / 10)] ≈ 5.2632
• If you have two options and one has more tasks running at the same time (parallelization)
and the other has more workers (cores), go for the one with more tasks running at the same
time if you want things to get done faster. This is because having more tasks running at the
same time can often be more efficient than just having more workers.
• So, the smart choice is to focus on maximizing the degree of parallelization in your program
to make the most of available computational resources and achieve better performance.
Effectiveness of parallel processing
_________________________________________________________________________________
• 𝑝 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
• 𝑊(𝑝) → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
• 𝑇(𝑝) → 𝑡𝑖𝑚𝑒 𝑜𝑓 𝑢𝑠𝑖𝑛𝑔 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
Mode of operation
Synchronous Asynchronous
• a single global clock is used by all • No global clock required.
components in the system(lock-step • Hand shaking signals are used to coordinate
manner). the operation of asynchronous systems.
Control strategy
Centralized Decentralized
• one central control unit is used to control
• The control function is distributed among
the operations of the components of the
different components in the system.
system.
Switching Techniques
Circuit switching Paket switching
• a complete path has to be established prior • communication between a source and a
to the start of communication between a destination takes place via messages divided
source and a destination. into smaller entities, called packets
Topology
• Describes how to connect processors and memories to other processors and memories.
Static Dynamic
direct fixed links are established among nodes
connections are established when needed.
to form a fixed network.
- Have a fixed path.
- Uni-direction or bi-direction between
processors .
- number of links : O(N2)
- delay complexity : O(1) completely connected network
Static INs
Dynamic INs
- Simplest way to connect multiprocessor systems.
- The use of local caches reduces the processor- memory
traffic.
- Size of such system varies between 2 and 50 processors.
- Single bus multiprocessors are inherently limited by:
1. Bandwidth of bus.
Single
2. 1 processor can access the bus.
3. 1 memory access can take place at any given time.
Bus-Based Dynamic INs
(MBFBMC).
• Multiple Bus with Single Bus — Memory Connection
(MBSBMC).
• Multiple Bus with Partial Bus — Memory Connection
(MBPBMC).
• Multiple Bus with Class-based Bus — Memory Connection
(MBCBMC).
• Provide simultaneous
connections among all its
inputs and all its outputs.
Switched-based INs.
Shared Memory
General Characteristics:
o Shared memory parallel computers vary widely, but generally have in common the ability
for all processors to access all memory as global address space.
o Multiple processors can operate independently but share the same memory resources.
o Changes in a memory location effected by one processor are visible to all other processors.
Uniform Memory Access Non-Uniform Memory Access Cache Only Memory Access
(UMA) (NUMA) (COMA)
• Most commonly represented today • Often made by physically linking • The Cache-Only Memory
by Symmetric Multiprocessor (SMP) two or more SMPs. Architecture (COMA) increases the
machines. • One SMP can directly access chances of data being available
• Identical processors - Equal access memory of another SNP. locally because the hardware
and access times to memory • Not all processors have equal transparently replicates the data
Sometimes called Cache Coherent access tune to all memories. and migrates it to the memory
UMA (CC-UMA). • Memory access across link is module of the node that is
• Cache Coherent means if one slower. currently assessing it.
processor updates a location in • If cache coherency is maintained, • Each memory module acts as a
shared memory, all the other then may also be called CC-NUMA. huge cache memory in which each
processors know about the block has a tag with the address
update. and the state.
• Cache Coherent is accomplished at • Data can be migrated or replicated
the hardware level. in the various memory banks of
the central main memory.
MIMD processing
Task Assignments