Unit 2
Unit 2
Parallel Programming
Syllabus
• Principles of Parallel Algorithm Design: Preliminaries,
Decomposition Techniques, Characteristics of Tasks and
Interactions, Mapping Techniques for Load Balancing, Methods for
Containing Interaction Overheads, Parallel Algorithm Models,
• Processor Architecture, Interconnect, Communication, Memory
Organization, and Programming Models in high performance
computing architecture examples: IBM CELL BE, Nvidia Tesla GPU,
Intel Larrabee Micro architecture and Intel Nehalem micro
architecture
• Memory hierarchy and transaction specific memory design, Thread
Organization
Preliminaries: Decomposition, Tasks, and
Dependency Graphs
• Observations:
– Tasks share the vector b but they have no control dependencies.
– There are zero edges in the task-dependency graph
– All tasks are of the same size in terms of number of operations .
• Is this the maximum number of tasks we could decompose this
problem into?
Example: Database Query Processing
Consider the execution of the query:
MODEL = “CIVIC” AND YEAR = “2001” AND
(COLOR = “GREEN” OR COLOR = “WHITE”)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
Example: Database Query Processing
• Assume the query is divided into four subtasks
– Each task generates an intermediate table of entries
• Processes (no in UNIX sense): logical computing agents that perform tasks
– Task + task data + task code required to produce the task’s output
• Decomposition:
– The process of dividing the computation into smaller pieces of work i.e., tasks
1. procedure SERIAL_MIN(A,n)
2. begin
3. min =A[0];
4. for i:= 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN
Example: Finding the Minimum
• In the frequency counting example, the input (i.e., the transaction set) can
be partitioned.
– This induces a task decomposition in which each task generates partial counts
for all itemsets. These are combined subsequently for aggregate counts.
Partitioning Input and Output Data
Intermediate Data Partitioning
Stage II
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2
Intermediate Data Partitioning: Example
• Performs more or the same aggregate work (but not less) than the
sequential algorithm
Example: Discrete Event Simulation
Speculative Execution
• Block Distribution
– Used to load-balance a variety of parallel computations that operate on
multi-dimensional arrays
• Cyclic Distribution
• Block-Cyclic Distribution
• For example, the task mapping of the binary tree (quicksort) cannot
use a large number of processes.
• For this reason, task mapping can be used at the top level and data
partitioning within each level.
Hierarchical Mappings
• When a process runs out of work, it requests the master for more work.
• When the number of processes increases, the master may become the
bottleneck.
• Selecting large chunk sizes may lead to significant load imbalances as well.
• Work Pool Model: The work pool or the task pool model is characterized
by a dynamic mapping of tasks onto processes for load balancing in which
any task may potentially be performed by any process. There is no desired
premapping of tasks onto processes. The mapping may be centralized or
decentralized. Pointers to the tasks may be stored in a physically shared
list, priority queue, hash table, or tree, or they could be stored in a physically
distributed data structure. The work may be statically available in the
beginning, or could be dynamically generated; i.e., the processes may
generate work and add it to the global (possibly distributed) work pool. If the
work is generated dynamically and a decentralized mapping is used, then a
termination detection algorithm would be required so that all processes can
actually detect the completion of the entire program (i.e., exhaustion of all
potential tasks) and stop looking for more work.
• Example : Parallelization of Loops by chunk scheduling