0% found this document useful (0 votes)

270 views

ACA Notes

This document describes an advanced computer architecture course. The course covers 8 units related to computer architecture fundamentals, pipelining, instruction-level parallelism, multiprocessing, memory hierarchy, hardware and software for VLIW and EPIC architectures. It includes information on topics, hours, textbooks, and a table of contents for the course material.

Uploaded by

Sharath Monappa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

270 views

ACA Notes

Uploaded by

Sharath Monappa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 156

Advance Computer Architecture

10CS74

A d v a n c e C o mp u t e r A r c h i t e c t u r e
Subject Code: 10CS74
Hours/Week : 04
Total Hours : 52

I.A. Marks : 25
Exam Hours: 03
Exam Marks: 100

PART - A
UNIT - 1
FUNDAMENTALS OF COMPUTER DESIGN: Introduction; Classes of computers;
Defining computer architecture; Trends in Technology, power in Integrated Circuits and
cost; Dependability; Measuring, reporting and summarizing Performance; Quantitative
Principles of computer design.
6 hours
UNIT - 2
PIPELINING: Introduction; Pipeline hazards; Implementation of pipeline; What makes
pipelining hard to implement?
6 Hours
UNIT - 3
INSTRUCTION LEVEL PARALLELISM 1: ILP: Concepts and challenges; Basic
Compiler Techniques for exposing ILP; Reducing Branch costs with prediction;
Overcoming Data hazards with Dynamic scheduling; Hardware-based speculation.
7 Hours
UNIT - 4
INSTRUCTION LEVEL PARALLELISM 2: Exploiting ILP using multiple issue
and static scheduling; Exploiting ILP using dynamic scheduling, multiple issue and
speculation; Advanced Techniques for instruction delivery and Speculation; The Intel
Pentium 4 as example.
7 Hours

Department of CSE, SJBIT

Page 1

Advance Computer Architecture

10CS74

PART - B
UNIT - 5
MULTIPROCESSORS AND THREAD LEVEL PARALLELISM: Introduction;
Symmetric shared-memory architectures; Performance of symmetric sharedmemory
multiprocessors; Distributed shared memory and directory-based coherence; Basics of
synchronization; Models of Memory Consistency. 7 Hours
UNIT - 6
REVIEW OF MEMORY HIERARCHY: Introduction; Cache performance; Cache
Optimizations, Virtual memory.
6 Hours
UNIT - 7
MEMORY HIERARCHY DESIGN: Introduction; Advanced optimizations of Cache
performance; Memory technology and optimizations; Protection: Virtual memory and
virtual machines.
6 Hours
UNIT - 8
HARDWARE AND SOFTWARE FOR VLIW AND EPIC: Introduction: Exploiting
Instruction-Level Parallelism Statically; Detecting and Enhancing Loop-Level
Parallelism; Scheduling and Structuring Code for Parallelism; Hardware Support for
Exposing Parallelism: Predicated Instructions; Hardware Support for Compiler
Speculation; The Intel IA-64 Architecture and Itanium Processor; Conclusions. 7 Hours

TEXT BOOK:
1.

Computer Architecture, A Quantitative Approach John L. Hennessey and

David A. Patterson:, 4th Edition, Elsevier, 2007.

REFERENCE BOOKS:
1.

Advanced Computer Architecture Parallelism, Scalability Kai Hwang:,

IC Logic technology:
Transistor density increases by about 35%per year. Increase in die size
corresponds to about 10 % to 20% per year. The combined effect is a growth rate
in transistor count on a chip of about 40% to 55% per year. Semiconductor DRAM
technology:cap acity increases by about 40% per year.
Storage Technology:
Before 1990: the storage density increased by about 30% per year.
After 1990: the storage density increased by about 60 % per year.
Disks are still 50 to 100 times cheaper per bit than DRAM.

Department of CSE, SJBIT

Page 9

Advance Computer Architecture

10CS74

Network Technology:
Network performance depends both on the per formance of the switches and
on the performance of the transmission system. Although the technology improves
continuously, the impact of these improvements can be in discrete leaps.
Performance trends: Bandwidth or throughput is the total amount of work done in
given time.
Latency or response time is the time between the start and the completion of an
event. (for eg. Millisecond for disk access)

A simple rule of thumb is that bandwidth gro ws by at least the square of the
improvement in latency. Computer designers should make plans accordingly.
IC Processes are characterizes by the f ature sizes.
Feature sizes decreased from 10 microns(1971) to 0.09 microns(2006)
Feature sizes shrink, devices shrink quadr atically.
Shrink in vertical direction makes the operating v oltage of the transistor to
reduce.
Transistor performance improves linearly with decreasing
feature size
Department of CSE, SJBIT

Page 10

Advance Computer Architecture

10CS74

.
Transistor count improves quadratically with a linear improvement in Transistor
performance.
!!! Wire delay scales poo rly comp ared to Transistor performance.
Feature sizes shrink, wires get shorter.
Signal delay fo r a wire increases in proportion to the product of Resistance and
Capacitance.

Trends in Power in Integrated Circuits

For CMOS chips, the dominant source of energy consumption is due to switching
transistor, also called as Dynamic power and is given b y the following equation.
Power = (1/2)*Capacitive load* Voltage
* Frequency switched dynamic
For mobile devices, energy is the better metric
Energy dynamic = Capacitive load x Voltage 2
For a fix ed task, slowing clock rate (frequency switched) reduces power, but not energy
Capacitive load a function of number of transistors connected to output and technology,
which determines capacitance of wires and transistors
Dropping voltage helps both, so went from 5V down to 1V
To save energy & dynamic power, most CPUs now turn off clock of inactive modules
Distributing the power, removing the heat and preventing hot spots have become
increasingly difficult challenges.
The leakage current flows even when a transistor is off. Therefore static power is
equally important.
Power static= Current static * Voltage
Leakage current increases in processors with smaller transistor sizes
Increasing the number of transistors increases power even if they are turned off
In 2006, goal for leakage is 25% of total power consumption; high performance designs
at 40%
Very low power systems even gate voltage to inactive modules to control loss due to
leakage

Trends in Cost
The underlying principle that drives the cost down is the learning curvemanufacturing
costs decrease over time.
Volume is a second key factor in determining cost. Volume decreases cost since it
increases purchasing manufacturing efficiency. As a rule of thumb, the cost decreases
Department of CSE, SJBIT

Page 11

Advance Computer Architecture

10CS74

about 10% for each doubling of volume.

Cost of an Integrated Circuit
Although the cost of ICs have dropped exponentially, the basic process of silicon
manufacture is unchanged. A wafer is still tested and chopped into dies that are
packaged.
Cost of IC = Cost of [die+ testing die+ Packaging and final test] / (Final test yoeld)
Cost of die = Cost of wafer/ (Die per wafer x Die yield)
The number of dies per wafer is approximately the area of the wafer divided by the area
of the die.
Die per wafer = [_ * (Wafer Dia/2)2/Die area]-[_* wafer dia/_(2*Die area)]
The first term is the ratio of wafer area to die area and the second term compensates for
the rectangular dies near the periphery of round wafers(as shown in figure).

Dependability:
The Infrastructure providers offer Service Level Agreement (SLA) or Service
Level Objectives (SLO) to guarantee that their networking or power services would be
dependable.
Department of CSE, SJBIT

Page 12

Advance Computer Architecture

10CS74

Systems alternate between 2 states of service with respect to an SLA:

1. Service accomplishment, where the service is delivered as specified in SLA
2. Service interruption, where the delivered service is different from the SLA
Failure = transition from state 1 to state 2
Restoration = transition from state 2 to state 1
The two main measures of Dependability are Module Reliability and Module
Availability. Module reliability is a measure of continuous service accomplishment (or
time to failure) from a reference initial instant.
1. Mean Time To Failure (MTTF) measures Reliability
2. Failures In Time (FIT) = 1/MTTF, the rate of failures
Traditionally reported as failures per billion hours of operation
Mean Time To Repair (MTTR) measures Service Interruption
Mean Time Between Failures (MTBF) = MTTF+MTTR
Module availability measures service as alternate between the 2 states of
accomplishment and interruption (number between 0 and 1, e.g. 0.9)
Module availability = MTTF / ( MTTF + MTTR)

Performance:
The Execution time or Response time is defined as the time between the start and
completion of an event. The total amount of work done in a given time is defined as the
Throughput.
The Administrator of a data center may be interested in increasing the
Throughput. The computer user may be interested in reducing the Response time.
Computer user says that computer is faster when a program runs in less time.

The routinely executed programs are the best candidates for evaluating the performance
of the new computers. To evaluate new system the user would simply compare the
execution time of their workloads.

Department of CSE, SJBIT

Page 13

Advance Computer Architecture

10CS74

Benchmarks
The real applications are the best choice of benchmarks to evaluate the
performance. However, for many of the cases, the workloads will not be known at the
time of evaluation. Hence, the benchmark program which resemble the real applications
are chosen. The three types of benchmarks are:
KERNELS, which are small, key pieces of real applications;
Toy Programs: which are 100 line programs from beginning programming
assignments, such Quicksort etc.,
Synthetic Benchmarks: Fake programs invented to try to match the profile and
behavior of real applications such as Dhrystone.
To make the process of evaluation a fair justice, the following points are to be followed.
Source code modifications are not allowed.
Source code modifications are allowed, but are essentially impossible.
Source code modifications are allowed, as long as the modified version produces
the same output.
To increase predictability, collections of benchmark applications, called
benchmark suites, are popular
SPECCPU: popular desktop benchmark suite given by Standard Performance
Evaluation committee (SPEC)
CPU only, split between integer and floating point programs
SPECint2000 has 12 integer, SPECfp2000 has 14 integer programs
SPECCPU2006 announced in Spring 2006.
SPECSFS (NFS file server) and SPECWeb (WebServer) added as server
benchmarks
Transaction Processing Council measures server performance and
costperformance for databases
TPC-C Complex query for Online Transaction Processing
TPC-H models ad hoc decision support
TPC-W a transactional web benchmark
TPC-App application server and web services benchmark
SPEC Ratio: Normalize execution times to reference computer, yielding a ratio
proportional to performance = time on reference computer/time on computer being rated
If program SPECRatio on Computer A is 1.25 times bigger than Computer B, then

Department of CSE, SJBIT

Page 14

Advance Computer Architecture

10CS74

Note : when comparing 2 computers as a ratio, execution times on the reference

computer drop out, so choice of reference computer is irrelevant.

Quantitative Principles of Computer Design

While designing the computer, the advantage of the following points can be
exploited to enhance the performance.
* Parallelism: is one of most important methods for improving performance.
- One of the simplest ways to do this is through pipelining ie, to over lap the
instruction Execution to reduce the total time to complete an instruction
sequence.
- Parallelism can also be exploited at the level of detailed digital design.
- Set- associative caches use multiple banks of memory that are typically searched
n parallel. Carry look ahead which uses parallelism to speed the process of
computing.
* Principle of locality: program tends to reuse data and instructions they have used
recently. The rule of thumb is that program spends 90 % of its execution time in only
10% of the code. With reasonable good accuracy, prediction can be made to find what
instruction and data the program will use in the near future based on its accesses in the
recent past.
* Focus on the common case while making a design trade off, favor the frequent case
over the infrequent case. This principle applies when determining how to spend
resources, since the impact of the improvement is higher if the occurrence is frequent.
Amdahls Law: Amdahls law is used to find the performance gain that can be obtained
by improving some portion or a functional unit of a computer Amdahls law defines the
speedup that can be gained by using a particular feature.
Speedup is the ratio of performance for entire task without using the enhancement
when possible to the performance for entire task without using the enhancement.
Execution time is the reciprocal of performance. Alternatively, speedup is defined as thee
ratio of execution time for entire task without using the enhancement to the execution
time for entair task using the enhancement when possible.
Speedup from some enhancement depends an two factors:
Department of CSE, SJBIT

Page 15

Advance Computer Architecture

10CS74

i. The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement. Fraction enhanced is always less than or
equal to
Example: If 15 seconds of the execution time of a program that takes 50
seconds in total can use an enhancement, the fraction is 15/50 or 0.3
ii. The improvement gained by the enhanced execution mode; ie how much
faster the task would run if the enhanced mode were used for the entire program. Speedup
enhanced is the time of the original mode over the time of the enhanced mode and is always
greater then 1.

Page 19

Advance Computer Architecture

10CS74

3. Execution / Effective address Cycle (EXE)

The ALU operates on the operands prepared in the previous cycle and performs
one of the following function defending on the instruction type.

* Register- Register ALU instruction: ALU performs the operation specified in the
instruction using the values read from the register file.
* Register- Immediate ALU instruction: ALU performs the operation specified in the
instruction using the first value read from the register file and that sign extended
immediate.
4. Memory access (MEM)
For a load instruction, using effective address the memory is read. For a store
instruction memory writes the data from the 2nd register read using effective address.
5. Write back cycle (WB)
Write the result in to the register file, whether it comes from memory system (for
a LOAD instruction) or from the ALU.

10CS74

Basic Performance issues in Pipelining

Pipelining increases the CPU instruction throughput but, it does not reduce the
executiontime of an individual instruction. In fact, the pipelining increases the execution
time of each instruction due to overhead in the control of the pipeline. Pipeline overhead
arises from the combination of register delays and clock skew. Imbalance among the pipe
stages reduces the performance since the clock can run no faster than the time needed for
the slowest pipeline stage.

Department of CSE, SJBIT

Page 23

Advance Computer Architecture

10CS74

Pipeline Hazards
Hazards may cause the pipeline to stall. When an instruction is stalled, all the
instructions issued later than the stalled instructions are also stalled. Instructions issued
earlier than the stalled instructions will continue in a normal way. No new instructions
are fetched during the stall. Hazard is situation that prevents the next instruction in the
instruction stream fromk executing during its designated clock cycle. Hazards will reduce
the pipeline performance.
Performance with Pipeline stall
A stall causes the pipeline performance to degrade from ideal performance. Performance
improvement from pipelining is obtained from:

Assume that,
i) cycle time overhead of pipeline is ignored
ii) stages are balanced
With theses assumptions

10CS74

DADD instruction produces the value of R1 in WB stage (Clock cycle 5) but the DSUB
instruction reads the value during its ID stage (clock cycle 3). This problem is called Data
Hazard. DSUB may read the wrong value if precautions are not taken. AND instruction
will read the register during clock cycle 4 and will receive the wrong results. The XOR
instruction operates properly, because its register read occurs in clock cycle 6 after
DADD writes in clock cycle 5. The OR instruction also operates without incurring a
hazard because the register file reads are performed in the second half of the cycle
whereas the writes are performed in the first half of the cycle.

Minimizing data hazard by Forwarding

The DADD instruction will produce the value of R! at the end of clock cycle 3. DSUB
instruction requires this value only during the clock cycle 4. If the result can be moved
from the pipeline register where the DADD store it to the point (input of LAU) where
DSUB needs it, then the need for a stall can be avoided. Using a simple hardware
technique called Data Forwarding or Bypassing or short circuiting, data can be made
available from the output of the ALU to the point where it is required (input of LAU) at
the beginning of immediate next clock cycle.
Forwarding works as follows:
i) The output of ALU from EX/MEM and MEM/WB pipeline register is always
feedback to the ALU inputs.
ii) If the Forwarding hardware detects that the previous ALU output serves as the
source for the current ALU operations, control logic selects the forwarded result
as the input rather than the value read from the register file. Forwarded results are
required not only from the immediate previous instruction, but also from an instruction
that started 2 cycles earlier. The result of ith instruction Is required to be forwarded to
(i+2)th instruction also. Forwarding can be generalized to include passing a result directly
to the functional unit that requires it.
Data Hazard requiring stalls
LD R1, 0(R2)
DADD R3, R1, R4
AND R5, R1, R6
OR
R7, R1, R8
The pipelined data path for these instructions is shown in the timing diagram (figure 2.7)

Department of CSE, SJBIT

Page 27

Advance Computer Architecture

10CS74

The LD instruction gets the data from the memory at the end of cycle 4. even with
forwarding technique, the data from LD instruction can be made available earliest during
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 4. This demands data forwarding of clock cycle 4. This demands data
forwarding in negative time which is not possible. Hence, the situation calls for a pipeline
stall.Result from the LD instruction can be forwarded from the pipeline register to the
and instruction which begins at 2 clock cycles later after the LD instruction. The load
instruction has a delay or latency that cannot be eliminated by forwarding alone. It is
necessary to stall pipeline by 1 clock cycle. A hardware called Pipeline interlock detects a
hazard and stalls the pipeline until the hazard is cleared. The pipeline interlock helps to
preserve the correct execution pattern by introducing a stall or bubble. The CPI for the
stalled instruction increases by the length of the stall. Figure 2.7 shows the pipeline
before and after the stall. Stall causes the DADD to move 1 clock cycle later in time.
Forwarding to the AND instruction now goes through the register file or forwarding is
not required for the OR instruction. No instruction is started during the clock cycle 4.
Control Hazard
When a branch is executed, it may or may not change the content of PC. If a branch is
taken, the content of PC is changed to target address. If a branch is taken, the content of
PC is not changed

The simple way of dealing with the branches is to redo the fetch of the instruction
following a branch. The first IF cycle is essentially a stall, because, it never performs
useful work. One stall cycle for every branch will yield a performance loss 10% to 30%
depending on the branch frequency
Department of CSE, SJBIT

Page 28

Advance Computer Architecture

10CS74

Reducing the Brach Penalties

There are many methods for dealing with the pipeline stalls caused by branch
delay
1. Freeze or Flush the pipeline, holding or deleting any instructions after the
ranch until the branch destination is known. It is a simple scheme and branch penalty is
fixed and cannot be reduced by software
2. Treat every branch as not taken, simply allowing the hardware to continue as if
the branch were not to executed. Care must be taken not to change the processor
state until the branch outcome is known.
Instructions were fetched as if the branch were a normal instruction. If the branch
is taken, it is necessary to turn the fetched instruction in to a no-of instruction and restart
the fetch at the target address. Figure 2.8 shows the timing diagram of both the situations.

3. Treat every branch as taken: As soon as the branch is decoded and target
Address is computed, begin fetching and executing at the target if the branch target is
known before branch outcome, then this scheme gets advantage.
For both predicated taken or predicated not taken scheme, the compiler can
improve performance by organizing the code so that the most frequent path
matches the hardware choice.
4. Delayed branch technique is commonly used in early RISC processors.
In a delayed branch, the execution cycle with a branch delay of one is
Branch instruction
Sequential successor-1
Branch target if taken
The sequential successor is in the branch delay slot and it is executed irrespective of
whether or not the branch is taken. The pipeline behavior with a branch delay is shown in
Figure 2.9. Processor with delayed branch, normally have a single instruction delay.
Compiler has to make the successor instructions valid and useful there are three ways in
Department of CSE, SJBIT

Page 29

Advance Computer Architecture

10CS74

which the to delay slot can be filled by the compiler.

The limitations on delayed branch arise from

i) Restrictions on the instructions that are scheduled in to delay slots.
ii) Ability to predict at compiler time whether a branch is likely to be taken or
not taken.
The delay slot can be filled from choosing an instruction
a) From before the branch instruction
b) From the target address
c) From fall- through path.
The principle of scheduling the branch delay is shown in fig 2.10

Department of CSE, SJBIT

Page 30

Advance Computer Architecture

10CS74

What makes pipelining hard to implements?

Dealing with exceptions: Overlapping of instructions makes it more difficult to
know whether an instruction can safely change the state of the CPU. In a pipelined CPU,
an instruction execution extends over several clock cycles. When this instruction is in
execution, the other instruction may raise exception that may force the CPU to abort the
instruction in the pipeline before they complete

Types of exceptions:
The term exception is used to cover the terms interrupt, fault and exception.
I/O device request, page fault, Invoking an OS service from a user program, Integer
arithmetic overflow, memory protection overflow, Hardware malfunctions, Power failure
etc. are the different classes of exception. Individual events have important characteristics
that determine what action is needed corresponding to that exception.
i)

Synchronous versus Asynchronous

If the event occurs at the same place every time the program is executed with the
same data and memory allocation, the event is asynchronous. Asynchronous events are
caused by devices external to the CPU and memory such events are handled after the
completion of the current instruction.
ii)
User requested versus coerced:
User requested exceptions are predictable and can always be handled after the
current instruction has completed. Coerced exceptions are caused by some
hardware event that is not under the control of the user program. Coerced
exceptions are harder to implement because they are not predictable
iii)

User maskable versus user non maskable :

If an event can be masked by a user task, it is user maskable. Otherwise it is user

non maskable.
iv)
Within versus between instructions:
Exception that occur within instruction are usually synchronous, since the
instruction triggers the exception. It is harder to implement exceptions that occur
withininstructions than those between instructions, since the instruction must be
stopped and restarted. Asynchronous exceptions that occurs within instructions arise from
catastrophic situations and always causes program termination.
v) Resume versus terminate:
If the programs execution continues after the interrupt, it is a resuming event
otherwise if is terminating event. It is easier implement exceptions that terminate
execution. 29

Department of CSE, SJBIT

Page 31

Advance Computer Architecture

10CS74

Stopping and restarting execution:

The most difficult exception have 2 properties:
1. Exception that occur within instructions
2. They must be restartable
For example, a page fault must be restartable and requires the intervention of OS. Thus
pipeline must be safely shutdown, so that the instruction can be restarted in the correct
state. If the restarted instruction is not a branch, then we will continue to fetch the
sequential successors and begin their execution in the normal fashion. 11) Restarting is
usually implemented by saving the PC of the instruction at which to restart. Pipeline
control can take the following steps to save the pipeline state safely.
i) Force a trap instruction in to the pipeline on the next IF
ii) Until the trap is taken, turn off all writes for the faulting instruction and for all
instructions that follow in pipeline. This prevents any state changes for instructions that
will not be completed before the exception is handled.
iii) After the exception handling routine receives control, it immediately saves
the PC of the faulting instruction. This value will be used to return from the exception
later.
NOTE:
1. with pipelining multiple exceptions may occur in the same clock cycle because
there are multiple instructions in execution.
2 Handling the exception becomes still more complicated when the instructions are
allowed to execute in out of order fashion.

Operation: send out the [PC] and fetch the instruction from memory in to the Instruction
Register (IR). Increment PC by 4 to address the next sequential instruction.
2. Instruction decode / Register fetch cycle (ID)

Operation: decode the instruction and access that register file to read the registers
( rs and rt). File to read the register (rs and rt). A & B are the temporary registers.
Operands are kept ready for use in the next cycle.
Department of CSE, SJBIT

Page 32

Advance Computer Architecture

10CS74

Decoding is done in concurrent with reading register. MIPS ISA has fixed length
Instructions. Hence, these fields are at fixed locations.
3. Execution/ Effective address cycle (EX)
One of the following operations are performed depending on the instruction
type.
* Memory reference:
:
Operation: ALU adds the operands to compute the effective address and places
the result in to the register ALU output.
Register Register ALU instruction:

Operation: The ALU performs the operation specified by the function code on the value
taken from content of register A and register B.
*. Register- Immediate ALU instruction:

Operation: the content of register A and register Imm are operated (function Op) and
result is placed in temporary register ALU output.
*. Branch:

Department of CSE, SJBIT

Page 33

Advance Computer Architecture

10CS74

UNIT - 3
INSTRUCTION LEVEL PARALLELISM 1: ILP
Concepts and challenges
Basic Compiler Techniques for exposing ILP
Reducing Branch costs with prediction
Overcoming Data hazards with Dynamic scheduling
Hardware-based speculation.
7 Hours

Department of CSE, SJBIT

Page 34

Advance Computer Architecture

10CS74

UNIT III
Instruction Level Parallelism
The potential overlap among instruction execution is called Instruction Level Parallelism
(ILP) since instructions can be executed in parallel. There are mainly two approaches to
exploit ILP.
i)

Hardware based approach: An approach that relies on hardware to help

discover and exploit the parallelism dynamically. Intel Pentium series which
has dominated in the market) uses this approach.

ii)

Software based approach: An approach that relies on software technology to

find parallelism statically at compile time. This approach has limited use in
scientific or application specific environment. Static approach of exploiting
ILP is found in Intel Itanium.

Factors of both programs and processors limit the amount of parallelism that can be
exploited among instructions and these limit the performance achievable. The
performance of the pipelined processors is given by:
Pipeline CPI= Ideal Pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
By reducing each of the terms on the right hand side, it is possible to minimize the overall
pipeline CPI.
To exploit the ILP, the primary focus is on Basic Block (BB). The BB is a straight line
code sequence with no branches in except the entry and no branches out except at the
exit. The average size of the BB is very small i.e., about 4 to 6 instructions. The flow
diagram segment of a program is shown below (Figure 3.1). BB1 , BB2 and BB3 are the
Basic Blocks.

Figure 3.1 Flow diagram segment

Department of CSE, SJBIT

Page 35

Advance Computer Architecture

10CS74

The amount of overlap that can be exploited within a Basic Block is likely to be less than
the average size of BB. To further enhance ILP, it is possible to look at ILP across
multiple BB. The simplest and most common way to increase the ILP is to exploit the
parallelism among iterations of a loop (Loop level parallelism). Each iteration of a loop
can overlap with any other iteration.

Data Dependency and Hazard

If two instructions are parallel, they can execute simultaneously in a pipeline of
arbitrary length without causing any stalls, assuming the pipeline has sufficient resources.
If two instructions are dependent, they are not parallel and must be executed in sequential
order.
There are three different types dependences.
Data Dependences (True Data Dependency)
Name Dependences
Control Dependences
Data Dependences
An instruction j is data dependant on instruction i if either of the following holds:
i) Instruction i produces a result that may be used by instruction j
Eg1: i: L.D F0, 0(R1)
j: ADD.D F4, F0, F2
ith instruction is loading the data into the F0 and jth instruction use F0 as one the
operand. Hence, jth instruction is data dependant on ith instruction.
Eg2: DADD R1, R2, R3
DSUB R4, R1, R5
ii) Instruction j is data dependant on instruction k and instruction k data dependant on
instruction i
Eg: L.D F4, 0(R1)
MUL.D F0, F4, F6
ADD.D F5, F0, F7
Dependences are the property of the programs. A Data value may flow between
instructions either through registers or through memory locations. Detecting the data flow
and dependence that occurs through registers is quite straight forward. Dependences that
flow through the memory locations are more difficult to detect. A data dependence
convey three things.
a) The possibility of the Hazard.
b) The order in which results must be calculated and
c) An upper bound on how much parallelism can possibly exploited.

Department of CSE, SJBIT

Page 36

Advance Computer Architecture

10CS74

Name Dependences
A Name Dependence occurs when two instructions use the same Register or Memory
location, but there is no flow of data between the instructions associated with that name.
Two types of Name dependences:
i) Antidependence: between instruction i and instruction j occurs when instruction j
writes a register or memory location that instruction i reads. he original ordering must be
preserved to ensure that i reads the correct value.
Eg: L.D F0, 0(R1)
DADDUI R1, R1, R3
ii) Output dependence: Output Dependence occurs when instructions i and j write to the
same register or memory location.
Ex: ADD.D F4, F0, F2
SUB.D F4, F3, F5
The ordering between the instructions must be preserved to ensure that the value finally
written corresponds to instruction j.The above instruction can be reordered or can be
executed simultaneously if the name of the register is changed. The renaming can be
easily done either statically by a compiler or dynamically by the hardware.
Data hazard: Hazards are named by the ordering in the program that must be preserved
by the pipeline
RAW (Read After Write): j tries to read a source before i writes it, so j in correctly gets
old value, this hazard is due to true data dependence.
WAW (Write After Write): j tries to write an operand before it is written by i. WAW
hazard arises from output dependence.
WAR (Write After Read): j tries to write a destination before it is read by i, so that I
incorrectly gets the new value. WAR hazard arises from an antidependence and normally
cannot occur in static issue pipeline.
CONTROL DEPENDENCE:
A control dependence determines the ordering of an instruction i with respect to a branch
instruction,
Ex: if P1 {
S1;
}
if P2 {
S2;
}
S1 is Control dependent on P1 and
Department of CSE, SJBIT

Page 37

Advance Computer Architecture

10CS74

S2 is control dependent on P2 but not on P1.

Tournament predictor is a multi level branch predictor and uses n bit saturating counter
to chose between predictors. The predictors used are global predictor and local predictor.
Advantage of tournament predictor is ability to select the right predictor for a
particular branch which is particularly crucial for integer benchmarks.
A typical tournament predictor will select the global predictor almost 40% of the
time for the SPEC integer benchmarks and less than 15% of the time for the SPEC
FP benchmarks
Existing tournament predictors use a 2-bit saturating counter per branch to choose
among two different predictors based on which predictor was most effective oin
recent prediction.
Department of CSE, SJBIT

Page 43

Advance Computer Architecture

10CS74

Dynamic Branch Prediction Summary

Prediction is becoming important part of execution as it improves the performance of
the pipeline.
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated with next branch
Either different branches (GA)
Or different executions of same branches (PA)
Tournament predictors take insight to next level, by using multiple predictors
usually one based on global information and one based on local information,
and combining them with a selector
In 2006, tournament predictors using 30K bits are in processors like the Power
and Pentium 4

Tomasulu algorithm and Reorder Buffer

Tomasulu idea:
1. Have reservation stations where register renaming is possible
2. Results are directly forwarded to the reservation station along with the final
registers. This is also called short circuiting or bypassing.
ROB:
1.The instructions are stored sequentially but we have indicators to say if it is speculative
or completed execution.
2. If completed execution and not speculative and reached head of the queue then we
commit it.

Department of CSE, SJBIT

Page 44

Advance Computer Architecture

10CS74

Speculating on Branch Outcomes

Reorder Buffer

Example
Loop: LD

MULTD

SUBI

BNEZ

Loop

Assume instructions in the loop have been issued twice

Assume L.D and MUL.D from the first iteration have committed and all other
instructions have completed
Assume effective address for store is computed prior to its issue
Show data structures

Department of CSE, SJBIT

Page 48

Advance Computer Architecture

10CS74

Reorder Buffer

Notes
If a branch is mispredicted, recovery is done by flushing the ROB of all entries that
appear after the mispredicted branch
entries before the branch are allowed to continue
restart the fetch at the correct branch successor
When an instruction commits or is flushed from the ROB then the corresponding slots
become available for subsequent instructions
Advantages of hardware-based speculation:

-able to disambiguate memory references;

-better when hardware-based branch prediction is better than software-based
branch
prediction done at compile time; - maintains a completely precise exception
model even for speculated instructions;
does not require compensation or bookkeeping code;
main disadvantage:
complex and requires substantial hardware resources;

Department of CSE, SJBIT

Page 49

Advance Computer Architecture

10CS74

UNIT - IV
INSTRUCTION LEVEL PARALLELISM 2:
Exploiting ILP using multiple issue and static scheduling
Exploiting ILP using dynamic scheduling
Multiple issue and speculation
Advanced Techniques for instruction delivery and Speculation
The Intel Pentium 4 as example.

Department of CSE, SJBIT

7 Hours

Page 50

Advance Computer Architecture

10CS74

UNIT IV
INSTRUCTION LEVEL PARALLELISM 2
What is ILP?
Instruction Level Parallelism
Number of operations (instructions) that can be performed in parallel
Formally, two instructions are parallel if they can execute simultaneously in a pipeline
of arbitrary depth without causing any stalls assuming that the pipeline has sufficient
resources
Primary techniques used to exploit ILP
Deep pipelines
Multiple issue machines
Basic program blocks tend to have 4-8 instructions between branches
Little ILP within these blocks
Must find ILP between groups of blocks
Example Instruction Sequences
Independent instruction sequence:

lw $10, 12($1)
sub $11, $2, $3
and $12, $4, $5
or $13, $6, $7
add $14, $8, $9
Dependent instruction sequence:

lw $10, 12($1)
sub $11, $2, $10
and $12, $11, $10
or $13, $6, $7
add $14, $8, $13
Finding ILP:
Must deal with groups of basic code blocks
Common approach: loop-level parallelism
Department of CSE, SJBIT

Page 51

Advance Computer Architecture

10CS74

Example:
In MIPS (assume $s0 initialized properly):
for (i=1000; i > 0; i--)
x[i] = x[i] + s;
Loop: lw $t0, 0($s1) # t0 = array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1, -4 # decrement pointer
bne $s1, $0, Loop # branch $s1 != 0

Loop Unrolling:
Technique used to help scheduling (and performance)
Copy the loop body and schedule instructions from different iterations of the
loop gether
MIPS example (from prev. slide):
Loop: lw $t0, 0($s1) # t0 = array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
lw $t1, -4($s1)
addu $t1, $t1, $s2
sw $t1, -4($s1)
addi $s1, $s1, -8 # decrement pointer
bne $s1, $0, Loop # branch $s1 != 0
Note the new register & counter adjustment!
Previous example, we unrolled the loop once
This gave us a second copy
Why introduce a new register ($t1)?
Antidependence (name dependence)
Loop iterations would reuse register $t0
No data overlap between loop iterations!
Compiler RENAMED the register to prevent a dependence
Allows for better instruction scheduling and identification of true dependencies
In general, you can unroll the loop as much as you want
A factor of the loop counter is generally used
Limited advantages to unrolling more than a few times

Loop Unrolling: Performance:

Performance (dis)advantage of unrolling
Assume basic 5-stage pipeline
Recall lw requires a bubble if value used immediately after
For original loop
10 cycles to execute first iteration
Department of CSE, SJBIT

Page 52

Advance Computer Architecture

10CS74

16 cycles to execute two iterations

Assuming perfect prediction
For unrolled loop
14 cycles to execute first iteration -- without reordering
Gain from skipping addi, bne
12 cycles to execute first iteration -- with reordering
Put lw together, avoid bubbles after ea

Loop Unrolling: Limitations

Overhead amortization decreases as loop is unrolled more
Increase in code size
Could be bad if ICache miss rate increases
Register pressure
Run out of registers that can be used in renaming process

Exploiting ILP: Deep Pipelines

Deep Pipelines
Increase pipeline depth beyond 5 stages
Generally allows for higher clock rates
UltraSparc III -- 14 stages
Pentium III -- 12 stages
Pentium IV -- 22 stages
Some versions have almost 30 stages
Core 2 Duo -- 14 stages
AMD Athlon -- 9 stages
AMD Opteron -- 12 stages
Motorola G4e -- 7 stages
IBM PowerPC 970 (G5) -- 14 stages
Increases the number of instructions executing at the same time
Most of the CPUs listed above also issue multiple instructions per cycle

Issues with Deep Pipelines

Branch (Mis-)prediction
Speculation: Guess the outcome of an instruction to remove it as a dependence
to other instructions
Tens to hundreds of instructions in flight
Have to flush some/all if a branch is mispredicted
Memory latencies/configurations
To keep latencies reasonable at high clock rates, need fast caches
Generally smaller caches are faster
Smaller caches have lower hit rates
Techniques like way prediction and prefetching can help lower latencies

Optimal Pipelining Depths

Several papers published on this topic
Esp. the 29th International Symposium on Computer Architecture (ISCA)
Department of CSE, SJBIT

10CS74

Static Mult. Issue w/Loop Unrolling

Department of CSE, SJBIT

Page 56

Advance Computer Architecture

10CS74

Exploiting ILP:Multiple Issue Computers Dynamic Scheduling

Dynamic Multiple Issue Computers
Superscalar computers
CPU generally manages instruction issuing and ordering
Compiler helps, but CPU dominates
Process
Instructions issue in-order
Instructions can execute out-of-order
Execute once all operands are ready
Instructions commit in-order
Commit refers to when the architectural register file is updated (current completed state
of program
Aside: Data Hazard Refresher
Two instructions (i and j), j follows i in program order
Read after Read (RAR)
Read after Write (RAW)
Type:
Problem:
Write after Read (WAR)
Type:
Problem:
Write after Write (WAW)
Type: Problem:
Superscalar Processors
Department of CSE, SJBIT

Page 57

Advance Computer Architecture

10CS74

Register Renaming
Use more registers than are defined by the architecture
Architectural registers: defined by ISA
Physical registers: total registers
Help with name dependencies
Antidependence
Write after Read hazard
Output dependence
Write after Write hazard

Tomasulos Superscalar Computers

R. M. Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units,
IBM J. of Research and Development, Jan. 1967
See also: D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, The IBM System/360
model 91: Machine philosophy and instruction-handling, IBM J. of Research and
evelopment, Jan. 1967
Allows out-of-order execution
Tracks when operands are available
Minimizes RAW hazards
Introduced renaming for WAW and WAR
hazards
Tomasulos Superscalar Computers

Department of CSE, SJBIT

Page 58

Advance Computer Architecture

10CS74

Instruction Execution Process

Three parts, arbitrary number of cycles/part
Above does not allow for speculative execution
Issue (aka Dispatch)
If empty reservation station (RS) that matches instruction, send to RS with operands
rom register file and/or know which functional unit will send operand
If no empty RS, stall until one is available
Rename registers as appropriate
Instruction Execution Process
Execute
All branches before instruction must be resolved
Preserves exception behavior
When all operands available for an instruction, send it to functional unit
Monitor common data bus (CDB) to see if result is needed by RS entry
For non-load/store reservation stations
If multiple instructions ready, have to pick one to send to functional unit
For load/store
Compute address, then place in buffer
Loads can execute once memory is free
Stores must wait for value to be stored, then execute
Write Back
Functional unit places on CDB
Goes to both register file and reservation stations
Use of CDB enables forwarding for RAW hazards
Also introduces a latency between result and use of a value

Reservation Stations
Require 7 fields
Operation to perform on operands (2 operands)
Tags showing which RS/Func. Unit will be producing operand (or zero if operand
available/unnecessary)
Two source operand values
A field for holding memory address calculation data
Initially, immediate field of instruction
Later, effective address
Busy
Indicates that RS and its functional unit are busy
Register file support
Each entry contains a field that identifies which RS/func. unit will be writing into this
entry (or blank/zero if noone will be writing to it) Limitation of Current Machine
Instruction execution requires branches to be resolved
Department of CSE, SJBIT

Page 59

Advance Computer Architecture

10CS74

For wide-issue machines, may issue one branch per clock cycle!
Desire:
Predict branch direction to get more ILP
Eliminate control dependencies
Approach:
Predict branches, utilize speculative instruction execution
Requires mechanisms for fixing machine when speculation is incorrect
Tomasulos w/Hardware Speculation

Tomasulos w/HW Speculation

Key aspects of this design
Separate forwarding (result bypassing) from actual instruction completion
Assuming instructions are executing speculatively
Can pass results to later instructions, but prevents instruction from performing updates
that cant be undone
Once instruction is no longer speculative it can update register file/memory
New step in execution sequence: instruction commit
Requires instructions to wait until they can commit Commits still happen in order
Reorder Buffer (ROB)
Instructions hang out here before committing
Provides extra registers for RS/RegFile
Department of CSE, SJBIT

Page 60

Advance Computer Architecture

10CS74

Is a source for operands

Non-traditional instruction cache
Recall x86 ISA
Department of CSE, SJBIT

Page 64

Advance Computer Architecture

10CS74

CISC/VLIW: ugly assembly instructions of varying lengths

Hard for HW to decode
Ended up translating code into RISC-like microoperations to execute
Trace Cache holds sequences of RISC-like micro-ops
Less time decoding, more time executing
Sequence storage similar to normal instruction cache

P4: Branch Handling

BTBs (Branch Target Buffers)
Keep both branch history and branch target addresses
Target address is instruction immediately after branch
Predict if no entry in BTB for branch
Static prediction
If a backwards branch, see how far target is from current; if within a threshold, predict
taken, else predict not taken
If a forward branch, predict not taken
Also some other rules
Front-end BTB is L2 (like) for the trace cache BTB (L1 like)

P4: Execution Core

Tomasulos algorithm-like
Can have up to 126 instructions in-flight
Max of 3 micro-ops sent to core/cycle
Max of 48 loads, 32 stores
Send up to 6 instructions to functional units per cycle via 4 ports
Port 0: Shared between first fast ALU and FP/Media move scheduler
Port 1: Shared between second fast ALU and Complex integer and FP/Media scheduler
Port 2: Load
Port 3: Store

P4: Rapid Execution Engine

Execute 6 micro-ops/cycle
Simple ALUs run at 2x machine clock rate
Can generate 4 simple ALU results/cycle
Do one load and one store per cycle
Loads involve data speculation
Assume that most loads hit L1 and Data Translation Look-aside Buffer (DTLB)
Get data into execution, while doing address check
Fix if L1 miss occurred

P4: Memory Tricks

Department of CSE, SJBIT

Page 65

Advance Computer Architecture

10CS74

Store-to-Load Forwarding
Stores must wait to write until non-speculative
Loads occasionally want data from store location
Check both cache and Store Forwarding Buffer
SFB is where stores are waiting to be written
If hit when comparing load address to SFB address, use SFB data, not cache data
Done on a partial address
Memory Ordering Buffer
Ensures that store-to-load forwarding was correct
If not, must re-execute load
Force forwarding
Mechanism for forwarding in case addresses are misaligned
MOB can tell SFB to forward or not
False forwarding
Fixes partial address match between load and SFB

P4: Specs for Rest of Slides

For one running at 3.2 GHz
From grad arch book
L1 Cache
Int: Load to use - 4 cycles
FP: Load to use - 12 cycles
Can handle up to 8 outstanding load misses
L2 Cache (2 MB)
18 cycle access time

Department of CSE, SJBIT

Page 66

Advance Computer Architecture

10CS74

P4: Branch Prediction

P4: Misspeculation Percentages

P4: Data Cache Miss Rates

Department of CSE, SJBIT

Page 67

Advance Computer Architecture

10CS74

P4: CPI

P4 vs. AMD Opteron

Department of CSE, SJBIT

Page 68

Advance Computer Architecture

10CS74

P4 vs. Opteron: Real Performance

Department of CSE, SJBIT

Page 69

Advance Computer Architecture

10CS74

PART - B
UNIT - 5
MULTIPROCESSORS AND THREAD LEVEL PARALLELISM:
Introduction
Symmetric shared-memory architectures
Performance of symmetric sharedmemory multiprocessors
Distributed shared memory and directory-based coherence
Basics of synchronization
Models of Memory Consistency.

Department of CSE, SJBIT

7 Hours

Page 70

Advance Computer Architecture

10CS74

UNIT V
Multiprocessors and Thread-Level Parallelism
We have seen the renewed interest in developing multiprocessors in early 2000:
- The slowdown in uniprocessor performance due to the diminishing returns in exploring
instruction-level parallelism.
- Difficulty to dissipate the heat generated by uniprocessors with high clock rates.
- Demand for high-performance servers where thread-level parallelism is natural.
For all these reasons multiprocessor architectures has become increasingly attractive.

A Taxonomy of Parallel Architectures

The idea of using multiple processors both to increase performance and to
improve availability dates back to the earliest electronic computers. About 30 years ago,
Flynn proposed a simple model of categorizing all computers that is still useful today. He
looked at the parallelism in the instruction and data streams called for by the instructions
at the most constrained component of the multiprocessor, and placed all computers in one
of four categories:
1.Single instruction stream, single data stream
(SISD)This category is the uniprocessor.

2.Single instruction stream, multiple data streams

(SIMD)The same instruction is executed by multiple processors using different data
streams. Each processor has its own data memory (hence multiple data), but there is a
single instruction memory and control processor, which fetches and dispatches
instructions. Vector architectures are the largest class of processors of this type.
Department of CSE, SJBIT

Page 71

Advance Computer Architecture

10CS74

3.Multiple instruction streams, single data stream (MISD)No commercial

multiprocessor of this type has been built to date, but may be in the future. Some special
purpose stream processors approximate a limited form of this (there is only a single data
stream that is operated on by successive functional units).

4. Multiple instruction streams, multiple data streams (MIMD)Each processor

fetches its own instructions and operates on its own data. The processors are often offthe-shelf microprocessors. This is a coarse model, as some multiprocessors are hybrids of
these categories. Nonetheless, it is useful to put a framework on the design space.

Department of CSE, SJBIT

Page 72

Advance Computer Architecture

10CS74

1. MIMDs offer flexibility. With the correct hardware and software support, MIMDs
can function as single-user multiprocessors focusing on high performance for one
application, as multiprogrammed multiprocessors running many tasks simultaneously, or
as some combination of these functions.
2. MIMDs can build on the cost/performance advantages of off-the-shelf
microprocessors. In fact, nearly all multiprocessors built today use the same
microprocessors found in workstations and single-processor servers.
With an MIMD, each processor is executing its own instruction stream. In many cases,
each processor executes a different process. Recall from the last chapter, that a process is
an segment of code that may be run independently, and that the state of the process
contains all the information necessary to execute that program on a processor. In a
multiprogrammed environment, where the processors may be running independent tasks,
each process is typically independent of the processes on other processors. It is also
useful to be able to have multiple processors executing a single program and sharing the
code and most of their address space. When multiple processes share code and data in
this way, they are often called threads
. Today, the term thread is often used in a casual way to refer to multiple loci of
execution that may run on different processors, even when they do not share an address
space. To take advantage of an MIMD multiprocessor with n processors, we must usually
have at least n threads or processes to execute. The independent threads are typically
identified by the programmer or created by the compiler. Since the parallelism in this
situation is contained in the threads, it is called thread-level parallelism.
Threads may vary from large-scale, independent processesfor example,
independent programs running in a multiprogrammed fashion on different processors to
parallel iterations of a loop, automatically generated by a compiler and each executing for
perhaps less than a thousand instructions. Although the size of a thread is important in
considering how to exploit thread-level parallelism efficiently, the important qualitative
Department of CSE, SJBIT

Page 73

Advance Computer Architecture

10CS74

distinction is that such parallelism is identified at a high-level by the software system and
that the threads consist of hundreds to millions of instructions that may be executed in
parallel. In contrast, instruction level parallelism is identified by primarily by the
hardware, though with software help in some cases, and is found and exploited one
instruction at a time.
Existing MIMD multiprocessors fall into two classes, depending on the number of
processors involved, which in turn dictate a memory organization and interconnect
strategy. We refer to the multiprocessors by their memory organization, because what
constitutes a small or large number of processors is likely to change over time.
The first group, which we call

Centralized shared memory architectures have at most a few dozen processors in

2000. For multiprocessors with small processor counts, it is possible for the processors to
share a single centralized memory and to interconnect the processors and memory by a
bus. With large caches, the bus and the single memory, possibly with multiple banks, can
satisfy the memory demands of a small number of processors. By replacing a single bus
with multiple buses, or even a switch, a centralized shared memory design can be scaled
to a few dozen processors. Although scaling beyond that is technically conceivable,
sharing a centralized memory, even organized as multiple banks, becomes less attractive
as the number of processors sharing it increases.
Because there is a single main memory that has a symmetric relationship to all
processos and a uniform access time from any processor, these multiprocessors are often
called symmetric (shared-memory) multiprocessors ( SMPs), and this style of architecture
is sometimes called UMA for uniform memory access. This type of centralized
sharedmemory architecture is currently by far the most popular organization.
Department of CSE, SJBIT

Page 74

Advance Computer Architecture

10CS74

The second group consists of multiprocessors with physically distributed memory.

To support larger processor counts, memory must be distributed among the processors
rather than centralized; otherwise the memory system would not be able to support the
bandwidth demands of a larger number of processors without incurring excessively long
access latency. With the rapid increase in processor performance and the associated
increase in a processors memory bandwidth requirements, the scale of multiprocessor for
which distributed memory is preferred over a single, centralized memory continues to
decrease in number (which is another reason not to use small and large scale). Of course,
the larger number of processors raises the need for a high bandwidth interconnect.

Distributing the memory among the nodes has two major benefits. First, it is a
costeffective way to scale the memory bandwidth, if most of the accesses are to the local
memory in the node. Second, it reduces the latency for accesses to the local memory.
These two advantages make distributed memory attractive at smaller processor counts as
processors get ever faster and require more memory bandwidth and lower memory
latency. The key disadvantage for a distributed memory architecture is that
communicating data between processors becomes somewhat more complex and has
higher latency, at least when there is no contention, because the processors no longer
share a single centralized memory. As we will see shortly, the use of distributed memory
leads to two different paradigms for interprocessor communication. Typically, I/O as well
as memory is distributed among the nodes of the multiprocessor, and the nodes may be
small SMPs (28 processors). Although the use of multiple processors in a node together
with a memory and a network interface is quite useful from the cost-efficiency viewpoint.

Department of CSE, SJBIT

Page 75

Advance Computer Architecture

Two rules to ensure this:

If P writes x and then P1 reads it, Ps write will be seen by P1 if the read
and write are sufficiently far apart
Writes to a single location are serialized: seen in one order
Latest write will be seen
Otherwise could see writes in illogical order (could see older
value after a newer value)

The definition contains two different aspects of memory system:

Coherence
Consistency
A memory system is coherent if,
Program order is preserved.
Processor should not continuously read the old data value.
Write to the same location are serialized.
The above three properties are sufficient to ensure coherence,When a written value will
be seen is also important. This issue is defined by memory consistency model. Coherence
and consistency are complementary.

Distributed shared-memory architectures
Separate memory per processor
Local or remote access via memory controller
The physical address space is statically distributed Coherence
Problems
Simple approach: uncacheable
shared data are marked as uncacheable and only private data are
kept in caches
very long latency to access memory for shared data
Alternative: directory for memory blocks
The directory per memory tracks state of every block in every
cache
which caches have a copies of the memory block, dirty vs. clean,
...
Two additional complications
The interconnect cannot be used as a single point of arbitration like the
bus
Because the interconnect is message oriented, many messages must have
explicit responses
To prevent directory becoming the bottleneck, we distribute directory entries with
memory, each keeping track of which processors have copies of their memory blocks
Directory Protocols
Similar to Snoopy Protocol: Three states
Shared: 1 or more processors have the block cached, and the value in
memory is up-to-date (as well as in all the caches)
Uncached: no processor has a copy of the cache block (not valid in any
cache)
Exclusive: Exactly one processor has a copy of the cache block, and it
has written the block, so the memory copy is out of date
Department of CSE, SJBIT

Page 84

Advance Computer Architecture

10CS74

The processor is called the owner of the block

In addition to tracking the state of each cache block, we must track the
processors that have copies of the block when it is shared (usually a bit vector for
each memory block: 1 if processor has copy)
Keep it simple(r):
Writes to non-exclusive data => write miss
Processor blocks until access completes
Assume messages received and acted upon in order sent

local node: the node where a request originates

home node: the node where the memory location and directory entry of an address
reside
remote node: the node that has a copy of a cache block (exclusive or shared)

Comparing to snooping protocols:

identical states
Department of CSE, SJBIT

Page 85

Advance Computer Architecture

10CS74

stimulus is almost identical

write a shared cache block is treated as a write miss (without fetch the
block)
cache block must be in exclusive state when it is written
any shared block must be up to date in memory
write miss: data fetch and selective invalidate operations sent by the directory
controller (broadcast in snooping protocols)
Directory Operations: Requests and Actions
Message sent to directory causes two actions:
Update the directory
More messages to satisfy request
Block is in Uncached state: the copy in memory is the current value; only
possible requests for that block are:
Read miss: requesting processor sent data from memory &requestor
made only sharing node; state of block made Shared.
Write miss: requesting processor is sent the value & becomes the
Sharing node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
Block is Shared => the memory value is up-to-date:
Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
Write miss: requesting processor is sent the value. All processors in the
set Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
Block is Exclusive: current value of the block is held in the cache of the
processor identified by the set Sharers (the owner) => three possible directory requests:
Read miss: owner processor sent data fetch message, causing state of
block in owners cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy). State is
shared.
Data write-back: owner processor is replacing the block and hence must
write it back, making memory copy up-to-date (the home directory
essentially becomes the owner), the block is now Uncached, and the Sharer set is
empty.
Write miss: block has a new owner. A message is sent to old owner
causing the cache to send the value of the block to the directory from which it is sent to
the requesting processor, which becomes the new owner. Sharers is set to identity of new
owner, and state of block is made Exclusive.
Synchronization: The Basics
Synchronization mechanisms are typically built with user-level software routines
that rely on hardware supplied synchronization instructions.
Department of CSE, SJBIT

Page 86

Advance Computer Architecture

10CS74

Why Synchronize?
Need to know when it is safe for different processes to use shared data
Issues for Synchronization:
Uninterruptable instruction to fetch and update memory (atomic
operation);
User level synchronization operation using this primitive;
For large scale MPs, synchronization can be a bottleneck; techniques to
reduce contention and latency of synchronization
Uninterruptable Instruction to Fetch and Update Memory
Atomic exchange: interchange a value in a register for a value in memory
0 _ synchronization variable is free
1 _ synchronization variable is locked and unavailable
Set register to 1 & swap
New value in register determines success in getting lock
0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
Key is that exchange operation is indivisible
Test-and-set: tests a value and sets it if the value passes the test
Fetch-and-increment: it returns the value of a memory location and atomically
increments it
0 _ synchronization variable is free
Hard to have read & write in 1 instruction: use 2 instead
Load linked (or load locked) + store conditional
Load linked returns the initial value
Store conditional returns 1 if it succeeds (no other store to same memory
location since preceding load) and 0 otherwise
Example doing atomic swap with LL & SC:
try:
mov R3,R4 ;
mov exchange value
ll

R2,0(R1) ; load linked

R3,0(R1) ; store conditional

beqz

R3,try ; branch store fails (R3 = 0)

mov

R4,R2 ; put load value in R4

Example doing fetch & increment with LL & SC:

try:
ll
R2,0(R1) ; load linked
addi
R2,R2,#1 ;
increment (OK if regreg)
sc
R2,0(R1) ;
store conditional
beqz R2,try ;
branch store fails (R2 = 0)
User Level SynchronizationOperation Using this Primitive

Goal is provide a memory system with cost per byte than the next lower level
Each level maps addresses from a slower, larger memory to a smaller but faster
memory higher in the hierarchy.
Address mapping
Address checking.
Hence protection scheme for address for scrutinizing addresses are also part of
the memory hierarchy.
Why More on Memory Hierarchy?

Department of CSE, SJBIT

So, misses per instruction can be used per memory reference

Cache Optimizations
Six basic cache optimizations
1. Larger block size to reduce miss rate:
- To reduce miss rate through spatial locality.
- Increase block size.
- Larger block size reduce compulsory misses.
- But they increase the miss penalty.
2. Bigger caches to reduce miss rate:
- capacity misses can be reduced by increasing the cache capacity.
- Increases larger hit time for larger cache memory and higher cost and power.
3. Higher associativity to reduce miss rate:
- Increase in associativity reduces conflict misses.
4. Multilevel caches to reduce penalty:
- Introduces additional level cache
- Between original cache and memory.
- L1- original cache
L2- added cache.
L1 cache: - small enough
- speed matches with clock cycle time.
L2 cache: - large enough
- capture many access that would go to main memory.
Average access time can be redefined as
Hit timeL1+ Miss rate L1 X ( Hit time L2 + Miss rate L2 X Miss penalty L2)
5. Giving priority to read misses over writes to reduce miss penalty:
- write buffer is a good place to implement this optimization.
- write buffer creates hazards: read after write hazard.
6. Avoiding address translation during indexing of the cache to reduce hit time:
- Caches must cope with the translation of a virtual address from the processor to
a physical address to access memory.
- common optimization is to use the page offset.
- part that is identical in both virtual and physical addresses- to index the cache.

Department of CSE, SJBIT

Page 94

Advance Computer Architecture

10CS74

Advanced Cache Optimizations

Reducing hit time
Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches
Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Compiler optimizations
Reducing miss penalty or miss rate via parallelism
Hardware prefetching
Compiler prefetching

First Optimization : Small and Simple Caches

Index tag memory and then compare takes time
_ Small cache can help hit time since smaller memory takes less time to index
E.g., L1 caches same size for 3 generations of AMD microprocessors:
K6, Athlon, and Opteron
Also L2 cache small enough to fit on chip with the processor avoids time
penalty of going off chip
Simple _ direct mapping
Can overlap tag check with data transmission since no choice
Access time estimate for 90 nm using CACTI model 4.0
Median ratios of access time relative to the direct-mapped caches are 1.32,
1.39, and 1.43 for 2-way, 4-way, and 8-way caches

Second Optimization: Way Prediction

How to combine fast hit time of Direct Mapped and have the lower conflict
misses of 2-way SA cache?
Department of CSE, SJBIT

Page 95

Advance Computer Architecture

10CS74

Way prediction: keep extra bits in cache to predict the way, or block within
the set, of next cache access.

Multiplexer is set early to select desired block, only 1 tag comparison performed that
clock cycle in parallel with reading the cache data
Miss _ 1st check other blocks for matches in next clock cycle
Accuracy 85%
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
- Used for instruction caches vs. data caches
Third optimization: Trace Cache
Find more instruction level parallelism?
How to avoid translation from x86 to microops?
Trace cache in Pentium 4
1. Dynamic traces of the executed instructions vs. static sequences of instructions
as determined by layout in memory
Built-in branch predictor
2. Cache the micro-ops vs. x86 instructions
Decode/translate from x86 to micro-ops on trace cache miss
+ 1. _ better utilize long blocks (dont exit in middle of block, dont enter
at label in middle of block)
- 1. _ complicated address mapping since addresses no longer aligned to
powerof2 multiples of word size
- 1. _ instructions may appear multiple times in multiple dynamic traces
due to different branch outcomes
Fourth optimization: pipelined cache access to increase bandwidth
Pipeline cache access to maintain bandwidth, but higher latency
Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4
- _ greater penalty on mispredicted branches
- _ more clock cycles between the issue of the load and the use of the data
Fifth optimization: Increasing Cache Bandwidth Non-Blocking Caches
Department of CSE, SJBIT

Page 96

Advance Computer Architecture

10CS74

Non-blocking cache or lockup-free cache allow data cache to continue to supply

cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
hit under miss reduces the effective miss penalty by working during miss vs.
ignoring CPU requests
hit under multiple miss or miss under miss may further lower the effective
miss penalty by overlapping multiple misses
Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
Requires multiple memory banks (otherwise cannot support)
Pentium Pro allows 4 outstanding memory misses
Value of Hit Under Miss for SPEC

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause faults;
a form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
The techniques to improve hit time, bandwidth, miss penalty and miss rate generally
affect the other components of the average memory access equation as well as the
complexity of the memory hierarchy.
Department of CSE, SJBIT

How to Improve Cache Performance?
Cache optimizations
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache
Cache Optimisation
Why improve Cache performance:

Performance improvement of CPU vs Memory- CPU fabrication has advanced

much more than memory- hence need to use cache optimization techniques.

Department of CSE, SJBIT

Page 105

Advance Computer Architecture

10CS74

Review: 6 Basic Cache Optimizations

Reducing hit time
1.Address Translation during Cache Indexing
Reducing Miss Penalty
2. Multilevel Caches
3. Giving priority to read misses over write misses
Reducing Miss Rate
4. Larger Block size (Compulsory misses)
5. Larger Cache size (Capacity misses)
6. Higher Associativity (Conflict misses)
11 Advanced Cache Optimizations
Reducing hit time
1. Small and simple caches
2. Way prediction
3. Trace caches
Increasing cache bandwidth
4. Pipelined caches
5. Multibanked caches
6. Nonblocking caches
Reducing Miss Penalty
7. Critical word first
8. Merging write buffers
Reducing Miss Rate
9.Compiler optimizations
Reducing miss penalty or miss rate via parallelism
10.Hardware prefetching
11.Compiler prefetching

1. Fast Hit times via Small and Simple Caches

Index tag memory and then compare takes time
Small cache can help hit time since smaller memory takes less time to index
E.g., L1 caches same size for 3 generations of AMD icroprocessors:
K6, Athlon, and Opteron
Also L2 cache small enough to fit on chip with the processor avoids
time penalty of going off chip
Simple direct mapping
Can overlap tag check with data transmission since no choice
2. Fast Hit times via Way Prediction
How to combine fast hit time of Direct Mapped and have the lower conflict
misses of 2-way SA cache?
Way prediction: keep extra bits in cache to predict the way, or block within
the set, ofnext cache access.
Multiplexer is set early to select desired block, only 1 tag comparison performed
that clock cycle in parallel with reading the cache data
Department of CSE, SJBIT

Page 106

Advance Computer Architecture

10CS74

Miss - 1st check other blocks for matches in next clock cycle
3. Fast Hit times via Trace Cache
Find more instruction level parallelism?
How avoid translation from x86 to microops?- Trace cache in Pentium 4
1. Dynamic traces of the executed instructions vs. static sequence of instructions
as determined by layout in memory
Built-in branch predictor
2. Cache the micro-ops vs. x86 instructions - Decode/translate from x86 to
micro-ops on trace cache miss
+ 1. better utilize long blocks (dont exit in middle of block, dont enter at label in
middle of block)
- 1. complicated address mapping since addresses no longer aligned to power-of-2
multiples of word size
- 1. instructions may appear multiple times in multiple dynamic traces due to different
branch outcomes
4: Increasing Cache Bandwidth by Pipelining
Pipeline cache access to maintain bandwidth, but higher latency
Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4
- greater penalty on mispredicted branches
- more clock cycles between the issue of the load and the use of the data

5. Increasing Cache Bandwidth:

Non-Blocking Caches- Reduce Misses/Penalty
Non-blocking cache or lockup-free cache allow data cache to continue to supply
cache hits during a m iss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
hit under miss reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
hit under multiple miss or miss under miss may further lower the effective
miss penalty by overlapping multiple misses
Significantly increases the complexity of the cache controller as there
can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
6: Increasing Cache Bandwidth via Multiple Banks
Rather than treat the cache as a single monolithic block, divide into independent banks
that can support simultaneous accesses
E.g.,T1 (Niagara) L2 has 4 banks
Department of CSE, SJBIT

Page 107

Advance Computer Architecture

10CS74

Banking works best when accesses naturally spread themselves across banks
mapping of addresses to banks affects behavior of memory system

Simple mapping that works well is sequential interleaving

Spread block addresses sequentially across banks
E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is
0; bank 1 has all blocks whose address modulo 4 is 1; .
7. Reduce Miss Penalty:
Early Restart and Critical Word First
Dont wait for full block before restarting CPU
Early restartAs soon as the requested word of the block arrives, send it to the CPU
and let the CPU continue execution
Spatial locality - tend to want next sequential word, so not clear size of benefit of just
early restart
Critical Word FirstRequest the missed word first from memory and send it to the
CPU as soon as it arrives; let the CPU continue execution while filling the rest of the
words in the block

Department of CSE, SJBIT

Page 108

Advance Computer Architecture

10CS74

8. Merging Write Buffer to Reduce Miss Penalty

Write buffer to allow processor to continue while waiting to write to memory
If buffer contains modified blocks, the addresses can be checked to see if address
of new data matches the address of a valid write buffer entry -If so, new data are
combined with that entry
Increases block size of write for write-through cache of writes to sequential
words, bytes since multiword writes more efficient to memory
The Sun T1 (Niagara) processor, among many others, uses write merging

9. Reducing Misses by Compiler Optimizations

McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte
blocks in software
Instructions
Reorder procedures in memory so as to reduce conflict misses
Profiling to look at conflicts (using tools they developed)
Data
Merging Arrays: improve spatial locality by single array of compound elements vs. 2
arrays
Loop Interchange: change nesting of loops to access data in order
stored in memory
Loop Fusion: Combine 2 independent loops that have same looping and some variables
overlap
Blocking: Improve temporal locality by accessing blocks of data repeatedly vs.
going down whole columns or rows
Department of CSE, SJBIT

Page 109

Advance Computer Architecture

10CS74

Compiler Optimizations- Reduction comes from software (no Hw ch.)

Loop Interchange
Motivation: some programs have nested loops that access data in nonsequential order
Solution: Simply exchanging the nesting of the loops can make the code access the data
in the order it is stored =>
reduce misses by improving spatial locality; reordering maximizes use of data in a cache
block before it is discarded
Loop Interchange Example
/* Before */
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Blocking
Motivation: multiple arrays, some accessed by rows and some by columns
Storing the arrays row by row (row major order) or column by column (column major
order) does not help: both rows and columns are used in every iteration of the loop
(Loop Interchange cannot help)
Solution: instead of operating on entire rows and columns of an array, blocked
algorithms operate on submatrices or blocks => maximize accesses to the data loaded
into the cache before the data is replaced
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B,N); k = k + 1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
};
Snapshot of x, y, z when
i=1
Department of CSE, SJBIT

Page 110

Advance Computer Architecture

10CS74

White:

White: not yet touched

Managing Virtual Memory

Virtual memory offers many of the features required for hardware virtualization
Separates the physical memory onto multiple processes
Each process thinks it has a linear address space of full size
Processor holds a page table translating virtual addresses used by a process and the
according physical memory
Additional information restricts processes from
Reading a page of on another process or
Allow reading but not modifying a memory page or
Do not allow to interpret data in the memory page as instructions and do not try to
execute them
Virtual Memory management thus requires
Mechanisms to limit memory access to protected memory
At least two modes of execution for instructions
Privileged mode: an instruction is allowed to do what it whatever it wants -> kernel
mode for OS
Non-privileged mode: user-level processes
Intel x86 Architecture: processor supports four levels
Level 0 used by OS
Level 3 used by regular applications
Provide mechanisms to go from non-privileged mode to privileged mode -> system call
Provide a portion of processor state that a user process can read but not modify
E.g. memory protection information
Each guest OS maintains its page tables to do the mapping from virtual address to
physical address
Most simple solution: VMM holds an additional table which maps the physical address
of a guest OS onto the machine address
Introduces a third level of redirection for every memory access
Alternative solution: VMM maintains a shadow page table of each guest OS
Copy of the page table of the OS
Page tables still works with regular physical addresses
Only modifications to the page table are intercepted by the VMM

Protection via Virtual Machines

-some definitions
VMs include all emulation methods to provide a standard software interface
Different ISAs can be used (emulated) on the native machine
When the ISAs match the hardware we call it (operating) system virtual
machines
Multiple OSes all share the native hardware
Cost of Processor Virtualisation
VM is much smaller than traditional OS
Isolation portion is only about 10000 lines for a VMM
Processor bound programs have very little virtualisation overhead
Department of CSE, SJBIT

Page 118

Advance Computer Architecture

10CS74

I/O bound jobs have more overhead

ISA emulation is costly

Other benefits of VMs

Managing software
Complete software stack
Old Oses like DOS
Current stable OS
Next OS release
Managing Hardware
Multiple servers avoided
VMs enable hardware sharing
Migration of a running VM to another m/c
For balancing load or evacuate from failing HW
Requirements of a VMM
Guest sw should behave exactly on VM as if on native hw
Guest sw should not be able to change allocation of RT resources directly
Timer interrupts should be virtualised
Two processor modes- system and user
Priveleged subset of instruction available only in system mode

More issues on VM usage

ISA support for virtual machine
IBM360 support
80x86 do no support
Use of virtual memory
Concept of virtual- real- physical memories
Instead of extra indirection use shadow page table
Virtualising I/Os
More i/o
More diversity
Physical disks to partitioned virtual disks
Network cards time sliced

Department of CSE, SJBIT

Page 119

Advance Computer Architecture

10CS74

UNIT - 8
HARDWARE AND SOFTWARE FOR VLIW AND EPIC:
Introduction
Exploiting Instruction-Level Parallelism Statically
Detecting and Enhancing Loop-Level Parallelism
Scheduling and Structuring Code for Parallelism
Hardware Support for Exposing Parallelism
Predicated Instructions; Hardware Support for Compiler Speculation
The Intel IA-64 Architecture and Itanium Processor; Conclusions.
7 Hours

Department of CSE, SJBIT

Page 120

Advance Computer Architecture

10CS74

UNIT VIII
HARDWARE AND SOFTWARE FOR VLIW AND EPIC
Loop Level Parallelism- Detection and Enhancement
Static Exploitation of ILP
Use compiler support for increasing parallelism
Supported by hardware
Techniques for eliminating some types of dependences
Applied at compile time (no run time support)
Finding parallelism
Reducing control and data dependencies
Using speculation
Unrolling Loops High-level
for (i=1000; i>0; i=i-1) x[i] = x[i] + s;
C equivalent of unrolling to block four iterations into one:
for (i=250; i>0; i=i-1)
{
x[4*i] = x[4*i] + s;
x[4*i-1] = x[4*i-1] + s;
x[4*i-2] = x[4*i-2] + s;
x[4*i-3] = x[4*i-3] + s;
}
Enhancing Loop-Level Parallelism
Consider the previous running example:
for (i=1000; i>0; i=i-1) x[i] = x[i] + s;
there is no loop-carried dependence where data used in a later iteration depends on
data produced in an earlier one
in other words, all iterations could (conceptually) be executed in parallel
Contrast with the following loop:
for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */ }
what are the dependences?
A Loop with Dependences
For the loop:
for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */ }
what are the dependences?
There are two different dependences:
loop-carried: (prevents parallel operation of iterations)
S1 computes A[i+1] using value of A[i] computed in previous iteration
S2 computes B[i+1] using value of B[i] computed in previous iteration
not loop-carried: (parallel operation of iterations is ok)
Department of CSE, SJBIT

Page 121

Advance Computer Architecture

10CS74

S2 uses the value A[i+1] computed by S1 in the same iteration

The loop-carried dependences in this case force successive iterations of the loop to
execute in series. Why?
S1 of iteration i depends on S1 of iteration i-1 which in turn depends on , etc.
Another Loop with Dependences
Generally, loop-carried dependences hinder ILP
if there are no loop-carried dependences all iterations could be executed in parallel
even if there are loop-carried dependences it may be possible to parallelize the loop an
analysis of the dependences is required
For the loop:
for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */ }
what are the dependences?
There is one loop-carried dependence:
S1 uses the value of B[i] computed in a previous iteration by S2
but this does not force iterations to execute in series. Why?
because S1 of iteration i depends on S2 of iteration i-1, and the chain of
dependences stops here!
Parallelizing Loops with Short Chains of Dependences
Parallelize the loop:
for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */ }
Parallelized code:
A[1] = A[1] + B[1];
for (i=1; i<=99; i=i+1)
{ B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100];
the dependence between the two statements in the loop is no longer loop-carried and
iterations of the loop may be executed in parallel

Loop-Carried Dependence Detection: affine array index: a x i+b

To detect loop-carried dependence in a loop, the Greatest Common Divisor (GCD) test
can be used by the compiler, which is based on the following:
If an array element with index: a x i + b is stored and element: c x i + d of
the same array is loaded later where index runs from m to n, a dependence exists if
the following two conditions hold:
1.There are two iteration indices, j and k , m <= j , k <= n
(within iteration limits)
2.The loop stores into an array element indexed by:
axj+b
and later loads from the same array the element indexed by:
cxk+d
Department of CSE, SJBIT

Page 122

ADD.D
S.D
Iteration i+1: L.D
ADD.D
S.D
Iteration i+2: L.D
ADD.D
S.D

10CS74

F4,F0,F2
F4,0(R1)
F0,0(R1)
F4,F0,F2
F4,0(R1)
F0,0(R1)
F4,F0,F2
F4,0(R1)

SW pipelined loop with startup and cleanup code

#startup, assume i runs from 0 to n
ADDI
R1,R1-16
LD
F0,16(R1)
ADDD
F4,F0,F2
LD
F0,8(R1)
#body for (i=2;i<=n-2;i++)
Loop: SD
F4,16(R1)
ADDD
F4,F0,F2
LD F0,0(R1)
ADDI
R1,R1,-8
BNE
R1,R2,Loop
#cleanup
SD
F4,8(R1)
ADDD
F4,F0,F2
SD
F4,0(R1)
Software pipelining versus unrolling

#point to v[n-2]
#load v[n]
#add v[n]
#load v[n-1]
#store to v[i]
#add to v[i-1]
#load v[i-2]

#store v[1]
#add v[0]
#store v[0]

Performance effects of SW pipelining vs. unrolling

Unrolling reduces loop overhead per iteration
SW pipelining reduces startup-cleanup pipeline overhead

Department of CSE, SJBIT

Page 126

Advance Computer Architecture

10CS74

Software pipelining versus unrolling (cont.)

Software pipelining

Advantages
Less code space than conventional unrolling
Loop runs at peak speed during steady state
Overhead only at loop initiation and termination
Complements unrolling
Disadvantages
Hard to overlap long latencies
Unrolling combined with SW pipelining
Requires advanced compiler transformations
Global Code Scheduling
Global code scheduling aims to compact a code fragment with internal control structure
into the shortest possible sequence that preserves the data and control dependences.

Page 132

Advance Computer Architecture

10CS74

Super Blocks for Global Scheduling

Motivation :
Entries and exits into trace schedule code are complicated
Compiler cannot do a good cost analysis about compensation code
Superblocks
Are like traces
One entry point
Unlike traces
Different exit points
Common in for loops
Single entry and exit points
Code motion across exit only need be considered

Superblock Construction

Department of CSE, SJBIT

Page 133

Advance Computer Architecture

10CS74

Tail duplication
Creates a separate block that corresponds to the portion of trace after the entry
When proceeding as per prediction Take the path of superblock code
When exit from
superblock
Residual loop that handles rest of the iterations

Analysis on Superblocks
Reduces the complexity of bookkeeping and scheduling
Unlike the trace approach
Can have larger code size though
Assessing the cost of duplication
Compilation process is not simple any more
H/W Support : Conditional Execution
Also known as Predicated Execution
Enhancement to instruction set
Can be used to eliminate branches
All control dependences are converted to data dependences
Instruction refers to a condition
Evaluated as part of the execution
True?
Executed normally
False?
Execution continues as if the instruction were a no-op
Example :
Conditional move between registers

Example
if (A==0)
S = T;
Straightforward Code
BNEZ R1,
L;
ADDU R2, R3, R0
L:
Conditional Code
CMOVZ R2, R3, R1
Annulled if R1 is not 0
Conditional Instruction
Can convert control to data dependence
In vector computing, its called if conversion.
Traditionally, in a pipelined system
Dependence has to be resolved closer to front of pipeline
For conditional execution
Dependence is resolved at end of pipeline, closer to the register write

Department of CSE, SJBIT

Page 134

Advance Computer Architecture

10CS74

Another example
A = abs(B)
if (B < 0)
A = -B;
else
A = B;
Two conditional moves
One unconditional and one conditional move
The branch condition has moved into the
instruction
Control dependence becomes data dependence

Limitations of Conditional Moves

Conditional moves are the simplest form of predicated instructions
Useful for short sequences
For large code, this can be inefficient
Introduces many conditional moves
Some architectures support full predication
All instructions, not just moves
Very useful in global scheduling
Can handle nonloop branches nicely
Eg : The whole if portion can be predicated if the frequent path is not taken

Assume : Two issues, one to ALU and one to memory; or branch by itself
Wastes a memory operation slot in second cycle
Can incur a data dependence stall if branch is not taken
R9 depends on R8
Predicated Execution
Assume : LWC is predicated load and loads if third operand is not 0

Department of CSE, SJBIT

Page 135

Advance Computer Architecture

10CS74

One instruction issue slot is eliminated

On mispredicted branch, predicated instruction will not have any effect
If sequence following the branch is short, the entire block of the code can be predicated
Some Complications
Exception Behavior
Must not generate exception if the predicate is false
If R10 is zero in the previous example
LW R8, 0(R10) can cause a protection fault
If condition is satisfied
A page fault can still occur
Biggest Issue Decide when to annul an instruction
Can be done during issue
Early in pipeline
Value of condition must be known early, can induce stalls
Can be done before commit
Modern processors do this
Annulled instructions will use functional resources
Register forwarding and such can complicate implementation

Limitations of Predicated Instructions

Annulled instructions still take resources
Fetch and execute atleast
For longer code sequences, benefits of conditional move vs branch is not clear
Only useful when predicate can be evaluated early in the instruction stream
What if there are multiple branches?
Predicate on two values?
Higher cycle count or slower clock rate for predicated instructions
More hardware overhead
MIPS, Alpha, Pentium etc support partial predication
IA-64 has full predication

Preserve control and data flow, precise interrupts in Predication
Speculative predicated instructions may not throw illegal exceptions
LWC may not throw exception if R10 == 0
LWC may throw recoverable page fault if R10 6= 0
Instruction conversion to nop
Early condition detection may not be possible due to data dependence
Late condition detection incurs stalls and consumes pipeline resources needlessly
Instructions may be dependent on multiple branches
Compiler able to find instruction slots and reorder
Hardware support for speculation
Alternatives for handling speculative exceptions
Hardware and OS ignore exceptions from speculative instructions
Mark speculative instructions and check for exceptions
Additional instructions to check for exceptions and recover
Registers marked with poison bits to catch exceptions upon read
Hardware buffers instruction results until instruction is no longer speculative
Exception classes
Recoverable: exception from speculative instruction may harm performance, but not
preciseness
Unrecoverable: exception from speculative instruction compromises preciseness
Solution I: Ignore exceptions
HW/SW solution
Instruction causing exception returns undefined value
Value not used if instruction is speculative
Incorrect result if instruction is non-speculative
Compiler generates code to throw regular exception
Rename registers receiving speculative results

Department of CSE, SJBIT

Page 139

Advance Computer Architecture

10CS74

Solution I: Ignore exceptions

Example

Solution II: mark speculative instructions

Advance Computer Architecture

10CS74

Itanium2 Specs
6 Integer ALU's
6 multimedia ALU's
2 Extended Precision FP Units
2 Single Precision FP units
2 Load and Store Units
3 Branch Units
8 Stage 6 Wide Pipeline
32k L1 Cache
256K L2 Cache
3MB L3 Cache(on die)
1Ghz Clock initially
Up to 1.66Ghz on Montvale
Itanium2 Improvements
Initially a 180nm process
Increased to 130nm in 2003
Further increased to 90nm in 2007
Improved Thermal Management
Clock Speed increased to 1.0Ghz
Bus Speed Increase from 266Mhz to 400Mhz

Department of CSE, SJBIT

Page 150

Advance Computer Architecture

10CS74

_Use predicates to eliminate branches, move instructions across branches

_Conditional execution of an instruction based on predicate register (64 1-bit
predicate registers)
_Predicates are set by compare instructions
_Most instructions can be predicated each instruction code contains predicate
field
_If predicate is true, the instruction updates the computation state; otherwise, it
behaves like a nop

Scheduling and Speculation

Basic block: code with single entry and exit, exit point can be multiway branch
Control Improve ILP by statically move ahead long latency code blocks.
path is a frequent execution path
Schedule for control paths
Because of branches and loops, only small percentage of code is executed regularly
Analyze dependences in blocks and paths
Compiler can analyze more efficiently - more time, memory, larger view of the program
Compiler can locate and optimize the commonly executed blocks

Department of CSE, SJBIT

Page 151

Advance Computer Architecture

10CS74

Control speculation
_ Not all the branches can be removed using predication.
_ Loads have longer latency than most instructions and tend to start timecritical
chains of instructions
_ Constraints on code motion on loads limit parallelism
_ Non-EPIC architectures constrain motion of load instruction
_ IA-64: Speculative loads, can safely schedule load instruction before one or
more prior branches
Control Speculation
_Exceptions are handled by setting NaT (Not a Thing) in target register
_Check instruction-branch to fix-up code if NaT flag set
_Fix-up code: generated by compiler, handles exceptions
_NaT bit propagates in execution (almost all IA-64 instructions)
_NaT propagation reduces required check points
Speculative Load
_ Load instruction (ld.s) can be moved outside of a basic block even if branch target
is not known
_ Speculative loads does not produce exception - it sets the NaT
_ Check instruction (chk.s) will jump to fix-up code if NaT is set
Data Speculation
Department of CSE, SJBIT

Page 152

Advance Computer Architecture

10CS74

_ The compiler may not be able to determine the location in memory being
referenced (pointers)
_ Want to move calculations ahead of a possible memory dependency
_ Traditionally, given a store followed by a load, if the compiler cannot
determine if the addresses will be equal, the load cannot be moved ahead of the
store.
_ IA-64: allows compiler to schedule a load before one or more stores
_ Use advance load (ld.a) and check (chk.a) to implement
_ ALAT (Advanced Load Address Table) records target register, memory
address accessed, and access size
Data Speculation
1. Allows for loads to be moved ahead of stores even if the compiler is unsure if
addresses are the same
2. A speculative load generates an entry in the ALAT
3. A store removes every entry in the ALAT that have the same address
4. Check instruction will branch to fix-up if the given address is not in the ALAT

Department of CSE, SJBIT

Page 153

Advance Computer Architecture

10CS74

Use address field as the key for comparison

If an address cannot be found, run recovery code
ALAT are smaller and simpler implementation than equivalent structures
for superscalars

Register Model
_128 General and Floating Point Registers
_32 always available, 96 on stack
_As functions are called, compiler allocates a specific number of local and output
registers to use in the function by using register allocation instruction Alloc.
Department of CSE, SJBIT

Page 154

Advance Computer Architecture

10CS74

_Programs renames registers to start from 32 to 127.

_Register Stack Engine (RSE) automatically saves/restores stack to memory when
needed
_RSE may be designed to utilize unused memory bandwidth to perform register
spill and fill operations in the background

On function call, machine shifts register window such that previous output registers
become new locals starting at r32
Software Pipelining
_loops generally encompass a large portion of a programs execution time, so its
important to expose as much loop-level parallelism as possible.
_Overlapping one loop iteration with the next can often increase the parallelism.
Software Pipelining

We can implement loops in parallel by resolve some problems.

Department of CSE, SJBIT

Page 155

Advance Computer Architecture

10CS74

_Managing the loop count,

_Handling the renaming of registers for the pipeline,
_Finishing the work in progress when the loop ends,
_Starting the pipeline when the loop is entered, and
_Unrolling to expose cross-iteration parallelism.
IA-64 gives hardware support to compilers managing a software pipeline
Facilities for managing loop count, loop termination, and rotating registers
The combination of these loop features and predication enables the compiler to
generate compact code, which performs the essential work of the loop in a highly parallel
form.

Department of CSE, SJBIT

Page 156

Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Accenture PLSQL
No ratings yet
Accenture PLSQL
173 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
156 pages
Cse Viii Advanced Computer Architectures (06cs81) Notes
No ratings yet
Cse Viii Advanced Computer Architectures (06cs81) Notes
156 pages
Cse Viii Advanced Computer Architectures 06CS81 Notes PDF
No ratings yet
Cse Viii Advanced Computer Architectures 06CS81 Notes PDF
156 pages
Cse.m-ii-Advances in Computer Architecture (12scs23) - Notes
No ratings yet
Cse.m-ii-Advances in Computer Architecture (12scs23) - Notes
213 pages
Complete Lessonplan Aca 12 Unki
No ratings yet
Complete Lessonplan Aca 12 Unki
19 pages
CS-704 Handouts Version 1
No ratings yet
CS-704 Handouts Version 1
477 pages
ACA Syllabus
No ratings yet
ACA Syllabus
10 pages
ACA Mod1
No ratings yet
ACA Mod1
118 pages
Aca
No ratings yet
Aca
3 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
86 pages
Computer Organization Notes
No ratings yet
Computer Organization Notes
82 pages
PDF
No ratings yet
PDF
41 pages
Cs6303comparchnotes PDF
No ratings yet
Cs6303comparchnotes PDF
250 pages
Cs1304-Computer Architecture Department of Cse & It
No ratings yet
Cs1304-Computer Architecture Department of Cse & It
105 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
ECE2015 - CA - All Slides
No ratings yet
ECE2015 - CA - All Slides
553 pages
CSE 820 Graduate Computer Architecture: Dr. Enbody
No ratings yet
CSE 820 Graduate Computer Architecture: Dr. Enbody
25 pages
Study Notes COAL Mids
No ratings yet
Study Notes COAL Mids
14 pages
Advanced Computer Systems (
No ratings yet
Advanced Computer Systems (
24 pages
CS6303 Syllabus
No ratings yet
CS6303 Syllabus
3 pages
ECE690 Syllabus
No ratings yet
ECE690 Syllabus
2 pages
Brief History of Computer Evolution
No ratings yet
Brief History of Computer Evolution
13 pages
Lect 01 PDF
No ratings yet
Lect 01 PDF
33 pages
Introduction To Computer Architecture
No ratings yet
Introduction To Computer Architecture
17 pages
Computer Architecture and Organization: General Introduction
No ratings yet
Computer Architecture and Organization: General Introduction
72 pages
Comp Arch Ch1 Ch2 Ch3 Ch4
No ratings yet
Comp Arch Ch1 Ch2 Ch3 Ch4
161 pages
Advanced Computer Architecture: CSE-401 E
No ratings yet
Advanced Computer Architecture: CSE-401 E
71 pages
BIS124 COMPUTER ARCHITECTURE AND ORGANISATION Course Outline
No ratings yet
BIS124 COMPUTER ARCHITECTURE AND ORGANISATION Course Outline
5 pages
Week 1 and Week 2 - Lecture 1 of 4v1
No ratings yet
Week 1 and Week 2 - Lecture 1 of 4v1
26 pages
Smd150 Computer Architecture: Per Lindgren Eislab, Lectures Andrey Kruglyak, Syncsim Johan Eriksson, VHDL
No ratings yet
Smd150 Computer Architecture: Per Lindgren Eislab, Lectures Andrey Kruglyak, Syncsim Johan Eriksson, VHDL
43 pages
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
No ratings yet
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
151 pages
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
No ratings yet
Modern Computer Architecture: Lecture1 Fundamentals of Quantitative Design and Analysis (I)
41 pages
Advance Operating System-Computer Organization: Chap 1a: Overview
No ratings yet
Advance Operating System-Computer Organization: Chap 1a: Overview
71 pages
Computer Organization 01
No ratings yet
Computer Organization 01
22 pages
Aula Advanced Comp Syst 2008 Dartmouth
No ratings yet
Aula Advanced Comp Syst 2008 Dartmouth
25 pages
RTSEC Documentation
No ratings yet
RTSEC Documentation
4 pages
COA Syllabus Acad Coun Apprved
No ratings yet
COA Syllabus Acad Coun Apprved
2 pages
CA Lec1
No ratings yet
CA Lec1
29 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Aca
No ratings yet
Aca
71 pages
CA Classes-26-30
No ratings yet
CA Classes-26-30
5 pages
Coa Final Merged
No ratings yet
Coa Final Merged
105 pages
Pressed
No ratings yet
Pressed
222 pages
Defining Computer Architecture
No ratings yet
Defining Computer Architecture
6 pages
A Brief History of Computer Architecture
No ratings yet
A Brief History of Computer Architecture
12 pages
EEM 486: Computer Architecture: Course Introduction and The Five Components of A Computer
No ratings yet
EEM 486: Computer Architecture: Course Introduction and The Five Components of A Computer
13 pages
CS5204/EE5364 - Advanced Computer Architecture - Introduction
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Introduction
28 pages
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
1 - Intro
No ratings yet
1 - Intro
12 pages
ReviewedCSC303 CompiledNote 2023 24
No ratings yet
ReviewedCSC303 CompiledNote 2023 24
78 pages
Two-Marks Material Computer Architecture
No ratings yet
Two-Marks Material Computer Architecture
35 pages
CH 1 - Introduction To Computer Architecture and Performance Measurement
No ratings yet
CH 1 - Introduction To Computer Architecture and Performance Measurement
42 pages
CS 404 - COA Course Plan
No ratings yet
CS 404 - COA Course Plan
8 pages
Aca Co
100% (1)
Aca Co
2 pages
Unit I Overview & Instructions: Cs6303-Computer Architecture
100% (1)
Unit I Overview & Instructions: Cs6303-Computer Architecture
16 pages
CompArch - Chapter One
No ratings yet
CompArch - Chapter One
9 pages
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Upgrading Old PCs
From Everand
Upgrading Old PCs
Isaac Berners-Lee
No ratings yet
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
Procedures and Functions
No ratings yet
Procedures and Functions
4 pages
Java Project Report
100% (2)
Java Project Report
46 pages
Siebel Workflow: Page 1 of 66
No ratings yet
Siebel Workflow: Page 1 of 66
66 pages
POLYMORPHISM
No ratings yet
POLYMORPHISM
21 pages
Mod Menu Log - Com - Mobile.legends
No ratings yet
Mod Menu Log - Com - Mobile.legends
933 pages
AP350
No ratings yet
AP350
23 pages
Master of Science in Information Technology (MScIT-NEW) - Semester 3
No ratings yet
Master of Science in Information Technology (MScIT-NEW) - Semester 3
29 pages
CS508 FinalTerm Solved Short Questions
No ratings yet
CS508 FinalTerm Solved Short Questions
40 pages
Compile and Runtime Errors in Java
0% (1)
Compile and Runtime Errors in Java
208 pages
COM323 Programming in Java May 2023 Set 1 Rubric
No ratings yet
COM323 Programming in Java May 2023 Set 1 Rubric
16 pages
Opp
No ratings yet
Opp
9 pages
Each Question Carries 2 Marks
No ratings yet
Each Question Carries 2 Marks
35 pages
TS2104 Queued Message Handler
No ratings yet
TS2104 Queued Message Handler
19 pages
OCAJP 8 Exam Topics: Java Basics
No ratings yet
OCAJP 8 Exam Topics: Java Basics
3 pages
C++Coursematerial MentorLabs
No ratings yet
C++Coursematerial MentorLabs
149 pages
Final Jimjim Oop - Learning Module - Final
No ratings yet
Final Jimjim Oop - Learning Module - Final
50 pages
Blackberry JVM Error Codes Listed
No ratings yet
Blackberry JVM Error Codes Listed
7 pages
Development II in Microsoft Dynamics AX 2012
No ratings yet
Development II in Microsoft Dynamics AX 2012
5 pages
Excel VBA Best Practice: Simon Murphy Developer - Codematic LTD
No ratings yet
Excel VBA Best Practice: Simon Murphy Developer - Codematic LTD
28 pages
Struts 1.2
No ratings yet
Struts 1.2
5 pages
Unit 4 Introduction To ARM CORTEX M4
100% (1)
Unit 4 Introduction To ARM CORTEX M4
84 pages
D 51925
No ratings yet
D 51925
13 pages
Interrupts and Exceptions
No ratings yet
Interrupts and Exceptions
7 pages
Advanced Java Programming
No ratings yet
Advanced Java Programming
129 pages
Lecture 5 - Part 2 Collections and Exceptions
No ratings yet
Lecture 5 - Part 2 Collections and Exceptions
26 pages
Final DBMS Lab
No ratings yet
Final DBMS Lab
31 pages
Java Titbits
No ratings yet
Java Titbits
24 pages
JDBC Connectivity With Ms-Access
No ratings yet
JDBC Connectivity With Ms-Access
54 pages
Java Exam Preparation Model Testn
No ratings yet
Java Exam Preparation Model Testn
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ACA Notes

Uploaded by

ACA Notes

Uploaded by

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Computer Architecture, A Quantitative Approach John L. Hennessey and

Advanced Computer Architecture Parallelism, Scalability Kai Hwang:,

Parallel Computer Architecture, A Hardware / Software Approach David

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Introduction; Classes of computers

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Figure 1.1 The evolution of various classes of computers:

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Defining C omputer Arch itecture

Advance Computer Architecture

v)Operations:The general category o f operations are:

Variable Length ISA

Number of Registers and number of Addressing modes hav e significant

Department of CSE, SJBIT

Advance Computer Architecture

Advance Computer Architecture

Trends in Power in Integrated Circuits

Advance Computer Architecture

about 10% for each doubling of volume.

Advance Computer Architecture

Systems alternate between 2 states of service with respect to an SLA:

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Note : when comparing 2 computers as a ratio, execution times on the reference

Quantitative Principles of Computer Design

Advance Computer Architecture

The Processor performance Equation:

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

2. Instruction decode / Register fetch cycle (ID):

Department of CSE, SJBIT

Advance Computer Architecture

3. Execution / Effective address Cycle (EXE)

Five stage Pipeline for a RISC processor

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

Basic Performance issues in Pipelining

Department of CSE, SJBIT

Advance Computer Architecture

Department of CSE, SJBIT

Advance Computer Architecture

If there are no pipeline stalls,

Department of CSE, SJBIT

Advance Computer Architecture

Pipeline stall is commonly called Pipeline bubble or just simply bubble

Department of CSE, SJBIT

Advance Computer Architecture

Minimizing data hazard by Forwarding

Department of CSE, SJBIT

Advance Computer Architecture

Advance Computer Architecture

Reducing the Brach Penalties

Advance Computer Architecture

which the to delay slot can be filled by the compiler.