485 COURSES Computer Architecture I 14103
485 COURSES Computer Architecture I 14103
Objectives: This course will introduce students to the fundamental concepts underlying modern
computer organization and architecture. The emphasis is on studying and analysing fundamental
issues in architecture design and their impact on performance. To familiarize the students with
the hardware design including basic structure and behaviour of the various functional modules of
the computer and how they interact to provide the processing needs of the user.
Course Outlines:
Fundamentals of Computer Architecture: Basic computer architecture , types, organizat ion
of the von Neumann machine, Instruction formats, fetch/execute cycle, instruction
d e c o d i n g a n d e xe c u t i o n , registers and register files, Instruction types and
addressing modes.
Processor System Design: The CPU interface clock, control, data and address buses
Address decoding and memory interfacing, Basic parallel and serial interfaces, pipeline,
CISC and RISC
Memory System Organization and Architecture : Memory systems hierarchy, Main
memory organization and its characteristics and performance, Latency, cycle time,
bandwidth, and interleaving, Cache memories
The Processor: It is the workhorse of the system; it is the component that executes a
program by performing Arithmetic and Logical operations on data. It is the only
component that creates new information by combining or modifying current information.
In a typical system there will be only one processor, known as the Central Processing
Unit (CPU). Modern high performance systems such as vector processors and parallel
processors often have more than one processor. Systems with only one processor are
serial processors especially among computational scientists, scalar processors.
Input/Output (I/O) Devices: This transfer information without altering it between the
external world and one or more internal components. I/O devices can be secondary
memories, for example disks and tapes, or devices used to communicate directly with
users, such as video displays, keyboards, and mouse.
The Communication Channels: This ties the system together which can either be simple
links that connect two devices or more complex switches that interconnect several
components. This allows them to communicate at a given point in time. When a switch is
configured to allow two devices to exchange information, all other devices that rely on
the switch are blocked, i.e. they must wait until the switch can be reconfigured.
Computer Architecture is a set of rules and methods that describe the functionality, organisation
and implementation of computer systems. Some definitions of architecture define it as describing
the capabilities and programming model of a computer but not a particular implementation.
Computer Architecture can also be defined as the design methodology of how computer hardware
components interact based on the challenges imposed by real world components and technology
and also depends on the current market demands.
Building an architecture deals with the materials (components, subsystems) at hand and many
levels of detail are required to completely specify a given implementation, however, a very good
example is von Neumann architecture.
which is still used by most types of computers today. This was proposed by the mathematician
John von Neumann in 1945. It describes the design of an electronic computer with its CPU,
which includes the arithmetic logic unit, control unit, registers, memory for data and instructions,
an input/output interface and external storage functions.
Von Neumann architecture is based on the stored-program computer concept, where instruction
data and program data are stored in the same memory. This design is still used in most computers
produced today are shown in the figure 2.
A CPU is not very useful unless there is some wa y to communicate to it, and receive information
back from it. This is usually known as a Bus. The Bus is the input/output (I/O) gateway for the
CPU. The primary area with which the CPU communicates with its system memory is what is
commonly known as Random Access Memory (RAM). Depending on the platform, the CPU may
communicate with other parts of the system or it may communicate just through memory. The
CPU contains the ALU, Control Unit and a variety of registers.
Memory Unit
The memory unit consists of RAM, sometimes referred to as primary or main memory. Unlike a
hard drive (secondary memory), this memory is fast and also directly accessible by the CPU.
RAM is split into partition, each partition consists of an address and its contents (both in binary
form). The address will uniquely identify every location in the memory. Loading data from
permanent memory (hard drive), into the faster and directly accessible temporary memory (RAM)
allows the CPU to operate much quicker.
Registers
Registers are high speed storage areas in the CPU. All data and instruction must be stored in a
register before it can be processed. A Register is a group of flip-flops with each flip-flop capable
of storing one bit of information. An n-bit register has a group of n flip-flops and is capable of
storing binary information of n-bits.
A register consists of a group of flip-flops and gates. The flip-flops hold the binary information
and gates control when and how new information is transferred into a register. Various types of
registers are available commercially. The simplest register is one that consists of only flip-flops
with no external gates. These days’ registers are also implemented as a register file such as:
Register Load: The transfer of new information into a register is referred to as loading the
register. If all the bits of register are loaded simultaneously with a common clock pulse than
the loading is said to be done in parallel.
Register Transfer Language: The symbolic notation used to describe the micro-operation
transfers amongst registers is called register transfer language. The term "register transfer"
means the availability of hardware logic circuits that can perform a stated micro-operation
and transfer the result of the operation to the same or another register. The word "language"
is borrowed from programmers who apply this term to programming languages. This
programming language is a procedure for writing symbols to specify a given computational
process.
Accumulator: This is the most common register, used to store data taken out from the
memory.
General Purpose Registers : This is used to store data intermediate results during program
execution. It can be accessed via assembly programming.
Special Purpose Registers : Users do not access these registers. These registers are for
Computer system,
MAR: Memory Address Register are those registers that holds the address for
memory unit.
MBR: Memory Buffer Register stores instruction and data received from the
memory and sent from the memory.
PC: Program Counter points to the next instruction to be executed.
IR: Instruction Register holds the instruction to be executed.
Table 1: Types of Register
Memory Address Register Holds the memory location of data that need
(MAR) to be accessed
Memory Data Register Holds data that is being transferred to or
(MDR) from memory
Types of Micro-Operations
The micro-operations in digital computers are of 4 types:
Arithmetic Micro-Operations:
This allows arithmetic (add, subtract etc.) and logic (AND, OR, NOT etc.) operations to be
carried out. Some of the basic micro-operations are addition, subtraction, increment and
decrement.
a) Add Micro-Operation
It is defined by the following statement:
R3 R1 R2
The above statement instructs the data or contents of register R1 to be added to data or content
of register R2 and the sum should be transferred to register R3.
b) Subtract Micro-Operation
Let us again take an example:
R3 R1 R2 1 In subtract micro-operation, instead of using minus operator we take 1's
compliment and add 1 to the register which gets subtracted, i.e R1 R2 is equivalent
to R3 R1 R2 1
c) Increment/Decrement Micro-Operation
Increment and decrement micro-operations are generally performed by adding and subtracting
1 to and from the register respectively.
R1 R1 1
R1 R1 1
Logic Micro-Operations
These are binary micro-operations performed on the bits stored in the registers. These
operations consider each bit separately and treat them as binary variables.
Let us consider the X-OR micro-operation with the contents of two registers R1 and R2.
Shift Micro-Operations
These are used for serial transfer of data. That means we can shift the contents of the register to
the left or right. In the shift left operation the serial input transfers a bit to the right most position
and in shift right operation the serial input transfers a bit to the left most position.
There are three types of shifts as follows:
a) Logical Shift
It transfers 0 through the seria l input. The symbol "shl" is used for logical shift left
and "shr" is used for logical shift right.
The register symbol must be same on both sides of arrows.
b) Circular Shift
This circulates or rotates the bits of register ar ound the two ends without any loss of data or
contents. In this, the serial output of the shift register is connected to its serial
input. "cil" and "cir" is used for circular shift left and right respectively.
c) Arithmetic Shift
This shifts a signed binary number to left or right. An arithmetic shift left multiplies a signed
binary number by 2 and shift left divides the number by 2. Arithmetic shift micro-operation
leaves the sign bit unchanged because the signed number remains same when it is multiplied
or divided by 2.
Buses
Buses are the means by which data is transmitted from one part of a computer to another,
connecting all major internal components to the CPU and memory. A standard CPU system
bus is comprised of a a control bus, data bus and address bus.
Address Bus Carries the addresses of data (but not the data)
between the processor and memory
Data Bus Carries data between the processor, the memory unit
and the input/output devices
Addressing Mode
Addressing modes are the ways how architectures specify the address of an object they want to
access. In machines, an addressing mode can specify a constant, a register or a location in
memory.
The operation field of an instruction specifies the operation to be performed. This operation will
be executed on some data which is stored in computer registers or the main memory. The way
any operand is selected during the program execution is dependent on the addressing mode of the
instruction. The purpose of using addressing modes is as follows:
To give the programming versatility to the user.
To reduce the number of bits in addressing field of instruction.
Register Mode
In this mode the operand is stored in the register and this register is present in CPU. The
instruction has the address of the Register where the operand is stored.
Instruction Codes
While a Program is a set of instructions that specify the operations, operands, and the sequence
by which processing has to occur. An instruction code is a group of bits that tells the computer to
perform a specific operation part.
Operation Code
The operation code of an instruction is a group of bits that define operations such as add, subtract,
multiply, shift and compliment. The number of bits required for the operation code depends upon
the total number of operations available on the computer. The operation code must consist of at
least n bits for a given 2^n operations. The operation part of an instruction code specifies the
operation to be performed.
Register Part
The operation must be performed on the data stored in registers. An instruction code therefore
specifies not only operations to be performed but also the registers where the operands (data) will
be found as well as the registers where the result has to be stored.
Computers with a single processor register are known as Accumulator (AC). The operation is
performed with the memory operand and the content of AC.
COMPUTER INSTRUCTIONS
The basic computer has three instruction code formats. The Operation Code (opcode) part of the
instruction contains 3 bits and remaining 13 bits depends upon the operation code encountered.
Three-Address Instructions
Computers with three-address instruction formats can use each address field to specify either a
processor register or a memory operand. The program in assembly language that evaluates X =
(A + B) ∗ (C + D) is shown below, together with comments that explain the register transfer
operation of each instruction.
It is assumed that the computer has two processor registers, R1 and R2. The symbol M [A]
denotes the operand at memory address symbolized by A.
The advantage of the three-address format is that it results in short programs when evaluating
arithmetic expressions. The disadvantage is that the binary-coded instructions require too many
bits to specify three addresses. An example of a commercial computer that uses three-address
instructions is the Cyber 170. The instruction formats in the Cyber computer are restricted to
either three register address fields or two register address fields and one memory address field.
Two-Address Instructions
Two address instructions are the most common in commercial computers. Here again each
address field can specify either a processor register or a memory word. The program to evaluate
X = (A + B) ∗ (C + D) is as follows:
MOV R1, A R1 ← M [A]
ADD R1, B R1 ← R1 + M [B]
MOV R2, C R2 ← M [C]
ADD R2, D R2 ← R2 + M [D]
MUL R1, R2 R1 ← R1∗R2
MOV X, R1 M [X] ← R1
The MOV instruction moves or transfers the operands to and from memory and processor
registers. The first symbol listed in an instruction is assumed to be both a source and the
destination where the result of the operation is transferred.
One-Address Instructions
One-address instructions use an implied accumulator (AC) register for all data manipulation. For
multiplication and division there is a need for a second register. However, here we will neglect
the second and assume that the AC contains the result of tall operations. The program to evaluate
X = (A + B) ∗ (C + D) is
LOAD A AC ← M [A]
ADD B AC ← A [C] + M [B]
STORE T M [T] ← AC
LOAD C AC ← M [C]
ADD D AC ← AC + M [D]
MUL T AC ← AC ∗ M [T]
STORE X M [X] ← AC
All operations are done between the AC register and a memory operand. T is the address of a
temporary memory location required for storing the intermediate result.
Zero-Address Instructions
A stack-organized computer does not use an address field for the instructions ADD and MUL.
The PUSH and POP instructions, however, need an address field to specify t he operand that
communicates with the stack. The following program shows how X = (A + B) ∗ (C + D) will be
written for a stack organized computer. (TOS stands for top of stack)
PUSH A TOS ← A
PUSH B TOS ← B
ADD TOS ← (A + B)
PUSH C TOS ← C
PUSH D TOS ← D
ADD TOS ← (C + D)
MUL TOS ← (C + D) ∗ (A + B)
POP X M [X] ← TOS
To evaluate arithmetic expressions in a stack computer, it is necessary to convert the expression
into reverse Polish notation. The name “zero-address” is given to this type of computer because
of the absence of an address field in the computational instructions.
Input-Output Instruction
These instructions are recognised by the operation code 111 with a 1 in the left most bit of
instruction. The remaining 12 bits are used to specify the input-output operation.
Format of Instruction
Basic fields of an instruction format are given below:
Instruction Cycle
An instruction cycle, also known as fetch-decode-execute cycle is the basic operational process of
a computer. This process is repeated continuously by CPU from boot up to s hut down of
computer.
The cycle is then repeated by fetching the next instruction. Thus in this way the instruction cycle
is repeated continuously.
Figure 3: Instruction Cycle
Central Processing Unit Architecture operates the capacity to work from “Instruction Set
Architecture” to where it was designed. There are 2 types of concepts to implement the processor
hardware architecture. The architectural designs of CPU are:
Hardware designers invent numerous technologies and tools to implement the desired architecture
in order to fulfil these needs. Hardware architecture may be implemented to be either hardware
specific or software specific, but according to the application both are used in the required
quantity.
Figure 4: RISC and CISC Architecture
CISC Architecture
The CISC approach attempts to minimize the number of instructions per program, sacrificing the
number of cycles per instruction. Computers based on the CISC architecture are designed to
decrease the memory cost. Because, the large programs need more storage, thus increasing the
memory cost and large memory becomes more expensive. To solve these problems, the number
of instructions per program can be reduced by embedding the number of operations in a single
instruction, thereby making the instructions more complex.
MUL loads two values from the memory into separate registers in CISC. CISC uses minimum
possible instructions by implementing hardware and executes operations.
Instruction Set Architecture is a medium to permit communication between the programmer and
the hardware. Data execution part, copying of data, deleting or editing is the user commands used
in the microprocessor and with this microprocessor the Instructio n set architecture is operated.
The main instruction used in the above Instruction Set Architecture are as below
Instruction Set:
Group of instructions given to execute the program and they direct the computer by manipulating
the data. Instructions are in the form – Opcode (operational code) and Operand. Where, opcode is
the instruction applied to load and store data, etc. The operand is a memory register where
instruction applied.
Addressing Modes:
Addressing modes are the manner in the data is accessed. Depending upon the type of instruction
applied, addressing modes are of various types such as direct mode where straight data is
accessed or indirect mode where the location of the data is accessed. Processors having identical
ISA may be very different in organization. A processor with identical ISA and nearly identical
organization is still not nearly identical.
CPU performance is given by the fundamental law
Thus, CPU performance is dependent upon Instruction Count, CPI (Cycles per instruction) and
Clock cycle time. And all three are affected by the instruction set architecture.
This underlines the importance of the instruction set architecture. There are two prevalent
instruction set architectures
RISC (Reduced Instruction Set Computer) is used in portable devices due to its power efficiency.
RISC is a type of microprocessor architecture that uses highly-optimized set of instructions. RISC
does the opposite by reducing the cycles per instruction at the cost of the n umber of instructions
per program Pipelining is one of the unique features of RISC. It is performed by overlapping the
execution of several instructions in a pipeline fashion.
It has a high performance advantage over CISC. RISC processors take simple instructions and are
executed within a clock cycle
Example: Apple iPod and Nintendo DS.
Semantic Gap
Both RISC and CISC architectures have been developed as an attempt to cover the semantic gap.
With an objective of improving efficiency of software development, several
powerful programming languages have come up, viz., Ada, C, C++, Java, etc. They provide a
high level of abstraction, conciseness and power. By this evolution the semantic gap grows. To
enable efficient compilation of high level language programs, CISC and RISC designs are the two
options.
CISC designs involve very complex architectures, including a large number of instructions and
addressing modes, whereas RISC designs involve simplified instruction set and adapt it to the real
requirements of user programs.
MEMORY ORGANIZATION
A memory unit is the collection of storage units or devices together. The memory unit stores the
binary information in the form of bits. Generally, memory/storage is classified into 2 categories:
Volatile Memory: This loses its data, when power is switched off.
Non-Volatile Memory: This is a permanent storage and does not lose any data when power
is switched off.
Memory Hierarchy
Auxiliary memory access time is generally 1000 times that of the main memory, hence it
is at the bottom of the hierarchy.
The main memory occupies the central position because it is equipped to communicate
directly with the CPU and with auxiliary memory devices through Input/output processor
(I/O).
When the program not residing in main memory is needed by the CPU, they are brought in from
auxiliary memory. Programs not currently needed in main memory are transferred into au xiliary
memory to provide space in main memory for other programs that are currently in use.
The cache memory is used to store program data which is currently being executed in the
CPU. Approximate access time ratio between cache memory and main memory is about 1
to 7~10
1. Random Access: Main memories are random access memories, in which each memory
location has a unique address. Using this unique address any memory location can be reached
in the same amount of time in any order.
2. Sequential Access: This method allows memory access in a sequence or in order.
3. Direct Access: In this mode, information is stored in tracks, with each track having a separate
read/write head.
1. Main Memory
The memory unit that communicates directly within the CPU, Auxillary me mory and Cache
memory, is called main memory. It is the central storage unit of the computer system. It is a large
and fast memory used to store data during computer operations. Main memory is made up
of RAM and ROM, with RAM integrated circuit chips holing the major share.
i. Random Access Memory (RAM):
DRAM: Dynamic RAM, is made of capacitors and transistors, and must be refreshed
every 10~100 ms. It is slower and cheaper than SRAM.
SRAM: Static RAM, has a six transistor circuit in each cell and retains data , until
powered off.
NVRAM : Non-Volatile RAM, retains its data, even when turned off. Example: Flash
memory.
ii. Read Only Memory (ROM): is non-volatile and is more like a permanent storage for
information. It also stores the bootstrap loader program, to load and start the operating
system when computer is turned on. PROM (Programmable ROM), EPROM (Erasable
PROM) and EEPROM (Electrically Erasable PROM) are some commonly used ROMs.
2. Auxiliary Memory
Devices that provide backup storage are called auxiliary memory. For example: Magnetic disks
and tapes are commonly used auxiliary devices. Other devices used as auxiliary memory are
magnetic drums, magnetic bubble memory and optical disks. It is not directly accessible to the
CPU, and is accessed using the Input/Output channels.
3. Cache Memory
The data or contents of the main memory that are used again and again by CPU, are stored in the
cache memory so that we can easily access that data in shorter time.
Whenever the CPU needs to access memory, it first checks the cache memory. If the data is not
found in cache memory then the CPU moves onto the main memory. It also transfers block of
recent data into the cache and keeps on deleting the old data in cache to accomodate the new one.
Hit Ratio
The performance of cache memory is measured in terms of a quantity called hit ratio. When the
CPU refers to memory and finds the word in cache it is said to produce a hit. If the word is not
found in cache, it is in main memory then it counts as a miss.
The ratio of the number of hits to the total CPU references to memory is called hit ratio.
Hit Ratio = Hit/(Hit + Miss)
4. Associative Memory
It is also known as content addressable memory (CAM). It is a memory chip in which each bit
position can be compared. In this the content is compared in each bit cell which allows very fast
table lookup. Since the entire chip can be compared, contents are randomly stored without
considering addressing scheme. These chips have less storage capacity than regular memory
chips.
Associative Mapping
Direct Mapping
Set Associative Mapping
6. Virtual Memory
Virtual memory is the separation of logical memory from physical memory. This separation
provides large virtual memory for programmers when only small physical memory is available.
Virtual memory is used to give programmers the illusion that they have a very large memory
even though the computer has a small main memory. It makes the task of programming easier
because the programmer no longer needs to worry about the amount of physical memory
available.
Parallel Processing and Data Transfer Modes in a Computer System
Instead of processing each instruction sequentially, a parallel processing system provides
concurrent data processing to increase the execution time. In this the system may have two or
more ALU's and should be able to execute two or more instructions at the same time. The
purpose of parallel processing is to speed up the computer processing capability and increase its
throughput.
NOTE: Throughput is the number of instructions that can be executed in a unit of time.
Parallel processing can be viewed from various levels of complexity. At the lowest level, we
distinguish between parallel and serial operations by the type of registers used. At the higher level
of complexity, parallel processing can be achieved by using multiple functional units that perform
many operations simultaneously.
Pipelining
Pipelining is the process of accumulating instruction from the processor through a pipeline. It
allows storing and executing instructions in an orderly process. It is also known as pipeline
processing.
Pipelining is a technique where multiple instructions are overlapped during execution. P ipeline is
divided into stages and these stages are connected with one another to form a pipe like structure.
Instructions enter from one end and exit from another end.
Note: Pipelining increases the overall instruction throughput.
Advantages of Pipelining
1. The cycle time of the processor is reduced.
2. It increases the throughput of the system
3. It makes the system reliable.
Disadvantages of Pipelining
1. The design of pipelined processor is complex and costly to manufacture.
2. The instruction latency is more.
In pipeline system, each segment consists of an input register followed by a combinational
circuit. The register is used to hold data and combinational circuit performs operations on it. The
output of combinational circuit is applied to the input register of the next segment.
Pipeline system is like the modern day assembly line setup in factories. For example in a car
manufacturing industry, huge assembly lines are setup and at each point, there are robotic arms to
perform a certain task, and then the car moves on ahead to the next arm.
Types of Pipeline
It is divided into 2 categories:
i. Arithmetic Pipeline
Arithmetic pipelines are usually found in most of the computers. They are used for float ing point
operations, multiplication of fixed point numbers etc. For example: The input to the Floating
Point Adder pipeline is:
X A 2 a
Y B 2 b
Here A and B are mantissas (significant digit of floating point numbers), while a and b are
exponents.
The floating point addition and subtraction is done in 4 parts:
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of these
factors are given below:
i. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction
processing where different instructions have different operand requirements and thus different
processing time.
Pipeline Hazards
There are situations, called hazards that prevent the next instruction in the instruction stream from
being executing during its designated clock cycle. Hazards reduce the performance from the ideal
speedup gained by pipelining.
A hazard is created whenever there is a dependence between instruction and they are close
enough that the overlap caused by pipelining would change the order of access to an operand.
1. Structural Hazards. They arise from resource conflicts when the hardware cannot support all
possible combinations of instructions in simultaneous overlapped execution.
2. Data Hazards. They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of instructions in the pipeline.
3. Control Hazards. They arise from the pipelining of branches and other instructions that
change the PC.
i. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the
present instruction is a conditional branch, and its result will lead us to the next instruction, then
the next instruction may not be known until the current one is processed.
ii. Interrupts
Interrupts set unwanted instructions into the instruction stream. Interrupts effect the execution of
instruction.
The principles of pipelining will be described using DLX (DELUXE) and a simple version of its
pipeline. Those principles can be applied to more complex instruction sets than DLX, although
the resulting pipelines are more complex. It has simple pipeline architecture for CPU and
provides a good architectural model for study.
The architecture of DLX was chosen based on observations about most frequently used primitives
in programs. DLX provides a good architectural model for study, not only because of the recent
popularity of this type of machine, but also because it is easy to understand.
Like most recent load/store machines, DLX emphasizes
For integer data: 8-bit bytes, 16-bit half words, 32-bit words
For floating point: 32-bit single precision, 64-bit double precision
The DLX operations work on 32-bit integers and 32- or 64-bit floating point. Bytes and
half words are loaded into registers with either zeros or the sign bit replicated to fill the
32 bits of the registers.
Memory
Byte addressable
32-bit address
Two addressing modes (immediate and displacement). Register deferred and absolute
addressing with 16-bit field are accomplished
Memory references are load/store between memory and FPRs and all memory accesses must
be aligned
There are instructions for moving between a FPR and a
Instructions
An Implementation of DLX
This un-pipelined implementation is not the most economical or the highest performance
implementation without pipelining. Instead, it is designed to lead naturally to a pipelined
implementation. Implementing the instruction set requires the introduction of several temporary
registers that are not part of the architecture. Every DLX instruction can be implemented in at
most five clock cycles. The five clock cycles are:
On each cycle the instruction will be process from IF to WB cycle (If "Cycle" is disabled then it
has no effect). If it appears like nothing has changed, it means that the cycle is not active for the
instruction type. Detailed description of each ia as follows:
Instruction fetch cycle (IF):
IR MEM[PC]
NPC PC +4
Operation:
• Send out the PC and fetch the instruction from memory into the instruction register (IR)
• Increment the PC by 4 to address the next sequential instruction
• The IR is used to hold the instruction that will be needed on subsequent clock cycles
• The NPC is used to hold the next sequential PC (program counter)
i. Decode the instruction and access the register file to read the registers.
ii. The outputs of the general-purpose registers are read into two temporary registers (A
and B) for use in later clock cycles.
iii. The lower 16 bits of the IR are also sign-extended and stored into the temporary
register IMM, for use in the next cycle.
iv. Decoding is done in parallel with reading registers, which is possible because these
fields are at a fixed location in the DLX instruction format. This technique is known
as fixed-field decoding.
Memory reference:
ALUOutput A +Imm
Operation: The ALU adds the operands to form the effective address and places the result into
the register ALUOutput
ALUOutput A op B
Operation: The ALU performs the operation specified by the opcode on the value in register A
and on the value in register B. The result is placed in the register ALUOutput.
ALUOutput A op Imm
Operation: The ALU performs the operation specified by the opcode on the value in register A
and on the value in register Imm. The result is placed in the register ALUOutput.
Branch:
Operation:
• The ALU adds the NPC to the sign-extended immediate value in Imm to compute
the address of the branch target.
• Register A, which has been read in the prior cycle, is checked to determine
whether the branch is taken.
• The comparison operation op is the relational operator determined by the branch
opcode (e.g. op is "==" for the instruction BEQZ)
Operation:
Access memory if needed
If the instruction is load , data returns from memory and is placed in the LMD (load
memory data) register
If the instruction is store, data from the B register is written into memory.
In either case the address used is the one computed during the prior cycle
and stored in the register ALUOutput
Branch:
Operation:
- If the instruction branches, the PC are replaced with branch destination address in the register
ALUOutput
- Otherwise, PC is replaced with the incremented PC in the register NPC
Memory reference:
Regs[IR11..15] LMD
Operation:
Write the result into the register file, whether it comes from the memory(LMD) or from ALU
(ALUOutput)
The register destination field is in one of two positions depending on the opcode
Limitations on practical depth of a pipeline arise from:
Pipeline latency. The fact that the execution time of each instruction does not decrease
puts limitations on pipeline depth;
Imbalance among pipeline stages. Imbalance among the pipe stages reduces
performance since the clock can run no faster than the time needed for the slowest
pipeline stage;
Pipeline overhead. Pipeline overhead arises from the combination of pipeline register
delay (setup time plus propagation delay) and clock skew.
Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further
pipelining is useful, since there is no time left in the cycle for useful work.
Example
1. Consider a non-pipelined machine with 6 execution stages of lengths 50ns, 50ns, 60ns,
60ns, 50ns, and 50 ns.
i. Find the instruction latency on this machine.
ii. How much time does it take to execute 100 instructions?
Solution:
Solution:
Remember that in the pipelined implementation, the length of the pipe stages must all be the
same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it comes to:
Solution:
Speedup is the ratio of the average instruction time without pipelining to the average instruction
time with pipelining.
Average instruction time not pipelined = 320 ns
Average instruction time pipelined = 65 ns
Speedup for 100 instructions = 32000 / 6825 = 4.69
REFERENCES
Abd-El-Barr, M . and El-Rewini, H. (2004) Pipelining Design Techniques, in Fundamentals of Computer Organization and
Architecture, John Wiley & Sons, Inc., Hoboken, NJ, USA. doi: 10.1002/0471478326.ch9
https://cs.nyu.edu/~gottlieb/courses/2000s/2006-07-spring/os2250/lectures/lecture-03.html
https://www.docsity.com/en/computer-architecture-and-organization-instruction-types-saritha/30725/
http://studylib.net/doc/6649432/cmp1203-computer-architecture-and-organization
https://www.google.com.ng/url?sa=t&source=web&rct=j&url=https://www.iare.ac.in/sites/default/file
s/PPT/CO%2520Lecture%2520Notes.pdf&ved=2ahUKEwiNl_2OsY3ZAhUNQMAKHdBmB_AQFj
AFegQICRAB&usg=AOvVaw0YMjE-M2IUxLJbD-g6jGsM
https://www.google.com.ng/url?sa=t&source=web&rct=j&url=https://www.elprocus.com/difference-
between-risc-and-cisc-architecture/&ved=2ahUKEwjioNK-
sY3ZAhVHDMAKHRneCeoQFjADegQIEBAB&usg=AOvVaw3q6GNzCAaQ9aCFzvMjV1yf