CS6303 Computer Architecture ACT Notes

UNIT 1
(OVERVIEW & INSTRUCTIONS)
1. Brief about 8 great ideas in computer architecture?

These ideas are so powerful they have lasted long after the first computer that used them.
1. Design for Moores Law
2. Use Abstraction to Simplify Design
3. Make the Common Case Fast
4. Performance via Parallelism
5. Performance via Pipelining
6. Performance via Prediction
7. Hierarchy of Memories
8. Dependability via Redundancy
SYMBOLS
Design for Moores Law
Moores Law. It states that integrated circuit resources double every 1824 months. computer
architects must anticipate where the technology will be when the design finishes rather than
design for where it starts. The resources available per chip can easily double or quadruple
between the start and finish of the project.
Use Abstraction to Simplify Design
A major productivity technique for hardware and software is to use abstractions to represent
the design at different levels of representation. lower-level details are hidden to offer a simpler
model at higher levels.
Make the Common Case Fast
Making the common case fast will tend to enhance performance better than optimizing the rare
case. Ironically, the common case is oft en simpler than the rare case and hence is oft en easier
to enhance.
Performance via Parallelism
Computer architects have offered designs that get more performance by performing operations
in parallel.
Performance via Pipelining

A particular pattern of parallelism is so prevalent in computer architecture that it merits its own
name: pipelining.
Performance via Prediction

prediction, In some cases it can be faster on average to guess and start working rather than wait
until you know for sure, assuming that the mechanism to recover from a misprediction is not
too expensive and your prediction is relatively accurate.
Hierarchy of Memories
Programmers want memory to be fast, large, and cheap, as memory speed often shapes
performance, capacity limits the size of problems that can be solved, and the cost of memory
today is oft en the majority of computer cost. Hierarchy of Memories, with the fastest, smallest,
and most expensive memory per bit at the top of the hierarchy and the slowest, largest, and
cheapest per bit at the bottom. Caches give the programmer the illusion that main memory is nearly
as fast as the top of the hierarchy and nearly as big and cheap as the bottom of the hierarchy.
Dependability via Redundancy
Computers not only need to be fast; they need to be dependable. Since any physical device can
fail, we make systems dependable by including redundant components that can take over when
a failure occurs and to help detect failures.
=============================================================================
2. Write short notes on system software used in computer?

Soft ware are organized primarily in a hierarchical fashion, with applications being the outermost ring and a
variety of systems soft ware sitting between the hardware and applications software. There are many types
of systems software, but two types of systems software are central to every computer system today: an
operating system and a compiler. An operating system interfaces between a users program and the
hardware and provides a variety of services and supervisory functions. Among the most important functions
are:
Handling basic input and output operations
Allocating storage and memory
Providing for protected sharing of the computer among multiple applications using it simultaneously.
Examples of operating systems in use today are Linux, iOS, and Windows.
FIGURE 1.3 A simplified view of hardware
and software as hierarchical layers.
Compilers perform another vital function: the translation of a
program written in a high-level language, such as C, C++, Java, or
Visual Basic into instructions that the hardware can execute.
From a High-Level Language to the Language of Hardware
Assembler. This program translates a symbolic version of an
instruction into the binary version. For example, the programmer
would write
add A,B
and the assembler would translate this notation into 1000110010100000.
The binary language that the machine understands is the machine language. Assembly language
requires the programmer to write one line for every instruction that the computer will follow, forcing the
programmer to think like the computer. In later stage, high-level programming languages and compilers
were introduced, that translate High level language into instructions.
Example High level language a=a+b;
Assembly level language add A,B
Binary / Machine Language 1000110010100000

program
High-level programming languages offer several important benefits.
They allow the programmer to think in a more natural language, using English words and algebraic
notation.
Fortran was designed for scientific computation.
Cobol for business data processing.
Lisp for symbol manipulation.
It improved programmer productivity.
Programming languages allow programs to be independent of the computer on which they were
developed, since compilers and assemblers can translate high-level language programs to the binary
instructions of any computer.
========================================================================
3. Write short notes on 5 classic components of a computer?

The five classic components of a computer are input, output, memory, data path, and control, with the last
two sometimes combined and called the processor.
I/O EQUIPMENT:
The most fascinating I/O device is probably
the graphics display.
liquid crystal displays (LCDs)
To get a thin, low-power display. The LCD
is not the source of light; instead, it controls
the transmission of light.
A typical LCD includes rod-shaped
molecules in a liquid that form a twisting
helix that bends light entering the display,
from either a light source behind the display
or less often from reflected light. The rods
straighten out when a current is applied and
no longer bend the light. Since the liquid
crystal material is between two screens
polarized at 90 degrees, the light cannot
pass through unless it is bent. Today, most
LCD displays use an active matrix that has
a tiny transistor switch at each pixel to
precisely control current and make sharper images. A red-green-blue mask associated with each dot on the
display determines the intensity of the three color components in the final image; in a color active matrix
LCD, there are three transistor switches at each point.
The image is composed of a matrix of picture elements, or pixels, which can be represented as a
matrix of bits, called a bit map. A color display might use 8 bits for each of the three colors (red, blue, and
green). The computer hardware support for graphics consists mainly of a raster refresh buffer, or frame
buffer, to store the bit map. The image to be represented onscreen is stored in the frame buffer, and the bit
pattern per pixel is read out to the graphics display at the refresh rate.
The processor: is the active part of the computer, following the instructions of a program to the
letter. It adds numbers, tests numbers, signals I/O devices to activate, and so on. The processor logically
comprises two main components: data path and control, the respective brawn and brain of the processor. The
data path performs the arithmetic operations, and control tells the data path, memory, and I/O devices what
to do according to the wishes of the instructions of the program.
The memory: is where the programs are kept when they are running; it also contains the data needed
by the running programs. Th e memory is built from DRAM chips. DRAM stands for dynamic random
access memory. Multiple DRAMs are used together to contain the instructions and data of a program. In
contrast to sequential access memories, such as magnetic tapes, the RAM portion of the term DRAM means
that memory accesses take basically the same amount of time no matter what portion of the memory is read.
=============================================================================
4. Write short notes on chip manufacturing process?
The manufacture of a chip begins with silicon, a substance found in sand. Because silicon does not
conduct electricity well, it is called a semiconductor. With a special chemical process, it is possible to add
materials to silicon that allow tiny areas to transform into one of three devices:
Excellent conductors of electricity (using either microscopic copper or aluminum wire)
Excellent insulators from electricity (like plastic sheathing or glass)
Areas that can conduct or insulate under special conditions (as a switch)
Transistors fall in the last category. A VLSI circuit, then, is just billions of combinations of conductors,
insulators, and switches manufactured in a single small package.
Figure shows process for Integrated chip manufacturing. The process starts with a silicon crystal
ingot, which looks like a giant sausage. Today, ingots are 812 inches in diameter and about 1224 inches
long. An ingot is finely sliced into wafers no more than 0.1 inches thick. These wafers then go through a
series of processing steps, during which patterns of chemicals are placed on each wafer, creating the
transistors, conductors, and insulators.
The simplest way to cope with imperfection is to place many independent components on a single wafer.
The patterned wafer is then chopped up, or diced, into these components, called dies and more
informally known as chips. To reduce the cost, using the next generation process shrinks a large die as it
uses smaller sizes for both transistors and wires. This improves the yield and the die count per wafer.
Once youve found good dies, they are connected to the input/output pins of a package, using a
process called bonding. These packaged parts are tested a final time, since mistakes can occur in packaging,
and then they are shipped to customers.
=============================================================================
5. Write short notes on performance?
Running a program on two different desktop computers, youd say that the faster one is the desktop
computer that gets the job done first. If you were running a datacenter that had several servers running jobs
submitted by many users, youd say that the faster computer was the one that completed the most jobs
during a day. As an individual computer user, you are interested in reducing response timethe time
between the start and completion of a taskalso referred as execution time.
Datacenter managers are often interested in increasing throughput or bandwidththe total amount
of work done in a given time.
Measuring Performance:
The computer that performs the same amount of work in the least time is the fastest. Program
execution time is measured in seconds per program. CPU execution time or simply CPU time, which
recognizes this distinction, is the time the CPU spends computing for this task and does not include time
spent waiting for I/O or running other programs. CPU time can be further divided into the CPU time spent
in the program, called user CPU time, and the CPU time spent in the operating system performing tasks on
behalf of the program, called system CPU time.
The term system performance to refer to elapsed time on an unloaded system and CPU
performance to refer to user CPU time.
CPU Performance and Its Factors:
CPU execution time for a program = CPU clock cycles for a program X Clock cycle time
Alternatively, because clock rate and clock cycle time are inverses,
CPU execution time for a program = CPU clock cycles for a program
Clock rate
This formula makes it clear that the hardware designer can improve performance by reducing the number
of clock cycles required for a program or the length of the clock cycle.
Instruction Performance:
The performance equations above did not include any reference to the number of instructions needed
for the program. The execution time must depend on the number of instructions in a program. Here
execution time is that it equals the number of instructions executed multiplied by the average time per
instruction. clock cycles required for a program can be written as
CPU clock cycles = Instructions for a program X Average clock cycles per instruction
The term clock cycles per instruction, which is the average number of clock cycles each instruction takes to
execute, is often abbreviated as CPI. CPI provides one way of comparing two different implementations
of the same instruction set architecture, since the number of instructions executed for a program will be the
same.
The Classic CPU Performance Equation:
The basic performance equation in terms of instruction count (the number of instructions executed by the
program), CPI, and clock cycle time:
CPU time = Instruction count X CPI X Clock cycle time
or, since the clock rate is the inverse of clock cycle time:
CPU time = Instruction count X CPI
Clock rate
These formulas are particularly useful because they separate the three key factors that affect performance.
Components of performance
CPU execution time for a program
Instruction count
Clock cycles per instruction (CPI)
Clock cycle time
Units of Measure
Seconds for the program
Instructions executed for the program
Average number of clock cycles per
instruction
Seconds
per clock cycle
We can measure the CPU execution time by running the program, and the clock cycle time is
usually published as part of the documentation for a computer. The instruction count and CPI can be more
difficult to obtain. Of course, if we know the clock rate and CPU execution time, we need only one of the
instruction count or the CPI to determine the other.
=========================================================================================
6. Write short notes on power wall?

Both clock rate and power increased rapidly for decades, and then flattened off recently. The energy
metric joules is a better measure than a power rate like watts, which is just joules/second.
Dominant technology for integrated circuits is called CMOS (complementary metal oxide
semiconductor). For CMOS, the primary source of energy consumption is so-called dynamic energythat
is, energy that is consumed when transistors switch states from 0 to 1 and vice versa.
Energy Capacitive load X (Voltage)2
This equation is the energy of a pulse during the logic transition of 0 1 0 or 1 0 1. The energy of
a single transition is then
Energy 1/2 X Capacitive load X (Voltage)2
The power required per transistor is just the product of energy of a transition and the frequency of
transitions:
Power Energy X Frequency switched
[or]
Power 1/2 X Capacitive load X (Voltage)2 X Frequency switched
Frequency switched is a function of the clock rate. The capacitive load per transistor is a function of both the
number of transistors connected to an output (called the fan out) and the technology, which determines the
capacitance of both wires and transistors.
Energy and thus power can be reduced by lowering the voltage, which occurred with each new
generation of technology, and power is a function of the voltage squared. In 20 years, voltages have gone
from 5 V to 1 V, which is why the increase in power is only 30 times. To try to address the power problem,
designers have already attached large devices to increase cooling, and they turn off parts of the chip that
are not used in a given clock cycle.
=============================================================================
7. Write short notes on operations and operands in computer hardware?
Operations in MIPS:
Every computer must be able to perform arithmetic. The MIPS assembly language notation
add a, b, c
instructs a computer to add the two variables b and c and to put their sum in a.
This notation is rigid in that each MIPS arithmetic instruction performs only one operation and must always
have exactly three variables.
EXAMPLE, To add 4 variables, b,c,d,e and store it in a.
add a, b, c # The sum of b and c is placed in a
add a, a, d # The sum of b, c, and d is now in a
add a, a, e # The sum of b, c, d, and e is now in a
Thus, it takes three instructions to sum the four variables.
Design Principle 1: Simplicity favors regularity.
EXAMPLE: Compiling Two C Assignment Statements into MIPS
This segment of a C program contains the five variables a, b, c, d, and e. Since Java evolved from C, this
example and the next few work for either high-level programming language:
a = b + c;
d = a e;
The translation from C to MIPS assembly language instructions is performed by the compiler. Show the
MIPS code produced by a compiler. A MIPS instruction operates on two source operands and places the
result in one destination operand. Hence, the two simple statements above compile directly into these two
MIPS assembly language instructions:
add a, b, c
sub d, a, e
Operands in MIPS:
The operands of arithmetic instructions are restricted; they must be from a limited number of special
locations built directly in hardware called registers. The size of a register in the MIPS architecture is 32
bits; groups of 32 bits occur so frequently that they are given the name word in the MIPS architecture.
Design Principle 2: Smaller is faster.
A very large number of registers may increase the clock cycle time simply because it takes electronic signals
longer when they must travel farther. So, 32 registers were used in MIPS architecture. The MIPS convention
is to use two-character names following a dollar sign to represent a register. eg: $s0, $s1
Example: f = (g + h) (i + j); instructions using registers.
add $t0,$s1,$s2
# register $t0 contains g + h
add $t1,$s3,$s4
# register $t1 contains i + j
sub $s0,$t0,$t1
# f gets $t0 $t1, which is (g + h)(i + j)
Memory Operands: Programming languages have simple variables that contain single data
elements, as in these examples, but they also have more complex data structuresarrays and structures.
These complex data structures can contain many more data elements than there are registers in a computer.
The processor can keep only a small amount of data in registers, but computer memory contains billions
of data elements. So, MIPS must include instructions that transfer data between memory and registers.
Such instructions are called data transfer instructions. To access a word in memory, the instruction
must supply the memory address.
The data transfer instruction that copies data from memory to a register is traditionally called load.
The format of the load instruction is the name of the operation followed by the register to be loaded, then
a constant and register used to access memory. The sum of the constant portion of the instruction and the
contents of the second register forms the memory address.
The actual MIPS name for this instruction is lw, standing for load word.
EXAMPLE:
g = h + A[8];
To get A[8] from memory use lw,
lw $t0,8($s3)
# Temporary reg $t0 gets A[8]
Use Result of A[8] i.e., stored in $t0,
add $s1,$s2,$t0
# g = h + A[8]
The constant in a data transfer instruction (8) is called the off set, and the register added to form the
address ($s3) is called the base register.
In MIPS, words must start at addresses that are multiples of 4. This requirement is called an
alignment restriction, and many architectures have it.(since in MIPS each 32 bits form a word in memory, so
the address of one word to another jumps in multiples of 4)
Byte addressing also affects the array index. To get the proper byte address in the code above, the off set to
be added to the base register $s3 must be 4 x 8, or 32,(as per previous example).
Above EXAMPLE:
g = h + A[8]; (implemented based on byte address)
To get A[8] from memory use lw and calculate (8 x4) = 32 which is the actual offset value,
lw $t0,32($s3)
Use Result of A[8] i.e., stored in $t0,
add $s1,$s2,$t0
# g = h + A[8]
The instruction complementary to load is traditionally called store; it copies data from a register to
memory. The format of a store is similar to that of a load: the name of the operation, followed by the
register to be stored, then off set to select the array element, and finally the base register. Once again, the
MIPS address is specified in part by a constant and in part by the contents of a register. The actual
MIPS name is sw, standing for store word.
EXAMPLE:
A[12] = h + A[8];
lw $t0,32($s3)
# Temporary reg $t0 gets A[8], note (8 x4) used.
add $t0,$s2,$t0
# Temporary reg $t0 gets h + A[8]
sw $t0,48($s3)
# Stores h + A[8] back into A[12], note (12 x 4) used.
Constant or Immediate Operands:
For example, to add the constant 4 to register $s3, we could use the code
lw $t0, AddrConstant4($s1)
# $t0 = constant 4
add $s3,$s3,$t0
# $s3 = $s3 + $t0 ($t0 == 4)
Alternative, that avoids the load instruction is to offer versions of the arithmetic instructions in which one
operand is a constant.
example:
add immediate or addi instructions.
addi $s3,$s3,4
# $s3 = $s3 + 4
Constant operands occur frequently, and by including constants inside arithmetic instructions, operations
are much faster and use less energy than if constants were loaded from memory.
=============================================================================
8. Write short notes on Instructions and its types that are used in MIPS?
Registers are referred to in instructions, there must be a convention to map register names into
numbers. In MIPS assembly language, registers $s0 to $s7 map onto registers 16 to 23, and registers $t0
to $t7 map onto registers 8 to 15. Hence, $s0 means register 16, $s1 means register 17, $s2 means register
18, . . . , $t0 means register 8, $t1 means register 9, and so on.
MIPS Fields for instruction
op: Basic operation of the instruction, traditionally called the opcode.

rs: The first register source operand.
rt: The second register source operand.
rd: The register destination operand. It gets the result of the operation.
shamt: Shift amount. (Section 2.6 explains shift instructions and this term; it
will not be used until then, and hence the field contains zero in this section.)
funct: Function. This field, oft en called the function code, selects the specific
variant of the operation in the op field.
A problem occurs when an instruction needs longer fields than those shown above.
MIPS designers is kept all instructions the same length, thereby requiring different kinds of
instruction formats for different kinds of instructions. The format above is called R-type (for register)
or R-format. A second type of instruction format is called I-type (for immediate) or I-format and is used
by the immediate and data transfer instructions. The fields of I-format are
Multiple formats complicate the hardware, we can reduce the complexity by keeping the formats
similar. For example, the first three fields of the R-type and I-type formats are the same size and have the
same names; the length of the fourth field in I-type is equal to the sum of the lengths of the last three
fields of R-type. Note that the meaning of the rt field has changed for this instruction: the rt field specifies
the destination register,
FIG: 2.5 MIPS instruction encoding.
Example: MIPS instruction encoding in computer hardware.

Consider A[300] = h + A[300]; the MIPS instruction for the operations are:
lw $t0,1200($t1)
add $t0,$s2,$t0
# Temporary reg $t0 gets h + A[300]
sw $t0,1200($t1)
# Stores h + A[300] back into A[300]
Tabulation , shows how hardware decodes and determine the three machine language instructions:
LW
ADD
ADD
The lw instruction is identified by 35 (OP field), The add instruction that follows is specified with 0 (OP
field), The sw instruction is identified with 43 (OP field).
LW
ADD
ADD
Binary version of the above Tabulation

============================================================================
9. Write short Notes on Logical operators / operation?
List of logical operators used in MIPS and other languages along with symbolic notation.
SHIFT LEFT (sll )

The first class of such operations is called shifts. They move all the bits in a word to the left or right,
filling the emptied bits with 0s. For example, if register $s0 contained
0000 0000 0000 0000 0000 0000 0000 1001two = 9ten
and the instruction to shift left by 4 was executed, the new value would be:
0000 0000 0000 0000 0000 0000 1001 0000two = 144ten
The dual of a shift left is a shift right. The actual name of the two MIPS shift instructions are called shift left
logical (sll) and shift right logical (srl).
sll $t2,$s0,4
# reg $t2 = reg $s0 << 4 bits(shifted 4 places)
shamt field in the R-format. Used in shift instructions, it stands for shift amount. The encoded
version of above Shift instruction is shown below.
Also Shifting left by i bits gives the same result as multiplying by 2i (refer above representation for
9 and 144) (9 x 2 4 = 9 x 16 = 144, where i =4, since left shift done 4 times )
LOGICAL AND (and)
AND is a bit-by-bit operation that leaves a 1 in the result only if both bits of the operands are 1.
For example, if register $t2 contains
0000 0000 0000 0000 0000 1101 1100 0000two
and
register $t1 contains
0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the MIPS instruction
and $t0,$t1,$t2
# reg $t0 = reg $t1 & reg $t2
the value of register $t0 would be
0000 0000 0000 0000 0000 1100 0000 0000two
(example for bit wise, note: do not add , ..00101
.10111
Bit wise AND 00101 (use AND truth table for each bit))
AND is traditionally called a mask, since the mask conceals some bits.
LOGICAL OR (or)
It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. To elaborate, if the registers
$t1 and $t2 are unchanged from the preceding i.e.,
0000 0000 0000 0000 0000 1101 1100 0000two
and
0000 0000 0000 0000 0011 1100 0000 0000two
then, after executing the MIPS instruction
or $t0,$t1,$t2
# reg $t0 = reg $t1 | reg $t2
the value in register $t0 would be
0000 0000 0000 0000 0011 1101 1100 0000two
(example for bit wise,
..00101
.10111
Bit wise OR
10111 (use OR truth table for each bit))
LOGICAL NOT (nor)
The final logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand
bit is a 0, and vice versa. Since MIPS needs three-operand format, the designers of MIPS decided to
include the instruction NOR (NOT OR) instead of NOT.
Step 1: . Perform bit wise OR ,
..00101
.00000 (dummy operation register filled with zero)
00101
Step 2: Take Inverse for the above result now we get 11010
Instruction : nor $t0,$t1,$t3
# reg $t0 = ~ (reg $t1 | reg $t3)
Constants are useful in AND and OR logical operations as well as in arithmetic operations, so MIPS
also provides the instructions and immediate (andi) and or immediate (ori).
=============================================================================
10. Write short notes on Decision making and branching instructions? ( control operations )
Branch and Conditional branches: Decision making is commonly represented in programming languages
using the if statement, sometimes combined with go to statements and labels. MIPS assembly language
includes two decision-making instructions, similar to an if statement with a go to. The first instruction is
beq register1, register2, L1
This instruction means go to the statement labeled L1 if the value in register1 equals the value in register2.
The mnemonic beq stands for branch if equal. The second instruction is
bne register1, register2, L1
It means go to the statement labeled L1 if the value in register1 does not equal the value in register2.
The mnemonic bne stands for branch if not equal. These two instructions are traditionally called
conditional branches.
EXAMPLE: if (i == j) f = g + h; else f = g h; the MIPS version of the given statements is
bne $s3,$s4, Else
# go to Else if i j
add $s0,$s1,$s2
# f = g + h (skipped if i j)
j Exit
# go to Exit
Else: sub $s0,$s1,$s2
#f=gh
(skipped if i = j)
Exit:
Here bne is used instead of beq, because bne(not equal to)

instruction provides a better efficiency. This example introduces
another kind of branch, often called an unconditional branch.
This instruction says that the processor always follows the
branch. To distinguish between conditional and unconditional
branches, the MIPS name for this type of instruction is jump,
abbreviated as j. (in example:- f, g, h, i, and j are variables mapped to fi ve registers $s0 through $s4)
Loops:
Decisions are important both for choosing between two alternativesfound in if statementsand for
iterating a computationfound in loops. The same assembly instructions are the basic building blocks for
both cases(if and loop).
EXAMPLE: while (save[i] == k)
i += 1;
the MIPS version of the given statements is
Assume that i and k correspond to registers $s3 and $s5 and the base of the array save is in $s6.
Loop: sll $t1,$s3,2 # Temp reg $t1 = i * 4
To get the address of save[i], we need to add $t1 and the
base of save in $s6:
add $t1,$t1,$s6
# $t1 = address of save[i]
Now we can use that address to load save[i] into a
temporary register:
lw $t0,0($t1)
# Temp reg $t0 = save[i]
The next instruction performs the loop test, exiting if
save[i] k:
bne $t0,$s5, Exit
# go to Exit if save[i] k
addi $s3,$s3,1
#i=i+1
The next instruction adds 1 to i:
j Loop
Exit:
# go to Loop
The end of the loop branches back to the while test at

the top of the loop. We just add the Exit label after
it, and were done:
=============================================================================
11. Write short notes on addressing and addressing modes?
Addressing types: Three address instructions
Syntax: opcode source1, source2,destination
Eg: ADD A,B, C (operation is A= B+C)
Two address instructions
Syntax: opcode source, destination
Eg: ADD A, B (operation is A=A+B)
One-address instruction (to fit in one word length)
Syntax: opcode source
Eg: STORE C (copies content of accumulator to memory location C)
where accumulator means cache memory register.
Zero-address instructions stack operation
Syntax: opcode
Eg: PUSH A (All addresses are implicit, pushes value in A to stack)
Addressing Modes:
The different ways in which the location of an operand is specified in an instruction are
referred to as addressing modes. It is a method used to determine which part of memory is being referred
by a machine instruction.
Register mode:
Operand is the content of a processor register. The register name/address is
given in the instruction. Value of R2 is moved to R1.
Example: MOV R1, R2
Absolute mode (direct):
Operand is in the memory location. Address of the location is given
explicitly. Here value in A is moved to 1000H.
Example: MOV 1000, A
Immediate mode:
Address and data constants can be given explicitly in the instruction.
Here value constant 200 is moved to R0 register.
Example: MOV #200, R0
Indirect Mode:
The processor will read the register content (R1) in this case, which
will not have direct value. Instead, it will be the address or location in which, the value will be
stored. Then the fetched value is added with the value in R0 register.
Example: ADD (R1), R0
Indexed / Relative Addressing Mode:

The processor will take R1 register address as base
address and adds the value constant 20 (offset / displacement) with the base address to get the
derived or actual memory location of the value i.e., stored in the memory. It fetches the value then
adds the value to R2 register.
Example: ADD 20(R1), R2
Auto increment mode and Auto decrement Mode: The value in the register / address that is
supplied in the instruction is incremented or decremented.
Example: Increment R1 (Increments the given register / address content by one)
Example: Decrement R2 (Decrements the given register / address content by one)
===============================================================================================
UNIT 2
ARITHMETIC OPERATIONS
PART B
1. Write short notes on Addition and Subtraction?
Digits are added bit by bit from right to left , with carries passed to the next digit to the left.
adding 6ten to 7ten in binary

0000 0000 0000 0000 0000 0000 0000 0111 two = 7ten
+ 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
___________________________________________________________
= 0000 0000 0000 0000 0000 0000 0000 1101two = 13ten
adding 6ten to 7ten in binary (showing carries)

Taking the last 4 bit
(1)
(1)
(0) ------- carries
0
1
1
1
0
1
1
0
____________________
(0)1 (1)1 (1)0 (0)1
Subtracting 6ten to 7ten in binary

0000 0000 0000 0000 0000 0000 0000 0111 two = 7ten
0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
_______________________________________________________
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
Subtraction via addition using the twos complement representation of
-6:
0000 0000 0000 0000 0000 0000 0000 0111 two = 7ten
+ 1111 1111 1111 1111 1111 1111 1111 1010two = 6ten (2's complement of -6)
_______________________________________________________
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
Over flow cannot occur for the following:
when adding operands with different sign. eg., -10+4 = -6 , since sum is not larger than one of the
operand. so no overflow when adding +ve and -ve operands.
when the signs of the operands are the same, overflow cannot occur. eg., c-a - c+(-a), since
subtraction is done via add operation by negating the second operand.
Over flow occurrence detection:
Adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully expressed.
The lack of a 33rd bit means that when overflow occurs, the sign bit is set with the value of the result
instead of the proper sign of the result.
overflow occurs when adding two positive numbers and the sum is negative
subtract a negative number from a positive number and get a negative result.
subtract a positive number from a negative number and get a positive result.
The MIPS has 2 choices of instructions,
Add (add), add immediate (addi), and subtract (sub) cause exceptions on overflow. . Eg., Fortran
compiler use this type of instructions.
Add unsigned (addu), add immediate unsigned (addiu), and subtract unsigned (subu) do not cause
exceptions on overflow. Eg., C compiler use this type of instructions.
2. Explain Multiplication Algorithm along with an example.

Multiplication of decimal numbers in long hand, can be used to show the steps
of multiplication and the names of the operands.
Multiplying 1000ten by 1001ten:

Multiplicand
1000ten
Multiplier
x 1001ten
1000
0000
0000
1000
Product
1001000ten
The number of digits in the product is considerably larger than the number in either the multiplicand
or the multiplier. The length of the multiplication of an n-bit multiplicand and an
m-bit multiplier is a product that is n + m bits long(sign bit is ignored).
So, n + m bits are required to represent all possible products.In this case one has to consider Over
flow condition also.
Base on the paper techniques that is shown initially, the multiplicand is moved to left each time and
the multiplier moved right after each bit has performed its intermediate execution. The No of iterations to
find the product will be equal to No: of bits in the multiplier. In this case we have 32 iterations (MIPS).
Example Multiply 2ten _ 3ten, or 0010two _ 0011two. (4 bits are used to save space.)
3. Explain Division Algorithm along with an example.

The reciprocal operation of multiply is divide, an operation that is even less frequent
and even more quirky.
example is dividing 1,001,010ten by 1000ten
1001ten Quotient
Divisor 1000ten
1001010ten
Dividend
1001ten
Quotient
Divisor 1000ten
1001010ten
1000
10
101
1010
1000
10ten
Dividend
Remainder
Divides two operands, called the dividend and divisor.

The result, called the quotient, are accompanied by a second result, called the remainder.
Here is another way to express the relationship between the components:
Dividend = Quotient x Divisor + Remainder
where the remainder is smaller than the divisor.
Binary numbers contain only 0 or 1, so binary division is restricted to

these two choices, thereby simplifying binary division.
Each iteration of the algorithm needs to move the divisor to the right one digit, so we start with the
divisor placed in the left half of the 64-bit Divisor register and shift it right 1 bit each step.
The Remainder register is initialized with the dividend.
Example,
Using a 4-bit version of the algorithm to save pages, divide 7ten by 2ten,
or 0000 0111two by 0010two.
4. Explain floating point number representation and the IEEE 754 format.
In mathematics , floating point number is otherwise called as real number representation.
Example,
3.14159265 ten (pi)

1.0ten 109 (seconds in a nanosecond)
3.15576ten 109 (seconds in a typical century)
A number in scientific notation that has a single digit to the left of the decimal point
has no leading 0s is called a normalized number.
example, 1.0ten x 10-9 is in normalized scientific notation,

but 0.1ten x 10-8 and 10.0ten x 10-10 are not.
Binary numbers in scientific notation: 1.0two X 2-1, here decimal point is called as binary point
for clarity.
yyyy
General syntax for Binary floating point number 1.xxxxxxxxxtwo X 2
Floating-Point Representation:
The representation has 2 parts that has fraction part and exponent part. The fraction part size
is used to increase or decrease precision. The exponent part size is used to increase or decrease the range of
the number.
The representation has 1 sign bit where bit -0 represents +ve value and bit -0 represents negative value.
In general, floating-point numbers are of the form
(-1)S x F x 2E
F involves the value in the fraction field

E involves the value in the exponent field
overflow here means that the exponent is too large to be represented in the exponent field.
The smallest non zero fraction that can be represented in the fraction field, when it is too small to store in
faction field then we call it as underflow.
Double precision representation:
-38
38
The range of the single precision format 2.0 x 10

to 2.0 x 10
-308
The range of the double precision format 2.0 x 10
to 2.0 x 10 308
To pack even more bits into the signifi cand, IEEE 754 makes the leading 1-bit of
normalized binary numbers implicit. Hence, the number is actually 24 bits long in single precision (implied
1 and a 23-bit fraction), and 53 bits long in double precision (1 + 52).
S
E
so the new representation will be (-1) x (1+ F) x 2
The IEEE format uses NaN, for Not a Number to represent undefined values like infinity , calculation of 0/0
values etc.
Representation of exponent value eg 2 +1
Representation of exponent value eg 2 -1
fig(a)
fig (b)
The 2 power -1 is represented in 2's complement format, in fig (b) value -1 is represented like as if a large
exponent value is sitting in the exponent field, to avoid this confusion bias value is used.
IEEE 754 uses a bias of 127 for single precision, so an exponent of -1 is represented by the bit
pattern of the value -1+ 127ten, or 126ten and +1 is represented by 1 + 127ten, or 128ten
so the value represented by a floating-point number is ,
(-1)S x (1+ F) x 2(Exponent - bias)

The range of single precision numbers is then from as small as
-/+1.00000000000000000000000two x 2 -126
to as large as
-/+1.11111111111111111111111two x 2+127
5. State and explain the arithmetic addition operations on floating point numbers.
Add numbers in scientific notation: eg., 9.999ten X 10-1 + 1.610ten X 10-1
Assume that we can store only four decimal digits of the significand and two decimal digits of the
exponent.
Step 1. Align the decimal point of the number that has the smaller exponent. here 1.610ten X 10-1 is
number that has small exponent value among the given two number.
so moving the decimal place to left we achieve the same exponent value as the larger number.
1.610ten X 10-1 = 0.1610ten X 100 = 0.01610ten X 101
only 4 digits are used so the now becomes 0.016ten X 101

Step 2. Next comes the addition of the significands:
9.999ten
+ 0.016ten
__________________
10.015ten
The sum is 10.015ten
X 101.
Step 3. This sum is not in normalized scientific notation, so we need to adjust it:
10.015ten X 101 = 1.0015ten X 102
The exponent is increased or decreased, we must check for overflow or underflowthat is, we
must make sure that the exponent still fits in its field.
Step 4. Since we assumed that the significand can be only four digits long (excluding the sign), we must
round the number. The number 1.0015ten X 102 is rounded to four digits in the significand to
1.002ten X 102
example based on above algorithm -- using binary numbers Binary Floating-Point Addition
Add the number 0.5 ten and -0.4375 ten in binary, convert the given number to binary we get,
0.5 ten = 0.1two = 0.1two x 20 = 1.000 two x 2-1

-0.4375 = -0.0111two = -0.0111two x 20 = -1.110 two x 2-2
Step 1: The number with less exponent shifted right -1.110 two x 2-2 = -0.111 two x 2-1
Step 2: Add the significand: 1.000 two + (-0.111 two ) = 0.001two x 2-1
Step 3: Normalize the sum, check for underflow and overflow
0.001two x 2-1=0.010two x 2-2=0.100two x 2-3=1.000 x 2-4
Step 4: Round the sum: 1.000 x 2-4 , The sum already the 4 bits size, so no change.
The decimal value will be 1.000 x 2-4 = 0.0001two, on conversion we get 1 x 2-4 = 1/24ten = 1/16ten
= 0.0625ten
When added normally the sum of 0.5 ten and -0.4375 ten = 0.0625ten, Thus verified.
6. State and explain the arithmetic multiplication operations on floating point numbers.
Multiplying decimal numbers in scientific notation by hand:
1.110ten X 1010 X 9.200ten X 10-5
Assume that we can store only four digits of the significand and two digits of the exponent.
Step 1. Unlike addition, we calculate the exponent of the product by simply adding the exponents of the
operands together:
New exponent = 10 + (-5) = 5
Step 2. Next comes the multiplication of the significands:
1.110ten
9.200ten
0000
0000
2220
9990
10212000ten
There are three digits to the right of the decimal point for each operand, so the decimal point is placed six
digits from the right in the product significand:
10.212000ten
Assuming that we can keep only three digits to the right of the decimal point, the product is 10.212 X 105.
Step 3. This product is un normalized, so we need to normalize it:
10.212ten x 105 = 1.0212ten x 106
check for overflow and underflow.
Step 4. We assumed that the significand is only four digits long (excluding the sign), so we must round the
number. The number 1.0212ten X 106 is rounded to four digits in the significand to 1.021ten X 106.
Step 5. The sign of the product depends on the signs of the original operands. If they are both the same, the
sign is positive; otherwise, its negative. Hence, the product is
+1.021ten X 106
example based on above algorithm -- using binary numbers Binary Floating-Point Multiplication
Multiply the number 0.5 ten and -0.4375 ten in binary
Convert the number to binary
0.5 ten = 0.1two = 0.1two x 20 = 1.000 two x 2-1

-0.4375 = -0.0111two = -0.0111two x 20 = -1.110 two x 2-2
Step 1. Adding the exponent with out bias: -1 + (-2) = -3

Step 2. Multiply the significand:
1.000two
x 1.110 two
0000
1000
1000
1000
1110000 two There are three digits to the right of the decimal point for
each operand, so the product will be 1110000 two x 2-3 , we need to keep only 4 - bits 1.110 two x 2-3 .
Step 3. Check whether the product is normalized, and check for underflow or overflow.
It is already normalized so, the product is 1.110 two x 2-3
Step 4. Round the product to 4 - bits, it already fits, so no action.
1.110 two x 2-3
Step 5. Since the sign of the operands are different so the show product as negative.
-1.110 two x 2-3
converting to decimal, -1.110 two x 2-3 = -0.001110two , After conversion we get -7 x 2-5ten
= - 7/25ten = -7/32 ten = -0.21875 ten
7. Write short notes on floating point instructions in MIPS?

MIPS supports the IEEE 754 single precision and double precision formats with these instructions:
Floating-point addition, single (add.s) and addition, double (add.d)
Floating-point subtraction, single (sub.s) and subtraction, double (sub.d)
Floating-point multiplication, single (mul.s) and multiplication, double (mul.d)
Floating-point division, single (div.s) and division, double (div.d)
Floating-point comparison, single (c.x.s) and comparison, double (c.x.d),
where x may be equal (eq), not equal (neq), less than (lt), less than or equal (le), greater than (gt), or
greater than or equal (ge)
Floating-point branch, true (bc1t) and branch, false (bc1f)
The MIPS designers decided to add separate floating-point registerscalled $f0, $f1, $f2, used either
for single precision or double precision. Hence, they included separate loads and stores for floating-point
registers: lwc1 and swc1.
eg., lwc1 $f4,c($sp) # Load 32-bit F.P. number into F4, where $sp are normal integer registers, since the
address representation remains the same. c is the offset value.

swc1 $f2,b($sp) # Store 32-bit F.P. number from F2, where $sp are normal integer registers, since the
address representation remains the same. b is the offset value.
The double precision number uses a pair of single precision number registers, like $f2 and $f3 - odd and
even register number name combination.
UNIT - 3 PROCESSOR AND CONTROL UNIT

1.How can memory access made faster and which hazard can be reduced when memeory is accessed faster?
Ans: Pipelining is an implementation technique in which multiple instructions are
overlapped in execution. Today, pipelining is key to making processors fast. hence memory access is faster.
Control hazard can be reduced.
What would be the effect if we increase the number of pipelining stages?
Ans: Limitations on practical depth of a pipeline arise from:
Pipeline latency. The fact that the execution time of each instruction does not decrease puts limitations on
pipeline depth;
Imbalance among pipeline stages. Imbalance among the pipe stages reduces performance since the clock
can run no faster than the time needed for the slowest pipeline stage;
Pipeline overhead. Pipeline overhead arises from the combination of pipeline register delay (setup time
plus propagation delay) and clock skew.
What is an hazard?
Ans: There are situations in pipelining when the next instruction cannot execute in the
following clock cycle. These events are called hazards, and there are three different
types.
What is branch prediction Dynamic?
Ans: Dynamic Branch Prediction
One approach is to look up the address of the instruction to see if a branch was taken the last time
this instruction was executed, and, if so, to begin fetching new instructions from the same place as the last
time. This technique is called dynamic branch prediction.
List the key aspects in gaining the performance in pipelined system?
Ans: The processor idle time is reduced, proper program will lead to less data/ control hazards.
PART B
1. Write short Notes on Pipelining?

Pipelining is an implementation technique in which multiple instructions are
overlapped in execution.
EXAMPLE: LAUNDRY EXAMPLE
The non pipelined approach, Determine the No of stages in the whole process.
STAGES:
1. place dirty in the washer.
2. After stage 1, load washed clothes(wet) in the dryer.
3. After stage 2, nished, place the dry load on a table and fold.
4. After stage 3, Put the folded clothes in storage area.
If there are 4 people to use a single washing machine, the following diagram shows the non pipelined
approach. Queue waiting time is long. Idle time of each stages are long between usages.
The pipelined approach, Overlapping of each stages without disturbing the other stages.
Now each person's waiting time is reduced and each stage Idle time is reduced.
The same principles apply to processors where the pipeline instruction-execution is applied. So, the
MIPS instructions classically take five steps or stages:
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register
All the pipeline stages take a single clock cycle, so the clock cycle must be long
enough to accommodate the slowest operation. Here LW instruction takes 800 ps(worst case), so the clock
cycle must have atleast 800 ps.
If the stages are perfectly balanced, then the time between instructions on the pipelined processor
assuming ideal conditionsis equal to
Time between Instruction non-pipelined =
Time between Instruction non-pipelined

-------------------------------------------Number of Pipe Stages
Under ideal conditions and with a large number of instructions, the speed-up from pipelining is
approximately equal to the number of pipe stages; a five-stage pipeline is nearly five times faster.
Non pipelined Approach, EXAMPLE - sequences of load Instruction
pipelined Approach, Above EXAMPLE - sequences of load Instruction, but stages are overlapped
during execution as given below.
Total time taken for execution is 1400 ps, while in Non pipelined approach is 2400 ps.
but formula suggests
Time taken for pipelined approach = time taken (nonpipelined approach)/ No of stages
= 2400 ps / 5
= 480 ps
But the practical result shows, its 1400 ps.
So only when No of instructions in pipelined execution is high enough, the theoretical execution speed can
be achieved or nearly achieved.
Pipelining improves performance by increasing instruction throughput.
----------------------------------2. Explain types of pipeline Hazards?
There are situations in pipelining when the next instruction cannot execute in the following clock cycle.
These events are called hazards. There are 3 types of Hazards:
1. structural hazard: When a planned instruction cannot execute in the proper clock cycle
because the hardware does not support the combination of instructions that are set to execute.
2. data hazard / pipeline data hazard: When a planned instruction cannot execute in the
proper clock cycle because data that is needed to execute the instruction is not yet available.
3. control hazard / branch hazard: When the proper instruction cannot execute in the
proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the
flow of instruction addresses is not what the pipeline expected.
structural hazard:
The hardware cannot support the combination of instructions that we want to execute in the same
clock cycle. Assume that we had a single memory instead of two memories. If the pipeline in Figure
part B Q.No 1 , if it had a fourth instruction, we would see that in the same clock cycle the first instruction
is accessing data from memory while the fourth instruction is fetching an instruction from that same
memory. Without two memories, our pipeline could have a structural hazard.
Data Hazards:
Data hazards occur when the pipeline must be stalled because one step must wait for another to
complete. In a computer pipeline, data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline.
For example, suppose we have an add instruction followed immediately by a subtract instruction
add
sub
$s0, $t0, $t1

$t2, $s0, $t3
The add instruction doesnt write its result until the fifth stage, meaning that we would have to waste
three clock cycles in the pipeline.
To resolve the data hazard, for the code sequence above, as soon as the ALU creates the sum for the
add, we can supply it as an input for the subtract. This is done by Adding extra hardware to retrieve
the missing item early from the internal resources is called forwarding or bypassing.
Figure below shows the connection to forward the value in $s0 after the execution stage of the add
instruction as input to the execution stage of the sub instruction.
forwarding paths are valid only if the destination stage is later in time than the source stage.
For example, there cannot be a valid forwarding path from the output of the memory access stage in
the first instruction to the input of the execution stage of the following, since that would mean going
backward in time.
Forwarding cannot prevent all pipeline stalls,
Suppose the first instruction was a load of $s0 instead of an add, So desired data would be available only
after the fourth stage of the first instruction in the dependence, which is too late for the input of the third
stage of sub instruction.
lw $s0, 20($t1)
sub $t2, $s0, $t3
even with forwarding, we would have to stall one stage for a load-use data hazard, This figure shows below
an important pipeline concept, officially called a pipeline stall, but often given the nickname bubble.
Control Hazards:
control hazard arising from the need to make a decision based on the results of one instruction while
others are executing. The equivalent decision task in a computer is the branch instruction.
Only just received the branch instruction from memory, the pipeline cannot possibly know what the
next instruction should be loaded. one possible solution is to stall immediately after we fetch a branch,
waiting until the pipeline determines the outcome of the branch.
The cost of this option is too high for most computers to use and so a second solution to the control hazard
is to predict. This option does not slow down the pipeline when you are correct.
One simple approach is to predict always that branches will be untaken.
Only when branches are taken,then the pipeline stalls (prediction goes wrong).
In cases of loops, branches are predicted to be taken because they jump back to the top of the loop.
In case of Conditional statements, branches are predicted to be untaken because they jump towards
forward direction with reference to the program flow.
Applying the PREDICTION METHODOLOGIES
Static branch prediction
o Based on typical branch behavior of the branching statements.
o Example: loop and if-statement branches
loop: predict taken

If: Predict not taken
Dynamic branch prediction
o Hardware measures actual branch behavior of branches.
e.g., record recent history of each branch
o Assume future behavior will continue the trend and refers the stored history.
When history data goes wrong, its stall while re-fetching the new branch and
updates the history accordingly
----------------------------------3. Explain data path in pipeline or pipelined data path?
The data path is separated into five pieces, with each piece named corresponding to a stage of instruction
execution:
1. IF: Instruction fetch
2. ID: Instruction decode and register fi le read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back
Refer Above diagram, instructions and data move generally from left to right through the five stages as they
complete execution.
There are, however, two exceptions to this left -to-right flow of instructions:
o The write-back stage, which places the result back into the register file in the middle of the
data path.
o The selection of the next value of the PC, choosing between the incremented PC and the
branch address from the MEM stage.
Data flowing from right to left does not affect the current instruction; these reverse data movements
influence only later instructions in the pipeline.
o The first right-to-left flow of data can lead to data hazards
o The second right-to-left leads to control hazards.
To retain the value of an individual instruction for its other four stages, the value read from instruction
memory must be saved in a register. Similar arguments apply to every pipeline stage, so we must
place registers wherever there are dividing lines between stages.
The registers are named for the two stages separated by that register. For example, the
pipeline register between the IF and ID stages is called IF/ID. Notice that there is no pipeline register at the
end of the write-back stage.
All instructions must update some state in the processor -the register file, memory, or the PC.
Its contents must be saved in the pipeline register when an exception occurs, while the contents of the
pipeline registers can be discarded. The registers can be referred from the below diagram.
Consider load instruction abbreviation lw and how the operations of this instruction in a
pipe line is done are as follows: (lw $s0, 32($t0))
1. Instruction fetch: The top portion of Figure shows the instruction being read from memory using the
address in the PC and then being placed in the IF/ID pipeline register. The PC address is incremented by 4
and then written back into the PC to be ready for the next clock cycle. This incremented
address is also saved in the IF/ID pipeline register in case it is needed later for an branch instruction, such as
beq.
2. Instruction decode and register file read: The bottom portion of Figure shows the instruction portion
of the IF/ID pipeline register supplying the 16-bit immediate field of the load instruction( here in our
example is 32), which is sign-extended to 32 bits, and the register numbers to read the two registers ( $s0
and $t0). All three values are stored in the ID/EX pipeline register, along with the incremented PC address.
3. Execute or address calculation:
Figure shows that the load instruction reads the contents of register1
($t0) and the sign-extended immediate (value 32) from the ID/EX pipeline register and adds them using the
ALU. That sum is placed in the EX/MEM pipeline register.
4. Memory access: The top portion of Figure shows the load instruction reading the data memory using the
address from the EX/MEM pipeline register and loading the data into the MEM/WB pipeline register.
5. Write-back: The bottom portion of Figure shows the final step: reading the data from the MEM/WB
pipeline register and writing it into the register file ($s0) in the middle of the figure.
The following figure shows the reading and writing operations done during a stage ( here instruction
decode stage). It show that reading operation is performed from IF/ID register to get input for the stage and
after completion of execution of that stage the results are written in ID/EX register.
Note: The read and write operations are shaded differently.
Similarly other stages follows with the same read / write operations with respect to the execution of
each stage and pipelined buffers.
---------------------------------------4. Explain control path in pipeline or pipelined control?

Now in this section we add control to the pipelined data path. The PC is written on each clock cycle,
so there is no separate write signal for the PC. There are no separate write signals for the pipeline registers
(IF/ID, ID/EX, EX/MEM, and MEM/WB), since the pipeline registers are also written during each clock
cycle.
Each control line is associated with a component, that is active in only a single pipeline stage.
we can divide the control lines into five groups according to the pipeline stage.
1. Instruction fetch: The control signals to read instruction memory and to write the PC are always
asserted, so there is nothing special to control in this pipeline stage.
2. Instruction decode/register file read: As in the previous stage, the same thing happens at every clock
cycle, so there are no optional control lines to set.
3. Execution/address calculation: The signals to be set are RegDst, ALUOp, and ALUSrc . The signals
select the Result register, the ALU operation, and either Read data 2 or a sign-extended immediate for the
ALU.
4. Memory access: The control lines set in this stage are Branch, MemRead, and MemWrite. The branch
equal, load, and store instructions set these signals, respectively. Recall that PCSrc in the figure selects the
next sequential address unless control asserts Branch and the ALU result was 0.
5. Write-back: The two control lines are MemtoReg, which decides between sending the ALU result or the
memory value to the register file, and Reg-Write, which writes the chosen value.
Control signals derived from instruction as in single-cycle implementation, which shows the signals of
different instructions that are active during its respective stages (refer table 4.49). Implementing control
means setting the nine control lines to these values in each stage for each instruction. The simplest way to do
this is to extend the pipeline registers to include control information. Since the control lines start with the
EX stage, we can create the control information during instruction decode. The figure 4.50 shows that these
control signals are then used in the appropriate pipeline stage as the instruction moves down the pipeline,
just as the destination register number for loads moves down the pipeline.
--------------------------------------5. Explain data hazard and the ways to detect and handle the data hazards ?
Consider this sequence:
sub
$2, $1,$3
and
$12,$2,$5
or
$13,$6,$2
add
$14,$2,$2
sw
$15,100($2)
# Register $2 written by sub

# 1st operand($2) depends on sub
# 2nd operand($2) depends on sub
# 1st($2) & 2nd($2) depend on sub
# Base ($2) depends on sub
The last four instructions are all dependent on the result in register $2 of the first instruction.
register $2 had the value 10 before the subtract instruction
20 after subtraction
Intention of the logic is to use 20 from register $2. After subtraction operation i.e., the rest of the
four instruction to use the value -20 from $2 register.
The above diagram shows the dependence of each instruction with respect to the first instruction SUB and
the result stored in $2 register.
As above, the and & or instructions would get the incorrect value 10(assumed value of $2
before execution of SUB instruction).
Instructions that would get the correct value of 20 are add & sw (since both the instruction
will need the value from / after CC5(clock cycle).
TO DETECT AND FORWARD WITHOUT STALL:
To avoid stall, the result can forwarded to and & or instruction from CC3 where the result is available at
the end of EX stage. (As shown in above diagram).
To detect the data hazard, the source register dependence on any destination register of the previous
instruction can be found by checking the following condition.
1a.
EX/MEM.RegisterRd = ID/EX.RegisterRs
1b.
EX/MEM.RegisterRd = ID/EX.RegisterRt
When the result is available in CC3, and the above condition is TRUE then the hazard is detected and
forwarding process must be initiated to the next instruction, at the end of EX stage (CC3) of the current
instruction. Here in this example condition 1a is TRUE.
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 is TRUE.
Similarly, When the result is available in CC4, and the below condition is TRUE then the hazard is detected
and forwarding process must be initiated to the next instruction, at the end of MEM stage (CC4) of the
current instruction.
2a.
MEM/WB.RegisterRd = ID/EX.RegisterRs
2b.
MEM/WB.RegisterRd = ID/EX.RegisterRt
Here in the given example Condition 2b is TRUE.
2b.
MEM/WB.RegisterRd = ID/EX.RegisterRt= $2 is TRUE.
Since the $2 is the second operand of the OR instruction, 2b is TRUE. Only the above all 4 conditions will
work when the read operation is Active, i.e., when the rd register does hold the result as ZERO.
To forward/Bypass the result in CC3 or end of EX stage

Either condition 1a and EX/MEM.RegisterRd 0 must be TRUE,
or condition 1b and EX/MEM.RegisterRd 0 must be TRUE.
To forward/Bypass the result in CC4 or end of MEM stage

Either condition 2a and MEM/WB.RegisterRd 0 must be TRUE,
or condition 2b and MEM/WB.RegisterRd 0 must be TRUE.
All the above conditions are checked by a special hardware called forwarding unit, refer the figure below.
TO INTRODUCE STALL WHEN FORWARDING FAILS

Similarly, consider an example where the first instruction is an LOAD instruction, and the second
instruction is dependent and needs the result of LOAD instruction in its EX stage. Then the forwarding not
possible as data cannot be forwarded in time backward.
The following diagram will show the need for stall operation.
Similarly as done before to detect the data hazard, the conditions for source and destination register in IF/ID
and ID/EX registers are checked respectively, if the register number is same then stall is introduced.
STEPS TO INTRODUCED THE STALL
Force control values in ID/EX register to 0.
Introduction of NOP, stages EX, MEM and WB do NOP (no-operation) , in this case stall is
introduced from the second instruction.
Prevent update of PC and IF/ID register (in this case 3rd instruction will not be loaded
immediately)
Same instruction is decoded again. (second instruction is decoded ,as per given example) and the
following instruction is fetched again. (3rd instruction is fetched as per given example).
Now, the first instruction is moved to MEM cycle, so the result can be forwarded to the second
instruction (given example).
PART A
1.How can memory access made faster and which hazard can be reduced when memeory is accessed faster?
Ans: Pipelining is an implementation technique in which multiple instructions are
overlapped in execution. Today, pipelining is key to making processors fast. hence memory access is faster.
Control hazard can be reduced.
2. What would be the effect if we increase the number of pipelining stages?

Ans: Limitations on practical depth of a pipeline arise from:
Pipeline latency. The fact that the execution time of each instruction does not decrease puts limitations on
pipeline depth;
Imbalance among pipeline stages. Imbalance among the pipe stages reduces performance since the clock
can run no faster than the time needed for the slowest pipeline stage;
Pipeline overhead. Pipeline overhead arises from the combination of pipeline register delay (setup time
plus propagation delay) and clock skew.
3. What is an hazard?
Ans: There are situations in pipelining when the next instruction cannot execute in the
following clock cycle. These events are called hazards, and there are three different
types.
4. What is branch prediction Dynamic?
Ans: Dynamic Branch Prediction
One approach is to look up the address of the instruction to see if a branch was taken the last time
this instruction was executed, and, if so, to begin fetching new instructions from the same place as the last
time. This technique is called dynamic branch prediction.
5. List the key aspects in gaining the performance in pipelined system?
Ans: The processor idle time is reduced, proper program will lead to less data/ control hazards.
UNIT 4 - PARALLELISM
PART B
1. Explain instruction level parallelism amd its difficulties in inplementing it?
Ans: ILP is to Achieve not only instruction overlap, but the actual execution of more than one instruction
at a time through dynamic scheduling and how to maximize the throughput of a processor.for typical RISC
processors, instructions usually depend on each other too and as a result the amount of overlap is limited.
Instruction level dependencies and hazards during ILP
If two instructions are not dependent then they can execute simultaneously assuming sufficient
resources that is no structural hazards. if one instruction depends on another, they must execute in order
though they may still partially overlap.
1. Data dependencies: Data dependence means that one instruction is dependent on another if there exists a
chain of dependencies between them. . Compilers can be of great help in detecting and scheduling around
these sorts of hazards; hardware can only resolve these dependencies with severe limitations.
2. Name Dependencies: A name dependency occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between them.
anti-dependence occurs when j writes a register/memory that i reads.
output dependence occurs when i and j write to the same register/memory location
The name used in the instruction is changed so that they do not conflict. This technique is known as register
renaming (uses temp registers).
3. Data Hazards:
Data dependency between instructions and they are close enough to cause the pipeline to stall three types of
data hazards:
read after write (RAW)j tries to read a source before i writes itthis is the most common type and is a
true data dependence;
example sequence logic 1. i=i+1;
2. j=i+1
write after write (WAW)j tries to write an operand before it is written by ithis corresponds to the
output dependence;
example sequence logic 1. i=i+1; 2. print i; 3. i=j+1
write after read (WAR)j tries to write a destination before i has read itthis corresponds to an antidependency
example sequence logic 1.read i ; 2. i=j+1
4. Control Dependencies:
A control dependency determines the order of an instruction i with respect to a branch,
if (p1)
S1
if (p2)
S2
S1 is control dependent on p1 and S2 is control dependent on p2
speculatively executed instructions do not effect the program state until branch result is determined. This
implies that the instructions executed speculatively must not raise an exception or otherwise cause side
effects.
Implementation of ILP and overcoming hazards or dependencies.

To implement ILP, 3 methods are used to avoid any delay during ILP
1. score boarding.
2. Tomasulo's solution for dynamic scheduling.
3. Branch prediction.
1. score boarding.
Instructions to be issued when they are ready, not necessarily in order, hence out of-order execution..
To implement out-of-order issue we need to split the instruction decode phase into two:
1.
issuedecode instructions and check for structural hazards;
2.
read operandswait until no data hazards obtain then read the operands and start executing.
It dynamically schedules the pipeline. instructions must pass through the issue phase in order;
This method can stall or bypass each other, in the read operands phase and enter, or even, complete
execution in out of order manner.
eg: CDC6600 used a scoreboard , The goal of a scoreboard is to maintain processor throughput of one
instruction per clock cycle (no structural hazard). If the next instruction would stall then store it on a queue
and start with a later instruction and takes full responsibility for instruction issue and execution.
It uses as many as 16 separate functional units.
2. Tomasulo's solution for dynamic scheduling.
Executing instructions only when operands are available, waiting instruction is stored in a
reservation station. Reservation stations keep track of pending instructions (RAW). WAW can be avoided
using Register renaming.(80 reg).
Tomasulo architecture executes instructions in three phases; each phase may take more than one clock cycle:
Phase 1issuegets the next instruction from the fetch unit. If there is a reservation station available then
insert the instruction, if not (structural hazard detected) wait for reservation station.
Phase 2executeeach reservation station monitors the output of the functional units. When all operands
are available, the appropriate functional unit can execute the instruction.(RAW is eliminated). To preserve
exception behaviour, no instruction is allowed to initiate execution until all proceeding branches are
finalised.
Phase 3write resultswhen the execution finishes and the results are available. Data are routed into both
the register file and any reservation stations that depend on the result. A memory store is usually saved in a
separate queue until both address and data are available .
3. Branch prediction.
It uses predictor is a simple saturating n-bit counter.Each time a particular branch is taken its entry
is incremented otherwise it is decremented. If the most significant bit in the counter is set then predict that
the branch is taken.
Limitations of ILP
1. An instruction stream needs to be run on an ideal processor with no significant limitations.
2. The ideal processor always predicts branches correctly, has no structural hazards .
3. This eliminates all control and name dependencies. (only data dependencies)
4. Theoritically it is possible for the last dynamically executed instruction in the program to be
scheduled on the first cycle.
-------------------------------------------------------------------------------------------------------------------------------2. How parallel processing is implemented and explain the architecture that are used? or explain
Flynn classication?
Ans: Processor executes programs by executing machine instructions in a sequence and one at a time. Each
instruction is executed in a sequence of operations (fetch instruction, fetch operands, perform operation store
result.)
The taxonomy first introduced by Flynn is still the most common way of categorizing systems with parallel
processing capability. Flynn proposed the following categories of computer system:
Single Instruction, Multiple Data (SIMD) system: A single machine instruction controls the simultaneous
execution of a number of processing elements on a lockstep basis. Each processing element has an
associated data memory, so that each instruction is executed on a different set of data by the different
processors. Vector and array processors fall into this category
Multiple Instruction, Single Data (MISD) system A sequence of data is transmitted to a set of processors,
each of which executes a different instruction sequence. This structure has never been implemented.
Multiple Instruction, Multiple Data (MIMD) system: A set of processors simultaneously execute
different instruction sequences on different data sets. SMPs, clusters, and NUMA systems fits into this
category.
MIMD can be subdivided into two main categories:
Symmetric multiprocessor (SMP): In an SMP, multiple processors share a single memory or a pool
of memory by means of a shared bus or other interconnection mechanism. A distinguish feature is
that the memory access time to any region of memory is approximately the same for each processor.
Nonuniform memory access (NUMA): The memory access time to different regions of memory
may differ for a NUMA processor.
Symmetric Multiprocessors (SMP)

A symmetric multiprocessor (SMP) can be defined as a standalone computer system with the following
characteristic:
1.There are two or more similar processor of comparable capability.
2. processors share the same main memory and I/O facilities and are interconnected by a bus or other
internal connection scheme.
3. All processors share access to I/O devices, either through the same channels or through different channels
4. All processors can perform the same functions.
5. The system is controlled by an integrated operating system that provides interaction between processors
and their programs at the job, task, file and data element levels.
SMP has a potential advantages over uniprocessor architecture:

Performance: A system with multiple processors will perform in a better way than one with a single
processor of the same type if the task can be organized in such a manner that some portion of the work done
can be done in parallel.
Availability: Since all the processors can perform the same function in a symmetric multiprocessor, the
failure of a single processor does not stop the machine. Instead, the system can continue to function at
reduce performance level.
Incremental growth: A user can enhance the performance of a system by adding an additional processor.
Sealing: Vendors can offer a range of product with different price and performance characteristics based on
number of processors configured in the system.
Organization:
The organization of a multiprocessor system is shown in Figure 10.1
Figure 10.1: Block diagram of tightly coupled multiprocessors

The processor can communicate with each other through memory (messages and status information
left in common data areas).
It may also be possible for processors to exchange signal directly.
The memory is often organized so that multiple simultaneous accesses to separate blocks of memory
are possible.
In some configurations each processor may also have its own private main memory and I/O channels
in addition to the shared resources.
Non-uniform Memory Access(NUMA)

In NUMA architecture, all processors have access to all parts of main memory using loads and
stores. The memory access time of a processor differs depending on which region of main memory is
accessed. The last statement is true for all processors; however, for different processors, which memory
regions are slower and which are faster differ. A NUMA system in which cache coherence is maintained
among the cache of the various processors is known as cache-cohence NUMA (CC-NUMA)
A typical CC-NUMA organization is shown in the Figure 10.4.
Figure 10.4: CC- NUMA Organization
There are multiple independent nodes, each node contains multiple processors, each with its
own L1 and L2 caches, plus main memory. The node is the basic building block of the overall CC NUMA
organization
The nodes are interconnected by means of some communication facility, which could be a switching
mechanism, a ring, or some other networking facility. Each node in the CC-NUMA system includes some
main memory.
Memory: From the point of view of the processors, there is only a single addressable memory, with each
location having a unique system-wide address.
When a processor initiates a memory access, if the requested memory location is not in the
processors cache, then the L2 cache initiates a fetch operation. If the desired line is in the local portion of the
main memory, the line is fetch across the local bus.
If the desired line is in a remote portion of the main memory, then an automatic request is send out to
fetch that line across the interconnection network, deliver it to the local bus, and then deliver it to the
requesting cache on that bus.
All of this activity is atomic and transparent to the processors and its cache.
In this configuration, cache coherence is a central concern. For that each node must maintain some sort of
directory that gives it an indication of the location of various portion of memory and also cache status
information.
============================================================================
3. Write short notes on different organisation in SMP?

Ans: The organization of multiprocessor system can be classified as follows:
Time shared or common bus
Multiport memory.
Central control unit.
Time shared Bus:
Time shared bus is the simplest mechanism for constructing a multiprocessor system. The bus
consists of control, address and data lines. The block diagram is shown in
The following features are provided in time-shared bus organization:
Addressing: It must be possible to distinguish modules on the bus to determine the source and destination
of data
Arbitration: Any I/O module can temporarily function as master. A mechanism is provided to arbitrate
competing request for bus control, using some sort of priority scheme.
Time shearing: when one module is controlling the bus, other modules are locked out and if
necessary suspend operation until bus access in achieved.
The bus organization has several advantages compared with other approaches:
Simplicity: This is the simplest approach to multiprocessor organization. The physical interface and
the addressing, arbitration and time sharing logic of each processor remain the same as in a single
processor system.
Flexibility: It is generally easy to expand the system by attaching more processor to the bus.
Reliability: The bus is essentially a passive medium and the failure of any attached device should
not cause failure of the whole system.
The main drawback to the bus organization is performance. Thus, the speed of the system is limited by the
bus cycle time. To improve performance, each processor can be equipped with local cache memory.
Multiport Memory:
The multiport memory approach allows the direct, independent access of main memory modules by
each processor and io module.The multiport memory system is shown in
Figure 10.3: Multiport
memory
The multiport memory approach is more complex than the bus approach, requiring a fair amount of
logic to be added to the memory system. Logic associated with memory is required for resolving conflict.
The method often used to resolve conflicts is to assign permanently designated priorities to each memory
port.
============================================================================
4. Explain multi core organization and its processors? or Explain the implementation of multicore
organisation?
Ans:
Top level of description, the main variables in a multicore organization are as follows:
The number of core processors on the chip.
The number of levels of cache memory.
The amount of cache memory that is shared.
Dedicated L1 cache:
Figure 18.8a is an organization found in some of the earlier multicore
computer chips and is still seen in embedded chips. In this organization, the only on-chip cache is L1 cache,
with each core having its own dedicated L1 cache. Almost invariably, the L1 cache is divided into
instruction and data caches.An example of this organization is the ARM11 MPCore.
Dedicated L2 cache:
Figure 18.8b is also one in which there is no on-chip cache
sharing. In this, there is enough area available on the chip to allow for L2 cache.An example of this
organization is the AMD Opteron.
Shared L2 cache:
Figure 18.8c shows a similar allocation of chip space to memory, but with the
use of a shared L2 cache. The Intel Core Duo has this organization.
Shared L3 cache:
Figure 18.8 d as the amount of cache memory available on the chip continues
to grow, performance considerations dictate splitting off a separate, shared L3 cache, with dedicated L1 and
L2 caches for each core processor.
The Intel Core i7 is an example of this organization.(shared L3 cache)
1. Constructive interference can reduce overall miss rates. That is, if a thread on one core accesses a main
memory location, this brings the frame containing the referenced location into the shared cache. If a thread
on another core soon thereafter accesses the same memory block, the memory locations will already be
available in the shared on-chip cache.
2. A related advantage is that data shared by multiple cores is not replicated at the shared cache level.
3. With proper frame replacement algorithms, the amount of shared cache allocated to each core is dynamic
4. Interprocessor communication is easy to implement, via shared memory locations.
A potential advantage to having only dedicated L2 caches on the chip is that each core enjoys more
rapid access to its private L2 cache.This is advantageous for threads that exhibit strong locality.
Intel Core Duo superscalar cores:

It implements two x86 superscalar processors with a shared L2 cache(fig 18.8 c)
The general structure of the Intel Core Duo is shown in Figure 18.9. Let us consider the key elements
starting from the top of the figure.
As is common in multicore systems, each core has its own dedicated L1 cache. In this case, each
core has a 32-KB instruction cache and a 32-KB data cache.
Each core has an independent thermal control unit.With the high transistor density of todays chips,
thermal management is a fundamental capability, especially for laptop and mobile systems.
The Core Duo thermal control unit is designed to manage chip heat dissipation to maximize
processor performance within thermal constraints. Thermal management also improves ergonomics
with a cooler system and lower fan acoustic noise.
In essence, the thermal management unit monitors digital sensors for high-accuracy die temperature
measurements. Each core can be defined as in independent thermal zone.The maximum temperature
for each thermal zone is reported separately via dedicated registers that can be polled by software.
If the temperature in a core exceeds a threshold, the thermal control unit reduces the clock rate for
that core to reduce heat generation.
Advanced Programmable Interrupt Controller (APIC). The APIC performs a number of
functions, including the following:
1. TheAPIC can provide interprocessor interrupts,which allow any process to interrupt any other processor
or set of processors.In the case of the Core Duo,a thread in one core can generate an interrupt, which is
accepted by the localAPIC, routed to theAPIC of the other core,and communicated as an interrupt to the
other core.
2. The APIC accepts I/O interrupts and routes these to the appropriate core.
3. Each APIC includes a timer, which can be set by the OS to generate an interrupt to the local core.
The power management logic is responsible for reducing power consumption when possible, thus
increasing battery life for mobile platforms, such as laptops. In essence, the power management logic
monitors thermal conditions and CPU activity and adjusts voltage levels and power consumption
appropriately. It includes an advanced power-gating capability that allows for an ultra fine-grained logic
control that turns on individual processor logic subsystems only if and when they are needed.
The Core Duo chip includes a shared 2-MB L2 cache. The cache logic allows for a dynamic
allocation of cache space based on current core needs, so that one core can be assigned up to 100% of the L2
cache.
A cache line gets the M state when a processor writes to it;

If the line is not in E or M-state prior to writing it,
the cache sends a Read-For-Ownership (RFO) request that ensures that the line exists in the L1
cache and is in the I state in the other L1 cache.
When a core issues an RFO, if the line is shared only by the other cache within the local die, we can
resolve the RFO internally very fast, without going to the external bus at all. Only if the line is
shared with another agent on the external bus do we need to issue the RFO externally.
The bus interface connects to the external bus, known as the Front Side Bus, which connects to main
memory, I/O controllers, and other processor chips.
===========================================================================
UNIT V (I/O Systems)

1. Write short notes on I/O modules?
Input/Output Organization
The computer system's input/output (I/O) architecture is its interface to the outside world.we have discussed
the two important modules of the computer system - The processor and The memory module.
The third key component of a computer system is a set of I/O modules. Each I/O module interfaces to
the system bus and controls one or more peripheral devices.
There are several reasons why an I/O device or peripheral device is not directly connected to the system
bus. Some of them are as follows There are a wide variety of peripherals with various methods of operation. It would be impractical to
include the necessary logic within the processor to control several devices.
The data transfer rate of peripherals is often much slower than that of the memory or processor.
Thus, it is impractical to use the high-speed system bus to communicate directly with a peripheral.
Peripherals often use different data formats and word lengths than the computer to which they are
attached.
Thus, an I/O module is required.
Input/Output Modules
The major functions of an I/O module are categorized as follows
Control and timing
Processor Communication
Device Communication
Data Buffering
Error Detection
During any period of time, the processor may communicate with one or more external devices in
unpredictable manner, depending on the program's need for I/O. The internal resources, such as main
memory and the system bus, must be shared among a number of activities, including data I/O.
Control & timings: The I/O function includes a control and timing requirement to co-ordinate the flow of
traffic between internal resources and external devices. For example, the control of the transfer of data from
an external device to the processor might involve the following sequence of steps
1. The processor interacts with the I/O module to check the status of the attached device.
2. The I/O module returns the device status.
3. If the device is operational and ready to transmit, the processor requests the transfer of data,
by means of a command to the I/O module.
4. The I/O module obtains a unit of data from external device.
5. The data are transferred from the I/O module to the processor.
If the system employs a bus, then each of the interactions between the processor and the I/O module
involves one or more bus arbitrations.
Processor & Device Communication: During the I/O operation, the I/O module must communicate with
the processor and with the external device. Processor communication involves the following
Command decoding : The I/O module accepts command from the processor, typically sent as
signals on control bus.
Data : Data are exchanged betweeen the processor and the I/O module over the data bus.
StatusReporting: Because peripherals are so slow, it is important to know the status of the I/O
module. For example, if an I/O module is asked to send data to the processor(read), it may not be
ready to do so because it is still working on the previous I/O command. This fact can be reported
with a status signal. Common status signals are BUSY and READY.
Address Recognition : Just as each word of memory has an address, so thus each of the I/O
devices. Thus an I/O module must recognize one unique address for each peripheral it controls.
On the other hand, the I/O must be able to perform device communication. This communication
involves command, status information and data.
Data Buffering: An essential task of an I/O module is data buffering. The data buffering is required due to
the mismatch of the speed of CPU, memory and other peripheral devices. In general, the speed of CPU
is higher than the speed of the other peripheral devices. So, the I/O modules store the data in a data buffer
and regulate the transfer of data as per the speed of the devices.
In the opposite direction, data are buffered so as not to tie up the memory in a slow transfer operation.
Thus the I/O module must be able to operate at both device and memory speed.
Error Detection: Another task of I/O module is error detection and for subsequently reporting error to the
processor. One class or error includes mechanical and electrical malfunctions reported by the device (e.g.
paper jam). Another class consists of unintentional changes to the bit pattern as it is transmitted from devices
to the I/O module.
Block diagram of I/O Module is shown in the Figure 6.1.
Figure 6.1: Block diagram of I/O Module.

=============================================================================
2. Write short notes on how I/O are mapped to memory?

There will be many I/O devices connected through I/O modules to the system. Each device will be
identified by a unique address.
When the processor issues an I/O command, the command contains the address of the device that is used
by the command. The I/O module must interpret the address lines to check if the command is for itself.
Generally in most of the processors, the processor, main memory and I/O share a common bus(data address
and control bus).
Two types of addressing are possible Memory-mapped I/O
Isolated or I/O mapped I/O
Memory-mapped I/O:
There is a single address space for memory locations and I/O devices. The processor treats the status and
address register of the I/O modules as memory location.
For example, if the size of address bus of a processor is 16, then there are 216 combinations and all together
216 address locations can be addressed with these 16 address lines. Out of these 216 address locations, some
address locations can be used to address I/O devices and other locations are used to address memory
locations.
Since I/O devices are included in the same memory address space, so the status and address registers of
I/O modules are treated as memory location by the processor. Therefore, the same machine instructions
are used to access both memory and I/O devices.
Isolated or I/O -mapped I/O:
In this scheme, the full range of addresses may be available for both. The address refers to a memory
location or an I/O device is specified with the help of a command line.
In general
command line is used to identify a memory location or an I/O device.
if
=1, it indicates that the address present in address bus is the address of an I/O device.
(i.e., IO=1 and M=0)
if
=0, it indicates that the address present in address bus is the address of a memory location.
(i.e., IO=0 and M=1)
Since full range of address is available for both memory and I/O devices, so, with 16 address lines, the
system may now support both 216 memory locations and 216 I/O addresses.
(Note: If I/O module is asked for 16 marks try to combine both 1st and 2nd question content)
====================================================================
3. Brief about I/O subsystems or different I/O systems ?
There are three basic forms of input and output systems
Programmed I/O
Interrupt driven I/O
Direct Memory Access(DMA)
With programmed I/O, the processor executes a program that gives its direct control of the I/O operation,
including sensing device status, sending a read or write command, and transferring the data.
With interrupt driven I/O, the processor issues an I/O command, continues to execute other instructions,
and is interrupted by the I/O module when the I/O module completes its work.
In Direct Memory Access (DMA), the I/O module and main memory exchange data directly without
processor involvement.
processor is responsible for
Memory
Special
address
space
extracting data from main memory
I/O device
space
With both programmed I/O and Interrupt driven I/O, the processor is responsible for extracting data from
main memory for output operation and storing data in main memory for input operation. To send data to an
output device, the CPU simply moves that data to a
special memory location in the I/O address space if I/O mapped input/output is used
or to an address in the memory address space if memory mapped I/O is used.
To read data from an input device, the CPU simply moves data from the address (I/O or memory) of that
device into the CPU.
Input/Output Operation: The input and output operation looks very similar to a memory read or write
operation except it usually takes more time since peripheral devices are slow in speed than main memory
modules. The working principle of the three method for input of a Block of Data is shown in the Figure 6.2.
Fig 6.2 Working of three techniques for input of block of data.
================================================================
4. Write short notes on Programmed control I/O? (use flowchart from Q. No 3 fig a)
programmed I/O, the processor executes a program that gives its direct control of the I/O operation,
including sensing device status, sending a read or write command, and transferring the data.
Input/Output Port: An I/O port is a device that looks like a memory cell to the computer but contains
connection to the outside world. An I/O port typically uses a latch. When the CPU writes to the address
associated with the latch, the latch device captures the data and makes it available on a set of wires external
to the CPU and memory system. The I/O ports can be read-only, write-only, or read/write. The writeonly port is shown in the Figure 6.3.
Figure 6.3: The write only port

First, the CPU will place the address of the device on the I/O address bus and with the help of address
decoder a signal is generated which will enable the latch.
Next, the CPU will indicate the
operation is a write operation by putting the
appropriate signal in CPU write control
line. Then the data to be transferred will be
placed in the CPU bus, which will be stored in
the latch for the onward transmission to the
device. Both the address decode and write
control lines must be active for the latch to
operate. The read/write or input/output port
is shown in the Figure 6.4.
The device is identified
by putting the appropriate address in the I/O
address lines. The address decoder will
generate the signal for the address decode
lines.
According to the
operation, read or write, it will select either of
the latch. If it is a write operation, then data
will be placed in the latch from CPU for
onward transmission to the output device.
If it is in a read operation, the data that
are already stored in the latch will be
transferred to the CPU.
A read only (input) port is simply the
lower half of the Figure 6.4.
Figure 6.4: Read / Write port
In case of I/O mapped I/O, a different address space is used for I/O devices. The address space for memory
is different.
In case of memory mapped I/O, same address space is used for both memory and I/O devices. Some of the
memory address space are kept reserved for I/O devices.
The
difference between I/O-mapped and memory-mapped input/output operation is the instruction to be used.
For memory-mapped I/O, any instruction that accessed memory can access a memory-mapped I/O port. I/Omapped input/output uses special instruction to access I/O port.
Generally, a given peripheral device will use more than a single I/O port.
for example printer, The input port returns control signals from the printer. These signals indicate
whether the printer is ready to accept another character, is off-line, is out of paper etc. The output port
transmits control information to the printer such as whether data is available to print.
In programmed I/O, the data transfer between CPU and I/O device is carried out with the help of a
software routine. When a processor is executing a program and encounters an instruction relating to I/O,
it executes that I/O instruction by issuing a command to the appropriate I/O module. The following are
steps followed while executing the I/O instruction/operation.
The I/O module will perform the requested action and then set the appropriate bits in the I/O status
register.
The I/O module takes no further action to alert the processor.
It is the responsibility of the processor to check periodically the status of the I/O module until it finds
that the operation is complete.
when the processor issues a command to a I/O module, it must wait until the I/O operation is
complete.
The I/O devices are slower than the processor, so in this scheme CPU time is wasted. CPU is
checking the status of the I/O module periodically without doing any other work.
I/O Commands
To execute an I/O-related instruction, the processor issues an address, specifying the particular I/O
module and external device, and an I/O command. There are four types of I/O commands that an I/O module
will receive when it is addressed by a processor
Control : Used to activate a peripheral device and instruct it what to do. For example, a magnetic
tape unit may be instructed to rewind or to move forward one record. These commands are specific
to a particular type of peripheral device.
Test : Used to test various status conditions associated with an I/O module and its peripherals. The
processor will want to know if the most recent I/O operation is completed or any error has
occurred.
Read : Causes the I/O module to obtain an item of data from the peripheral and place it in the
internal buffer.
Write : Causes the I/O module to take an item of data ( byte or word ) from the data bus and
subsequently transmit the data item to the peripheral.
================================================================
5. Write short notes on Interrupt driven I/O? (use flowchart from Q. No 3 fig b)
The problem with programmed I/O is that the processor has to wait a long time for the I/O module
of concern to be ready for either reception or transmission of data. The processor, while waiting, must
repeatedly interrogate the status of the I/O module.
The solution to this problem is to provide an interrupt mechanism. In this approach the
processor issues an I/O command to a module and then go on to do some other useful work. The I/O module
then interrupt the processor to request service when it is ready to exchange data with the processor. The
processor then executes the data transfer. Once the data transfer is over, the processor then resumes its
former processing.
I/O module point of view
For input, the I/O module receives a READ command from the processor.
The I/O module then proceeds to read data from an associated peripheral device.
Once the data are in the modules data register, the module issues an interrupt to the processor over a
control line.
The module then waits until its data are requested by the processor.
When the request is made, the module places its data on the data bus and is then ready for another
I/O operation.
Processor point of view
The processor issues a READ command.
It then does something else (e.g. the processor may be working on several different programs at the
same time)
At the end of each instruction cycle, the processor checks for interrupts.
When the interrupt from an I/O module occurs, the processor saves the context (e.g. program counter
& processor registers) of the current program and processes the interrupt.
In this case, the processor reads the word of data from the I/O module and stores it in memory.
It then restores the context of the program it was working on and resumes execution.
Interrupt Processing
The occurrence of an interrupt triggers a number of events, both in the processor hardware and in
software. When an I/O device completes an I/O operation, the following sequences of hardware events
occurs:
1. The device issues an interrupt signal to the processor.
2. The processor finishes execution of the current instruction before responding to the interrupt.
3. The processor tests for the interrupt; if there is one interrupt pending, then the processor sends an
acknowledgement signal to the device which issued the interrupt. After getting acknowledgement,
the device removes its interrupt signals.
4. The processor now needs to prepare to transfer control to the interrupt routine. It needs to save the
information needed to resume the current program at the point of interrupt. The minimum
information required to save is the processor status word (PSW) and the location of the next
instruction to be executed which is nothing but the contents of program counter. These can be pushed
into the system control stack.
5. The processor now loads the program counter with the entry location of the interrupt handling
program that will respond to the interrupt.
================================================================
6. Explain Interrupt processing in detail?
Interrupt Processing (instruction level)
1. An interrupt occurs when the processor is executing the instruction of location N (current
instruction).
2. At that point, the value of program counter is N+1(next instruction) .
3. Processor services the interrupt after completion of current instruction execution (N).
4. First, it moves the content of general registers to system stack.
5. Then it moves the program counter value to the system stack(N+1).
6. Top of the system stack is maintained by stack pointer.
7. The value of stack pointer is modified to point to the top of the stack.
8. If M elements are moved to the system stack, the value of stack pointer is changed from T to T-M.
9. Next, the program counter is loaded with the starting address of the interrupt service routine.
10. Processor starts executing the interrupt service routine.
Figure 6.5: Changes of memory and register for an interrupt.

Return from Interrupt (Instruction Level) :
1.
Interrupt service routine starts at location X and the return instruction is in location X + L.
2.
After fetching the return instruction, the value of program counter becomes X + L + 1.
3.
While returning to user's program, processor must restore the earlier values.
4.
From control stack, it restores the value of program counter and the general registers.
5.
Accordingly it sets the value of the top of the stack and accordingly stack pointer is updated.
6.
Now the processor starts execution of the user's program (interrupted program) from memory
location N + 1.
Figure 6.6: Return from interrupt

Once the program counter has been loaded, the processor proceeds to the next instruction cycle, which
begins with an interrupt fetch. The control will transfer to interrupt handler routine for the current interrupt.
The following operations are performed at this point.
1. At the point, the program counter and PSW relating to the interrupted program have been
saved on the system stack. In addition to that some more information must be saved related to the
current processor state which includes the control of the processor registers, because these
registers may be used by the interrupt handler. Typically, the interrupt handler will begin by
saving the contents of all registers on stack.
2. The interrupt handles next processes the interrupt. This includes an examination of status information
relating to the I/O operation or, other event that caused an interrupt.
3. When interrupt processing is complete, the saved register values are retrieved from the stack and
restored to the registers.
4. The final act is to restore the PSW and program counter values from the stack. As a result, the next
instruction to be executed will be from the previously interrupted program.
===============================================================
Note: If Interrupt driven I/O asked in 16 marks question write Qno 5 and write important
point from Q.No 6
7. Discuss the design Issues for implementing Interrupt I/O and methods to solve issues.
Two design issues arise in implementing interrupt I/O.

1.
When multiple I/O modules, the processor needs to know / identify which device issued the
interrupt.
2.
If multiple interrupts have occurred the processor needs to map interrupt and the process.
Device Identification (for issue 1, multiple I/O modules)

Four general categories of techniques are in common use:
Multiple interrupt lines.
Software poll.
Daisy chain. (hardware poll, vectored)
Bus arbitration. ( vectored)
Multiple Interrupts Lines:
The most straight forward approach is to provide multiple interrupt lines between the processor and
the I/O modules.
It is impractical to dedicate more than a few bus lines or processor pins to interrupt lines.
Thus, though multiple interrupt lines are used, it is most likely that each line will have multiple I/O
modules attached to it. Thus one of the other three techniques must be used on each line.
Software Poll
When the processor detects an interrupt, it branches to an interrupt service routine whose job is
to poll each I/O module to determine which module caused the interrupt.
The poll could be implemented with the help of a separate command line (e.g. TEST I/O). In this
case, the processor raises TEST I/O and place the address of a particular I/O module on the address
lines. The I/O module responds positively if it set the interrupt.
Alternatively, each I/O module could contain an addressable status register. The processor then reads
the status register of each I/O module to identify the interrupting module.
Once the correct module is identified, the processor branches to a device service routine specific to
that device.
The main disadvantage of software poll is that it is time consuming. Processor has to check the
status of each I/O module and in the worst case it is equal to the number of I/O modules.
Daisy Chain :
In this method for interrupts all I/O modules share a common interrupt request lines. However the
interrupt acknowledge line is connected in a daisy chain fashion. When the processor senses an
interrupt, it sends out an interrupt acknowledgement.
The interrupt acknowledge signal propagates through a series of I/O module until it gets to a
requesting module.
The requesting module typically responds by placing a word on the data lines. This word is
referred to as a vector and is either the address of the I/O module or some other unique
identification.
In either case, the processor uses the vector as a pointer to the appropriate device service routine.
This avoids the need to execute a general interrupt service routine first. This technique is referred
to as a vectored interrupt. The daisy chain arrangement is shown in the Figure 6.7.
Figure 6.7: Daisy chain arrangement
where INTR - Interrupt request, INTA - Interrupt acknowledge.

Bus Arbitration :
In bus arbitration method, an I/O module must first gain control of the bus before it can raise the
interrupt request line. Thus, only one module can raise the interrupt line at a time. When the processor
detects the interrupt, it responds on the interrupt acknowledge line. The requesting module then places it
vector on the data line.
Handling multiple interrupts (for issue 2, multiple interrupts)
There are several techniques to identify the requesting I/O module. These techniques also provide a
way of assigning priorities when more than one device is requesting interrupt service.
With multiple lines, the processor just picks the interrupt line with highest priority. During the processor
design phase itself priorities may be assigned to each interrupt lines.
With software polling, the order in which modules are polled determines their priority.
In case of daisy chain configuration, the priority of a module is determined by the position of the module
in the daisy chain. The module nearer to the processor in the chain has got higher priority, because this
is the first module to receive the acknowledge signal that is generated by the processor.
In case of bus arbitration method, more than one module may need control of the bus. Since only one
module at a time can successfully transmit over the bus, some method of arbitration is needed. The various
methods can be classified into two group centralized and distributed.
In a centralized scheme, a single hardware device, referred to as a bus controller or arbiter is
responsible for allocating time on the bus. The device may be a separate module or part of the processor.
In distributed scheme, there is no central controller. Rather, each module contains access control
logic and the modules act together to share the bus.
It is also possible to combine different device identification techniques to identify the devices and to
set the priorities of the devices. As for example multiple interrupt lines and daisy chain technologies can be
combined together to give access for more devices.
In one interrupt line, more than one device can be connected in daisy chain fashion. The High
priorities devices should be connected to the interrupt lines that has got higher priority.
A possible arrangement is shown in the Figure 6.8.
Figure 6.8 Possible arrangement to handle multiple interrupt.

Interrupt Nesting
The arrival of an interrupt request from an external device causes the processor to suspend the
execution of one program and starts the execution of another. The execution of this another program is
nothing but the interrupt service routine for that specified device.
Interrupt may arrive at any time. So during the execution of an interrupt service routine, another
interrupt may arrive. This kind of interrupts are known as nesting of interrupt.
Generally nesting of interrupt is allowed, but with some restrictions. The common problem is that a
high priority device may interrupt a low priority device, but not the vice-versa. The processor provides some
instructions to enable the interrupt and disable the interrupt. If interrupt is disabled, the CPU will not
respond to any interrupt signal. When multiple lines are used for interrupt and priorities are assigned to these
lines, then the interrupt received in a low priority line will not be served if an interrupt routine is in
execution for a high priority device(Low priority has to wait).
===========================================================================
8. Write short notes on DMA? (use flowchart from Q. No 3 fig c)
programmed I/O and Interrupt-driven I/O methods require the active intervention of the processor to
transfer data between memory and the I/O module, and any data transfer must transverse a path through the
processor. Thus both these forms of I/O suffer from two inherent drawbacks.
The I/O transfer rate is limited by the speed with which the processor can test and service a
device.
The processor is tied up in managing an I/O transfer; a number of instructions must be

executed for each I/O transfer.
To transfer large block of data at high speed, a special control unit may be provided to allow transfer
of a block of data directly between an external device and the main memory, without continuous
intervention by the processor. This approach is called direct memory access or DMA. DMA transfers are
performed by a control circuit associated with the I/O device and this circuit is referred as DMA controller.
The DMA controller allows direct data transfer between the device and the main memory without involving
the processor.
To transfer data between memory and I/O devices,DMA controller takes over the control of the
system from the processor and transfer of data take place over the system bus. For this purpose, the DMA
controller must use the bus only when the processor does not need it, or it must force the processor to
suspend operation temporarily. The later technique is more common and is referred to as cycle stealing,
because the DMA module in effect steals a bus cycle.
The typical block diagram of a DMA controller is shown in

the Figure 6.9.
When the processor wishes to read or write a block of
data, it issues a command to the DMA module, by
sending to the DMA module the following information.
Whether a read or write is requested, using the read

or write control line between the processor and the
DMA module.
The address of the I/O devise involved,

communicated on the data lines.
The starting location in the memory to read from or

write to, communicated on data lines and stored by
the DMA module in its address register.
The number of words to be read or written again

communicated via the data lines and stored in the
data count register.
Fig 6.9: Typical DMA block diagram
The processor then continues with other works. It has delegated this I/O operation to the DMA
module.The DMA module checks the status of the I/O devise whose address is communicated to DMA
controller by the processor. If the specified I/O devise is ready for data transfer, then DMA module
generates the DMA request to the processor. Then the processor indicates the release of the system bus
through DMA acknowledge.
The DMA module transfers the entire block of data, one word at a time, directly to or from memory,
without going through the processor. When the transfer is completed, the DMA module sends an interrupt
signal to the processor. After receiving the interrupt signal, processor takes over the system bus. Thus the
processor is involved only at the beginning and end of the transfer. During that time the processor is
suspended.
It is not required to complete the current instruction to suspend the processor. The processor may be
suspended just after the completion of the current bus cycle. On the other hand, the processor can be
suspended just before the need of the system bus by the processor, because DMA controller is going to use
the system bus, it will not use the processor. The point where in the instruction cycle the processor may be
suspended shown in the Figure 6.10.
Figure 6.10 : DMA break point

When the processor is suspended, then the DMA module transfer one word and return control to the
processor.Note that, this is not an interrupt, the processor does not save a context and do something else.
Rather, the processor pauses for one bus cycle.During that time processor may perform some other task
which does not involve the system bus. In the worst situation processor will wait for some time, till
the DMA releases the bus.
The net effect is that the processor will go slow. But the net effect is the enhancement of
performance, because for a multiple word I/O transfer, DMA is far more efficient than interrupt driven
or programmed I/O.
==========================================================================
9. Write short notes on different DMA mechanism and its configuration?
(Note: if DMA is asked for 16 marks combine both Q.No 8 and Q.No 9 put points
according to the question)
The DMA mechanism can be configured in different ways. The most common amongst them are:
Single bus, detached DMA - I/O configuration.
Single bus, Integrated DMA - I/O configuration.
Using separate I/O bus.
Single bus, detached DMA - I/O configuration:

In this organization all modules share the same system bus. The DMA module here acts as a
surrogate processor. This method uses programmed I/O to exchange data between memory and
an I/O module through the DMA module. For each transfer it uses the bus twice. The first one is when
transferring the data between I/O and DMA and the second one is when transferring the data between DMA
and memory. Since the bus is used twice while transferring data, so the bus will be suspended twice.
The transfer consumes two bus cycle.
The interconnection organization is shown in the Figure 6.11.
Figure 6.11: Single bus arrangement for DMA transfer

Single bus, Integrated DMA - I/O configuration:
By integrating the DMA and I/O function the number of required bus cycle can be reduced. In
this configuration, the DMA module and one or more I/O modules are integrated together in such a way that
the system bus is not involved. In this case DMA logic may actually be a part of an I/O module, or it may be
a separate module that controls one or more I/O modules.
The
DMA module, processor and the memory module are connected through the system bus. In this
configuration each transfer will use the system bus only once and so the processor is suspended only
once. The system bus is not involved when transferring data between DMA and I/O device, so processor is
not suspended. Processor is suspended when data is transferred between DMA and memory.
The configuration is shown in the Figure 6.12.
Figure 6.12: Single bus integrated DMA transfer

Using separate I/O bus:
In this configuration the I/O modules are connected to the DMA through another I/O bus. In this case
the DMA module is reduced to one. Transfer of data between I/O module and DMA module is carried out
through this I/O bus. In this transfer, system bus is not in use and so it is not needed to suspend the
processor. There is another transfer phase between DMA module and memory. In this time system bus is
needed for transfer and processor will be suspended for one bus cycle.
The configuration is shown in the Figure 6.13.
Figure 6.13: Separate I/O bus for DMA transfer

============================================================================
--------------------------------
UNIT V (MEMORY)
1. Write short notes on Memories?
Digital computer works on stored programmed concept introduced by Von Neumann. We use
memory to store the information, which includes both program and data. Due to several reasons, we have
different kind of memories. We use different kind of memory at different level. The memory of computer is
broadly categories into two categories:
1. Internal and
2. external
Internal memory is used by CPU to perform task and external memory is used to store bulk
information, which includes large software and data. Memory is used to store the information in digital
form. The memory hierarchy is given by:
Register
Cache Memory
Main Memory
Magnetic Disk
Removable media (Magnetic tape)
Register: This is a part of Central Processor Unit, so they reside inside the CPU. The information from
main memory is brought to CPU and keep the information in register. Due to space and cost constraints, we
have got a limited number of registers in a CPU. These are basically faster devices.
Cache Memory: Cache memory is a storage device placed in between CPU and main memory. These are
semiconductor memories. These are basically fast memory device, faster than main memory.We cannot have
a big volume of cache memory due to its higher cost and some constraints of the CPU. Due to higher cost
we cannot replace the whole main memory by faster memory. Generally, the most recently used information
is kept in the cache memory. It is brought from the main memory and placed in the cache memory. Now a
days, we get CPU with internal cache.
Main Memory: Like cache memory, main memory is also semiconductor memory. But the main memory is
relatively slower memory. We have to first bring the information (whether it is data or program), to main
memory. CPU can work with the information available in main memory only.
Magnetic Disk: This is bulk storage device. We have to deal with huge amount of data in many
application. But we don't have so much semiconductor memory to keep these information in our computer.
On the other hand, semiconductor memories are volatile in nature. It loses its content once we switch off the
computer. For permanent storage, we use magnetic disk. The storage capacity of magnetic disk is very high.
Removable media: For different application, we use different data. It may not be possible to keep all the
information in magnetic disk. So, which ever data we are not using currently, can be kept in removable
media. Magnetic tape is one kind of removable medium. CD is also a removable media, which is an optical
device.
Register, cache memory and main memory are internal memory. Magnetic Disk, removable media
are external memory. Internal memories are semiconductor memory. Semiconductor memories are
categoried as volatile memory and non-volatile memory.
RAM: Random Access Memories are volatile in nature. As soon as the computer is switched off,
the contents of memory are also lost.
ROM: Read only memories are non volatile in nature. The storage is permanent, but it is read only
memory. We cannot store new information in ROM.

Several types of ROM are available:
PROM: Programmable Read Only Memory; it can be programmed once as per user
requirements.
EPROM: Erasable Programmable Read Only Memory; the contents of the memory can be erased
and store new data into the memory. In this case, we have to erase whole information.
EEPROM: Electrically Erasable Programmable Read Only Memory; in this type of memory the
contents of a particular location can be changed without effecting the contents of other location.
Main Memory The main memory of a computer is semiconductor memory. The main memory unit of
computer is basically consists of two kinds of memory:
RAM : Random access memory; which is volatile in nature.
ROM : Read only memory; which is non-volatile.
The permanent information are kept in ROM and the user space is basically in RAM. The smallest unit of
information is known as bit (binary digit), and in one memory cell we can store one bit of information. 8
bit together is termed as a byte.The maximum size of main memory that can be used in any computer is
determined by the addressing scheme.
A computer that generates 16-bit address is capable of addressing upto 216 which is equal to
64K memory location. Similarly, for 32 bit addresses, the total capacity will be 232 which is equal to 4G
memory location.In some computer, the smallest addressable unit of information is a memory word and
the machine is called word-addressable. In some computer, individual address is assigned for each byte of
information, and it is called byte-addressable computer. In this computer, one memory word contains one
or more memory bytes which can be addressed individually.
A byte addressable 32-bit computer, each memory word
contains 4 bytes. A possible way of address assignment is
shown in figure3.1. The address of a word is always integer
multiple of 4.
The
main memory is usually designed to store and retrieve data in
word length quantities. The word length of a computer is
generally defined by the number of bits actually stored or
retrieved in one main memory access.
Consider a machine with 32 bit address bus. If the word
size is 32 bit, then the high order 30 bit will specify the address
of a word. If we want to access any byte of the word, then it can
be specified by the lower two bit of the address bus.
The data transfer between main memory and the CPU
takes place through two CPU registers.
MAR : Memory Address Register
MDR : Memory Data Register.
Figure 3.1:Address assignment to a 4-byte word
If the MAR is k-bit long, then the total addressable memory location will be 2k.If the MDR is n-bit long,
then the n bit of data is transferred in one memory cycle. The transfer of data takes place through memory
bus, which consist of address bus and data bus. In the above example, size of data bus is n-bit and size of
address bus is k bit.
It also includes control lines like Read, Write and Memory Function Complete (MFC) for
coordinating data transfer. In the case of byte addressable computer, another control line to be added to
indicate the byte transfer instead of the whole word.
For memory operation, the CPU initiates a memory operation by loading the appropriate data i.e.,
address to MAR.
If it is a memory read operation, then it sets the read memory control line to 1. Then the contents
of the memory location is brought to MDR and the memory control circuitry indicates this to the CPU by
setting MFC to 1.
If the operation is a memory write operation, then the CPU places the data into MDR and sets the
write memory control line to 1. Once the contents of MDR are stored in specified memory location, then
the memory control circuitry indicates the end of operation by setting MFC to 1.
==========================================================================
2. Brief about principle of locality? ( 6 marks, or add in any memory related answer)
Principle of locality states that programs access a relatively small portion of their address space at
any instant of time.
Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again
soon.
Example: If you recently brought a book to your desk to look at, you will probably
need to look at it again soon.
Spatial locality (locality in space): If an item is referenced, items whose addresses are close by
will tend to be referenced soon. Example: Libraries put books on the same topic together on
the same shelves to increase spatial locality.
Most programs contain loops, so instructions and data are likely to be accessed repeatedly, showing high
amounts of temporal locality.
Sequential accesses to elements of an array or a record will naturally have high degrees of spatial
locality.
==========================================================================
3. Write short Notes on Memory Hierarchy Levels ?
The goal is to present the user with as much memory as is
available in the cheapest technology, while providing access at the
speed offered by the fastest memory. When we move away from the
processor, the levels take progressively longer to access. Data is
copied between only two adjacent levels at a time.
The upper levelthe one closer to the
processor is smaller and faster Fig. shows that the
minimum unit of information that can be either
present or not present in the two-level hierarchy is
called a block or a line.
Block (line): unit of copying,
May be multiple words / word / byte.
If accessed data is present in upper level is called as
HIT. The hit rate, or hit ratio, is the fraction of
memory accesses found in the upper level.
Hit ratio: hits/accesses
If accessed data is absent in upper level is called as
MISS. The lower level in the hierarchy is then
accessed to retrieve the block containing the requested data. The miss rate (1hit rate) is the fraction of
memory accesses not found in the upper level.
Miss ratio: misses/accesses
Then accessed data supplied from upper level.
Hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to
determine whether the access is a hit or a miss.
The miss penalty is the time to replace a block in the upper level with the corresponding block from the
lower level, plus the time to deliver this block to the processor.
The hit time will be much smaller than the time to access the next level in the hierarchy(miss penalty),
which is the major component of the miss penalty.
============================================================================
4. Discuss about the memory technologies that are implemented in computer storage systems.
Memory Technology
SRAM Technology
integrated circuits that are memory arrays with (usually) a single access port that can provide
either a read or a write.
SRAMs have a fixed access time, but read and write access times differ.
It does not refresh and so the access time is very close to the cycle time
SRAMs typically use six to eight transistors per bit to prevent the information from being
disturbed when read.
only minimal power needed to retain the charge in standby mode.
DRAM Technology
DRAM, the value kept in a cell is stored as a charge in a capacitor.

A single transistor is used to access this stored charge (read/write).
DRAMs use only a single transistor per bit of storage, they are much denser and cheaper per bit
than SRAM.
DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must periodically be
refreshed. (dynamic )
Refresh the cell -- read its contents and write it back charge can be kept for several
milliseconds
It uses Two level decoding structure, refresh an entire row with a read cycle followed
immediately by a write cycle.
DRAM -- row organization It buffers row for repeated access Buffer acts as SRAM.
To improve the interface to processors, DRAMs added clocks and are properly called
Synchronous DRAMs or SDRAMs.
The advantage -- the use of a clock eliminates the time for the memory and processor to
synchronize.
The fastest version is called Double Data Rate (DDR) SDRAM -data transfers on both the rising and falling edge of the clock, -- Transfer rate is doubled. (Latest
version DDR 4)
It uses multiple banks , with each having its own row buffer.
For example, with four banks, there is just one access time --accesses rotate between the four
banks . This rotating access scheme is
called address interleaving.
Flash Memory ( electrically erasable programmable read-only memory EEPROM )
writes can wear out flash memory bits.

most flash products include a controller to spread the writes by remapping blocks that have been
written many times to less trodden blocks. This technique is called wear leveling.
wear leveling can reduce performance but it can save hardware.
It can also identifies defects in memory cells (manufacturing defects).
Disk Memory
magnetic hard disk consists of a collection of platters, which rotate on a spindle at 5400 to 15,000
revolutions per minute.
metal platter Magnetic coat. A movable arm containing a small electromagnetic coil called a
read-write head is located just above each surface.
TRACKS - Each disk surface is divided into concentric circles (> 10000).
SECTORS Each track is divided in to sectors.(>1000 sectors per track)
Sector size 512 to 4096 bytes
Sequence in sector sector no ----- inf. About sector(err. corection code) ------ next sector
number.
The disk heads connected together, - move in conjunction Every head is over the same
track.cylinder is used to refer to all the tracks under the heads at a given point on all surfaces.
To access data, 3 stage process.

1.SEEK: To move to the desired track. The time to move the head to the desired track is
called the seek time.
2. Must wait for the desired sector to rotate under the read/write head. This time is called the
rotational latency or rotational delay.
3. Transfer time: The time to transfer a block of bits.The transfer time is a function of the sector size,
the rotation speed, and the recording density of a track. (100 and 200 MB/sec year 2012)
ADV: Magnetic disks are nonvolatile like flash, but unlike flash there is no write wear-out problem.
============================================================================
5. Explain cache memory? (for 16 marks, write Q. no 5 and Q.no 6)
Cache was the name chosen to represent the level of the memory hierarchy between the processor
and main memory in the first commercial computer to have this extra level. The memories in the data
path(Buffers IF/ID, ID/EX, EX/MEM,
MEM/WB) in are simply replaced by
caches.
Processor
request for a word in the cache, each
block in the cache is equal to the size of
the word define in processor.
The processor requests a
word Xn that is not in the cache. This
request results in a miss, and the word
Xn is brought from memory into the
cache. Each word in memory is to assign the cache location based on the address of the word in memory.
This cache structure is called direct mapped.Each memory location is mapped directly to exactly one
location in the cache.
DIRECT MAPPING IN CACHE
Location in cache for a given data is determined by the address in the memory. The No of blocks in the
cache memory is equal to 2 power I , where I is the number of bits used to address the cache memory. It
uses or matches its address to the lower order address bits in the memory.
The
mapping is done by using the formula (Block Address of memory) Modulo (No of blocks in cache).
To identify the data's original mapping area in memory, Add a set of tags to the cache. The tags contain the
address information required to identify whether a word in the cache corresponds to the requested word. The
tag contains only the upper portion of the address. (It will not include bits that are used as an index into the
cache)
As per previous example, the address has 5 bits in which higher order 2 bits are used for tag
and lower order 3 bits are used for indexing cache. Index Value of cache always used or referred as
block no for the Cache.
To check whether, the data present in the cache block is valid(in use) or not, is found by adding another 1 bit
called as valid bit. If this bit =1, then data is valid else if bit=0 then data is in valid (block is free). To
validate the data present in the cache, i.e., mapped to a particular memory location or not, then the lower
order address bits of the memory must match the index field of the cache block and the higher order
address bits must match the tag field value of that particular cache block.
memory address = (tag field) combines with (cache index)
Address Sub division of 32 bits MIPS address fields.
============================================================================
6. Write short notes on handling the cache Misses and data inconsistency?
The steps to be taken on an instruction cache miss:
1. Send the original PC value (current PC 4) to the memory.
2. Instruct main memory to perform a read and wait for the memory to complete its access.
3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper
bits of the address (from the ALU) into the tag field, and turning the valid bit on.
4. Restart the instruction execution at the first step, which will refetch the instruction, this time finding it
in the cache.
The
control of the cache on a data access is essentially identical: on a miss, we simply stall the processor
until the memory responds with the data.
Scheme -- write-through.
Suppose a store instruction wrote the data into only the data
cache (without changing main memory); then, after the write into the cache, memory would have a
different value from that in the cache. Now the cache and memory are said to be inconsistent.
To keep main memory and the cache consistent, always data write is performed in both the memory
and the cache.This will not provide good performance as these memory writes can take 100 CPI (cycles
per Instruction).
One solution to this problem is to use a write buffer.The memory write is done , by writing the
content in the write Buffer.When the buffer data are written in memory, then that part of the buffer is
freed.
Alternative scheme -- write-back.
The new value is written only to the block in the cache. The modified block is written to the lower
level of the hierarchy when it is replaced. Write-back schemes can improve performance, more complex
to implement than write-through.
==========================================================================
7. Explain Virtual Memory ?(if asked for 16 marks, write Q. No 7 and Q. No 8)
The main memory can act as a cache for the secondary storage, This technique is called virtual
memory.Need for Virtual memory: To allow efficient and safe sharing of memory among multiple
programs.Virtual memory role is to protect and ensuring that a program can only read and write the
portions of main memory that have been assigned to it.Compilation of each program is done into its own
address spacea separate range of memory locations accessible only to this program.
Virtual memory implements the translation of a programs address space to physical addresses.
This translation process enforces protection of a programs address space
physical address : An address in main memory.
Protection: A set of mechanisms for ensuring that multiple processes sharing the processor, memory, or
I/O devices cannot interfere, intentionally or unintentionally, with one another by reading or writing each
others data. It is also used to isolate OS process and user process.
User program: virtual memory allow a single user program to exceed the size of primary memory.
Programmers divided programs into pieces and then identified the pieces that were mutually exclusive.
These overlays were loaded or unloaded (to and from main memory) under user program control during
execution. User program control ensures that program never access overlays that are not loaded and
these overlays never exceeds the allocated size of the memory. Overlays are organized modules, each
containing both code and data.
In virtual memory,
A virtual memory block is called a page, and a virtual memory miss is called a page fault. The
processor produces a virtual address, which is translated by a combination of hardware and soft ware
to a physical address, which in turn can be used to access main memory. This process is called address
mapping or address translation.
Fixed-size pages (e.g., 4K)
Address Translation:
Virtual memory also simplifies loading the program for execution by providing relocation. This
relocation allows us to load the program anywhere in main memory.All virtual memory systems in uses
relocation, the program is divided as a set of fixed-size blocks (pages), it eliminates the need to find a
contiguous block of memory for a program; instead, the operating system need only find a sufficient
number of pages in main memory.In virtual memory, the address is broken into a virtual page
number and a page off set. (refer previous diagram)
The
physical page number constitutes the upper portion of the physical address, while the page off set,
which is not changed, constitutes the lower portion. The number of bits in the page off set field
determines the page size.A larger number of virtual pages than physical pages is essentially the concept
to show unbounded amount of virtual memory.
Page fault
leads to enormous miss penalty. (millions of clock cycles)
To design
a virtual memory system, following issues are considered.
1. Pages should be large enough to reduce high access time. Sizes from 4 KiB to 16 KiB are typical
today (depends on computer types).
2. The primary technique used here is to allow fully associative placement of pages in memory.
3. Page faults can be handled in software because the overhead will be small compared to the disk access
time. (it reduces miss penalty)
4.
Write-through will not work for virtual memory, since writes take too long. Instead, virtual memory
systems use write-back.
==========================================================================
8. Explain page management in virtual memory and translation using page table?
Placing a Page and Finding It Again:
Location of pages by using a table that indexes the memory; this structure is called a page
table.(It presents in memory).A page table is indexed with the page number from the virtual address to
discover the corresponding physical page number. Each program has its own page table, which maps
the virtual address space of that program to main memory. To indicate the location of the page table in
memory, the hardware includes a register that points to the start of the page table; we call this the
page table register. (Refer figure) The page table register, the virtual address, and the indicated page
Table shows how the hardware can form a physical address. A valid bit is used in each page table
entry,(as in cache).
If the bit is off , the page is not present in main memory and a page fault occurs.
If the bit is on, the page is in memory and the entry contains the physical page number. The
page table contains a mapping for every possible virtual page, no tags are required.
Fig: Translation Using a Page Table

Handling Page Faults:
If the valid bit for a virtual page is off , a page fault occurs. The OS
finds the page in the next level of the hierarchy (usually Secondary device) and decide where to place the
requested page in main memory. The operating system usually creates the space on secondary devices
for all the pages of a process when it creates the process. This space is called the swap space.
(requirement of page in memory cannot be predicted)To keep track a page table is created, which may
part of the page table or an auxiliary page table is created.
When a page fault occurs, If all the pages in main memory are in use, the operating system must
choose a page to replace. operating systems follow the least recently used (LRU) replacement scheme.
The replaced pages are written to swap space on the disk. To help the operating system estimate the
LRU pages, some computers provide a reference bit or use bit. (LRU scheme is expensive). Operating
system periodically clears the reference
bits, it can determine which pages were
touched during a particular time period.
Replacement and Writes:
In a virtual memory, writes operation in
lower level(disk) can take millions of
processor clock cycles; therefore, building a
write buffer to allow the system to writethrough to disk would be completely
impractical. So, write-back scheme is used,
performing the writes into the page in
memory, and copying the page back to disk.
To track whether a page has been written
since it was read into the memory, a dirty
bit is added to the page table. The dirty bit
is set when any word in a page is written. The dirty bit indicates whether the page needs to be written
out. (in disk) . So modified page is often called a dirty page.
=======================================================================
9. Explain Fast translation using a TLB?
Page tables are stored in main memory, Memory access by a program can take at least twice as long. 1.
One memory access to obtain the physical address
2.
Second access to get the data.
To improve performance, rely on locality of reference i.e., when a virtual page is translated, it will be
referenced again(due to temporal and spatial locality).So, special address translation cache is used and is
traditionally referred to as a translation-lookaside buffer (TLB) or translation cache,TLB is acessed
instead of the page table on every reference, the TLB will need to include other status bits, such as the
dirty and the reference bits.
When a miss in the TLB occurs, we
must determine whether it is a page
fault or merely a TLB miss. If the
page exists in memory, then the
TLB miss indicates only that the
translation is missing, so the
processor can handle the
TLB miss by loading the translation
from the page table into the TLB. If
the page is not present in memory,
then the TLB miss indicates a true
page fault.
Details about TLB might be:
1. TLB size: 16512 entries
2. Block size: 12 page table entries
(typically 48 bytes each)
3. Hit time: 0.51 clock cycle
4. Miss penalty: 10100 clock
5. cycles Miss rate: 0.01%1%
TLB uses small, fully associative mapping (lower miss rate).Handling a TLB miss or a page fault
requires using the exception mechanism to interrupt the active process, transferring control to the
operating system, and later resuming execution of the interrupted process.

CS6303 Computer Architecture ACT Notes

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

CS6303 Computer Architecture ACT Notes

Enviado por

Direitos autorais:

Formatos disponíveis

UNIT 1

(OVERVIEW & INSTRUCTIONS)

1. Brief about 8 great ideas in computer architecture?

Performance via Pipelining

Performance via Prediction

2. Write short notes on system software used in computer?

Assembly level language add A,B

Binary / Machine Language 1000110010100000

3. Write short notes on 5 classic components of a computer?

6. Write short notes on power wall?

op: Basic operation of the instruction, traditionally called the opcode.

FIG: 2.5 MIPS instruction encoding.

Example: MIPS instruction encoding in computer hardware.

Binary version of the above Tabulation

SHIFT LEFT (sll )

# reg $t0 = ~ (reg $t1 | reg $t3)

Here bne is used instead of beq, because bne(not equal to)

The end of the loop branches back to the while test at

Indexed / Relative Addressing Mode:

adding 6ten to 7ten in binary

adding 6ten to 7ten in binary (showing carries)

Subtracting 6ten to 7ten in binary

Subtraction via addition using the twos complement representation of

2. Explain Multiplication Algorithm along with an example.

Multiplying 1000ten by 1001ten:

3. Explain Division Algorithm along with an example.

Divides two operands, called the dividend and divisor.

Binary numbers contain only 0 or 1, so binary division is restricted to

3.14159265 ten (pi)

example, 1.0ten x 10-9 is in normalized scientific notation,

In general, floating-point numbers are of the form

F involves the value in the fraction field

Double precision representation:

The range of the single precision format 2.0 x 10

(-1)S x (1+ F) x 2(Exponent - bias)

only 4 digits are used so the now becomes 0.016ten X 101

The sum is 10.015ten

0.5 ten = 0.1two = 0.1two x 20 = 1.000 two x 2-1

0.5 ten = 0.1two = 0.1two x 20 = 1.000 two x 2-1

Step 1. Adding the exponent with out bias: -1 + (-2) = -3

7. Write short notes on floating point instructions in MIPS?

address representation remains the same. c is the offset value.

UNIT - 3 PROCESSOR AND CONTROL UNIT

1. Write short Notes on Pipelining?

Time between Instruction non-pipelined

$s0, $t0, $t1

sub $t2, $s0, $t3

loop: predict taken

---------------------------------------4. Explain control path in pipeline or pipelined control?

# Register $2 written by sub

To forward/Bypass the result in CC3 or end of EX stage

To forward/Bypass the result in CC4 or end of MEM stage

TO INTRODUCE STALL WHEN FORWARDING FAILS

2. What would be the effect if we increase the number of pipelining stages?

Implementation of ILP and overcoming hazards or dependencies.

issuedecode instructions and check for structural hazards;

Symmetric Multiprocessors (SMP)

SMP has a potential advantages over uniprocessor architecture:

Figure 10.1: Block diagram of tightly coupled multiprocessors

Non-uniform Memory Access(NUMA)

Figure 10.4: CC- NUMA Organization

3. Write short notes on different organisation in SMP?

The following features are provided in time-shared bus organization:

Intel Core Duo superscalar cores: