Computer Architecture and Organization (Cao) Notes

B.
S Anangpuria Institute of Technology & Management Branch: CSE/IT (4th SEM) Session-2009
Computer Architecture and Organization CSE 210-E

Unit -1 Basic Principles
Lecture 1: Digital logics Boolean Algebra Logic Gates Truth table
Submitted by: Prerna Mittal
In this chapter we will be dealing with the basic digital circuits of our computer. That is what are the hardware components we are using , how these hardware components are related and interacted to each other and how this hardware is accessed or seen by the user. This gives the birth of the classification of our computer study into: Computer design: This is concerned with the hardware design of the computer. In this designer decides on the specifications of the computer system. Computer Organization: This is concerned with the way the hardware components operate and the way they are connected to form the computer system. Computer Architecture: This is concerned with the structure and behavior of the computer as seen by the user. It includes the information formats, the instruction set and addressing modes for accessing memory. In our course we will be dealing with computer architecture and organization. Before starting with the computer architecture and organization lets discuss the components which make the hardware or the organization of the computer which is composed of digital circuits which are handled by digital computer. Digital Computers Imply that the computer deals with digital information Digital information: is represented by binary digits (0 and 1) Gates blocks of Hardware that produce 1 or 0 when input logic requirements are satisfied
Gate
Binary digital input signal
GATE
Binary digital output signal
Functions of gates can be described by: Truth Table Boolean Function Karnaugh Map
Table for various logic gates -1.1
Boolean algebra Algebra with Binary (Boolean) Variable and Logic Operations Boolean Algebra is useful in Analysis and Synthesis of Digital Logic Circuits Input and Output signals can be represented by Boolean Variables and Function of the Digital Logic Circuits can be represented by Logic Operations , i.e., Boolean Function(s) From a Boolean function, a logic diagram can be constructed using AND, OR, and I
Note:
We can have many circuits for the same Boolean expression. 3
For example:
Truth Table The most elementary specification of the function of a Digital Logic Circuit is the Truth Table Table that describes the Output Values for all the combinations of the Input Values, called MINTERMS n input variables 2n minterms Summary: Computer Design: what hardware components we need. Computer Organization: how these hardware components are interacted. Computer Architecture: how these are connected with the user. Logic Gates: Blocks of hardware giving result in 0 or 1. Basic 8 logic gates out of 3 (AND , OR and I ) are basic Boolean Algebra: The representation of input and output signals in the form of expressions. Truth table: Table that describes the Output Values for all the combinations of the Input Values
Lecture 2: Combinational logic Blocks Multiplexers Adders Encoders Decoders

Combinational circuits are circuits without memory where the outputs are obtained from the inputs only. An n-input m-output combinational circuit is of the form.
n input
Combinational circuits
m output
Multiplexer is the combinational circuit which selects one of the many inputs depending on the selection criteria. The no of selection inputs depends on the number of inputs in the manner as 2x = y By this if y is the no of inputs then x is the no of selection lines. Thus if we have 4 input lines, we use 2 selection lines as 22 =4 and so on. And this will be called as 4:1 multiplexer or 4*1 multiplexer. This has been explained in the diagram as:
4-to-1 Multiplexer Select S1 S0 0 0 0 1 1 0 1 1 I0 I1 I2 I3 S0 S1

Adders Half Adder Full Adder Half Adder: Adds 2 bits and give out carry and sum as result
Output Y I0 I1 I2 I3
Truth Table x 0 0 1 1 y 0 1 0 1 c 0 0 0 1 s 0 1 1 0
y 0 0 x 01 c = xy
y 0 1 x 1 0 s = xy + xy =x y c s
Digital Circuit
Full Adder: Adds 2 bits with carry in and gives carry out and sum as result.
X Y Cin
x y cin cout 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 s 0 1 1 0 1 0 0 1
0 0 0 1 C in 1 1 0 1 1 0 1
S cout
CoCout xy + xcin + ycins = u t
1 0
0 C in 1 0
= xy + (x y) Cin s = xy cin +xycin +xycin +xyCin = x y Cin = (x y) Cin
Decoder: Decoder takes n inputs and gives 2n outputs. That is we get 8 outputs for 3 inputs and is called as 3* 8 decoder. We also have 2* 4 decoder and 4*16 decoder and so on. We are implementing a decoder with the help of NAND gates.
Using NAND gates, it becomes more economical.
Summary: Combinational circuits: where the outputs are obtained from the inputs only. Various combinational circuits are: o Multiplexers: No of selection inputs depends on the number of inputs in the manner as 2x = y. o Half Adder: Adds 2 bits and give result as carry and sum. o Full Adder: Adds 2 bits with carry in and gives result as carry out and Sum. o Encoder: Takes 2n inputs and gives n outputs. o Decoder: Takes n inputs and gives 2n outputs. Important Questions derived from this: Q1. What is the difference in multiplexer and decoder? Q2.Draw a 4*1 decoder with the help of AND gates.
Lecture 3: Sequential logic Blocks Latches Flip flops Registers Counters
Sequential logic Blocks : logic blocks whose output logic value depends on the input values and the state of the blocks In this we have the concept of memory which was not applicable for combinational circuits.
The various sequential blocks or circuits are: Latches:

A latch is a kind of bistable multivibrator, an electronic circuit which has two stable states and thereby can store one bit of information. Today the word is mainly used for simple transparent storage elements, while slightly more advanced non-transparent (or clocked) devices are described as flip-flops. Informally, as this distinction is quite new, the two words are sometimes used interchangeably. S-R latch:
To overcome the restricted combination, one can add gates to the inputs that would convert (S,R) = (1,1) to one of non-restricted combinations. That can be: Q = 1 (1,0) referred to as an S-latch 10
Q = 0 (0,1) referred to as an R-latch Keep state (0,0) referred to as an E-latch D-LATCH Forbidden input values are forced not to occur by using an inverter between the inputs.
Q E (enable) D(data) D 0 1
Flip Flops:
D E
Q Q Q Q
Q Q(t+1) 0 1
D E
D flip flop:
11
If you compare the D-flip flop and D latch the only difference you find in the circuit is that latches do not have clocks and flip flops have it. So you can note down the difference between latches and flip flops as: Latch is an edge triggered device whereas Flip flop is a level triggered. The output of a latch changes independent of a clock signal whereas the Output of a Flip - Flop changes at specific times determined by a clocking signal. In Latch We do not require clock pulses and flip flops are clocked devices. Characteristics - State transition occurs at the rising edge or falling edge of the clock pulse Latches
respond to the input only during these periods Edge-triggered Flip Flops (positive)
respond to the input only at this time
12
Counters: A counter is a device which stores (and sometimes displays) the number of times a particular event or process has occurred, often in relationship to a clock signal. 4 bit binary counter: A0 Q J Clock Counter Enable Output Carry K A1 Q J K A2 Q J K A3 Q J K
RING COUNTER:
In Ring Counter the output of 1st flip flop is moved to the input of 2nd flip flop.
13
JOHNSON COUNTER :
In Johnson counter the output of last flip flop is inverted and given to the first flip flop. Registers: It refers to a group of flip-flops operating as a coherent unit to hold data. This is different from a counter, which is a group of flip-flops operating to generate new data by tabulating it.
14
Shift register: A register that is capable of shifting data one bit at a time is called a shift register. The logical configuration of a serial shift register consists of a chain of flip-flops connected in cascade, with the output of one flip-flop being connected to the input of its neighbor. The operation of the shift register is synchronous; thus each flip-flop is connected to a common clock. Using D flip-flops forms the simplest type of shiftregisters. Bi- directional shift register with parallel load
A0
D C
A1
D C
A2
D C
A3 Q
D C
4x1 MUX
4x1 MUX
4x1 MUX
4x1 MUX
Clock
S0S1 SeriaI I0
Input
I1
I2
Serial I3 Input
Summary: 15
Sequential circuits: output logic value depends on the input values and the state of the blocks. These circuits have memory. Various combinational circuits are: o Latches: An electronic circuit which has two stable states and thereby can store one bit of information o Flip flops: It also has 2 stable states but with memory. o Counter: A device which stores number of times a particular event or process has occurred. o Registers: A group of flip-flops operating as a coherent unit to hold data. Important Questions derived from this: Q1. What is the difference in latch and flip flop? Q2. Explain Johnson counter? Q3. Draw shift register with parallel load.
16
Lecture 4: Stored Program control concept Flynns classification of computers: SISD SIMD MISD MIMD After the discussion of basic principles of hardware and the combinational and sequential circuits we have in our computer system. Lets see how these components are interacted to make our computer system which we use. We will be starting with the basic architectures of the computer system. And the most basic one which comes is how the programs are stored in our computer system or how the different programs and data are arranged in our system. Stored Program control concept The simplest way to organize a computer is to have one processor register and an instruction code with 2 parts. Opcode (What operation is to be completed) Address (Address of the operands on which the operation is to be computed) A computer that by design includes an instruction set architecture and can store in memory a set of instructions (a program) that details the computation and the data on which computation is to be done. Memory 4096*16 Instructions (Program) Operands (Data)
Instruction Format
15 12 11 Opcode Address 15 Binary Operand 0
Processor register (Accumulator or AC)
Fig 1: Stored Program Organization

The opcode tells us the operation to be performed. Address tells us the memory location where to find the operand. For a memory unit of 4096 bits we need 12 bits to specify the address. 17
When we store an instruction code in memory, 4 bits are specified for 16 operations (as 12 bits are for operand address). For an operation control fetches the instruction from memory, it decodes the operation (one out of 16) and finds out the operands and then do the operation. Computers with one processor register generally name it accumulator (or AC). The operation is performed with the operand and the content of AC. In case no operand is specified, we compute the operation on accumulator .E.g. Clear AC, complement AC etc.
PARALLEL COMPUTERS The one we studied was very basic one but sometimes we have very large computations in which one processor with general architecture will not of much help. Thus we take the help of many processors or divide the processor functions into many functional units and also doing the same computation on many data values. So to give solutions to all these we have various types of computers. Architectural Classification Flynn's classification Based on the multiplicity of Instruction Streams and Data Streams Instruction Stream Sequence of Instructions read from memory Data Stream Operations performed on the data in the processor
Number of Data Streams Single Multiple Number of Instruction Streams Single Multiple SISD MISD SIMD MIMD
Fig 2: Classification accordance to Instruction and Data stream There are a variety of ways parallel processing can be classified. M.J.Flynn considered the organization of a computer system by the number of instructions and data items manipulated simultaneously. The normal operation of a computer is to fetch instructions from memory and execute them in the processor.
18
The sequence of instructions read from memory constitutes an instruction stream. The operations performed on the data in the processor constitute a data stream. Parallel processing can be implemented with either instruction stream, data stream or both.
SISD COMPUTER SYSTEMS SISD (Single instruction single data stream) is the simplest computer available. It contains no parallelism. It has single instruction and single data stream. The instructions associated with SISD are executed sequentially and the system may or may not have external; parallel processing capabilities.
Control Unit
Processor Unit Instruction stream
Data stream
Memory
Fig 3: SISD Architecture Characteristics - Standard von Neumann machine - Instructions and data are stored in memory - One operation at a time Limitations Von Neumann bottleneck Maximum speed of the system is limited by the Memory Bandwidth (bits/sec or bytes/sec) - Limitation on Memory Bandwidth - Memory is shared by CPU and I/O Examples: Superscalar processors Super pipelined processors VLIW MISD COMPUTER SYSTEMS MISD (Multiple instruction, single data stream) is of no practical usage as there is least chance where a lot of instructions get executed on a single data.
19
M1 M2

CU1 CU2

P1 P2
Memory
Mn
CUn
Pn
Data stream
Instruction stream Fig 4: MISD Architecture Characteristics - There is no computer at present that can be Classified as MISD SIMD COMPUTER SYSTEMS SIMD (Single instruction Multiple data stream) is the computer where a single instruction gets operated with different sets of data. It gets executed with the help of many processing units controlled by a single control unit. The shared memory must contain various modules so that it can communicate with all the processors at the same time. Main memory is used for storage of programs. Master control unit decodes the instruction and determine the instruction to be executed.
Memory
Data bus Control Unit Instruction Stream
P1
P2
Data stream
Pn
Processor units
Alignment network
M1
M2
Mn
Memory modules
20
Fig 5: SIMD Architecture Characteristics - Only one copy of the program exists - A single controller executes one instruction at a time Examples: Array processors Systolic arrays Associative processors MIMD COMPUTER SYSTEMS MIMD (Multiple instruction, multiple data stream) refers to a computer system where we have different processing elements working on different data. In this we classify various multiprocessors and multi computers. Characteristics - Multiple processing units - Execution of multiple instructions on multiple data M1 P 2 M2 Pn Mn
P1
Interconnection Network
Shared Memory Fig 6: MIMD Architecture Types of MIMD computer systems - Shared memory multiprocessors UMA NUMA - Message-passing multi computers SHARED MEMORY MULTIPROCESSORS Example systems Bus and cache-based systems - Sequent Balance, Encore Multimax Multistage IN-based systems - Ultra computer, Butterfly, RP3, HEP
21
Crossbar switch-based systems - C.mmp, Alliant FX/8 Limitations Memory access latency Hot spot problem
SHARED MEMORY MULTIPROCESSORS (UMA)

M1 M2 Mn
P1
P2
Pn
Fig 7: Uniform Memory access(UMA) Characteristics All processors have equally direct access to one large memory address space. Thus the access time to reach that memory is same for all processors thus it is named as UMA. SHARED MEMORY MULTIPROCESSORS (NUMA)
P1
M1
P2
M2
Pn
Mn
22
Fig 8: NUMA (Non uniform memory access) Characteristics All processors have equally direct access to one large memory address space and also have their own memory. Thus the access time to reach different memories is different for each processor thus it is named as NUMA.
MESSAGE-PASSING MULTICOMPUTER
Message-Passing Network Point-to-point connections
P1
P2
Pn
Fig 9: Message passing multi computer Architecture Characteristics - Interconnected computers - Each processor has its own memory, and communicates via message-passing Example systems - Tree structure: Teradata, DADO - Mesh-connected: Rediflow, Series 2010, J-Machine - Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III Limitations - Communication overhead - Hard to programming Summary: Stored Program Control Concept: In this type of organization instructions and data are stored separately. Flynns classification Of computers: It divided the processing work into data streams and instruction streams and resulted in:
23
o o o o
SISD(Single instruction Single data) SIMD(Single instruction Multiple data) MISD(Multiple instruction Single data) MIMD (Multiple instruction Multiple data)
Important Questions: Q1. Explain stored program control concept. Q2. Explain Flynns classification of computers. Q3. Describe the concept of data stream and instruction stream.
Lecture -5 MULTILEVEL VIEWPOINT OF A MACHINE MICRO ARCHITECTURE ISA MICRO ARCHITECTURE CPU CACHES MAIN MEMORY AND SECONDARY MEMORY UNITS INPUT / OUTPUT MAPPING
After the discussion of stored program control concept and the various type of parallel computers, lets study the different components of the computer structure. MULTILEVEL VIEWPOINT OF A MACHINE Our computer is build on various layers. These layers are basically divided into: Software layer Hardware Layer Instruction Set Architecture
24
SOFTWARE LAYER
USER APPLICATION LAYER COMPILER ASSEMBLER OS MSDOS WINDOWS UNIX / LINUX
MACRO ARCHITECTURE
INSTRUCTION SET ARCHITECTURE (ISA) PROCESSOR MEMORY I/0 SYSTEM HARDWARE LAYER DATA PATH AND CONTROL GATE LEVEL DESIGN CIRCUIT LEVEL DESIGN SILICON LAYOUT LAYER
MICRO ARCHITECTURE
Fig 1: Multilevel viewpoint of a machine Computer system architecture is decided on the basis of the type of applications or usage of the computer. The computer architect decides the different layers and the function of each layer for a specific computer. These layers or functions of each can vary from one organization to another. Our layered architecture is basically divided into 3 parts: Macro-Architecture: as a unit of deployment, we will talk about Client applications and COM Servers. Computer Architecture is the conceptual design and fundamental operational structure of a computer system. It is a blueprint and functional description of requirements (especially speeds and interconnections) and design implementations for the various parts of a computer . This is basically our software layer of the computer. It comprises of : User Application layer The user layer is basically to give the interface to the user with the computer for which the computer is designed .At this layer the user gives the inputs as what processing has to be done .The requirements given by the user has to be implemented by the computer architect with the help of other layers. High level language
25
High-level programming language is a programming language with strong abstraction from the details of the computer. In comparison to low-level programming languages, it may use natural language elements, be easier to use, or more portable across platforms. Such languages hide the details of CPU operations such as memory access models and management of scope.E.g. C/Fortran/Pascal .These are not computer dependent. Assembly language Assembly Language refers to the lowest-level human-readable method for programming a particular computer. Assembly Languages are platform specific, and therefore there is a different Assembly Language necessary for programming every different type of computer. Machine language Machine languages consist entirely of numbers and are almost impossible for humans to read and write. Operating system Operating systems interface with hardware to provide the necessary services for application software. E.g. OS, LINUX, UNIX etc. Functions of Operating system: Process management Memory management File management Device management Error Detection Security Types of Operating system: Multiprogramming Operating System Multiprocessing Operating system Time Sharing Operating system Real time Operating system Distributed Operating system Network Operating system
Compiler Software that translates a program written in a high-level programming language (C/C++, COBOL, etc.) into machine language. A compiler usually generates assembly language first and then translates the assembly language into machine language. A utility known as a "linker" then combines all required machine language modules into an executable program that can run in the computer.
26
Assembler is the software that translates assembly language into machine language. Contrast with compiler, which is used to translate a high-level language, such as COBOL or C, into assembly language first and then into machine language.
Instruction set architecture: This is an abstraction on the interface between the hardware and the low-level software. It deals with the functional behaviour of a computer system as viewed by a programmer . Computer organization deals with structural relationships that are not visible by a programmer. Instruction set architecture is the attribute of a computing system, as seen by the assembly language programmer or compiler. ISA is determined by: Data Storage. Memory Addressing Modes. Operations in the Instruction Set. Instruction Formats. Encoding the Instruction Set. Compilers View. Micro-Architecture: inside a unit of deployment we will talk about running process, COM apartment, thread concurrency and synchronization, memory sharing. Micro architecture, also known as Computer organization is a lower level, more concrete, description of the system that involves how the constituent parts of the system are interconnected and how they interoperate in order to implement the ISA. The size of a computers cache for instance, is an organizational issue that generally has nothing to do with the Processor memory I /o system These are the basic hardware devices required for the processing of any system application. Data path and control In different computers we have different number and type of registers and other logic circuits .The data path and control decides the flow of information within the various parts of the computer system in various circuits. Gate level design These circuits such as register, counters etc are implemented in the form of various gates available. Circuit level design to add the gates to form a logical circuit or a component we have the basic circuit level design which ultimately gives birth to all the hardware components of a computer system. Silicon layout layer Other than the architecture of the computer , we have some very basic units which are important for our computer. 27
Memory units: Main Memory: The main memory of the computer is also known as RAM, standing for Random Access Memory. It is constructed from integrated circuits and needs to have electrical power in order to maintain its information. When power is lost, the information is lost too! It can be directly accessed by the CPU. Caches: A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM. As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory. Secondary Memory: Secondary memory which is sometimes called backing store or external memory, allows the permanent storage of large quantities of data. Example : Hard disk , floppy disk , CDs etc.
CPU: A central processing unit (CPU) is a machine that can execute computer programs. The fundamental operation of most CPUs, regardless of the physical form they take, is to execute a sequence of stored instructions called a program. The program is represented by a series of numbers that are kept in some kind of computer memory. There are four steps that nearly all CPUs use in their operation: fetch, decode, execute, and writeback. I/O units: I/O refers to the communication between an information processing system (such as a computer), and the outside world possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it. Summary: Multilevel view point of a machine describes the complete structure of the computer system in a hierarchical manner which comprises of: o Macro Architecture: Hardware components o Micro Architecture: Software components Operating system High level language Assembly language Compiler Assembler o ISA: How hardware components and software components are connected. It describes Data Storage. Memory Addressing Modes.
28
Operations in the Instruction Set. Instruction Formats. Encoding the Instruction Set. Compilers View Other than the structured organization of computer , other important elements are: o Memory o CPU o I/O Important Questions: Q1. Explain multi level view point of a machine. Q2. Describe micro architecture. Q3. Describe macro architecture. Q4. Explain ISA and why we call it is a link between the hardware and software components. Q5. What is operating system?
29
Lecture 6: CPU performance measures MIPS MFLOPS
After the discussion of all the elements of computer structure in the previous topics , we describe the performance of a computer in this lecture with the help of their performance metrics. Performance of a machine is determined by: Instruction count Clock cycle time Clock cycles per instruction Processor design (datapath and control) will determine: Clock cycle time Clock cycles per instruction Single cycle processor - one clock cycle per instruction Advantages: Simple design, low CPI Disadvantages: Long cycle time, which is limited by the slowest instruction
CP I
Inst. Count
Cycle Time
We have different methods to calculate the performance of a CPU or two compare two CPUs but it highly depends on what type of instructions we give to these CPU. The two phenomenon we generally use are: MIPS MFLOPS MIPS: For a specific program running on a specific computer MIPS is a measure of how many millions of instructions are executed per second: MIPS = Instruction count / (Execution Time x 106) = Instruction count / (CPU clocks x Cycle time x 106) = (Instruction count x Clock rate) / (Instruction count x CPI x 106) = Clock rate / (CPI x 106)
30
Faster execution time usually means faster MIPS rating. MIPS is a good technique but it also have some pitfalls. Problems with MIPS rating: No account for the instruction set used. Program-dependent: A single machine does not have a single MIPS rating since the MIPS rating may depend on the program used. Easy to abuse: Program used to get the MIPS rating is often omitted. Cannot be used to compare computers with different instruction sets. A higher MIPS rating in some cases may not mean higher performance or better execution time i.e. due to compiler design variations. For a machine with instruction classes:
Instruction class A B C
CPI 1 2 3
For a given program, two compilers produced the following instruction counts: Instruction counts (in millions) for each instruction class Code from: A B C Compiler 1 5 1 1 Compiler 2 10 1 1
The machine is assumed to run at a clock rate of 100 MHz. MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106) CPI = CPU execution cycles / Instructions count CPU time = Instruction count x CPI / Clock rate
For compiler 1: CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43 MIP1 = 100 / (1.428 x 106) = 70.0 CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds For compiler 2: CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25 MIP2 = 100 / (1.25 x 106) = 80.0 CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds
31
MFLOPS: MFLOPS, for a specific program running on a specific computer, is a measure of millions of floating point-operation (megaflops) per second. MFLOPS = Number of floating-point operations /(Execution time x 106 ) MFLOPS is a better comparison measure between different machines than MIPS. This is better than MIPS but it also has some pitfalls.
Problems with MFLOPS: A floating-point operation is an addition, subtraction, multiplication, or division operation applied to numbers represented by a single or a double precision floating-point representation. Program-dependent: Different programs have different percentages of floatingpoint operations present i.e. compilers have no floating- point operations and yield a MFLOPS rating of zero. Dependent on the type of floating-point operations present in the program. Summary: Performance of a machine is determined by: Instruction count Clock cycle time Clock cycles per instruction MIPS = Instruction count / (Execution Time x 106) MFLOPS = Number of floating-point operations /(Execution time x 106 )
Important Questions: Q1. What is MIPS? Q2. What is MFLOPS? Q3. What is the difference between MIPS and MFLOPS? Q4. What are CPU performance measures?
32
Lecture 7:

Cache Memory Main Memory Secondary Memory
We have basically 3 type of memories attached with our processor. Cache Memory Main Memory Secondary Memory Primary storage, presently known as memory, is the only one directly accessible to the CPU. The CPU continuously reads instructions stored there and executes them as required. Any data actively operated on is also stored there in uniform manner. there are two more sub-layers of the primary storage, besides main large-capacity RAM:
Processor registers are located inside the processor. Each register typically holds a word of data (often 32 or 64 bits). CPU instructions instruct the arithmetic and logic unit to perform various calculations or other operations on this data (or with the help of it). Registers are technically among the fastest of all forms of computer data storage. Processor cache is an intermediate stage between ultra-fast registers and much slower main memory. It's introduced solely to increase performance of the computer. Most actively used information in the main memory is just duplicated in the cache memory, which is faster, but of much lesser capacity. On the other hand it is much slower, but much larger than processor registers. Multi-level hierarchical cache setup is also commonly usedprimary cache being smallest, fastest and located inside the processor; secondary cache being somewhat larger and slower.
These are the type of memories accessed when we work with processor . But if we have to store some data permanently we need to take help of secondary or auxiliary memory. Secondary memory (or secondary storage) is the slowest and cheapest form of memory. It cannot be processed directly by the CPU. It must first be copied into primary storage (also known as RAM ). Secondary memory devices include magnetic disks like hard drives and floppy disks ; optical disks such as CDs and CDROMs ; and magnetic tapes, which were the first forms of secondary memory. Primary memory Secondary memory
33
1. Fast 2. Expensive 3. Low capacity 4. Connects directly to the processor
1. Slow 2. Cheap 3. Large capacity 4. Not connected directly to the processor
Hard Disks: Hard disks similar to cassette tapes use the magnetic recording techniques - the magnetic medium can be easily erased and rewritten, and it will "remember" the magnetic flux patterns stored onto the medium for many years. Hard drive consists of platter, control circuit board and interface parts. A hard disk is a sealed unit containing a number of platters in a stack. Hard disks may be mounted in a horizontal or a vertical position. In this description, the hard drive is mounted horizontally. Electromagnetic read/write heads are positioned above and below each platter. As the platters spin, the drive heads move in toward the center surface and out toward the edge. In this way, the drive heads can reach the entire surface of each platter.
On a hard disk, data is stored in thin, concentric bands. A drive head, while in one position can read or write a circular ring, or band called a track. There can be more than a thousand tracks on a 3.5-inch hard disk. Sections within each track are called sectors. A sector is the smallest physical storage unit on a disk, and is almost always 512 bytes (0.5 kB) in size. The stack of platters rotate at a constant speed. The drive head, while positioned close to the center of the disk reads from a surface that is passing by more slowly than the surface at the outer edges of the disk. To compensate for this physical difference, tracks near the outside of the disk are lessdensely populated with data than the tracks near the center of the disk. The result of the different data density is that the same amount of data can be read over the same period of time, from any 34
drive head position. The disk space is filled with data according to a standard plan. One side of one platter contains space reserved for hardware track-positioning information and is not available to the operating system. Thus, a disk assembly containing two platters has three sides available for data. Trackpositioning data is written to the disk during assembly at the factory. The system disk controller reads this data to place the drive heads in the correct sector position. Magnetic Tapes: An electric current in a coil of wire produces a magnetic field similar to that of a bar magnet, and that field is much stronger if the coil has a ferromagnetic (iron-like) core
Tape heads are made from rings of ferromagnetic material with a gap where the tape contacts it so the magnetic field can fringe out to magnetize the emulsion on the tape. A coil of wire around the ring carries the current to produce a magnetic field proportional to the signal to be recorded. If an already magnetized tape is passed beneath the head, it can induce a voltage in the coil. Thus the same head can be used for recording and playback.
35
Lecture 8: Instruction Set based classification of computers Three address instructions Two address instructions One address instructions Zero address instructions RISC address instructions CISC address instructions RISC Vs CISC
In the last chapter we discussed the various architectures and the layers of the computer architecture. In this chapter we are explaining the middle layer of the multilevel view point of a machine i.e. Instruction Set Architecture. Instruction Set Architecture (ISA) is an abstraction on the interface between the hardware and the low-level software. It comprises of : Instruction Formats. Memory Addressing Modes. Operations in the Instruction Set. Encoding the Instruction Set. Data Storage. Compilers View. Instruction Format Is the representation of the instruction. It contains the various Instruction Fields : opcode field specify the operations to be performed Address field(s) designate memory address(es) or processor register(s) Mode field(s) determine how the address field is to be interpreted to get effective address or the operand The number of address fields in the instruction format : depend on the internal organization of CPU The three most common CPU organizations : - Single accumulator organization : ADD X /* AC AC + M[X] */ - General register organization : ADD R1, R2, R3 /* R1 R2 + R3 */ ADD R1, R2 /* R1 R1 + R2 */ MOV R1, R2 /* R1 R2 */ ADD R1, X /* R1 R1 + M[X] */ - Stack organization : PUSH X /* TOS M[X] */
36
ADD Address Instructions: Three-address Instructions - Program to evaluate X = (A + B) * (C + D) : ADD R1, A, B /* R1 M[A] + M[B] */ ADD R2, C, D /* R2 M[C] + M[D] */ MUL X, R1, R2 /* M[X] R1 * R2 */ - Results in short program - Instruction becomes long (many bits) Two-address Instructions - Program to evaluate X = (A + B) * (C + D) : MOV R1, A /* R1 M[A] */ ADD R1, B /* R1 R1 + M[A] */ MOV R2, C /* R2 M[C] */ ADD R2, D /* R2 R2 + M[D] */ MUL R1, R2 /* R1 R1 * R2 */ MOV X, R1 /* M[X] R1 */ One-address Instructions - Use an implied AC register for all data manipulation - Program to evaluate X = (A + B) * (C + D) : LOAD A /* AC M[A] */ ADD B /* AC AC + M[B] */ STORE T /* M[T] AC */ LOAD C /* AC M[C] */ ADD D /* AC AC + M[D] */ MUL T /* AC AC * M[T] */ STORE X /* M[X] AC */ Zero-address Instructions - Can be found in a stack-organized computer - Program to evaluate X = (A + B) * (C + D) : PUSH A /* TOS A */ PUSH B /* TOS B */ ADD /* TOS (A + B) */ PUSH C /* TOS C */ PUSH D /* TOS D */ ADD /* TOS (C + D) */ MUL /* TOS (C + D) * (A + B) */ POP X /* M[X] TOS */
CISC(Complex Instruction Set Computer)

These computers with many instructions and addressing modes came to be known as Complex Instruction Set Computers (CISC)
37
One goal for CISC machines was to have a machine language instruction to match each high-level language statement type.
Criticisms on CISC -Complex Instruction Format, Length, Addressing Modes Complicated instruction cycle control due to the complex decoding HW and decoding process - Multiple memory cycle instructions Operations on memory data Multiple memory accesses/instruction - Microprogrammed control is necessity Microprogram control storage takes substantial portion of CPU chip area Semantic Gap is large between machine instruction and microinstruction - General purpose instruction set includes all the features required by individually different applications When any one application is running, all the features required by the other applications are extra burden to the application
RISC
In the late 70s - early 80s, there was a reaction to the shortcomings of the CISC style of processors Reduced Instruction Set Computers (RISC) were proposed as an alternative The underlying idea behind RISC processors is to simplify the instruction set and reduce instruction execution time Note : In RISC type of instructions , we cant access the memory operands directly . Evaluate X = (A + B) * (C + D) : MOV R1, A MOV R2, B ADD R1,R1,R2 MOV R2, C MOV R3, D ADD R2,R2, R3 MUL R1,R1, R2 MOV X, R1 /* R1 M[A] */ /* R2 M[B] */ /* R1 R1 + R2 /* R2 M[C] */ /* R3 M[D] */ /* R2 R2 + R2 */ /* R1 R1 * R2 */ /* M[X] R1 */
RISC processors often feature: Few instructions Few addressing modes Only load and store instructions access memory
38
All other operations are done using on-processor registers Fixed length instructions Single cycle execution of instructions The control unit is hardwired, not microprogrammed Since all (but the load and store instructions) use only registers for operands, only a few addressing modes are needed By having all instructions the same length : reading them in is easy and fast The fetch and decode stages are simple, looking much more like Manos BC than a CISC machine The instruction and address formats are designed to be easy to decode (Unlike the variable length CISC instructions,) the opcode and register fields of RISC instructions can be decoded simultaneously The control logic of a RISC processor is designed to be simple and fast : The control logic is simple because of the small number of instructions and the simple addressing modes The control logic is hardwired, rather than microprogrammed, because hardwired control is faster ADVANTAGES OF RISC VLSI Realization - Control area is considerably reduced RISC chips allow a large number of registers on the chip - Enhancement of performance and HLL support - Higher regularization factor and lower VLSI design cost Computing Speed - Simpler, smaller control unit faster - Simpler instruction set; addressing modes; instruction format faster decoding - Register operation faster than memory operation - Register window enhances the overall speed of execution - Identical instruction length, One cycle instruction execution suitable for pipelining faster Design Costs and Reliability - Shorter time to design reduction in the overall design cost and reduces the problem that the end product will be obsolete by the time the design is completed - Simpler, smaller control unit higher reliability - Simple instruction format (of fixed length)
39
ease of virtual memory management High Level Language Support - A single choice of instruction shorter, simpler compiler - A large number of CPU registers more efficient code - Register window Direct support of HLL - Reduced burden on compiler writer
RISC VS CISC
The CISC Approach Thus, the entire task of multiplying two numbers can be completed with one instruction: MULT 2:3, 5:2 One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. Because the length of the code is relatively short, very little RAM is required to store instructions. The emphasis is put on building complex instructions directly into the hardware. The RISC Approach In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A At first, this may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions. The compiler must also perform more work to convert a high-level language.
RISC vs CISC
Emphasis on hardware Transistors used for storing complex instructions Includes multi-clock complex instructions, Memory-to-memory: "LOAD" and "STORE" incorporated in instructions Small code sizes Emphasis on software Spends more transistors on memory registers Single-clock reduced instruction only Register to register: "LOAD" and "STORE" are independent instructions large code sizes
40
High cycles per second
Low cycles per second
Summary: The instruction format is composed of the opcode field, address field, and mode field. The different types of address instructions used are three-address, two-address, oneaddress and zero-address. RISC and CISC Introduction with its advantages and criticism RISC Vs CISC Important Questions: Q1.Explain the different addressing formats in detail with example. Q2.Explain RISC AND CISC with their advantages and criticisms. Q3 Numerical
41
Lecture 9:
Addressing modes Implied Mode Immediate Mode Register Mode Register Indirect Mode Autoincrement or Autodecrement Mode Direct Addressing Mode Indirect Addressing Mode Relative addressing Mode
In the last lecture we studied the instruction formats, now we study how the instructions use the addressing modes of different types.
Addressing Modes
Addressing Modes * Specifies a rule for interpreting or modifying the address field of the instruction (before the operand is actually referenced) * Variety of addressing modes - to give programming flexibility to the user - to use the bits in the address field of the instruction efficiently In simple words we can say the addressing modes is the way to fetch operands (or Data) from memory.
TYPES OF ADDRESSING MODES

Implied Mode : Address of the operands are specified implicitly in the definition of the instruction - No need to specify address in the instruction - EA = AC, or EA = Stack[SP] - Examples from BC : CLA, CME, INP Immediate Mode : Instead of specifying the address of the operand,operand itself is specified - No need to specify address in the instruction - However, operand itself needs to be specified - (-)Sometimes, require more bits than the address - (+) Fast to acquire an operand - Useful for initializing registers to a constant value Register Mode : Address specified in the instruction is the register address
42
- Designated operand need to be in a register - (+) Shorter address than the memory address -- Saving address field in the instruction - (+) Faster to acquire an operand than the memory addressing - EA = IR(R) (IR(R): Register field of IR) Register Indirect Mode : Instruction specifies a register which contains the memory address of the operand - (+) Saving instruction bits since register address is shorter than the memory address - (-) Slower to acquire an operand than both the register addressing or memory addressing - EA = [IR(R)] ([x]: Content of x) Autoincrement or Autodecrement Mode - Similar to the register indirect mode except : When the address in the register is used to access memory, the value in the register is incremented or decremented by 1 automatically Direct Address Mode : Instruction specifies the memory address which can be used directly to access the memory - (+) Faster than the other memory addressing modes - (-) Too many bits are needed to specify the address for a large physical memory space - EA = IR(addr) (IR(addr): address field of IR) - E.g., the address field in a branch-type instr Indirect Addressing Mode : The address field of an instruction specifies the address of a memory location that contains the address of the operand - (-) Slow to acquire an operand because of an additional memory access - EA = M[IR(address)] Relative Addressing Modes : The Address fields of an instruction specifies the part of the address (abbreviated address) which can be used along with a designated register to calculate the address of the operand --> Effective addr = addr part of the instr + content of a special register - (+) Large physical memory can be accessed with a small number of address bits - EA = f(IR(address), R), R is sometimes implied --> Typically EA = IR(address) + R - 3 different Relative Addressing Modes depending on R * (PC) Relative Addressing Mode (R = PC) * Indexed Addressing Mode (R = IX, where IX: Index Register) * Base Register Addressing Mode (R = BAR(Base Addr Register)) * Indexed addressing mode vs. Base register addressing mode - IR(address) (addr field of instr) : base address vs. displacement - R (index/base register) : displacement vs. base address
43
- Difference: the way they are used (NOT the way they are computed) * indexed addressing mode : processing many operands in an array using the same instr * base register addressing mode : facilitate the relocation of programs in memory in multiprogramming systems Addressing Modes: Examples
Summary: Addressing Modes: Specifies a rule for interpreting or modifying the address field of the instruction. The different types of addressing modes are: Implied mode, Immediate mode, Register mode, Register indirect mode, Autoincrement or auto decrement mode, Direct mode, Indirect mode, Relative addressing mode. Important Questions: Q1. Explain the addressing modes with suitable examples. 44
Lecture 10:
Instruction set Data Transfer Instructions o Typical Data Transfer Instructions o Data Transfer Instructions with Different Addressing Modes Data Manipulation Instructions o Arithmetic instructions o Logical and bit manipulation instructions o Shift instructions Program Control Instructions o Conditional Branch Instructions o Subroutine Call & Return
DATA TRANSFER INSTRUCTIONS These are the type of instructions used only for the transfer of data from registers to registers, registers to memory operands and other memory components. There is no manipulation done on the data values.
Typical Data Transfer Instructions
Name
Load Store Move Exchange Input Output Push Pop Table 3.1
Mnemonic
LD ST MOV XCH IN OUT PUSH POP
These are the type of instructions in which there is no usage of various addressing modes. We have a direct transfer between the various registers and memory components. 45
Like Load and store we used for the transfer of data to and from the accumulator. LD 20 ST D Move and Exchange are used for the data transfer between various general purpose registers. MOV R1,R2 MOV R1,X XCH R1,R2 Input and Output are used for the data transfer between memory and I/O devices. Push and Pop operations are used for information flow between stack and memory. Data Transfer Instructions with Different Addressing Modes
Assembly Conventio Register Transfer Mode n Direct address LD ADR AC M[ADR] Indirect address LD @ADR AC M[M[ADR]] Relative address LD $ADR AC M[PC + ADR] Immediate operand LD #NBR AC NBR Index addressing LD ADR(X) AC M[ADR + XR] Register LD R1 AC R1 Register indirect LD (R1) AC M[R1] Autoincrement LD (R1)+ AC M[R1], R1 R1 + 1 Autodecrement LD -(R1) R1 R1 - 1, AC M[R1]
Table 3.2
In these type of data transfers we use different addressing for loading the operand value in the accumulator register. DATA MANIPULATION INSTRUCTIONS Three Basic Types: Arithmetic instructions
46
Logical and bit manipulation instructions Shift instructions Arithmetic Instructions : These are the type of instructions used for arithmetical calculations like addition , subtraction , increment etc.
Name Increment Decrement Add Subtract Multiply Divide Add with Carry Subtract with Borrow Negate(2s Complement)
Mnemonic INC DEC ADD SUB MUL DIV ADDC SUBB NEG
Table 3.3
Logical and Bit Manipulation Instructions These are the type of instructions in which are operations are computed on string of bits. These bits are treated as individual and thus the operation can be done on individual or a group of bits ignoring the whole value and even new bits insertion is possible. For example: CLR R1 will make all the bits as 0. COM R1 will invert all the bits. AND , OR and XOR will produce the result on 2 individual bits of each operand. E.g.: AND of 0011 and 1100 will result to: 0000. AND instruction is also known as mask instruction as if we have to mask some values of operand we can AND that value with 0s giving other inputs as 1(high). E.g.: Suppose we have to mask register with value 11000110 On 1st , 3rd and 7th bit. Then we will have to AND it with value 01011101. CLRC, SETC and COMC will work only on 1 bit of the operand i.e. Carry.
47
Similarly in case of EI and DI we work only on 1 bit interrupt flip flop to enable it.
Name
Mnemonic CLR COM AND OR XOR CLRC SETC COMC EI DI
Clear Complement AND OR Exclusive-OR Clear carry Set carry Complement carry Enable interrupt Disable interrupt
Table 3.4
Shift Instructions : These are the type of instructions which modify the whole value of operand but by shifting the bits on left or right side. Say R1 has value 11001100 o SHR inserts 0 at the left most position. Result 01100110 o SHL inserts 0 at the right most position. Result 10011000 o SHRA : In case of SHRA the sign bit remains same else every bit shift left or right accordingly. Result 11100110 o SHLA is same as that of SHL inserting 0 in the end. Result 10011000 o In ROR , all the bits are shifted towards right and the rightmost one moves to leftmost position. Result : 01100110 o In ROL , all the bits are shifted towards left and the leftmost one moves to rightmost position. Result : 10011001
48
o In case of RORC , suppose we have a carry bit as 0 with register R1. In this all the bits of the register will be right shifted and the value of carry will be moved to leftmost position and the rightmost position will be moved to carry. Result : 01100110 with carry as 0 o Similarly in case of ROLC , we will get all the bits of the register left shifted and the value of carry moved to rightmost position and the leftmost position will be moved to carry. Result : 10011000 with carry as 1.
Name Mnemonic Logical shift right Logical shift left Arithmetic shift right Arithmetic shift left Rotate right Rotate left Rotate right thru carry Rotate left thru carry Table 3.5
PROGRAM CONTROL INSTRUCTIONS: Before starting with program control instructions, lets study the concept of PC i.e. Program counter. Program counter is the register which tells us the address of the next instruction to be executed. When we fetch the instruction pointed by PC from memory it changes it value giving us the address of the next instruction to be fetched. In case of sequential instructions it simply increments itself and in case of branching or modular programs it gives us the address of the first instruction of the called program. After the execution of the called program , the program counter points back to the instruction next to the instruction from which the subprogram was called. In case of go to kind of instructions the program counter simply changes the value of program counter with out keeping any reference of the previous instruction..
SHR SHL SHRA SHLA ROR ROL RORC ROLC
49
+1 In-Line Sequencing (Next instruction is fetched from the next adjacent location in the memory) PC Address from other source; Current Instruction, Stack, etc; Branch, Conditional Branch, Subroutine, etc. Program Control Instructions: These instructions are used for the transfer of control to other instructions. That is these instructions are used in case we have to execute the next instruction from some other location instead of sequential manner. The conditions can be : Calling a sub program Returning to the main program Jumping onto some other instruction or location Skip the instructions in case of break and exit or in case the condition you check is false and so on
Name Mnemonic Branch Jump Skip Call Return Compare(by ) Test(by AND) BR JMP SKP CALL RTN CMP TST Table 3.6
*CMP and TST instructions do not retain their results of operation (- and AND, respectively).They only set or clear certain flags. Conditional Branch Instructions: These are the instructions in which we test some conditions and depending on the result we go either for branching or sequential way.
50
Mnemonic BZ BNZ BC BNC BP BM BV BNV
Branch condition Branch if zero Branch if not zero Branch if carry Branch if no carry Branch if plus Branch if minus Branch if overflow Branch if no overflow
Tested condition Z=1 Z=0 C=1 C=0 S=0 S=1 V=1 V=0 A>B A B A<B A B A=B A B A>B A B A<B A B A=B A B
Unsigned compare conditions (A BHI B) Branch if higher BHE Branch if higher or equal BLO Branch if lower BLOE Branch if lower or equal BE Branch if equal BNE Branch if not equal Signed compare conditions (A B) BGT Branch if greater than BGE Branch if greater or equal BLT Branch if less than BLE Branch if less or equal BE Branch if equal BNE Branch if not equal
Table 3.7
Subroutine Call and Return: Subroutine Call: Call Subroutine Jump to Subroutine Branch to Subroutine Branch & save return address Two most important operations are implied: *Branch to the beginning of the Subroutine -Same as the branch or conditional branch *Save the return address to get the address of the location of the calling program upon exit from the subroutine. Location of storing return address: Fixed Location in the subroutine (Memory) Fixed Location in memory
51
In a processor Register In memory stack -most efficient way
CALL SP SP - 1 M[SP] PC PC EA RTN PC M[SP] SP SP + 1
Summary: Data Transfer Instructions are of two types namely: Typical Data Transfer Instructions and Data Transfer Instructions with Different Addressing Modes. The Data Manipulation Instructions are of three types, which are Arithmetic instructions, Logical and bit manipulation instructions and Shift instructions. Program Control Instructions can be divided into Conditional Branch Instructions and Subroutine Call & Return instructions. Important Questions: Q1.Explain the data Transfer instructions. Q2.Explain the data Manipulation instructions. Q3.Explain the Program control instructions with example.
52
Lecture 11:
Program Interrupts MASM

PROGRAM INTERRUPT: Types of Interrupts: 1. External Interrupt: External interrupts are initiated from outside of CPU & memory. -I/O device-> Data transfer request or data transfer complete -Timing device ->timeout - Power failure - Operator 2. Internal Interrupts (traps): Internal Interrupts are caused by the currently running program. - Register, Stack Overflow - Divide by Zero - OP- code violation - Protection Violation 3. Software Interrupts: Both external & internal interrupts are intiated by the computer hardware. Software interrupts are initiated by the executing instruction. -Supervisor Call -> Switching from user mode to the supervisor mode -> Allows to execute a certain class of operations which are not allowed in the user mode. MASM: If you have used a modern word processor such as Microsoft Word and have noticed the macros feature. Where you can record a series of frequently used actions or commands into the macros. For example, you always need to insert a 2 by 4 column with the title "Date" and "Time". You can start the macro recorder and create the table as you wish. After that, you can save the macro. The next time you need to create the same kind of table, you just need to execute the macro. The same applies for a macro assembler. It enables you to record down frequently performed actions or a frequently used block of code so that you do not have to re-type it each time. The Microsoft Macro Assembler (abbreviated MASM) is an x86 high-level assembler for DOS and Microsoft Windows. Currently it is the most popular x86 assembler. It supports a wide variety of macro facilities and structured programming idioms, including highlevel functions for looping and procedures. Later versions added the capability of 53
producing programs for Windows. MASM is one of the few Microsoft development tools that target 16-bit, 32-bit and 64-bit platforms. Earlier versions were MS-DOS applications. Versions 5.1 and 6.0 were OS/2 applications and later versions were Win32 console applications. Versions 6.1 and 6.11 included Phar Lap's TNT DOS extender so that MASM could run in MS-DOS.[citation needed The name MASM originally referred to as MACRO ASSEMBLER but over the years it has become synonymous with Microsoft Assembler. An Assembly language translator converts macros into several machine language instructions. MASM isn't the fastest assembler around (it's not particularly slow, except in a couple of degenerate cases, but there are faster assemblers available). Though very powerful, there are a couple of assemblers that, arguably, are more powerful (e.g., TASM and HLA). MASM is only usable for creating DOS and Windows applications; you cannot effectively use it to create software for other operating systems. Benefits of MASM There are some benefits to using MASM today: Steve Hutchessen's ("Hutch") MASM32 package provides the support for MASM that Microsoft no longer provides. You can download MASM (and MASM32) free from Microsoft and other sites. Most Windows' assembly language examples on the Internet today use MASM syntax. You may download MASM directly from Webster as part of the MASM32 package. Summary: Program Interrupts can be external, internal or software interrupts. MASM is Microsoft or macro assembler used for implementing macros. Important Questions: Q1.What are Program interrupts. Explain the types of Program interrupts. Q2. Explain MASM in detail.
54
Lecture 10: CPU Architecture types o Accumulator o Register o Stack o Memory / Register Detailed data path of a register based CPU In Unit -3 we discussed the instruction set computer(ISA) which deals with the various types of address instructions , addressing modes and different types of instructions in various computer architectures. In this chapter we will discuss the various type of computer organizations we have. In general, most processors or computers are organized in one of 3 ways Single register (Accumulator) organization Basic Computer is a good example Accumulator is the only general purpose register Stack organization All operations are done using the hardware stack For example, an OR instruction will pop the two top elements from the stack, do a logical OR on them, and push the result on the stack General register organization Used by most modern computer processors Any of the registers can be used as the source or destination for computer operations Accumulator type of Organization: In case of accumulator type of organizations, one operand is in memory and other is in accumulator. The instructions we can run with accumulator are : AC AC DR AC AC + DR AC DR AC(0-7) INPR AC AC AC shr AC, AC(15) E AC shl AC, AC(0) E AC 0 AC AC + 1 AND with DR Add with DR Transfer from DR Transfer from INPR Complement Shift right Shift left Clear Increment
55
Circuit required:
1 6 1 From DR 6 From INPR 8
Adder and logic circuit
1 6
AC Accumulator
1 6 To bus Clock
LD
INR
CLR
Control Gates
Stack Organization: Stack - Very useful feature for nested subroutines, nested interrupt services - Also efficient for arithmetic expression evaluation - Storage which can be accessed in LIFO - Pointer: SP - Only PUSH and POP operations are applicable Stack type of organization is of two types
56
REGISTER STACK ORGANIZATION Register Stack Address
Flags
FULL EMPTY
63
Stack pointer
SP
6 bits
C B A DR
4 3 2 1 0
Push, Pop operations
/* Initially, SP = 0, EMPTY = 1, FULL = 0 */
PUSH
POP
SP SP + 1 DR M[SP] M[SP] DR SP SP 1 If (SP = 0) then (FULL 1) If (SP = 0) then (EMPTY 1) EMPTY 0 FULL 0
57
MEMORY STACK ORGANIZATION Memory with Program, Data, and Stack Segments 1000 PC Program (Instructions) Data (Operands) 3000
AR SP
Stack
3997 3998 3999 4000 4001 Stack grows In this direction
A portion of memory is used as a stack with a processor register as a stack pointer - PUSH: SP SP - 1 M [SP] DR - POP: DR M [SP] SP SP + 1 Note: Most computers do not provide hardware to check stack overflow (full stack) or underflow (empty stack) must be done in software Register type of organization: In this we take the help of various registers , say R1 to R8 for transfer and manipulation of data. Detailed data path of a typical register based CPU
58
Clock R1 R2 R3 R4 R5 R6 R7
Input
(7 lines) SELS1{
3x8
Load
Decoder
SELD OPR
MUX 1 S1 bus ALU
MUX 2 S2 bus
} SELS2
Output/Result
To avoid memory access directly (as it is very time consuming and thus a costly technique) , we prefer the register organization as it proves to be more efficient and time saving organization. In this we are using 7 registers. The two multiplexers and a decoder decide which registers to be used as operands source and what register to be used as a destination for the storage of result. MUX 1 decides the 1st operand register which depends on the values of SELS1 (Selector for source 1).Similarly, for MUX 2, SELs2 works as input for 2nd operand decision. These two inputs through S1bus and S2 bus reach ALU. OPR denotes the type of operation to be performed and the computation or operation is performed on ALU. Then the result is either stored back in one of the 7 registers with the help of decoder which decides which is the resultant register with the help of SELD.
59
Lecture 13: Address Sequencing / Microinstruction Sequencing Implementation of control unit
Address Sequencing/Microinstruction Sequencing: Microinstructions are stored in control memory in groups, with each group specifying a routine. The hardware that controls the address sequencing of the control memory must be capable of sequencing the microinstructions within a routine and be able to branch from one routine to another with the help of this circuit.
Instruction code Mapping logic Status bits Branch MUX logic select Multiplexers
Subroutine register (SBR)
Control address register (CAR) Incrementer
Control memory (ROM) select a status bit Branch address Microoperations
Steps : An initial address is loaded into CAR at power turned ON that usually is the first microinstruction that activates the instruction fetch routine.This routine may be sequenced by incrementing.At the end of the fetch routine the instructionm is in the IR of the computer.Next the control memory computes the effective address of the operand.The net step is the execution of the instruction fetched from memory. The transformation from the instruction code bits to an address in control memory where the routine is located is reffered to as a mapping process.
60
At the completion of the execution of the instruction, control must return to the fetch routine by executing an unconditional; branch microinstruction to the first address of the fetch routine. Sequencing Capabilities Required in a Control Storage - Incrementing of the control address register - Unconditional and conditional branches - A mapping process from the bits of the machine instruction to an address for control memory - A facility for subroutine call and return Design of control Unit: After getting the microoperations we have to execute these microperations but before that we need to decode them.
microoperation fields
F1 F2 F3
3 x 8 decoder 6 54 3 21 0
3 x 8 decoder 7 6 54 3 21 0
3 x 8 decoder 7 6 54 3 21 0
ADD DRTAC
AND Arithmetic logic and shift unit From From PC DR(0-10) 0 1 Multiplexers Load
AC DR
P C T A DRTAR R
AC
Select
Load
AR
Clock
Fig: Decoding of microoperation fields. Because we have 8 microoperations represented with the help of 3 bits in every table and also we have 3 such tables possible we have decoded these microperations field bits with three 3*8 decoders. After getting the microoperations, we have to give it to particular circuits, the data manipulation type of microperations like AND, ADD, Sub and so on we give to ALU and
61
the corresponding results moved to AC. The ALU has been provided data from AC and DR. And for data transfer type of instructions like in the case of PCTAR or DRTAR we need to simply transfer the values .Because we have two options for data transfer in AR we are taking the help of MUX to choose one . We will take 2*1 MUX and one select line which is attached with DRTAR microperation signal .That means if DRTAR is high then MUX will choose DR to transfer the data to AR else PC s data will be moved to AR.And the corresponding data movement will be done with the help of load high or not .If any of the values is high the value will be loaded to AR. The clock signal is provided for the synchronization of microoperations.
62
Lecture 13: Fetch and decode cycle Control Unit
Fetch and Decode
T0: AR PC (S0S1S2=010, T0=1) T1: IR M [AR], PC PC + 1 (S0S1S2=111, T1=1) T2: D0, . . . , D7 Decode IR(12-14), AR IR(0-11), I IR(15)
T1 T0 S2 S1 Bus S0
Memory unit
Address Read
AR
LD
PC
INR
IR
LD Common bus Clock
63
Control Unit Control unit (CU) of a processor translates from machine instructions to the control signals for the microoperations that implement them Control units are implemented in one of two ways Hardwired Control CU is made up of sequential and combinational circuits to generate the control signals Microprogrammed Control A control memory on the processor contains microprograms that activate the necessary control signals We will consider a hardwired implementation of the control unit for the Basic Computer
Fetch and Decode
T0: AR PC (S0S1S2=010, T0=1) T1: IR M [AR], PC PC + 1 (S0S1S2=111, T1=1) T2: D0, . . . , D7 Decode IR(12-14), AR IR(0-11), I IR(15)
64
T1 T0
S2 S1 Bus S0
Memory unit
Address Read
AR
LD
PC
INR
IR
LD Common bus Clock
Control Unit Control unit (CU) of a processor translates from machine instructions to the control signals for the microoperations that implement them Control units are implemented in one of two ways Hardwired Control CU is made up of sequential and combinational circuits to generate the control signals Microprogrammed Control A control memory on the processor contains microprograms that activate the necessary control signals We will consider a hardwired implementation of the control unit for the Basic Computer
65
Lecture 15: Memory hierarchy and its organization Need of memory hierarchy Locality of reference principle
In the last units we have studied the various instructions , data and the registers associated with our computer organization. Lets come on to micro architecture of computer , in which an important part is memory. Lets study what is a memory and what are the various types of memory available. Memory unit is a very essential component in a computer which is used for storing programs and data. We use main memory for running programs and also additional capacity for storage . We have various levels of memory units in terms of memory hierarchy. MEMORY HIERARCHY Memory Hierarchy is to obtain the highest possible access speed while minimizing the total cost of the memory system The various components are: Main Memory: The memory unit that communicates directly with CPU. The programs and data currently needed by the processor reside in main memory. Auxiliary Memory : This is made of devices that provide backup storage. Example : Magnetic tapes , magnetic disks etc. Cache memory : This is the memory which lies in between your main memory and CPU.
Magnetic tapes Magnetic disks
I/O processor
Main memory
CPU
Cache memory
] 66
Register Cache
Main Memory
Magnetic Disk
Magnetic Tape
Fig :Memory Hierarchy In this hierarchy , we have magnetic tapes at the lowest level which means they are very slow and very cheap in nature. Moving on to upper levels , we have main memory in which we get increased speed but with increased cost per bit. Thus we can conclude as we go towards upper levels: - Price increases - Speed increases - Cost per bit increases - Access time decreases - Size decreases Many operating systems are designed to enable the CPU to process a number of independent programs concurrently. This concept is called multiprogramming.This is made possible by the existence of 2 programs residing in different pats of memory hierarchy at the same time . Example : CPU and I/O transfer. The locality of reference, also known as the locality principle, is the phenomenon, that the collection of the data locations referenced in a short period of time in a running computer, often consists of relatively well predictable clusters. Analysis of a large number of typical programs has shown that the references to memory at any given interval of time tend to be confined within a few localized areas in memory. This phenomenon is known as locality of reference
67
Important special cases of locality are temporal, spatial, equidistant and branch locality.
Temporal locality: if at one point in time a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future. There is a temporal proximity between the adjacent references to the same memory location. In this case it is common to make efforts to store a copy of the referenced data in special memory storage, which can be accessed faster. Temporal locality is a very special case of the spatial locality, namely when the prospective location is identical to the present location. Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. There is a spatial proximity between the memory locations, referenced at almost the same time. In this case it is common to make efforts to guess, how big neighbourhood around the current reference is worthwhile to prepare for faster access. Equidistant locality: it is halfway between the spatial locality and the branch locality. Consider a loop accessing locations in an equidistant pattern, i.e. the path in the spatial-temporal coordinate space is a dotted line. In this case, a simple linear function can predict which location will be accessed in the near future. Branch locality: if there are only few amount of possible alternatives for the prospective part of the path in the spatial-temporal coordinate space. This is the case when an instruction loop has a simple structure, or the possible outcome of a small system of conditional branching instructions is restricted to a small set of possibilities. Branch locality is typically not a spatial locality since the few possibilities can be located far away from each other. Sequential locality:In a typical program the execution of instructions follows a sequential order unless branch instructions create out of order execution. This also take into consideration spatial locality as the sequential instructions are stored near to each other.
In order to make benefit from the very frequently occurring temporal and spatial kind of locality, most of the information storage systems are hierarchical. The equidistant locality is usually supported by the diverse nontrivial increment instructions of the processors. For the case of branch locality, the contemporary processors have sophisticated branch predictors, and on the base of this prediction the memory manager of the processor tries to collect and preprocess the data of the plausible alternatives.
Reasons for locality

There are several reasons for locality. These reasons are either goals to achieve or circumstances to accept, depending on the aspect. The reasons below are not disjoint; in fact, the list below goes from the most general case to special cases.
68
Predictability: In fact, locality is merely one type of predictable behavior in computer systems. Luckily, many of the practical problems are decidable and hence the corresponding program can behave predictably, if it is well written. Structure of the program: Locality occurs often because of the way in which computer programs are created, for handling decidable problems. Generally, related data is stored in nearby locations in storage. One common pattern in computing involves the processing of several items, one at a time. This means that if a lot of processing is done, the single item will be accessed more than once, thus leading to temporal locality of reference. Furthermore, moving to the next item implies that the next item will be read, hence spatial locality of reference, since memory locations are typically read in batches. Linear data structures: Locality often occurs because code contains loops that tend to reference arrays or other data structures by indices. Sequential locality, a special case of spatial locality, occurs when relevant data elements are arranged and accessed linearly. For example, the simple traversal of elements in a onedimensional array, from the base address to the highest element would exploit the sequential locality of the array in memory.[2] The more general equidistant locality occurs when the linear traversal is over a longer area of adjacent data structures having identical structure and size, and in addition to this, not the whole structures are in access, but only the mutually corresponding same elements of the structures. This is the case when a matrix is represented as an sequential matrix of rows and the requirement is to access a single column of the matrix.
Use of locality in general

If most of the time the substantial portion of the references aggregate into clusters, and if the shape of this system of clusters can be well predicted, then it can be used for speed optimization. There are several ways to make benefit from locality. The common techniques for optimization are:
to increase the locality of references. This is achieved usually on the software side. to exploit the locality of references. This is achieved usually on the hardware side. The temporal and spatial locality can be capitalized by hierarchical storage hardwares. The equidistant locality can be used by the appropriately specialized instructions of the processors, this possibility is not only the responsibility of hardware, but the software as well, whether its structure is suitable for compiling a binary program which calls the specialized instructions in question. The branch locality is a more elaborate possibility, hence more developing effort is needed, but there is much larger reserve for future exploration in this kind of locality than in all the remaining ones.
69
Lecture 16: Main Memory o RAM chip organization o ROM chip organization Expansion of main memory o Memory connections to CPU o Memory address map
Till now we have discussed the memory interconnections and their comparisons. Lets take each in detail. Main Memory: Main memory is a large (w.r.t Cache Memory ) and fast memory (w.r.t magnetic tapes , disks etc) used to store the programs and data during the computer operation. I/O processor manages data transfers between auxiliary memory and main memory. Main Memory is available in 2 types : The principal technology used for main memory is based on semiconductor integrated circuits. RAM : This is part of main memory where we can both read and write data. Typical RAM chip: Chip select 1 Chip select 2 Read Write 7-bit address CS1 CS2 RD WR AD 7
128 x 8 RAM
8-bit data bus
CS1 and CS2 are used to enable or disable a particular RAM. . We have corresponding truth table as:
70
We have RAM enabled when CS1 as 1 and CS2 as 0.Else we will have inhibit operation or high impedence state. When we have 1 and 0 we will have RAM enabled. But if we have both read and write as 0 we dont have any operation and thus RAM is in high impedence state . RD pin tells us that RAM is getting used fro read operation. Similarly WR pin is used to show that Write operation is getting performed on RAM. In this if we have option of both WR and RD as high we choose read operation else we will have inconsistency of data. Since we have 128 * 8 words RAM that means we have 128 words and each word of length 8 bits. Thus we need 8 bit data bus to transfer the data and we have bidirection 8 bit data bus . To access 128 words we need 27 i.e. 7 bits to access 128 words. Integrated circuit RAM chips are available in 2 modes : Static memory: Dynamic Memory:
Typical ROM chip

Chip select 1 Chip select 2 9-bit address CS1 CS2 AD 9 512 x 8 ROM 8-bit data bus
We have ROM enabled when CS1 as 1 and CS2 as 0. We need not have any WR pin as ROM does not allow write operation. Also we do not need RD pin because if ROM is enabled it will be for read operation only. Since we have 512* 8 words ROM that means we have 512 words and each word of length 8 bits. Thus we need 8 bit data bus to transfer the data but unidirectional as it only allows reading. To access 512 words we need 29 i.e. 9 bits.
71
Memory Expansion: Sometimes we need to combine RAMs or ROMs to expand memory. Taking a similar case we need 512 words of RAM with 128 words RAM chip and also we need 512 ROM memory. In this we will have 4 RAMs of 128 each and one 512 ROM. To access a particular word of memory we have to go in 3 steps: 1. To access a particular word in 128 words RAM ( we need 7 bits for one RAM) or a particular word in 512 words ROM ( we need 9 bits for ROM). 2. To choose a particular RAM out of four we need 2 bits and thus we need a 2*4 decoder for it. 3. To choose between RAM or ROM we need 1 more bit to choose one out of 2. To show these type of connections to the CPU we have the following circuit: We have 4 RAMs and one ROM. The address lines from 1 7 is given to all RAMs. The address lines from 1 9 to ROM. 2 bits i.e.8th and 9th are used to access 4 RAMs thus have used a 2*4 decoder. To identify between RAM and ROM we used one more bit i.e. 10th bit. That means if we have value of 10th bit as low RAM will be enabled else ROM will be enabled.
72
Address bus 16-11 10 9 8
CPU
7-1 RD WR Data bus
Decoder 3 2 1 0
CS1 CS2 128 x 8 RD RAM 1 WR AD7 CS1 CS2 128 x 8 RD RAM 2 WR AD7 CS1 CS2 128 x 8 RD RAM 3 WR AD7 CS1 CS2 128 x 8 RD RAM 4 WR AD7 CS1 CS2
1- 7 8 9
512 x 8 AD9 ROM
Fig: Memory connection to CPU

To represent them properly we take the help of memory address map: Address Hexadecimal bus Component address 10 9 8 7 6 5 4 3 2 1 RAM RAM RAM RAM ROM 1 2 3 4 0000 - 007F 0080 - 00FF 0100 - 017F 0180 - 01FF 0200 - 03FF 0 0 0 0 1 0 0 1 1 x 0 1 0 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x
D at a
Data
Data
Data
Data
73
We have used x or dont care for 1-7 bits of RAM and 1-9 for ROM as any value whether 0 or 1 it will lie in that particular RAM address only. 2 bits i.e.8th and 9th are used to access 4 RAMs i.e. for o RAM1 0 0 o RAM2 0 1 o RAM3 1 0 o RAM4 1 1 To identify between RAM and ROM we used one more bit i.e. 10th bit. That means if we have value of 10th bit as low RAM will be enabled else ROM will be enabled.
74
Lecture 17: Static RAM nd Dynamic RAM Associative Memory
In last lecture we discussed the RAM and ROM chips and their expansion mechanisms. Lets discuss the various types of RAM . Static Random Access Memory (SRAM) is a type of semiconductor memory where the word static indicates that, unlike dynamic RAM (DRAM), it does not need to be periodically refreshed, as SRAM uses bistable latching circuitry to store each bit. SRAM exhibits data remanence,[1] but is still volatile in the conventional sense that data is eventually lost when the memory is not powered. Dynamic random access memory (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor within an integrated circuit. Since real capacitors leak charge, the information eventually fades unless the capacitor charge is refreshed periodically. Because of this refresh requirement, it is a dynamic memory as opposed to SRAM and other static memory. The advantage of DRAM is its structural simplicity: only one transistor and a capacitor are required per bit, compared to six transistors in SRAM. This allows DRAM to reach very high density. Unlike Flash memory, it is volatile memory (cf. non-volatile memory), since it loses its data when the power supply is removed. Static RAM is a type of RAM that holds its data without external refresh, for as long as power is supplied to the circuit. This is contrasted to dynamic RAM (DRAM), which must be refreshed many times per second in order to hold its data contents. SRAMs are used for specific applications within the PC, where their strengths outweigh their weaknesses compared to DRAM:

Simplicity: SRAMs don't require external refresh circuitry or other work in order for them to keep their data intact. Speed: SRAM is faster than DRAM.
In contrast, SRAMs have the following weaknesses, compared to DRAMs:

Cost: SRAM is, byte for byte, several times more expensive than DRAM. Size: SRAMs take up much more space than DRAMs (which is part of why the cost is higher).
75
We access our memory by some address but we can get some chance to access memory by the content value and not the address . For example the search mechanisms. So their comes the concept of associative memory. Associative Memory: This is the type of memory where we access data by searching or matching the contents and not by address value. - Accessed by the content of the data rather than by an address - Also called Content Addressable Memory (CAM) Argument register(A)
Key register (K) Match register Input Associative memory array and logic m words n bits per word
Read Write
In this the data we need to search is kept in Argument register. This is of the same length as of word size. Since we have m words of nbits length we will have argument register of length n bits. Also we have M as match register which gives us the matching result in terms of those particular bits high. This is equal to the number of words in memory, so match register is of length m bits. For example if we have to search 1011 and you have words Like 1011 0111 1000 1100 0010 1011 0111 1011
76
In this we have occurrence of 1011 , 3 times i.e for 1st , 6th and 8th word. Thus the value of match register will be high at 1st , 6th and 8th place else it will be low.
1011
1111 1011 0111 1000 1100 0010 1011 0111 1011 1 0 0 0 0 1 0 1
In this the value of key register is 1111 which represents that it is matching all the bits of argument register to every word of associative memory. In case we need to choose only some bits for checking as we want all words ending with one . The value of key register will be 0001( as we are matching only last bit) And the value of match register will be 11000111.
77
Lecture 18: Cache Memory: o Locality of reference o Associative mapped cache organizations o Direct mapped cache organizations o Set associative mapped cache organizations
Cache Memory: The basic idea of cache organization is that by keeping the most frequently accessed instructions and data in the fast memory i.e. cache memory , the average memory access time is reduced. Examples are : Important subprograms , iterative procedures etc. If these active portions of the program and data are placed in fast small memory , the average access time can be reduced , thus reducing the total execution time of the program. Such a fast small memory is referred to as a cache memory. This is placed between main memory and CPU. When the CPU needs to access memory , the cache is examined.If the word is found in the cache , it is read from this fast memory and is called a cache hit.Else we access the main memory and this is called a cache miss. The performance of cache memory is frequently measured in terms of hit ratio. Analysis of a large number of typical programs has shown that the references to memory at any given interval of time tend to be confined within a few localized areas in memory. This phenomenon is known as locality of reference . Main memory CPU Cache memory The various type of concepts used for locality of reference are: - Temporal Locality The information which will be used in near future is likely to be in use already( e.g. Reuse of information in loops) - Spatial Locality If a word is accessed, adjacent(near) words are likely accessed soon (e.g. Related data items (arrays) are usually stored together; instructions are executed sequentially) -Sequential locality In a typical program the execution of instructions follows a sequential order unless branch instructions create out of order execution. This also take into consideration spatial locality as the sequential instructions are stored near to each other. Performance of Cache Memory System 78
Hit Ratio - % of memory accesses satisfied by Cache memory system Te: Effective memory access time in Cache memory system Tc: Cache access time Tm: Main memory access time Te = Tc + (1 - h) Tm Example: Tc = 0.4 s, Tm = 1.2 s, h = 0.85% Te = 0.4 + (1 - 0.85) * 1.2 = 0.58 s
MEMORY AND CACHE MAPPING

Mapping Function: Specification of correspondence between main memory blocks and cache blocks There are 3 type of mapping mechanisms Associative mapping Direct mapping Set-associative mapping
ASSOCIATIVE MAPPING
- Any block location in Cache can store any block in memory -> Most flexible ( we can place any word of any address value anywhere) - Mapping Table is implemented in an associative memory -> Fast, very Expensive (The associative logic device is expensive . Also word size increases by number of bits in the address of main memory) - Mapping Table Stores both address and the content of the memory word
79
In this we have fetched these important words and place

Memory address 00000 00777 01000 01777 02000 02777 Memory data 1220 2340 3450 4560 5670 6710
into cache memory but since we get only address values to fetch the data we need to place this address too in our cache memory. But in cache memory this is saved as the content value. Thus we have to search this address in forms of content taking the concept of associative memory. That is why the address we need to fetch we place it in argument register and searches it in our cache memory .When we find it the corresponding data is fetched back. address (15 bits) Argument register Address Data 3450 6710 1234
CAM
01000 02777 22235
DIRECT MAPPING- Each memory block has only one place to load in Cache - Mapping Table is made of RAM instead of CAM - n-bit memory address consists of 2 parts; k bits of Index field and n-k bits of Tag field - n-bit addresses are used to access main memory and k-bit Index is used to access the Cache Addressing Relationships
80
In this we divide the address of main memory to 2 fields : Index: No of bits equal to bits required to access cache memory. Tag : Total Index bits.
Tag(6)
Index(9)
00 000
32K x 12
000
Main memory
Address = 15 bits 77 777 Data = 12 bits
Cache memory
777 Address = 9 bits Data = 12 bits
512 x 12
The reason of this division is we place the content of main memory onto the cache memory address equal to index bits. Example : In main memory we have 1220 at address 00000 , by dividing this address we get Tag as 00 and index as 000. Thus we save 1220 at address 000 of cache memory. But we also have data 01000 which has also index 000 . So to distinguish between them we save the tag values along with data in cache memory. Similarly 2340 is saved at 00777 in main memory will be saved at 777 address in cache memory with tag value as 00 and so on. Direct mapping cache organization
Memory address 00000 00777 01000 01777 02000 02777 Memory data 1220 2340 3450 4560 5670 6710 777 02 6710
Index address 000
Cache memory
Tag 00 Data 1220
Problem: But in this case we cannot save data of both address 00000, 1000 and 02000 they have same index value, we have to replace one to store other. Similarly we cannot save 00777 and 01777.That means even we have free words in our memory we cannot save the words with the same index.
81
But this is relatively less expensive and also the word size is smaller to associative type of organization. To avoid problems of both direct and associative type of organizations we take the 3rd concept i.e. Set-associative mapping. Index 000 Ta g 01 Data 3450 Ta g 02 Data 5670
777
02
6710
00
2340
In this we can save more than one data value with same index. And we also save both the tags corresponding to data values.
82
Lecture 17,18,19 and 20: Cache to memory write o Write back policy o Write through policy Cache Coherence o Software precautions o Snoopy controller
In last lecture we studied that on the basis of locality principle important or repetitive data is placed in cache memory. That means we can make any kind of data change in cache memory but since we have one copy of main memory too we need to make changes in main memory also. Thus in this lecture we will study how this updation is done and what are the various problems we face . Cache to memory write: Once we access data in cache memory and we make changes to it , we need to reflect this changes to main memory too. Thus we adopt two policies for it. Write Through When writing into memory If Hit, both Cache and memory is written in parallel If Miss, Memory is written For a read miss, missing block may be overloaded onto a cache block Memory is always updated -> Important when CPU and DMA I/O are both executing Slow, due to the memory access time Write-Back (Copy-Back) When writing into memory o If Hit, only Cache is written o If Miss, missing block is brought to Cache and write into Cache For a read miss, candidate block must be written back to the memory Memory is not up-to-date, i.e., the same item in Cache and memory may have different value
83
The mechanism is opted depending on 2 important scenarios like : 1. How frequent are the changes in cache memory. 2. And is main memory is used by some other cache memory too. The case is possible in multiprocessor mechanism , when all the processors use the same main memory and have their own separate cache memories. There is one part of main memory containing the value of X is shared by all the processors cache memories. Thus either we use write through or write back policy we will find issues in it and we will have inconsistency of data which is known as cache coherence. Cache coherence:
Maintaining Cache Coherency Shared Cache Disallow private cache Access time delay Software Approaches Read-Only Data are Cacheable Private Cache is for Read-Only data Shared Writable Data are not cacheable Compiler tags data as cacheable and non-cacheable Degrade performance due to software overhead 84
Centralized Global Table Status of each memory block is maintained in CGT: RO(ReadOnly); RW(Read and Write) All caches can have copies of RO blocks Only one cache can have a copy of RW block Hardware Approaches Snoopy Cache Controller Cache Controllers monitor all the bus requests from CPUs and IOPs All caches attached to the bus monitor the write operations When a word in a cache is written, memory is also updated (write through) Local snoopy controllers in all other caches check their memory to determine if they have a copy of that word; If they have, that location is marked invalid(future reference to this location causes cache miss)
85
Lecture 20: Goals of parallelism o Segmentation of processor in functional units Amdahls law
Parallel Processing In last unit we discussed various type of instructions and microinstructions generated but here we discuss how we can select these instructions for parallel processing. Parallel processing is the term used for simultaneous execution of 2 or more instructions. Various levels of Parallel Processing are: - Job or Program level - Task or Procedure level - Inter-Instruction level - Intra-Instruction level
Execution of Concurrent Events in the computing process to Achieve faster Computational Speed
First type of parallelism is implemented by increasing the number of processors. The classification can be on basis of either instruction or data which has been classified as SISD SIMD MISD MIMD Given by M.J.Flynn. The other technique what we use is that instead of using more than one processor , we can divide the work into various processing unit which is called as segmentation of processor into various processing units.
86
Adder/Subtractor
Multiplier/Divisor Logic Unit
To Memory Processor Register
Shift Unit
Incrementer
Floating point multiply Floating point Add / subtract Floating point divide Fig: Processor with multiple functional units Through this we save the cost of adding more and more processor. We have divided one processor into various functional units , so that they can work simultaneously on different type of instructions. That means one instruction requiring shift operation can run simultaneously with instruction requiring addition on a single processor. Another way to improve performance is through pipelining. Piepelining is a technique of decomposing a sequential process into suboperations , with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments. Either we add number of processors or we segment the processor into functional units , the speed up we achieve depends on the percentage of parallel execution possible. This concept gave birth to Amdahls law.
87
Amdahl's law, also known as Amdahl's argument is named after computer architect Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors. The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x, as shown in the diagram on the right. Thus Gene Amdahl in his 1967 paper titled validity of the single processor approach to achieve large scale computing capabilities has main statement as : If F is the fraction of the calculation that is sequential and (1 F) is the fraction that can be parallelized , then the maximum speed up that can be achieved by using P processors is : 1/(F+ (1-F)/P) For example : 1.If 50 % is the portion which can be paralelised addng 1 more processor will make a difference of speed up of only 1.333 times rather than twice. 1/(.50+.50/2) Similarly if we add 4 more processor the speed up will be only 1.667 times. 2. If we take into consideration 75% of the portion that can be parallelized then the speed up by adding one more processor is 1.6 times which is more than 1.33 times in case of 50% parallel portion. Percent 0 50 75 90 95 100 Parallel execution Number of processors 2 1 1.33 1.6 1.82 1.9 2 5 1 1.667 2.4 3.57 4.17 5 10 1 1.81 3.08 5.26 6.9 10 100 1 1.98 3.88 9.17 16.8 100 Thus we can deduce that speed up depends directly on the percentage of parallel portion and number of processors . But to a certain limit. And the extreme cases what we can discuss are : In case of 0% parallel portion , whatever the number of processors are there is no increase in speed up.
88
In case of 100% parallel portion , the maximum speedup is equal to the number of processors used.
89
Lecture 21: Pipelining or Pipeline processing o Example o Data table
Another way to improve performance is through pipelining. Piepelining is a technique of decomposing a sequential process into suboperations , with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments. Example : In this we have the instruction as A* B + C and we have to execute it for various data values . So we can represent it as :
Ai * Bi + Ci
for i = 1, 2, 3, ... , 7
Ai Segment 1 R1
Bi R2
Memory Ci
Multiplier Segment 2 R3 R4
Segment 3
Adder R5
In this we have divided the steps of execution of this instruction into various steps (as segments) as we work in one segment , we dont leave other segment idle. We work on 2 segments simultaneously but for different data values . This helps us in way as if we execute the data sequentially we will take clock pulse 7* 3 = 21 pulses( considering 1 pulse each for one segment) . To decrease the clock pulses we take the help of pipelining.
90
. This is implemented as : Step 1: In first clock pulse R1 is loaded with the first value of A as A1 and similarly R2 is loaded with B1. Step 2: In second clock pulse values of R1 and R2 will be given to multiplier and C1 is loaded into R3.The values of R1 and R2 are free we can load the values of A2 and B2 into R1 and R2. That means in pulse 2 we have both segment 1 and segment 2 in working state. Step 3 : In third clock pulse , the values of multiplier and R3 with value C1 is given to adder. Thus multiplier and R3 will be free so we can multiply R1 and R2 with value A2 and B2 . But now even segment is free , so we take the values of A3 and B3 into R1 and R2 . Similarly in next step In segment 1 : R1 and R2 will take values of A4 and B4. In segment 2 : R1 and R2 are multiplied with R3 loaded by C3. In segment 3 : Adder is working on A2 , B2 and C2. Data table: Clock Segment 2 Segment 3 Segment 1 Pulse Number R1 R2 R3 R4 R5 1 A1 B1 ---------------------2 A2 B2 A1 * B1 C1 ------------3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 -----------A7 * B7 C7 A6 * B6 + C6 9 -----------------------A7 * B7 + C7 So the calculation of clock pulse in case of sequential access is No of steps * No of data streams is replaced with no of steps + no of data streams -1.
91
Lecture 22: Instruction level parallelism o Instruction steps o Example o Flowchart Pipelining hazards
Instruction Pipelining The pipelining concept we discussed in last chapter was an example of SIMD ( single instruction on various data values ) . We can also segment the steps of instruction and execute them with the help of pipelining. This phenomenon is known as instruction pipelining or instruction level parallelism.. The steps of a particular instruction are: [1] [2] [3] [4] [5] [6] Fetch an instruction from memory Decode the instruction Calculate the effective address of the operand Fetch the operands from memory Execute the operation Store the result in the proper place
* Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] [2] [3] [4] FI: DA: FO: EX: Fetch an instruction from memory Decode the instruction and calculate the effective address of the operand Fetch the operand Execute the operation
Say we have 3 instructions and if we execute them sequentially we will get a scenario as : i FI DA FO EX i+1 FI DA FO EX i+2 FI DA FO EX
92
But if we use pipelining the scenario will be i FI i+1 DA FO EX FI i+2 DA FO FI DA EX FO EX
But it has some exceptions as, in case of branching and interrupts. Lets discuss it with the help of a flowchart.
Ste 1 p: Instruction 1 FI 2 (Branch) 3 4 5 6 7
3 4 5 6 7 DA FO EX FI DA FO EX FI DA FO EX FI FI
1 0
1 1
1 2
1 3
DA FO EX FI DA FO EX FI DA FO EX FI DA FO EX
In this we have instruction pipelining with sequential execution till instruction 3 . But the 3rd instruction is decoded and we get to know that its branching to some other address . SO the fourth instruction which is fetched is not the next instruction to be executed. So we dont take it further and wait till the 3rd instruction executed so that we get the address to be executed next and that called one becomes 4th instruction. And the pipelining continues.
93
Segment1:
Fetch instruction from memory Decode instruction and calculate Effective address Branch?
Segment2:
yes Segment3: Segment4: Interrupt handling Update PC Empty pipe
no Fetch operand from memory Execute instruction yes
Interrupt? no
94
Limitations of Pipelining / Pipelining hazards There are various advantages or uses of pipelining. But there are some problem areas what we face in case of pipelining. Major Hazards we face in Pipelined Execution are:
Structural Hazards Occur when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. Example: With one memory-port, a data and an instruction fetch cannot be initiated in the same clock i FI DA FO EX i+1 i+2 FI DA stall FO stall EX FI DA FO EX
The Pipeline is stalled for a structural hazard <- Two Loads with one port memory
95
-> Two-port memory will serve without stall Data Hazards: Occurs when the execution of an instruction depends on the results of a previous instruction ADD R1, R2, R3 SUB R4, R1, R5 Data hazard can be dealt with either HW techniques or SW technique HW Technique - Interlock - hardware detects the data dependencies and delays the scheduling of the dependent instruction by stalling enough clock cycles - (Operand) Forwarding (bypassing, short-circuiting) - Accomplished by a data path that routes a value from a source (usually an ALU) to a user, bypassing a designated register. This allows the value to be produced to be used at an earlier stage in the pipeline than would otherwise be possible SW Technique - Instruction Scheduling (compiler) for delayed load Control Hazards
Prefetch Target Instruction Fetch instructions in both streams, branch not taken and branch taken Both are saved until branch is executed. Then, select the right instruction stream and discard the wrong stream 96
Branch Target Buffer (BTB; Associative Memory) Entry: Addr of previously executed branches; Target instruction and the next few instructions When fetching an instruction, search BTB. If found, fetch the instruction stream in BTB; If not, new stream is fetched and update BTB Loop Buffer (High Speed Register file) Storage of entire loop that allows to execute a loop without accessing memory Branch Prediction Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty Delayed Branch Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction
97
Lecture 23: Vector Processors Super computers Memory Interleaving Array Processors o SIMD array processor o Attached array processor
There are various type of processors which perform particular operations. Vector Processors: One more type of processor what we use are vector processors. Ability to process vectors, and related data structures such as matrices and multidimensional arrays, much faster than conventional computers Vector Processing Applications Problems that can be efficiently formulated in terms of vectors Long-range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Mapping the human genome Image processing Vector Processor (computer) Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers Vector Processors may also be pipelined Example:
98
20
DO 20 I = 1, 100 C(I) = B(I) + A(I)
Conventional computer Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = i + 1 If I 100 goto 20 Vector computer C(1:100) = A(1:100) + B(1:100)
Supercomputers:Is a broad term for one of the fastest computers currently available. Such
computers are typically used for number crunching including scientific simulations, (animated) graphics, analysis of geological data (e.g. in petrochemical prospecting), structural analysis, computational fluid dynamics, physics, chemistry, electronic design, nuclear energy research and meteorology. Perhaps the best known supercomputer manufacturer is Cray Research. The chief difference between a supercomputer and a mainframe is that a supercomputer channels all its power into executing a few programs as fast as possible, whereas a mainframe uses its power to execute many programs concurrently. A supercomputer is a computer that leads the world in terms of processing capacity, particularly speed of calculation, at the time of its introduction. The first supercomputers were introduced in the 1960s, led primarily by Seymour Cray at Control Data Corporation (CDC), which led the market into the 1970s until Cray split off to form his own company, Cray Research, and then took over the market. In the 1980s a large number of smaller competitors entered the market, a parallel to the creation of the minicomputer market a decade earlier, many of whom disappeared in the mid-1990s "supercomputer market crash". Today supercomputers are typically one-off custom designs produced by "traditional" companies such as IBM and HP, who had purchased many of the 1980s companies to gain their experience.
99
Technologies developed for supercomputers include:
Vector processing : A vector processor, or array processor, is a CPU design where the instruction set includes operations that can perform mathematical operations on multiple data elements simultaneously. This is in contrast to a scalar processor which handles one element at a time using multiple instructions. The vast majority of CPUs are scalar (or close to it). Vector processors were common in the scientific computing area, where they formed the basis of most supercomputers through the 1980s and into the 1990s, but general increases in performance and processor design saw the near disappearance of the vector processor as a general-purpose CPU. Liquid cooling : An uncommon practice is to submerse the computer's components in a thermally conductive liquid. Personal computers that are cooled in this manner do not generally require any fans or pumps, and may be cooled exclusively by passive heat exchange between the computer's parts, the cooling fluid and the ambient air. Non-Uniform Memory Access (NUMA): Non-Uniform Memory Access or NonUniform Memory Architecture (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processor Striped disks (the first instance of what was later called RAID): In computer data storage, data striping is the segmentation of logically sequential data, such as a single file, so that segments can be assigned to multiple physical devices (usually disk drives in the case of RAID storage, or network interfaces in the case of Gridoriented Storage) in a round-robin fashion and thus written concurrently. Parallel filesystems: n computing, a file system (often also written as filesystem) is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use a data storage device such as a hard disk or CD-ROM and involve maintaining the physical location of the files, they might provide access to data on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients), or they may be virtual and exist only as an access method for virtual data (e.g., procfs). It is distinguished from a directory service and registry.
Memory Interleaving
Also known as MULTIPLE MEMORY MODULE AND INTERLEAVING
100
Memory interleaving is the term used because we are combining or communicating the different memories for assigning addresses and again to interchange the data.
Array Processors: A microprocessor that executes one instruction at a time but on an array or table of data at the same time rather than on single data elements. Array processor performs a single instruction in multiple execution units in the same clock cycle The different execution units have same instruction using same set of vectors in the array. Features of array proessor: Use of parallel execution units for processing different vectors of the arrays Use of memory interleaving, n memory address registers and n memory data registers in case of k pipelines and use of vector register files A computer/processor that has an architecture especially designed for processing arrays (e.g. matrices) of numbers. The architecture includes a number of processors (say 64 by 64) working simultaneously, each handling one element of the array, so that a single operation can apply to all elements of the array in parallel. To obtain the same effect in a conventional processor, the operation must be applied to each element of the array sequentially, and so consequently much more slowly. 101
An array processor may be built as a self-contained unit attached to a main computer via an I/O port or internal bus; alternatively, it may be a distributed array processor where the processing elements are distributed throughout, and closely linked to, a section of the computer's memory. Array processors are very powerful tools for handling problems with a high degree of parallelism. They do however demand a modified approach to programming. The conversion of conventional (sequential) programs to serve array processors is not a trivial task, and it is sometimes necessary to select different (parallel) algorithms to suit the parallel approach. Array processors are most imortantly implemented in 2 ways: SIMD array processors: A SIMD array processor is a computer with multiple processing units operating in parallel. The processing units are synchronized to perform the same operation under the control of a common control unit, thus providing a single instruction stream, multiple data stream organization. Data level parallelism in array processor, for example, the multiplier unit pipelines are in parallel Computing x[i] y[i] in number of parallel units. It multifunctional units simultaneously perform the actions
Fig: SIMD Array Processor Attached array processors: The various components of this structure are: General purpose computer : Used for general procesing Main memory : Memory attached to general purpose computer I/O interface : To connect the two procesors. Attached array processor : The array processor required for high computations. Local memory: Attached to array processor
102
The attached array processor has an input output interface to a common processor and another interface with a local memory The local memory interconnects main memory
Fig: Attached Array Processor
103
Lecture 24: Instruction Codes Type of Instructions o Memory reference type of Instructions o Register reference type of Instructions o I/O reference type of Instructions
Instruction Codes: In last topics, we studied the various types of organizations of our computer. Today , we will study the various types of instructions supported by our computer. Before that lets see the cycle of an instruction. In Basic Computer, a machine instruction is executed in the following cycle: 1. Fetch an instruction from memory 2. Decode the instruction 3. Read the effective address from memory if the instruction has an indirect address 4. Execute the instruction After an instruction is executed, the cycle starts again at step 1, for the next instruction Note: Every different processor has its own (different) instruction cycle
The Basic Computer Instruction Format we have is (OP-code = 000 ~ 110) 15 I 14 12 11 Opcode 0 Address
This is the type in which we refer memory to fetch our operands .Thus these type of instructions are called as memory reference instructions. In this case I -is the mode field which tells us whether the technique to fetch the operand is of type direct addressing or indirect type. i.e o I = 0 Direct address o I=1 Opcode This is another field which is used to tell us the type of operation to be performed. Since this is of 3 bits, the maximum no of memory reference operations possible are 23 = 8. The operations possible in memory reference type of instructions are: 104
Symbol Decoder AND D0
Operation
Symbolic Description
AC AC M[AR] AC AC + M[AR], E Cout AC M[AR] M[AR] AC PC AR M[AR] PC, PC AR + 1 M[AR] M[AR] + 1, if M[AR] + 1 = 0 then PC PC+1
ADD LDA STA BUN BSA ISZ
D1 D2 D3 D4 D5 D6
Address Address is the field which tells us the address on which we have to fetch the operand. The effective address of the instruction is in AR and was placed there during timing signal T2 when I = 0, or during timing signal T3 when I = 1 - Memory cycle is assumed to be short enough to complete in a CPU cycle - The execution of MR instruction starts with T4 AND to AC D0T4: DR M[AR] D0T5: AC AC DR, SC 0 ADD to AC D1T4: DR M[AR] D1T5: AC AC + DR, E Cout, SC 0 LDA: Load to AC D2T4: DR M[AR] D2T5: AC DR, SC 0 STA: Store AC D3T4: M[AR] AC, SC 0 BUN: Branch Unconditionally D4T4: PC AR, SC 0 BSA: Branch and Save Return Address M[AR] PC, PC AR + 1 Read operand AND with AC Read operand Add to AC and store carry in E
105
Memory, PC after execution 20 PC = 21 0 BSA 135 20 21 0 BSA 135 Next instruction Next instruction
AR = 135 136 Subroutine
135 PC = 136
21 Subroutine
BUN Memory
135
BUN Memory
135
BSA: D5T4: M[AR] PC, AR AR + 1 D5T5: PC AR, SC 0 ISZ: Increment and Skip-if-Zero D6T4: DR M[AR] D6T5: DR DR + 1 D6T4: M[AR] DR, if (DR = 0) then (PC PC + 1), SC 0
106
Memory-reference instruction
AND D0 T 4 DR M[AR] ADD D1 T 4 DR M[AR] LDA D2 T 4 DR M[AR] D2 T 5 AC DR SC 0 ISZ D5 T 4 M[AR] PC AR AR + 1 D5 T 5 PC AR SC 0 D6 T 4 DR M[AR] STA D 3T 4 M[AR] AC SC 0
D0 T 5 D1 T 5 AC AC DR AC AC + DR SC 0 E Cout SC 0 BUN D4 T 4 PC AR SC 0 BSA
D6 T 5 DR DR + 1 D6 T 6 M[AR] DR If (DR = 0) then (PC PC + 1) SC 0
Register Reference Instructions The instruction format to represent the register -reference type of instructions is: (OP-code = 111, I = 0) 12 11 Register Operation 0 OPerationoperationoperat 1 1 1 ion In this case last 4 bits are fixed i.e. 0111. And 1st 12 bits denotes the type of operation to be performed on operation. The 12 bits i.e. B0 to B11 represent individual instruction that has to be performed. 15 0
107
In these type of instructions the instructions itself tells us the operation and the register on which the operand has to be performed. - D7 = 1, I = 0 - Register Ref. Instr. is specified in B0 ~ B11 of IR - Execution starts with timing signal T3 r = D7 I T3 => Register Reference Instruction Bi = IR(i) , i=0,1,2,...,11
CLA CLE
r: rB11 : rB10 :
SC 0 AC 0 E 0 AC AC E E AC shr AC, AC(15) E, E AC(0) AC shl AC, AC(0) E, E AC(15) AC AC + 1 if (AC(15) = 0) then (PC PC+1) if (AC(15) = 1) then (PC PC+1) if (AC = 0) then (PC PC+1) if (E = 0) then (PC PC+1) S 0 (S is a start-stop flip-flop)
CMA rB9: CME rB8: CIR CIL INC rB7: rB6: rB5:
SPA rB4: SNA rB3: SZA rB2: SZE HLT rB1: rB0:
Input-Output Instructions The instruction format to represent input - output type of instructions is: (OP-code =111, I = 1) 12 15 0 11 I/O operation 1 1 1 1 In this case last 4 bits are fixed i.e. 1111. And 1st 12 bits denotes the type of operation to be performed on operation. The 12 bits i.e. B0 to B11 represent individual instruction that has to be performed.
108
To understand these operations lets discuss a simple computer with input and output devices connected to it. Now we will discuss how control unit will identify that what type of instruction is getting executed. Input-Output Configuration Serial Computer Input-output communication registers and terminal interface flip-flops Receiver Printer OUTR FGO interface AC Keyboard Transmitter interface INPR FGI
Serial Communications Path Parallel Communications Path

Here are the details of the registers used in this organization: INPR Input register - 8 bits : When we enter some value from keyboard(or from any other input device) ,the alphanumeric value of it gets stored to INPR and then processes to move to accumulator. OUTR Output register - 8 bits : Similar to INPR , OUTR is the register which holds the alphanumeric code of the input it gets from accumulator before it gets printed on printer ( or displays to the monitor). AC Accumulator 16 bits: Accumulator id the main processor register who receives the first inputs and last outputs. FGI Input flag - 1 bit: This is a control flip flop used to synchronize the timing difference between the input devices and the processors speed. FGO Output flag - 1 bit: Similar to FGI, this is a control flip flop used to synchronize the timing difference between the output devices and the processors speed. IEN Interrupt enable - 1 bit: This is a flip flop which tells us whether to interrupt the operations or not. Important points:
109
- The terminal sends and receives serial information - The serial info. from the keyboard is shifted into INPR - The serial info. for the printer is stored in the OUTR - INPR and OUTR communicate with the terminal serially and with the AC in parallel. - The flags are needed to synchronize the timing difference between I/O device and the computer The process continues as: Initially , the input flag FGI is cleared to 0.When a key is struck in the keyboard, an 8 bit alphanumeric code is shifted into INPR and the input flag FGI is set to 1.As long as the FGI is set to 1 , new information cannot be entered to it. The computer checks the flag bit , if it is 1 , the information from INPR is transferred in parallel to AC and FGI is set to 0 that means ready to take new key input. Similar is the operation in case of output devices , except in the flow of direction. Initially the FGO is set to 1. The computer checks the flag bit ;if it is 1 , the information from AC is transferred to OUTR and FGO is cleared to 0.The output device accepts the coded information , prints the corresponding character , and when the operation is completed , sets the flag to 1. In this case OUTR does not accept new character until the FGO is 0. After understanding the operation in I/O organization lets discuss the various operations of inout output organization. In these type of instructions the instructions itself tells us the operation and the register on which the operand has to be performed. D7IT3 = p IR(i) = Bi, i = 6, , 11 That means it supports only six operations for input /output and interrupt type of instructions for bits B6 to B11 and B0 to B5 holds no importance. And the operations are:
INP SKI ION IOF
p: SC 0 pB11 : AC(0-7) INPR, FGI 0 pB9: if(FGI = 1) then (PC PC + 1) pB7: IEN 1 pB6: IEN 0
Clear SC Input char. to AC Output char. from AC Skip on input flag Skip on output flag Interrupt enable on Interrupt enable off
OUT pB10 : OUTR AC(0-7), FGO 0 SKO pB8: if(FGO = 1) then (PC PC + 1)
INP: When FGI is 0 , give the value of INPR to accumulator. OUT: When output flag is set to 0 , sends accumulator value to OUTR.
110
SKI : This shows that now the input device is busy and processor is free , so we can execute the next instruction . SKO: Similarly , if output device is busy in printing the output and the processor is now free we can fetch a new instruction for execution. But to note that these instructions will be of branch type so that they return and check the flag again for the execution. ION: This causes the interrupt to be ON , i.e the operations can be interrupted if IEN (interrpt enable) flag is set to 1. IOF: This will cause the condition in which no interrupt is possible. To explain more on the interrupts and know the value of IEN flag, lets discuss the interrupt cycle.
Instruction cycle
=0
=1
Interrupt cycle Store return address in location 0 M[0] PC
Fetch and decode instructions Execute instructions =1 =0
IEN =1 FGI =0 FGO =0
Branch to location 1 PC 1
=1 R 1
IEN 0 R 0
R is the interrupt flip flop which checks whether the instruction execution is due to normal fetch condition or because of interrupt. Thus w e checks whether R = 0 or not. If R is 0 , this is the case of normal instruction cycle. We fetch , decode and execute instruction and in parallel checking there is interrupt or not . System will only accept interrupts in case if IEN is 1. Thus , if IEN is 0, no chance of interrupt cycle. If IEN is 1 ten we check the flags FGI and FGO, if they are 0 that means processor is busy so no interrupt is possible. If they are 1 that means we can go for interrupt , thus setting the value of R as 1 and continue for interrupt cycle. In case of interrupt, we have to store the address of the next instruction which comes in the normal execution to some place. We have stored it at address 0 memory and set the value of PC as 1 and also we set IEN and R as 0 to avoid further interrupt until this interrupt cycle is completed. 111
Lecture 25: Computer Registers Instruction set completeness Timing and control circuit
Another type of organization uses some processor registers other than accumulator. Some important points to be noticed are : A processor has many registers to hold instructions, addresses, data, etc The processor has a register, the Program Counter (PC) that holds the memory address of the next instruction to get Since the memory in the Basic Computer only has 4096 locations, the PC only needs 12 bits In a direct or indirect addressing, the processor needs to keep track of what locations in memory it is addressing: The Address Register (AR) is used for this The AR is a 12 bit register in the Basic Computer When an operand is found, using either direct or indirect addressing, it is placed in the Data Register (DR). The processor then uses this value as data for its operation The Basic Computer has a single general purpose register the Accumulator (AC). The significance of a general purpose register is that it can be referred to in instructions e.g. load AC with the contents of a specific memory location; store the contents of AC into a specified memory location Often a processor will need a scratch register to store intermediate results or other temporary data; in the Basic Computer this is the Temporary Register (TR) The Basic Computer uses a very simple model of input/output (I/O) operations Input devices are considered to send 8 bits of character data to the processor The processor can send 8 bits of character data to output devices The Input Register (INPR) holds an 8 bit character gotten from an input device The Output Register (OUTR) holds an 8 bit character to be send to an output device The organization of these basic registers looks like:
112
11
PC
11 0
Memory 4096 x 16
AR
15 0
IR
15 0 15 0
CPU DR
7 0 15 0
TR
7 0
OUTR
INPR
AC
The data registers are of 16 bit length and the address registers are of 12 bit.
Common Bus System

Common bus system deals with how these various registers are connected and they interact to each other. The registers in the Basic Computer are connected using a bus. This gives a savings in circuitry over complete connections between registers. That means if we use the general connection system i.e. connect each register with every other register will make it very complex and also a large no of connections will be required.
113
Bus Memory unit 4096 x 16 Write Read 7 Address 1
AR
LD INR CLR
PC
LD INR CLR
DR
LD INR CLR E ALU
AC
LD INR CLR
INPR IR
LD 5 6
TR
LD INR CLR
OUTR
LD 16-bit common bus
Clock
114
In Basic Computer, there is only one general purpose register, the Accumulator (AC) In modern CPUs, there are many general purpose registers It is advantageous to have many registers Transfer between registers within the processor are relatively fast Going off the processor to access memory is much slower
Instruction set completeness A computer should have a set of instructions so that the user can construct machine language programs to evaluate any function that is known to be computable. Instruction Types Functional Instructions - Arithmetic, logic, and shift instructions - ADD, CMA, INC, CIR, CIL, AND, CLA Transfer Instructions - Data transfers between the main memory and the processor registers - LDA, STA Control Instructions - Program sequencing and control - BUN, BSA, ISZ Input/Output Instructions - Input and output - INP, OUT
115
Flowchart for complete computer operations:
start SC 0, IEN 0, R 0
=0(Instruction Cycle) RT0 AR PC RT IR M[AR], PC PC + 1 RT2 AR IR(0~11), I IR(15) D0...D7 Decode IR(12 ~ 14) =1(Register or I/O)
1
=1(Interrupt Cycle)
RT0 RT1 RT2
AR 0, TR PC
M[AR] TR, PC 0 PC PC + 1, IEN 0 R 0, SC 0 D7 =0(Memory Ref)
=1 (I/O)
=0 (Register)
=1(Indir)
=0(Dir)
D7IT3 Execute I/O Instruction
D7IT3 Execute RR Instruction
D7IT3 AR <- M[AR]
D7IT3 Idle D7T4
Execute MR Instruction
116
Lecture 26: Instruction Cycle o Flowchart for determining the type of instruction o Timing and control circuit o Timing Signals
We have the instrcuction cycle as Fetch , decode and execute.At the time of decoding the instruction we find out the type of instruction. In this section we will discuss the flowchart and corresponding circuit to do so. Flowchart for determining the type of instruction: Start SC 0 AR PC T0 T1 T2
IR M[AR], PC PC + 1 Decode Opcode in IR(12-14), AR IR(0-11), I IR(15) (Register or I/O) = 1 (I/O) = 1 T3 D7
= 0 (Memory-reference) (indirect) = 1 T3 T3 AR M[AR] = 0 (direct) T3 Nothing T4
= 0 (register)
Execute Input-output Instruction SC 0
Execute Register-reference Instruction SC 0
Execute Memory-reference Instruction SC 0
117
D'7IT3:AR[AR] D'7I'T3:Nothing D7I'T3:Execute a register-reference instruction. D7IT3:Execute an input-output instruction.
Control unit of Basic Computer

Instruction register (IR) 14 13 12 11 - 0
15
Other inputs
3x8 decoder 7 6543 210 I D0 D7 T15 T0 15 14 . . . . 2 1 0 4 x 16 decoder 4-bit sequence counter (SC) Increment (INR) Clear (CLR) Clock Combinational Control logic Control signals
In this circuit we explain how the instruction is fetched in IR . And the corresponding bits 12 , 13 and 14 are decoded to check the type of instruction. Too take this decision we take the help of combinational control logic by additional mode bit information. After checking the type of instruction corresponding control signals are generated. To synchronize the fetch , decode and execute phases of instruction cycle we use a timing circuit . This contains a 4 bit sequence counter which gives us 16 timing signals by converting it with 4*16 decoder. Since the timing signals are fixed till 16 . We clear this 118
clock for every instruction and increment it for the various phases . Again when new instruction is fetched , we clear the SC so that timing signals start back from T0. To explain this further : We have taken an example of instruction STA which executes at D3T4. - Generated by 4-bit sequence counter and 4 16 decoder - The SC can be incremented or cleared. - Example: T0, T1, T2, T3, T4, T0, T1, . . . Assume: At time T4, SC is cleared to 0 if decoder output D3 is active.
D3T4: SC 0
T 0 Ck lc o T 0 T 1 T 2 T 3 T 4 D 3 C L R S C T 1 T 2 T 3 T 4 T 0
119
Lecture 27: Control Memory o Its Organization Mapping Logic Microprogram Example
Control Memory: The function of control unit in a digital computr is to initiate sequences of microoperations and these microoperations (the number) derives the complexity of digital system. These control signals can be hardwaired(using conventional logic design techniques ) or microprogrammed .Generally , the control function is a binary variable can be either 1 state or 0 state depending on the application and these can be represented by a string of 1s or 0s called a control word.. A control unit whose binary control variables are stored in memory is called a microprogrammed control unit. Each word in a control memory contains a micro instruction which is a set of microoperations. A sequence of microinstructions constitute a microprogram. Since alteration are not required once the control unit is in operation , the control memory can be static memory or ROM(read only memory). We can also use the technique of dynamic programming which can be used for writing (to change the program) but is used mostly for reading.This type of memory is also called writable control memory. Thus we can say, A memory that is part of control unit is known as control memory. A computer having microprogrammed control unit have 2 separate memories: Main Memory: This is used for storing programs which can be altered. Control Memory: This holds a fixed microprogram that cannot be altered by occasional user and these specify various microinstructions that contains various internal control signals for execution of register operations. These microinstructions generate the microoperations to: Fetch the instruction from memory. Evaluate the effective address Execute the operation specified by the instruction
120
Return control to the fetch phase in order to repeat the cycle for next instruction.
Configuration of a micro programmed control unit: External Input Next Address Generator Control Address Register (CAR) Control Memory-ROM Control Data Register Control Word
Fig: Micro programmed Control Unit The control unit is assumed to be ROM , within which all control information is permanently stored. The control memory address register contains address of the microinstruction. Control data register holds microinstruction read from memory .The microinstruction contains a control word that specifies one or more microoperations for the data processor. After the execution of these microoperations we should get the location of the next operation that can also depend on the external input .To find the next address , we need next address generator also called a sequencer as it determines the address sequence that is read from control memory. The typical function of a microprogram sequencer : Incrementing the CAR by 1 (in case of sequential execution). Loading the CAR an address from control memory(in case of branching). Transferring the external address(In case of interrupts) Loading an initial address to start the control operations(In case of first microoperation). The control data register holds the present microinstruction while the next address is computed and read from memory.It is also called pipeline register as it allows the execution of microoperations simultaneously with the generation of the next microinstruction.It requires a 2 phase clock , one applied to address register and one for data register. We can also work without control data register using single phase clock in which the control word and next address information are taken directly from control memory.Rom operates as a combinational circuit , with the address value as the input and the corresponding word as output.The content of the specified word remains in the address register .
121
The main advantage of the microprogrammed control is the fact that once the hardware configuration is established , there should be no need for further hardwire or wiring changes.Only thing changes is microprogram residing in control memory. Mapping Of instructions: Mapping from the OP-code of an instruction to the address of the Microinstruction which is the starting microinstruction of its execution microprogram
Machine Instruction Mapping bits Microinstruction address
OP-code 1 0 1 1 Address 0 x x x x 0 0 0 1 0 1 1 0 0
Here , we have to generate the address of the microinstruction with the help of instruction.In this we fetch the instruction and gets the opcode of the particular instruction. For mapping we copy the values of the opcode directly for microinstruction address but we append some bits at the end and starting of the microinstruction. What values will be appended is completely a decision of the designer .like in this example we have appended 0 before opcode copied bits and 00 at the end .And this mapping rule will be generalized for all opcodes /instructions. Note : The number of bits appended at the end specifies the maximum length of the microprogram. In the next diagram we have shown the mapping of various instructions to its particular microinstructions or microprogram.
122
MICROPROGRAM EXAMPLE Computer Harware Configuration
MUX 10 AR Address 10 PC 0 0 Memory 2048 x 16
MUX 6 SBR 0 6 CAR 0 15 DR 0
Control memory 128 x 20 Control unit
Arithmetic logic and shift unit 15 AC 0
This type of configuration contains two memory units : Main memory for storing instructions and data Control memory for storing microinstructions/microprogram. In this main memory is accessed with the help of PC , AR and DR and the transfer of information takes place with the help of MUX instead of common bus. Similarly control memory is accessed with the help of CAR and the manipulation of address sequencing is helped through SBR. The control signals then fetched from control memory are manipulated with the help of ALU taking the values from DR and AC and storing the result to AC.
123
Direct Mapping OP-codes of Instructions ADD 0000 AND 0001 LDA 0010 STA 0011 BUN 0100 Mapping Bits
10 xxxx 010
. . .
Address 0000 0001 0010 0011 0100
ADD Routine AND Routine LDA Routine STA Routine BUN Routine Control Storage
Address 10 0000 010 10 0001 010 10 0010 010 10 0011 010 10 0100 010
ADD Routine AND Routine LDA Routine STA Routine BUN Routine
Mapping function implemented by ROM or PLA
OP-code Mapping memory (ROM or PLD) Control address register Control Memory
The mapping function is sometimes implemented by means of an integrated circuit called programmable logic device or PLD. This is similar to ROM and the mapping function is expressed in terms of Boolean expressions which are implemented with PLD.
124
Lecture 30: Direct Memory Access
Block of data transfer from high speed devices, Drum, Disk, Tape * DMA controller - Interface which allows I/O transfer directly between Memory and Device, freeing CPU for other tasks * CPU initializes DMA Controller by sending memory address and the block size(number of words)
CPU bus signals for DMA transfer

Bus request Bus granted BR BG CPU ABUS DBUS RD WR Address bus Data bus Read Write
High-impedence (disabled) when BG is enabled
Block Diagram of DMA controller

Address bus Data bus Data bus buffers Address bus buffers
DMA select Register select Read Write Bus request Bus grant Interrupt
DS RS RD WR BR BG Interrupt Control logic
Internal Bus
Address register Word count register Control register
DMA request DMA acknowledge to I/O device
Starting an I/O - CPU executes instruction to Load Memory Address Register Load Word Counter Load Function(Read or Write) to be performed Issue a GO command Upon receiving a GO Command DMA performs I/O 125
operation as follows independently from CPU Input [1] Input Device <- R (Read control signal) [2] Buffer(DMA Controller) <- Input Byte; and assembles the byte into a word until word is full [4] M <- memory address, W(Write control signal) [5] Address Reg <- Address Reg +1; WC(Word Counter) <- WC - 1 [6] If WC = 0, then Interrupt to acknowledge done, else go to [1] Output [1] M <- M Address, R M Address R <- M Address R + 1, WC <- WC - 1 [2] Disassemble the word [3] Buffer <- One byte; Output Device <- W, for all disassembled bytes [4] If WC = 0, then Interrupt to acknowledge done, else go to [1] While DMA I/O takes place, CPU is also executing instructions DMA Controller and CPU both access Memory -> Memory Access Conflict Memory Bus Controller - Coordinating the activities of all devices requesting memory access - Priority System Memory accesses by CPU and DMA Controller are interwoven, with the top priority given to DMA Controller -> Cycle Stealing Cycle Steal - CPU is usually much faster than I/O(DMA), thus CPU uses the most of the memory cycles - DMA Controller steals the memory cycles from CPU - For those stolen cycles, CPU remains idle - For those slow CPU, DMA Controller may steal most of the memory cycles which may cause CPU remain idle long time
126
DMA TRANSFER
Interrupt BG BR RD WR Addr Data Read control Write control Data bus Address bus Address select RD DS RS BR BG Interrupt DMA Controller DMA request WR Addr Data DMA ack. I/O Peripheral device RD WR Addr Data CPU Random-access memory unit (RAM)
127
Lecture 30: Interrupts o Types of interrupts Interrupt cycle
Types of Interrupts: External interrupts External Interrupts initiated from the outside of CPU and Memory - I/O Device Data transfer request or Data transfer complete - Timing Device Timeout - Power Failure - Operator Internal interrupts (traps) Internal Interrupts are caused by the currently running program - Register, Stack Overflow - Divide by zero - OP-code Violation - Protection Violation Software Interrupts Both External and Internal Interrupts are initiated by the computer HW. Software Interrupts are initiated by the executing an instruction. - Supervisor Call Switching from a user mode to the supervisor mode Allows to execute a certain class of operations which are not allowed in the user mode Interrupt Procedure: - The interrupt is usually initiated by an internal or an external signal rather than from the execution of an instruction (except for the software interrupt) - The address of the interrupt service program is determined by the hardware rather than from the address field of an instruction - An interrupt procedure usually stores all the information necessary to define the state of CPU rather than storing only the PC. The state of the CPU is determined from; Content of the PC
128
Content of all processor registers Content of status bits Many ways of saving the CPU state depending on the CPU architectures Flowchart of interrupts: To explain more on the interrupts and know the value of IEN flag, lets discuss the interrupt cycle.
Instruction cycle
=0
=1
Interrupt cycle Store return address in location 0 M[0] PC
Fetch and decode instructions Execute instructions =1 =0
IEN =1 FGI =0 FGO =0
Branch to location 1 PC 1
=1 R 1
IEN 0 R 0
R is the interrupt flip flop which checks whether the instruction execution is due to normal fetch condition or because of interrupt. Thus w e checks whether R = 0 or not. If R is 0 , this is the case of normal instruction cycle. We fetch , decode and execute instruction and in parallel checking there is interrupt or not . System will only accept interrupts in case if IEN is 1. Thus , if IEN is 0, no chance of interrupt cycle. If IEN is 1 ten we check the flags FGI and FGO, if they are 0 that means processor is busy so no interrupt is possible. If they are 1 that means we can go for interrupt , thus setting the value of R as 1 and continue for interrupt cycle. In case of interrupt, we have to store the address of the next instruction which comes in the normal execution to some place. We have stored it at address 0 memory and set the value of PC as 1 and also we set IEN and R as 0 to avoid further interrupt until this interrupt cycle is completed.
129

Computer Architecture and Organization (Cao) Notes

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Computer Architecture and Organization (Cao) Notes

Enviado por

Direitos autorais:

Formatos disponíveis

B.

Computer Architecture and Organization CSE 210-E

Submitted by: Prerna Mittal

Binary digital output signal

Table for various logic gates -1.1

We can have many circuits for the same Boolean expression. 3

Lecture 2: Combinational logic Blocks Multiplexers Adders Encoders Decoders

4-to-1 Multiplexer Select S1 S0 0 0 0 1 1 0 1 1 I0 I1 I2 I3 S0 S1

CoCout xy + xcin + ycins = u t

= xy + (x y) Cin s = xy cin +xycin +xycin +xyCin = x y Cin = (x y) Cin

Using NAND gates, it becomes more economical.

Lecture 3: Sequential logic Blocks Latches Flip flops Registers Counters

The various sequential blocks or circuits are: Latches:

respond to the input only at this time

Processor register (Accumulator or AC)

Fig 1: Stored Program Organization

Processor Unit Instruction stream

SHARED MEMORY MULTIPROCESSORS (UMA)

USER APPLICATION LAYER COMPILER ASSEMBLER OS MSDOS WINDOWS UNIX / LINUX

Lecture 6: CPU performance measures MIPS MFLOPS

Cache Memory Main Memory Secondary Memory

1. Fast 2. Expensive 3. Low capacity 4. Connects directly to the processor

1. Slow 2. Cheap 3. Large capacity 4. Not connected directly to the processor

CISC(Complex Instruction Set Computer)

High cycles per second

Low cycles per second

TYPES OF ADDRESSING MODES

Typical Data Transfer Instructions

Mnemonic CLR COM AND OR XOR CLRC SETC COMC EI DI

SHR SHL SHRA SHLA ROR ROL RORC ROLC

Mnemonic BZ BNZ BC BNC BP BM BV BNV

In a processor Register In memory stack -most efficient way

CALL SP SP - 1 M[SP] PC PC EA RTN PC M[SP] SP SP + 1

Program Interrupts MASM

Adder and logic circuit

REGISTER STACK ORGANIZATION Register Stack Address

Push, Pop operations

/* Initially, SP = 0, EMPTY = 1, FULL = 0 */

MUX 1 S1 bus ALU

Lecture 13: Address Sequencing / Microinstruction Sequencing Implementation of control unit

Control address register (CAR) Incrementer

Control memory (ROM) select a status bit Branch address Microoperations

Lecture 13: Fetch and decode cycle Control Unit

Fetch and Decode

Fetch and Decode

Magnetic tapes Magnetic disks

Reasons for locality

Use of locality in general

8-bit data bus

Typical ROM chip

Address bus 16-11 10 9 8

512 x 8 AD9 ROM

Fig: Memory connection to CPU

Lecture 17: Static RAM nd Dynamic RAM Associative Memory

In contrast, SRAMs have the following weaknesses, compared to DRAMs:

1111 1011 0111 1000 1100 0010 1011 0111 1011 1 0 0 0 0 1 0 1

MEMORY AND CACHE MAPPING

In this we have fetched these important words and place

01000 02777 22235

Index address 000

Multiplier/Divisor Logic Unit

To Memory Processor Register

Lecture 21: Pipelining or Pipeline processing o Example o Data table