Shahadat

ENIAC (Electronic Numerical Integrator and Calculator): This was first widely known general-purpose electronic computer.
EDVAC (Electronic Discrete Variable Computer):

IAS:
IBM S/360 (International Business Machine System/360):
CDC 6600: A power full computer by CDC (Control Data Corporation).
CDC STAR-100: String Array Computer by CDC
T1-ASC: Texas Instruments Advanced Scientific Computer.
ILLIAC IV (Illinois Automatic Computer): A supercomputer, which has 64 separate CPUs.
Butterfly Computer: By BBN Laboratories Incorporations (Inc.s). It is a multiprocessor. It uses 256 CPUs based on Motorola
Company.
iPSC (Intel Personal Super Computer): It is a multiprocessor. It uses 128 CPUs based on Intel 80286 32-bit processor.
The CDC and CYBER series computer use large number of IOP with a high degree of autonomy. In edition each CPU is subdivided into
number of Processing Element (PE).
A CPU organization called pipelining is used in CDC STAR-100, T1-ASC
DAP (Distributed Array Processor): It is an associative processor. Designed by international Computers Limited.
CYBER 205: May be a computer by Control Data Corporation.
CYBER 205 by CDC and Cray-1: These two computers are pipeline processor and can be considered to fall in MISD category if
assume that a single data stream passes trough a pipeline, and is processed by different (micro-) instruction streams in different segments
of the pipeline.
UNIVAC LARC and the IBM 7030. These machines execute CPU operation simultaneously or in parallel and IOP acts as secondary
housekeeping functions in parallel with CPU operation which speed up CPU directly, and referred to as parallel processing.
Generally electronic computers are called von Neumann machine.
Processing Element: A CPU without memory is called Processing Element (PE).
Computing Element: A CPU with memory is called Computing Element (CE).
The speed of a computer can be increased by increasing the speed of hardware but the maximum speed of hardware which can be made is
equal to the speed of light. So by using hardware the speed up is fixed to a certain value. In order to get more speed, we must use parallel
processing method.
Examples of early supercomputers are the UNIVAC Livermore Atomic Research Computer) or
Microprogramming: It is a technique for implementing the control function of a processor in a systematic and flexible manner. Each
instruction of the processor being controlled causes a sequence of microinstructions, called a micro program, to be fetched from a special
ROM or RAM, called a control memory. The microinstructions specify the sequence of micro operations or register transfer operations
needed to interpret and execute the main instruction. Each instruction fetch from main memory thus indicates a sequence of
microinstruction fetches from control memory.
The instruction set of a micro programmed machine can be changed by replacing the contents of the control memory. This makes it
possible for a micro programmed computer to execute directly program written in the machine language of a different computer, a
process called emulator. If micro programs are stored in ROM instead of RAM then it is called firmware.
micro programmed control unit is costly and slower than hardware control unit.
F:\ClassNotr\ParallelProcessing\ParallelProcessing.doc
The Major steps in processing an instruction is described below:
1. Generate the next instruction address.
2. Fetch the instruction.
3. Decode the instruction.
4. Generate the operand address.
5. Fetch the operands.
6. Execute the instruction.
7. Store the results.
Parallel Processing
The term parallel processing refers to a large class of methods that attempt to increase computing sped by performing more than one
computation concurrently. A parallel processor is a computer that implements some parallel processing technique. Like any type of
computing, parallel processing can be viewed at various levels of complexity. At the gate level, for instance, a distinction is made
between serial arithmetic circuits, which process numbers 1 bit at a time, and parallel arithmetic circuits, in which all bits of a number are
processed concurrently, i.e., during the same clock cycle. At the register level, where the basic unit of information is the word, serial and
parallel machines are distinguished based on whether one or more words (instructions or data) can be processed in parallel. Thus a vector
computer, which performs computations on all words of a multiword vector simultaneously, is considered to be a type of parallel
processor.
All modern computers involve some degree of parallelism. Early electronic computers such as the EDVAC processed data bit by bit an
are now classed as serial computers. The computer or a portion thereof that is designed to process more than one basic CPU instruction in
parallel is referred to as parallel processor. Conventional machines that can only execute one CPU instruction at a time are termed as
sequential processors.
Types of Parallel Processors
There are many ways of classifying parallel processors based on their structure or behavior. The major classification methods consider
the number of instructions and/or operand sets that can be processed simultaneously, the internal organization of the processors, the
internal organization of the processors, the interprocessor connection structure, or the methods used to control the flow of instructions
and data through the system.
Flynns classification
A typical CPU operates by fetching instructions and operands from main memory, executing the instructions, and placing the results in
main memory. The steps associated with the processing of an instruction are listed in Table 1.
Michael J. Flynn has made an informal and widely used classification of processor parallelism based on the number of simultaneous
instruction and data streams seen by the processor during program execution. Suppose that a processor P is operating at its maximum
capacity, so that its full degree of parallelism is being exhibited. Let m I and mD denote the maximum number of instruction and data
streams, respectively that are being actively processed in any of the seven steps listed in Table 1. m I and mD are termed the instructionstream-multiplicity and data-stream-multiplicity of P, and measure its degree of parallelism.
Computers can be roughly divided into four major groups based o the values of m I and mD associated with their CPUs.
Single instruction stream single data stream (SISD): mI=1=mD=1. Most conventional machines with one CPU containing a single
arithmetic-logic unit capable only of scalar arithmetic fall into this category.
Single instruction stream multiple data stream (SIMD): mI=1, mD>1. This category includes single program control unit and multiple
execution units.
Multiple instruction stream single data stream (MISD): mI>1, mD=1. The computers, which rely heavily on pipeline processing, may
be considered as MISD machines if the viewpoint is taken that a single data stream passes through a pipeline, and is processed by
different instruction streams in different segments of the pipeline. Fault-tolerant computers where several CPUs process the same data
using different programs belong to the MISD class.
Multiple instruction stream multiple data stream (MIMD): mI>1, mD>1. This covers multiprocessors, which are computers with more
than one CPU and the ability to execute several programs simultaneously.
F:\ClassNotr\ParallelProcessing\ParallelProcessing6.doc
Generally electronic computers are called von Neumann machine.
Processing Element: A CPU without memory is called Processing Element (PE).
Computing Element: A CPU with memory is called Computing Element (CE).
The speed of a computer can be increased by increasing the speed of hardware but the maximum speed of hardware which can be made is
equal to the speed of light. So by using hardware the speed up is fixed to a certain value. In order to get more speed, we must use parallel
processing method.
Examples of early supercomputers are the UNIVAC Livermore Atomic Research Computer) or UNIVAC LARC and the IBM
7030. These machines execute CPU operation simultaneously or in parallel and IOP acts as secondary housekeeping functions in parallel
with CPU operation which speed up CPU directly, and referred to as parallel processing.
Parallel Processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process. Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources during
the same time interval; simultaneous events may occur at the same time instant; and pipelined events are attainable in a computer system
at various processing levels. Parallel Processing demands concurrent execution of many programs in the computer. It is in contrast to
sequential processing. It is cost-effective means to improve system performance through concurrent activities in the computer.
Like any type of computing, parallel processing can be viewed at various levels of complexity. At the gate level, for instance, a
distinction is made between serial arithmetic circuits, which process numbers 1 bit at a time, and parallel arithmetic circuits, in which all
bits of a number are processed concurrently, i.e., during the same clock cycle. At the register level, where the basic unit of information is
the word, serial and parallel machines are distinguished based on whether one or more words (instructions or data) can be processed in
parallel. Thus a vector computer, which performs computations on all words of a multiword vector simultaneously, is considered to be a
type of parallel processor.
All modern computers involve some degree of parallelism. Early electronic computers such as the EDVAC processed data bit by bit an
are now classed as serial computers. The computer or a portion thereof that is designed to process more than one basic CPU instruction in
parallel is referred to as parallel processor. Conventional machines that can only execute one CPU instruction at a time are termed as
sequential processors.
Parallel Processing and distributed processing is closely related. Generally the distributed processing is one form of achieve Parallel
Processing.
Associate memory is content addressable and the array processor designed with associative memory called associative processor.
Parallel Computer System / Parallel Computer Architecture/ Structural class: It can be characterized as i) Pipeline computers ii)
Array processors and iii) Multiprocessor systems. New computing concepts are Data Flow and VLSI approach.
The development and application of these computer systems requires a broad knowledge of the underlying hardware and software
structure and close interaction between parallel computing algorithm and the optimal allocation of machine resources. Parallel processing
can be applied at the hardware/software system level or at the algorithm and programming level.
Generation of Computer System (Trends towards Parallel Processing):
From application point of view the use of computer is experiencing a trend of four ascending level of sophistication: i) Data Processing
ii) Information Processing iii) Knowledge Processing and iv) Intelligence Processing.
The data space is the largest, including numeric numbers in various formats, character symbols, and multidimensional measures. Huge
amounts of data are being generated daily in all walks of life. Computer usages started with data processing, which is still a major task of
todays computer.
An information item is a collection of data objects that are related by some syntactic structure or relation. Therefore, information items
form a subspace of the data space. When more and more data structures developed, many users are shifting to computer roles from pure
data processing to information processing.
Knowledge consists of information items plus some semantic meanings. Thus knowledge items form a subspace of the information space.
The use of computer for knowledge processing is in expert systems where the computer can reach a level of performance comparable to
that of human expert.
Intelligence is derived from a collection of knowledge items. It is also a subspace of knowledge space. Todays computers can be made
very knowledgeable but are far from being intelligent and the modern scientist are trying to create intelligence in computer (At presence
Artificial Intelligence is created in computer).
From operating system point of view computer system has improved as: i) Batch Processing ii) Multiprogramming iii) Time sharing
and iv) Multiprocessing.
From programmatic point of view parallel processing is viewed as:
i) Job or Program Level: The height level of parallel processing is conducted among multiple jobs or program through
multiprogramming, timesharing and multiprocessing. This level requires the development of parallel process able algorithm. This
implementation of parallel algorithm depends on efficient allocation of limited hardware/software resource to multiple programs being
used to solve a large computational problem. This level is conducted algorithmically. After this level software approaches is decreased
and hardware approach is increases.
ii) Task or Procedure Level: This level of parallel processing is conducted among procedure or task (Program segment) within the same
program. This involves the decomposition of a program into multiple tasks.
iii) Interinstrruction Level: It is used to exploit concurrency among multiple instructions. Data dependency analysis is often performed
to reveal parallelism among instructions. Vectorization may be desired among scalar operations within DO loops.
iv) Intrainstrruction Level: It is faster and there is a concurrent operation within each instruction. This level is implemented by
hardware. Above this level hardware approaches is decreased and software approach is increases.
Types of Parallel Processors
There are many ways of classifying parallel processors based on their structure or behavior. The major classification methods consider
the number of instructions and/or operand sets that can be processed simultaneously, the internal organization of the processors, the
internal organization of the processors, the interprocessor connection structure, or the methods used to control the flow of instructions
and data through the system.
Classify Parallel Processor based on their architecture or behavior (Flynns Classification): It is based on the number of
simultaneous instruction and data streams seen by the processor during program execution. Suppose that a processor P is operating at its
maximum capacity, so that its full degree of parallelism is being exhibited. If m I and mD denote the maximum number of instruction
stream multiplicity and data streams multiplicity of P respectively, then m I and mD can measure its degree of parallelism.
Computers can be roughly divided into four major groups based on the values of m I and mD associated with their CPUs.
1.
Single instruction stream single data stream (SISD): mI=1=mD=1. Most conventional machines with one CPU containing a
single arithmetic-logic unit capable only of scalar arithmetic fall into this category. SISD computers and sequential computers
are thus synonymous. Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). A
SISD computer may have more than one functional unit under the control of one control unit. Examples are: IBM 7090, VAX
11/780.
2.
Single instruction stream multiple data stream (SIMD): mI=1, mD>1. This category includes single program control unit
and multiple execution units or PEs. All PEs receive the same instruction broadcast from the control unit but operate on
different data sets from distinct data stream. It is further divided into i) Bit slice and ii) Word slice. This class corresponds to
array processor. Examples are: ILLIAC IV that has a single program control unit and multiple execution units. It also includes
associative processors like DAP.
3.
Multiple instruction stream single data stream (MISD): mI>1, mD=1. The computers, which rely heavily on pipeline
processing, may be considered as MISD machines if the viewpoint is taken that a single data stream passes through a pipeline,
and is processed by different instruction streams in different segments of the pipeline. The results (output) of one processor
become the input (operands) of the next processor in the macro pipe. Fault-tolerant computers where several CPUs process
the same data using different programs belong to the MISD class. Examples are: CYBER 205 by CDC and Cray-1. It is not
popular.
4.
Multiple instruction stream multiple data stream (MIMD): mI>1, mD>1. This covers multiprocessors, which are computers
with more than one CPU and the ability to execute several programs simultaneously. An intrinsic MIMD computer is called
tightly coupled if the degree of interactions among the processor is high. Otherwise we consider them loosely coupled.
Examples are: NCUBE/ten, IBM 370. Sometimes Cray-1 is considered to fall in this category.
Classify Parallel Processor based on Degree Of Parallelism or Serial vs Parallel processing (Fengs Classification): The maximum
number of binary digit (Bits) that can be processed within a unit time by a computer is called maximum parallelism degree (P). Let P i be
the number of bits that can be processed within the ith processor cycle (or ith clock period). Consider T processor cycle indexed by i = 1,
2, 3, . . . , T. the average parallelism degree Pa is defined as:
T
Pa
Pi
i 1
The utilization rate of a computer within T cycle is defined by:
Pa
P
If the computing power of the processor is fully utilized (or the parallelism is fully exploited), then we have P i = P for all i and = 1 for
100 percent utilization. The utilization rate depends on the application program being executed.
The maximum parallelism degree P(C) of a given computer C is represented by the product of word length W and bit slice length m; that
is P(C) = n.m (P(C) = W.m).
Bit slice: A bit slice is a string of bits, one from each of the words at the same vertical bit position. For example, TI-ASC has a word
length 64 and 4 arithmetic pipeline. Each pipe has 8 stages. Thus there are 8 4 = 32 bit per each bit slice in the 4 pipes. So it is
represented as P (T1-ASC) = (64, 32).
There are four types of processing methods in this classification, which are as follows:
i. Word serial and bit serial: Here n = m = 1, so it is also called bit serial processing.
ii. Word parallel and bit serial: Here n = 1, m > 1, so it is also called bis (bit serial) processing.
iii. Word serial and bit parallel: Here n > 1, m = 1, so it is also called word slice processing.
iv. Word parallel and bit parallel: Here n > 1, m > 1, so it is also called fully parallel processing.
Classify Parallel Processor based on Pipeline degree and Parallelism degree parallelism vs pipelining (Handlers Classification):
Its level is:
i) Processor Control Unit (PCU): Denoted by K. Each PCU corresponds to one processor.
ii) Arithmetic Logic Unit (ALU): Denoted by D. It is equivalent to PE used in array processor.
iii) Bit Level Circuit (BLC): It corresponds to a combinational circuit and performs 1-bit operation in the ALU.
We denote T(C) = <K K, D D, W W>
where
W is word length of ALU or PCU
W is number of pipeline stages in all ALU or PCU
D is the number of ALU that can be pipelined
K is the number of PCU that can be pipelined.
The TI-ASC has one controller, controlling four arithmetic pipelines; each has 64-bit word length and 8 stages. Thus we have,
T (TI-ASC) = < 1 1, 4 1, 64 8> = < 1 , 4 , 64 8> when second entity is 1 we can drop it.
Another example is CDC 6600, which has a CPU with an ALU that has 10 specialized hardware functions, each of a word length of 60bit. Up to 10 of this functions can be linked into a longer pipeline. It has 10 peripherals IOP which can operate in parallel. Each IOP has
one ALU with a word length of 12-bit. Thus we can specify CDC 6600 as:
T (CDC 6600) = T (Central processor) T (IOP)
=<1, 1 10, 60> <10, 1, 12>
Multiprocessor System: If we extend the structure of computer to include multiple processors with shared memory space and
peripheries under the control of one operating system then such a computer is called multiprocessor system. Example is, C.mmp by
Carnegie0Mellon University.
Parallel Processing in a Uniprocessor System:
The mechanism of Parallel Processing in a uniprocessor system can be classified as i) Hardware approach ii) Software approach and iii)
Operating system approach.
System Throughput: It is defined as the number of instructions (or basic computation) performed per unit time.
Hardware approaches to provide parallelism in a uniprocessor system: (All examples given in this section are uniprocessor)
1.
Multiplicity of functional units: The function of the ALU can be distributed to multiple and specialized functional units which
can operate in parallel. These functional units are independent of each other and may operate simultaneously. The CDC 6600
has 10 functional units built into its CPU. IBM 360/91 has two parallel execution units; one for fixed point and other for
floating point.
2.
Parallelism and pipelining within the CPU: Parallel address, using such techniques as Carry Lookahead and Carry Save, are
now built into almost all ALUs. High-speed multiplier recording and convergence division are techniques for exploring
parallelism and the sharing of hardware resources for the functions of multiply and divide.
Various phases of instruction executions are now pipelined, including instruction fetch, decode, operand fetch, arithmetic logic execution,
and store result. To facilate overlapped instruction executions through the pipe, instruction prefetch and data buffering techniques have
been developed.
1.
Overlapped CPU and IO operation: IO operations can be performed simultaneously with the CPU computations by using
separate IO controllers, channels, or IO processors. The DMA method can be used for this purpose. Back end database
machines can be used to manage large databases stored on disk.
2.
Use of a hierarchical memory system: It minimizes the speed gap between CPU and memory. The inner most level is the
register files directly addressable by CPU.
approaches to provide parallelism in uniprocessor system:
1.
Balancing of subsystem bandwidth: The bandwidth of a system is defined as the number of operations performed per unit
time. Generally Disk memory cycle time is greater, followed by main memory and the CPU. With these speed gaps between the
subsystems, we need to match their processing bandwidths in order to avoid a system bottleneck problem. In order to achieve a
totally balanced system the following relation can be satisfied: B p + Bd = Bm.
Software approaches to provide parallelism in uniprocessor system:

Multiprogramming and Time sharing:
Multiprogramming: Within the same time interval, there may be multiple processes active in a computer, competing for memory, IO,
and CPU resources. Some computer program is IO bound and some are CPU bound. We can mix the execution of various types of
programs in the computer to balance bandwidths among the various functional units. This interleaving of CPU and IO operations among
several programs is called Multiprogramming.
Time sharing: Multiprogramming on a uniprocessor is centered around the sharing of the CPU by many programs. Sometimes a high
priority program may occupy the CPU for a too long to allow others to share. This problem can be overcome by using a time-sharing
operating system. The concept extends from multiprogramming by assigning fixed or variable time slices to multiple programs. In other
ords, equal opportunities are given to all programs completing for the use of the CPU. The time-sharing use of the CPU by multiple
programs in a uniprocessor computer creates the concept of virtual processors. Each user at a terminal can interact with the computer on
an instantaneous basis. Each user thinks that he/she is the sole user of the system, because the response is so fast 9waiting time between
time slices is not recognigible by the user). Time-sharing is particularly effective when applied to a computer system connected to many
interactive terminals. Time-sharing is indispensable to the development of real-time computer system.
Performance of Parallel Computer:
The speedup that can be achieved by a parallel computer with n identical processors working concurrently on a single problem is at most
n times faster than a single processor. In practice, the speedup is much less, since some processors are idle at a given time because of
conflicts over memory access or communication paths, inefficient algorithms for exploiting the natural concurrency in the computing
problem, or many other reasons. Generally the speedup has a lower bound log2n to an upper bound n/ln n.
The lower bound log2n is known as the Minskys conjecture. Most commercial multiprocessor systems have from 2 to 4
processors( C.mmp and S-1 systems have 16 processors). Using Minskys conjecture, only a speedup of 2 to 4 can be expected from
existing multiprocessors with 4 to 16 processors. A more optimistic speedup estimate is upper bounded by n/ln n as derived below:
Consider a computing problem, which can be executed by a uniprocessor in unit time, T 1 = 1. Let fi be the probability of assigning the
same problem to i processors working equally with an average load d i = 1/i per processor. Assume equal probability of each operating
mode using i processors, that is fi = 1/n for n operating modes: i = 1, 2, . . ., n. The average time required to solve the problem on an nprocessor system is given below, where the summation represents n operating modes.
n
fd
i 1
Tn =
i 1
The average speedup S is obtained as the ratio of T1 = 1 to Tn; that is,
T1
n
n
n
1 ln n
Tn
i 1 i
For a given multiprocessor system with 2, 4, 8, or 16 processors, the respective average speedups are 1.33, 1.92, 3,08, and 6.93. The
above analysis explains the reason why a typical commercial multiprocessor system consists of only 2 to 4 processors (The speedup is
better in 2 to 4 processors).
Parallel Computer Structural class: It can be characterized as follows:
1.
Pipeline computers: It performs overlapped computations to exploit temporal parallelism.
2.
Array processors: It uses multiple synchronized ALUs to achieve spatial parallelism.
3. Multiprocessor systems: It achieves asynchronous parallelism through a set of interactive processors with shared resources
(memories, database etc.).
These three parallel approaches to computer system design are not mutually exclusive. In fact, most existing computers are now
pipelined, and some of them assume also an array or a multiprocessor structure.
The fundamental difference between an array processor and multiprocessor system is that the processing elements in an array
processor operate synchronously but processors in a multiprocessor system may operate asynchronously.
New computing concepts are Data Flow and VLSI approach. These approaches use extensive hardware to achieve parallelism.
Data flow computers: The conventional von Neumann machines are called control flow computers because the instructions are executed
sequentially as controlled by the program counter. This computer is used to get maximum parallelism. The basic concept is to enable the
execution of an instruction whenever its required operand is available. Thus no program counter is needed is hold the address of the next
instruction to be executed. Instruction initiation depends on data availability, independent of physical location of an instruction in the
program. In other words, the instructions in a program are not ordered.
VLSI Computing structure: The parallel algorithm is implemented by hardware in this technology, so it is faster.
If the speed up of a method is directly proportional to the number of processors, then the method is said to scale well.
Temporal Parallelism: In this method each job is divided into subtasks. For example, to examine four questions in an answer sheet we
use four examiners and one examiner examine one question. The term temporal means pertaining to time. It is also called overlapped
parallelism, or assembly line processing, or pipelining processing.
This method is appropriate if:
1.
The jobs to be carried out are identical.
2.
A job can be divided into many independent tasks, i.e., each task can be independent of other tasks.
3.
The time taken for each task is the same.
4.
The number of tasks is much smaller than the number of total jobs to be carried out.
5.
The time to send a job from one unit to other is small as compared to the time needed to execute a task.
Advantages of this method:
1.
It is simple.
2.
Each unit only be able to solve his subtask instead of the whole tasks.
Problems of this method:
1.
Synchronization: For simultaneous flow of job equal time should be taken for doing each task.
2.
Bubbles: Each some tasks are absent in a job then the units whose tasks are absent are idle and this form bubbles.
3.
Fault tolerance: This system does not tolerate fault. If one unit is damaged then entire process is stopped.
4.
Intertask communication: You must follow that the time to send a job from one unit to other is small as compared to the time
needed to execute a task. So you must be aware of interprocess communication.

Calculate the value of Speed up:
Let,
The numbers of jobs = n
The time taken to complete one job = p
Each job be divisible into k tasks (the numbers of processing unit) and each task takes equal time to execute. One unit is dedicated to do a
particular task of a job.
Thus the time taken to execute each task = p/k
Time taken to complete n jobs without pipelining = np
Time taken to complete n jobs with pipelining = p + (n 1) p/k
Speed up due to pipelining processing = nk/(k + n - 1) k if n k.
Since the speed up is directly proportional to the number of processors, this method is scale well.
Data Parallelism: In this method the total jobs are divided (not tasks in each job). Since the jobs (input data) are divided into
independent sets and processed simultaneously, it is called data parallelism. This method is useful in numerical computing in which long
vector and large matrices are used as data.
Advantages of this method:
1.
There is no interprocess communication delay because each unit works separately.
2.
No synchronization needed because each unit can work with its own speed.
3.
The problem of bubbles is absent because if any task in a job is absent then the processing rime of this unit is reduced.
4.
This method is fault tolerance because if one unit is damaged then whole system is not stopped.
Disadvantages of this method:
1.
The assignment of jobs to each unit is predefined, which is called static assignment. If one unit is speedy then this unit needs
less time to solve his job though the total jobs are not completed. Again if some subtasks of a job are absent then the unit who
does that job needs less time to complete. This problem can be solved by dynamic assignment.
2.
The set of jobs must be partitionable into subsets of mutually independent jobs and each subset should take the same time to
complete. The time to partition the job must be small.
3.
Each unit must be able to solve all subtasks instead of the one task as in temporal parallelism.
Data Parallelism with dynamic assignment: In this method an extra unit (called head) is needed whose function is to assign job. If one
unit completes his job quicker than other unit then more jobs are given to that unit. It solves the problem of static assignment discussed
above. It has a problem that if we increase the numbers of unit then the speed up is increased for a certain number of units. After that if
we increase the numbers of unit then speedup is not increased because when number of units are large the head unit is not able to serve
the job as quick as need by the units. So many units remain idle which decreases speedup.
Calculate the value of Speed up:
Let,
The total numbers of jobs = n
n
If no parallel processing is employed then the time to do the n jobs =
p
i 1
where pi is the time to complete ith job.
The number of units work in parallel = k

The time a unit needs to wait to get the ith job from head unit and return to head unit = q i (not include processing time)
The time a unit needs to wait to get the ith job from head unit and return to head unit after processing = (q i + pi) where pi is the
processing time
Assuming that all units work simultaneously and there loads are perfectly balanced, the total time taken to complete jobs by k units are:
1 n
(qi pi )
k i 1
n
Speed up =
pi
i 1
1 n
(qi pi )
k
i
1
/
= k/1 + q/p where p =
As long as q<<p the speed up approaches to k.
pi
i 1
/n and q =
q
i 1
/n
Some features of parallel computer:
1.
Better quality of solution: When arithmetic operations are distributed to many computers, each one does a smaller number of
operations. Thus rounding errors are lower.
2.
Better algorithm: The availability of many computers, which can work simultaneously, leads to different algorithm, which is
not relevant for purely sequential computers. It is possible to explore different facets of a solution simultaneously using
several processors and these give better insight to solutions of several physical problems.
3.
Better storage distribution: Certain types of parallel computing systems provide much larger storage, which is distributed.
This feature is of special interest in many applications such as information retrieval and computer aided design.
4.
Greater reliability: In principle a parallel computer will work even if a processor fails. We can build a parallel computers
hardware and software for better fault tolerance.
Shared-Memory Architectures
The shared-memory architecture provides a very efficient medium for processes to exchange data. In our implementation, each task owns
a shared buffer created with the shmget() system call. The task id is used as the ``key" to the shared segment. If the key is being used by
another user, PVM will assign a different id to the task. A task communicates with other tasks by mapping their message buffers into its
own memory space.
F:\ClassNotr\ParallelProcessing\VectorProcessor.doc
Vector Processing Requirements (VP requirements)
Examples of VPs are: Vector
parallel Processor (VPP 500) by Fujitsu.
Characteristies of VP
A vector operand contain an ordered set of n elements where n is called the length of vector. Each elements in a vector is a scalar quantity
which may be any data types. Vector instruction is classified into four primitive types.
F1 : V V
F2 : V S
F4 : V SV
The dot product of two vector (V1 V2 ) is generated by applying f 3 (vector multiply ) and f 2 (vector sum ) operation is sequence I,e,V 1 . V2
= ,V1I . V2I
Beside above some special vector instruction is used in pipeline
Example of Vector Instruction:
VMULF Vi , Vj , Vk , n : Vi Vj Vk n is vector length
VSUMI Vi , Rj , n : Vj Rj
VLOAD Vi , MemAddr, az, i1, n
Where Vi is vector read from memory into vector register Vi
MemAddr is start address of vector
Sz is size of vector
i1 is interleaved factor (Address of element i Address of element i-1)
n is number of elements to be loaded
VSTORE Vi , MemAddr, az, i1, n: Store vector in memory
Vector Processors
Vector-Vector Instructions: Examples are, f1: Vi Vj, f2: Vj Vj Vk. V1 = sin(V2), V3 = V1 + V2.
Vector-Scalar Instructions: Examples are, f3: s Vi Vj, where Vi and Vj are vectors of equal length. s V1 = V2.
Vector-Memory Instructions: Examples are, f4: M V (Vector load), f5: V M (Vector store), where V and M are vector and memory
recpectively.
Vector-Reduction Instructions: Examples are, f6: Vi sj, f7: Vi Vj sk. Eamples are finding maximum, minimum, sum, dot product of
two vectors.
Gather and Scatter instructions: f8: M V1 V2 (Gather), f9: V1 V2 M (Scatter). Gather is an operation that fetches from memory the
nonzero elements of a sparse vector using indices that themselves are indexed. Scatter does the opposite, storing into memory a vector in
a sparse vector whose nonzero entries are indexed. V1 contains data and V2 containns index .
Masking Instructions: f10: Vj Vm Vk.
Vector balance point: It is defined as the percentages of vector code in a program required to achieve equal utilization of vector and scalar
haedware. In other words, we required equal time spent in vector and scalar hardware so that no resources will be idle.
Register to Register Architechture: CrayY-MP, C-90
Memory to Register Architechture: STAR-100 by CDC and Cyber 205
8.3. Compound Vector Processing:
Compound vector operations are as followes: i) Compound vector operation ii) Multipipelining chaining and iii) Networking techniques.
Compound vector function: It is defined as a composite function of vector operations converted from a looping structure of linked scalar
operations.
Vector looping or strip-mining: When a vector has a length greater than that of the vector registors, segmentation of of the long vector
into a fixed-length segments is necessary. This technique has been called Vector looping or strip-mining.
Networking techniques: The idea of linking vector operations are the key concept of Networking techniques.
Vectorization Method: It is the process of converting scalar looping operations into equivalent vector instruction execution.
Parallelization: It is defined as the process of converting sequential code into parellel form, which can enable parallel execution by
multiple processor.
F:\ClassNotr\ParallelProcessing\Multiprocesor.doc
Multiprocessors:
A multiprocessor is an MIMD computer system, characterized by the presence of several CPUs or more generally, processing elements
(PEs), which cooperate on common or shared computational tasks. Multiprocessors are distinguished from multicomputer systems and
computer networks, which are systems with multiple PEs operating independently on separate tasks. The various PE making up a
multiprocessor typically share such resources as communication facilities, IO devices, program libraries, etc, and are controlled by a
common operating system.
The two man reason for including multiple PEs in a single computer system are to improve performance and to increase reliability.
Performance improvement is obtained either by allowing many PEs to share the computation load associated with a single large task, or
else by allowing many smaller tasks to be performed in parallel in separate PEs.
Types of Multiprocessor:
Multiprocessor can be classified by the organization of their main-memory systems as:
i) Shared memory system (computer) or Tightly coupled multiprocessor
ii) Distributed memory system (computer) or Loosely coupled multiprocessor
If the main memory, or a major portion thereof, can be directly accessed by all the PEs of a multiprocessor, then the system is termed as
shared memory computer, and the shared portion of main memory is called global memory. Information can therefore be shared among
the processors simply by passing it in the global memory.
Distributed memory computers, on the other hand, have no global memory. Instead each processing element (PE i) has its own private or
local main memory (Mi) and the combination of PEi and Mi is referred to as the processor Pi.
In shared memory systems in which all of main memory is global, the term processing element and processor coincide
Distributed memory systems share information by transmitting it in the form of meassage between the local memories of different
processors; such meassage passing requires a series of relatively slow IO operations.
Operating System:
Multiprocessor impose special requirements both on the operating system needed to control them and the programming techniques
needed to make effective use of their parallelism. The operating system for a multiprocessor resembles that of a uniprocessor with
multiprogramming capability since it must allow multiple users to timeshare the system resources, which now include multiple CPUs, in
a non-conflicting manner.
To prevent conflicting use of each shared resource by several processes, the operating system function associated with allocating that
resource should be exercised by only one CPU at a time. Three general ways of organizing the operating system
Lecture 01
Parallel Computer Structure:
Parallel computer structure will be characterized as pipelined computers, array processors and multiprocessor systems.
Evaluation of Computer Systems:
Relays and vacuum tubes (1940 1950s)
Diodes and transistors (1950 1960s)
Small and medium-scale integrated (SSI/MSI) circuits (1960 1970s)
Large and very large-scale integrated (LSI/VLSI) devices (1970s and beyond)
Increase in device speed and reliability and reductions in hardware cost and physical size have greatly enhanced computer performance.
To design a powerful and cost effective computer system and to devise efficient programs to solve a computational problem, one must
understand the underlying hardware and software system structures and the computing algorithms to be implemented on the machine
with some user-oriented programming languages.
Generation of Computer Systems:

The division of computer systems into generations is determined by the device technology, system architecture, processing mode, and
languages used.
The First Generation (1938 1953) ENIAC (Electronic Numerical Integrator and Computer), first analog computer was
introduced in 1938. CPU structure was bit-serial, i.e., arithmetic is done on a bit by bit fixed-point basis. In 1950, the first storedcomputer, EDVAC (Electronic Discrete Variable Automatic Computer) was developed.
The Second Generation (1952 1963) The first trasistorized digital computer, TRADIC was built by Bell Laboratories in 1954.
Assembly languages were used until the development of high-level languages, Fortran in 1956 and Algol (algorithmic language) in
1960. The first IBM scientific, trasistorized computer IBM 1620 became available in 1959.
The Third Generation (1962 1963) Began to use SSI and MSI circuits. Such as CDC-6600 and CDC-7600. Many high
performance computers, like IBM 360/91, Illiac IV, TI-ASC and several vector processors were developed in the early seventies.
The fourth Generation (1972 1985) LSI circuits were used. Most operating systems were time-sharing. A high degree of
pipelining and multiprocessing was greatly emphasized in commercial supercomputers. A massively parallel processor (MMP) was
custom-designed in 1982.
The Fifth Generation (1985 Present) VLSI chips are used along with high-density modular design. Multiprocessing, parallel
processing have been achieved. More than several thousands mega floating point operations per second is possible.
Lecture 02
Parallel Processing
Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process. Concurrency implies parallelism, simultaneity and pipelining. Parallel processing demands concurrent execution of
many programs in the computer. It is a cost-effective means to improve system performance through concurrent activities through
computer.
Trends Towards Parallel Processing
From the viewpoint of an OS, computer systems have improved chronologically in four phases:
Batch processing
Multiprogramming
Time sharing
Multiprocessing
In these four operating modes, the degree of parallelism increases sharply from phase to phase. The highest level of parallel processing is
conducted among multiple jobs or programs through multiprogramming, time sharing and multiprocessing. This level requires the
development of parallel processable algorithms.
As far as parallel processing is concerned, the general architectural trend is being shifted away from conventional uniprocessor systems to
multiprocessor systems or to an array of processing elements controlled by one uniprocessor.
Parallelism in Uniprocessor Systems
Basic Uniprocessor Architecture:
A typical uniprocessor computer consists of three major components i) the main memory ii) the central processing unit (CPU), and iii)
the input-output (I/O) subsystem [See Fig. 1.3, page 9].
Parallel Processing Mechanisms:
A number of parallel processing mechanisms have been developed in uniprocessor computers. They can be categorized into the following
six groups:
Multiplicity of functional units The early computer had only one arithmetic and logic unit (ALU) and could only perform single
function at a time. In practice, many of the functions of the ALU can be distributed to multiple and specialized functional units
which can operate in parallel [See Fig. 1.5].
Parallelism and pipelining within the CPU The use of multiple functional units is a form of parallelism with the CPU. Various
phases of instruction executions are now pipelined, including instruction fetch, decode, operand fetch, arithmetic logic execution,
and store result.
Overlapped CPU and I/O operations I/O operations can be performed simultaneously with the CPU computations by using
separate I/O controllers, channels, or I/O processors. The direct-memory access (DMA) channel can be used to provide direct
information transfer between the I/O devices and the main memory.
Use of hierarchical memory system Usually, the CPU is about 1000 times faster than memory access. A hierarchical memory
system can be used to close up the speed gap [See Fig. 1.6]. Catch memory can be used to serve as a buffer between the CPU and
the main memory.
Balancing of subsystem bandwidths In general, the CPU is the fastest unit in a computer with a processor cycle
nanoseconds; the main memory has a cycle time
tm
tp
of tens of
of hundreds of nanoseconds; and the I/O devices are the slowest with an
t
average access time d of a few milliseconds. It is thus observed that
td tm t p
The bandwidth of a system is defined as the number of operations performed per unit time. In the case of a main memory system,
the memory bandwidth is measured by the number of memory words that can be accessed per unit time. Let
delivered per memory cycle
t m , then the maximum memory bandwidth Bm

Bm
W
tm
be the number of words
is given by
words/s or bytes/s
Multiprogramming and time sharing -
Lecture 03
Parallel Computers Structures
Parallel computer can be divided into three architectural configurations:
1.
Pipeline computers
2.
Array processors
3.
Multiprocessor systems
A pipeline computer performs overlapped computations to exploit temporal parallelism. An array processor uses multiple synchronized
arithmetic logic units to achieve spatial parallelism. A multiprocessor system achieves asynchronous parallelism through a set of
interactive processors with shared resources (memory, database, etc.).
The fundamental difference between an array processor and a multiprocessor system is that the processing elements in an array processor
operate synchronously but processors in a multiprocessor system may operate asynchronously.
1.
Pipeline Computers
Usually the process of executing an instruction in a digital computer involves four major steps i) instruction fetch (IF) from the
main memory; ii) instruction decoding (ID), identifying the operation to be performed; iii) operand fetch (OF) if needed in the
execution; and iv) execution of the decoded arithmetic logic operation. In a nonpipelined computer, these four steps must be
completed before the next instruction can be issued. On the other hand, in a pipelined computer, successive instructions are executed
in an overlapped fashion as illustrated in Fig 1.10 [See page 21]. Four pipeline stages IF, ID, OF and EX are arranged into a linear)
cascade.
An instruction cycle consists of multiple cycles. A pipeline cycle can be set equal to the delay of the slowest stage. The flow of data
from stage to stage is triggered by a common clock of the pipeline. Interface latches are used between adjacent segments to hold the
intermediate results. For nonpipelined computer, it takes four pipeline cycles to complete one instruction. Once a pipeline is filled
up, an output result is produced from the pipeline on each cycle. The instruction cycle has been effectively reduced to one-fourth of
the original cycle time by such overlapped execution.
2.
Array Computers
An array processor is a synchronous parallel computer with multiple arithmetic logic units, called processing elements (PE) that can
operate in parallel in a clock-step fashion. The PEs are synchronized to perform the same function at the same time. An appropriate
data-routing mechanism must be established among the PEs. A typical array processor is depicted in Fig. 1.12 [See page 24]. Each
PE consists of an ALU with registers and a local memory. The PEs are interconnected by a data-routing network. The
interconnection pattern to be established for specific computation is under program control from the control unit (CU). Vector
instructions are broadcast to the PEs for distributed execution over different component operands fetched directly from the local
memories. Instruction fetch and decode is done by the control unit. The PEs are passive devices without instruction decoding
capabilities.
3.
Multiprocessor Systems
A basic multiprocessor organization is conceptually depicted in Fig. 1.13 [See page 26]. The system contains two or more processors
of approximately comparable capabilities. All processors share access to common sets of memory modules, I/O channels, and
peripherals. The entire system must be controlled by a single integrated operating system providing interactions between processors
and their programs at various levels. Besides the shared memories and I/O devices, each processor has its own local memory and
private devices. Interprocessor communications can be done through the shared memories or through an interrupt network.
Lecture 04
Memory and Input-Output subsystem: Hierarchical memory structure, Virtual memory systems, Memory allocation and management,
Cache memories and management, Input-Output subsystems.
Memory Hierarchy:
Memories in a hierarchy can be classified on the basis of several attributes. One common attribute is the accessing method, which divides
the memories into three basic classes: i) random access memory (RAM) ii) sequential access memory (SAM) and iii) direct access
t
storage device (DASD). In RAM, the access time a of a memory is independent of its location. In SAM, information is accessed
serially or sequentially such as a first-in first-out (FIFO) buffer, charged-coupled devices (CCD) and magnetic bubble memories. DASDs
are rotational devices made of magnetic materials where any block of information can be accessed directly.
Another attribute often used to classify memory is the speed or access time of the memory. In most computer systems, the memory
hierarchy is often organized so that the highest level has the fastest memory speed and the lowest level has the slowest speed. On the
basis of access time, memory can be further classified into primary memory and secondary memory.
c , t and s
i are
In general, the memory hierarchy is structured so that memories at level i are higher than those at level i 1 . If i i
respectively the cost per byte, average access time, and the total memory size at level i, the following relationships normally hold
c c i 1 , t i t i 1 , and s i s i 1 for i 1 .
between levels i and i 1 : i
In modeling the performance of a hierarchical memory, it is often assumed that the memory management policy is characterized by a
success function or hit ratio H, which is the probability of finding the requested information in the memory of a given level. It has been
found that H is most sensitive to the memory size s. Hence the success function may be written as H(s). The miss ratio or probability is
F ( s ) 1 H ( s ) . Since the copies of information in level i are assumed to exist in levels greater than i, the probability of a hit at
level i and of misses at higher levels 1 to i 1 is:
then
hi H ( s i ) H ( s i 1 )
hi is the access frequency at level i and indicates the relative number of successful accesses to level i. The missing-item fault
f 1 hi .
frequency at level i is then i
where
Virtual Memory Concept: .

Memory Allocation and Management
Classification of Memory Policies:
Two classes of memory management policies are often used in multiprogramming systems fixed partitioning and variable partitioning.
Cache memories and management

Characteristics & Organizations

Shahadat

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Shahadat

Enviado por

Direitos autorais:

Formatos disponíveis

ENIAC (Electronic Numerical Integrator and Calculator): This was first widely known general-purpose electronic computer.

EDVAC (Electronic Discrete Variable Computer):

The utilization rate of a computer within T cycle is defined by:

Software approaches to provide parallelism in uniprocessor system:

The average speedup S is obtained as the ratio of T1 = 1 to Tn; that is,

Pipeline computers: It performs overlapped computations to exploit temporal parallelism.

Array processors: It uses multiple synchronized ALUs to achieve spatial parallelism.

The jobs to be carried out are identical.

The time taken for each task is the same.

Advantages of this method:

Problems of this method:

needed to execute a task. So you must be aware of interprocess communication.

There is no interprocess communication delay because each unit works separately.

Disadvantages of this method:

If no parallel processing is employed then the time to do the n jobs =

where pi is the time to complete ith job.

The number of units work in parallel = k

= k/1 + q/p where p =

As long as q<<p the speed up approaches to k.

Some features of parallel computer:

parallel Processor (VPP 500) by Fujitsu.

Evaluation of Computer Systems:

Relays and vacuum tubes (1940 1950s)

Diodes and transistors (1950 1960s)

Small and medium-scale integrated (SSI/MSI) circuits (1960 1970s)

Generation of Computer Systems:

t m , then the maximum memory bandwidth Bm

be the number of words

Multiprogramming and time sharing -

Virtual Memory Concept: .

Cache memories and management

Você também pode gostar