CSC580 Quick Notes Lect1and2

CS230 Bachelor of Computer Science UiTM Perak (Tapah)
______________________________________________________________________________
CSC580
PARALLEL PROCESSING
Lecture1: Introduction to Parallel Computing
• What is Parallel Computing?

- In order to understand parallel computing, we must first understand the meaning
of serial computing
Serial Computing:
• Traditionally, software has been written for serial computation:

o A problem is broken into a discrete series of instructions
o Instructions are executed sequentially one after another
o Executed on a single processor
o Only one instruction may execute at any moment in time
______________________________________________________________________________
Parallel Computing:
• In the simplest sense, parallel computing is the simultaneous use of multiple

compute resources to solve a computational problem:
o A problem is broken into discrete parts that can be solved concurrently
o Each part is further broken down to a series of instructions
o Instructions from each part execute simultaneously on different
processors
o An overall control/coordination mechanism is employed
______________________________________________________________________________
• Why Use Parallel Computing?

- Main reasons
o Save time and money
o Solve larger / more complex problems
o Provide concurrency
o Take advantage of non-local resources
o Make better use of underlying parallel hardware
• Motivating Parallelism
- The role of parallelism includes:
o To accelerate (speed up) computing speeds
o To provide multiplicity of datapaths and increased access to storage
elements
• Scope of Parallel Computing

- Who is Using Parallel Computing?
o Applications in Engineering and Design
o Scientific Applications
o Commercial Applications
o Applications in Computer Systems
• von Neumann Architecture

- Comprised of four main components:
o Memory
o Control Unit
o Arithmetic Logic Unit
o Input/Output
- Parallel computers still follow this basic design, just multiplied in units. The
basic, fundamental architecture remains the same.
______________________________________________________________________________
• Flynn's Classical Taxonomy
- There are different ways to classify parallel computers.

- One of the more widely used classifications, in use since 1966, is called Flynn's
Taxonomy.
- Flynn's taxonomy distinguishes multi-processor computer architectures according
to how they can be classified along the two independent dimensions
of Instruction Stream and Data Stream. Each of these dimensions can have only
one of two possible states: Single or Multiple.
- The matrix below defines the 4 possible classifications according to Flynn:
a) Single Instruction, Single Data (SISD):
• A serial (non-parallel) computer

• Single Instruction: Only one instruction stream is being acted on by the CPU
during any one clock cycle
• Single Data: Only one data stream is being used as input during any one
clock cycle
• Deterministic execution
• This is the oldest type of computer
• Examples: older generation mainframes, minicomputers, workstations and
single processor/core PCs.
______________________________________________________________________________
b) Single Instruction, Multiple Data (SIMD):
• A type of parallel computer

• Single Instruction: All processing units execute the same instruction at any
given clock cycle
• Multiple Data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
c) Multiple Instructions, Single Data (MISD):

• Multiple Instructions: Each processing unit operates on the data independently
via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing units.
• Few (if any) actual examples of this class of parallel computer have ever
existed.
• Some conceivable uses might be:
o multiple frequency filters operating on a single signal stream
o multiple cryptography algorithms attempting to crack a single coded
message.
______________________________________________________________________________
d) Multiple Instructions, Multiple Data (MIMD):

• Multiple Instruction: Every processor may be executing a different instruction
stream
• Multiple Data: Every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic
• Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer clusters
and "grids", multi-processor SMP computers, multi-core PCs.
• Note: many MIMD architectures also include SIMD execution sub-components
______________________________________________________________________________
Lecture2: Parallel Platforms – Part 1
• Implicit Parallelism: Trends in Microprocessor Architectures
- In general, conventional (old fashioned) architectures comprise of a few

components: a processor, memory system and datapath
- Each component may have performance problem (bottlenecks – something that
cause a process to slow down)
- Parallelism may help to solve this bottleneck problem, however different
component or aspect will have different aspects of parallelism
o For example: data intensive applications utilize high aggregate throughput,
server applications utilize high aggregate network bandwidth, and
scientific applications typically utilize high processing and memory
system performance
- In the current technologies, due to higher levels of device integration, there are a
lot of transistors in the design of the microprocessor
- Therefore, current processors use these resources (transistors etc) to execute
multiple instructions in the same cycle
❖ Machine Cycle
❖ The steps performed by the computer processor for each machine
language instruction received. The machine cycle is a 4 process cycle that
includes reading and interpreting the machine language, executing the
code and then storing that code.
❖ Four steps of Machine cycle:
1) Fetch - Retrieve an instruction from the memory.
2) Decode - Translate the retrieved instruction into a series of computer
commands.
3) Execute - Execute the computer commands.
4) Store - Sand and write the results back in memory.
❖ Pipelining and superscalar execution

❖ Pipelining
❖ An instruction pipeline is a technique used in the design of computers to
increase their instruction throughput (the number of instructions that can
be executed in a unit of time).
❖ The basic instruction cycle is broken up into a series called a pipeline.
______________________________________________________________________________
❖ Rather than processing each instruction sequentially each instruction is

split up into a sequence of steps so different steps can be
executed concurrently and in parallel
❖ Pipelining increases instruction throughput by performing multiple
operations at the same time
❖ Pipelining overlaps various stages of instruction execution to achieve
performance.
❖ At a high level of abstraction, an instruction can be executed while the
next one is being decoded and the next one is being fetched.
❖ (this is very similar to car manufacturing process and the fish frying
example given in class).
o Let say to fry a fish we need 1 hour to make sure it is well cooked
(1/2 an hour for each side). Assume that a pan can only fry two
fishes at one time. What is the minimum time needed to fry 3
fishes?
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction

Decode, EX = Execute, MEM = Memory access, WB = Register write back). In the
fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the
latest instruction has not yet entered the pipeline.
❖ However pipelining has several limitations

o The speed of a pipeline is limited by the slowest stage
o Multiple pipelines can reduce bottlenecks
❖ Superscalar
❖ Superscalar design involves the processor being able to issue multiple
instructions in a single clock, with redundant facilities to execute an
instruction.
❖ A superscalar processor executes more than one instruction during a clock
cycle by simultaneously dispatching multiple instructions to different
functional units on the processor. Each functional unit is not a separate
CPU core but an execution resource within a single CPU such as
anarithmetic logic unit, a bit shifter, or a multiplier.
______________________________________________________________________________
Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a

maximum of two instructions per cycle can be completed. (IF = Instruction Fetch, ID =
Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write
back, i = Instruction number, t = Clock cycle [i.e., time])
❖ Scheduling of instructions is determined by a number of factors:

o True Data Dependency: The result of one operation is an input to
the next.
o Resource Dependency: Two operations require the same resource.
o Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
❖ Advantages and Disadvantages of Superscalars
❖ Very long instruction word processors

❖ The performance of a superscalar execution is eventually limited due to
some issues (in-order or dynamic issues).
❖ To address these issues, VLIW processors rely on compile time analysis to
identify and bundle together instructions that can be executed
concurrently.
❖ These instructions are packed and dispatched together and thus the name
very long instruction word.
______________________________________________________________________________
• Limitations of Memory System Performance

– Normally the memory system (and not processor speed) is often the bottleneck for
many applications.
– Memory system performance is largely captured by two parameters, latency and
bandwidth.
– Latency is the time from the issue of a memory request to the time the data is
available at the processor.
o latency is the time it takes to send a minimal (0 byte) message from point
A to point B. Commonly expressed as microseconds.
– Bandwidth is the rate at which data can be pumped to the processor by the
memory system.
o bandwidth is the amount of data that can be communicated per unit of
time. Commonly expressed as megabytes/sec or gigabytes/sec.
– Sending many small messages can cause latency to dominate communication
overheads. Often it is more efficient to package small messages into a larger
message, thus increasing the effective communications bandwidth.
❖ Improving effective memory latency using caches

– Caches are small and fast memory elements between the processor and
DRAM.
– This memory acts as a low-latency high-bandwidth storage.
– If a piece of data is repeatedly used, the effective latency of this memory
system can be reduced by the cache.
______________________________________________________________________________
❖ Impact of memory bandwidth

– Memory bandwidth is determined by the bandwidth of the memory bus as
well as the memory units.
– Memory bandwidth can be improved by increasing the size of memory
blocks.
– The underlying system takes l time units (where l is the latency of the
system) to deliver b units of data (where b is the block size).
❖ Tradeoffs of multithreading and prefetching

❖ Alternate Approaches for Hiding Memory Latency
– Consider the problem of browsing the web on a very slow network
connection. We deal with the problem in one of three possible ways:
o Prefetching: we anticipate which pages we are going to
browse ahead of time and issue requests for them in advance;
o Multithreading: we open multiple browsers and access
different pages in each browser, thus while we are waiting for
one page to load, we could be reading others; or
o Special Locality: we access a whole bunch of pages in one go
- amortizing the latency across various accesses.
– Multithreading and prefetching are critically impacted by the memory
bandwidth.
– Multithreading and prefetching only address the latency problem and
may often worsen (make it even worse) the bandwidth problem.
– Multithreading and prefetching also require significantly more
hardware resources in the form of storage.
• Dichotomy of Parallel Computing Platforms

❖ Control Structure of Parallel Platforms
– Processing units in parallel computers either operate under the centralized
control of a single control unit or work independently.
o If there is a single control unit that dispatches the same instruction
to various processors (that work on different data), the model is
referred to as single instruction stream, multiple data stream
(SIMD).
o If each processor has its own control control unit, each processor
can execute different instructions on different data items. This
model is called multiple instruction stream, multiple data stream
(MIMD).
❖ Communication Model of Parallel Platforms

– There are two primary forms of data exchange between parallel tasks -
accessing a shared data space and exchanging messages.
o Platforms that provide a shared data space are called shared-
address-space machines or multiprocessors.
o Platforms that support messaging are also called message passing
platforms or multicomputers
______________________________________________________________________________
Part (or all) of the memory is accessible to all processors.

– Processors interact by modifying data objects stored in this shared-
address-space.
– If the time taken by a processor to access any memory word in the system
global or local is identical, the platform is classified as a uniform
memory access (UMA), else, a non-uniform memory access (NUMA)
machine.
UMA NUMA
______________________________________________________________________________
❖ Network Topologies
a) Buses
b) Crossbars
______________________________________________________________________________
c) Multistage Networks
d) Omega Network
• The perfect shuffle patterns are connected using 2×2 switches.

• The switches operate in two modes – crossover or passthrough.
Pass through Cross over

______________________________________________________________________________
A complete Omega Network Connecting eight inputs and eight outputs
e) CompletelyConected Network
f) Star Connected Network

______________________________________________________________________________
g) Linear Arrays
h) Two and Three Dimensional Meshes
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh
with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
i) Hypercubes
______________________________________________________________________________
j) Tree-Based Network
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
______________________________________________________________________________
References:
https://computing.llnl.gov/tutorials/parallel_comp/
http://www.computerhope.com/jargon/m/machcycl.htm
http://en.wikipedia.org/wiki/Instruction_pipeline
http://en.wikipedia.org/wiki/Superscalar

CSC580 Quick Notes Lect1and2

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

CSC580 Quick Notes Lect1and2

Enviado por

Direitos autorais:

Formatos disponíveis

CS230 Bachelor of Computer Science UiTM Perak (Tapah)

Lecture1: Introduction to Parallel Computing

• What is Parallel Computing?

• Traditionally, software has been written for serial computation:

• In the simplest sense, parallel computing is the simultaneous use of multiple

• Why Use Parallel Computing?

• Scope of Parallel Computing

• von Neumann Architecture

• Flynn's Classical Taxonomy

- There are different ways to classify parallel computers.

a) Single Instruction, Single Data (SISD):

• A serial (non-parallel) computer

b) Single Instruction, Multiple Data (SIMD):

• A type of parallel computer

c) Multiple Instructions, Single Data (MISD):

• A type of parallel computer

d) Multiple Instructions, Multiple Data (MIMD):

• A type of parallel computer

Lecture2: Parallel Platforms – Part 1

• Implicit Parallelism: Trends in Microprocessor Architectures

- In general, conventional (old fashioned) architectures comprise of a few

❖ Pipelining and superscalar execution

❖ Rather than processing each instruction sequentially each instruction is

Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction

❖ However pipelining has several limitations

Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a

❖ Scheduling of instructions is determined by a number of factors:

❖ Advantages and Disadvantages of Superscalars

❖ Very long instruction word processors

• Limitations of Memory System Performance

❖ Improving effective memory latency using caches

❖ Impact of memory bandwidth

❖ Tradeoffs of multithreading and prefetching

• Dichotomy of Parallel Computing Platforms

❖ Communication Model of Parallel Platforms

Part (or all) of the memory is accessible to all processors.

• The perfect shuffle patterns are connected using 2×2 switches.

Pass through Cross over

A complete Omega Network Connecting eight inputs and eight outputs

f) Star Connected Network

h) Two and Three Dimensional Meshes

Você também pode gostar