ACA Unit 4

Unit 4
THREAD LEVEL PARALLELISM
1)(i)Describe the basic structure of a centralized shared memory multiprocessor in

detail?(Apr/May 2011)
(ii)Describe the implementation of directory -based cache coherence.

2)What are the advantages and disadvantages of distributed -memory multiprocessors?describe
the basic structure of a distributed memory multiprocessor in detail.
(ii)Describe sequential and relaxed consistency model. ?(Apr/May 2011)
or What is need of memory consistency model?Explain its various types.

3)Explain the functions of symmetric shared memory architecture.Compare it with distributed
shared memory architecture.((May/June 2014)
Refer Q.No 1 and 2
4)Discuss in detail about the performance issues in symmetric and distributed shared memory
architectures.(Apr/May 2015)
5)(i)Discuss the various cache-coherence protocols used in in symmetric shared memory
architectures.
(ii)What are the hardware primitives to resolve synchronization issues in a multiprocessor

environment.Give examples.(May/June 2012)
6)Discuss the performance of symmetric shared memory multiprocessors for a
multiprogrammed workload consisting of both user activity and OS activity.(May/June 2012)
7)(i)Describe the architectures of typical CMT Processor
(ii)Briefly compare SMT and CMParchitectures(Nov/Dec 2011)
OR
Discuss SMT and CMP architecture in detail.(May/June 2013)
DESIGN ISSUES:
SMT and CMP Architectures

They determine the performance measures of each processor in a precise manner. The
issue slots usage limitations and its issues also determine the performance.Why Multithreading
Today ILP is exhausted, TLP is in. Large performance gap between MEMORY and
PROCESSOR. Too many transistors on chip. More existing MT applications today.
Multiprocessors on a single chip. Long network latency, too
5.3.1DESIGN CHALLENGES OF SMT
Impact of fine grained scheduling on single thread performance?
A preferred thread approach sacrifices throughput and single threaded performance.
Unfortunately with a preferred thread, the processor is likely to sacrifice some throughput
Reason for loss of throughput
Pipeline is less likely to have a mix of instructions from several threads resulting in a
greater probability that either empty slots or a stall will occur
Design Challenges
Larger register file needed to hold multiple contexts.Not affecting clock cycle time,
especially in
Instruction issue- more candidate instructions need to be considered
Instruction completion- choosing which instructions to commit may be
challenging
Ensuring that cache and TLP conflicts generated by SMT do not degrade performance.
There are mainly two observations
Potential performance overhead due to multithreading is small
Efficiency of current superscalar is low with the room for significant
improvement
A SMT processor works well if Number of compute intensive threads does not exceed
the number of threads supported in SMT. Threads have highly different charecteristics For eg;
1 thread doing mostly integer operations and another doing mostly floating point operations
It does not work well if Threads try to utilize the same functional units and for
assignment problems
Eg; a dual core processor system, each processor having 2 threads simultaneously
2 computer intensive application processes might end up on the same processor
instead of different processors
The problem here is the operating system does not see the difference between the SMT and real
processors !!!
Transient Faults
Faults that persist for a short duration. Cause is cosmic rays (e.g., neutrons).The effect
is knock off electrons, discharge capacitor.The Solution is no practical absorbent for cosmic
rays.1 fault per 1000 computers per year (estimated fault rate)
Processor Utilization vs. Latency
R= the run length to a long latency event

L = the amount of latency
Simultaneous & Redundantly Threaded Processor (SRT)
SRT = SMT + Fault Detection + Less hardware compared to replicated microprocessors
SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to
existing SMT+ Better performance than complete replication better use of resources + Lower
cost avoids complete replication
SRT Design Challenges
Lock stepping doesnt work because SMT may issue same instruction from redundant
threads in different cycles. Must carefully fetch/schedule instructions from redundant threads
since branch misprediction &cache miss will occur
Transient Fault Detection in CMPs
CRT borrows the detection scheme from the SMT-based simultaneously and
Redundantly Threaded (SRT) processors and applies the scheme to CMPs.
replicated two communicating threads (leading & trailing threads)
Compare the results of the two.

CRT executes the leading and trailing threads on different processors to achieve
load balancing and to reduce the probability of a fault corrupting both threads
register file in each core) but not memory values. The CRTs leading thread commits stores only
after checking, so that memory is guaranteed to be correct.CRT compares only stores and
uncached loads, but not register values, of the two threads.
An incorrect value caused by a fault propagates through computations and is eventually
consumed by a store, checking only stores suffices for detection; other instructions commit
without checking.
CRT uses a store buffer (StB) in which the leading thread places its committed store
values and addresses. The store values and addresses of the trailing thread are compared against
the StB entries to determine whether a fault has occurred. (one checked store reaches to the
cache hierarchy)
Transient Fault Recovery for CMPs
Unlike CRT, CRTR must not allow any trailing instruction to commit before it is checked
for faults, so that the register state of the trailing thread may be used for recovery. However, the
leading thread in CRTR may commit register state before checking, as in CRT.
This asymmetric commit strategy allows CRTR to employ a long slack to absorb interprocessor
latencies. As in CRT, CRTR commits stores only after checking. In addition to
communicating branch outcomes, load addresses, load values, store addresses, and store values
like CRT, CRTR also communicates register values.
Challenges with this approach

I-Cache:
Instruction bandwidth
I-Cache misses:
Since instructions are being grabbed from many different contexts,
instruction locality is degraded and the I-cache miss rate rises.
Register file access time:
Register file access time increases due to the fact that the regfile had to
significantly increase in size to accommodate many separate contexts.
In fact, the HEP and Tera use SRAM to implement the regfile, which
means longer access times.
Single thread performance
Single thread performance significantly degraded since the context is
forced to switch to a new thread even if none are available.
Very high bandwidth network, which is fast and wide
Retries on load empty or store full
To maximize SMT performance Issue slots, Functional units, Renaming registers

ACA Unit 4

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ACA Unit 4

Enviado por

Direitos autorais:

Formatos disponíveis

Unit 4

THREAD LEVEL PARALLELISM

1)(i)Describe the basic structure of a centralized shared memory multiprocessor in

(ii)Describe the implementation of directory -based cache coherence.

(ii)Describe sequential and relaxed consistency model. ?(Apr/May 2011)

or What is need of memory consistency model?Explain its various types.

Refer Q.No 1 and 2

(ii)What are the hardware primitives to resolve synchronization issues in a multiprocessor

(ii)Briefly compare SMT and CMParchitectures(Nov/Dec 2011)

Discuss SMT and CMP architecture in detail.(May/June 2013)

SMT and CMP Architectures

R= the run length to a long latency event

Compare the results of the two.

Challenges with this approach

Você também pode gostar