Você está na página 1de 11

Stack Exchange

sign up log in tour help stack overflow careers

Stack Overflow

Questions
Tags
Users
Badges
Unanswered

Ask Question

Take the 2-minute tour


Stack Overflow is a question and answer site for professional and enthusiast programmers. It's
100% free, no registration required.

what is difference between Superscaling and


pipelining?

up vote 17
down vote
favorite
5

Well looks too simple a question to be asked but i asked after going through few
ppts on both.
Both methods increase instruction throughput. And Superscaling almost always
makes use of pipelining as well. Superscaling has more than one execution unit and
so does pipelining or am I wrong here?
processor pipelining

edited Nov 1 '09 at 8:17 asked Nov 1 '09 at 7:34


share|improve this question

Michael Petrotta Alex Xander


37.4k985134 72552236

I removed all the comments that weren't on-topic to the question. That didn't
leave any. Please keep it civil people. Marc Gravell Nov 1 '09 at 8:56

Good idea. Otherwise a perfectly good question would have been closed as
"subjective and argumentative"! RCIX Nov 1 '09 at 9:48
add a comment

5 Answers
active oldest votes
Superscalar design involves the processor being able to issue multiple instructions
in a single clock, with redundant facilities to execute an instruction. We're talking
about within a single core, mind you -- multicore processing is different.
Pipelining divides an instruction into steps, and since each step is executed in a
different part of the processor, multiple instructions can be in different "phases" each
clock.
They're almost always used together. This image from Wikipedia shows both
concepts in use, as these concepts are best explained graphically:

Here, two instructions are being executed at a time in a five-stage pipeline.


up vote 26
down vote
accepted

To break it down further, given your recent edit:


In the example above, an instruction goes through 5 stages to be "performed". These
are IF (instruction fetch), ID (instruction decode), EX (execute), MEM (update
memory), WB (writeback to cache).
In a very simple processor design, every clock a different stage would be completed
so we'd have:
1.
2.
3.
4.
5.

IF
ID
EX
MEM
WB

Which would do one instruction in five clocks. If we then add a redundant execution

unit and introduce superscalar design, we'd have this, for two instructions A and B:
1.
2.
3.
4.
5.

IF(A) IF(B)
ID(A) ID(B)
EX(A) EX(B)
MEM(A) MEM(B)
WB(A) WB(B)

Two instructions in five clocks -- a theoretical maximum gain of 100%.


Pipelining allows the parts to be executed simultaneously, so we would end up with
something like (for ten instructions A through J):
1.
2.
3.
4.
5.
6.
7.
8.
9.

IF(A) IF(B)
ID(A) ID(B) IF(C) IF(D)
EX(A) EX(B) ID(C) ID(D) IF(E) IF(F)
MEM(A) MEM(B) EX(C) EX(D) ID(E) ID(F) IF(G) IF(H)
WB(A) WB(B) MEM(C) MEM(D) EX(E) EX(F) ID(G) ID(H) IF(I) IF(J)
WB(C) WB(D) MEM(E) MEM(F) EX(G) EX(H) ID(I) ID(J)
WB(E) WB(F) MEM(G) MEM(H) EX(I) EX(J)
WB(G) WB(H) MEM(I) MEM(J)
WB(I) WB(J)

In nine clocks, we've executed ten instructions -- you can see where pipelining really
moves things along. And that is an explanation of the example graphic, not how it's
actually implemented in the field (that's black magic).
The Wikipedia articles for Superscalar and Instruction pipeline are pretty good.
answered Nov 1 '09 at 7:42
share|improve this answer

edited Nov 1 '09 at 8:36


Jed Smith
9,26243050

They are used together primarily because both techniques are available, both
are good ideas and modern process manufacturing technology makes it
possible. Notable chips that are pipelined but not super-scalar include the Intel
i486 and some of the early ARM, MIPS CPUs as well as the first Alpha
processor. Paul Hsieh Nov 1 '09 at 8:20
1
The first "execute" should be an "issue" and then you can use "execute" instead
of "do". That's how that phase is called in the Henessy & Patterson book.
yeyeyerman Nov 1 '09 at 8:35
@yeyeyerman: Thank you for the feedback, I have revised the answer. I
haven't had much exposure to texts on the material, so forgive the oversight.
Jed Smith Nov 1 '09 at 8:37
Thank you, it's all clear to me now! :) hardsetting Jun 15 at 16:33

redundant means "superfluous", "unnecessary", or "not strictly necessary to


functioning but included in case of failure in another component." But the
functional units on a superscalar don't even need to provide overlapping
functionality (for example in the case where you have separate branch unit,
ALU and memory unit.) Wandering Logic Jul 3 at 14:51
add a comment

up vote A long time ago, CPUs executed only one machine instruction at a time. Only when it
14
was completely finished did the CPU fetch the next instruction from memory (or, later, the
down instruction cache).
vote
Eventually, someone noticed that this meant that most of a CPU did nothing most of the
time, since there were several execution subunits (such as the instruction decoder, the
integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only
one of them busy at a time.
Thus, "simple" pipelining was born: once one instruction was done decoding and went on
towards the next execution subunit, why not already fetch and decode the next instruction?
If you had 10 such "stages", then by having each stage process a different instruction
you could theoretically increase the instruction throughput tenfold without increasing the
CPU clock at all! Of course, this only works flawlessly when there are no conditional
jumps in the code (this led to a lot of extra effort to handle conditional jumps specially).
Later, with Moore's law continuing to be correct for longer than expected, CPU makers
found themselves with ever more tansistors to make use of and thought "why have only
one of each execution subunit?". Thus, superscalar CPUs with multiple execution
subunits able to do the same thing in parallel were born, and CPU designs became
much, much more complex to distribute instructions across these fully parallel units while

ensuring the results were the same as if the instructions had been executed sequentially.
answered Nov 1 '09 at 8:23
share|improve this answer

edited Nov 1 '09 at 10:07


Michael Borgwardt
193k29272515

Its answers like these that should end the debate going on about value of such
questions on SO. Alex Xander Nov 1 '09 at 8:40
2
A long time ago, in a die far, far away? Jed Smith Nov 1 '09 at 8:50

I'd vote this up but the description of superscalar CPUs is incorrect. You're
describing a vector processor, superscalar processors are subtly different. Wedge
Nov 1 '09 at 9:10
Now that calls for another question - what is difference between vector and
superscaler processors? AJ. Nov 1 '09 at 9:24
1
@nurabha: in practice, some forms of pipelining were done very early, and the real
question is how deep the pipeline of a given processor is. I think the Pentium IV had
a pretty extreme one with 40+ stages. Michael Borgwardt Nov 10 at 12:01
show 9 more comments
An Analogy: Washing Clothes

up
vote 6
down Imagine a dry cleaning store with the following facilities: a rack for hanging dirty or clean
vote clothes, a washer and a dryer (each of which can wash one garment at a time), a folding
table, and an ironing board.
The attendant who does all of the actual washing and drying is rather dim-witted so the
store owner, who takes the dry cleaning orders, takes special care to write out each
instruction very carefully and explicitly.
On a typical day these instructions may be something along the lines of:

take the shirt from the rack


wash the shirt
dry the shirt
iron the shirt
fold the shirt
put the shirt back on the rack
take the pants from the rack
wash the pants
dry the pants
fold the pants
put the pants back on the rack
take the coat from the rack
wash the coat
dry the coat

iron the coat


put the coat back on the rack
The attendant follows these instructions to the tee, being very careful not to ever do
anything out of order. As you can imagine, it takes a long time to get the day's laundry done
because it takes a long time to fully wash, dry, and fold each piece of laundry, and it must
all be done one at a time.
However, one day the attendant quits and a new, smarter, attendant is hired who notices that
most of the equipment is laying idle at any given time during the day. While the pants were
drying neither the ironing board nor the washer were in use. So he decided to make better
use of his time. Thus, instead of the above series of steps, he would do this:

take the shirt from the rack


wash the shirt, take the pants from the rack
dry the shirt, wash the pants
iron the shirt, dry the pants
fold the shirt, (take the coat from the rack)
put the shirt back on the rack, fold the pants, (wash the coat)
put the pants back on the rack, (dry the coat)
(iron the coat)
(put the coat back on the rack)

This is pipelining. Sequencing unrelated activities such that they use different components
at the same time. By keeping as much of the different components active at once you
maximize efficiency and speed up execution time.
Now, the little dry cleaning shop started to make more money because they could work so
much faster, so the owner bought an extra washer, dryer, ironing board, folding station, and
even hired another attendant. Now things are even faster, instead of the above, you have:

take the shirt from the rack, take the pants from the rack
wash the shirt, wash the pants, (take the coat from the rack)
dry the shirt, dry the pants, (wash the coat)
iron the shirt, fold the pants, (dry the coat)
fold the shirt, put the pants back on the rack, (iron the coat)
put the shirt back on the rack, (put the coat back on the rack)

This is superscalar design. Multiple sub-components capable of doing the same task
simultaneously, but with the processor deciding how to do it.
Older processors, such as the 386 or 486, are simple scalar processors, they execute one
instruction at a time in exactly the order in which it was received. Modern consumer
processors since the PowerPC/Pentium are pipelined and superscalar. A Core2 CPU is
capable of running the same code that was compiled for a 486 while still taking advantage
of instruction level parallelism because it contains its own internal logic that analyzes
machine code and determine how to reorder and run it (what can be run in parallel, what
can't, etc.) This is the essence of superscalar design and why it's so practical.
In contrast a vector parallel processor performs operations on several pieces of data at once

(a vector). Thus, instead of just adding x and y a vector processor would add, say, x0,x1,x2
to y0,y1,y2 (resulting in z0,z1,z2). The problem with this design is that it is tightly coupled
to the specific degree of parallelism of the processor. If you run scalar code on a vector
processor (assuming you could) you would see no advantage of the vector parallelization
because it needs to be explicitly used, similarly if you wanted to take advantage of a newer
vector processor with more parallel processing units (e.g. capable of adding vectors of 12
numbers instead of just 3) you would need to recompile your code. Vector processor
designs were popular in the oldest generation of super computers because they were easy to
design and there are large classes of problems in science and engineering with a great deal
of natural parallelism.
Superscalar processors can also have the ability to perform speculative execution. Rather
than waiting for a code path to finish executing before branching a processor can make a
best guess and start executing code past the branch before prior code has finished
processing. When execution of the prior code catches up to the branch point the processor
can then compare the actual branch with the branch guess and either continue on if the
guess was correct (already well ahead of where it would have been by just waiting) or it
can invalidate the results of the speculative execution and run the code for the correct
branch.
answered Nov 1 '09 at 9:26
share|improve this answer

edited Nov 1 '09 at 10:45


Wedge
12.3k53354

add a comment
up
Pipelining is what a car company does in the manufacturing of their cars. They break down
vote 3 the process of putting together a car into stages and perform the different stages at different
down points along an assembly line done by different people. The net result is that the car is
vote manufactured at exactly the speed of the slowest stage alone.
In CPUs the pipelining process is exactly the same. An "instruction" is broken down into
various stages of execution, usually something like 1. fetch instruction, 2. fetch operands
(registers or memory values that are read), 2. perform computation, 3. write results (to
memory or registers). The slowest of this might be the computation part, in which case the
overall throughput speed of the instructions through this pipeline is just the speed of the
computation part (as if the other parts were "free".)
Super-scalar in microprocessors refers to the ability to run several instructions from a single
execution stream at once in parallel. So if a car company ran two assembly lines then
obviously they could produce twice as many cars. But if the process of putting a serial
number on the car was at the last stage and had to be done by a single person, then they
would have to alternate between the two pipelines and guarantee that they could get each
done in half the time of the slowest stage in order to avoid becoming the slowest stage
themselves.
Super-scalar in microprocessors is similar but usually has far more restrictions. So the
instruction fetch stage will typically produce more than one instruction during its stage -this is what makes super-scalar in microprocessors possible. There would then be two fetch
stages, two execution stages, and two write back stages. This obviously generalizes to more

than just two pipelines.


This is all fine and dandy but from the perspective of sound execution both techniques
could lead to problems if done blindly. For correct execution of a program, it is assumed
that the instructions are executed completely one after another in order. If two sequential
instructions have inter-dependent calculations or use the same registers then there can be a
problem, The later instruction needs to wait for the write back of the previous instruction to
complete before it can perform the operand fetch stage. Thus you need to stall the second
instruction by two stages before it is executed, which defeats the purpose of what was
gained by these techniques in the first place.
There are many techniques use to reduce the problem of needing to stall that are a bit
complicated to describe but I will list them: 1. register forwarding, (also store to load
forwarding) 2. register renaming, 3. score-boarding, 4. out-of-order execution. 5.
Speculative execution with rollback (and retirement) All modern CPUs use pretty much all
these techniques to implement super-scalar and pipelining. However, these techniques tend
to have diminishing returns with respect to the number of pipelines in a processor before
stalls become inevitable. In practice no CPU manufacturer makes more than 4 pipelines in a
single core.
Multi-core has nothing to do with any of these techniques. This is basically ramming two
micro-processors together to implement symmetric multiprocessing on a single chip and
sharing only those components which make sense to share (typically L3 cache, and I/O).
However a technique that Intel calls "hyperthreading" is a method of trying to virtually
implement the semantics of multi-core within the super-scalar framework of a single core.
So a single micro-architecture contains the registers of two (or more) virtual cores and
fetches instructions from two (or more) different execution streams, but executing from a
common super-scalar system. The idea is that because the registers cannot interfere with
each other, there will tend to be more parallelism leading to fewer stalls. So rather than
simply executing two virtual core execution streams at half the speed, it is better due to the
overall reduction in stalls. This would seem to suggest that Intel could increase the number
of pipelines. However this technique has been found to be somewhat lacking in practical
implementations. As it is integral to super-scalar techniques, though, I have mentioned it
anyway.
answered Nov 1 '09 at 9:22
share|improve this answer

Paul Hsieh
1,25059

add a comment
up vote 1 Pipelining is simultaneous execution of different stages of multiple instructions at the
down vote same cycle. It is based on splitting instruction processing into stages and having
specialized units for each stage and registers for storing intermediate results.
Superscaling is dispatching multiple instructions (or microinstructions) to multiple
executing units existing in CPU. It is based thus on redundant units in CPU.
Of course, this approaches can complement each other.
share|improve this answer answered Nov 1 '09 at 7:44

elder_george
6,5251125
add a comment

Your Answer

Sign up or log in
Sign up using Google
Sign up using Facebook
Sign up using Stack Exchange

Post as a guest

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions


tagged processor pipelining or ask your own question.
asked 5 years ago
viewed 13368 times

active 5 years ago

Related
5
Why increased pipeline depth does not always mean increased throughput?
3
Using Java NIO for pipelined Http
5
difference between pipelining and redirection in linux
8
Swapping Variables (C++, processor level)
6
What is pipelining? how does it increase the speed of execution?
0
How to read ISA disassembly? Also GPU-Pipelining and Wait states
2
C# Threading for a pipeline
2
Pipelining in a polynomial
0
Single-cycle vs a pipelined approach
0
What does it really mean to Squash an instruction?
Hot Network Questions

When were country of origin labels (e.g. Made in France) introduced?


How can I not hurt my fingers when using a hammer
Can "chez lui" be used as a nominal part in a sentence?
How damaging are writing errors like typos in faculty application documents?
Is it possible to study in China without knowing Chinese language?
A Non-Unique Factorization of Integers!
Known Algorithm Attack: Caesar Reloaded
Project Euler #12 - first triangle number with more than 500 divisors
"Animals which lay eggs are called birds." and "Animals that lay eggs are called birds."
What is the difference between these two sentences?
How did 70 souls give birth to millions in a few generations?
I'm stumped on a simple roman numeral puzzle
What is V-Sync and when should I enable it?
Long-abandoned, yet still working ancient technology
What's the differences of binding key?

What parts of a pure mathematics undergraduate curriculum have been discovered since
1964?
Generate any random integer
Trying to setup SSL with Apache on Ubuntu 14.04
Elasticsearch as memcache replacement
How to prepend a number before an HTML H2 with CSS?
How do I force OS X to treat a file as a Folder?
So many parentheses in gcc standard headers
Prove that a given sequence is Cauchy sequence.
Why does the US President fly in an obsolete 747-200?
What makes relic guitar be likely more expensive?

question feed

tour help blog chat data legal privacy policy work here advertising info mobile contact us feedback
Culture /
Technology
Life / Arts
Recreation
1. Programmers
1. Photography
2. Unix &
2. Science
1. English
Linux
Fiction &
Language &
1. Stack
3. Ask Different
1. Database
Fantasy
Usage
Overflow
(Apple)
Administrators
3. Graphic
2. Skeptics
2. Server Fault
4. WordPress
2. Drupal
Design
3. Mi Yodeya
3. Super User
Development
Answers
4. Seasoned
(Judaism)
4. Web
5. Geographic
3. SharePoint
Advice
4. Travel
Applications
Information
4. User
(cooking)
5. Christianity
5. Ask Ubuntu
Systems
Experience
5. Home
6. Arqade
6. Webmasters
6. Electrical
5. Mathematica
Improvement
(gaming)
7. Game
Engineering
6. Salesforce
6. Personal
7. Bicycles
Development
7. Android
7. more (13)
Finance &
8. Role-playing
8. TeX - LaTeX
Enthusiasts
Money
Games
8. Information
7. Academia
9. more (21)
Security
8. more (10)
site design / logo 2014 stack exchange inc; user contributions licensed under cc by-sa 3.0 with
attribution required
rev 2014.12.27.2141

Você também pode gostar