Computing in A Parallel Universe

A reprint from
American Scientist
the magazine of Sigma Xi, The Scientific Research Society
This reprint is provided for personal and noncommercial use. For any other use, please send a request Brian Hayes by
electronic mail to bhayes@amsci.org.
Computing Science
Computing in a Parallel Universe

Brian Hayes
T he pace of change in com

puter technology can be breath
taking—and sometimes infuriating.
Multicore chips
laws is an economic miracle: As tran
sistors or other components are made
smaller and packed more densely on
You bring home a new computer, and
before you can get it plugged in you’re
could bring about the surface of a silicon chip, the cost
of producing the chip remains nearly
hearing rumors of a faster and cheaper
model. In the 30 years since the micro the biggest change constant. Thus the cost per transistor
steadily declines; it’s now measured in
processor first came on the scene, com nanodollars. This extraordinary fact is
puter clock speeds have increased by a in computing since the basis of Moore’s Law, formulated
factor of a thousand (from a few mega in 1965 by Gordon E. Moore, one of the
hertz to a few gigahertz) and memory the microprocessor founders of Intel. Moore observed that
capacity has grown even more (from the number of transistors on a state-of-
kilobytes to gigabytes). the-art chip doubles every year or two.
Through all this frenzy of upgrades of doublings and redoublings, with Less famous than Moore’s Law but
and speed bumps, one aspect of com the number of cores per chip following equally important are several “scaling
puter hardware has remained stub the same kind of exponential growth laws” first stated in 1974 by Robert
bornly resistant to change. Until re curve that earlier traced the rise in H. Dennard and several colleagues at
cently, that new computer you brought clock speed and memory capacity. IBM. Dennard asked: When we reduce
home surely had only one CPU, or The next computer you bring home, the size of a transistor, how should we
central processing unit—the computer- a few years from now, could have hun adjust the other factors that control its
within-the-computer where programs dreds or even thousands of processors. operation, such as voltages and cur
are executed and calculations are per If all goes according to plan, you may rents? And what effect will the changes
formed. Over the years there were notice nothing different about the new have on performance? He found that
many experiments with multiproces machines apart from another boost in voltage and current should be pro
sors and other exotica in the world of performance. Inside, though, coordi portional to the linear dimensions of
supercomputers, but the desktops and nating all those separate computation the device, which implies that power
laptops familiar to most of us contin al cores is going to require profound consumption (the product of voltage
ued to rely on a single-CPU architec changes in the way programs are de and current) will be proportional to
ture whose roots go back to the age of signed. Up to now, most software has the area of the transistor. This was an
the vacuum tube and the punch card. been like music written for a solo per encouraging discovery; it meant that
Now a major shift is under way. former; with the current generation even as the number of devices on a
Many of the latest computers are of chips we’re getting a little experi chip increased, the total power density
equipped with “dual core” proces ence with duets and quartets and other would remain constant.
sor chips; they bundle two CPUs on small ensembles; but scoring a work Dennard’s conclusion about per
a single slab of silicon. The two pro for large orchestra and chorus is a dif formance was even more cheering. In
cessors are meant to share the work ferent kind of challenge. digital circuits, transistors act essen
of computation, potentially doubling tially as on-off switches, and the cru
the machine’s power. Quad-core chips Free Lunch cial measure of their performance is
are also available; Intel has announced Why build chips with multiple proces the switching delay: the time it takes to
an eight-core product, due in 2009; sors? Why not just keep cranking up go from the conducting to the noncon
Sun Microsystems has been testing a the speed of a single CPU? If the latter ducting state or vice versa. The scaling
16-core chip. A company called Tilera option were feasible, the chipmakers laws show that delay is proportional to
even offers 64 cores. It seems we are would be delighted to adopt it. They size, and so as circuits shrink they can
on the threshold of another sequence are turning to multicore systems only be operated at higher speed.
because the path to higher gigahertz Taken together, these findings sug
Brian Hayes is Senior Writer for American Sci seems to be blocked, at least for now. gest that our universe is an especially
entist. Additional material related to the “Comput- The causes of this impasse lie in the friendly place for making electronic
ing Science” column appears in Hayes’s Weblog at peculiar physical and economic laws computers. In other realms, the laws
http://bit-player.org. Address: 211 Dacian Avenue, that govern the design of integrated of nature seem designed to thwart us.
Durham, NC 27701. Internet: bhayes@amsci.org circuits. The most celebrated of those Thermodynamics and quantum me
476 American Scientist, Volume 95 © 2007 Brian Hayes. Reproduction with permission only.
Contact bhayes@amsci.org.
chanics tell us what we can’t hope to
do; levers amplify either force or dis
tance but not both. Everywhere we
turn, there are limits and tradeoffs,
and no free lunch. But Moore’s Law
and the Dennard scaling rules promise
circuits that gain in both speed and
capability, while cost and power con
sumption remain constant. From this
happy circumstance comes the whole
bonanza of modern microelectronics.
The Impasse
Free lunch is great, but there’s still a
bill to pay for breakfast and dinner.
Throughout the past decade, chip de
signers have struggled with two big
problems.
First, although CPUs are a thou
sand times faster, memory speed has
increased only by a factor of ten or so.
Back in the 1980s, reading a bit from
main memory took a few hundred
nanoseconds, which was also the time
needed to execute a single instruction
in a CPU. The memory and the proces
sor cycles were well matched. Today,
a processor could execute a hundred
instructions in the time it takes to get
data from memory.
One strategy for fixing the memory
bottleneck is to transfer data in large
blocks rather than single bits or bytes; A “quad core” microprocessor chip manufactured by Advanced Micro Devices has four sepa-
this improves throughput (bits per sec rate processors that act in parallel. The cores are the four large areas of irregular geometry;
ond), but not latency (the delay be most of the gridlike regions hold cache memory. The chip is part of the AMD Opteron product
fore the first bit arrives). To mitigate line and is also known by its prerelease codename Barcelona. The total silicon area of 285
square millimeters holds about 600 million transistors.
the latency problem, computers are
equipped with an elaborate hierarchy
of cache memories, which surround point, all the benefits of any further that could drive a wind tunnel, and
the processor core like a series of wait boost in processor speed will be eaten laptops burn your knees.
ing rooms and antechambers. Data and up by the demand for more cache. In the future, even the 0.85 voltage
instructions that are likely to be needed The second problem that plagues reduction looks problematic. As volt
immediately are held in the innermost, chip designers is a power crisis. Den age is lowered, transistors become
first-level cache, which has only a nard’s scaling laws promised that leaky, like valves that cannot be com
small capacity but is built for very high power density would remain constant pletely shut off. The leakage current
speed. The second-level cache, larger even as the number of transistors and now accounts for roughly a third of
but a little slower, holds information their switching speed increased. For total power consumption; with further
that is slightly less urgent. Some sys that rule to hold, however, voltages reductions in voltage, leakage could
tems have a third-level cache. have to be reduced in proportion to become unmanageable. On the other
Reliance on cache memory puts a the linear dimensions of the transistor. hand, without those continuing volt
premium on successfully predicting Manufacturers have not been able to age reductions, the clock rate cannot
which data and instructions a program lower operating voltages that steeply. be increased.
is going to call for next, and there’s Historically, each successive genera These problems with memory la
a heavy penalty when the prediction tion of processor chips has scaled the tency and power density are some
is wrong. Moreover, processor chips linear dimensions by a factor of 0.7, times viewed as signalling the end of
have to sacrifice a large fraction of their which yields an area reduction of one- Moore’s Law, but that’s not the apoca
silicon area to make room for caches half. (In other words, density doubles.) lypse we’re facing. We can still pack
and the logic circuits that control them. The scaling factor for voltages, howev more transistors onto a chip and man
As the disparity between memory and er, has been 0.85 rather than 0.7, with ufacture it for roughly constant cost.
CPU speed grows more extreme, a pro the result that power density has been The semiconductor industry “road
cessor begins to look like a shopping rising steadily with each new genera map” calls for increasing the number
mall where the stores are dwarfed by tion of chips. That’s why desktop ma of transistors on a processor chip from
the surrounding parking lot. At some chines now come equipped with fans a few hundred million today to more
www.americanscientist.org © 2007 Brian Hayes. Reproduction with permission only. 2007 November–December 477
TPVSDF ESBJO tion Machine, which had 216 single-bit and storing results from a third. When
HBUF
processors (and 212 blinking red lights). ever possible, two or more instructions
Another notable project was the Trans are executed simultaneously. Through
puter, created by the British semicon such “instruction-level parallelism” a
ductor firm Inmos. Transputer chips single CPU can have a throughput of
were single processors designed for more than one instruction per cycle,
TJMJDPOTVCTUSBUF interconnection, with built-in commu on average.
nications links and facilities for man
The field-effect transistor, seen here in cross aging parallel programs. Shared Memories
section, is the building block of virtually all Software innovators were also It is surely no coincidence that the
microelectronic circuits. A voltage applied drawn to the challenges of parallelism. kinds of parallelism in widest use
to the gate controls the flow of current from The Occam programming language today are the kinds that seem to be
source to drain. The transistor is fabricated by was devised for the Transputer, and easiest for programmers to manage.
implanting ions in selected areas (green) and languages called *Lisp and C* were Instruction-level parallelism is all but
depositing layers of insulating silicon dioxide written for the Connection Machine. invisible to the programmer; you cre
(light gray) and metal interconnects (dark Gelernter introduced the Linda pro ate a sequential series of instructions,
gray). The width of the gate is a crucial dimen-
gramming system, in which multiple and it’s up to the hardware to find op
sion; in recent chips it is 65 nanometers.
processors pluck tasks from a cloud portunities for concurrent execution.
called “tuple space.” In writing a program to run on a
than 12 billion by 2020. What appears What became of all these ventures? cluster or server farm, you can’t be
to be ending, or at least dramatically They were flattened by the steam oblivious to parallelism, but the archi
slowing, is the scaling law that allows roller of mass-market technology and tecture of the system imposes a helpful
processor speed to keep climbing. We economics. Special-purpose, limited- discipline. Each node of the cluster is
can still have smaller circuits, but not production designs are hard to justify essentially an independent computer,
faster ones. And hence the new Lilli when the same investment will buy with its own processor and private
putian strategy of Silicon Valley: lots of hundreds or thousands of commodity memory. The nodes are only loosely
little processors working in parallel. PCs, which you can mount in racks coupled; they communicate by passing
and link together in a loose federation messages. This protocol limits the op
A Notorious Hangout via Ethernet. Such clusters and “server portunities for interprocess mischief.
Parallel processing is hardly a new farms” soon came to dominate large- The software development process is
idea in computer science. Machines scale computing, especially in the sci not radically different; programs are
with multiple processors were built ences. The vendors of supercomputers often written in a conventional lan
as early as the 1960s, when it was al eventually gave in and began selling guage such as Fortran or C, augment
ready widely believed that some form systems built on the same principle. ed by a library of routines that handle
of “massive parallelism” was the way All of the fastest supercomputers are the details of message passing.
of the future. By the 1980s that future now elaborations of this concept. In Clusters work well for tasks that
was at hand. David Gelernter of Yale other words, parallelism wasn’t de readily break apart into lots of nearly
University wrote that “parallel com feated; it was co-opted. independent pieces. In weather predic
puting, long a notorious hangout for It’s also important to note that paral tion, for example, each region of the at
utopians, theorists, and backyard tin lelism of a different kind insinuated it mosphere can be assigned its own CPU.
kerers, has almost arrived and is defi self into mainstream processor designs. The same is true of many algorithms
nitely for sale.” The impressive performance of recent in graphics and image synthesis. Web
Throughout that decade and into the CPU chips comes not only from giga servers are another candidate for this
early 1990s novel parallel architectures hertz clock rates but also from doing treatment, since each visitor’s requests
became a wonderful playground for more during each clock cycle. The pro can be handled independently.
computer designers. For example, W. cessors “pipeline” their instructions, In principle, multicore computer
Daniel Hillis developed the Connec decoding one while executing another systems could be organized in the

BDUVBM

TXJUDIJOHEFMBZ
BDUVBM

MJOFBSTJ[F
WPMUBHF
QPXFS
BSFB
JEFBM

JEFBM

HFOFSBUJPO HFOFSBUJPO HFOFSBUJPO HFOFSBUJPO HFOFSBUJPO
Scaling laws relate the physical dimensions of transistors to their electrical properties. In each successive generation of microprocessors lin-
ear dimensions (such as the gate width of a transistor) are reduced by a factor of about 0.7, which means the area of a transistor is cut in half.
Switching delay (the reciprocal of processing speed) is proportional to the linear size. If operating voltage could also be lowered by a factor of
0.7, a transistor’s power consumption would be proportional to its surface area, and the power density of the entire chip would remain constant.
But voltages have actually been reduced only by 0.85 per generation, with the result that power and heat have become limiting factors.
same way as clusters, with each CPU "MJDF Y #PC "MJDF Y #PC "MJDF Y #PC
having its own private memory and
with communication governed by a
message-passing protocol. But with
YY YY
many CPUs on the same physical
YY YY YY
substrate, it’s tempting to allow much
closer collaboration. In particular, mul
ticore hardware makes it easy to build
shared-memory systems, where pro
YY
cessors can exchange information sim
ply by reading and writing the same
location in memory. In software for
a shared-memory machine, multiple
computational processes all inhabit the
same space, allowing more interesting Software for parallel processors is susceptible to subtle errors that cannot arise in strictly
and flexible patterns of interaction, not sequential programs. Here two concurrent processes both access a shared location in memory
to mention subtler bugs. designated by the variable name x. Each process reads the current value of x, increments it by
1, and writes the new value back to the same location. The outcome depends on the way the
Losing My Minds two transactions are interleaved, and the timing of these events is not under the programmer’s
If our universe is a peculiarly friendly control. Only the rightmost case is correct.
place for builders of digital computers,
it is not so benign for creators of pro all about them. Many of the same is are things happening all around me
grams that run concurrently on paral sues arise even in uniprocessor systems that I don’t remember doing. “I con
lel hardware. Or maybe the difficulty where “time slicing” creates the illusion tain multitudes,” declared Walt Whit
lies in the human mind rather than in that multiple programs are running at man, but for a computer programmer
the nature of the universe. the same time. this is not a healthy state of mind.
Think of a program that reserves Writing correct concurrent programs Edward A. Lee, of the University of
airline seats. Travelers access the pro is not impossible or beyond human California, Berkeley, recently described
gram through a Web site that shows abilities, but parallelism does seem to the mental challenge of writing non
a diagram of the aircraft interior, with make extreme demands on mental dis deterministic programs:
each seat marked as either vacant or cipline. The root of the difficulty is non
A folk definition of insanity is to
occupied. When I click on seat 3A, the determinism: Running the same set of
do the same thing over and over
program first checks its database to programs on the same set of inputs can
again and expect the results to be
make sure 3A is still available; if it is, yield different results depending on the
different. By this definition, we in
I get a confirming message, and the exact timing of events. This is discon
fact require that programmers of
database is updated to show that seat certing if your approach to program
multithreaded systems be insane.
3A has been taken. All’s well, at least ming is to try to think like a computer.
Were they sane, they could not
in a sequential world. But you too may Even though the brain is a highly
understand their programs.
be booking a seat on the same flight, parallel neural network, the mind
and you may want 3A. If my trans seems to be single-threaded. You may Lee also writes:
action completes before your request be able to walk and chew gum at the
I conjecture that most multi
arrives, then I’m afraid you’re out of same time, but it’s hard to think two
threaded general-purpose appli
luck. On the other hand, if you are thoughts at once. Consciousness is sin
cations are so full of concurrency
quicker with the mouse, I’m the one gular. In trying to understand a com
bugs that—as multicore architec
who will be disappointed. But what puter program, I often imagine myself
tures become commonplace—
happens if the two requests are essen standing at a certain place in the pro
these bugs will begin to show up
tially simultaneous and are handled gram text or in a flow chart. As the in
as system failures. This scenario is
in parallel by a multiprocessing com structions are executed, I follow along,
bleak for computer vendors: Their
puter? Suppose the program has just tracing out the program’s path. I may
next-generation machines will be
assigned the seat to me but has not yet have to jump from place to place to
come widely known as the ones
revised the database record when your follow branches and loops, but at any
on which many programs crash.
request reaches the Web server. At that given moment there is always one lo
instant a check of the database indi cation that I can call here. Furthermore, Cynics, of course, will reply that com
cates 3A is still vacant, and so we both I am the only actor on the stage. Noth puters of every generation have exactly
get confirming messages. It’s going to ing ever happens behind my back or that reputation.
be a cozy flight! out of sight. Those airline seats can’t be
Of course there are remedies for this assigned unless I assign them. The Slice-and-Dice Compiler
problem. Programming techniques for That’s how it works in a sequential Not everyone shares Lee’s bleak out
ensuring exclusive access to resources program. With parallel processing, the look. There may well be ways to tame
have been known for 50 years; they are sense of single-mindedness is lost. If I the multicore monster.
key assets in the intellectual heritage try to trace the path of execution, I have One idea is to let the operating sys
of computer science, and the airline’s to stand in many places at once. I don’t tem deal with the problems of allocat
programmer should certainly know know who “I” am anymore, and there ing tasks to processors and balancing
www.americanscientist.org © 2007 Brian Hayes. Reproduction with permission only. 2007 November–December 479
the workload. This is the main ap colleagues have written hundreds of multicore designs are not fundamen
proach taken today with time-sliced programs that run on very large clus tal physical limits; they are merely
multiprocessing and, more recently, ters of computers, all using a program hurdles that engineers have not yet
with dual-core chips. Whether it will ming model they call MapReduce. learned to leap. New materials or new
continue to work well with hundreds The lesson of these examples ap fabrication techniques could upset all
of cores is unclear. In the simplest case, pears to be that we shouldn’t waste our assumptions.
an operating system would adopt a effort trying to adapt or convert exist A year ago, IBM and Georgia Tech
one-processor-per-program rule. Thus ing software. A new computer archi tested an experimental silicon-germa
the spreadsheet running in the back tecture calls for a new mental model, nium chip at a clock rate of 500 giga
ground would never slow down the ac a new metaphor. We need to rethink hertz—more than a hundred times the
tion in the video game on your screen. the problem as well as the solution. In speed of processors now on the mar
But this policy leaves processors idle if other words, we have a historic oppor ket. Reaching that clock rate required
there aren’t enough programs running, tunity to clean out the closet of com cooling the device to 4 Kelvins, which
and it would do nothing to help any puter science, to throw away all those might seem to rule it out as a practi
single program run faster. To make bet dusty old sorting algorithms and the cal technology. But which is harder:
ter use of the hardware, each program design patterns that no longer fit. We Writing reliable and efficient parallel
needs to be divided into many threads get to make a fresh start. (Be ready to software, or building a liquid-helium
of execution. buy all new software along with your cooler for a laptop computer? I’m not
An alternative is to put the burden new kilocore computer.) sure I know the answer.
on the compiler—the software that
translates a program text into machine The Helium-cooled Laptop Bibliography
code. The dream is to start with an or Although there’s doubtless a multicore Agarwal, Anant, and Markus Levy. 2007.
dinary sequential program and have processor in my future (and yours), I’m Thousand-core chips: The kill rule for multi
core. In Proceedings of the 44th Annual Con-
it magically sliced and diced for ex not yet entirely convinced that massive
ference on Design Automation DAC ’07, pp.
ecution on any number of processors. parallelism is the direction computing 750–753. New York: ACM Press.
Needless to say, this Vegematic com will follow for decades to come. There Asanovic, Krste, Ras Bodik, Bryan Christopher
piler doesn’t yet exist, although some could be further detours and devia Catanzaro, Joseph James Gebis, Parry Hus
compilers do detect certain opportuni tions. There could be a U-turn. bands, Kurt Keutzer, David A. Patterson,
ties for parallel processing. The multicore design averts a power William Lester Plishker, John Shalf, Samuel
Webb Williams and Katherine A. Yelick.
Both of these strategies rely on the catastrophe, but it won’t necessarily 2006. The landscape of parallel computing
wizardry of a programming elite— break through the memory bottleneck. research: A view from Berkeley. University
those who build operating systems All of those cores crammed onto a sin of California, Berkeley, Electrical Engineer
and compilers—allowing the rest of gle silicon chip have to compete for ing and Computer Sciences Technical Re
port UCB/EECS-2006-183. http://www.
us to go on pretending we live in a the same narrow channel to reach off- eecs.berkeley.edu/Pubs/TechRpts/2006/
sequential world. But if massive paral chip main memory. As the number of EECS-2006-183.html
lelism really is the way of the future, it cores increases, contention for memory Brock, David C., ed. 2006. Understanding
can’t remain hidden behind the curtain bandwidth may well be the factor that Moore’s Law: Four Decades of Innovation.
forever. Everyone who writes software limits overall system performance. Philadelphia: Chemical Heritage Press.
will have to confront the challenge of In the present situation we have Carriero, Nicholas, and David Gelernter.
creating programs that run correctly an abundance of transistors available 1990. How to Write Parallel Programs: A First
Course. Cambridge, Mass.: The MIT Press.
and efficiently on multicore systems. but no clear idea of the best way to
Dean, Jeffrey, and Sanjay Ghemawat. 2004.
Contrarians argue that parallel pro make use of them. Lots of little pro MapReduce: Simplified data processing
gramming is not really much harder cessors is one solution, but there are on large clusters. In Proceedings of the Sixth
than sequential programming; it just alternatives. One idea is to combine Symposium on Operating Systems Design and
requires a different mode of thinking. a single high-performance CPU and Implementation, pp. 137–150. http://labs.
google.com/papers/mapreduce.html
Both Hillis and Gelernter have taken several gigabytes of main memory on
Dennard, Robert, Fritz Gaensslen, Hwa-Nien
this position, backing it up with de the same sliver of silicon. This system- Yu, V. Leo Rideout, Ernest Bassous and An
tailed accounts drawn from their own on-a-chip is an enticing possibility; it dre LeBlanc. 1974. Design of ion-implanted
experience. For example, Hillis and would have benefits in price, power MOSFETs with very small physical dimen
Guy L. Steele, Jr., describe their search and performance. But there are also sions. IEEE Journal of Solid State Circuits SC-
for the best sorting algorithm on the impediments. For one thing, the steps 9(5):256–268.
Connection Machine. They found, to in fabricating a CPU are different from Hillis, W. Daniel, and Guy L. Steele, Jr. 1986.
Data parallel algorithms. Communications of
no one’s surprise, that solutions from those that create the highest-density the ACM 29:1170–1183.
the uniprocessor world are seldom op memories, so it’s not easy to put both Intel Corporation. 2007. Special issue on tera
timal when you have 65,536 processors kinds of devices on one chip. There are scale computing. Intel Technical Journal 11(3).
to play with. What’s more illuminating also institutional barriers: Semiconduc http://www.intel.com/technology/itj/
is their realization that having immedi tor manufacturers tend to have exper International Technology Roadmap for Semicon
ate access to every element of a large tise in microprocessors or in memories ductors. 2005. http://www.itrs.net/Links/
2005ITRS/Home2005.htm
data set means you may not need to but not in both.
Lee, Edward A. 2006. The problem with
sort at all. More recently, Jeffrey Dean Finally, we haven’t necessarily seen threads. IEEE Computer 39(5):33–42.
and Sanjay Ghemawat of Google have the last of the wicked-fast uniprocessor. Sutter, Herb, and James Larus. 2005. Software
described a major success story for The power and memory constraints and the concurrency revolution. ACM
parallel programming. They and their that have lately driven chipmakers to Queue 3(7):54–62.

Computing in A Parallel Universe

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Computing in A Parallel Universe

Enviado por

Direitos autorais:

Formatos disponíveis

A reprint from