Você está na página 1de 24

1

Scalable Lock-Free Dynamic Memory Allocation


Maged M. Michael
IBM Thomas J. Watson Research Center
P.O. Box 218, or!to"n #eights, $ 1%&'8,
()*
magedm+,s.i-m.com
ABSTRACT
Dynamic memory allocators
(malloc/free) rely on mutual
exclusion locks for protecting the
consistency of their shared data
structures under multithreading.
The use of locking has many
disadvantages with respect to
performance, avail a!ility,
ro!ustness, and programming
"exi!ility. # lockfree memory
allocator guarantees progress
regardless of whether some threads
are delayed or even killed and
regardless of scheduling policies.
This paper presents a completely
lock free memory allocator. $t
uses only widelyavaila!le oper
ating system support and hardware
atomic instructions. $t o%ers
guaranteed availa!ility even under
ar!itrary thread termination and
crashfailure, and it is immune to
dead lock regardless of scheduling
policies, and hence it can !e used
even in interrupt handlers and real
time applications without re&uiring
special scheduler support. #lso, !y
lever aging some highlevel
structures from 'oard, our
allocator is highly scala!le, limits
space !lowup to a constant factor,
and is capa!le of avoiding false
sharing. $n addition, our allocator
allows (ner concurrency and much
lower latency than 'oard. )e use
*ower*+ shared memory multipro
cessor systems to compare the
performance of our allocator with
the default #$, -.1 li!c malloc,
and two widelyused multithread
allocators, 'oard and *tmalloc.
.ur allocator outperforms the other
allocators in virtually all cases and
often !y su!stantial margins, under
various levels of paral lelism and
allocation patterns. /urthermore,
our allocator also o%ers the lowest
contentionfree latency among the
al locators !y signi(cant margins.
+ategories and 0u! 1ect Descriptors2
D.1.3 4Programming Techniques52
+oncurrent *rogramming6 D.3.3
4Program- ming Languages52 7anguage
+onstructs and /eatures8 dynamic storage
management6 D.9.1 4Operating Systems52
*rocess :anagement8concurrency, deadlocks,
synchroniza- tion, threads.
;eneral Terms2 #lgorithms,
*erformance, <elia!ility. =eywords2
malloc, lockfree, asyncsignalsafe,
availa!ility.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page To copy otherwise! to
republish! to post on servers or to redistribute to lists! requires prior specic
permission and"or a fee
PLDI04, #une $%&&! '(()! *ashington! +,! -S.
,opyright '(() .,/ &-01&&2-1(3-0"()"(((4 >0((
?
1.
INTRODUCTION
Dynamic memory allocation
functions, such as malloc and free,
are heavily used !y a wide range
of important multi threaded
applications, from commercial
data!ase and we! servers to data
mining and scienti(c applications.
$n order to !e safe under
multithreading (:Tsafe), current
alloca tors employ mutual
exclusion locking in a variety of
ways, ranging from the use of a
single lock wrapped around single
thread malloc and free, to the
distri!uted use of locks in order to
allow more concurrency and higher
scala!ility. The use of locking
causes many pro!lems and
limitations with respect to
performance, availa!ility,
ro!ustness, and pro gramming
"exi!ility.
# desira!le !ut challenging
alternative approach for achiev ing
:Tsafety is lockfree
synchroni@ation. # shared o! 1ect is
lockfree (non!locking) if it
guarantees that whenever a thread
executes some (nite num!er of
steps, at least one operation on
the o! 1ect !y some thread must
have made progress during the
execution of these steps. 7ockfree
syn chroni@ation implies several
inherent advantages2
5mmunity to deadlock2 Ay de(nition, a
lockfree o! 1ect must !e immune
to deadlock and livelock.
Therefore, it is much simpler to
design deadlockfree systems when
all or some of their components are
lockfree.
.sync-signal-safety2 Due to the use of
locking in cur
rent implementations of malloc and
free, they are not consid ered async-
signal-safe 4B5, i.e., signal handlers are
prohi!ited
from using them. The reason for
this prohi!ition is that if a thread
receives a signal while holding a
userlevel lock in the allocator, and
if the signal handler calls the
allocator, and in the process it must
ac&uire the same lock held !y the
interrupted thread, then the
allocator !ecomes deadlocked due
to circular dependence. The
signal handler waits for the
interrupted thread to release the
lock, while the thread cannot
resume until the signal handler
completes. :asking interrupts or
using kernelassisted locks in
malloc and free is too costly for
such heavilyused functions. $n
contrast, a completely lockfree
allocator is capa!le of !eing
async signalsafe without incurring
any performance cost.
Tolerance to priority inversion2 0imilarly,
in realtime
applications, userlevel locking is
suscepti!le to deadlock due
to priority inversion. That is, a
high priority thread can
!e waiting for a userlevel lock to
!e released !y a lower
priority thread that will not !e
scheduled until the high
priority thread completes its task.
7ockfree synchroni@ation
guarantees progress regardless
of scheduling policies.
6ill-tolerant availability2 # lockfree o!
1ect must !e
immune to deadlock even if any
num!er of threads are killed
while operating on it. #ccordingly,
a lockfree o! 1ect must
o%er guaranteed availa!ility
regardless of ar!itrary thread
CAS(addr,expval,newval)
atomically do if (*addr ==
expval) {
*addr = newval;
return true;
} else
return false;
7igure &8 ,ompare-and-Swap
AtomicInc(addr)
do {
oldval = *addr;
newval = oldval+;
} until CAS(addr,oldval,newval);
7igure '8 .tomic increment using ,.S
termination or crashfailure. This
is particularly useful for servers
that re&uire a high level of
availa!ility, !ut can toler ate the
infre&uent loss of tasks or servlets
that may !e killed !y the server
administrator in order to relieve
temporary re source shortages.
Preemption-tolerance2 )hen a thread
is preempted while holding a
mutual exclusion lock, other
threads wait ing for the same lock
either spin uselessly, possi!ly for
the rest of their time slices, or
have to pay the performance cost
of yielding their processors in the
hope of giving the lock holder an
opportunity to complete its critical
section. 7ockfree synchroni@ation
o%ers preemptiontolerant perfor
mance, regardless of ar!itrary
thread scheduling.
$t is clear that it is desira!le for
memory allocators to !e
completely lockfree. The
&uestion is how, and more im
portantly, how to !e completely
lockfree and (1) o%er good
performance competitive with the
!est lock!ased allocators
(i.e., low latency, scala!ility,
avoiding false sharing, constant
space !lowup factor, and
ro!ustness under contention and
preemption), (?) using only widely
availa!le hardware and
.0 support, and (3) without
making triviali@ing assump
tions that make lockfree progress
easy, !ut result in unac
cepta!le memory consumption or
impose unreasona!le re
stricti
ons.
/or example, it is trivial to
design a waitfree allocator
with pure perthread private
heaps. That is, each thread
allocates from its own heap and also
frees !locks to its own
heap. 'owever, this is hardly an
accepta!le generalpurpose
solution, as it can lead to
un!ounded memory consumption
(e.g., under a producerconsumer
pattern 435), even when
the programCs memory needs are in
fact very small. .ther
unaccepta!le characteristics include
the need for initiali@ing
large parts of the address space,
putting an arti(cial limit
on the total si@e or num!er of
allocata!le dynamic !locks,
or restricting !eforehand regions of
the address to speci(c
threads or speci(c !lock si@es. #n
accepta!le solution must
!e generalpurpose and space
eDcient, and should not im
pose arti(cial limitations on the
use of the address space.
$n this paper we present a
completely lockfree allocator
that o%ers excellent performance,
uses only widelyavaila!le
hardware and .0 support, and
is generalpurpose.
/or constructing our lockfree
allocator and with only the
simple atomic instructions
supported on current mainstream
processor architectures as our
memory access tools, we !reak
down malloc and free into (ne
atomic steps, and organi@e
the allocatorCs data structures
such that if any thread is
delayed ar!itrarily (or even killed)
at any point, then any
other thread using the allocator
will !e a!le to determine
enough of the state of the allocator
to proceed with its own
operation without waiting for the
delayed thread to resume.
Ay leveraging some highlevel
structures from 'oard 435,
a scala!le lock!ased allocator, we
achieve concurrency !e
tween operations on multiple
processors, avoid inducing false
sharing, and limit space !lowup to
a constant factor. $n
addition, our allocator uses a
simpler and (ner grained or
gani@ation that allows more
concurrency and lower latency than
'oard.
)e use *.)E<3 and *.)E<9
shared memory mul tiprocessors
to compare the performance of
our allocator with the default #$,
-.1 li!c malloc, and two widely
used lock!ased allocators with
mechanisms for scala!le perfor
mance, 'oard 435 and *tmalloc 4F5.
The experimental per formance
results show that not only is our
allocator compet itive with some of
the !est lock!ased allocators, !ut
also that it outperforms them, and
often !y su!stantial margins, in
virtually all cases including under
various levels of par allelism and
various sharing patterns, and
also o%ers the lowest contention
free latency.
The rest of the paper is organi@ed
as follows. $n 0ection ?, we discuss
atomic instructions and related
work. 0ection 3 descri!es the new
allocator in detail. 0ection 9
presents our experimental
performance results. )e conclude
the paper with 0ection -.
2. BACKGROUND
2.1 Atomic Instructions
+urrent mainstream processor
architectures support ei ther
+ompareand0wap (+#0) or the
pair 7oad7inked and 0tore
+onditional (77/0+). .ther weaker
instructions, such as /etchand#dd
and 0wap, may !e supported, !ut in
any case they are easily
implemented using +#0 or 77/0+.
+#0 was introduced on the $A:
0ystem 3GH 4I5. $t is supported on
$ntel ($#3? and $#F9) and 0un
0*#<+ ar chitectures. $n its
simplest form, it takes three
arguments2 the address of a
memory location, an expected
value, and a new value. $f the
memory location is found to hold
the expected value, the new value
is written to it, atomically. #
Aoolean return value indicates
whether the write occurred. $f it
returns true, it is said to succeed.
.therwise, it is said to fail. /igure
1 shows the semantics of +#0.
77 and 0+ are supported on the
*ower*+, :$*0, and #lpha
architectures. 77 takes one
argument2 the address of a
memory location, and returns its
contents. 0+ takes two arguments2
the address of a memory location
and a new value. .nly if no other
thread has written the memory lo
cation since the current thread
last read it using 77, the new
value is written to the memory
location, atomically. # Aoolean
return value indicates whether the
write occurred. 0imilar to +#0, 0+
is said to succeed or fail if it
returns true or false, respectively.
/or architectural reasons, current
architectures that support 77/0+
do not allow the nesting or
interleaving of 77/0+ pairs, and
infre&uently allow 0+ to fail
spuriously, even if the target
location was never written since the
last 77. These spurious failures
happen, for exam ple, if the
thread was preempted or a
di%erent location in the same cache
line was written !y another
processor.
/or generality, we present the
algorithms in this paper using
+#0. $f 77/0+ are supported rather
than +#0, then CAS(addr,expval,newval)
can !e simulated in a lockfree
manner as follows2 {do {if (!!(addr)"
=expval) return false;} until
SC(addr,newval); return true;}.
0upport for +#0 and restricted
77/0+ on aligned F9!it !locks is
availa!le on !oth 3?!it and F9!it
architectures, e.g., +:*,+';I on
$#3?. 'owever, support for +#0
or 77/0+ on wider !lock si@es is
generally not availa!le even on
F9!it architectures. Therefore, we
focus our presentation of the
algorithms on F9!it mode, as it is
the more challenging case while 3?
!it mode is simpler.
/or a very simple example of
lockfree synchroni@ation, /igure ?
shows the classic lockfree
implementation of #tomic
$ncrement using +#0 4I5. Jote that
if a thread is de layed at any point
while executing this routine, other
active threads will !e a!le to
proceed with their operations with
out waiting for the delayed thread,
and every time a thread executes a
full iteration of the loop, some
operation must have made
progress. $f the +#0 succeeds,
then the incre ment of the current
thread has taken e%ect. $f the +#0
fails, then the value of the counter
must have changed during the loop.
The only way the counter changes is
if a +#0 succeeds. Then, some
other threadCs +#0 must have
succeeded during the loop and
hence that other threadCs increment
must have taken e%ect.
2.2 Related Wor
The concept of lockfree
synchroni@ation goes !ack more
than two decades. $t is attri!uted
to early work !y 7am port 41?5
and to the motivating !asis for
introducing the +#0 instruction in
the $A: 0ystem 3GH documentation
4I5. The impossi!ility and
universality results of 'erlihy 4G5
had signi(cant in"uence on the
theory and practice of lockfree
synchroni@ation, !y showing that
atomic instructions such as +#0
and 77/0+ are more powerful than
others such as Testand0et, 0wap,
and /etchand#dd, in their a!ility
to provide lockfree
implementations of ar!itrary o! 1ect
types. $n other pu!lications 41G,
1B5, we review practical lockfree
algorithms for dynamic data
structures in light of recent
advances in lockfree memory
management.
)ilson et. al. 4?35 present a
survey of se&uential memory
allocation. Aerger 4?, 35 presents
an overview of concurrent allocators,
e.g., 49, F, 1H, 11, 135. $n our
experiments, we compare our
allocator with two widelyused
malloc replace ment packages for
multiprocessor systems, *tmalloc
and 'oard. )e also leverage some
scala!ilityena!ling highlevel
structures from 'oard.
*tmalloc 4F5, developed !y
)olfram ;loger and !ased on Doug
7eaCs dlmalloc se&uential allocator
4195, is part of ;JK gli!c. $t uses
multiple arenas in order to reduce
the adverse e%ect of contention.
The granularity of locking is the
arena. $f a thread executing malloc
(nds an arena locked, it tries the
next one. $f all arenas are found to
!e locked, the thread creates a new
arena to satisfy its malloc and
adds the new arena to the main list
of arenas. To improve locality
and reduce false sharing, each
thread keeps threadspeci(c in
formation a!out the arena it used
in its last malloc. )hen a thread
frees a chunk (!lock), it returns
the chunk to the arena from which
the chunk was originally allocated,
and the thread must ac&uire that
arenaCs lock.
'oard 4?, 35, developed !y
Emery Aerger, uses multiple
processor heaps in addition to a
glo!al heap. Each heap contains
@ero or more super!locks. Each
super!lock con tains one or more
!locks of the same si@e. 0tatistics
are maintained individually for each
super!lock as well as col lectively
for the super!locks of each heap.
)hen a processor heap is found to
have too much availa!le space, one
of its su
per!locks is moved to the glo!al
heap. )hen a thread (nds that its
processor heap does not have
availa!le !locks of the desired si@e,
it checks if any super!locks of the
desired si@e are availa!le in the
glo!al heap. Threads use their
thread ids to decide which processor
heap to use for malloc. /or free, a
thread must return the !lock to its
original super!lock and update the
fullness statistics for the
super!lock as well as the heap that
owns it. Typically, malloc and
free re&uire one and two lock
ac&uisitions, respectively.
Dice and ;arthwaite 4-5 propose
a partly lockfree allo cator. The
allocator re&uires special operating
system sup port, which makes it
not readily porta!le across
operating systems and
programming environments. $n
the environ ment for their
allocator, the kernel monitors
thread migra tion and preemption
and posts upcalls to usermode.
)hen a thread is scheduled to run,
the kernel posts the +*K id of the
processor that the thread is to run
on during its upcom ing time slice.
The kernel also saves the user
mode instruc tion pointer in a
threadspeci(c location and replaces
it with the address of a special
noti(cation routine that will !e the
(rst thing the thread executes
when it resumes. The noti( cation
routine checks if the thread was in a
critical section when it was
preempted. $f so, the noti(cation
routine passes control to the
!eginning of the critical section
instead of the original instruction
pointer, so that the thread can
retry its critical section. The
allocator can apply this
mechanism only to +*Kspeci(c
data. 0o, it is only used for the
+*KCs local heap. /or all other
operations, such as freeing a !lock
that !elongs to a remote +*K
heap or any access to the glo!al
heap, mutual exclusion locks are
used. The allocator is not
completely lockfree, and hence8
without additional special support
from the kernel8it is suscepti!le to
deadlock under ar!itrary thread
termination or priority inversion.
!. "OCK#$R%% A""OCATOR
This section descri!es our lockfree
allocator in detail. )ithout loss of
generality we focus on the case of a
F9!it address space. The 3?!it
case is simpler, as F9!it +#0 is
supported on 3?!it architectures.
!.1 O&er&ie'
/irst, we start with the general
structure of the allocator. 7arge
!locks are allocated directly from
the .0 and freed directly to the
.0. /or smaller !lock si@es, the
heap is com posed of large
super!locks (e.g., 1F =A). Each
super!lock is divided into multiple
e&ualsi@ed !locks. 0uper!locks are
distri!uted among si@e classes
!ased on their !lock si@es. Each
si@e class contains multiple
processor heaps propor tional to
the num!er of processors in the
system. # proces sor heap
contains at most one active
super!lock. #n active super!lock
contains one or more !locks
availa!le for reser vation that are
guaranteed to !e availa!le to
threads that reach them through
the header of the processor heap.
Each super!lock is associated with
a descriptor. Each allocated !lock
contains a pre(x (I !ytes) that
points to the descrip tor of its
super!lock. .n the (rst call to
malloc, the static structures for the
si@e classes and processor heaps
(a!out 1F
=A for a 1F processor machine) are
allocated and initiali@ed in a lock
free manner.
:alloc starts !y identifying the
appropriate processor heap, !ased
on the re&uested !lock si@e and the
identity of the calling thread.
Typically, the heap already has an
active super!lock with !locks
availa!le for reservation. The
thread atomically reads a pointer to
the descriptor of the active su
per!lock and reserves a !lock.
Jext, the thread atomically
## Super$loc% descriptor structure
typedef anc&or ' ## fits in one
atomic $loc% unsi(ned
avail'),count'),state'*,ta('+*;
## state codes AC,I-.=) /0!!= 1A2,IA!=*
.31,4=5
typedef
descriptor '
anc&or Anc&or;
descriptor*
6ext;
void* s$; ## pointer to super$loc%
proc&eap* &eap; ## pointer to owner
proc&eap unsi(ned s7; ## $loc% si7e
unsi(ned maxcount; ## super$loc% si7e#s7
## 1rocessor &eap structure
typedef active ' unsi(ned
ptr'89,credits':;
typedef proc&eap '
active Active; ## initially 60!!
descriptor* 1artial; ## initially 60!!
si7eclass* sc; ## pointer to parent
si7eclass
## Si7e class
structure typedef
si7eclass '
desc!ist 1artial; ## initially
empty unsi(ned s7; ## $loc%
si7e
unsi(ned s$si7e; ## super$loc% si7e
7igure 28 Structures
pops a !lock from that super!lock
and updates its descrip tor. #
typical free pushes the freed !lock
into the list of availa!le !locks of
its original super!lock !y atomically
up dating its descriptor. )e
discuss the less fre&uent more
complicated cases !elow when
descri!ing the algorithms in detail.
!.2 Structures and Al(orit)ms
/or the most part, we provide
detailed (+like) code for the
algorithms, as we !elieve that it is
essential for under standing lock
free algorithms, unlike lock!ased
algorithms where se&uential
components protected !y locks can
!e de scri!ed clearly using high
level pseudocode.
!"!# $tructures
/igure 3 shows the details of the
a!ove mentioned struc tures. The
Anc&or (eld in the super!lock
descriptor struc
ture contains su!(elds that can !e
updated together atom ically using
+#0 or 77/0+. The su!(eld avail
holds the index of the (rst
availa!le !lock in the super!lock,
count
holds the num!er of unreserved
!locks in the super!lock,
state holds the state of the
super!lock, and ta( is used to
prevent the #A# pro!lem
as discussed !elow.
The Active (eld in the processor
heap structure is primar
ily a pointer to the descriptor of the
active super!lock owned !y the
processor heap. $f the value of Active
is not 60!!, it is
guaranteed that the active
super!lock has at least one !lock
availa!le for reservation. 0ince the
addresses of super!lock descriptors
can !e guaranteed to !e aligned to
some power
of ? (e.g., F9), as an optimi@ation,
we can carve a credits
su!(eld to hold the num!er of
!locks availa!le for reserva
tion in the active super!lock less
one. That is, if the value of credits
is n, then the active super!lock
contains nL1 !locks availa!le for
reservation through the Active
(eld.
Jote that the num!er of !locks in
a super!lock is not lim ited to the
maximum reservations that can !e
held in the credits su!(eld. $n a
typical malloc operation (i.e.,
when
Active M 60!! and credits % H), the
thread reads Active
and then atomically decrements
credits while validating
that the active
super!lock is still valid.
!"!" $u&er'lock
$tates
# super!lock can !e in one of
four states2 AC,I-., /0!!, 1A2,IA!, or
.31,4. # super!lock is AC,I-. if it is
the active super!lock in a heap, or
if a thread intends to try to install it
as such. # super!lock is /0!! if all its
!locks are either allo cated or
reserved. # super!lock is 1A2,IA! if it
is not AC,I-. and contains
unreserved availa!le !locks. #
super!lock is .31,4 if all its !locks
are free and it is not AC,I-.. #n .31,4
super!lock is safe to !e returned to
the .0 if desired.
!"! (alloc
/igure 9 shows the malloc
algorithm. The outline of the
algorithm is as follows. $f the
!lock si@e is large, then the !lock
is allocated directly from the .0
and its pre(x is set to indicate the
!lockCs si@e. .therwise, the
appropriate heap is identi(ed using
the re&uested !lock si@e and the id
of the re&uesting thread. Then,
the thread tries the following in
order until it allocates a !lock2 (1)
#llocate a !lock from the heapCs
active super!lock. (?) $f no active
super!lock is found, try to allocate
a !lock from a 1A2,IA! super!lock.
(3) $f none are found, allocate a
new super!lock and try to install it
as the active super!lock.
/alloc from .ctive Superblock
The vast ma 1ority of malloc
re&uests are satis(ed from the
heapCs active super!lock as shown in
the 3alloc/romActive
routine in /igure 9. The routine
consists of two main steps. The (rst
step (lines 1NF) involves reading a
pointer to the ac tive super!lock
and then atomically decrementing
the num !er of availa!le credits8
there!y reserving a !lock8while
validating that the active
super!lock is still valid. Kpon the
success of +#0 in line F, the
thread is guaranteed that a !lock
in the active super!lock is reserved
for it.
The second step of 3alloc/romActive
(lines GN1I) is pri
marily a lockfree pop from a 7$/.
list 4I5. The thread reads the index
of the (rst !lock in the list from
Anc&or;avail in line I, then it reads
the index of the next !lock in line
1H,
&
and (nally in line 1I it tries to swing
the head pointer (i.e.,
Anc&or;avail) atomically to the next
!lock, while validat
ing that at that time what it
OthinksP to !e the (rst two
indexes in the list (i.e.,
oldanc&or;avail and next) are in
deed the (rst two indexes in the
list, and hence in e%ect popping
the (rst availa!le !lock from the
list.
Qalidating that the +#0 in
line 1I succeeds only if
Anc&or;avail is e&ual to oldanc&or;avail
follows directly
from the semantics of +#0.
'owever, validating that at that
time *addrMnext is more su!tle and
without the ta( su!
(eld is suscepti!le to the #A#
pro!lem 4I, 1B5. +onsider the case
where in line I thread ) reads the
value * from Anc&or;avail and in line
1H reads the value + from *addr.
#fter line 1H, ) is delayed and some
other thread or threads pop (i.e.,
allocate) !lock * then !lock + and
then push (i.e., free) some !lock ,
and then !lock * !ack in the list.
7ater, ) resumes and executes the
+#0 in line 1I. )ithout the tag
su!(eld (for simplicity ignore the
count su!(eld), the +#0
would (nd Anc&or e&ual to oldanc&or
and succeeds where
&
This is correct even if there is
no next !lock, !ecause in such a
case no su!se&uent malloc will
target this super!lock !efore one of
its !locks is freed.
void* malloc(s7) {
## 0se s7 and t&read id to find &eap;
&eap = find &eap(s7);
* if ("&eap) ## !ar(e $loc%
5 *llocate 'lock from -$ and return its address!
w&ile() {
+ addr = 3alloc/romActive(&eap);
8 if (addr) return addr;
: addr = 3alloc/rom1artial(&eap);
< if (addr) return addr;
9 addr = 3alloc/rom6ewS=(&eap);
> if (addr) return addr;
} }
void* 3alloc/romActive(&eap) {
do { ## /irst step' reserve $loc%
newactive = oldactive = &eap?
@Active;
* if ("oldactive) return 60!!;
5 if (oldactive;credits == ))
+ newactive = 60!!;
els
e
8 newactive;credits??;
: } until CAS(A&eap?
@Active,oldactive,newactive);
## Second step' pop
$loc%
< desc = mas% credits(oldactive);
do {
## state may $e AC,I-., 1A2,IA!
or /0!!
9 newanc&or = oldanc&or = desc?
@Anc&or;
> addr = desc?
@s$+oldanc&or;avail*desc?@s7;
) next = *(unsi(ned*)addr;
newanc&or;avail = next;
* newanc&or;ta(++;
5 if (oldactive;credits == )) {
## state must $e
AC,I-.
+ if (oldanc&or;count == ))
8 newanc&or;state = /0!!;
else
{
: morecredits =
min(oldanc&or;count,3ABC2.CI,S);
< newanc&or;count ?= morecredits;
}
}
9 } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
> if (oldactive;credits==) AA
oldanc&or;count@))
*) 0pdateActive(&eap,desc,morecredits);
* *addr = desc; return addr+.IDE,=4,.S;
}
0pdateActive(&eap,desc,morecredits) {
newactive = desc;
* newactive;credits = morecredits?;
5 if CAS(A&eap?@Active,60!!,newactive)
return;
## Someone installed anot&er active s$
## 2eturn credits to s$ and ma%e it
partial do {
+ newanc&or = oldanc&or = desc?@Anc&or; 8
8 newanc&or;count += morecredits;
: newanc&or;state = 1A2,IA!; :
< } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
<
9 Eeap1ut1artial(desc);
} }
void* 3alloc/rom1artial(&eap) {
retry'
desc = EeapDet1artial(&eap);
* if ("desc) return 60!!;
5 desc?@&eap = &eap;
do { ## reserve
$loc%s
+ newanc&or = oldanc&or = desc?@Anc&or;
8 if (oldanc&or;state == .31,4) {
: Cesc2etire(desc); (oto retry;
}
## oldanc&or state must $e 1A2,IA!
## oldanc&or count must $e
@ )
< morecredits =
min(oldanc&or;count?
,3ABC2.CI,S);
9 newanc&or;count ?= morecredits+;
> newanc&or;state =
(morecredits @ )) F AC,I-. '
/0!!;
) } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
do { ## pop reserved $loc%
newanc&or = oldanc&or = desc?@Anc&or;
* addr = desc?@s$+oldanc&or;avail*desc?
@s7;
5 newanc&or;avail = *(unsi(ned*)addr;
+ newanc&or;ta(++;
8 } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
: if (morecredits @ ))
< 0pdateActive(&eap,desc,morecredits);
9 *addr = desc; return addr+.IDE,=4,.S;
}
descriptor* EeapDet1artial(&eap) {
do {
desc = &eap?@1artial;
* if (desc == 60!!)
5 return !istDet1artial(&eap?@sc);
+ } until CAS(A&eap?@1artial,desc,60!!);
8 return desc;
}
void* 3alloc/rom6ewS=(&eap) {
desc = CescAlloc();
* desc?@s$ = Alloc6ewS=(&eap?@sc?
@s$si7e);
5 -rganize 'locks in a linked list starting .ith inde/ 0!
+ desc?@&eap = &eap;
8 desc?@Anc&or;avail = ;
: desc?@s7 = &eap?@sc?@s7;
< desc?@maxcount = &eap?@sc?@s$si7e#desc?
@s7;
9 newactive = desc;
> newactive;credits =
min(desc?@maxcount?,3ABC2.CI,S)?;
) desc?@Anc&or;count =
(desc?@maxcount?)?
(newactive;credits+);
desc?@Anc&or;state = AC,I-.;
* memory fence!
5 if CAS((A&eap?@Active,60!!,newactive) {
+ addr = desc?@s$;
*addr = desc; return
addr+.IDE,=4,.S;
} else {
0ree the su&er'lock desc?@s$!
Cesc2etire(desc); return
60!!;
}
7igure )8 /alloc
Activ
e
Activ
e
Activ
e
Active
3
credits
?
credit
s
?
credit
s
?
credit
s
avai
l
coun
t
stat
e
ta(
Anc&or

s$
- 1 A &&&'

avai
l
coun
t
stat
e
ta(
Anc&or

s$
- 1 A &&&'

avai
l
coun
t
stat
e
ta(
Anc&or

s$
3 1 A &&&2

avai
l
coun
t
stat
e
ta(
Anc&or

s$
- ? A &&&2

H
allocated
1
dontcare
?
allocated
3 F
9
allocated
- 3
F G
G 1
H
allocated
1
dontcare
?
allocated
3 F
9
allocated
- 3
F G
G 1
H
allocated
1
dontcare
?
allocated
3 F
9
allocated
-
allocated
F G
G 1
H allocated
1 dontcare
? allocated
3 F
9 allocated
- 3
F G
G 1
(a)
(!
)
(c
)
(d)
7igure 08 .n e9ample of a typical malloc and free from an active superblock 5n configuration :a;! the active
superblock contains 0 available blocks organi<ed in a linked list =0!2!4!3!&>! four of which are available for
reservation as indicated by Active;credits?2 5n the first step of malloc! a block is reserved by
decrementing Active;credits! resulting in configuration :b; 5n the second step of malloc! block 0 is
popped! resulting in configuration :c; 7ree pushes the freed block :block 0; resulting in configuration :d;
it should not, as the new head of the
free list would !ecome !lock + which
is actually not free. )ithout the
tag su! (eld, ) is una!le to detect
that the value of Anc&or;avail
changed from * to + and (nally !ack
to * again (hence the name #A#).
To prevent this pro!lem for the
Anc&or (eld,
we use the classic $A: tag
mechanism 4I5. )e increment the ta(
su!(eld (line 1?) on every pop and
validate it atomically with the other
su!(elds of Anc&or. Therefore, in the
a!ove
mentioned scenario, when the tag
is used, the +#0 fails8 as it should
8and ) starts over from line I.
The tag must have enough !its to
make full wraparound practically
impos si!le in a short time. /or an
a!solute solution for the #A#
pro!lem, an eDcient implementation
of ideal 77/0+8which inherently
prevents the #A# pro!lem8using
pointersi@ed +#0 can !e used 41I,
1B5.
$n lines 13N1G, the thread checks
if it has taken the last credit in
Active;credit. $f so, it checks if the
super!lock has more availa!le
!locks, either !ecause maxcount is
larger than 3ABC2.CI,S or !ecause
!locks were freed. $f more !locks
are availa!le, the thread reserves
as many as possi!le (lines
1F and 1G). .therwise, it declares
the super!lock /0!! (line
1-). The reason for doing that
is that /0!! super!locks
are not pointed to !y any allocator
structures, so the (rst thread to
free a !lock !ack to a /0!!
super!lock needs to know that, in
order to take responsi!ility for
linking it !ack
to the allocator
structures.
$f the thread has taken credits, it
tries to update Active
!y executing 0pdateActive. There is
no risk of more than
one thread trying to take credits
from the same super!lock at the
same time. .nly the thread that
sets Active to 60!! in line F can do
that. .ther concurrent threads (nd
Active either with credits%H or not
pointing to desc at all.
/inally the thread stores desc
(i.e., the address of the
descriptor of the super!lock from
which the !lock was allo
cated) into the pre(x of the newly
allocated !lock (line ?1), so that
when the !lock is su!se&uently
freed, free can de
termine from which super!lock it
was originally allocated. Each !lock
includes an I !yte pre(x
(overhead).
Jote that, after a thread (nds
Active;credits%H and af ter the
success of the +#0 in line F and
!efore the thread proceeds to a
successful +#0 in line 1I, it is
possi!le that the OactiveP
super!lock might have !ecome /0!!
if all availa!le !locks were reserved,
1A2,IA!, or even the AC,I-. super!lock
of a di%erent processor heap (!ut
must !e the same si@e class).
'owever, it cannot !e .31,4. These
possi!ilities do not matter to the
original thread. #fter the success
of the +#0 in line F, the thread is
guaranteed a !lock from this
speci(c super!lock, and all it need
do is pop a !lock from
the super!lock and leave the
super!lockCs Anc&or;state un
changed. /igure - shows a typical
malloc and free from an
active
super!lock
.
-pdating .ctive ,redits
Typically, when the routine
0pdateActive in /igure 9
is called, it ends with the
success of the +#0 operation in
line 3 that reinstalls desc?@s$ as the
active super!lock for &eap with one
or more credits. 'owever, it is
possi!le that after the current
thread had set &eap?@Active to 60!!
(line F of 3alloc/romActive), some
other thread installed a
new super!lock. $f so, the current
thread must return the credits,
indicate that the super!lock is
1A2,IA!, and make the super!lock
availa!le for future use in line I !y
calling
Eeap1ut1artial
(descri!ed !elow).
/alloc from Partial Superblock
The thread calls 3alloc/rom1artial in
/igure 9 if it (nds ActiveM60!!. The
thread tries to get a 1A2,IA!
super!lock !y calling EeapDet1artial.
$f it succeeds, it tries to re
serve as many !locks8including
one for itself8from the
super!lockCs descriptor. Kpon the
success of +#0 in line
1H, the thread is guaranteed to
have reserved one or more !locks.
$t then proceeds in lines 11N1- to
pop its reserved !lock, and if it has
reserved more, it deposits the
additional
credits in Active !y calling
0pdateActive.
$n EeapDet1artial, the thread (rst
tries to pop a su
per!lock from the 1artial slot
associated with the threadCs
processor heap. $f 1artialM60!!,
then the thread checks the *artial
list associated with the si@e class as
descri!ed in 0ection 3.?.F.
/alloc from @ew Superblock
$f the thread does not (nd any
1A2,IA! super!locks, it calls
3alloc/rom6ewS= in /igure 9. The
thread allocates a descriptor !y
calling CescAlloc (line 1), allocates a
new
super!lock, and sets its (elds
appropriately (lines ?N11). /i nally,
it tries to install it as the active
super!lock in Active
using +#0 in line 13. $f the +#0
fails, the thread deallocates the
super!lock and retires the
descriptor (or alternatively, the
thread can take the !lock, return
the credits to the su per!lock, and
install the super!lock as 1A2,IA!).
The failure of +#0 in line 13
implies that &eap?@Active is no longer
60!!, and therefore a new active
super!lock must have !een
installed !y another thread. $n
order to avoid having too many
1A2,IA! super!locks and hence cause
unnecessary ex ternal
fragmentation, we prefer to
deallocate the super!lock rather
than take a !lock from it and keep
it as 1A2,IA!.
.n systems with memory
consistency models 415 weaker than
se&uential consistency, where the
processors might ex ecute and
o!serve memory accesses out of
order, fence in structions are needed
to enforce the ordering of memory
ac cesses. The memory fence
instruction in line 1? serves to
ensure that the new values of the
descriptor (elds are o! served !y
other processors !efore the +#0
in line 13 can !e o!served.
.therwise, if the +#0 succeeds,
then threads running on other
processors may read stale values
from the
descrip
tor.
'
!"!4
0ree
/igure F shows the free
algorithm. 7arge !locks are
returned directly to the .0. The
free algorithm for small !locks is
simple. $t primarily involves
pushing the freed !lock into its
super!lockCs availa!le list and
ad1usting the super!lockCs state
appropriately.
The instruction fence in line 19 is
needed to ensure that
the read in line 13 is executed
!efore the success of the +#0
in line 1I. The memory fence in line
1G is needed to ensure
that the write in line I is o!served
!y other processors no
later than the +#0 in
line 1I is o!served.
$f a thread is the (rst to return
a !lock to a /0!! su
per!lock, then it takes
responsi!ility for making it 1A2,IA!
!y calling Eeap1ut1arial, where it
atomically swaps the su
per!lock with the prior value in the
*artial slot of the heap
that last owned the super!lock.
$f the previous value of
&eap?@1artial is not 60!!, i.e., it held a
partial super!lock,
then the thread puts that
super!lock in the partial list of
the si@e class as descri!ed
in 0ection 3.?.F.
$f a thread frees the last allocated
!lock in a super!lock,
then it takes responsi!ility for
indicating that the super!lock is
.31,4 and frees it. The thread
then tries to retire the associated
descriptor. $f the descriptor is in
the *artial slot
of a processor heap, a simple +#0
will suDce to remove it. .therwise,
the descriptor may !e in the
*artial list of the si@e class
(possi!ly in the middle). )e
discuss this case in 0ection 3.?.F.
'
Due to the variety in memory
consistency models and fence
instructions among architectures, it
is customary for concur rent
algorithms presented in the
literature to ignore them. $n this
paper, we opt to include fence
instructions in the code, !ut for
clarity we assume a typical *ower*+
like archi tecture. 'owever,
di%erent architectures8including
future ones8may use di%erent
consistency models.
free(ptr)
{
if ("ptr)
return;
* ((void**)ptr)??; ## (et
prefix
5 desc =
*(descriptor**)ptr;
+ if (lar(e $loc% $it
set(desc))
## !ar(e $loc% ? desc &olds
s7+
8 { 1eturn 'lock to -$!
return; }
: s$ = desc?
@s$;
do
{
< newanc&or = oldanc&or = desc?
@Anc&or;
9 *(unsi(ned*)ptr =
oldanc&or;avail;
> newanc&or;avail = (ptr?
s$)#desc?@s7;
) if (oldanc&or;state ==
/0!!)
newanc&or;state =
1A2,IA!;
* if (oldanc&or;count==desc?
@maxcount?) {
5 &eap = desc?
@&eap;
+ instruction
fence!
8 newanc&or;state =
.31,4;
}
else
:
newanc&or;count++;
< memory
fence!
9 } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
> if (newanc&or;state ==
.31,4) {
*) 0ree the su&er'lock
s$!
*
2emove.mptyCesc(&eap,desc);
** } elseif (oldanc&or;state ==
/0!!)
*5
Eeap1ut1artial(desc);
}
Eeap1ut1artial(desc
) {
do { prev = desc?@&eap?
@1artial;
* } until CAS(Adesc?@&eap?
@1artial,prev,desc);
5 if (prev)
!ist1ut1artial(prev);
}
2emove.mptyCesc(&eap,des
c) {
if CAS(A&eap?
@1artial,desc,60!!)
*
Cesc2etire(desc);
5 else
!ist2emove.mptyCesc(&eap?@sc);
}
7igure 48 7ree
!"!2 Descri&tor List
/igure G shows the CescAlloc and
Cesc2etire routines. $n CescAlloc, the
thread (rst tries to pop a descriptor
from the list of availa!le
descriptors (lines 3N9). $f not
found, the thread allocates a
super!lock of descriptors, takes
one
descriptor, and tries to install the
rest in the glo!al availa!le
descriptor list. $n order to avoid
unnecessarily allocating too many
descriptors, if the thread (nds that
some other thread has already
made some descriptors availa!le
(i.e., the +#0 in line I fails), then it
returns the super!lock to the .0
and starts over in line 1, with the
hope of (nding an availa!le
descriptor. Cesc2etire is a
straightforward lockfree push
that follows the classic freelist
push algorithm 4I5.
#s mentioned a!ove in the case of
the pop operation in the
3alloc/romActive routine, care must
!e taken that +#0
does not succeed where it should not
due to the #A# pro! lem. )e
indicate this in line 9, !y using the
term SafeCAS
(i.e., #A#safe). )e use the ha@ard
pointer methodology 41G,
1B58which uses only pointersi@ed
instructions8in order to
prevent the #A# pro!lem
for this structure.
descriptor* CescAvail; ##
initially 60!!
descriptor*
CescAlloc() {
w&ile ()
{
desc =
CescAvail;
* if
(desc) {
5 next = desc?
@6ext;
+ if SafeCAS(ACescAvail,desc,next)
$rea%;
} else
{
8 desc =
Alloc6ewS=(C.SCS=SIG.);
: -rganize descri&tors in a linked
list!
< memory
fence!
9 if CAS(ACescAvail,60!!,desc?
@6ext)) $rea%;
> 0ree the su&er'lock
desc!
}
}
) return
desc;
}
Cesc2etire(desc)
{
do {
old&ead =
CescAvail;
* desc?@6ext =
old&ead;
5 memory
fence!
+ } until
CAS(ACescAvail,old&ead,desc);
}
7igure 38 +escriptor allocation
$n the current implementation,
super!lock descriptors are not
reused as regular !locks and cannot
!e returned to the .0. This is
accepta!le as descriptors constitute
on average less than 1R of allocated
memory. 'owever, if desired, space
for descriptors can !e reused
ar!itrarily or returned to the .0,
!y organi@ing descriptors in a
similar manner to regular !locks and
maintaining special descriptors for
super!locks of descriptor, with
virtually no e%ect on average
performance whether contentionfree
or under high contention. This can
!e applied on as many levels as
desired, such that at most
1R of 1R8and so on8of allocated
space is restricted from !eing
reused ar!itrarily or returned to
the .0.
0imilarly, in order to reduce the
fre&uency of calls to mmap and munmap,
we allocate super!locks (e.g., 1F
=A) in !atches of (e.g., 1 :A)
hy&er'locks (super!locks of super!locks)
and maintain descriptors for such
hyper!locks, allowing them
eventually to !e returned to the
.0. )e organi@e the de
scriptor Anc&or (eld in a slightly
di%erent manner, such that
super!locks are not written until
they are actually used, thus
saving disk swap space for
unused super!locks.
!"!3 Lists of Partial
$u&er'locks
/or managing the list of partial
super!locks associated with each
si@e class, we need to provide
three functions2 !istDet1artial,
!ist1ut1artial, and !ist2emove.mpty?
Cesc. The goal of the latter is to
ensure that empty de
scriptors are eventually made
availa!le for reuse, and not
necessarily to remove a speci(c
empty descriptor immedi
ate
ly.
$n one possi!le implementation,
the list is managed in a
7$/. manner, with the possi!ility
of removing descriptors
from the middle of the list. The
simpler version in 41B5 of the
lockfree linked list algorithm in 41F5
can !e used to manage such a list.
!ist1ut1artial inserts desc at the head
of the list. !istDet1artial pops a
descriptor from the head of the
list. !ist2emove.mptyCesc traverses the
list until it removes some empty
descriptor or reaches the end of the
list.
#nother implementation, which we
prefer, manages the
list in a /$/. manner and thus
reduces the chances of con tention
and false sharing. !ist1ut1artial
en&ueues desc at the tail of the list.
!istDet1artial de&ueues a descrip tor
from the head of the list.
!ist2emove.mptyCesc keeps
de&ueuing descriptors from the
head of the list until it de &ueues
a nonempty descriptor or reaches
the end of the list. $f the function
de&ueues a nonempty descriptor,
then it reen&ueues the descriptor
at the tail of the list. Ay re
moving any one empty descriptor or
moving two nonempty descriptor
from the head of the list to its end,
we are guar anteed that no more
than half the descriptors in the list
are left empty. )e use a version
of the lockfree /$/. &ueue
algorithm in 4?H5 with optimi@ed
memory management for the
purposes of the new allocator.
/or preventing the #A# pro!lem
for pointersi@ed vari a!les in the
a!ove mentioned list
implementations, we can not use
$A: #A#prevention tags (such as
in Anc&or;ta(),
instead we use ideal 77/0+
constructions using pointersi@ed
+#0 41I5. Jote that these
constructions as descri!ed in 41I5
use memory allocation, however a
generalpurpose malloc is not
needed. $n our implementation we
allocate such !locks in a manner
similar !ut simpler than allocating
descriptors. Jote that in our
allocator, unlike 'oard 435, we do
not maintain fullness classes or
keep statistics a!out the full ness
of processor heaps and we are
&uicker to move partial super!locks
to the partial list of the si@e class.
This simplic ity allows lower
latency and lower fragmentation.
Aut, one concern may !e that this
makes it more likely for !locks to
!e freed to a super!lock in the si@e
class partial lists. 'ow ever, this is
not a disadvantage at all, unlike
'oard 435 and also 4-5 where this can
cause contention on the glo'al heapCs
lock. $n our allocator, freeing a
!lock into such a super!lock does
not cause any contention with
operations on other su per!locks,
and in general is no more complex or
less eDcient than freeing a !lock
into a super!lock that is in the
threadCs own processor heap.
#nother possi!le concern is that
!y moving partial super!locks out
of the processor heap too 4uickly,
contention and false sharing may
arise. This is why we use a most
recentlyused *artial slot (multiple
slots can !e used if desired) in the
processor heap structure, and use
a /$/. structure for the
si@e class partial lists.
*. %+,%RI-%NTA"
R%SU"TS
$n this section, we descri!e our
experimental performance results on
two *ower*+ multiprocessor
systems. The (rst system has
sixteen 3G- :'@ *.)E<3$$
processors, with
?9 ;A of memory, 9 :A second level
caches. The second sys tem has
eight 1.? ;'@ *.)E<9L
processors, with 1F ;A of memory,
3? :A third level caches. Aoth
systems run #$,
-.1. )e ran experiments on !oth
systems. The results on the
*.)E<3 system (with more
processors) provided more insights
into the allocatorsC scala!ility and
a!ility to avoid false sharing and
contention. The results on the
*.)E<9 system provided insights
into the contentionfree latency of
the allocators and contentionfree
synchroni@ation costs on recent
processor architectures.
)e compare our allocator with the
default #$, li!c mal lo c, "
2
'oard 435
v ersion 3.H.? (Decem!er ?HH3), and
*tmal
2
.ur o!servations on the default
li!c malloc are !ased on external
experimentation only and are not
!ased on any knowledge of its
internal design.
3G- :'@ *.)E<3 1.? ;'@
@ew Aoard Ptmalloc @ew Aoard Ptmalloc
Linu9 scalability ?.? 1.1 1.I ?.G 1.3 1.B
Threadtest ?.1 1.? 1.B ?.3 1.? 1.B
Larson ?.B ?.? ?.- ?.B ?.3 ?.F
Table &8 ,ontention-free speedup over libc malloc
loc? (Jov. ?HH?) 4F5. #ll allocators
and !enchmarks were compiled
using gcc and gLL with the
highest optimi@ation level (.F) in
F9!it mode. )e used pthreads
for multi threading. #ll allocators
were dynamically linked as shared
li!raries. /or meaningful
comparison, we tried to use opti
mal versions of 'oard and *tmalloc
as !est we could. )e modi(ed the
*ower*+ lightweight locking
routines in 'oard
!y removing a sync instruction from
the !eginning of the lock ac&uisition
path, replacing the sync at the end
of lock ac&uisition with isync, and
adding eieio !efore lock re
lease. These changes reduced the
average contentionfree latency of a
pair of malloc and free using 'oard
from 1.GF
5s. to 1.-1 5s. on *.)E<3, and
from II- ns. to -FH ns.
on *.)E<9. The default
distri!ution of *tmalloc? uses
pthread mutex for locking. )e
replaced calls to pthread
mutex !y calls to a lightweight
mutex that we coded us
ing inline assem!ly. This reduced
the average contention
free latency of a pair of malloc and
free using *tmalloc !y more than
-HR, from 1.B3 5s. to B?3 ns. on
*.)E<3 and
from I1? ns. to 9H9 ns. on *.)E<9.
$n addition, *tmalloc showed
su!stantially !etter scala!ility
using the lightweight locks than it
did using pthread mutex locks.
*.1 Benc)mars
Due to the lack of standard
!enchmarks for multithreaded
dynamic memory allocation, we use
micro!enchmarks that focus on
speci(c performance
characteristics. )e use six
!enchmarks2 !enchmark 1 of Linu/
$cala'ility 41-5, 6hread- test, *cti7e-false, and
Passi7e-false from 'oard 435, Lar- son 4135,
and a lockfree producerconsumer
!enchmark that we descri!e !elow.
$n Linu/ scala'ility, each thread
performs 1H million mal
loc/free pairs of I !yte !locks in a
tight loop. $n 6hreadtest,
each thread performs 1HH
iterations of allocating 1HH,HHH
I!yte !locks and then freeing
them in order. These two
!enchmarks capture allocator
latency and scala!ility under
regular private
allocation patterns.
$n *cti7e-false, each thread performs
1H,HHH malloc/free
pairs (of I !yte !locks) and each
time it writes 1,HHH times to each
!yte of the allocated !lock. Passi7e-
false is simi
lar to *cti7e-false, except that initially
one thread allocates
!locks and hands them to the other
threads, which free them
immediately and then proceed as in
*cti7e-false. These two
!enchmarks capture the allocatorCs
a!ility to avoid causing false sharing
4??5 whether actively or passively.
$n Larson, initially one thread
allocates and frees ran
$n the lockfree Producer-consumer
!enchmark, we mea sure the
num!er of tasks performed !y t
threads in 3H sec onds. $nitially, a
data!ase of 1 million items is
initiali@ed randomly. .ne thread is
the producer and the others, if
any, are consumers. /or each
task, the producer selects a
randomsi@ed (1H to ?H) random set
of array indexes, allo cates a
!lock of matching si@e (9H to IH
!ytes) to record the array indexes,
then allocates a (xed si@e task
structure (3? !ytes) and a (xed
si@e &ueue node (1F !ytes), and en
&ueues the task in a lockfree /$/.
&ueue 41B, ?H5. # con sumer thread
repeatedly de&ueues a task, creates
histograms from the data!ase for
the indexes in the task, and then
spends time proportional to a
parameter .ork performing local work
similar to the work in 'oardCs 6hreadtest
!ench mark. )hen the num!er of
tasks in the &ueue exceeds 1HHH, the
producer helps the consumers !y
de&ueuing a task from the &ueue
and processing it. Each task
involves 3 malloc op erations on the
part of the producer, and one
malloc and 9 free operations on the
part of the consumer. The
consumer spends su!stantially more
time on each task that the pro
ducer. Producer-consumer captures
mallocCs ro!ustness un der the
producerconsumer sharing pattern,
where threads free !locks allocated
!y other threads.
*.2 Results
4!"!# Latency
Ta!le 1 presents contentionfree
)
speedups over li!c mal loc for the
new allocator, 'oard, and
*tmalloc, for the !enchmarks that
are a%ected !y malloc latency2 Linu/
scal- a'ility, 6hreadtest, and Larson.
:allocCs latency had little or no
e%ect on the performance of *cti7e-false,
Passi7e-false, and Producer-consumer.
The new allocator achieves
signi(cantly lower contention free
latency than the other allocators
under !oth regular and irregular
allocation patterns. The reason is
that it has a faster execution path
in the common case. #lso, unlike
lock!ased allocators, it operates
only on the actual allo cator
varia!les without the need to
operate on additional lock related
varia!les and to synchroni@e these
accesses with the accesses to the
allocator varia!les through fence
instruc tions.
The new allocator re&uires only one
memory fence instruc tion (line 1G
of free) in the common case for
each pair of
malloc and free, while every lock
ac&uisition and release re &uires an
instruction fence !efore the critical
section to pre
dom si@ed !locks (1F to IH !ytes) in random order, then an
e&ual num!er of !locks (1H?9) is
handed over to each of the
remaining threads. $n the parallel
phase which lasts 3H sec onds, each
thread randomly selects a !lock and
frees it, then allocates a new
randomsi@ed !lock in its place. The
!ench mark measures how many
free/malloc pairs are performed
during the parallel phase. Larson
captures the ro!ustness of mallocCs
latency and scala!ility under
irregular allocation patterns with
respect to !locksi@e and order of
deallocation over a long period of
time.
)
$t appears that li!c malloc as
well as 'oard use a tech ni&ue
where the parent thread !ypasses
synchroni@ation if it knows that it
has not spawned any threads yet.
)e ap plied the same techni&ue
to our allocator and the average
singlethread latency for our
allocator was lower than those for
li!c malloc and 'oard. 'owever,
in order to measure true
contentionfree latency under
multithreading, in our experiments,
the parent thread creates an
additional thread at initiali@ation
time which does nothing and exits
immedi ately !efore starting time
measurement.
vent reads inside the critical section
from reading stale data !efore lock
ac&uisition, and a memory fence
after the end of the critical section
to ensure that the lock is not
o!served to !e free !efore the
writes inside the critical sections
are also o!served !y other
processors. $n the common case, a
pair of malloc and free using
*tmalloc and 'oard need to ac&uire
and release two and three locks,
respectively.
$nterestingly, when we conducted
experiments with a lightweight
testandset mutual exclusion
lock on the *.)E<9 system,
we found that the average
contention free latency for a pair of
lock ac&uire and release is 1F- ns.
.n the other hand. the average
contentionfree latency for a pair of
malloc and free in Linu/ $cala'ility using
our allo cator is ?I? ns., i.e., it is
less than twice that of a minimal
critical section protected !y a
lightweight testandset lock. That
is, on that architecture, it is
highly unlikely8if not impossi!le8for
a lock!ased allocator (without per
thread private heaps) to have lower
latency than our lockfree allo cator,
even if it uses the fastest
lightweight lock to protect malloc
and free and does nothing in these
critical sections.
4!"!" $cala'ility and *7oiding 0alse
$haring
/igure I(a) shows speedup results
relative to contention free li!c
malloc for Linu/ scala'ility. .ur
allocator, *tmal loc, and 'oard
scale well with varying slopes
proportional to their contentionfree
latency. 7i!c malloc does not scale
at all, its speedup drops to H.9 on
two processors and contin ues to
decline with more processors. .n
1F processors the execution time of
li!c malloc is 331 times as much as
that of our allocator.
The results for 6hreadtest (/igure
I(!)) show that our
allocator and 'oard scale in
proportion to their contention
free latencies. *tmalloc scales
!ut at a lower rate under
high contention, as it !ecomes more
likely that threads take
over the arenas of other threads
when their own arenas have
no free !locks availa!le, which
increases the chances of con
tention and
false sharing.
/igures I(cNd) show the results
for *cti7e-false and
Passi7e-false. The latency of malloc
itself plays little role
in these results. The results
re"ect only the e%ect of the
allocation policy on inducing or
avoiding false sharing. .ur
allocator and 'oard are less likely
to induce false sharing
than *tmalloc and
li!c malloc.
$n Larson (/igure I(e)), which is
intended to simulate
server workloads, our allocator and
'oard scale, while *t
malloc does not, pro!a!ly due
to fre&uent switching of
threads !etween arenas, and
conse&uently more fre&uent
cases of freeing !locks to arenas
locked !y other threads. )e
also noticed, when running this
!enchmark, that *tmalloc
creates more arenas than the
num!er of threads, e.g., ??
arenas for 1F threads, indicating
fre&uent switching among
arenas !y threads. Even though
freeing !locks to remote
heaps in 'oard can degrade
performance, this e%ect is elim
inated after a short time. $nitially
threads free !locks that
were allocated !y another thread,
!ut then in the steady
state they free !locks that they
have allocated from their
own
processor
heaps.
4!"! 1o'ustness under Producer-
,onsumer
/or Producer-consumer we ran
experiments with various values for
.ork (parameter for local work per
task). /ig ures I(fNh) show the
results for .ork set to -HH, G-H, and
1HHH, respectively. The results for
all the allocators are vir tually
identical under no contention, thus
the latency of the allocator plays a
negligi!le role in the results for this
!ench
mark. The purpose of this
!enchmark is to show the ro!ust
ness of the allocators under the
producerconsumer sharing pattern
when the !enchmark is scala!le.
The case where the !enchmark
cannot scale even using a perfect
allocator is not of interest. )e
focus on the knee of the curve,
where the di%erences in ro!ustness
!etween allocators impact the
scala!ility of the !enchmark.
.ur allocator scales perfectly with
work set to 1HHH and
G-H, and up to 13 processors with
work set to -HH. )ith
more than 13 processors (and with
work set to -HH), we
found that the producer could not
keep up with the con
sumers (as the &ueue was always
empty at the end of each
experiment), which is not an
interesting case as the appli
cation would not !e scala!le in
any case. .ur allocatorCs
scala!ility is limited only !y the
scala!ility of the applica
tio
n.
*tmalloc scales to a lesser
degree, !ut at the cost of
higher external memory
fragmaentation, as the producer
keeps creating and switching arenas
due to contention with
consumers, even though most
arenas already have availa!le
!loc
ks.
'oardCs scala!ility su%ers due to
high contention on the
producerCs heap, as G-R of all
malloc and free operations are
targeted at the same heap. .ur
allocatorCs performance does
not su%er, although it faces exactly
the same situation. The
main reason is that in 'oard, even
in the common case, free
operations need to ac&uire either
the processor heapCs lock
or the glo!al heapCs lock. $n our
allocator typical free oper
ations are very simple and operate
only on the super!lock
descriptor associated with the freed
!lock, thus allowing su!
stantially more concurrency than
'oard. .ther minor rea
sons for our allocatorCs a!ility to
perform well even under
contention on the same super!lock
are2 (a) $n our alloca
tor, readmodifywrite code
segments are shorter in dura
tion, compared with critical
sections in 'oard. (!) 0uccess
ful lockfree operations can overlap
in time, while mutual
exclusion locks !y de(nition must
strictly seriali@e critical
secti
ons.
4!"!4 -&timization for
8ni&rocessors
)ith uniprocessors in mind, we
modi(ed a version of our allocator
such that threads use only one
heap, and thus when executing
malloc, threads do not need to
know their id. This optimi@ation
achieved 1-R increase in
contention free speedup on Linu/
scala'ility on *.)E<3. )hen we used
multiple threads on the same
processor, performance remained
una%ected, as our allocator is
preemptiontolerant. $n practice, the
allocator can determine the num!er
of pro cessors in the system at
initiali@ation time !y &uerying the
system environment.
4!"!2 $&ace 9fficiency
)e tracked the maximum space
used !y our allocator, 'oard, and
*tmalloc when running the
!enchmarks that allocate a large
num!er of !locks2 6hreadtest, Larson,
and Producer-consumer. The maximum
space used !y our allo cator was
consistently slightly less than that
used !y 'oard, as in our allocator
each processor heap holds at most
two su per!locks, while in 'oard
each processor heap holds a vari
a!le num!er of super!locks
proportional to allocated !locks. The
maximum space allocated !y
*tmalloc was consistently more
than that allocated !y 'oard and
our allocator. The ratio of the
maximum space allocated !y
*tmalloc to the maximum space
allocated !y ours, on 1F processors,
ranged from 1.1F in 6hreadtest to 3.I3 in
Larson.
Speedup over contention-free libc malloc
.&
ne"
.%
#oard
2&
Ptmalloc
2%
li-c
1&
1%
&
Speedup over contention-free libc malloc
/&
/%
ne"
.&
#oard
.%
Ptmalloc
li-c
2&
2%
1&
1%
&
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(a) 7inux
scala!ility
Speedup over contention-free libc malloc
.%
ne"
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(e)
7arson
Speedup over contention-free libc malloc
18
10
ne"
2&
2%
1&
1%
&
#oard
Ptmalloc
li-c
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(!)
Threadte
st
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(f ) *roducerconsumer
with work M -HH
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(c) #ctive
false sharing
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(g) *roducerconsumer
with work M G-H
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(d) *assive
false sharing
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(h) *roducerconsumer
with work M 1HHH
7igure 18 Speedup results on &4-way 230 /A< PO*BC2
.. SU--AR/
$n this paper we presented a
completely lockfree dynamic
memory allocator. Aeing completely
lockfree, our allocator is immune to
deadlock regardless of scheduling
policies and even when threads
may !e killed ar!itrarily.
Therefore, it can o%er asyncsignal
safety, tolerance to priority
inversion, killtolerance, and
preemptiontolerance, without
re&uiring any special kernel
support or incurring performance
over head. .ur allocator is
porta!le across software and hard
ware platforms, as it re&uires only
widelyavaila!le .0 sup port and
hardware atomic primitives. $t is
generalpurpose and does not
impose any unreasona!le
restrictions regard ing the use or
initiali@ation of the address space.
$t is space eDcient and limits space
!lowup 435 to a constant factor.
.ur experimental results compared
our allocator with the default #$,
-.1 li!c malloc, and two of the !est
multithread allocators, 'oard 435 and
*tmalloc 4F5. .ur allocator outper
formed the other allocators in all
cases, often !y signi(cant margins,
under various levels of parallelism
and allocation patterns. .ur
allocator showed near perfect
scala!ility un der various
allocation and sharing patterns.
Knder maxi mum contention on 1F
processors, it achieved a speedup
of
331 over li!c malloc.
E&ually signi(cant, our allocator
o%ers su!stantially lower
latency than the other allocators.
Knder no contention, it
achieved speedups of ?.G-, 1.BB,
and 1.93 over li!c malloc,
and highlyoptimi@ed versions of
'oard and *tmalloc, re
spectively. 0cala!le allocators are
often critici@ed that they
achieve their scala!ility at the cost
of higher latency in the
more common case of no contention.
.ur allocator achieves
!oth scala!ility and low latency, in
addition to many other
performance and &ualitative
advantages.
/urthermore, this work, in
com!ination with recent lock
free methods for safe memory
reclamation 41G, 1B5 and #A#
prevention 41I5 that use only single
word +#0, allows lock
free algorithms including eDcient
algorithms for important
o! 1ect types8such as 7$/. stacks
4I5, /$/. &ueues 4?H5,
and linked lists and hash ta!les
41F, ?158to !e !oth com
pletely dynamic and completely lock
free, including in F9!it
applications and on systems
without support for automatic
gar!age collection, all eDciently
without re&uiring special
.0 support and using only widely
availa!le F9!it atomic
instructions.
.cknowledgments
The author thanks Emery
Aerger, :ichael 0cott, Se(m 0huf,
and the anonymous referees for
valua!le comments on the paper.
0. R%$%R%NC%S
415 0arita Q. #dve and =ourosh
;harachorloo. 0hared memory
consistency models2 # tutorial.
I999 ,om- &uter, ?B(1?)2FFNGF, 1BBF.
4?5 Emery D. Aerger. (emory (anagement
for :igh- Performance *&&lications. *hD
thesis, Kniversity of Texas at
#ustin, #ugust ?HH?.
435 Emery D. Aerger, =athryn 0.
:c=inley, <o!ert D. Alu mofe,
and *aul <. )ilson. 'oard2 #
scala!le memory allocator for
multithreaded applications. $n
Proceed- ings of the ;th International ,onference on
*rchitec-
tural $u&&ort for Programming Languages and
-&erat- ing $ystems, pages 11GN1?I,
Jovem!er ?HHH.
495 Aruce :. Aigler, 0tephen T.
#llan, and <odney <. .ld
ehoeft. *arallel dynamic storage
allocation. $n Proceed- ings of the #;<2
International ,onference on Parallel
Processing, pages ?G?N?G-, #ugust
1BI-.
4-5 Dave Dice and #lex ;arthwaite.
:ostly lockfree mal loc. $n
Proceedings of the "00" International $ym&o- sium on
(emory (anagement, pages ?FBN?IH,
Tune
?HH?.
4F5 )olfram ;loger. Dynamic (emory
*llocator Im&le- mentations in Linu/ $ystem
Li'raries. &ttp'##www;dent;med;uni?
muenc&en;de#Hwm(lo#.
4G5 :aurice *. 'erlihy. )aitfree
synchroni@ation. *,( 6ransactions on
Programming Languages and $ystems,
13(1)21?9N19B, Tanuary 1BB1.
4I5 $A:. I+( $ystem=>0 9/tended *rchitecture,
Princi- &les of -&eration, 1BI3. *u!lication
Jo. 0#??GHI-.
4B5 $EEE. I999 $td #00!#, "00 9dition,
?HH3.
41H5 #run =. $yengar. Dynamic $torage
*llocation on a (ul- ti&rocessor. *hD thesis,
:$T, 1BB?.
4115 #run =. $yengar. *arallel
dynamic storage allocation
algorithms. $n Proceedings of the 0ifth
I999 $ym&o- sium on Parallel and Distri'uted
Processing, pages I?N
B1, Decem!er 1BB3.
41?5 7eslie 7amport. +oncurrent
reading and writing. ,om- munications
of the *,(, ?H(11)2IHFNI11,
Jovem!er
1BGG.
4135 *er#ke 7arson and :urali
=rishnan. :emory alloca tion for
longrunning server applications.
$n Proceedings of the #;;< International $ym&osium
on (emory (an-
agement, pages 1GFN1I-,
.cto!er 1BBI. 4195 Doug
7ea. * (emory *llocator.
&ttp'##(ee;cs;oswe(o;edu#dl#&tml#malloc;
&tml.
41-5 +huck 7ever and David
Aoreham. :alloc() perfor
mance in a multithreaded 7inux
environment. $n Pro- ceedings of the
0199?I) 6rack of the "000 8$9?I)
*nnual 6echnical ,onference, Tune ?HHH.
41F5 :aged :. :ichael. 'igh
performance dynamic lock free
hash ta!les and list!ased sets.
$n Proceedings of the 0ourteenth *nnual *,(
$ym&osium on Parallel *l-
gorithms and *rchitectures, pages G3NI?,
#ugust ?HH?. 41G5 :aged :.
:ichael. 0afe memory reclamation
for dy namic lockfree o! 1ects
using atomic reads and writes.
$n Proceedings of the 6.enty-0irst *nnual *,(
$ym-
&osium on Princi&les of Distri'uted ,om&uting,
pages
?1N3H, Tuly ?HH?.
41I5 :aged :. :ichael. #A#
prevention using single word
instructions. Technical <eport
<+ ?3HIB, $A: T. T. )atson
<esearch +enter, Tanuary ?HH9.
41B5 :aged :. :ichael. 'a@ard
pointers2 0afe memory
reclamation for lockfree o! 1ects.
I999 6ransactions on Parallel and Distri'uted
$ystems, ?HH9. To appear. 0ee
www;researc&;i$m;com#people#m#mic&ael#p
u$s;&tm.
4?H5 :aged :. :ichael and :ichael
7. 0cott. 0imple, fast, and
practical non!locking and
!locking concurrent &ueue
algorithms. $n Proceedings of the 0ifteenth
*n- nual *,( $ym&osium on Princi&les of
Distri'uted
,om&uting, pages ?FGN?G-, :ay 1BBF.
4?15 .ri 0halev and Jir 0havit. 0plit
ordered lists2 7ockfree
extensi!le hash ta!les. $n
Proceedings of the 6.enty- $econd *nnual *,(
$ym&osium on Princi&les of Dis-
tri'uted ,om&uting, pages 1H?N111, Tuly
?HH3.
4??5 Tosep Torrellas, :onica 0. 7am,
and Tohn 7. 'ennessy.
/alse sharing and spatial
locality in multiprocessor
caches. I999 6ransactions on ,om&uters,
93(F)2F-1N
FF3, Tune 1BB9.
4?35 *aul <. )ilson, :ark 0.
Tohnstone, :ichael Jeely, and
David Aoles. Dynamic storage
allocation2 # sur vey and
critical review. $n Proceedings of the #;;2
In- ternational @orksho& on (emory (anagement,
pages
1N11F, 0eptem!er 1BB-.

Você também pode gostar