Escolar Documentos
Profissional Documentos
Cultura Documentos
avai
l
coun
t
stat
e
ta(
Anc&or
s$
- 1 A &&&'
avai
l
coun
t
stat
e
ta(
Anc&or
s$
3 1 A &&&2
avai
l
coun
t
stat
e
ta(
Anc&or
s$
- ? A &&&2
H
allocated
1
dontcare
?
allocated
3 F
9
allocated
- 3
F G
G 1
H
allocated
1
dontcare
?
allocated
3 F
9
allocated
- 3
F G
G 1
H
allocated
1
dontcare
?
allocated
3 F
9
allocated
-
allocated
F G
G 1
H allocated
1 dontcare
? allocated
3 F
9 allocated
- 3
F G
G 1
(a)
(!
)
(c
)
(d)
7igure 08 .n e9ample of a typical malloc and free from an active superblock 5n configuration :a;! the active
superblock contains 0 available blocks organi<ed in a linked list =0!2!4!3!&>! four of which are available for
reservation as indicated by Active;credits?2 5n the first step of malloc! a block is reserved by
decrementing Active;credits! resulting in configuration :b; 5n the second step of malloc! block 0 is
popped! resulting in configuration :c; 7ree pushes the freed block :block 0; resulting in configuration :d;
it should not, as the new head of the
free list would !ecome !lock + which
is actually not free. )ithout the
tag su! (eld, ) is una!le to detect
that the value of Anc∨avail
changed from * to + and (nally !ack
to * again (hence the name #A#).
To prevent this pro!lem for the
Anc&or (eld,
we use the classic $A: tag
mechanism 4I5. )e increment the ta(
su!(eld (line 1?) on every pop and
validate it atomically with the other
su!(elds of Anc&or. Therefore, in the
a!ove
mentioned scenario, when the tag
is used, the +#0 fails8 as it should
8and ) starts over from line I.
The tag must have enough !its to
make full wraparound practically
impos si!le in a short time. /or an
a!solute solution for the #A#
pro!lem, an eDcient implementation
of ideal 77/0+8which inherently
prevents the #A# pro!lem8using
pointersi@ed +#0 can !e used 41I,
1B5.
$n lines 13N1G, the thread checks
if it has taken the last credit in
Active;credit. $f so, it checks if the
super!lock has more availa!le
!locks, either !ecause maxcount is
larger than 3ABC2.CI,S or !ecause
!locks were freed. $f more !locks
are availa!le, the thread reserves
as many as possi!le (lines
1F and 1G). .therwise, it declares
the super!lock /0!! (line
1-). The reason for doing that
is that /0!! super!locks
are not pointed to !y any allocator
structures, so the (rst thread to
free a !lock !ack to a /0!!
super!lock needs to know that, in
order to take responsi!ility for
linking it !ack
to the allocator
structures.
$f the thread has taken credits, it
tries to update Active
!y executing 0pdateActive. There is
no risk of more than
one thread trying to take credits
from the same super!lock at the
same time. .nly the thread that
sets Active to 60!! in line F can do
that. .ther concurrent threads (nd
Active either with credits%H or not
pointing to desc at all.
/inally the thread stores desc
(i.e., the address of the
descriptor of the super!lock from
which the !lock was allo
cated) into the pre(x of the newly
allocated !lock (line ?1), so that
when the !lock is su!se&uently
freed, free can de
termine from which super!lock it
was originally allocated. Each !lock
includes an I !yte pre(x
(overhead).
Jote that, after a thread (nds
Active;credits%H and af ter the
success of the +#0 in line F and
!efore the thread proceeds to a
successful +#0 in line 1I, it is
possi!le that the OactiveP
super!lock might have !ecome /0!!
if all availa!le !locks were reserved,
1A2,IA!, or even the AC,I-. super!lock
of a di%erent processor heap (!ut
must !e the same si@e class).
'owever, it cannot !e .31,4. These
possi!ilities do not matter to the
original thread. #fter the success
of the +#0 in line F, the thread is
guaranteed a !lock from this
speci(c super!lock, and all it need
do is pop a !lock from
the super!lock and leave the
super!lockCs Anc∨state un
changed. /igure - shows a typical
malloc and free from an
active
super!lock
.
-pdating .ctive ,redits
Typically, when the routine
0pdateActive in /igure 9
is called, it ends with the
success of the +#0 operation in
line 3 that reinstalls desc?@s$ as the
active super!lock for &eap with one
or more credits. 'owever, it is
possi!le that after the current
thread had set &eap?@Active to 60!!
(line F of 3alloc/romActive), some
other thread installed a
new super!lock. $f so, the current
thread must return the credits,
indicate that the super!lock is
1A2,IA!, and make the super!lock
availa!le for future use in line I !y
calling
Eeap1ut1artial
(descri!ed !elow).
/alloc from Partial Superblock
The thread calls 3alloc/rom1artial in
/igure 9 if it (nds ActiveM60!!. The
thread tries to get a 1A2,IA!
super!lock !y calling EeapDet1artial.
$f it succeeds, it tries to re
serve as many !locks8including
one for itself8from the
super!lockCs descriptor. Kpon the
success of +#0 in line
1H, the thread is guaranteed to
have reserved one or more !locks.
$t then proceeds in lines 11N1- to
pop its reserved !lock, and if it has
reserved more, it deposits the
additional
credits in Active !y calling
0pdateActive.
$n EeapDet1artial, the thread (rst
tries to pop a su
per!lock from the 1artial slot
associated with the threadCs
processor heap. $f 1artialM60!!,
then the thread checks the *artial
list associated with the si@e class as
descri!ed in 0ection 3.?.F.
/alloc from @ew Superblock
$f the thread does not (nd any
1A2,IA! super!locks, it calls
3alloc/rom6ewS= in /igure 9. The
thread allocates a descriptor !y
calling CescAlloc (line 1), allocates a
new
super!lock, and sets its (elds
appropriately (lines ?N11). /i nally,
it tries to install it as the active
super!lock in Active
using +#0 in line 13. $f the +#0
fails, the thread deallocates the
super!lock and retires the
descriptor (or alternatively, the
thread can take the !lock, return
the credits to the su per!lock, and
install the super!lock as 1A2,IA!).
The failure of +#0 in line 13
implies that &eap?@Active is no longer
60!!, and therefore a new active
super!lock must have !een
installed !y another thread. $n
order to avoid having too many
1A2,IA! super!locks and hence cause
unnecessary ex ternal
fragmentation, we prefer to
deallocate the super!lock rather
than take a !lock from it and keep
it as 1A2,IA!.
.n systems with memory
consistency models 415 weaker than
se&uential consistency, where the
processors might ex ecute and
o!serve memory accesses out of
order, fence in structions are needed
to enforce the ordering of memory
ac cesses. The memory fence
instruction in line 1? serves to
ensure that the new values of the
descriptor (elds are o! served !y
other processors !efore the +#0
in line 13 can !e o!served.
.therwise, if the +#0 succeeds,
then threads running on other
processors may read stale values
from the
descrip
tor.
'
!"!4
0ree
/igure F shows the free
algorithm. 7arge !locks are
returned directly to the .0. The
free algorithm for small !locks is
simple. $t primarily involves
pushing the freed !lock into its
super!lockCs availa!le list and
ad1usting the super!lockCs state
appropriately.
The instruction fence in line 19 is
needed to ensure that
the read in line 13 is executed
!efore the success of the +#0
in line 1I. The memory fence in line
1G is needed to ensure
that the write in line I is o!served
!y other processors no
later than the +#0 in
line 1I is o!served.
$f a thread is the (rst to return
a !lock to a /0!! su
per!lock, then it takes
responsi!ility for making it 1A2,IA!
!y calling Eeap1ut1arial, where it
atomically swaps the su
per!lock with the prior value in the
*artial slot of the heap
that last owned the super!lock.
$f the previous value of
&eap?@1artial is not 60!!, i.e., it held a
partial super!lock,
then the thread puts that
super!lock in the partial list of
the si@e class as descri!ed
in 0ection 3.?.F.
$f a thread frees the last allocated
!lock in a super!lock,
then it takes responsi!ility for
indicating that the super!lock is
.31,4 and frees it. The thread
then tries to retire the associated
descriptor. $f the descriptor is in
the *artial slot
of a processor heap, a simple +#0
will suDce to remove it. .therwise,
the descriptor may !e in the
*artial list of the si@e class
(possi!ly in the middle). )e
discuss this case in 0ection 3.?.F.
'
Due to the variety in memory
consistency models and fence
instructions among architectures, it
is customary for concur rent
algorithms presented in the
literature to ignore them. $n this
paper, we opt to include fence
instructions in the code, !ut for
clarity we assume a typical *ower*+
like archi tecture. 'owever,
di%erent architectures8including
future ones8may use di%erent
consistency models.
free(ptr)
{
if ("ptr)
return;
* ((void**)ptr)??; ## (et
prefix
5 desc =
*(descriptor**)ptr;
+ if (lar(e $loc% $it
set(desc))
## !ar(e $loc% ? desc &olds
s7+
8 { 1eturn 'lock to -$!
return; }
: s$ = desc?
@s$;
do
{
< newanc&or = oldanc&or = desc?
@Anc∨
9 *(unsi(ned*)ptr =
oldanc∨avail;
> newanc∨avail = (ptr?
s$)#desc?@s7;
) if (oldanc∨state ==
/0!!)
newanc∨state =
1A2,IA!;
* if (oldanc∨count==desc?
@maxcount?) {
5 &eap = desc?
@&eap;
+ instruction
fence!
8 newanc∨state =
.31,4;
}
else
:
newanc∨count++;
< memory
fence!
9 } until CAS(Adesc?
@Anc&or,oldanc&or,newanc&or);
> if (newanc∨state ==
.31,4) {
*) 0ree the su&er'lock
s$!
*
2emove.mptyCesc(&eap,desc);
** } elseif (oldanc∨state ==
/0!!)
*5
Eeap1ut1artial(desc);
}
Eeap1ut1artial(desc
) {
do { prev = desc?@&eap?
@1artial;
* } until CAS(Adesc?@&eap?
@1artial,prev,desc);
5 if (prev)
!ist1ut1artial(prev);
}
2emove.mptyCesc(&eap,des
c) {
if CAS(A&eap?
@1artial,desc,60!!)
*
Cesc2etire(desc);
5 else
!ist2emove.mptyCesc(&eap?@sc);
}
7igure 48 7ree
!"!2 Descri&tor List
/igure G shows the CescAlloc and
Cesc2etire routines. $n CescAlloc, the
thread (rst tries to pop a descriptor
from the list of availa!le
descriptors (lines 3N9). $f not
found, the thread allocates a
super!lock of descriptors, takes
one
descriptor, and tries to install the
rest in the glo!al availa!le
descriptor list. $n order to avoid
unnecessarily allocating too many
descriptors, if the thread (nds that
some other thread has already
made some descriptors availa!le
(i.e., the +#0 in line I fails), then it
returns the super!lock to the .0
and starts over in line 1, with the
hope of (nding an availa!le
descriptor. Cesc2etire is a
straightforward lockfree push
that follows the classic freelist
push algorithm 4I5.
#s mentioned a!ove in the case of
the pop operation in the
3alloc/romActive routine, care must
!e taken that +#0
does not succeed where it should not
due to the #A# pro! lem. )e
indicate this in line 9, !y using the
term SafeCAS
(i.e., #A#safe). )e use the ha@ard
pointer methodology 41G,
1B58which uses only pointersi@ed
instructions8in order to
prevent the #A# pro!lem
for this structure.
descriptor* CescAvail; ##
initially 60!!
descriptor*
CescAlloc() {
w&ile ()
{
desc =
CescAvail;
* if
(desc) {
5 next = desc?
@6ext;
+ if SafeCAS(ACescAvail,desc,next)
$rea%;
} else
{
8 desc =
Alloc6ewS=(C.SCS=SIG.);
: -rganize descri&tors in a linked
list!
< memory
fence!
9 if CAS(ACescAvail,60!!,desc?
@6ext)) $rea%;
> 0ree the su&er'lock
desc!
}
}
) return
desc;
}
Cesc2etire(desc)
{
do {
old&ead =
CescAvail;
* desc?@6ext =
old&ead;
5 memory
fence!
+ } until
CAS(ACescAvail,old&ead,desc);
}
7igure 38 +escriptor allocation
$n the current implementation,
super!lock descriptors are not
reused as regular !locks and cannot
!e returned to the .0. This is
accepta!le as descriptors constitute
on average less than 1R of allocated
memory. 'owever, if desired, space
for descriptors can !e reused
ar!itrarily or returned to the .0,
!y organi@ing descriptors in a
similar manner to regular !locks and
maintaining special descriptors for
super!locks of descriptor, with
virtually no e%ect on average
performance whether contentionfree
or under high contention. This can
!e applied on as many levels as
desired, such that at most
1R of 1R8and so on8of allocated
space is restricted from !eing
reused ar!itrarily or returned to
the .0.
0imilarly, in order to reduce the
fre&uency of calls to mmap and munmap,
we allocate super!locks (e.g., 1F
=A) in !atches of (e.g., 1 :A)
hy&er'locks (super!locks of super!locks)
and maintain descriptors for such
hyper!locks, allowing them
eventually to !e returned to the
.0. )e organi@e the de
scriptor Anc&or (eld in a slightly
di%erent manner, such that
super!locks are not written until
they are actually used, thus
saving disk swap space for
unused super!locks.
!"!3 Lists of Partial
$u&er'locks
/or managing the list of partial
super!locks associated with each
si@e class, we need to provide
three functions2 !istDet1artial,
!ist1ut1artial, and !ist2emove.mpty?
Cesc. The goal of the latter is to
ensure that empty de
scriptors are eventually made
availa!le for reuse, and not
necessarily to remove a speci(c
empty descriptor immedi
ate
ly.
$n one possi!le implementation,
the list is managed in a
7$/. manner, with the possi!ility
of removing descriptors
from the middle of the list. The
simpler version in 41B5 of the
lockfree linked list algorithm in 41F5
can !e used to manage such a list.
!ist1ut1artial inserts desc at the head
of the list. !istDet1artial pops a
descriptor from the head of the
list. !ist2emove.mptyCesc traverses the
list until it removes some empty
descriptor or reaches the end of the
list.
#nother implementation, which we
prefer, manages the
list in a /$/. manner and thus
reduces the chances of con tention
and false sharing. !ist1ut1artial
en&ueues desc at the tail of the list.
!istDet1artial de&ueues a descrip tor
from the head of the list.
!ist2emove.mptyCesc keeps
de&ueuing descriptors from the
head of the list until it de &ueues
a nonempty descriptor or reaches
the end of the list. $f the function
de&ueues a nonempty descriptor,
then it reen&ueues the descriptor
at the tail of the list. Ay re
moving any one empty descriptor or
moving two nonempty descriptor
from the head of the list to its end,
we are guar anteed that no more
than half the descriptors in the list
are left empty. )e use a version
of the lockfree /$/. &ueue
algorithm in 4?H5 with optimi@ed
memory management for the
purposes of the new allocator.
/or preventing the #A# pro!lem
for pointersi@ed vari a!les in the
a!ove mentioned list
implementations, we can not use
$A: #A#prevention tags (such as
in Anc∨ta(),
instead we use ideal 77/0+
constructions using pointersi@ed
+#0 41I5. Jote that these
constructions as descri!ed in 41I5
use memory allocation, however a
generalpurpose malloc is not
needed. $n our implementation we
allocate such !locks in a manner
similar !ut simpler than allocating
descriptors. Jote that in our
allocator, unlike 'oard 435, we do
not maintain fullness classes or
keep statistics a!out the full ness
of processor heaps and we are
&uicker to move partial super!locks
to the partial list of the si@e class.
This simplic ity allows lower
latency and lower fragmentation.
Aut, one concern may !e that this
makes it more likely for !locks to
!e freed to a super!lock in the si@e
class partial lists. 'ow ever, this is
not a disadvantage at all, unlike
'oard 435 and also 4-5 where this can
cause contention on the glo'al heapCs
lock. $n our allocator, freeing a
!lock into such a super!lock does
not cause any contention with
operations on other su per!locks,
and in general is no more complex or
less eDcient than freeing a !lock
into a super!lock that is in the
threadCs own processor heap.
#nother possi!le concern is that
!y moving partial super!locks out
of the processor heap too 4uickly,
contention and false sharing may
arise. This is why we use a most
recentlyused *artial slot (multiple
slots can !e used if desired) in the
processor heap structure, and use
a /$/. structure for the
si@e class partial lists.
*. %+,%RI-%NTA"
R%SU"TS
$n this section, we descri!e our
experimental performance results on
two *ower*+ multiprocessor
systems. The (rst system has
sixteen 3G- :'@ *.)E<3$$
processors, with
?9 ;A of memory, 9 :A second level
caches. The second sys tem has
eight 1.? ;'@ *.)E<9L
processors, with 1F ;A of memory,
3? :A third level caches. Aoth
systems run #$,
-.1. )e ran experiments on !oth
systems. The results on the
*.)E<3 system (with more
processors) provided more insights
into the allocatorsC scala!ility and
a!ility to avoid false sharing and
contention. The results on the
*.)E<9 system provided insights
into the contentionfree latency of
the allocators and contentionfree
synchroni@ation costs on recent
processor architectures.
)e compare our allocator with the
default #$, li!c mal lo c, "
2
'oard 435
v ersion 3.H.? (Decem!er ?HH3), and
*tmal
2
.ur o!servations on the default
li!c malloc are !ased on external
experimentation only and are not
!ased on any knowledge of its
internal design.
3G- :'@ *.)E<3 1.? ;'@
@ew Aoard Ptmalloc @ew Aoard Ptmalloc
Linu9 scalability ?.? 1.1 1.I ?.G 1.3 1.B
Threadtest ?.1 1.? 1.B ?.3 1.? 1.B
Larson ?.B ?.? ?.- ?.B ?.3 ?.F
Table &8 ,ontention-free speedup over libc malloc
loc? (Jov. ?HH?) 4F5. #ll allocators
and !enchmarks were compiled
using gcc and gLL with the
highest optimi@ation level (.F) in
F9!it mode. )e used pthreads
for multi threading. #ll allocators
were dynamically linked as shared
li!raries. /or meaningful
comparison, we tried to use opti
mal versions of 'oard and *tmalloc
as !est we could. )e modi(ed the
*ower*+ lightweight locking
routines in 'oard
!y removing a sync instruction from
the !eginning of the lock ac&uisition
path, replacing the sync at the end
of lock ac&uisition with isync, and
adding eieio !efore lock re
lease. These changes reduced the
average contentionfree latency of a
pair of malloc and free using 'oard
from 1.GF
5s. to 1.-1 5s. on *.)E<3, and
from II- ns. to -FH ns.
on *.)E<9. The default
distri!ution of *tmalloc? uses
pthread mutex for locking. )e
replaced calls to pthread
mutex !y calls to a lightweight
mutex that we coded us
ing inline assem!ly. This reduced
the average contention
free latency of a pair of malloc and
free using *tmalloc !y more than
-HR, from 1.B3 5s. to B?3 ns. on
*.)E<3 and
from I1? ns. to 9H9 ns. on *.)E<9.
$n addition, *tmalloc showed
su!stantially !etter scala!ility
using the lightweight locks than it
did using pthread mutex locks.
*.1 Benc)mars
Due to the lack of standard
!enchmarks for multithreaded
dynamic memory allocation, we use
micro!enchmarks that focus on
speci(c performance
characteristics. )e use six
!enchmarks2 !enchmark 1 of Linu/
$cala'ility 41-5, 6hread- test, *cti7e-false, and
Passi7e-false from 'oard 435, Lar- son 4135,
and a lockfree producerconsumer
!enchmark that we descri!e !elow.
$n Linu/ scala'ility, each thread
performs 1H million mal
loc/free pairs of I !yte !locks in a
tight loop. $n 6hreadtest,
each thread performs 1HH
iterations of allocating 1HH,HHH
I!yte !locks and then freeing
them in order. These two
!enchmarks capture allocator
latency and scala!ility under
regular private
allocation patterns.
$n *cti7e-false, each thread performs
1H,HHH malloc/free
pairs (of I !yte !locks) and each
time it writes 1,HHH times to each
!yte of the allocated !lock. Passi7e-
false is simi
lar to *cti7e-false, except that initially
one thread allocates
!locks and hands them to the other
threads, which free them
immediately and then proceed as in
*cti7e-false. These two
!enchmarks capture the allocatorCs
a!ility to avoid causing false sharing
4??5 whether actively or passively.
$n Larson, initially one thread
allocates and frees ran
$n the lockfree Producer-consumer
!enchmark, we mea sure the
num!er of tasks performed !y t
threads in 3H sec onds. $nitially, a
data!ase of 1 million items is
initiali@ed randomly. .ne thread is
the producer and the others, if
any, are consumers. /or each
task, the producer selects a
randomsi@ed (1H to ?H) random set
of array indexes, allo cates a
!lock of matching si@e (9H to IH
!ytes) to record the array indexes,
then allocates a (xed si@e task
structure (3? !ytes) and a (xed
si@e &ueue node (1F !ytes), and en
&ueues the task in a lockfree /$/.
&ueue 41B, ?H5. # con sumer thread
repeatedly de&ueues a task, creates
histograms from the data!ase for
the indexes in the task, and then
spends time proportional to a
parameter .ork performing local work
similar to the work in 'oardCs 6hreadtest
!ench mark. )hen the num!er of
tasks in the &ueue exceeds 1HHH, the
producer helps the consumers !y
de&ueuing a task from the &ueue
and processing it. Each task
involves 3 malloc op erations on the
part of the producer, and one
malloc and 9 free operations on the
part of the consumer. The
consumer spends su!stantially more
time on each task that the pro
ducer. Producer-consumer captures
mallocCs ro!ustness un der the
producerconsumer sharing pattern,
where threads free !locks allocated
!y other threads.
*.2 Results
4!"!# Latency
Ta!le 1 presents contentionfree
)
speedups over li!c mal loc for the
new allocator, 'oard, and
*tmalloc, for the !enchmarks that
are a%ected !y malloc latency2 Linu/
scal- a'ility, 6hreadtest, and Larson.
:allocCs latency had little or no
e%ect on the performance of *cti7e-false,
Passi7e-false, and Producer-consumer.
The new allocator achieves
signi(cantly lower contention free
latency than the other allocators
under !oth regular and irregular
allocation patterns. The reason is
that it has a faster execution path
in the common case. #lso, unlike
lock!ased allocators, it operates
only on the actual allo cator
varia!les without the need to
operate on additional lock related
varia!les and to synchroni@e these
accesses with the accesses to the
allocator varia!les through fence
instruc tions.
The new allocator re&uires only one
memory fence instruc tion (line 1G
of free) in the common case for
each pair of
malloc and free, while every lock
ac&uisition and release re &uires an
instruction fence !efore the critical
section to pre
dom si@ed !locks (1F to IH !ytes) in random order, then an
e&ual num!er of !locks (1H?9) is
handed over to each of the
remaining threads. $n the parallel
phase which lasts 3H sec onds, each
thread randomly selects a !lock and
frees it, then allocates a new
randomsi@ed !lock in its place. The
!ench mark measures how many
free/malloc pairs are performed
during the parallel phase. Larson
captures the ro!ustness of mallocCs
latency and scala!ility under
irregular allocation patterns with
respect to !locksi@e and order of
deallocation over a long period of
time.
)
$t appears that li!c malloc as
well as 'oard use a tech ni&ue
where the parent thread !ypasses
synchroni@ation if it knows that it
has not spawned any threads yet.
)e ap plied the same techni&ue
to our allocator and the average
singlethread latency for our
allocator was lower than those for
li!c malloc and 'oard. 'owever,
in order to measure true
contentionfree latency under
multithreading, in our experiments,
the parent thread creates an
additional thread at initiali@ation
time which does nothing and exits
immedi ately !efore starting time
measurement.
vent reads inside the critical section
from reading stale data !efore lock
ac&uisition, and a memory fence
after the end of the critical section
to ensure that the lock is not
o!served to !e free !efore the
writes inside the critical sections
are also o!served !y other
processors. $n the common case, a
pair of malloc and free using
*tmalloc and 'oard need to ac&uire
and release two and three locks,
respectively.
$nterestingly, when we conducted
experiments with a lightweight
testandset mutual exclusion
lock on the *.)E<9 system,
we found that the average
contention free latency for a pair of
lock ac&uire and release is 1F- ns.
.n the other hand. the average
contentionfree latency for a pair of
malloc and free in Linu/ $cala'ility using
our allo cator is ?I? ns., i.e., it is
less than twice that of a minimal
critical section protected !y a
lightweight testandset lock. That
is, on that architecture, it is
highly unlikely8if not impossi!le8for
a lock!ased allocator (without per
thread private heaps) to have lower
latency than our lockfree allo cator,
even if it uses the fastest
lightweight lock to protect malloc
and free and does nothing in these
critical sections.
4!"!" $cala'ility and *7oiding 0alse
$haring
/igure I(a) shows speedup results
relative to contention free li!c
malloc for Linu/ scala'ility. .ur
allocator, *tmal loc, and 'oard
scale well with varying slopes
proportional to their contentionfree
latency. 7i!c malloc does not scale
at all, its speedup drops to H.9 on
two processors and contin ues to
decline with more processors. .n
1F processors the execution time of
li!c malloc is 331 times as much as
that of our allocator.
The results for 6hreadtest (/igure
I(!)) show that our
allocator and 'oard scale in
proportion to their contention
free latencies. *tmalloc scales
!ut at a lower rate under
high contention, as it !ecomes more
likely that threads take
over the arenas of other threads
when their own arenas have
no free !locks availa!le, which
increases the chances of con
tention and
false sharing.
/igures I(cNd) show the results
for *cti7e-false and
Passi7e-false. The latency of malloc
itself plays little role
in these results. The results
re"ect only the e%ect of the
allocation policy on inducing or
avoiding false sharing. .ur
allocator and 'oard are less likely
to induce false sharing
than *tmalloc and
li!c malloc.
$n Larson (/igure I(e)), which is
intended to simulate
server workloads, our allocator and
'oard scale, while *t
malloc does not, pro!a!ly due
to fre&uent switching of
threads !etween arenas, and
conse&uently more fre&uent
cases of freeing !locks to arenas
locked !y other threads. )e
also noticed, when running this
!enchmark, that *tmalloc
creates more arenas than the
num!er of threads, e.g., ??
arenas for 1F threads, indicating
fre&uent switching among
arenas !y threads. Even though
freeing !locks to remote
heaps in 'oard can degrade
performance, this e%ect is elim
inated after a short time. $nitially
threads free !locks that
were allocated !y another thread,
!ut then in the steady
state they free !locks that they
have allocated from their
own
processor
heaps.
4!"! 1o'ustness under Producer-
,onsumer
/or Producer-consumer we ran
experiments with various values for
.ork (parameter for local work per
task). /ig ures I(fNh) show the
results for .ork set to -HH, G-H, and
1HHH, respectively. The results for
all the allocators are vir tually
identical under no contention, thus
the latency of the allocator plays a
negligi!le role in the results for this
!ench
mark. The purpose of this
!enchmark is to show the ro!ust
ness of the allocators under the
producerconsumer sharing pattern
when the !enchmark is scala!le.
The case where the !enchmark
cannot scale even using a perfect
allocator is not of interest. )e
focus on the knee of the curve,
where the di%erences in ro!ustness
!etween allocators impact the
scala!ility of the !enchmark.
.ur allocator scales perfectly with
work set to 1HHH and
G-H, and up to 13 processors with
work set to -HH. )ith
more than 13 processors (and with
work set to -HH), we
found that the producer could not
keep up with the con
sumers (as the &ueue was always
empty at the end of each
experiment), which is not an
interesting case as the appli
cation would not !e scala!le in
any case. .ur allocatorCs
scala!ility is limited only !y the
scala!ility of the applica
tio
n.
*tmalloc scales to a lesser
degree, !ut at the cost of
higher external memory
fragmaentation, as the producer
keeps creating and switching arenas
due to contention with
consumers, even though most
arenas already have availa!le
!loc
ks.
'oardCs scala!ility su%ers due to
high contention on the
producerCs heap, as G-R of all
malloc and free operations are
targeted at the same heap. .ur
allocatorCs performance does
not su%er, although it faces exactly
the same situation. The
main reason is that in 'oard, even
in the common case, free
operations need to ac&uire either
the processor heapCs lock
or the glo!al heapCs lock. $n our
allocator typical free oper
ations are very simple and operate
only on the super!lock
descriptor associated with the freed
!lock, thus allowing su!
stantially more concurrency than
'oard. .ther minor rea
sons for our allocatorCs a!ility to
perform well even under
contention on the same super!lock
are2 (a) $n our alloca
tor, readmodifywrite code
segments are shorter in dura
tion, compared with critical
sections in 'oard. (!) 0uccess
ful lockfree operations can overlap
in time, while mutual
exclusion locks !y de(nition must
strictly seriali@e critical
secti
ons.
4!"!4 -&timization for
8ni&rocessors
)ith uniprocessors in mind, we
modi(ed a version of our allocator
such that threads use only one
heap, and thus when executing
malloc, threads do not need to
know their id. This optimi@ation
achieved 1-R increase in
contention free speedup on Linu/
scala'ility on *.)E<3. )hen we used
multiple threads on the same
processor, performance remained
una%ected, as our allocator is
preemptiontolerant. $n practice, the
allocator can determine the num!er
of pro cessors in the system at
initiali@ation time !y &uerying the
system environment.
4!"!2 $&ace 9fficiency
)e tracked the maximum space
used !y our allocator, 'oard, and
*tmalloc when running the
!enchmarks that allocate a large
num!er of !locks2 6hreadtest, Larson,
and Producer-consumer. The maximum
space used !y our allo cator was
consistently slightly less than that
used !y 'oard, as in our allocator
each processor heap holds at most
two su per!locks, while in 'oard
each processor heap holds a vari
a!le num!er of super!locks
proportional to allocated !locks. The
maximum space allocated !y
*tmalloc was consistently more
than that allocated !y 'oard and
our allocator. The ratio of the
maximum space allocated !y
*tmalloc to the maximum space
allocated !y ours, on 1F processors,
ranged from 1.1F in 6hreadtest to 3.I3 in
Larson.
Speedup over contention-free libc malloc
.&
ne"
.%
#oard
2&
Ptmalloc
2%
li-c
1&
1%
&
Speedup over contention-free libc malloc
/&
/%
ne"
.&
#oard
.%
Ptmalloc
li-c
2&
2%
1&
1%
&
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(a) 7inux
scala!ility
Speedup over contention-free libc malloc
.%
ne"
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(e)
7arson
Speedup over contention-free libc malloc
18
10
ne"
2&
2%
1&
1%
&
#oard
Ptmalloc
li-c
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(!)
Threadte
st
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(f ) *roducerconsumer
with work M -HH
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(c) #ctive
false sharing
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(g) *roducerconsumer
with work M G-H
Speedup over contention-free libc malloc
18
10
ne"
1/
#oard
12
Ptmalloc
li-c
1%
8
0
/
2
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(d) *assive
false sharing
1 2 . / & 0 1 8 ' 1% 11 12 1. 1/ 1& 10
Processors
(h) *roducerconsumer
with work M 1HHH
7igure 18 Speedup results on &4-way 230 /A< PO*BC2
.. SU--AR/
$n this paper we presented a
completely lockfree dynamic
memory allocator. Aeing completely
lockfree, our allocator is immune to
deadlock regardless of scheduling
policies and even when threads
may !e killed ar!itrarily.
Therefore, it can o%er asyncsignal
safety, tolerance to priority
inversion, killtolerance, and
preemptiontolerance, without
re&uiring any special kernel
support or incurring performance
over head. .ur allocator is
porta!le across software and hard
ware platforms, as it re&uires only
widelyavaila!le .0 sup port and
hardware atomic primitives. $t is
generalpurpose and does not
impose any unreasona!le
restrictions regard ing the use or
initiali@ation of the address space.
$t is space eDcient and limits space
!lowup 435 to a constant factor.
.ur experimental results compared
our allocator with the default #$,
-.1 li!c malloc, and two of the !est
multithread allocators, 'oard 435 and
*tmalloc 4F5. .ur allocator outper
formed the other allocators in all
cases, often !y signi(cant margins,
under various levels of parallelism
and allocation patterns. .ur
allocator showed near perfect
scala!ility un der various
allocation and sharing patterns.
Knder maxi mum contention on 1F
processors, it achieved a speedup
of
331 over li!c malloc.
E&ually signi(cant, our allocator
o%ers su!stantially lower
latency than the other allocators.
Knder no contention, it
achieved speedups of ?.G-, 1.BB,
and 1.93 over li!c malloc,
and highlyoptimi@ed versions of
'oard and *tmalloc, re
spectively. 0cala!le allocators are
often critici@ed that they
achieve their scala!ility at the cost
of higher latency in the
more common case of no contention.
.ur allocator achieves
!oth scala!ility and low latency, in
addition to many other
performance and &ualitative
advantages.
/urthermore, this work, in
com!ination with recent lock
free methods for safe memory
reclamation 41G, 1B5 and #A#
prevention 41I5 that use only single
word +#0, allows lock
free algorithms including eDcient
algorithms for important
o! 1ect types8such as 7$/. stacks
4I5, /$/. &ueues 4?H5,
and linked lists and hash ta!les
41F, ?158to !e !oth com
pletely dynamic and completely lock
free, including in F9!it
applications and on systems
without support for automatic
gar!age collection, all eDciently
without re&uiring special
.0 support and using only widely
availa!le F9!it atomic
instructions.
.cknowledgments
The author thanks Emery
Aerger, :ichael 0cott, Se(m 0huf,
and the anonymous referees for
valua!le comments on the paper.
0. R%$%R%NC%S
415 0arita Q. #dve and =ourosh
;harachorloo. 0hared memory
consistency models2 # tutorial.
I999 ,om- &uter, ?B(1?)2FFNGF, 1BBF.
4?5 Emery D. Aerger. (emory (anagement
for :igh- Performance *&&lications. *hD
thesis, Kniversity of Texas at
#ustin, #ugust ?HH?.
435 Emery D. Aerger, =athryn 0.
:c=inley, <o!ert D. Alu mofe,
and *aul <. )ilson. 'oard2 #
scala!le memory allocator for
multithreaded applications. $n
Proceed- ings of the ;th International ,onference on
*rchitec-
tural $u&&ort for Programming Languages and
-&erat- ing $ystems, pages 11GN1?I,
Jovem!er ?HHH.
495 Aruce :. Aigler, 0tephen T.
#llan, and <odney <. .ld
ehoeft. *arallel dynamic storage
allocation. $n Proceed- ings of the #;<2
International ,onference on Parallel
Processing, pages ?G?N?G-, #ugust
1BI-.
4-5 Dave Dice and #lex ;arthwaite.
:ostly lockfree mal loc. $n
Proceedings of the "00" International $ym&o- sium on
(emory (anagement, pages ?FBN?IH,
Tune
?HH?.
4F5 )olfram ;loger. Dynamic (emory
*llocator Im&le- mentations in Linu/ $ystem
Li'raries. &ttp'##www;dent;med;uni?
muenc&en;de#Hwm(lo#.
4G5 :aurice *. 'erlihy. )aitfree
synchroni@ation. *,( 6ransactions on
Programming Languages and $ystems,
13(1)21?9N19B, Tanuary 1BB1.
4I5 $A:. I+( $ystem=>0 9/tended *rchitecture,
Princi- &les of -&eration, 1BI3. *u!lication
Jo. 0#??GHI-.
4B5 $EEE. I999 $td #00!#, "00 9dition,
?HH3.
41H5 #run =. $yengar. Dynamic $torage
*llocation on a (ul- ti&rocessor. *hD thesis,
:$T, 1BB?.
4115 #run =. $yengar. *arallel
dynamic storage allocation
algorithms. $n Proceedings of the 0ifth
I999 $ym&o- sium on Parallel and Distri'uted
Processing, pages I?N
B1, Decem!er 1BB3.
41?5 7eslie 7amport. +oncurrent
reading and writing. ,om- munications
of the *,(, ?H(11)2IHFNI11,
Jovem!er
1BGG.
4135 *er#ke 7arson and :urali
=rishnan. :emory alloca tion for
longrunning server applications.
$n Proceedings of the #;;< International $ym&osium
on (emory (an-
agement, pages 1GFN1I-,
.cto!er 1BBI. 4195 Doug
7ea. * (emory *llocator.
&ttp'##(ee;cs;oswe(o;edu#dl#&tml#malloc;
&tml.
41-5 +huck 7ever and David
Aoreham. :alloc() perfor
mance in a multithreaded 7inux
environment. $n Pro- ceedings of the
0199?I) 6rack of the "000 8$9?I)
*nnual 6echnical ,onference, Tune ?HHH.
41F5 :aged :. :ichael. 'igh
performance dynamic lock free
hash ta!les and list!ased sets.
$n Proceedings of the 0ourteenth *nnual *,(
$ym&osium on Parallel *l-
gorithms and *rchitectures, pages G3NI?,
#ugust ?HH?. 41G5 :aged :.
:ichael. 0afe memory reclamation
for dy namic lockfree o! 1ects
using atomic reads and writes.
$n Proceedings of the 6.enty-0irst *nnual *,(
$ym-
&osium on Princi&les of Distri'uted ,om&uting,
pages
?1N3H, Tuly ?HH?.
41I5 :aged :. :ichael. #A#
prevention using single word
instructions. Technical <eport
<+ ?3HIB, $A: T. T. )atson
<esearch +enter, Tanuary ?HH9.
41B5 :aged :. :ichael. 'a@ard
pointers2 0afe memory
reclamation for lockfree o! 1ects.
I999 6ransactions on Parallel and Distri'uted
$ystems, ?HH9. To appear. 0ee
www;researc&;i$m;com#people#m#mic&ael#p
u$s;&tm.
4?H5 :aged :. :ichael and :ichael
7. 0cott. 0imple, fast, and
practical non!locking and
!locking concurrent &ueue
algorithms. $n Proceedings of the 0ifteenth
*n- nual *,( $ym&osium on Princi&les of
Distri'uted
,om&uting, pages ?FGN?G-, :ay 1BBF.
4?15 .ri 0halev and Jir 0havit. 0plit
ordered lists2 7ockfree
extensi!le hash ta!les. $n
Proceedings of the 6.enty- $econd *nnual *,(
$ym&osium on Princi&les of Dis-
tri'uted ,om&uting, pages 1H?N111, Tuly
?HH3.
4??5 Tosep Torrellas, :onica 0. 7am,
and Tohn 7. 'ennessy.
/alse sharing and spatial
locality in multiprocessor
caches. I999 6ransactions on ,om&uters,
93(F)2F-1N
FF3, Tune 1BB9.
4?35 *aul <. )ilson, :ark 0.
Tohnstone, :ichael Jeely, and
David Aoles. Dynamic storage
allocation2 # sur vey and
critical review. $n Proceedings of the #;;2
In- ternational @orksho& on (emory (anagement,
pages
1N11F, 0eptem!er 1BB-.