Hot Topics in Operating Systems TU Berlin, March 2006 Arwed Stare Abstract: !hen designing an operating system ernel "or a shared memory symmetric m#ltiprocessor system, shared data has to $e protected "rom conc#rrent access% &ritical iss#es in this area are the increasing code comple'ity as well as the per"ormance and scala$ility o" a SM( ernel% An introd#ction to SM()sa"e locing primiti*es, and how locing can $e applied to SM( ernels will $e gi*en, and we will "oc#s on how to increase scala$ility $y red#cing loc contention, and the growing negati*e impact on locing per"ormance $y caches and memory $arriers% +ew, per"ormance) aware approaches "or m#t#al e'cl#sion in SM( systems will $e presented, that made it into today,s -in#' 2%6 ernel. The Se/-oc and the read)copy) #pdate 01&U2 mechanism% 1 Introduction 1.1 Introduction to SMP systems As Moore,s law is a$o#t to "ail, since cloc speeds can not $e raised $y a "actor o" two e*ery year any more as it #sed to $e in the 3good old times3, most o" the gains in comp#ting power are now achie*ed $y increasing the n#m$er o" processors or processing #nits woring parallel. The tri#mph o" SM( systems is ine*ita$le% The a$$re*iation SM( stands "or tightly co#pled, shared memory symmetric m#ltiprocessor system% A set o" e/#al &(Us accesses a common physical memory 0and 45O ports2 *ia a shared "ront side $#s% Th#s, the FSB $ecomes a contended reso#rce% A $#s master manages all read5write accesses to the $#s% A read or write operation is g#aranteed to complete atomically, which means, $e"ore any other read or write operation is carried o#t on the bus% 4" two &(Us access the $#s within the same cloc cycle, the $#s master nondeterministically 0"rom the programmers *iew2 selects one o" them to $e "irst to access the $#s% 4" a &(U accesses the $#s while it is still occ#pied, the operation is delayed% This can $e seen as a hardware meas#re o" synchronisation% 1.2 Introduction to Locking 4" more than one process can access data at the same time, as is the case in preempti*e m#ltitasing systems and SM( systems, m#t#al e'cl#sion m#st $e introd#ced to protect this shared data% !e can di*ide m#t#al e'cl#sion into three classes. Short)term m#t#al e'cl#sion, short)term m#t#al e'cl#sion with interr#pts, and long)term m#t#al e'cl#sion 6Sch789% -et #s tae a loo at the typical #niprocessor 0U(2 ernel sol#tions "or these pro$lem classes, and why they do not wor "or SM( systems% Short-term mutual exclusion re"ers to pre*enting race conditions in short critical sections% They occ#r, when two processes access the same data str#ct#re in memory 3at the same time3, th#s ca#sing inconsistent states o" data% On U( systems, this co#ld only occ#r i" one process is preempted $y the other% To protect critical sections, they are g#arded with some sort o" preempt_disable5 preempt_enable call to disa$le preemption, so a process can "inish the critical section witho#t $eing interr#pted $y another process% 4n a non) preempti*e ernel, no meas#res ha*e to $e taen at all% Un"ort#nately, this does not wor "or SM( systems, $eca#se processes do not ha*e to $e preempted to r#n 3parallel3, there can $e two processes e'ec#ting the e'act same line o" code at the e'act same time% +o disa$ling o" preemption will pre*ent that% Short-term mutual exclusion with interrupts in*ol*es interr#pt handlers that access shared data% To pre*ent interr#pt handler code "rom interr#pting a process in a critical section, it is s#""icient to g#ard a critical section in the process conte't with some sort o" cli5sti 0disa$le5 ena$le all interr#pts2 call% Un"ort#nately, this approach does not wor on SM( systems as well, $eca#se all other &(Us, interr#pts are still acti*e and can e'ec#te the interr#pt handler code at any time% -ong)term m#t#al e'cl#sion re"ers to processes $eing held #p accessing a shared reso#rce "or a longer time% For e'ample, once a write system call to a reg#lar "ile $egins, it is g#aranteed $y the operating system that any other read or write system calls to the same "ile 2 will $e held #ntil the c#rrent one completes% A write system call may re/#ire one or more dis 45O operations in order to complete the system call% :is 45O operations are relati*ely long operations when compared to the amo#nt o" wor that the &(U can accomplish d#ring that time% 4t wo#ld there"ore $e highly #ndesira$le to inhi$it preemption "or s#ch long operations, $eca#se the &(U wo#ld sit idle waiting "or the 45O to complete% To a*oid this, the process e'ec#ting the write system call needs to allow itsel" to $e preempted so other processes can r#n% As yo# pro$a$ly already now, semaphores are #sed to sol*e this pro$lem% This also holds tr#e "or SM( systems% 2 !e basic SMP locking "rimiti#es !hen we tal a$o#t m#t#al e'cl#sion, we mean that we want changes to appear as i" they were an atomic operation% 4" we can not #pdate data with an atomic operation, we need to mae an #pdate #ninterr#pti$le and se/#entiali;e it with all other processes that co#ld access the data% B#t sometimes, we can% 2.1 Atomic O"erations Most SM( architect#res possess some operations that read and change data within a single, #ninterr#pti$le step, called atomic operations% &ommon atomic operations are test and set 0TS12, which ret#rns the c#rrent *al#e o" a memory location and replaces it with a gi*en new *al#e, compare and swap 0&AS2, which compares the content o" a memory location with a gi*en *al#e, and, i" they e/#al, replaces it with a gi*en new *al#e, or the load lin5 store conditional instr#ction pair 0--5S&2% Many SM( systems also "eat#re atomic arithmetical operations. Addition $y gi*en *al#e, s#$traction $y gi*en *al#e, atomic increment, decrement, among others% The ta$le $elow is an e'ample o" how the line counter++ might appear in assem$ler code 06Sch7892% 4" this line is e'ec#ted at the same time $y two &(Us, the res#lt is wrong, $eca#se the operation is not atomical% CPU 1 CPU 2 Time Instruction Executed Register R0 Value of Counter Instruction Executed Register R0 1 load R0, counter 0 0 load R0, counter 0 2 add R0, 1 1 0 add R0, 1 1 3 store R0, counter 1 1 store R0, counter 1 To sol*e s#ch pro$lems witho#t e'tra locing, one can #se an atomic increment operation as shown in -isting <% 04n -in#', the atomic operations are de"ined in atomic%h% Operations not s#pported $y the hardware are em#lated with critical sections%2 The shared data is still there, $#t the critical section co#ld $e eliminated% Atomic updates = atomic_t counter = ATOMIC_INIT(0); atomic_inc(&counter); Listing 1: Atomic increment in Linux. can $e done on se*eral common occasions, "or e'ample the replacement o" a lined list element 0-isting 22% +ot e*en a special atomic operation is necessary to do that% +on)$locing synchronisation algorithms solely rely on atomic operations% 2.2 S"in Locks Spin locs are $ased on some atomic operation, "or e'ample test and set% The principle is simple. A "lag *aria$le indicates i" a process is c#rrently in the critical section 0loc>*ar ? <2 or i" no process is in the critical section 0loc>*ar ? 02% A process spins 0$#sy waits2 #ntil the loc is reset, then sets the loc% Testing and setting o" the loc stat#s "lag m#st $e done in one step, with an atomic operation% To release the loc, a process resets the loc *aria$le% -isting = shows a possi$le implementation "or loc and unloc% +ote that a spin loc can not $e ac/#ired rec#rsi*ely @ it wo#ld deadloc on the second call to loc% This has two conse/#ences. A process holding a spin loc may not $e preempted, or else a deadloc sit#ation co#ld occ#r% And spin locs can not $e #sed within interr#pt handlers, $eca#se i" an interr#pt handler tries to ac/#ire a loc that is already held $y the process it interr#pted, it deadlocs% 2.2.1 I$%&Safe S"in Locks The -in#' ernel "eat#res se*eral spin loc *ariants that are sa"e "or #sing with interr#pts% A critical section in process conte't is g#arded $y spin_loc_ir! and spin_unloc_ir!, while critical sections in interr#pt handlers are g#arded $y the normal spin_loc and spin_unloc% The only di""erence $etween these "#nctions is, that the ir/)sa"e *ersions o" spin_loc disa$le all interr#pts on the local &(U "or the critical section% The possi$ility o" a deadloc is there"ore eliminated% Fig#re < shows how two &(Us interact when trying to ac/#ire the same ir/)sa"e spin loc% 8 "" set up ne# element ne#$%data = some_data; ne#$%ne&t = old$%ne&t; "" replace old element #it' it pre($%ne&t = ne#; Listing 2: Atomical update of single linked list replace !old! with !new! (oid loc((olatile int )loc_(ar_p) * #'ile (test_and_set_bit(0+ loc_(ar_p) == ,); - (oid unloc((olatile int )loc_(ar_p) * )loc_(ar_p = 0; - Listing ": Spin lock #Sch$%& !hile &(U < 0in process conte't2 is holding the loc, any incoming interr#pt re/#ests on &(U < are stalled #ntil the loc is released% An interr#pt on &(U 2 $#sy waits on the loc, $#t it does not deadloc% A"ter &(U < releases the loc, &(U 2 0in interr#pt conte't2 o$tains it, and &(U < 0now e'ec#ting the interr#pt handler2 waits "or &(U 2 to release it% 2.2.2 'n!ancements of t!e sim"le S"in Lock Sometimes it is wanted to allow spin locs to $e nested% To do so, the spin loc is e'tended $y a nesting co#nter and a *aria$le indicating which &(U holds the loc% 4" the loc is held, the loc "#nction checs i" the c#rrent &(U is the one holding the loc% 4n this case, the nesting co#nter is incremented and it e'its "rom the spin loop% The #nloc "#nction decrements the nesting co#nter% The loc is released when the #nloc "#nction has $een called the same n#m$er o" times the loc "#nction was called $e"ore% This ind o" spin loc is called a recursi'e lock% -ocs can also $e modi"ied to allow $locing% -in#', and FreeBS:,s $ig ernel loc is dropped i" the process holding it sleeps 0$locs2, and reac/#ired when it waes #p% Solaris 2%' pro*ides a type o" locing nown as adapti'e locks% !hen one thread attempts to ac/#ire one o" these that is held $y another thread, it checs to see i" the second thread is acti*e on a processor% 4" it is, the "irst thread spins% 4" the second thread is $loced, the "irst thread $locs as well% 2.( Sema"!ores )mute*+ Aside "rom the classical #se o" semaphores e'plained in section <%<, semaphores 0initiali;ed with a co#nter *al#e o" <2 can also $e #sed "or protecting critical sections% For per"ormance reasons, semaphores #sed "or this ind o" wor are o"ten reali;ed as a separate primiti*e, called m#te', that replaces the co#nter with a simple loc stat#s "lag% Using m#te'es instead o" spin locs is prod#cti*e, i" the critical section taes longer than a conte't switch% Alse, the o*erhead o" $locing compared to $#sy waiting "or a loc to $e released is worse% On the pro side, m#te'es imply a ind o" "airness, while processes co#ld B (igure 1: )*+-safe spin locks star*e on hea*ily contended spin locs% M#te'es can not $e #sed in interr#pts, $eca#se it is generally not allowed to $loc in interr#pt conte't% A semaphore is a comple' shared data str#ct#re itsel", and m#st there"ore $e protected $y an own spin loc% 2., $eader-.riter Locks As reading a data str#ct#re does not a""ect the integrity o" data, it is not necessary to m#t#ally e'cl#de two processes "rom reading the same data at the same time% 4" a data str#ct#re is read o"ten, allowing readers to operate in parallel is a great ad*antage "or SM( so"tware% An rwloc eeps co#nt o" the readers c#rrently holding a read)only loc and has a /#e#e "or $oth waiting writers and waiting readers% 4" the writer /#e#e is empty, new readers may gra$ the loc% 4" a writer enters the scene, it has to wait "or all readers to complete, then it gets an e'cl#si*e loc% Meanwhile arri*ing writers or readers are /#e#ed #ntil the write loc is dropped% Then, all readers waiting in the /#e#e are allowed to enter, and the game starts anew 0waiting writers are p#t on hold a"ter a writer completes to pre*ent star*ation o" readers2% Fig#re 2 shows a typical se/#ence% The rwloc in*ol*es a marginal o*erhead, $#t should yield almost linear scala$ility "or read)mostly data str#ct#res 0we will see a$o#t this later2% ( Locking /ranularity in SMP Kernels (.1 /iant Locking The designers o" the -in#' operating systems did not ha*e to worry m#ch a$o#t m#t#al e'cl#sion in their #niprocessor ernels, $eca#se they made the whole ernel non) preempti*e 0see section <%<2% The "irst -in#' SM( ernel 0*ersion 2%02 #sed the most simple approach to mae the traditional U( ernel code wor on m#ltiple &(Us. 4t protected the 6 (igure 2: reader,writer lock whole ernel with a single loc, the $ig ernel loc 0BC-2% The BC- is a spin loc that co#ld $e nested and is $locing)sa"e 0see section 2%2%22% There co#ld $e no two &(Us in the ernel at the same time% The only ad*antage o" this was that the rest o" the ernel co#ld $e le"t #nchanged% (.2 0oarse&grained Locking 4n -in#' 2%2, the BC- was remo*ed "rom the ernel entry points, and each s#$system was protected $y an own loc% +ow, a "ile system call wo#ld not ha*e to wait "or a so#nd dri*er ro#tine or a networ s#$system call to "inish% Still, it was not data that was protected $y the locs, $ot rather conc#rrent "#nction calls that were se/#entiali;ed% Also, the BC- co#ld not $e remo*ed "rom all mod#les, $eca#se it was o"ten #nclear which data it protected% And data protected $y the BC- co#ld $e accessed anywhere in the ernel% (.( 1ine&grained Locking Fine)grained locing means. 4ndi*id#al data str#ct#res, not whole s#$systems or mod#les, are protected $y their own locs% The degree o" gran#larity can $e increased "rom locs protecting $ig data str#ct#res 0lie, "or e'ample, a whole "ile system or the whole process ta$le2 to locs protecting indi*id#al data str#ct#res 0"or e'ample, a single "ile or a process control $loc2 or e*en single elements o" a data str#ct#re% Fine)grained locing was introd#ced in the -in#' 2%8 ernel series, and has $een "#rthered in the 2%6 series% Fine)grained locing has also $een introd#ced into the FreeBS: operating system $y the SM(ng team, into the Solaris ernel, and into the !indows +T ernels as well% Un"ort#nately, the BC- is still not dead% &hanges to locing code had to $e implemented *ery ca#tio#sly, as to not $ring in hard to trac down deadloc "ail#res% So, e*ery time a BC- was considered #seless "or a piece o" code, it was mo*ed into the "#nctions this code called, $eca#se it was not always o$*io#s i" these "#nctions relied on the BC-% Th#s, the occ#rrences o" BC- increased e*en more, and mod#le maintainers did not always react to calls to remo*e the BC- "rom their code% D (igure ": -isual description of locking granularity in .S kernels #/ag01& , Performance 0onsiderations The -in#' 2%0%80 contains a total o" <D BC- calls, while the -in#' 2%8%=0 ernel contains a total o" 226 BC-, =27 spin loc and <2< rwloc calls% The -in#' 2%6%<<%D ernel contains <0< BC-, <D<D spin loc and =87 rwloc calls, as well as B6 se/ loc and <8 1&U 0more on these synchronisation mechanisms later2 0n#m$ers taen "rom 6Cag0B92% The reason, why the -in#' programmers too so m#ch wor #pon them, is that coarse) grained ernels scale poorly on more than =)8 &(Us% The optimal per"ormance "or a n)&(U SM( system is n times the per"ormance o" a <)&(U system o" the same type% B#t this optimal per"ormance can only $e achie*ed i" all &(Us are doing prod#cti*e wor all the time% B#sy waiting on a loc wastes time, and the more contended a loc is, the more processors will liely $#sy wait to get it% Hence, the ernel de*elopers r#n special loc contention $enchmars to detect which locs ha*e to $e split #p to distri$#te the in*ocations on them% The loc *aria$les are e'tended $y a loc)in"ormation str#ct#re that contains a co#nter "or hits 0s#ccess"#l attempts to gra$ a loc2, misses 0#ns#ccess"#l attempts to gra$ a loc2, and spins 0total o" waiting loops2 6&am7=9% The n#m$er o" spins5misses shows how contended a loc is% 4" this n#m$er is high, processes waste a lot o" time waiting "or the loc% Meas#ring loc contention is a common practice to loo "or $ottlenecs% Bryant and Hawes wrote a speciali;ed tool to meas#re loc contention in the ernel, which they #sed to analy;e "ilesystem per"ormance 6Bry029% Others 6Cra0<9 "oc#sed on contention in the 2%8%' sched#ler, which has since $een completely rewritten% Today, the -in#' sched#ler mostly operates on per)&(U ready /#e#es and scales "ine #p to B<2 &(Us% -ocing is most prono#nced with applications that access shared reso#rces, s#ch as the *irt#al "ilesystem 0EFS2 and networ, and applications that spawn many processes% Atison et% al% #sed se*eral $enchmars that stress these s#$systems as an e'ample o" how the C-ogger ernel logging and analysis tool can $e #sed "or meas#ring loc contention , #sing *arying degrees o" paralleli;ation. They meas#red the percentage o" time spent spinning on locs d#ring mae r#nning a parallel compilation o" the -in#' ernel, +etper" 0a networ per"ormance e*al#ation tool2, and an Apache we$ ser*er with (erl &F4 $eing stress)tested 6Ati0B9 0see Fig#re B2% Meas#rements lie these help to spot and eliminate locing $ottlenecs% G (igure %: 2xtract from a lock-contention benchmark on 3nix S-*%,45 #6am$"& hits misses spins spins/miss >(ageTa$le-oc4n"o < 7,6B6 G0 7,BD< <20 >:ispatcherH#e#e-oc4n"o < 87,7D7 =G2 =0,B0G G0 >SleepHashH#e#e-oc4n"o < 2B,B87 D0G B6,<72 D7 ,.1 Scalability Simon CIgstrJm made similar $enchmars to compare the scala$ility o" the -in#' ernel on <)G &(Us "rom *ersion 2%0 to 2%6% He meas#red the relati*e speed#p in regard to the 0giant loced2 2%0%80 U( ernel with the (ostmar $enchmar 0Fig#re 62% The res#lt o" this $enchmar is not s#rprising% As we can see, the more we increase locing gran#larity 0-in#' 2%62, the $etter the system scales% B#t how "ar can we increase locing gran#larityK 7 (igure 1: 5ercentage of cycles spent on spinning on locks for each of the test applications #2ti01& (igure 7: 5ostmark benchmark of se'eral Linux kernels 6.8()9:S45;y #/ag01& O" co#rse, we cannot ignore the increasing comple'ity ind#ced $y "iner locing gran#larity% As more locs ha*e to $e held to per"orm a speci"ic operation, the ris o" deadloc increases% There is an ongoing disc#ssion in the -in#' comm#nity a$o#t how m#ch locing hierarchy is too m#ch% !ith more locing comes more need "or doc#mentation o" locing order, or need "or tools lie deadloc analy;ers% :eadloc "a#lts are among the most di""ic#lt to come $y% The o*erhead o" locing operations matters as well% The &(U does not only spend time in a critical section, it also taes some time to ac/#ire and to release a loc% &ompare the graph o" the 2%6 ernel with the 2%8 ernel "or less than "o#r &(Us. The ernel with more locs is the slower one% The e""iciency o" e'ec#ting a critical section can $e mathematically e'pressed as. Time within critical section 5 0Time within critical section L Time to ac/#ire loc L Time to release loc2% 4" yo# split a critical section into two, the time to ac/#ire and release a loc can $e ro#ghly m#ltiplied $y two% S#rprisingly, e*en one time the ac/#isition o" a loc is generally one time too m#ch, and the per"ormance penalty o" a simple loc ac/#isition, e*en i" s#ccess"#l at the "irst attempt, is $ecoming worse and worse% To #nderstand why, we ha*e to "orget the $asic model o" a simple scalar processor witho#t caches and loo at today,s reality% ,.2 Performance Penalty of Lock O"erations 4mage < shows typical instr#ction costs o" se*eral operations on a G)&(U <%8B FH; ((& system < % The gap $etween normal instr#ctions, cache)hitting memory accesses 0not listed hereM they are generally =)8 times "aster than an atomic increment operation2 and a loc operation $ecomes o$*io#s% -et #s loo at the architect#re o" today,s SM( systems and it,s impact on o#r spin loc% ,.2.1 0ac!es As &(U power has increased ro#ghly $y "actor 2 each year, memory speeds ha*e not ept pace, and increased $y only <0 to <BN each year% Th#s, memory operations impose a $ig per"ormance penalty on todays, comp#ters% < 4" yo# wonder why an instr#ction taes less time then a &(U cycle, remem$er that we are looing at a G)&(U SM( system, and *iew these n#m$ers as 3typical instr#ction cost3% <0 )mage 1: )nstruction costs on a <-653 1.%19=> 556 system #4c/01& As a conse/#ence o" this de*elopment, small S1AM caches were introd#ced, which are m#ch "aster than main memory% :#e to temporal and spatial locality o" re"erence in programs 0see 6Sch789 "or e'planation2, e*en a comparati*ely small cache achie*es hit ratios o" 70N and higher% On SM( systems, each processor has its own cache% This has the $ig ad*antage that cache hits ca#se no load on the common memory $#s, $#t it introd#ces the pro$lem o" cache consistency% !hen a memory word is accessed $y a &(U, it is "irst looed #p in the &(U,s local cache% 4" it is not "o#nd there, the whole cache line 2 containing the memory word is copied into the cache% This is called a cache miss 0to increase the n#m$er o" cache hits, it is th#s *ery ad*isa$le to align data along cache lines in physical memory and operate on data str#ct#res that "it within a single cache line2% S#$se/#ent read accesses to that memory address will ca#se a cache hit% B#t what happens on a write access to a memory word that lies in the cacheK This depends on the 3write policy3% 3!rite thro#gh3 means that a"ter e*ery write access, the cache line is written $ac to main memory% This ins#res consistency $etween the cache and memory 0and $etween all caches o" a SM( system, i" the other caches snoop the $#s "or write accesses2, $#t it is also the slowest method, $eca#se a memory access is needed on e*ery write access to a cache line% The 3write $ac3 policy is m#ch more common% On a write access, data is not written $ac to memory immediately, $#t the cache line gets a 3modi"ied3 tag% 4" a cache line with a modi"ied tag is "inally replaced $y another line, it,s content is written $ac to memory% S#$se/#ent write operations hit in the cache as long as the line is not remo*ed% On SM( systems, the same piece "rom physical memory co#ld lie in more then one cache% The SM( architect#re needs a protocol to ins#re consistency $etween all caches% 4" two &(Us want to read the same memory word "rom their cache, e*erything goes well% 4n addition, $oth read operations can e'ec#te at the same time% B#t i" two &(Us wanted to write to the same memory word in their cache at the same time, there wo#ld $e a modi"ied *ersion o" this cache line in $oth caches a"terwards and th#s, two *ersions o" the same cache line wo#ld e'ist% To pre*ent this, a &(U trying to modi"y a cache line has to get the 3e'cl#si*e3 right on it% !ith that, this cache line is mared in*alid in all other caches% Another &(U trying to modi"y a cache line has to wait #ntil &(U < drops the e'cl#si*e right, and has to re)read the cache line "rom &(U <,s cache% -et #s loo at the e""ects o" the simple spin loc code "rom -isting =, i" a loc is held $y &(U <, and &(U 2 and = wait "or it. The loc *aria$le is set $y the test>and>set operation on e*ery spinning cycle% !hile &(U < is in the critical section, &(U 2 and = constantly read and write to the cache line containing the loc *aria$le% The line is constantly trans"erred "rom one cache to the other, $eca#se $oth &(Us m#st ac/#ire an e'cl#si*e copy o" the line when they test)and)set the loc *aria$le again% This is called 3cache line $o#ncing3, and it imposes a $ig load on the memory $#s% The impact on per"ormance wo#ld $e e*en worse i" the data protected $y the loc was also lying in the same cache line% !e can howe*er modi"y the implementation o" the spin loc to "it the "#nctionality o" a 2 To "ind data in the cache, each line o" the cache 0thin o" the cache as a spreadsheet2 has a tag containing it,s address% 4" one cache line wo#ld consist o" only one memory word, a lot o" lines and th#s, a lot o" address tags, wo#ld $e needed% To red#ce this o*erhead, cache lines contain #s#ally a$o#t =2)<2G $ytes, accessed $y the same tag, and the least signi"icant $its o" the address ser*e as the $yte o""set within the cache line% << cache% The atomic read)modi"y)write operation cannot possi$ly ac/#ire the loc while it is held $y another processor% 4t is there"ore #nnecessary to #se s#ch an operation #ntil the loc is "reed% 4nstead, other processors trying to ac/#ire a loc that is in #se can simply read the c#rrent state o" the loc and only #se the atomic operation once the loc has $een "reed% -isting 8 gi*es an alternate implementation o" the loc "#nction #sing this techni/#e% Here, one attempt is made to ac/#ire the loc $e"ore entering the inner loop, which then waits #ntil the loc is "reed% 4" the loc is already taen again on the test>and>set operation, the &(U spins again in the inner loop% &(Us spinning in the inner loop only wor on a shared cache line and do not re/#est the cache line e'cl#si*e% They wor cache)local and do not waste $#s $andwidth% !hen &(U < releases the loc, it mars the cache line e'cl#si*e and sets the loc *aria$le to ;ero% The other &(Us re)read the cache line and try to ac/#ire the loc again% +e*ertheless, spin loc operations are still *ery time)cons#mpti*e, $eca#se they #s#ally in*ol*e at least one cache line trans"er $etween caches or "rom memory% ,.2.2 Memory 2arriers !ith the s#perscalar architect#re, parallelism was introd#ced into the &(U cores% 4n a s#perscalar &(U, there are se*eral "#nctional #nits o" the same type, along with additional circ#itry to dispatch instr#ctions to the #nits% For instance, most s#perscalar designs incl#de more than one arithmetic)logical #nit% The dispatcher reads instr#ctions "rom memory and decides which ones can $e r#n in parallel, dispatching them to the two #nits% The per"ormance o" the dispatcher is ey to the o*erall per"ormance o" a s#perscalar design. The #nits, pipelines sho#ld $e as "#ll as possi$le% A s#perscalar &(U,s dispatcher hardware there"ore reorders instr#ctions "or optimal thro#ghp#t% This holds tr#e "or load5store operations as well% For e'ample, imagine a program that adds two integers "rom main memory% The "irst arg#ment that is "etched is not in the cache and m#st $e "etched "rom main memory% Meanwhile, the second arg#ment is "etched "rom the cache% The second load operation is liely to complete earlier% Meanwhile, a third load operation can $e iss#ed% The dispatcher #ses interlocs to prohi$it that the add operation is iss#ed $e"ore the load operations it depends on are "inished% Also, most modern &(Us sport a small register set called store $#""ers, where se*eral store operations are gathered to $e e'ec#ted at once at a later time% They can $e $#""ered in)order 0which is called total store ordering2 or @ as common with s#perscalar &(Us @ o#t o" order 0partial store ordering2% 4n short. As long as a load or store operation does not access the <2 (oid loc((olatile loc_t )loc_status) * #'ile (test_and_set(loc_status) == ,) #'ile ()loc_status == ,); "" spin - Listing %: Spin lock implementation a'oiding excessi'e cache line bouncing #Sch$%& same memory word as a prior store operation 0or *ice *ersa2, they can $e e'ec#ted in any possi$le order $y the &(U% This needs "#rther meas#res to ens#re correctness o" SM( code% The simple atomic list insert code "rom section 2%< co#ld $e e'ec#ted as shown in Fig#re D% The method re/#ires the new node,s ne't pointer 0and all it,s data2 to $e initiali;ed before the new element is inserted at the list,s head% 4" these instr#ctions are o#t o" order, the list will $e in an inconsistent state #ntil the second instr#ction completes% Meanwhile, another &(U co#ld tra*erse the list, the thread co#ld $e preempted, etc% 4t is not necessary that $oth operations are e'ec#ted right a"ter each other, $#t it is important that the "irst one was e'ec#ted $e"ore the second% To "orce "inishing o" all read or write operations in the instr#ction pipeline $e"ore the ne't operation is "etched, s#perscalar &(Us ha*e so)called memory $arrier instr#ctions% !e disting#ish read memory $arriers 0wait #ntil all pending read operations ha*e completed2, write memory $arriers, and memory $arriers 0wait #ntil all pending memory operations ha*e completed, read and write2% &orrect code wo#ld read. 4nstr#ction reordering can also ca#se operations in a critical section to 3$leed o#t3 0Fig#re G2% The line that claims to $e in a critical section is o$*io#sly not, $eca#se the operation releasing the loc *aria$le was e'ec#ted earlier% Another &(U co#ld long ago ha*e altered a part o" the data, with the res#lts $eing #npredicta$le% <= ne#$%ne&t = i$%ne&t; smp_#mb(); "" #rite memor. barrier/ i$%ne&t = ne#; Listing 1: 6orrect code of atomic list insertion on machines without se?uential memory model (igure @: )mpact of non se?uential memory model on atomic list insertion algorithm ne#$%ne&t = i$%ne&t; i$%ne&t = ne#; code in memory: execution order: i$%ne&t = ne#; ne#$%ne&t = i$%ne&t; To pre*ent this, we ha*e to alter o#r locing operations again 0note that it is not necessary to pre*ent load5store operations prior to the critical section to 3$leed3 into it% And o" co#rse, dispatcher #nits do not o*erride the logical instr#ction "low, so e*ery operation in the critical section will $e e'ec#ted a"ter the &(U e'its that while loop2. A memory $arrier "l#shes the store $#""er and stalls the pipelines 0to carry o#t all pending read5write operations $e"ore new ones are e'ec#ted2, so it negati*ely impacts the per"ormance proportional to the n#m$er o" pipeline stages and n#m$er o" "#nctional #nits% This is why the memory $arrier operations tae so m#ch time in the chart presented earlier% Atomic operations tae so long $eca#se they also "l#sh the store $#""er in order to $e carried o#t immediately% ,.2.( 3as!&able 2enc!mark Below are the res#lts o" a $enchmar that per"ormed search operations on a hash ta$le with a dense array o" $#cets, do#$ly lined hash chains, and one element per hash chain% Under the locing designs tested "or this hash ta$le were. Flo$al spin loc, glo$al reader5writer <8 (oid loc((olatile loc_t )loc_(ar) * #'ile (test_and_set(loc_(ar) == ,) #'ile ()loc_(ar == ,); "" spin - (oid unloc((olatile loc_t )loc_(ar) * mb(); "" read$#rite memor. barrier )loc_(ar = 0; - Listing 7: Spin lock with memory barrier (igure <: )mpact of weak store ordering on critical sections data01oo = .; "" in critical section data0ne&t = &bar; )loc_(ar = 0; ") unloc )" code in memory: execution order: )loc_(ar = 0; ") unloc )" data01oo = .; "" in critical section data0ne&t = &bar; loc, per)$#cet spin loc and rwloc and -in#', $ig reader loc% Aspecially the res#lts "or the allegedly parallel reader5writer locs seem s#rprising, $#t only s#pport the things said in the last two sections. The locing instr#ctions, o*erhead thwarts any parallelism% The e'planation "or this is now rather simple 0Fig#re 72. The ac/#isition and release o" rwlocs tae so m#ch time 0remem$er the cache line $o#ncing etc%2, that the act#al critical section is not e'ec#ted parallel any more% <B )mage 2: 5erformance of se'eral locking strategies for a hash table A#4c/01&B 4 Lock&A#oiding Sync!roni5ation Primiti#es M#ch e""ort has $een p#t in de*eloping synchronisation primiti*es that a*oid locs% -oc) "ree and wait)"ree synchronisation also plays a maOor part in real time operating systems, where time g#arantees m#st $e gi*en% Another way to red#ce loc contention is $y #sing per)&(U data% Two new synchronisation mechanisms that get $y totally witho#t locing or atomic operations on the reader side were introd#ced into the -in#' 2%6 ernel to address the a$o*e iss#es. The Se/-oc and the 1ead)&opy)Update mechanism% !e will disc#ss them in partic#lar% 4.1 Se6 Locks Se/ locs 0short "or se/#ence locs2 are a *ariant o" the reader)writer loc, $ased on spin locs% They are intended "or short critical sections and t#ned "or "ast data access and low latency% 4n contrast to the rwloc, a se/#ence loc warrants writer access immediately, regardless o" any readers% !riters ha*e to ac/#ire a writer spin loc which pro*ides m#t#al e'cl#sion "or m#ltiple writers% They then can alter the data witho#t paying regard to possi$le readers% There"ore, readers also do not need to ac/#ire any loc to synchroni;e with possi$le <6 (igure $: 2ffects of cache line bouncing and memory synchronisation delay on rwlockCs efficiency A#4c/0"&B writers% Th#s, a read access generally does not ac/#ire a loc, $#t readers are in charge to chec i" they read *alid data. 4" a write access too place while the data was read, the data is in*alid and has to $e read again% The identi"ication o" write accesses is reali;ed with a co#nter 0see Fig#re <02% A*ery writer increments this ;ero)initiali;ed co#nter once $e"ore he changes any data, and again a"ter all changes are done% The reader reads the co#nter *al#e $e"ore he reads the data, then compares it to the c#rrent co#nter *al#e a"ter reading the data% 4" the co#nter *al#e has increased, the reading was tampered $y one or more conc#rrent writers and the data has to $e read again% Also, i" the co#nter *al#e was #ne*en at the $eginning o" the read)side critical section, a writer was in progress while the data was read and it has to $e discarded% So, strictly speaing, the while loop,s condition is ((count_pre /= count_post) && (count_pre 2 3 == 0))% 4n the worst case, the readers wo#ld ha*e to loop in"initely i" there was a non)ending chain o" writers% B#t #nder normal conditions, the readers read the data s#ccess"#lly within a "ew, i" not only one tries% By minimi;ing the time spent in the read)side critical section, the pro$a$ility o" $eing interr#pted $y a writer can $e red#ced greatly% There"ore, it is part o" the method to only copy the shared data in the critical section and wor on it later% -isting D shows how to read shared data protected $y the se/ loc "#nctions o" the -in#' ernel% time_loc is the se/ loc *aria$le, $#t read_se!be4in and read_ se!retr. only read the se/ loc,s co#nter and do not access the loc *aria$le% 4.2 !e $ead&0o"y&7"date Mec!anism As yo# can see, synchronisation is a mechanism and a coding con*ention% The coding con*ention "or m#t#al e'cl#sion with a spin loc is, that yo# ha*e to hold a loc $e"ore yo# access the data protected $y it, and that yo# ha*e to release it a"ter yo# are done% The <D unsi4ned lon4 se!; do * se! = read_se!be4in(&time_loc); no# = time; - #'ile( read_se!retr.(&time_loc+se!) ); "" (alue in 5no#5 can no# be used Listing @: Se? Lock: *ead-side critical section (igure 10: Se? Lock schematics Afigure based upon #+ua0%&B coding con*ention "or non)$locing synchronisation is, that e*ery data manip#lation only needs a single atomic operation 0e%g% &AS!, &AS!2 or o#r atomic list #pdate e'ample2% The 1&U mechanism is $ased on something called /#iescent states% A /#iescent state is a point in time, where a process that has $een reading shared data does not hold any re"erences to this data any more% !ith the 1&U mechanism, processes can enter a read) side critical section any time and can ass#me that the data they read is consistent as long as they wor on it% B#t a"ter a process lea*es it,s read)side critical section, it must not hold any re"erences to the data any longer% The process enters the /#iescent state% This imposes some constraints on how to #pdate shared data str#ct#res% As readers do not chec i" the data they read is consistent 0as in the Se/-oc mechanism2, writers ha*e to apply all their changes with one atomical operation% 4" a reader read the data $e"ore the #pdate, it sees the old state, i" the reader reads the data a"ter the #pdate, it sees the new state% O" co#rse, readers sho#ld read data once and than wor with it, and not read the same data se*eral times and than "ail i" they read di""ering *ersions% &onsider a lined list protected $y 1&U% To #pdate an element o" the list 0see Fig#re <<2, the writer has to read the old element,s contents, mae a copy o" them (1), #pdate this element (2), and then e'change the old element with the new one with an atomic operation 0writing the new element,s address to the pre*io#s element,s ne't pointer2 (3)% As yo# can see, readers could still read stale data e*en a"ter the writer has "inished #pdating, i" they entered the read)side critical section $e"ore the writer "inished (4)% There"ore, the writer cannot immediately delete the old element% 4t has to de"er the destr#ction, #ntil all processes that were in a read)side critical section at the time the writer "inished ha*e dropped their re"erences to the stale data (5)% Or in other words, entered the /#iescent state% This time span is called the grace period% A"ter that time, there can $e readers holding re"erences to the data, $#t none o" them co#ld possi$ly re"erence the old data, $eca#se they started at a time when the old data was not *isi$le any more% The old element can $e deleted (6)% <G (igure 11: Dhe six steps of the *63 mechanism prev cur next cur prev cur next new prev old next new next prev next new ,+ 4+ 8+ (+ 2+ 1+ prev old new next prev old new The 1&U mechanism re/#ires data that is stored within some sort o" container that is re"erenced $y a pointer% The #pdate step consists o" changing that pointer% Th#s, lined lists are the most common type o" data protected $y 1&U% 4nsertion and deletion o" elements is done lie presented in section 2%<% O" co#rse, we need memory $arrier operations on machines with wea ordering% More comple' #pdates, lie sorting a list, need some other ind o" synchronisation mechanism% 4" we ass#me that readers tra*erse a list in search o" an element once, and not se*eral times $ac and "orth 0as we ass#med anyways2, we can also #se do#$ly lined lists 0see -isting G2% The 1&U mechanism is optimal "or read)mostly data str#ct#res, where readers can tolerate stale data 0it is "or e'ample #sed in the -in#' ro#ting ta$le implementation2% !hile readers generally do not ha*e to worry a$o#t m#ch, things get more comple' on the writer,s side% First o" all, writers ha*e to ac/#ire a loc, O#st lie with the se/loc mechanism% 4" they wo#ld not, two writers co#ld o$tain a copy o" a data element, per"orm their changes, and then replace it% The data str#ct#re will still $e intact, $#t one #pdate wo#ld $e lost% Second, writers ha*e to de"er the destr#ction o" an old *ersion o" the data #ntil sometime% !hen e'actly is it sa"e to delete old dataK A"ter all readers that were in a read)side critical section at the time o" the #pdate ha*e le"t their critical section, or, entered a /#iescent state 0Fig#re <22% A simple approach wo#ld $e to tae a co#nter that indicates how many processes are within a read)side critical section, and de"er destr#ction o" all stale *ersions o" data elements #ntil that time% B#t as yo# can see, later readers are not taen into acco#nt, so this approach "ails% !e co#ld also incl#de a re"erence co#nter in e*ery data element, i" the architect#re "eat#res an atomic increment operation% A*ery reader wo#ld increment this re"erence co#nter as it gets a re"erence to the <7 static inline (oid __list_add_rcu(struct list_'ead ) ne#+ struct list_'ead ) pre(+ struct list_'ead ) ne&t) * ne#$%ne&t = ne&t; ne#$%pre( = pre(; smp_#mb(); ne&t$%pre( = ne#; pre($%ne&t = ne#; - Listing <: extract from Linux 2.7.1% kernel list.h (igure 12: *63 9race 5eriod: After all processes that entered a read-side critical section AgrayB before the writer AredB finished ha'e entered a ?uiescent state it is sa'e delete an old element data element, and decrement it when the read)side critical section completes% 4" an old data element has a re"co#nt o" ;ero, it can $e deleted 6McC0<9% !hile this sol*es the pro$lem, it reintrod#ces the per"ormance iss#es o" atomic operations that we wanted to a*oid% -et #s ass#me that a process in a 1&U read)side critical section does not yield the &(U% This means. (reemption is disa$led while in the critical section and no "#nctions that might $loc m#st $e #sed 6McC089% Then no re"erences can $e held across a conte't switch, and we can "airly ass#me a &(U that has gone thro#gh a conte't switch to $e in a /#iescent state% The earliest time when we can $e a$sol#tely s#re that no process on any other &(U is still holding a re"erence to stale data, is a"ter e*ery other &(U has gone thro#gh a conte't switch at least once a"ter the writer "inished% The writer has to de"er destr#ction o" stale data #ntil then, either $y waiting or $y registering a call$ac "#nction that "rees the space occ#pied $y the stale data% This call$ac "#nction is called a"ter the grace period is o*er% The -in#' ernel "eat#res $oth *ariants% A simple mechanism to detect when all &(Us ha*e gone thro#gh a conte't switch is to start a high priority thread on &(U < that repeatedly resched#les itsel" on the ne't &(U #ntil it reaches the last &(U% This thread then e'ec#tes the call$ac "#nctions or waes #p any processes waiting "or the grace period to end% There are more e""ecti*e algorithms o#t there to detect the end o" a grace period, $#t these are o#t o" the scope o" this doc#ment% -isting 7 presents the -in#' 1&U A(4 0witho#t the -in#' 1&U list A(42% +ote that, while it is necessary to g#ard read)side critical sections with rcu_read_loc and rcu_read_unloc, the only thing these "#nctions do 0e'cept "or *is#ally highlighting a critical section2 is disa$ling preemption "or the critical section% 4" the -in#' ernel is compiled with 6788M6T_8NA9:8=no, they do nothing% !rite)side critical sections are protected $y spin>loc02 and spin>#nloc02, and wait "or the grace period a"terwards with s.nc'roni;e_ernel() or register a call$ac "#nction to destroy the old element with call_rcu()0 20 (igure 1": Simple detection of a grace period: Dhread !u! runs once on e'ery 653 #4c/01& The 1&U mechanism is widely $elie*ed to ha*e $een de*eloped at Se/#ent &omp#ter Systems, who were then $o#ght $y 4BM, who holds se*eral patents on this techni/#e% The patent holders ha*e gi*en permission to #se this mechanism #nder the F(-% There"ore, -in#' is c#rrently the only maOor OS #sing it% 1&U is also part o" the S&O claims in the S&O *s% 4BM laws#it% 8 0onclusion The introd#ction o" SM( systems has greatly increased the comple'ity o" locing in OS ernels% 4n order to stri*e "or optimal per"ormance on all plat"orms, the operating system designers ha*e to meet con"licti*e goals. 4ncrease gran#larity to increase scala$ility o" their ernels, and red#ce #sing o" locs to increase the e""iciency o" critical sections, and th#s the per"ormance o" their code% !hile a spin loc can always $e #sed, it is not always the right tool "or the Oo$% +on)$locing synchronisation, Se/ -ocs or the 1&U mechanism o""er $etter per"ormance than spin locs or rwlocs% B#t these synchronisation methods re/#ire more e""ort than a simple replacement% 1&U re/#ires a complete rethining and rewriting o" the data str#ct#res it is #sed "or, and the code it is #sed in% 4t too hal" a decade "or -in#' "rom it,s "irst giant loced SM( ernel implementation to a reasona$ly "ine gran#lar% This time co#ld ha*e $een greatly red#ced, i" the -in#' ernel had $een written as a preempti*e ernel with "ine gran#lar locing "rom the $eginning% !hen it comes to m#t#al e'cl#sion, it is always a good thing to thin the whole thing thro#gh "rom the $eginning% Starting with an approach that is #gly $#t wors, and t#ning it to a well r#nning sol#tion later, o"ten lea*es yo# coding the same thing twice, and e'periencing greater pro$lems than i" yo# tried to do it nicely "rom the start% As m#ltiprocessor systems and modern architect#res lie s#perscalar, s#per pipelined, and hyperthreaded &(Us as well as m#lti)le*el caches $ecome normalcy, simple code that loos "ine at "irst glance can ha*e se*ere impact on per"ormance% Th#s, programmers need to ha*e a thoro#gh #nderstanding o" the hardware they write code "or% F#rther 1eading This paper detailed on the per"ormance aspects o" locing in SM( ernels% 4" yo# are 2< (oid s.nc'roni;e_ernel((oid); (oid call_rcu(struct rcu_'ead )'ead+ (oid ()1unc)((oid )ar4)+ (oid )ar4); struct rcu_'ead * struct list_'ead list; (oid ()1unc)((oid )ob<); (oid )ar4; -; (oid rcu_read_loc((oid); (oid rcu_read_unloc((oid); Listing $: Dhe Linux 2.7 *63 A5) functions interested in the implementation comple'ity o" "ine grained SM( ernels or e'periences "rom per"orming a m#ltiprocessor port, please re"er to 6Cag0B9% For a more in)depth introd#ction to SM( architect#re and caching, read &#rt Schimmel,s $oo 06Sch7892% 4" yo# want to gain deeper nowledge o" the 1&U mechanism, yo# can start at (a#l A% McCenney,s 1&U we$site 0http.55www%rdrop%com5#sers5pa#lmc51&U2% 1e"erences 6Bry029 1% Bryant, 1% Forester, P% Hawes. Filesystem per"ormance and scala$ility in -in#' 2%8%<D% 5roceedings of the 3senix Annual Dechnical 6onference, 2002 6&am7=9 Mar :% &$ell, 1#ss -% Holt. -oc)Fran#larity Analysis Tools in SE185M(% )222 Software, March <77= 6Ati0B9 Qoa* Atison, :an Tsa"rir et% al%. Fine Frained Cernel -ogging with C-ogger. A'perience and 4nsights% =ebrew 3ni'ersity, 200B 6Cag0B9 Simon CIgstrJm. (er"ormance and 4mplementation &omple'ity in M#ltiprocessor Operating System Cernels% Elekinge )nstitute of Dechnology, 200B 6Cra0<9 M% Cra*et;, H% Frane. Anhancing the -in#' sched#ler% 5roceedings of the .ttawa Linux Symposium, 200< 6McC0<9 (a#l A% McCenney. 1ead)&opy Update% 5roceedings of the .ttawa Linux Symposium, 2002 6McC0=9 (a#l A% McCenney. Cernel Corner ) Using 1&U in the -in#' 2%B Cernel% Linux 4aga>ine, 200= 6McC089 (a#l A% McCenney. 1&U *s% -ocing (er"ormance on :i""erent Types o" &(Us% http:,,www.rdrop.com,users,paulmck,*63,L6A200%.02.1"a.pdf, 2008 6McC0B9 (a#l McCenney. A$straction, 1eality &hecs, and 1&U% http:,,www.rdrop.com,users,paulmck,*63,*63intro.2001.0@.27bt.pdf, 200B 6H#a089 PRrgen H#ade, A*a)Catharina C#nst. -in#' Trei$er entwiceln% Fpunkt.'erlag, 2008 6Sch789 &#rt Schimmel. U+4S Systems "or Modern Architect#res. Symmetric M#ltiprocessing and &aching "or Cernel (rogrammers% Addison Gesley, <778 22