Spin Locks and Contention
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit
Focus so far: Correctness and Progress
Art of Multiprocessor Programming 2 Focus so far: Correctness and Progress Models Accurate (we never lied to you) But idealized (so we forgot to mention a few things) Protocols Elegant Important But naïve
New Focus: Performance
Art of Multiprocessor Programming 3 New Focus: Performance Models More complicated (not the same as complex!) Still focus on principles (not soon obsolete) Protocols Elegant (in their fashion) Important (why else would we pay attention) And realistic (your mileage may vary)
Kinds of Architectures
Art of Multiprocessor Programming 4 Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data.
Kinds of Architectures
Art of Multiprocessor Programming 5 Kinds of Architectures SISD (Uniprocessor) Single instruction stream Single data stream SIMD (Vector) Single instruction Multiple data MIMD (Multiprocessors) Multiple instruction Multiple data. Our space (1)
MIMD Architectures
Art of Multiprocessor Programming 6 MIMD Architectures Memory Contention Communication Contention Communication Latency Shared Bus memory Distributed
Today: Revisit Mutual Exclusion
Art of Multiprocessor Programming 7 Today: Revisit Mutual Exclusion Performance, not just correctness Proper use of multiprocessor architectures A collection of locking algorithms… (1)
What Should you do if you can’t get a lock?
Art of Multiprocessor Programming 8 What Should you do if you can’t get a lock? Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor (1)
What Should you do if you can’t get a lock?
Art of Multiprocessor Programming 9 What Should you do if you can’t get a lock? Keep trying “spin” or “busy-wait” Good if delays are short Give up the processor Good if delays are long Always good on uniprocessor our focus
Basic Spin-Lock
Art of Multiprocessor Programming 10 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . .
Basic Spin-Lock
Art of Multiprocessor Programming 11 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . . …lock introduces sequential bottleneck
Basic Spin-Lock
Art of Multiprocessor Programming 12 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . . …lock suffers from contention
Basic Spin-Lock
Art of Multiprocessor Programming 13 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . . Notice: these are distinct phenomena …lock suffers from contention
Basic Spin-Lock
Art of Multiprocessor Programming 14 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . . Seq Bottleneck no parallelism …lock suffers from contention
Basic Spin-Lock
Art of Multiprocessor Programming 15 Basic Spin-Lock CS Resets lock upon exit spin lock critical section . . . Contention ??? …lock suffers from contention
Review: Test-and-Set
Art of Multiprocessor Programming 16 Review: Test-and-Set Boolean value Test-and-set (TAS) Swap true with current value Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet”
Review: Test-and-Set
Art of Multiprocessor Programming 17 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet( boolean newValue) { boolean prior = value; value = newValue; return prior; } } (5)
Review: Test-and-Set
Art of Multiprocessor Programming 18 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } } Package java.util.concurrent.atomic
Review: Test-and-Set
Art of Multiprocessor Programming 19 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet( boolean newValue) { boolean prior = value; value = newValue; return prior; } } Swap old and new values
Review: Test-and-Set
Art of Multiprocessor Programming 20 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean( false ) boolean prior = lock.getAndSet( true )
Review: Test-and-Set
Art of Multiprocessor Programming 21 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) boolean prior = lock.getAndSet( true ) (5) Swapping in true is called “test-and-set” or TAS
Test-and-Set Locks
Art of Multiprocessor Programming 22 Test-and-Set Locks Locking Lock is free: value is false Lock is taken: value is true Acquire lock by calling TAS If result is false , you win If result is true , you lose Release lock by writing false
Test-and-set Lock
Art of Multiprocessor Programming 23 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean( false ); void lock() { while (state.getAndSet( true )) {} } void unlock() { state.set( false ); }}
Test-and-set Lock
Art of Multiprocessor Programming 24 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean( false ); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean
Test-and-set Lock
Art of Multiprocessor Programming 25 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet( true )) {} } void unlock() { state.set(false); }} Keep trying until lock acquired
Test-and-set Lock
Art of Multiprocessor Programming 26 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set( false ); }} Release lock by resetting state to false
Space Complexity
Art of Multiprocessor Programming 27 Space Complexity TAS spin-lock has small “footprint” N thread spin-lock uses O(1) space As opposed to O(n) Peterson/Bakery How did we overcome the (n) lower bound? We used a RMW operation…
Performance
Art of Multiprocessor Programming 28 Performance Experiment n threads Increment shared counter 1 million times How long should it take? How long does it take?
Graph
Art of Multiprocessor Programming 29 Graph ideal time threads no speedup because of sequential bottleneck
Mystery #1
Art of Multiprocessor Programming 30 Mystery #1 time threads TAS lock Ideal What is going on?
Test-and-Test-and-Set Locks
Art of Multiprocessor Programming 31 Test-and-Test-and-Set Locks Lurking stage Wait until lock “looks” free Spin while read returns true (lock taken) Pouncing state As soon as lock “looks” available Read returns false (lock free) Call TAS to acquire lock If TAS loses, back to lurking
Test-and-test-and-set Lock
Art of Multiprocessor Programming 32 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean( false ); void lock() { while ( true ) { while (state.get()) {} if (!state.getAndSet( true )) return ; } }
Test-and-test-and-set Lock
Art of Multiprocessor Programming 33 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } } Wait until lock looks free
Test-and-test-and-set Lock
Art of Multiprocessor Programming 34 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet( true )) return ; } } Then try to acquire it
Mystery #2
Art of Multiprocessor Programming 35 Mystery #2 TAS lock TTAS lock Ideal time threads
Mystery
Art of Multiprocessor Programming 36 Mystery Both TAS and TTAS Do the same thing (in our model) Except that TTAS performs much better than TAS Neither approaches ideal
Opinion
Art of Multiprocessor Programming 37 Opinion Our memory abstraction is broken TAS & TTAS methods Are provably the same (in our model) Except they aren’t (in field tests) Need a more detailed model …
Bus-Based Architectures
Art of Multiprocessor Programming 38 Bus-Based Architectures Bus cache memory cache cache
Bus-Based Architectures
Art of Multiprocessor Programming 39 Bus-Based Architectures Bus cache memory cache cache Random access memory (10s of cycles)
Bus-Based Architectures
Art of Multiprocessor Programming 40 Bus-Based Architectures cache memory cache cache Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” Bus
Bus-Based Architectures
Art of Multiprocessor Programming 41 Bus-Based Architectures Bus cache memory cache cache Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information
Jargon Watch
Art of Multiprocessor Programming 42 Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™
Jargon Watch
Art of Multiprocessor Programming 43 Jargon Watch Cache hit “I found what I wanted in my cache” Good Thing™ Cache miss “I had to shlep all the way to memory for that data” Bad Thing™
Cave Canem
Art of Multiprocessor Programming 44 Cave Canem This model is still a simplification But not in any essential way Illustrates basic principles Will discuss complexities later
Processor Issues Load Request
Art of Multiprocessor Programming 45 Bus Processor Issues Load Request cache memory cache cache data
Processor Issues Load Request
Art of Multiprocessor Programming 46 Bus Processor Issues Load Request Bus cache memory cache cache data Gimme data
Memory Responds
Art of Multiprocessor Programming 47 cache Bus Memory Responds Bus memory cache cache data Got your data right here data
Processor Issues Load Request
Art of Multiprocessor Programming 48 Bus Processor Issues Load Request memory cache cache data data Gimme data
Processor Issues Load Request
Art of Multiprocessor Programming 49 Bus Processor Issues Load Request Bus memory cache cache data data Gimme data
Processor Issues Load Request
Art of Multiprocessor Programming 50 Bus Processor Issues Load Request Bus memory cache cache data data I got data
Other Processor Responds
Art of Multiprocessor Programming 51 Bus Other Processor Responds memory cache cache data I got data data data Bus
Other Processor Responds
Art of Multiprocessor Programming 52 Bus Other Processor Responds memory cache cache data data data Bus
Modify Cached Data
Art of Multiprocessor Programming 53 Modify Cached Data Bus data memory cache data data (1)
Modify Cached Data
Art of Multiprocessor Programming 54 Modify Cached Data Bus data memory cache data data data (1)
Modify Cached Data
Art of Multiprocessor Programming 55 memory Bus data Modify Cached Data cache data data
Modify Cached Data
Art of Multiprocessor Programming 56 memory Bus data Modify Cached Data cache What’s up with the other copies? data data
Cache Coherence
Art of Multiprocessor Programming 57 Cache Coherence We have lots of copies of data Original copy in memory Cached copies at processors Some processor modifies its own copy What do we do with the others? How to avoid confusion?
Write-Back Caches
Art of Multiprocessor Programming 58 Write-Back Caches Accumulate changes in cache Write back when needed Need the cache for something else Another processor wants it On first modification Invalidate other entries Requires non-trivial protocol …
Write-Back Caches
Art of Multiprocessor Programming 59 Write-Back Caches Cache entry has three states Invalid: contains raw seething bits Valid : I can read but I can’t write Dirty : Data has been modified Intercept other load requests Write back to memory before using cache
Invalidate
Art of Multiprocessor Programming 60 Bus Invalidate memory cache data data data
Invalidate
Art of Multiprocessor Programming 61 Bus Invalidate Bus memory cache data data data Mine, all mine!
Invalidate
Art of Multiprocessor Programming 62 Bus Invalidate Bus memory cache data data data cache Uh,oh
Invalidate
Art of Multiprocessor Programming 63 cache Bus Invalidate memory cache data data Other caches lose read permission
Invalidate
Art of Multiprocessor Programming 64 cache Bus Invalidate memory cache data data Other caches lose read permission This cache acquires write permission
Invalidate
Art of Multiprocessor Programming 65 cache Bus Invalidate memory cache data data Memory provides data only if not present in any cache, so no need to change it now (expensive)
Another Processor Asks for Data
Art of Multiprocessor Programming 66 cache Bus Another Processor Asks for Data memory cache data data Bus
Owner Responds
Art of Multiprocessor Programming 67 cache data Bus Owner Responds memory cache data data Bus Here it is!
End of the Day …
Art of Multiprocessor Programming 68 Bus End of the Day … memory cache data data Reading OK, no writing data data
Mutual Exclusion
Art of Multiprocessor Programming 69 Mutual Exclusion What do we want to optimize? Bus bandwidth used by spinning threads Release/Acquire latency Acquire latency for idle lock
Simple TASLock
Art of Multiprocessor Programming 70 Simple TASLock TAS invalidates cache lines Spinners Miss in cache Go to bus Thread wants to release lock delayed behind spinners
Test-and-test-and-set
Art of Multiprocessor Programming 71 Test-and-test-and-set Wait until lock “looks” free Spin on local cache No bus use while lock busy Problem: when lock is released Invalidation storm …
Local Spinning while Lock is Busy
Art of Multiprocessor Programming 72 Local Spinning while Lock is Busy Bus memory busy busy busy busy
On Release
Art of Multiprocessor Programming 73 Bus On Release memory free invalid invalid free
On Release
Art of Multiprocessor Programming 74 On Release Bus memory free invalid invalid free miss miss Everyone misses, rereads (1)
On Release
Art of Multiprocessor Programming 75 On Release Bus memory free invalid invalid free TAS(…) TAS(…) Everyone tries TAS (1)
Problems
Art of Multiprocessor Programming 76 Problems Everyone misses Reads satisfied sequentially Everyone does TAS Invalidates others’ caches Eventually quiesces after lock acquired How long does this take?
Measuring Quiescence Time
Measuring Quiescence Time Acquire lock Pause without using bus Use bus heavily Art of Multiprocessor Programming 77 1 P 2 P n P If pause > quiescence time, critical section duration independent of number of threads If pause < quiescence time, critical section duration slower with more threads
Quiescence Time
Art of Multiprocessor Programming 78 Quiescence Time Increses linearly with the number of processors for bus architecture time threads
Mystery Explained
Art of Multiprocessor Programming 79 Mystery Explained TAS lock TTAS lock Ideal time threads Better than TAS but still not as good as ideal
Solution: Introduce Delay
Art of Multiprocessor Programming 80 Solution: Introduce Delay spin lock time d d 1 r d 2 r If the lock looks free But I fail to get it There must be contention Better to back off than to collide again
Dynamic Example: Exponential Backoff
Art of Multiprocessor Programming 81 Dynamic Example: Exponential Backoff time d 2d 4d spin lock If I fail to get lock wait random duration before retry Each subsequent failure doubles expected wait
Exponential Backoff Lock
Art of Multiprocessor Programming 82 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while ( true ) { while (state.get()) {} if (!lock.getAndSet(true)) return ; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}
Exponential Backoff Lock
Art of Multiprocessor Programming 83 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay
Exponential Backoff Lock
Art of Multiprocessor Programming 84 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free
Exponential Backoff Lock
Art of Multiprocessor Programming 85 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return ; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return
Exponential Backoff Lock
Art of Multiprocessor Programming 86 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration
Exponential Backoff Lock
Art of Multiprocessor Programming 87 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason
Spin-Waiting Overhead
Art of Multiprocessor Programming 88 Spin-Waiting Overhead TTAS Lock Backoff lock time threads
Backoff: Other Issues
Art of Multiprocessor Programming 89 Backoff: Other Issues Good Easy to implement Beats TTAS lock Bad Must choose parameters carefully Not portable across platforms
Idea
Art of Multiprocessor Programming 90 Idea Avoid useless invalidations By keeping a queue of threads Each thread Notifies next in line Without bothering the others
Anderson Queue Lock
Art of Multiprocessor Programming 91 Anderson Queue Lock flags next T F F F F F F F