Futures, Scheduling, and Work Distribution
Futures, Scheduling, and Work Distribution Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A
How to write Parallel Apps?
Art of Multiprocessor Programming 2 2 How to write Parallel Apps? split a program into parallel parts In an effective way Thread management
Matrix Multiplication
Art of Multiprocessor Programming 3 3 Matrix Multiplication
Matrix Multiplication
Art of Multiprocessor Programming 4 4 Matrix Multiplication
Matrix Multiplication
Art of Multiprocessor Programming 5 Matrix Multiplication class Worker extends Thread { int row, col; Worker( int row, int col) { this .row = row; this .col = col; } public void run() { double dotProduct = 0.0; for ( int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
Matrix Multiplication
Art of Multiprocessor Programming 6 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} a thread
Matrix Multiplication
Art of Multiprocessor Programming 7 Matrix Multiplication class Worker extends Thread { int row, col; Worker ( int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Which matrix entry to compute
Matrix Multiplication
Art of Multiprocessor Programming 8 Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}} Actual computation
Matrix Multiplication
Art of Multiprocessor Programming 9 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for ( int row …) for ( int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for ( int row …) for ( int col …) worker[row][col].join(); }
Matrix Multiplication
Art of Multiprocessor Programming 10 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for ( int row …) for ( int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Create n x n threads
Matrix Multiplication
Art of Multiprocessor Programming 11 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Start them
Matrix Multiplication
Art of Multiprocessor Programming 12 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for ( int row …) for ( int col …) worker[row][col].start(); for ( int row …) for ( int col …) worker[row][col].join(); } Wait for them to finish Start them
Matrix Multiplication
Art of Multiprocessor Programming 13 Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join(); } Wait for them to finish Start them What’s wrong with this picture?
Thread Overhead
Art of Multiprocessor Programming 14 Thread Overhead Threads Require resources Memory for stacks Setup, teardown Scheduler overhead Short-lived threads Ratio of work versus overhead bad
Thread Pools
Art of Multiprocessor Programming 15 Thread Pools More sensible to keep a pool of long-lived threads Threads assigned short-lived tasks Run the task Rejoin pool Wait for next assignment
Thread Pool = Abstraction
Art of Multiprocessor Programming 16 Thread Pool = Abstraction Insulate programmer from platform Big machine, big pool Small machine, small pool Portable code Works across platforms Worry about algorithm, not platform
ExecutorService Interface
Art of Multiprocessor Programming 17 ExecutorService Interface In java.util.concurrent Task = Runnable object If no result value expected Calls run() method. Task = Callable<T> object If result value of type T expected Calls T call() method.
Future<T>
Art of Multiprocessor Programming 18 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get();
Future<T>
Art of Multiprocessor Programming 19 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get(); Submitting a Callable<T> task returns a Future<T> object
Future<T>
Art of Multiprocessor Programming 20 Future<T> Callable<T> task = …; Future<T> future = executor.submit(task); T value = future.get(); The Future’s get() method blocks until the value is available
Future<?>
Art of Multiprocessor Programming 21 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get();
Future<?>
Art of Multiprocessor Programming 22 22 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get(); Submitting a Runnable task returns a Future<?> object
Future<?>
Art of Multiprocessor Programming 23 23 Future<?> Runnable task = …; Future<?> future = executor.submit(task); future.get(); The Future’s get() method blocks until the computation is complete
Note
Art of Multiprocessor Programming 24 24 Note Executor Service submissions Like New England traffic signs Are purely advisory in nature The executor Like the Boston driver Is free to ignore any such advice And could execute tasks sequentially …
Matrix Addition
Art of Multiprocessor Programming 25 25 Matrix Addition
Matrix Addition
Art of Multiprocessor Programming 26 26 Matrix Addition 4 parallel additions
Matrix Addition Task
Art of Multiprocessor Programming 27 27 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }}
Matrix Addition Task
Art of Multiprocessor Programming 28 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Constant-time operation
Matrix Addition Task
Art of Multiprocessor Programming 29 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Submit 4 tasks
Matrix Addition Task
Art of Multiprocessor Programming 30 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // add this ! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Base case: add directly
Matrix Addition Task
Art of Multiprocessor Programming 31 Matrix Addition Task class AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices a ij and b ij ) Future<?> f 00 = exec.submit(add(a 00 ,b 00 )); Future<?> f 11 = exec.submit(add(a 11 ,b 11 )); f 00 .get(); …; f 11 .get(); }} Let them finish
Dependencies
Art of Multiprocessor Programming 32 Dependencies Matrix example is not typical Tasks are independent Don’t need results of one task … To complete another Often tasks are not independent
Fibonacci
Art of Multiprocessor Programming 33 33 Fibonacci 1 if n = 0 or 1 F(n) F(n-1) + F(n-2) otherwise Note Potential parallelism Dependencies
Disclaimer
Art of Multiprocessor Programming 34 34 Disclaimer This Fibonacci implementation is Egregiously inefficient So don’t try this at home or job! But illustrates our point How to deal with dependencies Exercise: Make this implementation efficient!
Multithreaded Fibonacci
Art of Multiprocessor Programming 35 class FibTask implements Callable < Integer > { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask( int n) { arg = n; } public Integer call() { if (arg > 2) { Future< Integer > left = exec.submit( new FibTask(arg-1)); Future< Integer > right = exec.submit( new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci
Multithreaded Fibonacci
Art of Multiprocessor Programming 36 class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future< Integer > left = exec.submit( new FibTask(arg-1)); Future< Integer > right = exec.submit( new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Parallel calls
Multithreaded Fibonacci
Art of Multiprocessor Programming 37 class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}} Multithreaded Fibonacci Pick up & combine results
Dynamic Behavior
Art of Multiprocessor Programming 38 38 Dynamic Behavior Multithreaded program is A directed acyclic graph (DAG) That unfolds dynamically Each node is A single unit of work
Fibonacci DAG
Art of Multiprocessor Programming 39 39 Fibonacci DAG fib(4)
Fibonacci DAG
Art of Multiprocessor Programming 40 40 Fibonacci DAG fib(4) fib(3)
Fibonacci DAG
Art of Multiprocessor Programming 41 41 Fibonacci DAG fib(4) fib(3) fib(2) fib(2)
Fibonacci DAG
Art of Multiprocessor Programming 42 42 Fibonacci DAG fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1)
Fibonacci DAG
Art of Multiprocessor Programming 43 43 Fibonacci DAG fib(4) fib(3) fib(2) call get fib(2) fib(1) fib(1) fib(1) fib(1) fib(1)
How Parallel is That?
Art of Multiprocessor Programming 44 44 How Parallel is That? Define work : Total time on one processor Define critical-path length : Longest dependency path Can’t beat that!
Unfolded DAG
Unfolded DAG Art of Multiprocessor Programming 45
Parallelism?
Parallelism? Art of Multiprocessor Programming 46 Serial fraction = 3/18 = 1/6 Amdahl’s Law says speedup cannot exceed 6 .
Work?
Work? Art of Multiprocessor Programming 47 1 2 3 7 5 4 7 8 9 10 11 12 13 14 15 16 17 18 T 1 : time needed on one processor Just count the nodes …. T 1 = 18
Critical Path?
Critical Path? Art of Multiprocessor Programming 48 T 1 : time needed on as many processors as you like
Critical Path?
Critical Path? Art of Multiprocessor Programming 49 1 2 3 4 5 6 7 8 9 T 1 : time needed on as many processors as you like Longest path …. T 1 = 9
Notation Watch
Art of Multiprocessor Programming 50 50 Notation Watch T P = time on P processors T 1 = work (time on 1 processor) T = critical path length (time on processors)
Simple Laws
Art of Multiprocessor Programming 51 51 Simple Laws Work Law : T P ≥ T 1 /P In one step, can’t do more than P work Critical Path Law : T P ≥ T Can’t beat infinite resources
Performance Measures
Art of Multiprocessor Programming 52 52 Performance Measures Speedup on P processors Ratio T 1 /T P How much faster with P processors Linear speedup T 1 /T P = Θ (P) Max speedup (average parallelism) T 1 /T
Sequential Composition
Sequential Composition Art of Multiprocessor Programming 53 A B Work: T 1 (A) + T 1 (B) Critical Path: T 1 (A) + T 1 (B)
Parallel Composition
Parallel Composition Art of Multiprocessor Programming 54 Work: T 1 (A) + T 1 (B) Critical Path: max{T 1 (A), T 1 (B)} A B
Matrix Addition
Art of Multiprocessor Programming 55 55 Matrix Addition
Matrix Addition
Art of Multiprocessor Programming 56 56 Matrix Addition 4 parallel additions
Addition
Art of Multiprocessor Programming 57 57 Addition Let A P (n) be running time For n x n matrix on P processors For example A 1 (n) is work A (n) is critical path length
Addition
Art of Multiprocessor Programming 58 58 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ (1) 4 spawned additions Partition, synch, etc
Addition
Art of Multiprocessor Programming 59 59 Addition Work is A 1 (n) = 4 A 1 (n/2) + Θ (1) = Θ (n 2 ) Same as double-loop summation
Addition
Art of Multiprocessor Programming 60 60 Addition Critical Path length is A (n) = A (n/2) + Θ (1) spawned additions in parallel Partition, synch, etc
Addition
Art of Multiprocessor Programming 61 61 Addition Critical Path length is A (n) = A (n/2) + Θ (1) = Θ (log n)
Matrix Multiplication Redux
Art of Multiprocessor Programming 62 62 Matrix Multiplication Redux
Matrix Multiplication Redux
Art of Multiprocessor Programming 63 63 Matrix Multiplication Redux
First Phase …
Art of Multiprocessor Programming 64 64 First Phase … 8 multiplications
Second Phase …
Art of Multiprocessor Programming 65 65 Second Phase … 4 additions
Multiplication
Art of Multiprocessor Programming 66 66 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + A 1 (n) 8 parallel multiplications Final addition
Multiplication
Art of Multiprocessor Programming 67 67 Multiplication Work is M 1 (n) = 8 M 1 (n/2) + Θ (n 2 ) = Θ (n 3 ) Same as serial triple-nested loop
Multiplication
Art of Multiprocessor Programming 68 68 Multiplication Critical path length is M (n) = M (n/2) + A (n) Half-size parallel multiplications Final addition
Multiplication
Art of Multiprocessor Programming 69 69 Multiplication Critical path length is M (n) = M (n/2) + A (n) = M (n/2) + Θ (log n) = Θ (log 2 n)
Parallelism
Art of Multiprocessor Programming 70 70 Parallelism M 1 (n)/ M (n) = Θ (n 3 /log 2 n) To multiply two 1000 x 1000 matrices 1000 3 /10 2 =10 7 Much more than number of processors on any real machine
Shared-Memory Multiprocessors
Art of Multiprocessor Programming 71 71 Shared-Memory Multiprocessors Parallel applications Do not have direct access to HW processors Mix of other jobs All run together Come & go dynamically
Ideal Scheduling Hierarchy
Art of Multiprocessor Programming 72 72 Ideal Scheduling Hierarchy Tasks Processors User-level scheduler
Realistic Scheduling Hierarchy
Art of Multiprocessor Programming 73 73 Realistic Scheduling Hierarchy Tasks Threads Processors User-level scheduler Kernel-level scheduler
For Example
Art of Multiprocessor Programming 74 For Example Initially, All P processors available for application Serial computation Takes over one processor Leaving P-1 for us Waits for I/O We get that processor back ….
Speedup
Art of Multiprocessor Programming 75 Speedup Map threads onto P processes Cannot get P -fold speedup What if the kernel doesn’t cooperate? Can try for speedup proportional to P
Scheduling Hierarchy
Art of Multiprocessor Programming 76 76 Scheduling Hierarchy User-level scheduler Tells kernel which threads are ready Kernel-level scheduler Synchronous (for analysis, not correctness!) Picks p i threads to schedule at step i
Greedy Scheduling
Greedy Scheduling A node is ready if predecessors are done Greedy : schedule as many ready nodes as possible Optimal scheduler is greedy (why?) But not every greedy scheduler is optimal Art of Multiprocessor Programming 77
Greedy Scheduling
Greedy Scheduling Art of Multiprocessor Programming 78 There are P processors Complete Step : ¸ P nodes ready run any P Incomplete Step: < P nodes ready run them all
Theorem
Art of Multiprocessor Programming 79 79 Theorem For any greedy scheduler, T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 80 80 Theorem For any greedy scheduler, Actual time T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 81 81 Theorem For any greedy scheduler, No better than work divided among processors T P ≤ T 1 /P + T
Theorem
Art of Multiprocessor Programming 82 82 Theorem For any greedy scheduler, No better than critical path length T P ≤ T 1 /P + T
TP ≤ T1/P + T∞
T P ≤ T 1 /P + T Art of Multiprocessor Programming 83 Proof : Number of complete steps < T 1 /P … because each performs P work. Number of incomplete steps < T 1 … because each shortens the unexecuted critical path by 1 QED
Near-Optimality
Art of Multiprocessor Programming 84 84 Near-Optimality Theorem : any greedy scheduler is within a factor of 2 of optimal. Remark : Optimality is NP-hard!
Proof of Near-Optimality
Art of Multiprocessor Programming 85 85 Proof of Near-Optimality Let T P * be the optimal time. T P * ¸ max{T i /P, T 1 } T P · T 1 /P + T 1 T P · 2 max{T 1 /P ,T 1 } T P · 2 T P * From work and critical path laws Theorem QED
Work Distribution
Art of Multiprocessor Programming 86 86 Work Distribution zzz…
Work Dealing
Art of Multiprocessor Programming 87 87 Work Dealing Yes!
The Problem with Work Dealing
Art of Multiprocessor Programming 88 88 The Problem with Work Dealing D’oh! D’oh! D’oh!
Work Stealing
Art of Multiprocessor Programming 89 89 Work Stealing No work… Yes!
Lock-Free Work Stealing
Art of Multiprocessor Programming 90 90 Lock-Free Work Stealing Each thread has a pool of ready work Remove work without synchronizing If you run out of work, steal someone else’s Choose victim at random
Local Work Pools
Art of Multiprocessor Programming 91 91 Local Work Pools