Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Game of Loom: implementation patterns and perfo...

Mario Fusco
September 18, 2022

Game of Loom: implementation patterns and performance implications playing with virtual threads

Virtual threads will be very likely the next big game changer in the Java ecosystem, allowing to have the scalability of asynchronous programming models with the simplicity of synchronous code. Their main claim is that, differently from native threads that are a costly and then scarce resource, you can create as many virtual threads as you want with known and much cheaper memory and performance impact than the native ones. But is this always true? What are the costs of scheduling thousands or even millions of virtual threads? Does the more frequent context switch have some performance implications? What about the cache misses that these context switches could potentially imply? During this talk we will try to answer these questions in a funny way, by analyzing an implementation of the traditional Conway's Game of Life based on the communicating sequential processes (CSP) model and using both virtual and native threads with different algorithms in order to compare their performances. Based on this analysis we will also try to derive some rules of thumb on when and how using virtual threads.

Mario Fusco

September 18, 2022
Tweet

More Decks by Mario Fusco

Other Decks in Programming

Transcript

  1. What virtual threads are … ❖ JDK native threads are

    thin wrappers around operating system threads ➢ Native threads are costly and memory consuming ➢ You cannot have too many of them ❖ Virtual threads make a better use of OS threads getting rid of the 1:1 relationship with operating system threads ➢ Many virtual threads can be multiplexed on the same OS thread (carrier) ➢ When a virtual thread calls a blocking operation the JDK performs a nonblocking OS call and automatically suspends the virtual thread until the operation finishes
  2. … and why we need them - the WRONG reason

    ❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch
  3. … and why we need them - the WRONG reason

    ❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch You can create MILLIONS of threads!!!
  4. Virtual threads aren't great because you can have gazillions of

    them. That's just a side effect of their design and behavior which are the actual great things.
  5. … and why we need them - the RIGHT reason

    Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads
  6. … and why we need them - the RIGHT reason

    What virtual threads are really good for is… waiting Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads
  7. The Spiral of Innovation Concurrent Programming in Java Green threads

    Native threads Events ( actors / message passing )
  8. The Spiral of Innovation Concurrent Programming in Java Green threads

    Native threads Events ( actors / message passing ) Reactive programming
  9. The Spiral of Innovation Concurrent Programming in Java Green threads

    Native threads Events ( actors / message passing ) Reactive programming Virtual threads (Project Loom)
  10. Virtual threads are syntactic sugar for reactive programming Virtual threads

    do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way.
  11. Virtual threads are syntactic sugar for reactive programming // sum

    ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming.
  12. Virtual threads are syntactic sugar for reactive programming // sum

    ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); // ... same without lambdas sumAges = persons.stream() .filter(new Predicate<Person>() { @Override public boolean test(Person person) { return person.getName().startsWith("A"); } }) .reduce(0, new BiFunction<Integer, Person, Integer>() { @Override public Integer apply(Integer acc, Person person) { return acc + person.getAge(); } }, new BinaryOperator<Integer>() { @Override public Integer apply(Integer a, Integer b) { return a + b; } }); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming. You could do FP even without λs but at what cost?
  13. Virtual threads are syntactic sugar for reactive programming … but

    not only that for other conversations that should have been a blog post 😎
  14. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC.
  15. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier.
  16. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses.
  17. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler.
  18. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key.
  19. (Possible) issues with Virtual Threads ❖ Loom uses the Java

    heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key. ❖ Virtual threads support thread locals, but since they can be very numerous, thread locals could have a huge memory footprint and should be used sparingly. They could be replaced by Extent-Local Variables (JEP-429).
  20. What is CSP? CSP is a concurrent programming model using

    communication as synchronization primitive
  21. Game of Loom – The original Loom based implementation ❖

    All communications between threads happen through a (monodirectional) blocking channel public class Channel<T> { private final BlockingQueue<T> queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }
  22. Game of Loom – The original Loom based implementation ❖

    Each grid cell is an independent process communicating with its neighbour via channels public class Cell { private boolean alive; private final Channel<Boolean> tickChannel; private final Channel<Boolean> resultChannel; private final List<Channel<Boolean>> inChannels; private final List<Channel<Boolean>> outChannels; Cell(CellOptions options) { this.alive = options.alive(); this.tickChannel = options.tickChannel(); this.resultChannel = options.resultChannel(); this.inChannels = options.inChannels(); this.outChannels = options.outChannels(); } void start() { Thread.startVirtualThread(this::run); } }
  23. Game of Loom – The original Loom based implementation ❖

    Each grid cell reads neighbours’ states and calculate and publish its own in an infinite loop public class Cell { private void run() { while (true) { tickChannel.take(); // wait for tick stimulus outChannels.forEach(ch -> ch.put(alive)); // announce liveness to neighbors // receive liveness from neighbors int neighbors = inChannels.stream().map(Channel::take).mapToInt(b -> b ? 1 : 0).sum(); alive = nextState(alive, neighbors); // calculate next state based on game of life rules resultChannel.put(alive); // announce resulting next state } } private static boolean nextState(boolean alive, int neighbors) { if (alive) { return neighbors == 2 || neighbors == 3; } return neighbors == 3; } }
  24. Game of Loom – The original Loom based implementation ❖

    The GameOfLife class coordinates all cells, gathering their states and publishing it for UI consumption public class GameOfLife { void start() { cells.forEach(Cell::start); Thread.startVirtualThread(this::run); } private void run() { while (true) { dims.forEachRowCol((r, c) -> tickChannels.get(r).get(c).put(true)); // emit tick for every cell boolean[][] grid = new boolean[dims.rows()][dims.cols()]; // prepare result boolean matrix dims.forEachRowCol((r, c) -> grid[r][c] = resultChannels.get(r).get(c).take()); // populate results gridChannel.put(grid); // emit aggregated liveness matrix sleep(); } } }
  25. Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations =

    10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } }
  26. Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations =

    10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads
  27. Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations =

    10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core
  28. Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations =

    10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core Different channel implementations
  29. Native vs. Virtual Threads public class GameOfLife { private final

    Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } }
  30. Native vs. Virtual Threads public class GameOfLife { private final

    Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread
  31. Native vs. Virtual Threads public class GameOfLife { private final

    Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame
  32. Native vs. Virtual Threads public class GameOfLife { private final

    Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame Blocking headless version Used in benchmarks
  33. Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads)

    Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN ---
  34. Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads)

    Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN --- ~7x OOM
  35. Mixed-Mode Flame Graphs ❖ X-axis alphabetical stack sort ie NOT

    A TIME SERIES! ❖ Width is proportional to samples presence ❖ Colors ➢ green - Java ➢ aqua - Inlined Java ➢ orange - Kernel code ➢ red - C libraries (eg JNI) ➢ yellow - C++ (eg JVM) ❖ Color intensity is randomized to differentiate frames ❖ Thread ID is added as a base frame
  36. Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches”

    Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op
  37. Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches”

    Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op
  38. Native vs. Virtual Threads – Context Switch htop on native

    threads htop on virtual threads jmh -prof “perfnorm:events=context-switches” Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op Kernel space User space
  39. Experimenting with different Channels types ❖ All communications between threads

    happen through a (monodirectional) blocking channel public class Channel<T> { private final BlockingQueue<T> queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }
  40. Experimenting with different Channels types ❖ All communications between threads

    happen through a (monodirectional) blocking channel ❖ The channel contains (at most) a single value at time and it is used to synchronize the producer and consumer (CSP pattern) acting as a rendez-vous point. public class Channel<T> { private final BlockingRendezVous<T> rendezVous; public Channel(BlockingRendezVous.Type type) { rendezVous = BlockingRendezVous.of(type); } T take() { return rendezVous.take(); } void put(T value) { rendezVous.put(value); } }
  41. Experimenting with different Channels types public interface BlockingRendezVous<T> { void

    put(T x) throws InterruptedException; T take() throws InterruptedException; enum Type { BlockingQueue, // the original one based on a LinkedBlockingQueue BlockingTransfer, // same but based on a LinkedTransferQueue (variation of M&S lock-free queue algorithm) LockedSingleValue, // locked data structure similar to a BlockingQueue but holding one single value OneToOneParking // lock-free specialized one-producer/one-consumer channel } static <T> BlockingRendezVous<T> of(Type type) { return switch (type) { case BlockingQueue -> new BlockingQueue<>(); case BlockingTransfer -> new BlockingTransfer<>(); case LockedSingleValue -> new LockedSingleValue<>(); case OneToOneParking -> new OneToOneParkingSingleValue<>(); default -> throw new AssertionError("unknown type: " + type); }; } }
  42. Channels types – LockedSingleValue final class LockedSingleValue<T> implements BlockingRendezVous<T> {

    private final Lock lock = new ReentrantLock(); private final Condition notFull = lock.newCondition(); private final Condition notEmpty = lock.newCondition(); private T value; @Override public void put(T x) throws InterruptedException { lock.lock(); try { if (value != null) { notFull.await(); } value = x; notEmpty.signal(); } finally { lock.unlock(); } } @Override public T take() throws InterruptedException { lock.lock(); try { if (value == null) { notEmpty.await(); } T x = value; value = null; notFull.signal(); return x; } finally { lock.unlock(); } } }
  43. Channels types – OneToOneParkingSingleValue final class OneToOneParkingSingleValue<T> implements BlockingRendezVous<T> {

    private volatile Thread waitSomeValue = null; private volatile Thread waitEmpty = null; private volatile T value; @Override public void put(T x) throws InterruptedException { Objects.requireNonNull(x); while (value != null) { Thread cThread = Thread.currentThread(); waitEmpty = currentThread; try { if (value == null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitEmpty = null; } } value = x; LockSupport.unpark(waitSomeValue); } @Override public T take() throws InterruptedException { T t; while ((t = value) == null) { Thread cThread = Thread.currentThread(); waitSomeValue = ct; try { t = value; if (t != null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitSomeValue = null; } } assert t != null; value = null; LockSupport.unpark(waitEmpty); return t; }
  44. Different Channels types – Benchmark results Native Threads Benchmark (channelType)

    (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s
  45. Different Channels types – Benchmark results Native Threads Benchmark (channelType)

    (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s Virtual Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 154.734 ± 6.313 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 161.351 ± 7.030 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 284.477 ± 12.121 ops/s
  46. Different Channels types – Benchmark results Native Threads Benchmark (channelType)

    (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s Virtual Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 154.734 ± 6.313 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 161.351 ± 7.030 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 284.477 ± 12.121 ops/s
  47. Thread per cell vs. Thread per core public abstract class

    GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); }
  48. Thread per cell vs. Thread per core public abstract class

    GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated
  49. Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife

    extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated
  50. Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife

    extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Run each cell in its own thread Abstracts the number of threads to be used … and how cells are evaluated One thread per cell + 1 for coordinator
  51. Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife

    extends GameOfLife { private final List<CellsGroup> cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List<CellsGroup> createCellGroups() { List<CellsGroup> cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } }
  52. Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife

    extends GameOfLife { private final List<CellsGroup> cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List<CellsGroup> createCellGroups() { List<CellsGroup> cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } } Thread per core Run each group in its own thread group cells in n-1 groups n-th is for coordinator
  53. Cells group public class CellsGroup { private final List<Cell> cells;

    public CellsGroup(List<Cell> cells) { this.cells = cells; } public void run() { while (true) { cells.forEach( Cell::notifyLiveness ); cells.forEach( Cell::calculateNextState ); } } }
  54. Thread per cell vs. Thread per core – Benchmark results

    Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s
  55. Thread per cell vs. Thread per core – Benchmark results

    Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s
  56. Thread per cell vs. Thread per core – Benchmark results

    Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s Change the rules of the game and you’ll have a different winner
  57. Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames

    in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second 2000 frames warmup + 5000 frames benchmarking
  58. Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames

    in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second OneToOneParking Native: Completed 5000 frames in 19681ms = 254.052131 frames per second JDK: Completed 5000 frames in 13398ms = 373.190028 frames per second 2000 frames warmup + 5000 frames benchmarking
  59. Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t

    magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way.
  60. Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t

    magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations.
  61. Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t

    magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations. Domain knowledge still matters … a lot!
  62. References ❖ Original Game of Life CSP implementation ➢ https://github.com/ebarlas/game-of-life-csp

    ❖ My fork used in this presentation ➢ https://github.com/mariofusco/game-of-life-csp ❖ Loom and Thread Fairness ➢ https://www.morling.dev/blog/loom-and-thread-fairness/ ❖ From Imperative to Pure-Functional and Back Again: Monads vs. Scoped Continuations ➢ https://www.javacodegeeks.com/2015/08/from-imperative-to-pure-functional-and-back-again-monads-vs-scoped-continuations.html ❖ Coming to Java 19: Virtual threads and platform threads ➢ https://blogs.oracle.com/javamagazine/post/java-loom-virtual-threads-platform-threads ❖ Going inside Java’s Project Loom and virtual threads ➢ https://blogs.oracle.com/javamagazine/post/going-inside-javas-project-loom-and-virtual-threads ❖ Webtide’s Loom Report ➢ https://webtide.com/do-looms-claims-stack-up-part-1/ ➢ https://webtide.com/do-looms-claims-stack-up-part-2/ ❖ Java Mixedmode Flame Graphs ➢ https://www.slideshare.net/brendangregg/javaone-2015-java-mixedmode-flame-graphs ❖ Single Writer Principle ➢ https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html ❖ Red/Blue functions problem ➢ http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ ➢ https://www.tedinski.com/2018/11/13/function-coloring.html ❖ JEP 429: Extent-Local Variables (Incubator) ➢ https://openjdk.org/jeps/429