Upgrade to Pro — share decks privately, control downloads, hide ads and more …

groovy wayang

Avatar for paulking paulking
September 15, 2025
6

groovy wayang

Avatar for paulking

paulking

September 15, 2025
Tweet

Transcript

  1. Cross-platform machine & deep learning with Apache Wayang (Incubating) &

    Groovy Dr Paul King, VP Apache Groovy & Distinguished Engineer, Object Computing Groovy: Repos: Blogs: Slides: Twitter/X | Mastodon | Bluesky: https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-wayang-tensorflow https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyWayang https://groovy.apache.org/blog/wayang-tensorflow https://groovy.apache.org/blog/using-groovy-with-apache-wayang https://speakerdeck.com/paulk/groovy-wayang @ApacheGroovy | @[email protected] | @groovy.apache.org
  2. What is Apache Groovy? It’s like a super version of

    Java: • Simpler scripting: more powerful yet more concise • Extension methods: 2000+ enhancements to Java classes for a great out-of-the box experience (batteries included) • Flexible typing: from dynamic duck-typing (terse code) to extensible stronger-than-Java static typing (better checking) • Improved OO & functional features: from traits (more powerful and flexible OO designs) to tail recursion and memorizing/partial application of pure functions • AST transforms: 10s of lines instead of 100/1000s of lines • Java features earlier: recent features on older JDK
  3. What is Apache Wayang (incubating)? • A unified data processing

    framework that seamlessly integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility JDK Spark Groovy Custom Flink Groovy Groovy JDK Spark Custom Flink Groovy Wayang Tensor Flow
  4. What is Apache Wayang (incubating)? • A unified data processing

    framework that seamlessly integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility Image source: Apache Wayang documentation
  5. Source: https://wayang.apache.org/docs/introduction/about • A unified data processing framework that seamlessly

    integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility What is Apache Wayang (incubating)?
  6. Source: https://wayang.apache.org/docs/introduction/about • A unified data processing framework that seamlessly

    integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility What is Apache Wayang (incubating)?
  7. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/
  8. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  9. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  10. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  11. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  12. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  13. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  14. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  15. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  16. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  17. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  18. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  19. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  20. Apache Wayang Offers two approaches for us: • Roll your

    own Kmeans algorithm using existing operators o Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink o Built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: o Transform, Stage, Compute, Update, Sample, Converge, Loop o Built-in algorithms including Kmeans
  21. Roll your own Kmeans: domain classes record Point(double[] pts) implements

    Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List<Double> pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }
  22. Roll your own Kmeans: domain classes record Point(double[] pts) implements

    Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List<Double> pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }
  23. Roll your own Kmeans: algorithm class class SelectNearestCentroid implements ExtendedSerializableFunction<Point,

    PointGrouping> { Iterable<PointGrouping> centroids void open(ExecutionContext context) { centroids = context.getBroadcast('centroids') } PointGrouping apply(Point p) { var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 for (c in centroids) { var distance = sqrt(p.pts.indices.collect{ p.pts[it] - c.pts[it] } .sum{ it ** 2 } as double) if (distance < minDistance) { minDistance = distance nearestCentroidId = c.cluster } } new PointGrouping(p.pts, nearestCentroidId, 1) } }
  24. Apache Wayang: Roll your own Kmeans class PipelineOps { public

    static SerializableFunction<PointGrouping, Integer> cluster = pg -> pg.cluster public static SerializableFunction<PointGrouping, PointGrouping> average = pg -> pg.average() public static SerializableBinaryOperator<PointGrouping> plus = (pg1, pg2) -> pg1 + pg2 } import static PipelineOps.* int k = 5 int iterations = 10 // read in data from our file var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file def rows = new File(url).readLines()[1..-1]*.split(',') var distilleries = rows*.getAt(1) var pointsData = rows.collect{ new Point(it[2..-1] as double[]) } var dims = pointsData[0].pts.size() // create some random points as initial centroids var r = new Random() var randomPoint = { (0..<dims).collect { r.nextGaussian() + 2 } as double[] } var initPts = (1..k).collect(randomPoint)
  25. Apache Wayang: Roll your own Kmeans var context = new

    WayangContext() .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)") var points = planBuilder .loadCollection(pointsData).withName('Load points') var initialCentroids = planBuilder .loadCollection((0..<k).collect{ idx -> new PointGrouping(initPts[idx], idx, 0) }) .withName('Load random centroids') var finalCentroids = initialCentroids.repeat(iterations, currentCentroids -> points.map(new SelectNearestCentroid()) .withBroadcast(currentCentroids, 'centroids').withName('Find nearest centroid') .reduceByKey(cluster, plus).withName('Aggregate points') .map(average).withName('Average points') .withOutputClass(PointGrouping) ).withName('Loop').collect()
  26. Apache Wayang: Roll your own Kmeans println 'Centroids:' finalCentroids.each {

    c -> var pretty = c.pts.collect('%.2f'::formatted) println "Cluster $c.cluster: ${pretty.join(', ')}" } Centroids: Cluster 0: 2.53, 1.65, 2.76, 2.12, 0.29, 0.65, 1.65, 0.59, 1.35, 1.41, 1.35, 0.94 Cluster 2: 3.33, 2.56, 1.67, 0.11, 0.00, 1.89, 1.89, 2.78, 2.00, 1.89, 2.33, 1.33 Cluster 3: 1.42, 2.47, 1.03, 0.22, 0.06, 1.00, 1.03, 0.47, 1.19, 1.72, 1.92, 2.08 Cluster 4: 2.25, 2.38, 1.38, 0.08, 0.13, 1.79, 1.54, 1.33, 1.75, 2.17, 1.75, 1.79
  27. Apache Wayang: Roll your own Kmeans var allocator = new

    SelectNearestCentroid(centroids: finalCentroids) var allocations = pointsData.withIndex() .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] } .groupBy{ cluster, ds -> "Cluster $cluster" } .collectValues{ v -> v.collect{ it[1] } } .sort{ e1, e2 -> e1.key <=> e2.key } allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(', ')}" } Cluster 0 (17 members): Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich Cluster 2 (9 members): Aberlour, Balmenach, Dailuaine, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, RoyalLochnagar Cluster 3 (36 members): AnCnoc, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenElgin, GlenGrant, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tomore, Tullibardine Cluster 4 (24 members): Aberfeldy, Ardmore, Auchroisk, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Craigallechie, Deanston, Edradour, GlenDeveronMacduff, GlenKeith, GlenOrd, Glendullan, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, OldFettercairn, Scapa, Strathisla, Tomatin
  28. Apache Wayang Offers two approaches for us: • Roll your

    own Kmeans algorithm using existing operators o Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink o Built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: o Transform, Stage, Compute, Update, Sample, Converge, Loop o Built-in algorithms including Kmeans
  29. Apache Wayang: ML4all int k = 3 int maxIterations =

    100 double accuracy = 0 class TransformCSV extends Transform<double[], String> { double[] transform(String input) { input.split(',')[2..-1] as double[] } } class KMeansStageWithRandoms extends LocalStage { int k, dimension private r = new Random() void staging(ML4allModel model) { double[][] centers = new double[k][] for (i in 0..<k) { centers[i] = (0..<dimension).collect { r.nextGaussian() + 2 } as double[] } model.put('centers', centers) } }
  30. Apache Wayang: ML4all var url = WhiskeyWayangML.classLoader.getResource('whiskey_noheader.csv').path var dims =

    12 var context = new WayangContext() .withPlugin(Spark.basicPlugin()) .withPlugin(Java.basicPlugin()) var plan = new ML4allPlan( transformOp: new TransformCSV(), localStage: new KMeansStageWithRandoms(k: k, dimension: dims), computeOp: new KMeansCompute(), updateOp: new KMeansUpdate(), loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations) ) var model = plan.execute('file:' + url, context) model.getByKey("centers").eachWithIndex { center, idx -> var pts = center.collect { sprintf '%.2f', it }.join(', ') println "Cluster$idx: $pts" } Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74, 1.72, 1.85 Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29, 0.14 Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12, 1.81
  31. Classification Overview Classification: • Predicting class of some data Algorithms:

    • Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image/speech recognition • Spam filtering • Medical diagnosis • Fraud detection • Customer behaviour prediction
  32. Classification Overview Classification: • Predicting class of some data Algorithms:

    • Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image/speech recognition • Spam filtering • Medical diagnosis • Fraud detection • Customer behaviour prediction
  33. Case Study: classification of Iris flowers British statistician & biologist

    Ronald Fisher 1936 paper: “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris setosa versicolor virginica sepal petal
  34. Case Study: classification of Iris flowers British statistician & biologist

    Ronald Fisher 1936 paper: “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris
  35. Apache Wayang: TensorFlow def fileOperation(URI uri, boolean random) { var

    textFileSource = new TextFileSource(uri.toString()) var line2tupleOp = new MapOperator<>(line -> line.split(",").with{ new Tuple(it[0..-2]*.toFloat() as float[], LABEL_MAP[it[-1]]) }, String, Tuple) var mapData = new MapOperator<>(tuple -> tuple.field0, Tuple, float[]) var mapLabel = new MapOperator<>(tuple -> tuple.field1, Tuple, Integer) if (random) { Random r = new Random() var randomOp = new SortOperator<>(e -> r.nextInt(), String, Integer) textFileSource.connectTo(0, randomOp, 0) randomOp.connectTo(0, line2tupleOp, 0) } else { textFileSource.connectTo(0, line2tupleOp, 0) } line2tupleOp.connectTo(0, mapData, 0) line2tupleOp.connectTo(0, mapLabel, 0) new Tuple<>(mapData, mapLabel) }
  36. Apache Wayang: TensorFlow var TEST_PATH = getClass().classLoader.getResource("iris_test.csv").toURI() var TRAIN_PATH =

    getClass().classLoader.getResource("iris_train.csv").toURI() var LABEL_MAP = ["Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2] var trainSource = fileOperation(TRAIN_PATH, true) var testSource = fileOperation(TEST_PATH, false) /* labels & features */ Operator trainData = trainSource.field0 Operator trainLabel = trainSource.field1 Operator testData = testSource.field0 Operator testLabel = testSource.field1 int[] noShape = null var features = new Input(noShape, Input.Type.FEATURES) var labels = new Input(noShape, Input.Type.LABEL, Op.DType.INT32)
  37. Apache Wayang: TensorFlow /* model */ Op l1 = new

    Linear(4, 32, true) Op s1 = new Sigmoid().with(l1.with(features)) Op l2 = new Linear(32, 3, true).with(s1) DLModel model = new DLModel(l2) /* training options */ Op criterion = new CrossEntropyLoss(3).with(model.out, labels) Optimizer optimizer = new Adam(0.1f) // optimizer with learning rate int batchSize = 45 int epoch = 10 var option = new DLTrainingOperator.Option(criterion, optimizer, batchSize, epoch) option.setAccuracyCalculation(new Mean(0).with( new Cast(Op.DType.FLOAT32).with( new Eq().with(new ArgMax(1).with(model.out), labels) ))) var trainingOp = new DLTrainingOperator<>(model, option, float[], Integer) var predictOp = new PredictOperator<>(float[], float[])
  38. Apache Wayang: TensorFlow /* map to label */ var bestFitOp

    = new MapOperator<>(array -> array.indexed().max{ it.value }.key, float[], Integer) /* sink */ var predicted = [] var predictedSink = createCollectingSink(predicted, Integer) var groundTruth = [] var groundTruthSink = createCollectingSink(groundTruth, Integer) trainData.connectTo(0, trainingOp, 0) trainLabel.connectTo(0, trainingOp, 1) trainingOp.connectTo(0, predictOp, 0) testData.connectTo(0, predictOp, 1) predictOp.connectTo(0, bestFitOp, 0) bestFitOp.connectTo(0, predictedSink, 0) testLabel.connectTo(0, groundTruthSink, 0)
  39. Apache Wayang: TensorFlow var wayangPlan = new WayangPlan(predictedSink, groundTruthSink) new

    WayangContext().with { register(Java.basicPlugin()) register(Tensorflow.plugin()) execute(wayangPlan) } println "labels: $LABEL_MAP" println "predicted: $predicted" println "ground truth: $groundTruth" var correct = predicted.indices.count{ predicted[it] == groundTruth[it] } println "test accuracy: ${correct / predicted.size()}" Start training: [epoch 1, batch 1] loss: 6.300267 accuracy: 0.111111 [epoch 1, batch 2] loss: 2.127365 accuracy: 0.488889 [epoch 1, batch 3] loss: 1.647756 accuracy: 0.333333 …
  40. Apache Wayang: TensorFlow … [epoch 2, batch 1] loss: 1.245312

    accuracy: 0.333333 [epoch 2, batch 2] loss: 1.901310 accuracy: 0.422222 [epoch 2, batch 3] loss: 1.388500 accuracy: 0.244444 [epoch 3, batch 1] loss: 0.593732 accuracy: 0.888889 [epoch 3, batch 2] loss: 0.856900 accuracy: 0.466667 [epoch 3, batch 3] loss: 0.595979 accuracy: 0.755556 [epoch 4, batch 1] loss: 0.749081 accuracy: 0.666667 [epoch 4, batch 2] loss: 0.945480 accuracy: 0.577778 [epoch 4, batch 3] loss: 0.611283 accuracy: 0.755556 [epoch 5, batch 1] loss: 0.625158 accuracy: 0.666667 [epoch 5, batch 2] loss: 0.717461 accuracy: 0.577778 [epoch 5, batch 3] loss: 0.525020 accuracy: 0.600000 [epoch 6, batch 1] loss: 0.308523 accuracy: 0.888889 [epoch 6, batch 2] loss: 0.830118 accuracy: 0.511111 [epoch 6, batch 3] loss: 0.637414 accuracy: 0.600000 [epoch 7, batch 1] loss: 0.265740 accuracy: 0.888889 [epoch 7, batch 2] loss: 0.676369 accuracy: 0.511111 [epoch 7, batch 3] loss: 0.443011 accuracy: 0.622222 … … [epoch 8, batch 1] loss: 0.345936 accuracy: 0.666667 [epoch 8, batch 2] loss: 0.599690 accuracy: 0.577778 [epoch 8, batch 3] loss: 0.395788 accuracy: 0.755556 [epoch 9, batch 1] loss: 0.342955 accuracy: 0.688889 [epoch 9, batch 2] loss: 0.477057 accuracy: 0.933333 [epoch 9, batch 3] loss: 0.376597 accuracy: 0.822222 [epoch 10, batch 1] loss: 0.202404 accuracy: 0.888889 [epoch 10, batch 2] loss: 0.515777 accuracy: 0.600000 [epoch 10, batch 3] loss: 0.318649 accuracy: 0.911111 Finish training. labels: [Iris-setosa:0, Iris-versicolor:1, Iris-virginica:2] predicted: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2] ground truth: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2] test accuracy: 1
  41. Cross-platform machine & deep learning with Apache Wayang (Incubating) &

    Groovy Dr Paul King, VP Apache Groovy & Distinguished Engineer, Object Computing Groovy: Repos: Blogs: Slides: Twitter/X | Mastodon | Bluesky: https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-wayang-tensorflow https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyWayang https://groovy.apache.org/blog/wayang-tensorflow https://groovy.apache.org/blog/using-groovy-with-apache-wayang https://speakerdeck.com/paulk/groovy-wayang @ApacheGroovy | @[email protected] | @groovy.apache.org Questions?