groovy wayang

Cross-platform machine & deep learning with Apache Wayang (Incubating) &
Groovy Dr Paul King, VP Apache Groovy & Distinguished Engineer, Object Computing Groovy: Repos: Blogs: Slides: Twitter/X | Mastodon | Bluesky: https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-wayang-tensorflow https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyWayang https://groovy.apache.org/blog/wayang-tensorflow https://groovy.apache.org/blog/using-groovy-with-apache-wayang https://speakerdeck.com/paulk/groovy-wayang @ApacheGroovy | @[email protected] | @groovy.apache.org

• Introduction • Whisky Clustering • Iris Classification

What is Apache Groovy? It’s like a super version of
Java: • Simpler scripting: more powerful yet more concise • Extension methods: 2000+ enhancements to Java classes for a great out-of-the box experience (batteries included) • Flexible typing: from dynamic duck-typing (terse code) to extensible stronger-than-Java static typing (better checking) • Improved OO & functional features: from traits (more powerful and flexible OO designs) to tail recursion and memorizing/partial application of pure functions • AST transforms: 10s of lines instead of 100/1000s of lines • Java features earlier: recent features on older JDK

What is Apache Wayang (incubating)? • A unified data processing
framework that seamlessly integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility JDK Spark Groovy Custom Flink Groovy Groovy JDK Spark Custom Flink Groovy Wayang Tensor Flow

What is Apache Wayang (incubating)? • A unified data processing
framework that seamlessly integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility Image source: Apache Wayang documentation

Source: https://wayang.apache.org/docs/introduction/about • A unified data processing framework that seamlessly
integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility What is Apache Wayang (incubating)?

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies
• 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Clustering Overview Clustering: • Grouping similar items Algorithm families: •
Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Clustering with KMeans Step 1: • Guess k cluster centroids
at random

Step 2: • Assign points to closest centroid

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Apache Wayang Offers two approaches for us: • Roll your
own Kmeans algorithm using existing operators o Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink o Built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: o Transform, Stage, Compute, Update, Sample, Converge, Loop o Built-in algorithms including Kmeans

Roll your own Kmeans: domain classes record Point(double[] pts) implements
Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List<Double> pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }

Roll your own Kmeans: algorithm class class SelectNearestCentroid implements ExtendedSerializableFunction<Point,
PointGrouping> { Iterable<PointGrouping> centroids void open(ExecutionContext context) { centroids = context.getBroadcast('centroids') } PointGrouping apply(Point p) { var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 for (c in centroids) { var distance = sqrt(p.pts.indices.collect{ p.pts[it] - c.pts[it] } .sum{ it ** 2 } as double) if (distance < minDistance) { minDistance = distance nearestCentroidId = c.cluster } } new PointGrouping(p.pts, nearestCentroidId, 1) } }

Apache Wayang: Roll your own Kmeans class PipelineOps { public
static SerializableFunction<PointGrouping, Integer> cluster = pg -> pg.cluster public static SerializableFunction<PointGrouping, PointGrouping> average = pg -> pg.average() public static SerializableBinaryOperator<PointGrouping> plus = (pg1, pg2) -> pg1 + pg2 } import static PipelineOps.* int k = 5 int iterations = 10 // read in data from our file var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file def rows = new File(url).readLines()[1..-1]*.split(',') var distilleries = rows*.getAt(1) var pointsData = rows.collect{ new Point(it[2..-1] as double[]) } var dims = pointsData[0].pts.size() // create some random points as initial centroids var r = new Random() var randomPoint = { (0..<dims).collect { r.nextGaussian() + 2 } as double[] } var initPts = (1..k).collect(randomPoint)

Apache Wayang: Roll your own Kmeans var context = new
WayangContext() .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)") var points = planBuilder .loadCollection(pointsData).withName('Load points') var initialCentroids = planBuilder .loadCollection((0..<k).collect{ idx -> new PointGrouping(initPts[idx], idx, 0) }) .withName('Load random centroids') var finalCentroids = initialCentroids.repeat(iterations, currentCentroids -> points.map(new SelectNearestCentroid()) .withBroadcast(currentCentroids, 'centroids').withName('Find nearest centroid') .reduceByKey(cluster, plus).withName('Aggregate points') .map(average).withName('Average points') .withOutputClass(PointGrouping) ).withName('Loop').collect()

Apache Wayang: Roll your own Kmeans println 'Centroids:' finalCentroids.each {
c -> var pretty = c.pts.collect('%.2f'::formatted) println "Cluster $c.cluster: ${pretty.join(', ')}" } Centroids: Cluster 0: 2.53, 1.65, 2.76, 2.12, 0.29, 0.65, 1.65, 0.59, 1.35, 1.41, 1.35, 0.94 Cluster 2: 3.33, 2.56, 1.67, 0.11, 0.00, 1.89, 1.89, 2.78, 2.00, 1.89, 2.33, 1.33 Cluster 3: 1.42, 2.47, 1.03, 0.22, 0.06, 1.00, 1.03, 0.47, 1.19, 1.72, 1.92, 2.08 Cluster 4: 2.25, 2.38, 1.38, 0.08, 0.13, 1.79, 1.54, 1.33, 1.75, 2.17, 1.75, 1.79

Apache Wayang: Roll your own Kmeans var allocator = new
SelectNearestCentroid(centroids: finalCentroids) var allocations = pointsData.withIndex() .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] } .groupBy{ cluster, ds -> "Cluster $cluster" } .collectValues{ v -> v.collect{ it[1] } } .sort{ e1, e2 -> e1.key <=> e2.key } allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(', ')}" } Cluster 0 (17 members): Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich Cluster 2 (9 members): Aberlour, Balmenach, Dailuaine, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, RoyalLochnagar Cluster 3 (36 members): AnCnoc, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenElgin, GlenGrant, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tomore, Tullibardine Cluster 4 (24 members): Aberfeldy, Ardmore, Auchroisk, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Craigallechie, Deanston, Edradour, GlenDeveronMacduff, GlenKeith, GlenOrd, Glendullan, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, OldFettercairn, Scapa, Strathisla, Tomatin

Apache Wayang Offers two approaches for us: • Roll your
own Kmeans algorithm using existing operators o Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink o Built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: o Transform, Stage, Compute, Update, Sample, Converge, Loop o Built-in algorithms including Kmeans

Apache Wayang: ML4all int k = 3 int maxIterations =
100 double accuracy = 0 class TransformCSV extends Transform<double[], String> { double[] transform(String input) { input.split(',')[2..-1] as double[] } } class KMeansStageWithRandoms extends LocalStage { int k, dimension private r = new Random() void staging(ML4allModel model) { double[][] centers = new double[k][] for (i in 0..<k) { centers[i] = (0..<dimension).collect { r.nextGaussian() + 2 } as double[] } model.put('centers', centers) } }

Apache Wayang: ML4all var url = WhiskeyWayangML.classLoader.getResource('whiskey_noheader.csv').path var dims =
12 var context = new WayangContext() .withPlugin(Spark.basicPlugin()) .withPlugin(Java.basicPlugin()) var plan = new ML4allPlan( transformOp: new TransformCSV(), localStage: new KMeansStageWithRandoms(k: k, dimension: dims), computeOp: new KMeansCompute(), updateOp: new KMeansUpdate(), loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations) ) var model = plan.execute('file:' + url, context) model.getByKey("centers").eachWithIndex { center, idx -> var pts = center.collect { sprintf '%.2f', it }.join(', ') println "Cluster$idx: $pts" } Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74, 1.72, 1.85 Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29, 0.14 Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12, 1.81

Classification Overview Classification: • Predicting class of some data Algorithms:
• Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbors, Decision Tree, Random Forest, Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image/speech recognition • Spam filtering • Medical diagnosis • Fraud detection • Customer behaviour prediction

Case Study: classification of Iris flowers British statistician & biologist
Ronald Fisher 1936 paper: “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris setosa versicolor virginica sepal petal

Case Study: classification of Iris flowers British statistician & biologist
Ronald Fisher 1936 paper: “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris

Apache Wayang: TensorFlow def fileOperation(URI uri, boolean random) { var
textFileSource = new TextFileSource(uri.toString()) var line2tupleOp = new MapOperator<>(line -> line.split(",").with{ new Tuple(it[0..-2]*.toFloat() as float[], LABEL_MAP[it[-1]]) }, String, Tuple) var mapData = new MapOperator<>(tuple -> tuple.field0, Tuple, float[]) var mapLabel = new MapOperator<>(tuple -> tuple.field1, Tuple, Integer) if (random) { Random r = new Random() var randomOp = new SortOperator<>(e -> r.nextInt(), String, Integer) textFileSource.connectTo(0, randomOp, 0) randomOp.connectTo(0, line2tupleOp, 0) } else { textFileSource.connectTo(0, line2tupleOp, 0) } line2tupleOp.connectTo(0, mapData, 0) line2tupleOp.connectTo(0, mapLabel, 0) new Tuple<>(mapData, mapLabel) }

Apache Wayang: TensorFlow var TEST_PATH = getClass().classLoader.getResource("iris_test.csv").toURI() var TRAIN_PATH =
getClass().classLoader.getResource("iris_train.csv").toURI() var LABEL_MAP = ["Iris-setosa": 0, "Iris-versicolor": 1, "Iris-virginica": 2] var trainSource = fileOperation(TRAIN_PATH, true) var testSource = fileOperation(TEST_PATH, false) /* labels & features */ Operator trainData = trainSource.field0 Operator trainLabel = trainSource.field1 Operator testData = testSource.field0 Operator testLabel = testSource.field1 int[] noShape = null var features = new Input(noShape, Input.Type.FEATURES) var labels = new Input(noShape, Input.Type.LABEL, Op.DType.INT32)

Apache Wayang: TensorFlow /* model */ Op l1 = new
Linear(4, 32, true) Op s1 = new Sigmoid().with(l1.with(features)) Op l2 = new Linear(32, 3, true).with(s1) DLModel model = new DLModel(l2) /* training options */ Op criterion = new CrossEntropyLoss(3).with(model.out, labels) Optimizer optimizer = new Adam(0.1f) // optimizer with learning rate int batchSize = 45 int epoch = 10 var option = new DLTrainingOperator.Option(criterion, optimizer, batchSize, epoch) option.setAccuracyCalculation(new Mean(0).with( new Cast(Op.DType.FLOAT32).with( new Eq().with(new ArgMax(1).with(model.out), labels) ))) var trainingOp = new DLTrainingOperator<>(model, option, float[], Integer) var predictOp = new PredictOperator<>(float[], float[])

Apache Wayang: TensorFlow /* map to label */ var bestFitOp
= new MapOperator<>(array -> array.indexed().max{ it.value }.key, float[], Integer) /* sink */ var predicted = [] var predictedSink = createCollectingSink(predicted, Integer) var groundTruth = [] var groundTruthSink = createCollectingSink(groundTruth, Integer) trainData.connectTo(0, trainingOp, 0) trainLabel.connectTo(0, trainingOp, 1) trainingOp.connectTo(0, predictOp, 0) testData.connectTo(0, predictOp, 1) predictOp.connectTo(0, bestFitOp, 0) bestFitOp.connectTo(0, predictedSink, 0) testLabel.connectTo(0, groundTruthSink, 0)

Apache Wayang: TensorFlow var wayangPlan = new WayangPlan(predictedSink, groundTruthSink) new
WayangContext().with { register(Java.basicPlugin()) register(Tensorflow.plugin()) execute(wayangPlan) } println "labels: $LABEL_MAP" println "predicted: $predicted" println "ground truth: $groundTruth" var correct = predicted.indices.count{ predicted[it] == groundTruth[it] } println "test accuracy: ${correct / predicted.size()}" Start training: [epoch 1, batch 1] loss: 6.300267 accuracy: 0.111111 [epoch 1, batch 2] loss: 2.127365 accuracy: 0.488889 [epoch 1, batch 3] loss: 1.647756 accuracy: 0.333333 …

Apache Wayang: TensorFlow … [epoch 2, batch 1] loss: 1.245312
accuracy: 0.333333 [epoch 2, batch 2] loss: 1.901310 accuracy: 0.422222 [epoch 2, batch 3] loss: 1.388500 accuracy: 0.244444 [epoch 3, batch 1] loss: 0.593732 accuracy: 0.888889 [epoch 3, batch 2] loss: 0.856900 accuracy: 0.466667 [epoch 3, batch 3] loss: 0.595979 accuracy: 0.755556 [epoch 4, batch 1] loss: 0.749081 accuracy: 0.666667 [epoch 4, batch 2] loss: 0.945480 accuracy: 0.577778 [epoch 4, batch 3] loss: 0.611283 accuracy: 0.755556 [epoch 5, batch 1] loss: 0.625158 accuracy: 0.666667 [epoch 5, batch 2] loss: 0.717461 accuracy: 0.577778 [epoch 5, batch 3] loss: 0.525020 accuracy: 0.600000 [epoch 6, batch 1] loss: 0.308523 accuracy: 0.888889 [epoch 6, batch 2] loss: 0.830118 accuracy: 0.511111 [epoch 6, batch 3] loss: 0.637414 accuracy: 0.600000 [epoch 7, batch 1] loss: 0.265740 accuracy: 0.888889 [epoch 7, batch 2] loss: 0.676369 accuracy: 0.511111 [epoch 7, batch 3] loss: 0.443011 accuracy: 0.622222 … … [epoch 8, batch 1] loss: 0.345936 accuracy: 0.666667 [epoch 8, batch 2] loss: 0.599690 accuracy: 0.577778 [epoch 8, batch 3] loss: 0.395788 accuracy: 0.755556 [epoch 9, batch 1] loss: 0.342955 accuracy: 0.688889 [epoch 9, batch 2] loss: 0.477057 accuracy: 0.933333 [epoch 9, batch 3] loss: 0.376597 accuracy: 0.822222 [epoch 10, batch 1] loss: 0.202404 accuracy: 0.888889 [epoch 10, batch 2] loss: 0.515777 accuracy: 0.600000 [epoch 10, batch 3] loss: 0.318649 accuracy: 0.911111 Finish training. labels: [Iris-setosa:0, Iris-versicolor:1, Iris-virginica:2] predicted: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2] ground truth: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2] test accuracy: 1

Cross-platform machine & deep learning with Apache Wayang (Incubating) &
Groovy Dr Paul King, VP Apache Groovy & Distinguished Engineer, Object Computing Groovy: Repos: Blogs: Slides: Twitter/X | Mastodon | Bluesky: https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-wayang-tensorflow https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/WhiskeyWayang https://groovy.apache.org/blog/wayang-tensorflow https://groovy.apache.org/blog/using-groovy-with-apache-wayang https://speakerdeck.com/paulk/groovy-wayang @ApacheGroovy | @[email protected] | @groovy.apache.org Questions?

groovy wayang

groovy wayang

More Decks by paulking

Featured

Transcript