Groovy Whiskey

objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.
No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Whiskey Clustering with Apache Projects: Groovy, Commons CSV, Commons Math, Ignite, Spark, Wayang, Beam, Flink Dr Paul King Object Computing & VP Apache Groovy Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert | @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey

• Apache Groovy • Clustering Overview • Whiskey Clustering &
Visualization • Scaling Whiskey Clustering

Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO
& functional • Dynamic & static • Aligned closely with Java • 20+ years since inception • 3.5+B downloads (partial count) • 520+ contributors • 240+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Friends of Apache Groovy Open Collective

Why use Groovy in 2024? It’s like a super version
of Java: • Simpler scripting: more powerful yet more concise • Extension methods: 2000+ enhancements to Java classes for a great out-of-the box experience (batteries included) • Flexible Typing: from dynamic duck-typing (terse code) to extensible stronger-than-Java static typing (better checking) • Improved OO & Functional Features: from traits (more powerful and flexible OO designs) to tail recursion and memorizing/partial application of pure functions • AST transforms: 10s of lines instead of 100/1000s of lines • Java Features Earlier: recent features on older JDKs

Scripting for Data Science • Same example • Same library
Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Clustering Overview Clustering: • Grouping similar items Algorithm families: •
Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Clustering with KMeans Step 1: • Guess k cluster centroids
at random

Step 2: • Assign points to closest centroid

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies
• 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

var file = getClass().classLoader.getResource('whiskey.csv').file as File var builder = RFC4180.builder().build()
var records = file.withReader { r -> builder.parse(r).records*.toList() } var features = records[0][2..-1] var data = records[1..-1].collect{ new DoublePoint(it[2..-1] as int[]) } var distilleries = records[1..-1]*.get(1) Clustering case study: Whiskey flavor profiles • Read CSV records • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features

var clusterer = new KMeansPlusPlusClusterer(4) Map<Integer, List> clusterPts = [:]
var clusters = clusterer.cluster(data) println features.join(', ') var centroids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point clusterPts[num] = ctrd.points.collect { pt -> data.point.findIndexOf { it == pt.point } } println cpt.collect { sprintf '%.3f', it }.join(', ') cpt.eachWithIndex { val, idx -> centroids.addValue(val, "Cluster ${num + 1}", features[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 1.630, 2.333, 1.148, 0.222, 0.037, 1.185, 1.037, 0.556, 1.963, 1.630, 2.000, 2.111 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545 1.450, 2.550, 1.150, 0.400, 0.150, 0.850, 1.400, 0.600, 0.450, 1.800, 1.700, 2.000 2.607, 2.357, 1.643, 0.107, 0.036, 1.893, 1.679, 1.821, 1.679, 2.107, 1.929, 1.536

Whiskey Clusters – Apache Commons Math println "\n${cols.join(', ')}, Medoid"
var medoids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0..<cpt.size()).collect { cpt[it] - pt.point[it] } as double[]) } var medoidIdx = data.findIndexOf { row -> row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } }

println "\n${cols.join(', ')}, Medoid" var medoids = categoryDataset() clusters.eachWithIndex {
ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0..<cpt.size()).collect { cpt[it] - pt.point[it] } as double[]) } var medoidIdx = data.findIndexOf { row -> row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 2.000, 2.000, 2.000, 2.000, Cardhu 3.000, 2.000, 3.000, 3.000, 1.000, 0.000, 2.000, 0.000, 1.000, 1.000, 2.000, 0.000, Clynelish 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 1.000, 2.000, 2.000, 2.000, Glenallachie 2.000, 2.000, 2.000, 0.000, 0.000, 2.000, 1.000, 2.000, 2.000, 2.000, 2.000, 2.000, Aberfeldy

Dimensionality reduction

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body",
"Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots)) Whiskey – Screeplot

Whiskey – Exploring Weka clustering algorithms

Whiskey – clustering and visualizing centroids … def data =
table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ projected[it][idx] })}, StringColumn.create("Cluster", labels), DoubleColumn.create("Centroid", [10] * labels.size()) ) def centroids = pca.project(clusterer.centroids()) def toAdd = table.emptyCopy(1) (0..<centroids.size()).each { idx -> toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Whiskey – Hierarchical clustering with Dendrogram … def dendrogram =
new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Scaling up machine learning: Apache Ignite • Apache Ignite is
a distributed database for high- performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. • It has cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. Image source: Apache Ignite documentation

• 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV

• 12 flavor categories • Let’s select the regions of interest

Clustering case study: Whiskey flavor profiles • Read CSV rows
• Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]

Clustering case study: Whiskey flavor profiles • Set up configuration
& define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println
">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println
">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.3793, 1.0345, 0.2414, 0.0345, 0.8966, 1.1034, 0.5517, 1.5517, 1.6207, 2.1724, 2.1379 2.5556, 1.4444, 0.0556, 0.0000, 1.8333, 1.6667, 2.3333, 2.0000, 2.0000, 2.2222, 1.5556 3.1429, 1.0000, 0.2857, 0.1429, 0.8571, 0.5714, 0.7143, 0.7143, 1.5714, 0.7143, 1.5714 2.0476, 1.7619, 0.3333, 0.1429, 1.7619, 1.7619, 0.7143, 1.0952, 2.1429, 1.6190, 1.8571 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{
[] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: Bunnahabhain, Dufftown, Glenmorangie, Teaninich, Glenallachie, Longmorn, Scapa, Tobermory, AnCnoc, Cardhu, GlenElgin, Mannochmore, Speyside, Craigganmore, GlenGrant, Tullibardine, Auchentoshan, Bladnoch, GlenKeith, Glengoyne, Knochando, Strathmill, GlenMoray, Aultmore, Tamdhu, Balblair, Glenlossie, Linkwood, Tamnavulin Cluster 2: Aberfeldy, Balmenach, RoyalLochnagar, Aberlour, Edradour, Glenrothes, Glendronach, Glenturret, Macallan, Glendullan, Glenfarclas, Mortlach, Strathisla, Dailuaine, Auchroisk, BlairAthol, Dalmore, Glenlivet Cluster 3: GlenSpey, GlenDeveronMacduff, Speyburn, Miltonduff, Tomore, ArranIsleOf, Glenfiddich Cluster 4: Loch Lomond, Belvenie, BenNevis, Tomatin, Benriach, Highland Park, Tomintoul, Ardmore, Benrinnes, Craigallechie, GlenGarioch, Inchgower, Benromach, Glenkinchie, OldFettercairn, Bowmore, Dalwhinnie, GlenOrd, Bruichladdich, Deanston, RoyalBrackla Cluster 5: Caol Ila, Ardbeg, Clynelish, Springbank, Isle of Jura, Oban, Lagavulin, Talisker, Laphroig, OldPulteney, GlenScotia …

Whiskey flavors – scaling clustering var dist = new EuclideanDistance()
… Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5

Whiskey flavors – scaling clustering var dist = new ManhattanDistance()
… Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for
data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Whiskey flavors – scaling clustering Image source: wikipedia

Apache Spark • Multi-language engine for executing data engineering, data
science, and machine learning on single-node machines or clusters Spark Session Cluster Manager Executor Executor Executor Driver node Worker nodes Cache Tasks

Apache Spark MLlib ML Algorithms • Clustering • Kmeans •
Bisecting Kmeans • Latent Dirichlet Allocation • Gaussian Mixture Model • Power Iteration Clustering • Classification • Regression • Feature engineering • Stats • Utility functions MLlib Your APP Spark Core Spark SQL

Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate
var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop()

Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate
var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop() Cluster centers: 2.89, 2.42, 1.53, 0.05, 0.00, 1.84, 1.58, 2.11, 2.11, 2.11, 2.26, 1.58 1.45, 2.35, 1.06, 0.26, 0.06, 0.84, 1.13, 0.45, 1.26, 1.65, 2.19, 2.10 1.83, 3.17, 1.00, 0.33, 0.17, 1.00, 0.67, 0.83, 0.83, 1.50, 0.50, 1.50 3.00, 1.50, 3.00, 2.80, 0.50, 0.30, 1.40, 0.50, 1.50, 1.50, 1.30, 0.50 1.85, 2.20, 1.70, 0.40, 0.10, 1.85, 1.80, 1.00, 1.35, 2.00, 1.40, 1.85 Cluster0: Aberfeldy, Aberlour, Auchroisk, Balmenach, BenNevis, Benrinnes, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster1: AnCnoc, Auchentoshan, Aultmore, Balblair, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dufftown, GlenElgin, GlenGrant, GlenKeith, GlenMoray, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Linkwood, Loch Lomond, Mannochmore, RoyalBrackla, Speyside, Strathmill, Tamdhu, Tamnavulin, Teaninich, Tobermory, Tullibardine Cluster3: Ardbeg, Caol Ila, Clynelish, GlenScotia, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Talisker Cluster4: Ardmore, Belvenie, Benromach, Bowmore, Bruichladdich, Craigallechie, Dalwhinnie, Deanston, GlenGarioch, GlenOrd, Glenlivet, Glenturret, Highland Park, Inchgower, Knochando, OldFettercairn, Scapa, Springbank, Tomatin, Tomintoul Cluster2: ArranIsleOf, GlenDeveronMacduff, GlenSpey, Miltonduff, Speyburn, Tomore

Apache Wayang • A unified data processing framework that seamlessly
integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility Image source: Apache Wayang documentation

Apache Wayang • Offers two approaches for us: • Roll
your own Kmeans algorithm using existing operators • Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink • Many built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: • Transform, Stage, Compute, Update, Sample, Converge, Loop • Kmeans implementation included in next release Image source: Apache Wayang documentation

Apache Wayang: Roll your own Kmeans Domain classes: record Point(double[]
pts) implements Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List<Double> pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }

Apache Wayang: Roll your own Kmeans Algorithm class: class SelectNearestCentroid
implements ExtendedSerializableFunction<Point, PointGrouping> { Iterable<PointGrouping> centroids void open(ExecutionContext context) { centroids = context.getBroadcast('centroids') } PointGrouping apply(Point p) { var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 for (c in centroids) { var distance = sqrt(p.pts.indices.collect{ p.pts[it] - c.pts[it] }.sum{ it ** 2 } as double) if (distance < minDistance) { minDistance = distance nearestCentroidId = c.cluster } } new PointGrouping(p.pts, nearestCentroidId, 1) } }

Apache Wayang: Roll your own Kmeans class PipelineOps { public
static SerializableFunction<PointGrouping, Integer> cluster = tpc -> tpc.cluster public static SerializableFunction<PointGrouping, PointGrouping> average = tpc -> tpc.average() public static SerializableBinaryOperator<PointGrouping> plus = (tpc1, tpc2) -> tpc1 + tpc2 } import static PipelineOps.* int k = 5 int iterations = 10 // read in data from our file var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file def rows = new File(url).readLines()[1..-1]*.split(',') var distilleries = rows*.getAt(1) var pointsData = rows.collect{ new Point(it[2..-1] as double[]) } var dims = pointsData[0].pts.size() // create some random points as initial centroids var r = new Random() var randomPoint = { (0..<dims).collect { r.nextGaussian() + 2 } as double[] } var initPts = (1..k).collect(randomPoint)

Apache Wayang: Roll your own Kmeans var context = new
WayangContext() .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)") var points = planBuilder .loadCollection(pointsData).withName('Load points') var initialCentroids = planBuilder .loadCollection((0..<k).collect{ idx -> new PointGrouping(initPts[idx], idx, 0) }) .withName('Load random centroids') var finalCentroids = initialCentroids.repeat(iterations, currentCentroids -> points.map(new SelectNearestCentroid()) .withBroadcast(currentCentroids, 'centroids').withName('Find nearest centroid') .reduceByKey(cluster, plus).withName('Aggregate points') .map(average).withName('Average points') .withOutputClass(PointGrouping) ).withName('Loop').collect()

Apache Wayang: Roll your own Kmeans println 'Centroids:' finalCentroids.each {
c -> println "Cluster $c.cluster: ${c.pts.collect { sprintf '%.2f', it }.join(', ')}" } Centroids: Cluster 0: 2.53, 1.65, 2.76, 2.12, 0.29, 0.65, 1.65, 0.59, 1.35, 1.41, 1.35, 0.94 Cluster 2: 3.33, 2.56, 1.67, 0.11, 0.00, 1.89, 1.89, 2.78, 2.00, 1.89, 2.33, 1.33 Cluster 3: 1.42, 2.47, 1.03, 0.22, 0.06, 1.00, 1.03, 0.47, 1.19, 1.72, 1.92, 2.08 Cluster 4: 2.25, 2.38, 1.38, 0.08, 0.13, 1.79, 1.54, 1.33, 1.75, 2.17, 1.75, 1.79 var allocator = new SelectNearestCentroid(centroids: finalCentroids) var allocations = pointsData.withIndex() .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] } .groupBy{ cluster, ds -> "Cluster $cluster" } .collectValues{ v -> v.collect{ it[1] } } .sort{ e1, e2 -> e1.key <=> e2.key } allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(', ')}" } Cluster 0 (17 members): Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich Cluster 2 (9 members): Aberlour, Balmenach, Dailuaine, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, RoyalLochnagar Cluster 3 (36 members): AnCnoc, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenElgin, GlenGrant, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tomore, Tullibardine Cluster 4 (24 members): Aberfeldy, Ardmore, Auchroisk, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Craigallechie, Deanston, Edradour, GlenDeveronMacduff, GlenKeith, GlenOrd, Glendullan, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, OldFettercairn, Scapa, Strathisla, Tomatin

Apache Wayang: ML4all int k = 3 int maxIterations =
100 double accuracy = 0 class TransformCSV extends Transform<double[], String> { double[] transform(String input) { input.split(',')[2..-1] as double[] } } class KMeansStageWithRandoms extends LocalStage { int k, dimension private r = new Random() void staging(ML4allModel model) { double[][] centers = new double[k][] for (i in 0..<k) { centers[i] = (0..<dimension).collect { r.nextGaussian() + 2 } as double[] } model.put('centers', centers) } }

Apache Wayang: ML4all var url = WhiskeyWayangML.classLoader.getResource('whiskey_noheader.csv').path var dims =
12 var context = new WayangContext() .withPlugin(Spark.basicPlugin()) .withPlugin(Java.basicPlugin()) var plan = new ML4allPlan( transformOp: new TransformCSV(), localStage: new KMeansStageWithRandoms(k: k, dimension: dims), computeOp: new KMeansCompute(), updateOp: new KMeansUpdate(), loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations) ) var model = plan.execute('file:' + url, context) model.getByKey("centers").eachWithIndex { center, idx -> var pts = center.collect { sprintf '%.2f', it }.join(', ') println "Cluster$idx: $pts" } Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74, 1.72, 1.85 Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29, 0.14 Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12, 1.81

Apache Beam® • Apache Beam offers a unified programming model
for batch and streaming data processing pipelines • The pipeline abstraction encapsulates all the data and steps in your data processing task • Apache Beam unifies multiple data processing engines and SDKs around its distinctive Beam model • Several language SDKs: Java, Groovy (via Java JDK), Python, Go, SQL, … Image sources: Apache Beam documentation

Apache Beam Kmeans record Point(double[] pts) implements Serializable { private
static Random r = new Random() private static Closure<double[]> randomPoint = { dims -> (1..dims).collect { r.nextGaussian() + 2 } as double[] } static Point ofRandom(int dims) { new Point(randomPoint(dims)) } String toString() { "Point[${pts.collect{ sprintf '%.2f', it }.join('. ')}]" } } record Points(List<Point> pts) implements Serializable { }

Apache Beam Kmeans var readCsv = new DoFn<String, Point>() {
@ProcessElement void processElement(@Element String path, OutputReceiver<Point> receiver) throws IOException { def parser= CSV.builder().setHeader().setSkipHeaderRecord(true).build() def records= new File(path).withReader{ rdr -> parser.parse(rdr).records*.toList() } records.each { receiver.output(new Point(it[2..-1] as double[])) } } } var pointArray2out = new DoFn<Points, String>() { @ProcessElement void processElement(@Element Points pts, OutputReceiver<String> out) { String log = "Centroids:\n${pts.pts()*.toString().join('\n')}" out.output(log) } }

Apache Beam Kmeans class MeanDoubleArrayCols implements SerializableFunction<Iterable<Point>, Point> { @Override
Point apply(Iterable<Point> inputs) { double[] result = new double[12] int count = 0 for (Point input : inputs) { result.indices.each { result[it] += input.pts()[it] } count++ } result.indices.each { result[it] /= count } new Point(result) } } class Squash extends Combine.CombineFn<KV<Integer, Point>, Accum, Points> { int k, dims @Override Accum createAccumulator() { new Accum() } @Override Accum addInput(Accum mutableAccumulator, KV<Integer, Point> input) { … } @Override Accum mergeAccumulators(Iterable<Accum> accumulators) { … } @Override Points extractOutput(Accum accumulator) { … } static class Accum implements Serializable { List<Point> pts = [] } }

Apache Beam Kmeans var assign = { Point pt, Points
centroids -> var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 var idxs = pt.pts().indices centroids.pts().eachWithIndex { Point next, int cluster -> var distance = sqrt(sumSq(idxs.collect { pt.pts()[it] - next.pts()[it] } as double[])) if (distance < minDistance) { minDistance = distance nearestCentroidId = cluster } } KV.of(nearestCentroidId, pt) }

Apache Beam Kmeans Points initCentroids = new Points((1..k).collect{ Point.ofRandom(dims) })
var points = p .apply(Create.of(filename)) .apply('Read points', ParDo.of(readCsv)) var centroids = p.apply(Create.of(initCentroids)) iterations.times { var centroidsView = centroids .apply(View.<Points> asSingleton()) centroids = points .apply('Assign clusters', ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView)) .apply('Calculate new centroids', Combine.<Integer, Point> perKey(new MeanDoubleArrayCols())) .apply('As Points', Combine.<KV<Integer, Point>, Points> globally(new Squash(k: k, dims: dims))) } centroids .apply('Display centroids', ParDo.of(pointArray2out)).apply(Log.ofElements())

Apache Beam Kmeans int k = 5 int iterations =
10 int dims = 12 var pipeline = Pipeline.create() def csv = getClass().classLoader.getResource('whiskey.csv').path buildPipeline(pipeline, csv, k, iterations, dims) pipeline.run().waitUntilFinish() May 29, 2024 5:47:06 PM org.codehaus.groovy.vmplugin.v8.IndyInterface fromCache INFO: Centroids: Point[1.22. 2.87. 0.78. 0.11. 0.35. 0.90. 1.87. 0.81. 0.94. 1.86. 1.65. 1.67] Point[3.67. 1.50. 3.67. 3.33. 0.67. 0.17. 1.67. 0.50. 1.17. 1.33. 1.17. 0.17] Point[1.29. 1.62. 1.00. 0.10. 0.02. 1.17. 0.40. 0.31. 1.36. 1.93. 2.00. 2.14] Point[2.81. 2.43. 1.52. 0.05. 0.00. 2.00. 1.71. 2.05. 1.95. 2.05. 2.19. 1.71] Point[1.86. 2.00. 1.93. 1.07. 0.21. 1.29. 1.29. 1.00. 1.57. 1.86. 1.00. 1.00]

Apache Beam with Groovy metaprogramming for Python-style coding Points initCentroids
= new Points((1..k).collect { Point.ofRandom(dims) }) var points = p | Create.of(filename) | 'Read points' >> ParDo.of(readCsv) var centroids = p | Create.of(initCentroids) iterations.times { var centroidsView = centroids | View.asSingleton() centroids = points | 'Assign clusters' >> ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView) | 'Calculate new centroids' >> Combine.perKey(new MeanDoubleArrayCols()) | 'As Points' >> Combine.globally(new Squash(k: k, dims: dims)) } centroids | 'Display centroids' >> ParDo.of(pointArray2out) | Log.ofElements() INFO: Centroids: Point[4.00. 1.33. 4.00. 4.00. 0.67. 0.00. 1.00. 1.00. 1.00. 1.33. 0.67. 0.00] Point[1.56. 2.58. 1.07. 0.02. 0.08. 1.07. 1.00. 0.59. 1.39. 1.60. 1.52. 1.76] Point[2.18. 1.88. 2.35. 1.59. 0.24. 0.76. 1.76. 0.47. 1.41. 1.47. 1.65. 1.29] Point[2.42. 2.48. 1.27. 0.08. 0.08. 1.84. 1.73. 1.95. 1.98. 2.15. 2.16. 1.92] Point[2.24. 2.20. 3.55. 1.85. 1.58. 1.50. 1.97. 2.45. 0.84. 2.02. 0.77. 0.73]

Apache Flink® • Distributed processing engine for stateful computations over
unbounded and bounded data streams Image sources: Apache Flink documentation

Apache Flink ML ML Algorithms • Clustering • Kmeans •
AgglomerativeClustering • Classification • Regression • Evaluation • Feature engineering • Recommendation • Stats • Utility functions Image based on Apache Flink documentation ML Your APP

var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file
= WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans

var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file
= WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans Cluster 2 has 23 members: [2.0, 2.0, 2.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0] [3.0, 3.0, 1.0, 0.0, 0.0, 4.0, 3.0, 2.0, 2.0, 3.0, 3.0, 2.0] [2.0, 3.0, 1.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0] [4.0, 3.0, 2.0, 0.0, 0.0, 2.0, 1.0, 3.0, 3.0, 0.0, 1.0, 2.0] … Cluster 0 has 46 members: [1.0, 3.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 2.0, 3.0, 2.0] [2.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0, 1.0, 1.0] [2.0, 3.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 2.0] [0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 2.0, 2.0, 3.0, 3.0] …

… var data = stream.executeAndCollect().collect{ Row.of(it) } var train =
data[0..79] var predict = data[80..-1] var trainSource = new PeriodicSourceFunction(1000, train.collate(8)) var trainStream = eEnv.addSource(trainSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var trainTable = tEnv.fromDataStream(trainStream).as("features") var predictSource = new PeriodicSourceFunction(1000, [predict]) var predictStream = eEnv.addSource(predictSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var predictTable = tEnv.fromDataStream(predictStream).as("features") var kmeans = new OnlineKMeans(featuresCol: 'features', predictionCol: 'prediction', globalBatchSize: 8, initialModelData: randomInit, k: 3) var kmeansModel = kmeans.fit(trainTable) var outputTable = kmeansModel.transform(predictTable)[0] outputTable.execute().collect().each { row -> DenseVector features = (DenseVector) row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) println "Cluster $clusterId: ${features}" } Flink ML Online KMeans

Flink ML Online KMeans Cluster 1: [1.0, 1.0, 1.0, 0.0,
0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 2: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 2: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 0: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] … Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 1: [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 1: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 0: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 2: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] …

Questions? Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert
| @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey

Groovy Whiskey

Groovy Whiskey

More Decks by paulking

Other Decks in Technology

Featured

Transcript