Machine Learning on Spark @ Strata Conference

Transcript

Machine learning techniques Classiﬁcation Regression Clustering Active learning Collaborative ﬁltering

Implementing Machine Learning § Machine learning algorithms are -  Complex, multi-stage

-  Iterative § MapReduce/Hadoop unsuitable § Need efﬁcient primitives for data sharing

§  Spark RDDs à efﬁcient data sharing §  In-memory caching

accelerates performance -  Up to 20x faster than Hadoop §  Easy to use high-level programming interface -  Express complex algorithms ~100 lines. Machine Learning using Spark

Machine learning techniques Classiﬁcation Regression Clustering Active learning Collaborative ﬁltering

K-Means Clustering using Spark Focus: Implementation and Performance

Clustering Grouping data according to similarity Distance East Distance North

E.g. archaeological dig

Clustering Grouping data according to similarity Distance East Distance North

E.g. archaeological dig

K-Means Algorithm Beneﬁts •  Popular •  Fast •  Conceptually straightforward

Distance East Distance North E.g. archaeological dig

K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values

data = lines.map(line=> parseVector(line))

K-Means: preliminaries Feature 1 Feature 2 Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

K-Means: preliminaries Feature 1 Feature 2 K = Number of

clusters Data assignments to clusters S1 , S2 ,. . ., SK

K-Means: preliminaries Feature 1 Feature 2 K = Number of

clusters Data assignments to clusters S1 , S2 ,. . ., SK

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points.

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: Assign each data point to the cluster with the closest center. Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed)

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: Assign each cluster center to be the mean of its cluster’s data points. centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p))

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey()

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters = pointsGroup.mapValues( ps => average(ps))

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

centers • Repeat until convergence: centers = data.takeSample( false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (dist(centers, newCenters) > ɛ)

K-Means Source Feature 1 Feature 2 centers = data.takeSample(

false, K, seed) closest = data.map(p => (closestPoint(p,centers),p)) pointsGroup = closest.groupByKey() newCenters =pointsGroup.mapValues( ps => average(ps)) while (d > ɛ) { } d = distance(centers, newCenters) centers = newCenters.map(_)

Ease of use §  Interactive shell: Useful for featurization, pre-processing

data §  Lines of code for K-Means -  Spark ~ 90 lines – (Part of hands-on tutorial !) -  Hadoop/Mahout ~ 4 ﬁles, > 300 lines

274 157 106 197 121 87 143 61 33 0

50 100 150 200 250 300 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark K-Means 184 111 76 116 80 62 15 6 3 0 50 100 150 200 250 25 50 100 Iteration time (s) Number of machines Hadoop HadoopBinMem Spark Logistic Regression Performance [Zaharia et. al, NSDI’12]

§  K means clustering using Spark §  Hands-on exercise this

afternoon ! Examples and more: www.spark-project.org §  Spark: Framework for cluster computing §  Fast and easy machine learning programs Conclusion

None

Machine Learning on Spark @ Strata Conference

Machine Learning on Spark @ Strata Conference

Reynold Xin

More Decks by Reynold Xin

Other Decks in Programming

Featured

Transcript

Machine Learning on Spark Shivaram Venkataraman UC Berkeley

Computer Science Machine learning Statistics

Machine learning Spam ﬁlters Recommendations Click prediction Search ranking

Machine learning techniques Classiﬁcation Regression Clustering Active learning Collaborative ﬁltering

Implementing Machine Learning § Machine learning algorithms are -  Complex, multi-stage

§  Spark RDDs à efﬁcient data sharing §  In-memory caching

Machine learning techniques Classiﬁcation Regression Clustering Active learning Collaborative ﬁltering

K-Means Clustering using Spark Focus: Implementation and Performance

Clustering Grouping data according to similarity Distance East Distance North

Clustering Grouping data according to similarity Distance East Distance North

K-Means Algorithm Beneﬁts •  Popular •  Fast •  Conceptually straightforward

K-Means: preliminaries Feature 1 Feature 2 Data: Collection of values

K-Means: preliminaries Feature 1 Feature 2 Dissimilarity: Squared Euclidean distance

K-Means: preliminaries Feature 1 Feature 2 K = Number of

K-Means: preliminaries Feature 1 Feature 2 K = Number of

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Algorithm Feature 1 Feature 2 • Initialize K cluster

K-Means Source Feature 1 Feature 2 centers = data.takeSample(

Ease of use §  Interactive shell: Useful for featurization, pre-processing

274 157 106 197 121 87 143 61 33 0

§  K means clustering using Spark §  Hands-on exercise this