Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Lightning-fast Machine Learning with Spark
Search
Probst Ludwine
November 11, 2014
Programming
1k
6
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Lightning-fast Machine Learning with Spark
Probst Ludwine
November 11, 2014
More Decks by Probst Ludwine
See All by Probst Ludwine
Tech Beyond Borders
nivdul
0
210
Tech Beyond Borders
nivdul
0
86
Analytics in the age of the Internet of Things
nivdul
1
220
Lightning-fast Machine Learning with Spark
nivdul
15
5.4k
Introduction to Spark
nivdul
4
650
Other Decks in Programming
See All in Programming
気圧・高度・GPSを記録&可視化するアプリ「Koudo」を作った話
hjmkth
1
200
正しくソフトウェアを作る、前提を疑うための認知の視点 / doubt-premise
minodriven
21
6.6k
Composerを使ったサプライチェーン攻撃の様子を眺めてみる #phpstudy
o0h
PRO
2
250
ECSアプリログをFireLensでコスト削減しようとしたけど諦めた話 in Fargate×Node.js
akihisaikeda
2
4.2k
過去最大のMCPアップデート! 2026-07-28 RC版の謎に迫る
licux
6
280
Vue × Nuxt × Oxc どこまで使える?実運用の現在地
andpad
0
240
ユニットテストの先へ:テスト技法で要求・仕様を整理するJava開発実践 / Beyond_Unit_Testing_Practical_Java_Development_Techniques_for_Organizing_Requirements_and_Specifications
shimashima35
0
400
Vite+ Unified Toolchain for the Web
naokihaba
0
290
セキュリティの専門家じゃなくてもできる。「セキュリティ意識」をアップデートして サプライチェーン攻撃への耐性を高めよう。
tk3fftk
5
740
作って学ぶ、 JSX (TSX) ランタイムの基本
syumai
7
1.6k
JavaDoc 再入門
nagise
0
330
Observability in Practice:Grafana 與 Edge Device SRE 的那些事
blueswen
0
160
Featured
See All Featured
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.6k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.4k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
3.4k
Skip the Path - Find Your Career Trail
mkilby
1
150
A designer walks into a library…
pauljervisheath
211
24k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
130
Design in an AI World
tapps
1
240
Building the Perfect Custom Keyboard
takai
2
790
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
2k
Optimizing for Happiness
mojombo
378
71k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.5k
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
210
Transcript
@nivdul #DV14 #MLwithSpark Lightning fast Machine Learning with Spark Ludwine
Probst
@nivdul #Devoxx #MLwithSpark me Data engineer at Leader of Duchess
France
@nivdul #Devoxx #MLwithSpark Machine Learning
@nivdul #DV14 #MLwithSpark MapReduce Lay of the land
@nivdul #Devoxx #MLwithSpark MapReduce
@nivdul #Devoxx #MLwithSpark HDFS with iterative algorithms
@nivdul #Devoxx #MLwithSpark
@nivdul #Devoxx #MLwithSpark is a fast and general engine for
large-scale data processing
@nivdul #DV14 #MLwithSpark •big data analytics in memory/disk •complements Hadoop
•fast and more flexible •Resilient Distributed Datasets (RDD) •shared variables
@nivdul #Devoxx #MLwithSpark Shared variables broadcast variables accumulators val broadcastVar
= sc.broadcast(Array(1, 2, 3)) val acc = sc.accumulator(0, "MyAccumulator") sc.parallelize(Array(1, 2, 3)).foreach(x => acc += x)
@nivdul #DV14 #MLwithSpark RDD (Resilient Distributed Datasets) •process in parallel
•controllable persistence (memory, disk…) •higher-level operations (transformation & actions) •rebuilt automatically using lineage
@nivdul #Devoxx #MLwithSpark Data Storage InputFormat cassandra cassandra
@nivdul #Devoxx #MLwithSpark Spark data flow
@nivdul #Devoxx #MLwithSpark Languages interactive shell (scala & python) Lambda
(Java 8)
@nivdul #Devoxx #MLwithSpark val conf = new SparkConf() .setAppName("Spark word
count") .setMaster("local") ! val sc = new SparkContext(conf) WordCount example (scala)
@nivdul #DV14 #MLwithSpark // load the data val data =
sc.textFile("filepath/wordcount.txt") // map then reduce step val wordCounts = data.flatMap(line => line.split("\\s+")) .map(word => (word, 1)) .reduceByKey(_ + _) // persist the data wordCounts.cache()
@nivdul #DV14 #MLwithSpark // keep words which appear more than
3 times val filteredWordCount = wordCounts.filter { case (key, value) => value > 2 } ! filteredWordCount.count()
@nivdul #Devoxx #MLwithSpark Spark ecosystem
@nivdul #Devoxx #MLwithSpark streaming makes it easy to build scalable
fault-tolerant streaming applications
@nivdul #Devoxx #MLwithSpark SQL unifies access to structured data
@nivdul #Devoxx #MLwithSpark is Apache Spark's API for graphs and
graph-parallel computation
@nivdul #Devoxx #MLwithSpark MLlib is Apache Spark's scalable machine learning
library
@nivdul #Devoxx #MLwithSpark Machine learning with Spark / MLlib
@nivdul #Devoxx #MLwithSpark Machine learning libraries scikit
@nivdul #Devoxx #MLwithSpark Example make a movies recommender system
@nivdul #Devoxx #MLwithSpark Collaborative filtering with Alternating Least Square (ALS)
@nivdul #DV14 #MLwithSpark 1 3 5 1 28 4 2
18 3 2 5 5 userID movieID rating
@nivdul #DV14 #MLwithSpark // Load and parse the data val
data = sc.textFile("movies.txt") ! // create a RDD[Rating] val ratings = data.map(_.split("\\s+") match { case Array(user, movie, rate) => Rating(user.toInt, movie.toInt, rate.toDouble) })
@nivdul #DV14 #MLwithSpark // split the data into training set
and test set val splits = ratings.randomSplit(Array(0.8, 0.2)) ! // persist the training set val training = splits(0).cache() val test = splits(1)
@nivdul #DV14 #MLwithSpark // Build the recommendation model using ALS
! val model = ALS.train(training, rank = 10, iterations = 20, 1)
@nivdul #DV14 #MLwithSpark // Evaluate the model val userMovies =
test.map { case Rating(user, movie, rate) => (user, movie) } val predictions = model.predict(userMovies).map { case Rating(user, movie, rate) => ((user, movie), rate) } ! val ratesAndPreds = test.map { case Rating(user, movie, rate) => ((user, movie), rate) }.join(predictions) //measuring the Mean Squared Error of rating prediction val MSE = ratesAndPreds.map { case ((user, movie), (r1, r2)) => val err = (r1 - r2) err * err }.mean()
@nivdul #DV14 #MLwithSpark // recommending movies ! val recommendations =
model.recommendProducts(2, 10) .sortBy(- _.rating) ! var i = 1 recommendations.foreach { r => println(r.product + " with rating " + r.rating) i += 1 }
@nivdul #Devoxx #MLwithSpark Performance Spark core Hadoop MapReduce http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html How
fast a system can sort 100 TB of data on disk ?
@nivdul #Devoxx #MLwithSpark Performance Spark / MLlib Collaborative filtering with
MLlib vs Mahout https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
@nivdul #Devoxx #MLwithSpark Why should I care ? fast and
easy Machine Learning with MLlib fast & flexible in-memory /on-disk SQL Streaming MLlib
None