Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Spark Machine Learning 101 @HadoopCon
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Chu-Yu Hsu
September 19, 2015
Technology
420
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Spark Machine Learning 101 @HadoopCon
Chu-Yu Hsu
September 19, 2015
Other Decks in Technology
See All in Technology
地球に⽣きるAI —GeoAIと「中間領域」— / AI Living on Earth — GeoAI and the “Intermediate Layer” —
ykiyota
0
280
LLMと共に進化するプロセスを目指して
ymatsuwitter
12
4k
データサイエンスを価値につなげるプロジェクト設計 〜 DS一年目が現場で得た気づき 〜
ysd113
1
180
Claude Code×Terraform IaC テンプレート駆動開発
itouhi
1
490
2026 TECHFRESH 畢業分享會 - 開發日常大解密!從領域驅動到企業級上線
line_developers_tw
PRO
0
770
AIの性能が向上しても未解決な組織の重大問題は何か?/An Unsolved Organizational Problem in the Age of AI
moriyuya
3
610
「速く作る」から「正しく作る」へ ─ 生成AI時代の開発フロー改革の ロードマップと実行 ─
starfish719
0
9.8k
なぜ Platform Engineering の土台に Kubernetes を選ぶのか
r4ynode
1
580
DevOps Agentで始めるAWS運用 〜フロンティアエージェントが変える運用の現場〜
nyankotaro
1
380
ルールやカスタム機能、どう活かす?ハンズオンで体感するIBM Bobの出力コントロール
muehara
1
130
非定型業務をAI slackbotで自動化する ~ 社内要望を自動壁打ちするbotを作った ~/automating-ad-hoc-work-with-ai-slackbot
shibayu36
0
600
AWSシリコン最前線 〜AI時代のチップ選択を読み解く〜
htokoyo
2
440
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
2k
The #1 spot is gone: here's how to win anyway
tamaranovitovic
2
1.1k
The agentic SEO stack - context over prompts
schlessera
0
810
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.2k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Odyssey Design
rkendrick25
PRO
2
690
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
The Invisible Side of Design
smashingmag
302
52k
Prompt Engineering for Job Search
mfonobong
0
340
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
6k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
55k
Transcript
Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015
About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine
Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml
Outline • Introduction to Spark ML • Alternative Least Squares
(ALS) • Hands-on example
None
Apache Spark MLlib • To Make practical machine learning easy
and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workflows Apache Spark spark.mllib spark.ml
What’s in MLlib Utilities Data types Basic statistics Classification and
regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html
ML Workflow can be VERY complex
Types of Recommenders • Editorial and hand curated • Simple
aggregates • Tailored to individual users
Who Uses Recommenders
Approaches • Content based method • Item based method •
Model based method
Collaborative Filtering • One of mostly known “Recommendation Algorithm” •
Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible
Collaborative Filtering Main idea: Find set N of other users
whose ratings are “similar” to X’s ratings
Users Preferences • This is a baby example • Users:
> 2M • Items: > 30M • Sparsity: > 2%
Low Rank Assumption • Matrix can be reduced to the
product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller
Low Rank Assumption Action Romance Thriller Action Romance Thriller
Matrix Factorization
• Our goal is to find P and Q such
that (Sum of Square Error): • Root Mean Square Error (RMSE)
Alternative Least Squares • Because p and q are both
unknown, the object function is not convex • If fix one of the unknowns > can be solved as a least squares problem
Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4
million products on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720
Resources
Resources
And More Resources • Source code examples https://github.com/apache/spark/tree/master/ examples •
Apache Spark JIRA https://issues.apache.org/jira/browse/spark
Dataset • MovieLens Dataset http://grouplens.org/datasets/movielens/ • “ratings.dat” UserID::MovieID::Rating::Timestamp • “movies.dat”
MovieID::Title::Genres
Conclusion • Spark MLlib grows fast, but still need some
time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is first priority
Q&A Visit me on: http://blog.chuyuhsu.ml Github: http://github.com/ChuyuHsu Thanks
References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with- spark-mllib.html
• https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib