Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Spark Machine Learning 101 @HadoopCon
Search
Chu-Yu Hsu
September 19, 2015
Technology
420
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Spark Machine Learning 101 @HadoopCon
Chu-Yu Hsu
September 19, 2015
Other Decks in Technology
See All in Technology
作って終わりにしない タイミーのセマンティックレイヤー育成の現在地
chanyou0311
3
2.2k
AAIFに入ってみた ~内から見えるコミュニティ動向~
sato4
0
160
protovalidate-es を導入してみた
bengo4com
0
170
Claude Codeをどのように キャッチアップしているか
oikon48
1
980
SIer20年! 培ったスキルがスタートアップで輝く時
shucho0103
0
830
AI-DLCを活用した高品質・安全なAI駆動開発実践 / AI Driven Development with AI-DLC
yoshidashingo
0
170
Claude Code の Sandbox 機能を Anthropic Sandbox Runtime(srt) で試そう!/lets-play-anthropic-sandbox-runtime
tomoki10
1
530
日本 Fintech 未来予測レポート 2027〜2028年(手動編集版)
8maki
0
1.7k
On-behalf-of Token exchange with AgentCore Identity
hironobuiga
2
150
Microsoft Build Keynoteふりかえり
tomokusaba
0
120
RSA暗号を手計算したくなること、ありますよね?? (20260615_orestudy6_rsa)
thousanda
0
220
脆弱性対応、どこで線を引くか
rymiyamoto
0
360
Featured
See All Featured
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
230
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
201
75k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Chasing Engaging Ingredients in Design
codingconduct
0
220
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
HDC tutorial
michielstock
2
700
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
370
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.3k
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
420
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.4k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
65
55k
WENDY [Excerpt]
tessaabrams
11
38k
Transcript
Spark Machine Learning 101 Chu-Yu Hsu @ HadoopCon 2015
About Me Chu-Yu Hsu, 許儲⽻羽 • Software Engineer • Machine
Learning Practicer • Used Spark ML and Python in daily work and Kaggle competition • http://blog.chuyuhsu.ml
Outline • Introduction to Spark ML • Alternative Least Squares
(ALS) • Hands-on example
None
Apache Spark MLlib • To Make practical machine learning easy
and scalable • spark.mllib - the primary API • spark.ml - a higher-level API for constructing ML workflows Apache Spark spark.mllib spark.ml
What’s in MLlib Utilities Data types Basic statistics Classification and
regression SVM Logistic regression Linear regression Naive Bayes Decision trees Ensembles of trees Isotonic regression Collaborative filtering Alternating least squares (ALS) Clustering K-means Gaussian mixture Power iteration clustering Latent Dirichlet allocation Streaming k-means Dimensionality reduction SVD PCA Frequent pattern mining FP-growth Optimization Stochastic gradient descent Limited-memory BFGS https://spark.apache.org/docs/latest/mllib-guide.html
ML Workflow can be VERY complex
Types of Recommenders • Editorial and hand curated • Simple
aggregates • Tailored to individual users
Who Uses Recommenders
Approaches • Content based method • Item based method •
Model based method
Collaborative Filtering • One of mostly known “Recommendation Algorithm” •
Widely used in E-commerce application • The data size can be enormous • Need to be delivered as soon as possible
Collaborative Filtering Main idea: Find set N of other users
whose ratings are “similar” to X’s ratings
Users Preferences • This is a baby example • Users:
> 2M • Items: > 30M • Sparsity: > 2%
Low Rank Assumption • Matrix can be reduced to the
product of low rank matrixes • That is also understood as “latent factors” • We assume that the low factor can represent the hidden factors we do not know Action Romance Thriller
Low Rank Assumption Action Romance Thriller Action Romance Thriller
Matrix Factorization
• Our goal is to find P and Q such
that (Sum of Square Error): • Root Mean Square Error (RMSE)
Alternative Least Squares • Because p and q are both
unknown, the object function is not convex • If fix one of the unknowns > can be solved as a least squares problem
Amazon Reviews Dataset 35 million ratings, 6.6 million users, 2.4
million products on 16-node (m3.2xlarge) https://github.com/apache/spark/pull/3720
Resources
Resources
And More Resources • Source code examples https://github.com/apache/spark/tree/master/ examples •
Apache Spark JIRA https://issues.apache.org/jira/browse/spark
Dataset • MovieLens Dataset http://grouplens.org/datasets/movielens/ • “ratings.dat” UserID::MovieID::Rating::Timestamp • “movies.dat”
MovieID::Title::Genres
Conclusion • Spark MLlib grows fast, but still need some
time • Spark MLlib is a strong tool, if you use it right • Sharpening ML skills is first priority
Q&A Visit me on: http://blog.chuyuhsu.ml Github: http://github.com/ChuyuHsu Thanks
References • https://spark.apache.org/docs/latest/mllib-guide.html • http://www.slideshare.net/jeykottalam/mllib • http://www.slideshare.net/PetrZapletal1/mllib-and-machine-learning-on-spark • https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with- spark-mllib.html
• https://github.com/apache/spark/pull/3720 • https://www.hakkalabs.co/articles/spark-mllib-making-practical-machine- learning-easy-and-scalable • http://www.slideshare.net/databricks/practical-machine-learning-pipelines- with-mllib