Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Demo - Query-Based Simple and Scalable Recommen...

Avatar for Takuya Kitazawa Takuya Kitazawa
October 04, 2018

Demo - Query-Based Simple and Scalable Recommender Systems with Apache Hivemall

Avatar for Takuya Kitazawa

Takuya Kitazawa

October 04, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Research

Transcript

  1. Query-Based Simple and Scalable Recommendation with Apache Hivemall Easy-to-use ‣

    ML in SQL ‣ No expertise ‣ Sharable SELECT train_classifier( -- train_regressor( features, label, ‘-loss logloss -optimizer AdaGrad -reg L1' ) as (feature, weight) FROM training ‣ Loss func?on ‣ Op?mizer ‣ Regulariza?on ‣ Learning rate ‣ Mini-batch Scalable ‣ Runs in parallel ‣ Hadoop ecosystem ‣ Flexible selection of each layer: - HiveQL - Pig - Spark - DataFrame - HiveContext Versatile ‣ Regression ‣ Classification ‣ Feature engineering ‣ Evaluation ‣ Topic modeling ‣ Anomaly detection ‣ NLP ‣ Generic array/map operations Multi-platform
  2. Item-Based Collaborative Filtering in Query Language itemid other cnt 583266

    621056 231 583266 583266 923 31231 13212 129 31231 31231 542 … … … CREATE TABLE cooccurrence as SELECT u1.itemid, u2.itemid as other, count(1) as cnt FROM user_purchased u1 JOIN user_purchased u2 ON (u1.userid = u2.userid) WHERE u1.itemid != u2.itemid GROUP BY u1.itemid, u2.itemid userid itemid purchased_at 1 31231 2015-04-09 00:29:02 1 13212 2016-05-24 16:29:02 2 312 2016-06-03 23:29:02 3 2312 2016-06-04 19:29:02 … … … CREATE TABLE user_purchased as SELECT userid, itemid, count(1) as purchase_count FROM history GROUP BY userid, itemid Count # of transac?ons for each user-item pair Compute item-item co-count What’s next?
  3. Matrix Factorization in Query Language CREATE TABLE sgd_model as SELECT

    idx, array_avg(u_rank) as Pu, array_avg(i_rank) as Qi, avg(u_bias) as Bu, avg(i_bias) as Bi FROM ( SELECT train_mf_sgd( user_id, item_id, rating, '-factor ${factor} -mu ${mu} -iter ${iters}' ) as (idx, u_rank, i_rank, u_bias, i_bias) FROM training ) t GROUP BY idx SELECT mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, ${mu}) as predicted FROM ( SELECT t1.user_id, t1.item_id, m1.Pu, m1.Bu FROM target t1 LEFT OUTER JOIN sgd_model m1 ON (t1.user_id = m1.idx) ) t2 LEFT OUTER JOIN sgd_model m2 ON (t2.item_id = m2.idx)
  4. List of Recommender Related Capabilities ‣ List top-3 items per

    user: item user score 1 B 70 2 A 80 3 A 90 4 B 60 5 A 70 … … … SELECT each_top_k( 2, user, score, user, item -- output columns ) as (rank, score, user, item) FROM ( SELECT * FROM table CLUSTER BY user ) t Complete in 2 hrs. k-nearest-neighbor ‣ MinHash and b-Bit MinHash (LSH) ‣ Similari?es - Euclid - Cosine - Jaccard - Angular Efficient top-k retrieval Efficient item-based CF techniques ‣ Sparse Linear Method (SLIM) ‣ Approximated all-pair similari?es (DIMSUM) Matrix completion ‣ Matrix Factoriza?on ‣ (Field-Aware) Factoriza?on Machines SELECT item, user, score, rank FROM ( SELECT item, user, score, rank() over (PARTITION BY user ORDER BY score DESC) as rank FROM table ) t WHERE rank <= 2 NOT finish in 24 hrs. for 20M users and 
 ~1k items in each