Predicting Online Performance of Job Recommender Systems

Predicting Online Performance of Job Recommender Systems RecSys2019 Short paper
Indeed, Tokyo Japan Adrien M, Tuan A, @masa_kazama, Jialin K

Problem and Motivation • Online evaluation (A/B testing) is usually
the most reliable way to measure the results from our experiments, but it is a slow process. • Oﬄine evaluation process is faster, but it is critical to make it reliable as it informs our decision to roll out new improvements in production.

Problem and Motivation • What are the offline evaluation metrics
we should monitor to expect an impact in production? • What is the level of confidence we can have in the offline results? • How should we decide to push or not push a new model to production?

Funnel in job recommendation Typical conversion funnel in job recommendation
Impression → Click → Apply → Interview → Get a Job Focus on ﬁrst half of the funnel because later funnel is very sparse. apply-rate@10 = # applies up to rank 10 # impressions up to rank 10

Dataset Format userId, jobId, time, isClicked, isApplied Volume 125M interactions,
250M users, 20M jobs

Recommendation Models • Word2vec (w2v): an embedding model using negative
sampling, where the model captures the sequence of actions • Word2vec (w2vhs): a variant of word2vec using hierarchical softmax • Knn: as an item-based collaborative ﬁltering technique

Word2vec for user action data (Item embedding) • Implicit data
◦ Click data ◦ Bookmark data ◦ Apply data UserID, ItemID, TimeStamp User1, Item2, 2016/02/12 User1, Item6, 2016/02/17 User1, Item7, 2016/02/19 User2, Item2, 2016/02/12 User2, Item9, 2016/02/17 User2, Item10, 2016/02/19 User2, Item12, 2016/02/20 Ex

Ex. Apply data UserID, ItemID, TimeStamp User1, Item2, 2016/02/12 User1,
Item6, 2016/02/17 User1, Item7, 2016/02/19 User2, Item2, 2016/02/12 User2, Item9, 2016/02/17 User2, Item10, 2016/02/19 User2, Item12, 2016/02/20 [Item2, Item6, Item7] [Item2, Item9, Item10, Item12] . . We consider a ItemID as a word and Items the user clicked as a document. We can apply word2vec.

Metrics Evaluation Metrics • MAP • MPR • Precision@k (p@k)
• NDCG@k • Recall@k (r@k) k in (3, 10, 20, 30, 40) Process • During two weeks, we run an A/B test with one bucket for each model. • Daily, we generate new recommendations based on the past data and compare the performance in production apply-rate with the offline performance (p@k, etc.)

Results Model apply -rate apply- rate@10 p@10 MAP MPR NDCG
@10 r@10 r@100 w2v - - - - - - - - knn +17% +11% +9.3% -54% +11% -47% -38% +3.2% w2vhs +48% +46% +90% +51% -5.1% +60% +70% +65% Cross-model comparison, averaged over days; word2vec is used as baseline. The metrics in bold do not have the expected sign, e.g. online performance increased, but offline evaluation metric decreased.

Per-metric

Conclusion • We conclude those offline evaluation metrics are reliable
enough to decide to not deploy the new models when the offline performances are significantly negative; and to deploy the new models when there is a positive impact on the offline metrics. • We recommend p@k, which showed a consistent predictive power, when the recommendation task is focused on precision.

OSS contribution • Add Recall@k metric to RankingMetrics in Spark
◦ https://github.com/apache/spark/pull/23881 • Add nmslib indexer to Gensim ◦ https://github.com/RaRe-Technologies/gensim/pull/2417 • Write a tutorial for nmslib indexer in Gensim ◦ https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/nmslibtut orial.ipynb

Predicting Online Performance of Job Recommende...

Predicting Online Performance of Job Recommender Systems

Masa Kazama

More Decks by Masa Kazama

Other Decks in Research

Featured

Transcript

Predicting Online Performance of Job Recommender Systems RecSys2019 Short paper

Problem and Motivation • Online evaluation (A/B testing) is usually

Problem and Motivation • What are the oﬄine evaluation metrics

Funnel in job recommendation Typical conversion funnel in job recommendation

Dataset Format userId, jobId, time, isClicked, isApplied Volume 125M interactions,

Recommendation Models • Word2vec (w2v): an embedding model using negative

Word2vec for user action data (Item embedding) • Implicit data

Ex. Apply data UserID, ItemID, TimeStamp User1, Item2, 2016/02/12 User1,

Metrics Evaluation Metrics • MAP • MPR • Precision@k (p@k)

Results Model apply -rate apply- rate@10 p@10 MAP MPR NDCG

Per-metric

Conclusion • We conclude those oﬄine evaluation metrics are reliable

OSS contribution • Add Recall@k metric to RankingMetrics in Spark