the most reliable way to measure the results from our experiments, but it is a slow process. • Offline evaluation process is faster, but it is critical to make it reliable as it informs our decision to roll out new improvements in production.
we should monitor to expect an impact in production? • What is the level of confidence we can have in the offline results? • How should we decide to push or not push a new model to production?
Impression → Click → Apply → Interview → Get a Job Focus on first half of the funnel because later funnel is very sparse. apply-rate@10 = # applies up to rank 10 # impressions up to rank 10
sampling, where the model captures the sequence of actions • Word2vec (w2vhs): a variant of word2vec using hierarchical softmax • Knn: as an item-based collaborative filtering technique
Item6, 2016/02/17 User1, Item7, 2016/02/19 User2, Item2, 2016/02/12 User2, Item9, 2016/02/17 User2, Item10, 2016/02/19 User2, Item12, 2016/02/20 [Item2, Item6, Item7] [Item2, Item9, Item10, Item12] . . We consider a ItemID as a word and Items the user clicked as a document. We can apply word2vec.
• NDCG@k • Recall@k (r@k) k in (3, 10, 20, 30, 40) Process • During two weeks, we run an A/B test with one bucket for each model. • Daily, we generate new recommendations based on the past data and compare the performance in production apply-rate with the offline performance (p@k, etc.)
@10 r@10 r@100 w2v - - - - - - - - knn +17% +11% +9.3% -54% +11% -47% -38% +3.2% w2vhs +48% +46% +90% +51% -5.1% +60% +70% +65% Cross-model comparison, averaged over days; word2vec is used as baseline. The metrics in bold do not have the expected sign, e.g. online performance increased, but offline evaluation metric decreased.
enough to decide to not deploy the new models when the offline performances are significantly negative; and to deploy the new models when there is a positive impact on the offline metrics. • We recommend p@k, which showed a consistent predictive power, when the recommendation task is focused on precision.