Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Enhancing News Recommendation with Transformers...

Enhancing News Recommendation with Transformers and Ensemble Learning

1st place solution for RecSys Challenge Competition

Kazuki Fujikawa

March 15, 2025
Tweet

More Decks by Kazuki Fujikawa

Other Decks in Research

Transcript

  1. © DeNA Co., Ltd. 1 Enhancing News Recommendation with Transformers

    and Ensemble Learning Team “:D” 1st place solution Kazuki Fujikawa, Naoki Murakami, and Yuki Sugawara. DeNA Co., Ltd. RecSys Challenge 2024 Workshop.
  2. © DeNA Co., Ltd. 2 About our team “:D” •

    DeNA is an IT company that aims to entertain and to serve through our businesses. ◦ Our AI team tackles different business challenges using a wide range of AI technologies. ◦ One of our focus area is matching technologies (including recommendation). ◦ Our team “:D” is made up of data scientists | Kaggle GM/Master from DeNA. Business portofolio AI Team Reinforcement Learning Matching Computer Vision Multimodal Generative AI
  3. © DeNA Co., Ltd. 3 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  4. © DeNA Co., Ltd. 4 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  5. © DeNA Co., Ltd. 5 • Task: predict which article

    a user will click on from a given list of articles. ◦ We can use user’s click history, session details, and personal metadata. ◦ The primary metric: the area under the ROC curve (AUC). • Our team “:D” achieved 1st place on the leaderboard. Introduction: Competition overview https://recsys.eb.dk/assets/pdf/recsy s_workshop_sponsor.pdf https://recsys.eb.dk/
  6. © DeNA Co., Ltd. 6 • EB-NeRD dataset: news articles

    and their associated behavioral logs ◦ The behavior dataset contains a list of articles to be shown to users in the upcoming week. ◦ The history dataset includes each user's clicked articles from the past three weeks. ◦ Participants predict which articles in the behavior dataset will be clicked. Introduction: Dataset description Behavior: 5/18~5/24 History: 4/27~5/17 Behavior: 5/25~6/1 History: 5/4~5/24 Behavior: 6/2~6/8 History: 5/11~6/1 Train Validation Test Train/Validation/Test split (provided by host) article id: 001 ✔ article id: 002 ✔ article id: 003 ✔ article id: 001 ー article id: 003 ー article id: 002 ✔ article id: 004 ー article id: 005 ー article id: 007 ✔ article id: 006 ー article id: 008 ー article id: 002 ー article id: 010 ー article id: 008 ✔ article id: 011 ー Behavior History 3weeks 1week 〜 〜
  7. © DeNA Co., Ltd. 7 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  8. © DeNA Co., Ltd. 8 • Separate hyperparameter optimization and

    final model training ◦ Challenge: news articles tend to rapidly lose popularity after publication. ▪ A simple approach to deal with this problem: train more recent data. ◦ Our data splitting strategy: ▪ Train model with standard split to obtain validation score and best hyperparameters. ▪ Re-train with validation dataset using the same hyperparameters. ◦ Achieve two goals simultaneously: evaluating our models and training on recent data. Data splitting Training for validation Re-training using validation Inference for submission Behavior: 5/18~5/24 History: 4/27~5/17 Behavior: 5/25~6/1 History: 5/4~5/24 Train Validation Model (5/18~5/24) Validation Score Best Hyperparams Behavior: 5/18~5/24 History: 4/27~5/17 Behavior: 5/25~6/1 History: 5/4~5/24 Train Model (5/18~6/1) Behavior: 5/25~6/1 History: 5/4~5/24 Submission Use best hyperparams Use trained model
  9. © DeNA Co., Ltd. 9 • Feature engineering that accounts

    for the impact of changing trends ◦ We found static article features (e.g. article ID) tend to overfit the training dataset. ▪ They capture article popularity during the training period. ◦ We focused on features less influenced by article content. ▪ e.g. time-related features and similarity to the user's history. Feature engineering Feature name Description article_published_ts_diff The elapsed time from article publication to impression. elapsed_ts_from_history The elapsed time from user's last click to impression. history_matched_topics_nor med_counts_rank Overlap rate of topics with articles in the user's history article id: 001 NEW article id: 001 ? diff: 123456 sec article id: 001 ✔ article id: 001 ? diff: 12345 sec article id: 001 ? Sport article id: 001 ✔ Sport article id: 001 ✔ Sport article id: 003 ✔ Politic normed count: 0.6666 Table: Example of feature engineering
  10. © DeNA Co., Ltd. 10 • Three types of data

    leakages that occurred in this competition: ◦ L1: Article statistical features that contain future information. ◦ L2: Test set impression information from after the prediction time. ◦ L3: Test set impression information from before the prediction time. Feature engineering with data leakage article id: 001 ✔ article id: 002 ✔ article id: 003 ✔ article id: 001 ? article id: 003 ? article id: 002 ? article id: 004 ? article id: 005 ? article id: 007 ? article id: 006 ? article id: 008 ? article id: 002 ? article id: 010 ? article id: 008 ? article id: 011 ? Behavior History 〜 〜 L2 L3 article id: 001 Statistical aggregation start Impression Statistical aggregation end L1
  11. © DeNA Co., Ltd. 11 • L1: Article statistical features

    that contain future information. ◦ Article dataset already contained statistical features. (“Total Inviews”, “Total Pageviews”, and “Total Read Time”). ◦ These statistics are calculated including user behavior beyond the test period (6/2-6/8). ◦ They contain future information, so it is a form of data leakage. Feature engineering with data leakage Feature name Description Total Inviews The total number of times an article has been inview (registered as seen) by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. Total Pageviews The total number of times an article has been clicked by users within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. Total Read Time The accumulated read time of an article within the first 7 days after it was published. This feature only applies to articles that were published after February 16, 2023. Article statistical features Obvious leakage Weak leakage article id: 001 Statistical aggregation start Impression Statistical aggregation end Behavior History L1
  12. © DeNA Co., Ltd. 12 • L2: Test set inview

    information from *after* the prediction time. ◦ (Again) Task: predict which article a user will click on from a given list of articles. ▪ We could use the inview article IDs in the test set. ◦ Using this information from after the prediction time is future data leakage. ▪ In fact, this information was useful to explain short-term article popularity. Feature engineering with data leakage article id: 001 ✔ article id: 002 ✔ article id: 003 ✔ article id: 005 ? article id: 007 ? article id: 006 ? article id: 008 ? article id: 002 ? article id: 010 ? article id: 008 ? article id: 011 ? Behavior History 〜 〜 Table: Example of feature engineering with L2 data leakage Feature name Description future_article_log_counts The logarithm of the frequency of the same article appearing in future impressions for the same user. global_article_-5m_5m_nor med_counts The frequency of the article's appearances in all users' behavior data within 5 minutes of the target impression, normalized within the inview set. L2
  13. © DeNA Co., Ltd. 13 Feature engineering with data leakage

    • L3: Test set inview information from *before* the prediction time. ◦ This information is theoretically available at prediction time. ▪ It's not strictly considered future data leakage. ◦ Using this data would require real-time processing and likely wasn’t intended in the competition. ▪ We have decided to treat it as a form of data leakage. article id: 001 ✔ article id: 002 ✔ article id: 003 ✔ article id: 001 ? article id: 003 ? article id: 002 ? article id: 004 ? article id: 005 ? article id: 007 ? article id: 006 ? article id: 008 ? Behavior History 〜 〜 L3 Table: Example of feature engineering with L2 data leakage Feature name Description past_article_log_counts The logarithm of the frequency of the same article appearing in past impressions for the same user. global_article_-11m_-1m_n ormed_counts The frequency of the article's appearances in all users' behavior data from 11 minutes to 1 minute before the target impression, normalized within the inview set.
  14. © DeNA Co., Ltd. 14 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  15. © DeNA Co., Ltd. 15 • Stage 1: ◦ Train

    Transformer, LightGBM, and CatBoost using the features previously described. • Stage 2: ◦ Employ LightGBM and Optuna to integrate the predictions from Stage 1. • Stage3: ◦ Obtain our final prediction by taking a simple average of the Stage 2 outputs. Models: Our final solution overview
  16. © DeNA Co., Ltd. 16 • Predict the click probabilities

    for each inview article as a token classification task. ◦ Input impression and inview features into Transformer simultaneously, and output click probabilities. ◦ This approach allows consideration of other articles and the overall impression context when predicting each inview article. ◦ Loss: Binary cross-entropy loss for whether the article was clicked or not. ◦ Sampling: 3 impressions per user for epoch (Approximately 25% of all impressions). Models: Transformer-based
  17. © DeNA Co., Ltd. 17 • Impression embedding: ◦ Aim

    to encode the contextual features of an impression into a single node. ◦ Encode categorical and numerical features separately, then combine them before feeding into the next layers. ◦ Use sinusoidal embedding to help neural networks learn from diverse numerical features. Models: Transformer-based Feature name Feature type L1 L2 L3 Description elapsed_ts_from_history elapsed_ts_from_future elapsed_ts_from_past elapsed_ts_from_day_start Numerical - - - - - ✓ - - - - ✓ - The time differences between the target impression and: ・The most recent click in the “history”. ・The closest past impression within the “behavior”. ・The closest future impression within the “behavior”. ・The day start (00:00) in Denmark. num_history_articles num_future_articles num_past_articles Numerical - - - - ✓ - - - ✓ The number of unique articles for: ・Clicked articles within the "history". ・Inview articles appearing in the "behavior" before the target impression. ・Inview articles appearing in the "behavior" after the target impression. read_time scroll_percentage Numerical - - - - - - Numerical features related to the given target impression. device_type Categorical - - - Categorical features related to the given target impression Table: Description of impression features
  18. © DeNA Co., Ltd. 18 • Inview embedding: ◦ Aim

    to encode the inview article features into their corresponding nodes. ◦ Features are transformed using different methods according to their feature types, then combine them before feeding into the next layers. Models: Transformer-based Feature name Feature type L1 L2 L3 Description article_published_ts_diff inview_article_ts_ranks_desc Numerical - - - - - - Features related to the difference between the target impression and: ・The published time of that inview article. ・The rank of published time of that inview article within the inview set history_article_log_counts history_article_log_ranks Numerical - - - - - - For the same article as that inview article: ・The logarithm of its occurrence count in the history. ・Its rank within the inview set based on this count. article_history_topics_embed ding_similarity Numerical - - - The cosine similarity calculated between the following two embeddings: 1. The topic embedding of the target inview article. 2. The average of topic embeddings from articles in the history. global_article_-5m_5m_norme d_counts Numerical - ✓ ✓ The frequency of the article's appearances in all users' behavior data within 5 minutes of the target impression, normalized within the inview set. article_total_inviews article_total_pageviews Numerical ✓ ✓ - - - - The total number of times an article has been inviews / clicked (registered as seen) by users within the first 7 days after it was published. Table: Description of inview features (selected)
  19. © DeNA Co., Ltd. 19 Models: GBDT-based • GBDT-based model

    cannot predict for multiple inview articles simultaneously. ◦ Create features for each inview article and make predictions independently. ◦ Optimize the ranking within the same impression group using LambdaRank during training. ◦ Sampling: 20% user impression data. article id: 005 ? article id: 007 ? article id: 006 ? article id: 008 ? Impression features Inview features GBDT model Impression features Inview features Impression features Inview features Impression features Inview features Score Score Score Score Lambdarank loss
  20. © DeNA Co., Ltd. 20 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  21. © DeNA Co., Ltd. 21 Experiment 1: Model performance •

    Key findings: ◦ Proposed data-splitting strategy (re-training using validation) is better than standard data-splitting strategy. ◦ Transformer-based model achieved the highest scores among single models, while ensembling with GBDT models further improved performance. Model Validation Test (Training: 5/18~5/24) Test (Training: 5/18~6/1) Transformer LightGBM CatBoost 0.8734 0.8668 0.8652 0.8764 0.8694 0.8691 0.8864 0.8817 0.8805 Ensemble 0.8802 0.8827 0.8924 Table: AUC scores of each model in different data splitting methods.
  22. © DeNA Co., Ltd. 22 Experiment 1: Model performance •

    Proposed data-splitting strategy (re-training with validation) is better than standard data-splitting strategy. ◦ Proposed method outperformed baseline in all experiments, improving by 0.097-0.123. ◦ This result suggests that using validation data from periods close to the test set as training data, we could effectively mitigate the impact of trend changes. Model Validation Test (Training: 5/18~5/24) Test (Training: 5/18~6/1) Transformer LightGBM CatBoost 0.8734 0.8668 0.8652 0.8764 0.8694 0.8691 0.8864 0.8817 0.8805 Ensemble 0.8802 0.8827 0.8924 Table: AUC scores of each model in different data splitting methods.
  23. © DeNA Co., Ltd. 23 Experiment 1: Model performance Model

    Validation Test (Training: 5/18~5/24) Test (Training: 5/18~6/1) Transformer LightGBM CatBoost 0.8734 0.8668 0.8652 0.8764 0.8694 0.8691 0.8864 0.8817 0.8805 Ensemble 0.8802 0.8827 0.8924 • Transformer-based models achieved the highest scores among single models, while ensembling with GBDT models further improved performance. ◦ Transformer-based model outperformed GBDT by 0.0047-0.007. ◦ This result confirmed the effectiveness of Transformer for in-view article interactions and ensemble methods for combining model predictions. Table: AUC scores of each model in different data splitting methods.
  24. © DeNA Co., Ltd. 24 Experiment 2: Evaluation of the

    impact of data leakage • Key findings: ◦ A significant performance difference was observed depending on whether test impressions were used or not. ◦ By utilizing past impressions close to the prediction time, we could achieve performance close to that of the full features. Model L1 L2 L3 Validation Test Full features ✓ ✓ ✓ 0.8734 0.8864 w/o future impressions - - ✓ 0.8495 0.8680 w/o test impressions - - - 0.7483 0.7699 Table: Impact of data leakage on Transformer-based model performance
  25. © DeNA Co., Ltd. 25 Experiment 2: Evaluation of the

    impact of data leakage • A significant performance difference was observed depending on whether test impressions were used or not. ◦ Features about short-term article popularity, derived from test impressions, had a dominant impact on model performances. ◦ This result highlights the complexity of data handling in recommendation systems and underscores the need for careful consideration in experimental design. Model L1 L2 L3 Validation Test Full features ✓ ✓ ✓ 0.8734 0.8864 w/o future impressions - - ✓ 0.8495 0.8680 w/o test impressions - - - 0.7483 0.7699 Table: Impact of data leakage on Transformer-based model performance
  26. © DeNA Co., Ltd. 26 Experiment 2: Evaluation of the

    impact of data leakage • By utilizing past impressions close to the prediction time, we could achieve performance close to that of the full features. ◦ Using very recent data - impressions from just a minute before prediction - to capture short-term trends can greatly boost model accuracy. ◦ This result suggests two potential strategies for improvement: ▪ Developing a system that can process and use data from the last few minutes in real-time. ▪ Updating ML models more frequently, perhaps as often as every hour. Model L1 L2 L3 Validation Test Full features ✓ ✓ ✓ 0.8734 0.8864 w/o future impressions - - ✓ 0.8495 0.8680 w/o test impressions - - - 0.7483 0.7699 Table: Impact of data leakage on Transformer-based model performance
  27. © DeNA Co., Ltd. 27 Table of contents 1 Introduction

    Data splitting / Feature engineering Models Experiments 4 3 2 Conclusion 5
  28. © DeNA Co., Ltd. 28 Conclusion • We described team

    “:D” solution for RecSys Challenge 2024, which won the 1st place. ◦ Introduced our feature engineering methods and data-splitting strategies to address the temporal nature of news articles and improve model generalization. ◦ Proposed an efficient method using a Transformer-based model to process inview and impression features simultaneously. ◦ Experiments on data leakages provided insights into usefulness of statistical information from recent impressions.
  29. © DeNA Co., Ltd. 30 • Using ebnerd_small dataset for

    local validation ◦ Calculating AUC for the whole EBNeRD dataset is computationally very expensive. (validation impressions: 12M, AUC calculation on a single process: 4-5 hours.) ◦ We used the ebnerd_small dataset (validation impressions: 244K), which is sampled on a per-user basis (given by host). ◦ The AUC for the small dataset strongly correlated with the AUC for the whole (large) dataset. Data splitting AUC for ebnerd_small AUC for ebnerd_large Figure: Comparison of AUC for ebnerd_large vs. ebnerd_small