Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCI...

Kohei Shinden
December 10, 2020

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Published on Dec 9, 2020

KASYS at the NTCIR-15 WWW-3 Task
Achieved the best performances in terms of nDCG, Q and iRBU among all the participants in the WWW-3 Task
paper: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/pdf/ntcir/02-NTCIR15-WWW-ShindenK.pdf

Kohei Shinden

December 10, 2020
Tweet

More Decks by Kohei Shinden

Other Decks in Research

Transcript

  1. • NTCIR-15 WWW-3 Task ‒ Ad-hoc document retrieval tasks for

    web documents Background 2 • Proposed search model using BERT (Birch) ‒ Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019 ‒ BERT has been successfully applied to a broad range of NLP tasks including document ranking tasks.
  2. • Applying a sentence-level relevance estimator learned by QA and

    microblog search datasets to ad-hoc document retrieval Birch (Yilmaz et al, 2019) 3 1. The sentence-level relevance estimator is obtained by fine-tuning the pre-trained BERT model with QA and microblog search data. 2. Calculate BM25 scores and BERT scores for query and document sentences. 3. Weighted sum of the BM25 and the score of the highest BERT-score sentence in the document. Pre-trained BERT Model BERT Sentence-Level Relevance Judgements Model Halloween Pictures Datasets Trick or Treat... 0.7 Children get candy... 0.3 Pumpkin sweets... 0.1 0.4 BERT + BM25 = 0.6 BM25 Score BERT Score Sentences Document Fine-tune
  3. • Weighted sum of the BM25 and the score of

    the highest BERT-scoring sentence in the document ‒ Assuming that the most relevant sentences in a document are good indicators of the document-level relevance [1] • 𝑓BM25 (𝑑): The BM25 score of document 𝑑 • 𝑓BERT (𝑝𝑖 ): The sentence relevance of the top 𝑖-th sentence obtained by BERT • 𝑤𝑖 : The hyper-parameter 𝑤𝑖 is to be tuned with a validation set Details of Birch 4 [1] Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019
  4. Preliminary Experiment Details 5 • Preliminary experiments to select datasets

    and hyper-parameters suitable for ranking web documents Train Validation NTCIR-14 WWW-2 Test Collection (with its original qrels) Robust04 MS MARCO TREC CAR TREC MB Model MB ✓ ✓ Model CAR ✓ ✓ Model MS MARCO ✓ ✓ Model CAR → MB ✓ ✓ ✓ Model MS MARCO → MB ✓ ✓ ✓ The checkmarks represent the data set used for training.
  5. MSMARCO → MB is the best. Thus, we submitted runs

    based on MS MARCO → MB and CAR → MB. Preliminary Experiment Results & Discussion 6 • Evaluated the prediction results of Birch models ‒ Top k sentences: Uses the k-sentence with the highest BERT score for ranking 0.3098 0.3112 0.3103 0.3266 0.3312 0.3318 0 0.1 0.2 0.3 0.4 0.5 BM25 MB CAR MS MARCO CAR → MB MS MARCO → MB nDCG@10 Baseline Top 1 sentence Top 2 sentences Top 3 sentences
  6. • MSMARCO→MB is the best. The CAR→MB model also achieved

    similar scores. • The reason why MS MARCO and TREC CARʻs results are better probably because they are web documents retrieval and have a large amount of data. • BERT is also valid for web document retrieval. Official Evaluation Results & Discussion 7 • Achieved the best performances in terms of nDCG, Q and iRBU among all the participants. KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5
  7. • Achieved the best performances in terms of nDCG, Q

    and iRBU among all the participants. • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. Summary of NEW Runs 8 KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5
  8. Replicating and reproducing the THUIR runs at the NTCIR 14

    WWW-2 Task Whether the results between models are consistent with each result. THUIR KASYS(ours) Abstract of REP runs 10 BM25 BM25 LambdaMART (learning-to-rank model) LambdaMART (learning-to-rank model) < < ❓
  9. Replication Procedure 1 11 disney switch Canon ‥‥ Clueweb Collection

    Ranked by BM25 algorithm input output Disney shop Tokyo Disney resort Disney official ‥‥ Ranked web documents 1st 2nd 3rd input Feature extracting program Extracted eight features Extracting tf, idf, docement length, BM25, LMIR as features Up to BM25 LamdbaMART from here WWW-2 and WWW-3 topics honda Pokemon ice age ‥‥
  10. ・MQ Track : A dataset of the relevance of a

    topic and a document. Replication Procedure 2 12 Re-ranked web document Extraction feature program qid:001 1:0.2 ‥ qid:001 1:0.5 ‥ qid:001 1:0.1 ‥ qid:001 1:0.9 ‥ output ‥‥ Extracted features from document LambdaMART input MQ Track WWW-1 test collection train validate Disney official Disney shop Tokyo Disney resort 1st 2nd 3rd ‥‥ output
  11. • Features for learning to rank ‒ TF, IDF, TF-IDF,

    document length, BM25 score, and three language-model-based IR scores • The differences from original paper ‒ Although THUIR extracted the features from four fields (whole document, anchor text, title, and URL), we extracted the features from only the whole document ‒ Normalization is used by maximum and minimum values because the normalization of features was not described in the original paper Implementation Details 13
  12. 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 LamdbaMART

    BM25 Ours Original 0.3 0.31 0.32 0.33 0.34 0.35 0.36 Preliminary Evaluation Results with Original WWW-2 qrels 14 0.28 0.29 0.3 0.31 0.32 0.33 0.34 Ours Original nDCG@10 Q@10 nERR@10 • Our results is lower than original results • LambdaMART results were above BM25 for all evaluation metrics • Succeeded in reproducing the run Ours Original
  13. Official Evaluation Results 15 0.6 0.65 0.7 0.75 0.8 0.85

    0.9 0.95 nDCG Q ERR iRBU WWW-3 official result LambdaMART BM25 • BM25 results were above LambdaMART for all evaluation metrics • Failed to reproduce the run 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 nDCG Q ERR iRBU WWW-2 official result LambdaMART BM25
  14. • In the original paper, LambdaMART gave better results than

    BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Conclusion 16 Suggestions • In web search tasks, more effective to extract features from all fields • Better to clarify the method of normalization in a paper
  15. NEW runs • Achieved the best performances in terms of

    nDCG, Q and iRBU among all the participants • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. REP runs • In the original paper, LambdaMART gave better results than BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Summary of All Runs 17