Upgrade to Pro — share decks privately, control downloads, hide ads and more …

はてなブックマーク全文検索の精度改善

 はてなブックマーク全文検索の精度改善

Hatena Engineer Seminar #5 での発表スライド

Takuya Asano

June 16, 2015
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. id:takuya-a ϓϥοτϑΥʔϜˍΞυςΫνʔϜ 2015 ೥ 4 ݄ʹೖࣾ ڵຯ • ৘ใݕࡧ •

    ࣗવݴޠॲཧ • ػցֶश OSS ׆ಈ kuromoji.js ͳͲͷ JavaScript ϥΠϒϥϦΛ։ൃ ΞΠίϯมΘΓ·ͨ͠
  2. ਫ਼౓ͱ͸ ༷ʑͳλεΫͰجຊͱͳΔ 2 ͭͷධՁࢦඪ • ద߹཰ (Precision) = ݕࡧϊΠζͷগͳ͞ •

    ࠶ݱ཰ (Recall) = ݕࡧ࿙Εͷগͳ͞ ͜ͷ 2 ͭ͸τϨʔυΦϑ • ద߹཰Λ্͛Α͏ͱ͢Δͱɺ࠶ݱ཰͕Լ͕Δ • ࠶ݱ཰Λ্͛Α͏ͱ͢Δͱɺద߹཰͕Լ͕Δ
  3. ਫ਼౓ͱ͸ ۃ୺ͳྫ • ࣗ৴ͷ͋ΔΤϯτϦʔΛ 1 ͚݅ͩग़͢ ɹˠɹద߹཰͸ 100% ʹͳΔ͕ɺ࠶ݱ཰͸௿͘ͳΔ •

    ͢΂ͯͷΤϯτϦʔΛग़͢ ɹˠɹ࠶ݱ཰͸ 100% ʹͳΔ͕ɺద߹཰͸௿͘ͳΔ
  4. ֓೦ݕࡧ ֓೦ݕࡧΛ࢖͍ͬͯΔݕࡧΤϯδϯ • Autonomy (Hewlett-Packard) • GETA (NII) • ConceptBase

    (δϟετγεςϜ) • ͦͷଞɺಛڐݕࡧΤϯδϯͳͲ ֓೦ݕࡧʢConcept Searchɺίϯηϓταʔνɺίϯηϓτݕࡧɺࣗવจݕࡧɺࣗવݴޠจݕࡧɺྨ ࣅจॻݕࡧɺ࿈૝ݕࡧʣ͸ɺࣗಈԽ͞Εͨ৘ใݕࡧͷख๏Ͱɺ஝ੵ͞Εͨඇߏ଄ԽσʔλʢిࢠΞʔ ΧΠϒɺిࢠϝʔϧɺՊֶจݙͳͲʣ͔ΒɺݕࡧΫΤϦʹରͯ͠ɺ֓೦͕ྨࣅ͢Δ৘ใΛݕࡧ͢Δͷ ʹ༻͍ΒΕΔɻ http://ja.wikipedia.org/wiki/֓೦ݕࡧɹΑΓҾ༻
  5. ֓೦ݕࡧͷ͘͠Έ ΩʔͱͳΔΞϧΰϦζϜ͸ҎԼͷ 3 ͭ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு


    ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ
 ֦ுͨ͠ΫΤϦͰશจݕࡧ ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹
 ͦΕΛΫΤϦʹ଍͚ͩ͢
  6. ژ౎ͷؔ࿈Ωʔϫʔυ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  7. ژ౎ͷಉٛޠʢγϊχϜʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  8. ژ౎ͷ஍໊ʢԼҐޠʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  9. ژ౎͔Β࿈૝͞ΕΔޠ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ژ౎ͷࢢ֎ہ൪
  10. ژ౎ͷྺ࢙ɾݟͲ͜Ζ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  11. ژ౎ͷ؍ޫɾ॓ധɾަ௨ػؔ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  12. ژ౎ͷάϧϝ৘ใ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  13. ύφιχοΫͷؔ࿈Ωʔϫʔυ 1 lumix 2 dmc 3 viera 4 ௡լ 5

    diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ
  14. ΫΤϦ֦ுͷޮՌɿ
 ಉٛޠʢγϊχϜʣΛࣗಈ֫ಘ 1 lumix 2 dmc 3 viera 4 ௡լ

    5 diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ
  15. εϓϥτΡʔϯͷؔ࿈Ωʔϫʔυ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ 5

    amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ γϡʔλʔ -> γϡʔλ
 ϩʔϥʔ -> ϩʔϥ
 ͱͳ͍ͬͯΔͷ͸
 Elasticsearch ͷΞφϥΠβʔ
 Ͱͷεςϛϯάॲཧͷ݁Ռ
  16. ΫΤϦ֦ுͷޮՌɿ
 ৽ޠͰ΋ಉٛޠʢγϊχϜʣΛ֫ಘ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ

    5 amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ ٯʹΞϧϑΝϕοτʮsplatoonʯ
 ͷؔ࿈ΩʔϫʔυΛग़͢ͱ
 ΧλΧφʮεϓϥτΡʔϯʯ ͕ग़ͯ͘Δʂ
  17. ϢχΫϩͷؔ࿈Ωʔϫʔυ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ 5

    Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦςΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε
  18. ΫΤϦ֦ுͷޮՌɿ
 ܗଶૉղੳϛεͷิ׬ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ

    5 Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦς ΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε ܗଶૉղੳϛεΛิ׬Ͱ͖Δ ʢϢχΫϩ͸ࣙॻʹͳ͍ޠʣ ܗଶૉղੳ͸݁Ռ͕ΏΕΔ ʮϢχ/Ϋϩʯͱ෼ׂ͞Εͯ ΠϯσοΫεʹొ࿥͞Ε͍ͯͯ ΫΤϦ͸ʮϢχΫϩʯͱ1τʔΫϯ ʹͳͬͨͱͯ͠΋ݕࡧͰ͖Δ ܗଶૉղੳثͷۤखͳύλʔϯΛ ࣗಈతʹֶशɾิ׬
  19. ͢΂ͯͷݕࡧχʔζΛ
 ΧόʔͰ͖ΔΘ͚Ͱ͸ͳ͍ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ

    5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ʮژ౎ͷఱؾʯΛ஌Γ͍ͨਓʹ͸ແҙຯ ࣌ؒతɾۭؒతܭࢉίετͷ੍໿͕͋ΔͨΊ ແݶʹ͸ΫΤϦ֦ுͰ͖ͳ͍
  20. ಛ௃ޠநग़ͷํ਑ ػցֶशʢeg. ϥϯΫֶशʣ͸࢖Θͳ͍ • ؔ܎ऀ΁ͷઆ໌ɺ݁Ռͷղऍ͕೉͘͠ͳΔ • ਓखʹΑΔϧʔϧɾώϡʔϦεςΟοΫʹରԠ͠ʹ͍͘ • ༻ҙͰ͖Δσʔλ͕গͳ͍ͨΊɺաֶश͢ΔՄೳੑ͕ߴ͍ •

    ݹయతͳ৘ใݕࡧ (Information Retrieval) ͷख๏ʹཔΔ • Elasticsearch ͷ Term Vectors API ͰऔಘͰ͖Δɺ
 λʔϜʢ୯ޠʣͷ౷ܭ৘ใΛར༻
  21. Elasticsearch Term Vectors API Term Vectors ͔ΒऔಘͰ͖ΔλʔϜͷ౷ܭ৘ใ λʔϜ = Elasticsearch

    ͷΞφϥΠβʔʹΑͬͯ෼ׂ͞Εͨ΋ͷ
 ʢ͍ΘΏΔ୯ޠͱҰக͠ͳ͍৔߹΋͋ΔͷͰ஫ҙʣ term_freq ΤϯτϦʔதͷλʔϜͷग़ݱճ਺ʢස౓ʣ doc_freq ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ ttf ΠϯσοΫεશମͰλʔϜ͕ݱΕΔग़ݱճ਺ʢස౓ʣͷ࿨ doc_count ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺
  22. Elasticsearch Ͱͷ ؔ࿈Ωʔϫʔυநग़ͷྲྀΕ 1. ʢλάݕࡧʣೖྗ͞ΕͨΫΤϦͰɺλάͷϑΟʔϧυʹରͯ͠
 Filtered Query ͰߜΓࠐΈݕࡧ 2. ʢ౷ܭ৘ใऔಘʣTerm

    Vectors API (ݫີʹ͸ Multi Term Vectors API) Ͱ
 λʔϜͷ౷ܭ৘ใΛऔಘ 3. ʢಛ௃ޠநग़ʣͻͱͭͻͱͭͷจॻʹରͯ͠ɺλʔϜͷॏཁ౓ΛαʔόͰܭࢉ͠
 Top-25 ͷಛ௃ޠΛநग़ 4. ʢؔ࿈Ωʔϫʔυநग़ʣগ਺ͷจॻʹ͔͠ݱΕͳ͍λʔϜ͸མͱ͠ɺ
 ࠷΋είΞ͕ߴ͍ Top-25 ͷλʔϜΛநग़ ʢؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹ɺ͋ͱ͸ΫΤϦʹΩʔϫʔυΛ଍͚ͩ͢ʣ λʔϜͷॏཁ౓ΛͲͷΑ͏ʹܭࢉ͢Δ͔ʁ Tips ೋ෼ώʔϓΛ࢖ͬͯ Top-K ܭࢉΛߴ଎Խ
  23. ಛ௃ޠͷܭࢉΞϧΰϦζϜ ৘ใݕࡧͰ͸ɺλʔϜͷॏཁ౓ʢॏΈ෇͚ʣͷࢦඪͱͯ͠͸
 ҎԼͷ 2 ͕ͭσϑΝΫτ • TF-IDF ( TF:୯ޠස౓ ͱ

    IDF:จॻස౓ͷٯ਺ ͷֻ͚ࢉ ) • BM25 ( ֬཰Ϟσϧʹج͖ͮɺจॻ௕΋ߟྀͨ͠΋ͷ ) TF-IDF Λ࠾༻ BM25 ͸ܭࢉίετ͕ߴ͍ & 2ͭͷύϥϝʔλௐ੔͕ඞཁ
  24. TF-IDF TF-IDF ͷܭࢉࣜ ʢ͍͔ͭ͘ͷόϦΤʔγϣϯ͕͋Δʣ TF-IDF ͚ͩͰ͸͏·͍͔͘ͳ͍ • "1" ͷΑ͏ͳɺ͋·Γҙຯͷͳ͍਺ࣈ •

    "to" ͷΑ͏ͳɺӳޠͷετοϓϫʔυ
 ʢ೔ຊޠͷετοϓϫʔυ͸ɺ Elasticsearch ͷϑΟϧλͰ஄͔ΕΔʣ TF : λʔϜස౓ fi,j → จॻதʹԿճ΋ग़ͯ͘ΔλʔϜ΄Ͳߴ͘ͳΔ IDF : จॻස౓ ni ͷٯ਺ʢN ͸จॻ਺ʣ→ ϨΞͳλʔϜ΄Ͳߴ͘ͳΔ fi,j : ΤϯτϦʔ j ʹݱΕΔλʔϜͷग़ݱճ਺ʢස౓ʣ = term_freq N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
  25. RIDF Residual IDF; ࢒ࠩ IDF Church, K. W. and Gale,

    W. A. (1995a). “Inverse Document Frequency (IDF): A Measure of Deviation from Poisson.” In Proc. of the 3rd Workshop on Very Large Corpora, pp. 121–130. N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Fi : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔ૯਺ = ttf 1 จॻதͷλʔϜͷग़ݱճ਺Λ
 ϙΞιϯ෼෍ͰϞσϧԽ ػೳޠɿଟ਺ͷจॻʹ͹Β͚ͯଘࡏʢۉҰʹ෼෍ʣ ಺༰ޠɿগ਺ͷจॻʹूதͯ͠ଘࡏʢภͬͯ෼෍ʣ RIDF = ਪఆͨ͠ IDF ͱ࣮ࡍͷ IDF ͱͷࠩ ϙΞιϯ෼෍ P(k; λi) ͷύϥϝʔλ(=ظ଴஋) λi ͸ λʔϜͷશग़ݱճ਺ (Fi) / จॻ਺ (N) ͰਪఆʢλʔϜ͕ۉҰʹ෼෍͍ͯ͠ΔͱԾఆʣ P(0; λi) ͸ɺͦͷλʔϜ͕ 1 ճ΋ग़ͯ͜ͳ͍֬཰ ͭ·Γ 1 - P(0; λi) ͸ 1 ճͰ΋ग़ͯ͘Δ֬཰ → ෼฼ N (1 - P(0; λi)) ͸จॻස౓ ni ͷਪఆ஋ → ӈล͸ IDFi ͷਪఆ஋ RIDF ͕௿͍ RIDF ͕ߴ͍ ਪఆ஋ͱͷ͕ࠩখ ਪఆ஋ͱͷ͕ࠩେ
  26. Gain ۃ୺ʹߴස౓ͷλʔϜʢ΄ͱΜͲͷΤϯτϦʔʹग़ͯ͘Δʣ
 ۃ୺ʹ௿ස౓ͷλʔϜʢ਺ݸͷΤϯτϦʔʹ͔͠ग़ͯ͜ͳ͍ʣͰείΞ͕௿͘ͳΔ ݁ՌɿޮՌ͕ͳ͔ͬͨ
 ߴස౓ͷλʔϜ͸ɺ Elasticsearch ʹΑͬͯɺ͢ͰʹϑΟϧλ͞Ε͍ͯΔ N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺

    = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Papineni, K. (2001). “Why Inverse Document Frequency?” In Proc. of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pp. 25–32.
  27. ಛ௃ޠબ୒ͷධՁͱ࠷దԽ • Top-50 ͰධՁ • ద߹ͯ͠Δ (+1) ͔ɺ͍ͯ͠ͳ͍ (-1) ͔ͷ

    2 ஋෼ྨͱΈͳ͢ • ਖ਼ղσʔλΛ༻ҙʢ500ݸ΄Ͳʣ いく -1 ゴッホ +1 あんまり -1 ... ܳज़ʹؔ͢ΔΤϯτϦʔ
  28. ࠓճ͸ MAP ͰධՁ & ࠷దԽ MAP 90% Ҏ্Λୡ੒
 ʢରτϨʔχϯάσʔλʣ ධՁࢦඪɹP@n

    / AP / MAP
 MAP ʹΑΔ࠷దԽ • P@n; Precision at n
 ୈ n Ґ·Ͱͷద߹཰ • AP; Average Precision
 P@n Λ n ·ͰͰฏۉͨ͠ࢦඪ • MAP; Mean Average Precision
 AP Λ͢΂ͯͷΤϯτϦʔͰฏۉ Max MAP : 0.9173 ----------------------- IDF threshold : 6.0 RIDF threshold : 0.55 Gain threshold : 0.0 Α͍͜͸ަࠩݕఆ͠·͠ΐ͏
  29. ·ͱΊɿධՁͱ࠷దԽ • 3 ͭͷධՁࢦඪ (P@n / AP / MAP) Λ঺հ

    • ධՁࢦඪ (MAP) Λ࠷େԽ͢ΔΑ͏ʹ
 ύϥϝʔλΛ࠷దԽ • 3ͭͷείΞؔ਺ (TF-IDF, IDF, RIDF) Λ૊Έ߹ΘͤΔ
 ͜ͱͰߴ͍ਫ਼౓Λୡ੒
  30. Elasticsearch ʹΑΔ
 ΫΤϦ֦ுˍ֓೦ݕࡧ • ݩͷΩʔϫʔυ͸ʮඞؚͣΉʯ
 ʢ Bool Query ͷ must

    અʣ • ؔ࿈Ωʔϫʔυ͸είΞʹ଍͞ΕΔ͚ͩ
 ʢ Bool Query ͷ should અʣ • είΞʹᮢ஋
 ʢ Query ʹ min_score Λࢦఆʣ
  31. ࢀߟจݙ TF-IDF ͷόϦΤʔγϣϯ, BM25, Feature Selection, Information Gain, ٖࣅద߹ੑϑΟʔυόοΫ, ϥϯΫֶशͳͲ

    R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Recall, Precision, P@n, AP, MAP, Binary heap ʹΑΔ Top-K ͳͲ Büttcher S, Clarke C, Cormack GV.Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. ϙΞιϯ෼෍Ϟσϧ, IDF, RIDF ͳͲ Manning, C. D., & Schutze, H. Foundations of statistical natural language processing. The MIT Press, 1999. IDF, RIDF ʹΑΔ Dynamic stop word list Amati, G., Carpineto, C., Romano, G. (Eds.). Advances in Information Retrieval, 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007, Proceedings. Lecture Notes in Computer Science Springer Volume 4425, 2007. IDF, RIDF ʹΑΔࡧҾޠͷॏΈ෇͚ ๺, ௡ా, ࢰʑງ. ৘ใݕࡧΞϧΰϦζϜ, ڞཱग़൛, 2002.