Upgrade to Pro — share decks privately, control downloads, hide ads and more …

検索エンジン自作入門 Go Conference 2021 Spring

検索エンジン自作入門 Go Conference 2021 Spring

Go Conference 2021 Springの登壇資料です

アウトライン
1. 検索エンジンとは ~ 一般的な検索エンジンの仕組みと構成要素
2. 自作した検索エンジンの紹介 ~ 具体的に自作した検索エンジンの構成要素と動作例
3. 自作した検索エンジンの実装 ~ アルゴリズムとデータ構造、ライブラリ
4. おわりに ~ 検索エンジンを自作した感想

kotaroooo0

April 24, 2021
Tweet

More Decks by kotaroooo0

Other Decks in Programming

Transcript

  1. ݕࡧΤϯδϯͱ͸ υΩϡϝϯτͷू߹͔ΒΫΤϦʹద͢ΔυΩϡϝϯτΛฦ͢ ݕࡧΤϯδϯ ΫΤϦೖྗ υΩϡϝϯτฦ٫ υΩϡϝϯτొ࿥ 🤖 👦 Ϣʔβʔଆ ఏڙଆ

    ಛʹෳ਺υΩϡϝϯτ͔ΒಛఆͷจࣈྻΛݕࡧ͢Δ͜ͱΛશจݕࡧ 7 (GoogleݕࡧͰ͸Webϖʔδ) (GoogleݕࡧͰ͸ΫϩʔϦϯά)
  2. શจݕࡧͷछྨ G O R U B Y J S J

    S J S J S J S J S G O R U B Y G O R U B Y J S Grepܕ ΠϯσοΫεܕ ྫ: υΩϡϝϯτʮGO RUBY JSʯʹจࣈྻʮJSʯؚ͕·Ε͍ͯΔ͔ݕࡧ 8 ࣄલʹݕࡧର৅ͷυΩϡϝϯτͷࡧҾσʔλΛ࡞੒
  3. ΠϯσοΫεܕݕࡧΤϯδϯͷߏ੒ ΠϯσοΫε ΫΤϦೖྗ υΩϡϝϯτొ࿥ Analyzer Analyzer Indexer Searcher υΩϡϝϯτฦ٫ 9

    Sorter ҎԼͷཁૉͰߏ੒͞Ε͍ͯ·͢(ৄࡉ͸͜ͷޙઆ໌͠·͢) ※Sorter͸࣮૷͍ͯ͠ͳ͍ͨΊɺҎޙͷઆ໌Ͱ͸লུ
  4. Indexing “I have pens.” “We have a desk.” Indexer Analyzer

    จࣈྻΛ୯ޠ΁෼ղ సஔΠϯσοΫε࡞੒ 11 1 2 have pen we desk 1 1 2 2 2 సஔΠϯσοΫε υΩϡϝϯτΛసஔΠϯσοΫεʹొ࿥͢Δ
  5. Search ΫΤϦ “PENS” υΩϡϝϯτ1͕ώοτ Searcher Analzyer จࣈྻΛ୯ޠ΁෼ղ “PENS” → “pen”

    సஔΠϯσοΫεΛ୳ࡧ 12 have pen we desk 1 1 2 2 2 సஔΠϯσοΫεΛ୳ࡧͯ͠ద߹͢ΔυΩϡϝϯτΛ୳͢
  6. Analyzer Char Filter จࣈྻΛ୯ޠ΁෼ׂ͢Δલͷจࣈͷௐ੔ ྫ: & → & 13 Token

    Filter ෼ׂͨ͠୯ޠΛௐ੔ ྫ: খจࣈ΁౷Ұɺετοϓϫʔυআڈ Tokenizer จࣈྻΛ୯ޠʹ෼ׂɺදه༳Εٵऩ ྫ: I have pens → I, have, pens Char Filter Tokenizer Token Filter 3ͭ߹ΘͤͯAnalyzer จࣈྻ ෳ਺୯ޠ දه༳ΕΛٵऩ &จࣈྻΛ୯ޠʹ෼ׂ Indexing, Search྆ํͰར༻ ※྆ํͰಉ͡AnalyzerΛ࢖Θͳ͍ͱޮՌ͸ബ͍
  7. ಈ࡞ྫ Analyzer Analyzeલ MappingCharFilter StandardTokenizer LowercaseFilter StemmerFilter StopWordFilter I feel

    TIRED :( I feel TIRED sad I, feel, TIRED, sad i, feel, tired, sad i, feel, tire, sad feel, tire, sad ॲཧͷྲྀΕ 17
  8. ಈ࡞ྫ Indexing & Search 19 ΠϯσοΫε΁௥Ճ 1: “Go Ruby PHP”

    2: “Go PHP Python” ϑϨʔζݕࡧ ”go php” ANDݕࡧ ”go php” 2ͷΈώοτ 1,2͕ώοτ 1⃣ 2⃣ 3⃣ 1⃣ 2⃣ 3⃣ 3: “Go Python Ruby”
  9. ಈ࡞ྫ Indexing • సஔΠϯσοΫεΛ࡞੒͢ΔIndexerͷॳظԽ • ҎԼͷυΩϡϝϯτΛIndexing(సஔΠϯσοΫεΛੜ੒͠MySQL΁อଘ) 1: “Go Ruby PHP”

    • Analyzeޙ: go, ruby, php 2: “Go PHP Python” • Analyzeޙ: go, php, python 3: “Go Python Ruby” • Analyzeޙ: go, python, ruby 21 go 1 1:1 ruby php 2:1 3:1 1:2 1:3 3:3 2:2 ɾ ɾ ɾ υΩϡϝϯτID:ग़ݱҐஔ
  10. Analyzerͷ࣮૷ 25 Analyzerߏ଄ମ • CharFilter InterfaceͷSlice • Tokenizer Interface(1ͭͷΈ) •

    ToknenFilter InterfaceͷSlice Analyzeϝιου • string → TokenStream(TokenͷSlice) • CharFilter, Tokenizer, TokenFilterͷॱʹ దԠ
  11. Token Filterͷ࣮૷ 28 TokenFilter Interface • Filterϝιου ࣮૷ྫ: LowercaseFilter •

    strings.ToLowerͰখจࣈԽ ࣮૷ྫ: StopWordFilter • ର৅ͷจࣈΛTokenStream͔Βল͘
  12. Token Filterͷ࣮૷ ReadingformFilter • ׽ࣈ → ϩʔϚࣈ΁ม׵͢ΔFilter • ׽ࣈ →

    ฏԾ໊ → ϩʔϚࣈ • ׽ࣈ → ฏԾ໊(ܗଶૉղੳث͕ߦ͏) • ฏԾ໊ → ϩʔϚࣈ(ࣗ࡞) ࣮૷ • ۪௚ͳϚοϐϯά+ ϔϘϯࣜͷϧʔϧ 29
  13. సஔΠϯσοΫεͷ࣮૷ సஔΠϯσοΫε(ࣙॻ+ϙεςΟϯάϦετ) • Ϛοϓ: τʔΫϯID → ϙεςΟϯάϦετ ϙεςΟϯάϦετ • ϙεςΟϯάͷϦϯΫτϦετ

    ϙεςΟϯά • υΩϡϝϯτID • υΩϡϝϯτதͷҐஔ৘ใ(ϑϨʔζݕࡧ༻) • ࣍ͷϙεςΟϯά΁ͷϙΠϯλ 30
  14. సஔΠϯσοΫεͷ࣮૷ 31 0 TokenID(mapͷkey) 3 PostingListͷPostings Postings 1 6 17

    90 198 5 8 14 Postings Postings nil Next Next Next Next ※ऴ୺͸nilΛࢦ͢ TokenID 0ͷτʔΫϯ͕ υΩϡϝϯτID3,5,8,14ʹؚ·Ε͍ͯΔ͜ͱΛࣔ͢
  15. సஔΠϯσοΫεͷ࣮૷ 32 0 TokenID 3 PostingListͷPostings Postings 1 3 7

    90 198 5 8 14 Postings Postings nil Next Next Next Next ※ऴ୺͸nilΛࢦ͢ 7 Postings Next ϦϯΫτϦετ: ͋Δϊʔυ௚ޙͷσʔλͷ௥Ճ࡟আO(1) ϙεςΟϯάϦετͰ͸υΩϡϝϯτIDͷঢॱΛΩʔϓ͢ΔͨΊ ͜ͷૢ࡞͕ଟ͘ϦϯΫτϦετ͕޲͍͍ͯΔ ϙΠϯλͷ෇͚ସ͑ͷΈ
  16. Indexerͷ࣮૷ ( ͷ۩ମྫ) ϝϞϦ্ͷసஔΠϯσοΫε ௥Ճ͍ͨ͠υΩϡϝϯτ go php go php MySQL্ͷసஔΠϯσοΫε

    సஔΠϯσοΫε΁௥Ճ ϝϞϦͱMySQLͷసஔΠϯσοΫε Ϛʔδ ӬଓԽ ”Go PHP” go, php Analyze 33 Analyzer Indexer MySQL͔ΒRead 1⃣ 2⃣ 3⃣ ϝϞϦ্ͷసஔΠϯσοΫεͷ αΠζ͕ᮢ஋Λ௒͑Δͱ… 4⃣ 3⃣ MySQL΁Write (ᮢ஋͸ετϨʔδ΁ͷΞΫηεճ਺ͱϝϞϦ࢖༻ྔͷτϨʔυΦϑ)
  17. Searcherͷ࣮૷ 35 go php go php MySQL্ͷసஔΠϯσοΫε ݕࡧΫΤϦ ”Go PHP”

    go, php Analyzer Analyze Searcher ద߹ͨ͠υΩϡϝϯτ τʔΫϯʹରԠ͢ΔϙεςΟϯάϦετΛ MySQL͔ΒϝϞϦ΁Read ϙεςΟϯάϦετ ૸ࠪ
  18. ద߹͢ΔυΩϡϝϯτΛ୳ͨ͢ΊʹͲ͏ϙεςΟϯάϦετΛͲ͏૸ࠪ͢Δ͔? DAAT(Document At A Time) • ΫΤϦʹؚ·ΕΔτʔΫϯͷϙεςΟϯάϦετͷΧʔιϧΛ͢΂ͯ։͖ಉ ࣌ʹॲཧ • υΩϡϝϯτ͝ͱʹ૸ࠪ͢Δ

    • Elasticsearch(Lucene)Ͱ΋࠾༻ ※DAATʹରͯ͠TAAT(Team At A Time)͕͋Δ͕ࠓճ͸ϊʔλον ࢀߟࢿྉΛఴ෇ https://www.slideshare.net/tsubosaka/top-kquery Searcherͷ࣮૷ 36
  19. DAATͰͷORݕࡧ ΧʔιϧΛಈ͔ͯ͠ɺશͯͷIDΛॏෳͳ͘௥Ճɹ ✌︎ (‘ω' ✌︎ ) ݪ࢝త ( ✌︎ ’ω')

    ✌︎ ※࣮ࡍͷݕࡧΤϯδϯͰ͸ORݕࡧ͸࠷దԽ͕ߦΘΕ͍ͯΔ(͜ͷτʔΫͰ͸ϊʔλον) 41 ྫ “PHP” OR “Ruby” ݕࡧ ruby php 1 2 5 5 2 4
  20. ҎԼೋͭͷυΩϡϝϯτ΁“Amazon Prime”ͱݕࡧ • D1: “Amazon Prime movies” • D2: “a

    prime concern of Amazon” ϙεςΟϯάʹυΩϡϝϯτதͷҐஔ৘ใ ୯ޠͷॱংΛߟྀͰ͖Δ D1ͷΈΛώοτͤ͞ΒΕΔ 42 ϑϨʔζݕࡧ
  21. ΫΤϦ“Go PHP”Ͱݕࡧ͢Δ৔߹ 43 ϑϨʔζݕࡧ go 1:1 php 2:1 1:3 2:2

    υΩϡϝϯτID:ग़ݱҐஔ go 1:1-0 php 2:1-0 1:3-1 2:2-1 υΩϡϝϯτID:૬ରग़ݱҐஔ go 1:1 php 2:1 1:2 2:1 υΩϡϝϯτID:૬ରग़ݱҐஔ ૬ରग़ݱҐஔ͕౳͍͠ → ϑϨʔζΛؚΜͰ͍Δ 1: “Go Ruby PHP” 2: “Go PHP Python” ΫΤϦ্ͰGo͸0൪໨ɺPHP͸1൪໨ͳͷͰ૬ରग़ݱҐஔΛܭࢉ͢Δ సஔΠϯσοΫεʹొ࿥ͨ͠υΩϡϝϯτ
  22. సஔΠϯσοΫεͷѹॖ go υΩϡϝϯτID ̍ 4 21 241 412 500 560

    600 888 1324 2000 2321 10940 23131 29898 3001 8090 9012 3001 8090 9012 901 1201 ϙεςΟϯάϦετʹυΩϡϝϯτID͕Կेສͱ͋Δ৔߹΋͋Γ༰ྔ͕େ͖͘ͳΔ → Մม௕ූ߸Խͱࠩ෼ྻΛ૊Έ߹Θͤͯѹॖ 47
  23. సஔΠϯσοΫεͷѹॖ ࠩ෼ྻ go ̍ 5 50 250 500 800 1000

    2500 2800 υΩϡϝϯτID͸ঢॱʹιʔτ͞Ε͍ͯΔ ࠩ෼ΛͱΔ → Մม௕ූ߸Խʹ͓͍ͯ৘ใྔ͕গͳ͘ͳΔ go ̍ 4 45 200 250 300 200 1500 300 υΩϡϝϯτIDࠩ෼ 49
  24. సஔΠϯσοΫεͷѹॖ Մม௕ූ߸Խ • Մม௕ූ߸Խ͸ࣗ࡞ͤͣʹඪ४ύοέʔδ”encoding/gob"Λར༻ • JSON΍XML, Protocol buffersͳͲ͋Δ͕Goʹดͨ͡؀ڥͰ͸gobͷํ͕ศར • json:”hoge"ͷΑ͏ͳλά͕ෆཁ

    • ςΩετϕʔεͰͳ͘όΠφϦʔͰແବ͕ͳ͍ • ࣗݾهड़త • Goͷม਺ͳΒstructͰ΋nilͰ΋ΤϯίʔυͰ͖Δ(ϦϯΫτϦετ΋Ͱ͖ͨ) • gobͷunsigned intͷΤϯίʔυ • 0 → 00000000(1byte) • 7 → 00000111(1byte) • 256 → 11111110 000000001 00000000(3byte) https://blog.golang.org/gob https://golang.org/pkg/encoding/gob/ 50
  25. ͓ΘΓʹ 53 • ंྠͷ࠶ൃ໌ͰΠϯϓοτͱΞ΢τϓοτͷαΠΫϧΛճͤͨ • ࣗ࡞ָ͕͍͠(Ϟνϕʔγϣϯ) • ৄ͘͠ͳ͍ٕज़ʹ৮ΕΔඞཁ(Πϯϓοτ) • (ྫ)ΤϯίʔσΟϯά

    • encoding/gob • encoding/binary • protobuf • धཁ͕͋Δ(Ξ΢τϓοτ) • ϒϩά౤ߘ→ొஃ • Ξ΢τϓοτۦಈ։ൃ INPUT OUTPUT
  26. TAATͰͷORݕࡧ 1. ͲͷΩʔͰ΋ྑ͍ͷͰAccumlatorΛ࡞੒ [1,2,5] (php) 2. ॏෳ͠ͳ͍શͯͷIDΛAccumlatorʹ௥Ճ ✌︎ (‘ω' ✌︎

    ) ݪ࢝త ( ✌︎ ’ω') ✌︎ ※࣮ࡍͷݕࡧΤϯδϯͰ͸ORݕࡧ͸࠷దԽ͕ߦΘΕ͍ͯΔ(͜ͷτʔΫͰ͸৮Ε·ͤΜ) 58 “PHP” OR “Ruby” ݕࡧ ruby php 1 2 5 5 2