Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch で部内 Wiki 検索高速化

Avatar for nonylene nonylene
June 05, 2017

Elasticsearch で部内 Wiki 検索高速化

KMC 例会講座 資料

Avatar for nonylene

nonylene

June 05, 2017
Tweet

More Decks by nonylene

Other Decks in Technology

Transcript

  1. ໨࣍ 1. PukiWiki ͷ࿩ 2. ߴ଎Խ͢Δʹ͸ 3. Elasticsearch ಋೖ 4.

    Elasticsearch ݕࡧ 5. Heineken ( React ) 6. ੒Ռɾײ૝
  2. • όʔδϣϯ͕ݹ͍ • 1.4.8_alpha2 ( 2006 ೥ ) • Slack

    ͷ౤ߘͳͲඍົʹվ଄͍ͯ͠Δ PukiWiki at KMC ͷͭΒ͍ͱ͜
  3. • PukiWiki ͷσʔλ͸શͯςΩετϑΝΠϧ $ ls /…/pukiwiki/wiki/ 28A1A6A1FEA1A629.txt 31B6A6A5D0A5EAA5B1A1BCA5C9A5BFA5EFA1BCA5C7A5A3A5D5A5A7A5F 3A5B9A5B2A1BCA5E0.txt 323034384149A5B3A5F3A5C6A5B9A5C8.txt

    3332A5ADA5C3A5C1A5F3C0B0C8F7B7D7B2E8.txt … PukiWiki ͷσʔλ ※ λΠτϧΛ euc-jp ͰΤϯίʔυͨ͠όΠτྻͷ Hex ͕ϑΝΠϧ໊ʹͳΔ
  4. • PukiWiki ͷσʔλ͸શͯςΩετϑΝΠϧ $ nkf /…/pukiwiki/wiki/BFB7B4BFA5B3A5F3A5D132303137.txt [[৽׻ίϯύ]] ~&size(25){ͲΜͲΜࢀՃ͍ͯ͜͠͏ͳ}; *໨࣍ [#jfaa7b62]

    #contents ~৽ೖੜͷ͔ͨ͸΋ͪΖΜɺ্ճੜ΍OBͷํʑ΋Ұॹʹָ͠Έ·͠ΐ͏ʂ … ※ euc-jp ͳͷͰ nkf Ͱม׵͍ͯ͠Δ PukiWiki ͷσʔλ
  5. PukiWiki ͷݕࡧ͸ͳͥ஗͍ • PukiWiki ͕ݕࡧ͢Δ࣌… • PHP Ͱຖճશͯͷ Wiki ϑΝΠϧͷ಺༰Λऔಘ

    • lib/func.php ͷ do_search ࢀর • औಘͨ͠จࣈྻʹରͯ͠શจݕࡧ
  6. Elasticseach Ͱղੳ • Analyzer ͷߏ੒ • Char Filter Ͱਖ਼نԽ౳ •

    Tokenizer Ͱ෼ׂʢ͕͜͜େࣄʣ • Token Filter Ͱਖ਼نԽ౳
  7. Char Filter • จࣈؒͷҧ͍Λٵऩ͢Δ • ͂ʢ̴͉̺̰̺̽̈́ʣ → s (hankaku) •

    ᶨ → ϔΫλʔϧ • Tokenizer ʹೖΕΔલʹෆཁͳจࣈΛআ͘
  8. Char Filter • ICU Analysis Plugin • ެࣜϓϥάΠϯ • ྑ͍ײ͡ʹਖ਼نԽͯ͘͠ΕΔ

    • શ֯ˠ൒֯ɺه߸෼ղɺେจࣈˠখจࣈ౳ • ౉ᬒˠ౉ลͱ͔͸΍Βͳ͍
  9. Tokenizer • จষΛ͍͍ײ͡ʹ۠੾Δ • ۠੾ͬͨޠ۟ͷҐஔΛه࿥ ( Index ) ͢Δ →

    ݕࡧޠ͕۟͋Δ৔ॴ͕͙͢෼͔Δ → શจݕࡧΑΓ΋଎͍ʂ
  10. • Index: [(‘͜Μʹͪ͸’, 0), (‘KMC’, 
 6), (‘Hello!’, 10)]
 •

    ‘KMC’ Ͱݕࡧ → ʮ 6 ൪໨ʹ͋Δʯͱ͙͢෼͔Δ Tokenizer
  11. • Index: [(‘͜Μʹͪ͸’, 0), (‘KMC’, 
 6), (‘Hello!’, 10)]
 •

    ‘KM’ Ͱݕࡧ → ʮͦΜͳ΋ͷ͸ͳ͍ʯ Tokenizer
  12. • ํ๏ᶃ n จࣈ͝ͱʹ۠੾Δ ( N-Gram )
 • “͜Μʹͪ͸ɺࠓ೔΋͍͍ఱؾͰ͢Ͷ” →

    Index: [(‘͜Μ’, 0), (‘Μʹ’, 1), (‘ʹͪ’, 
 2), (‘ͪ͸’,3), … , (‘͢Ͷ’, 14)] Tokenizer
  13. Tokenizer • Index: [(‘͜Μ’, 0), (‘Μʹ’, 1), (‘ʹͪ’, 2), (‘ͪ͸’,3),

    … , (‘͢Ͷ’, 14)]
 • ‘͜Μʹͪ’ Ͱݕࡧ → ‘͜Μ’ ͕ 0 ൪໨ʹώοτ → ͦͷޙ΋ਖ਼ͦ͠͏ → ʮ 0 ൪໨ʹ͋Δʯͱ͙͢෼͔Δ
  14. • ํ๏ᶃ n จࣈ͝ͱʹه࿥͢Δ ( N-Gram ) • ར఺ •

    ඞͣώοτ͢ΔʢऔΓ͜΅͕͠ͳ͍ʣ • ۠੾ͬͨจࣈҎ্ͷ৔߹ʹݶΔ • ؆୯ Tokenizer
  15. • ํ๏ᶃ n จࣈ͝ͱʹه࿥͢Δ ( N-Gram ) • ܽ఺ •

    Index ͕ංେԽ͠΍͍͢ • ෆཁͳ΋ͷʹϚον͠΍͍͢ • ྫ: ’͍ఱ’ → ‘͍͍ఱؾ’ Tokenizer
  16. Token Filter • Tokenize ޙͷޠ۟ʹର͔͚ͯ͠ΔϑΟϧλ • ྨޠɾ-ed / -s ͷ౷ҰͳͲ

    • ͜͜ͰେจࣈখจࣈΛἧ͑Δ৔߹΋
 • Heineken Ͱ͸࢖͍ͬͯͳ͍
  17. Elasticseach ͷσʔλߏ଄ Cluster: KMC Index: hoge Index: piyo Type: page

    Type: relation Type: item Typ Field: title Field: body Field: modified Field: from Field: to Field: price Field: name Field: desc Field: available Fiel Fiel
  18. • Index ͱ Type Λఆٛ • Type ʹ Field Λઃఆ

    • Analyzer / Datatype ͳͲ • Index ʹ Type Λઃఆ
 (mapping) Elasticseach ͷσʔλߏ଄ Index: pukiwiki Type: page Field: title Field: body Field: modified
  19. • Analyzer ఆٛ { "settings": { "analysis": { "analyzer": {

    "jp_analyzer": { "tokenizer": "jp_tokenizer", "char_filter":
 [ "html_strip", “icu_normalizer" ], … } }, "tokenizer": { "jp_tokenizer": { "type": “ngram", … , "token_chars":
 [ "letter", "digit", "symbol", "punctuation" ] }}}}, … }
  20. • Mapping ఆٛ { … , "mappings": { "page": {

    … "properties": { "title": { … }, "title_url_encoded": { … }, "body": { "type": "text", "analyzer": "jp_analyzer", "term_vector" : "with_positions_offsets" }, "modified": { "type": "date", "format": "strict_date_optional_time||epoch_millis" } }}}} Index: pukiwiki Type: page Field: title Field: body Field: modified Field: title_url_encoded
  21. • Elasticsearch is RESTful • جຊతʹશͯ JSON Ͱ΍ΓऔΓ͢Δ • Elasticsearch

    Λىಈ͢Δͱ Web αʔόʔཱ͕ͭ • ͦ͜ʹ Python ౳Ͱ JSON ͷ
 Index ఆٛΛ౤͛ͯઃఆ Elasticseach ͷઃఆ
  22. Elasticseach Clusterʢࢀߟʣ Cluster: KMC Node: foo Node: bar Replica: hoge3

    Shard: hoge1 Replica: piyo1 Replica: piyo2 Shard: hoge3 Replica: hoge1 Sha Rep
  23. Heineken-crawler • Python3 Ͱॻ͔Εͨ PukiWiki ͷΫϩʔϥ • จࣈίʔυΛ UTF-8 ʹม׵

    • λΠτϧऔಘɾม׵ • PukiWiki σʔλΛ Elasticsearch ʹ౤͛Δ
  24. Heineken ͰͷείΞ • ᶄ ߋ৽೔࣌ʹॏΈΛஔ͘ • ௚ײͰௐ੔ • origin ->

    ݱࡏ • offset -> 150೔ • scale -> 500೔ • decay -> 0.75 https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
  25. Heineken ͰͷείΞ • ᶅ λΠτϧͷ୹͞ʹॏΈΛஔ͘ • + 1 ʹ͍ͯ͠Δͷ͸ 1

    จࣈͷ࣌ରࡦ • log ͸ͳΜͱͳ͘ • sqrt ͸ͳΜͱͳ͘ score ⇤ 1 p log ( title.length + 1)
  26. ଞͷػೳ • ݕࡧޠ͔۟Β͍͍ײ͡ͷ৔ॴΛநग़Ͱ͖Δ • ϋΠϥΠτ༻ͷ HTML λάૠೖ΋Ͱ͖Δ "fields": { "body":

    { "pre_tags": ["<mark>"], "post_tags": ["</mark>"], "fragment_size": 220, …, } }
  27. Heineken ΞϓϦ࣮૷ • Elasticsearch ͸ RESTful • શ෦ JSON Ͱฦͬͯ͘Δ

    → Elasticsearch Ҏ֎ʹσʔλ͸ෆཁ → શ෦ JavaScript Ͱ΍Ε͹ྑ͍
  28. React ུ֓ • UI Λߏ੒͢ΔͨΊͷ JS ϥΠϒϥϦ • Facebook ੡

    • ֤ॴͰ࢖ΘΕ͍ͯΔ • UI ͷ֤෦඼Λ Component ͱͯ͠ߏ੒͍ͯ͘͠
  29. React ུ֓ - Virtual DOM • Virtual DOM ͰԾ૝తʹ DOM

    Λอ࣋ • σʔλͷมߋ࣌ʹ͸ Virtual DOM Λมߋ • ͦͷޙ࣮ࡍͷ DOM ͱͷࠩ෼Λ൓ө • DOM ͷมߋΛ཈͑ΒΕͯޮ཰త ৄࡉ: http://qiita.com/mizchi/items/4d25bc26def1719d52e6
  30. React ུ֓ - JSX • JSX Ͱ JS ্ʹ HTML

    Λॻ͚Δ • ςϯϓϨʔτΤϯδϯͬΆ͘ॻ͚ͯศར const element = ( <h1> Hello, {username}! </h1> );
  31. React ։ൃ؀ڥ • create-react-app • Facebook ੡ͷ؆୯ React ߏ੒πʔϧ •

    ։ൃ؀ڥ • Ϗϧυ؀ڥ • ςετ؀ڥ https://github.com/facebookincubator/create-react-app
  32. React ։ൃ؀ڥ • ES6 ( ECMA Script 6 ) •

    JavaScript ͷ৽͍͠ඪ४ • class / Arrow function / const / Promise etc.. ৄࡉ: https://www.slideshare.net/1000ch/begin-ecmascript6
  33. ͦͷଞͷύʔπ • Bootstrap • Twitter ࣾ੡ CSS / JS 


    ϑϨʔϜϫʔΫ • ෦һ໊฽ɾ
 ਆֆΞοϓϩʔμ ͳͲ
  34. Heineken ։ൃ 1. create-react-app Ͱͻͳܗ࡞੒ 2. ྑ͍ײ͡ʹ Component ࡞Δ •

    Elasticsearch ͷ API Λୟ͍ͯ൓өͤ͞Δ 3. BabelɾWebpack ͰίϯύΠϧ 4. αʔόʔʹஔ͘
  35. ౰ࣾൺ 1 / 500 • 25000 ms -> 50 ms

    0 7500 15000 22500 30000 PukiWiki Heineken 50 25,000 
  36. ײ૝ • Elasticsearch ͍͢͝ • ͱʹ͔͍͍͘ײ͡ʹͳΔ • React ศརɾES6 ָ͍͠

    • αʔόʔͰԿ΋ಈ͔͞ͳ͍ͷ͸ָ • ΍Γ͔ͨͬͨ͜ͱ͕ग़དྷͨͷͰྑ͔ͬͨ