Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Elasticsearchで多言語検索対応してみた話.pdf
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
motsat
July 19, 2018
Programming
1.5k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Elasticsearchで多言語検索対応してみた話.pdf
motsat
July 19, 2018
More Decks by motsat
See All by motsat
「SmartHR基本機能」の溜まっていく技術課題への取り組み
motsat
0
1.8k
メドピアの輪読会
motsat
2
1.4k
Other Decks in Programming
See All in Programming
ローカルLLMでどこまでコードが書けるか -拡張版 / How much code can be written on a local LLM Extended
kishida
10
4k
jQueryをバージョンアップする前に使いたいjQuery Migrate
matsuo_atsushi
0
470
スマートグラスで並列バイブコーディング
hyshu
0
140
さぁV100、メモリをお食べ・・・
nilpe
0
140
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
A2UI という光を覗いてみる
satohjohn
1
130
dRuby over BLE
makicamel
2
340
Signal Forms: Details & Live Coding @enterJS 2026 in Mannheim
manfredsteyer
PRO
0
130
コンテキストの使い捨てをやめる — ビジネスルール駆動開発と miko —
ioki
0
190
例外の正しい扱い方 そのエラー try-catchして大丈夫?
jinwatanabe
0
230
Semantic Version 単位で戦略を柔軟に変えて、パッケージアップデートを自動化する
daitasu
1
230
LLMによるContent Moderationの本番運用の裏側と品質担保への挑戦
suikabar
2
640
Featured
See All Featured
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
410
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.3k
Visualization
eitanlees
152
17k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
62k
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
200
Ruling the World: When Life Gets Gamed
codingconduct
0
250
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Building the Perfect Custom Keyboard
takai
2
790
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Building Applications with DynamoDB
mza
96
7.1k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
10k
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
540
Transcript
ElasticsearchͰଟݴޠݕࡧର Ԡͯ͠Έͨ
ࣗݾհ ɹɾ໊લ ɹɹࠤ౻ ݩل ɹɾϝυϐΞྺ ɹɹ 2017/9 ʙ WebͷιϑτΤΞΤϯδχΞɹ ɹ
͜Ε͔Β͓͢͠Δ͜ͱ ɾӳޠυΩϡϝϯτΛຊޠͰݕࡧ͢ΔΑ͏ͳɺҟͳΔݴޠ ؒͷݕࡧ ɾ༁Λߦ͏ࣄ͕Ͱ͖ͳ͍߹ͷରԠ ɾElasticsearchͰߦͬͨࡍͷํ๏ͷ1ͭͱɺ ɹϝϦοτ/σϝϦοτ
ҩࢣઐ༻αΠτʮMedPeerʯ
㲔 ৽αʔϏεͷ։ൃ
ʮPubmedจΛຊޠͰޮΑ͘ݕࡧʯ
PubMedจʁ ɹւ֎ҩֶจݙใͷσʔλϕʔε ɹɾӳޠ ʢ͘͝·Εʹผͷݴޠʣ ɾAPIɺFTPͰͷϑΝΠϧऔಘʹରԠ͍ͯ͠Δ
ɾӳจυΩϡϝϯτΛຊޠͰݕࡧ ͍ͨ͠ ɾݕࡧରλΠτϧɺຊจ ࣮ݱ͢Δ͜ͱ
ຊޠʹ༁͓͚ͯ͠ ͳ͍ͣ
༁Λߦ͏ͨΊͷAPI ɾGoogle Translation API 100 ສจࣈ - 20υϧ ɹˠ Pubmed༁ʹෆࣗવͳ͕গͳ͍
ɾMicrosoft Translator API 100 ສจࣈ - 10υϧʢ1120ԁ) ɹˠ ྉ͍͕ۚ҆ɺPubmed༁͢Δͱෆࣗવͳ͕ΘΓͱ͋Δ ɾAmazon Translate ɹຊޠະରԠ(2017/ळࠒ)
༁࣭Λ༏ઌͯ͠ɺPubmed༁ʹෆ ࣗવͳ͕গͳ͔ͬͨͷͰ Google Translation API ʹܾఆɻ
༁ྉۚ
ɾPubMedจ ɹ1700ສ݅ʢMedPeerͰͷऔΓࠐΈରɻʑ૿Ճʣ ɾฏۉจࣈ ɹ1300จࣈ (λΠτϧ100จࣈɺຊจ1200จࣈ) ɹɹɹɹɹˣɹɹɹɹ ɾ߹ܭจࣈ ɹ221ԯจࣈ (100 +
1200) * 1700ສ݅
Google Translation API100 ສจࣈ - 20υϧ ɾֹۚ (221ԯ / 100ສจࣈ)
* 20υϧ = 442000υϧ ɹ ຊԁ = 4889ສԁ ɹ (2017/07/07࣌)
4889ສԁ ߴ͍ʢฐࣾج४ʣ
4889ສԁߴ͍ ɾશͯ༁͢Δͱߴ͗͢Δ ɾͱ͍͑ɺ݅ݮΒͨ͘͠ͳ͍ → ʮӳจυΩϡϝϯτΛຊޠͰݕࡧʯΛͲ ͏͢Δ͔
1.ݩυΩϡϝϯτΛ༁͍ͯ͠ͳͯ͘ ݕࡧՄೳʹ → ຊޠݕࡧʹࣙॻΛ͏ɻ ɹElasticsearchͷʮSynonym Token Filterʯ
Elasticsearch Synonym Token Filter https://www.elastic.co/guide/en/elasticsearch/reference/current/ analysis-synonym-tokenfilter.html
ಉҙޠྨٛޠΛઃఆͰ͖Δػೳɻ ྫʣ ͱ͍͏ఆ͕ٛ͋Εɺ ʮi-podʯͰݕࡧ →ʮi podʯʯʮipodʯʹώοτ ʮi podʯͰݕࡧ →ʮi-podʯʯʮipodʯʹώοτ Synonym
Token Filter i-pod, i pod => ipod
ߴ݂ѹ => hypertension ΠϯϑϧΤϯβ => influenza ͜ΕΛ͍ɺ ຊޠ/ӳޠΛؔ࿈͚ͮΔ
pubmed: { properties: { title_en: { type: "text", analyzer: “english_analyzer"
}, title_ja: { type: "text", analyzer: "ja_analyzer" }, body_en: { type: “text”, analyzer: "english_analyzer" }, body_ja: { type: "text", analyzer: "ja_analyzer" }, }, }, Indexͷproperties(Ϛοϐϯά) ɹɾຊޠϑΟʔϧυ(title_ja/body_ja)ɺ ɹɹӳޠϑΟʔϧυ(title_en/body_en)ΛλΠτϧ/ຊจ ɹɹͦΕͧΕͰ༻ҙ ɾӳޠϑΟʔϧυɺຊޠϑΟʔϧυͰanalyzerΛ͚Δ Elasticsearch༻ͷઃఆ
Indexͷanalysisઃఆ ɹɾfilterʹtype:”synonym”ͰઃఆՃ ɹɾӳޠϑΟʔϧυ༻ͷʮenglish_analyzerʯͷfillterʹɺ ɹɹsynonym filterΛ͏Α͏ઃఆ(ଞެࣜͷEnglishઃఆΛϕʔεʹ) Elasticsearch༻ͷઃఆ { “index” : {
“analysis“: { “filter“ : { “synonym“ : { “type“ : "synonym", “synonyms“ : [‘ߴ݂ѹ => hypertension’, …]}, “analyzer“: { “english_analyzer“: { “tokenizer”: "standard", “filter”: [“synoncym”,”english_possessive_stemmer”,”lowercase “,…] }, …}, }, }, }
Analyze݁Ռ (kibana) ʮߴ݂ѹʯˠʮhypertensionʯͷtokenͱͳΔ (࣮ࡍʹɺઌఔͷαϯϓϧͷଞͷfilterʹΑΓՃ͞Εͨtokenʣ
Ωʔϫʔυʮߴ݂ѹʯͰຊޠ/ӳޠݕࡧ͕Մೳʹ ɾຊޠͷʮߴ݂ѹʯ ɾӳจͷʮhypertensionʯ
ࣙॻͲ͜ͰखʹೖΕΔ͔ ༗໊ͳࣙॻ ɾJMdict (Japanese-Multilingual Dictionary) http://www.edrdg.org/jmdict/j_jmdict.html ɹӳޠҎ֎ೖ͍ͬͯΔΑ͏ͳͷͰɺ ɹෳݴޠͷઃఆͰ͖Δ͔͠Ε·ͤΜɻ
ͲΜͳࣙॻͰྑ͍ͷ͔ ɾઐ༻ޠݫ͍͠ ɹઐ༻ޠ(ҩྍ)ઐͷࣙॻͰͳ͍ͱ୯ޠ͕ཏͰ͖ͳ͍ࣄ͕ ଟ͍ɻ ɹˠ ઐྖҬͷࣙॻΛ୳͢ɺݕࡧϩά͔ΒࣙॻΛΞοϓσʔτ͢Δ ͳͲ ϝυϐΞגࣜձࣾϩθολ༷ͱܖ͠ɺҩྍʹಛԽͨࣙ͠ॻΛఏ ڙ͖ͯ͠·ͨ͠ɻ
ࣙॻʹΑΔଟݴޠͷݕࡧ ϝϦοτ ɹɾશ༁͠ͳͯ͘ݕࡧͰ͖ΔΑ͏ʹͳΔ σϝϦοτ ɹɾࣙॻͷ༻ҙ͕ඞཁʹͳΔ ɹɾݕࡧਫ਼͕ࣙॻͷ࣭ʹࠨӈ͞Ε͍͢ ɹɹˠ จষͷ୯ޠΛཏͰ͖ͳ͍Մೳੑେ͖͍ ɾ୯ޠϕʔεͷݕࡧʹڧ͍͕ɺจͷݕࡧʹऑ͍ ɾݕࡧͰ͖Δ͕ɺ݁Ռදࣔӳޠ
Ϣʔβʔ ΠϯλʔϑΣʔε͍ͩ͠Ͱ͕͢ɺ ɾλΠτϧࣄલ ɹ༁APIͷ͕ͪ࣌ؒͭΒ͍ʢҰཡͰͬ͞ͱݟΔఔͳͷͰʣ ɹຊจΑΓจࣈগͳ͍ͷͰɺશ༁ֹͯۚ͠Ί 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ɾຊจৄࡉදࣔ࣌ ɹදࣔϖʔδͰඇಉظͰॲཧ͞Εͨ༁݁ՌΛදࣔɹ 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ྉۚ ɾ༁ྉ(ݻఆ) ɹ ࣄલʹλΠτϧ͚ͩ༁͢Δࣄʹͨ͠ ɹ 1700ສ݅ * λΠτϧͷΈ(100จࣈ) = 376ສԁɹ(Google
Translate API) ɾ༁ྉ(มಈ) ɹදࣔ࣌ʹ༁͢Δࣄʹͨ͠ = αສԁʢPVඇެදͷؔͰग़ ͤͳ͍Ͱ͕͢ɺݕࡧ͞Εදࣔ͞ΕΔϖʔδҰ෦ͳͷͰ͔ͳΓ҆͘ʣ ɾࣙॻ ɹ גࣜձࣾϩθολ༷ (ҩྍܥಛԽͷӳࣙॻ)ɹ= bສԁʢ͜Εެ ද☓ͳͷͰ͕͢ɺۃΊ͓ͯ҆͘ఏڙ͍͍ͯ·͢ʣ
࠷ऴతͳֹۚ 4889ສԁ → 376+α+b ສԁ େৎʢฐࣾج४ʣ
ิ 1700ສ݅Ҏ߱ʢPubMedͷίϯςϯπʑ૿Ճʣ ͷ༁ɺΑΓҩྍಛԽͷ༁Λߦ͏ͨΊGoogle༁͚ͩͰ ͳ͘ɺԼهͷύʔτφʔ༷ͱܖΛߦ͍APIΛར༻͍ͯ͠· ͢ɻ ɾגࣜձࣾγΣΞϝσΟΧϧ༷ ɹҩྍܥಛԽͷ༁API ɹhttps://www.ikotoba.jp/ ɾגࣜձࣾϩθολ༷ PubMedಛԽͷ༁API
ɹhttps://www.rozetta.jp/
·ͱΊ - ݴޠ͕ҟͳΔυΩϡϝϯτݕࡧʹɺ ɹ ElasticsearchͷSynonym Token FilterΛͬͨ - ༁ྉݮΒͤΔ͕ɺϝϦοτ/σϝϦοτ͋Γ
- ୯ޠͰͷݕࡧΛओͱ͢Δ߹ʹద༻͍͢͠
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠