Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Elasticsearchで多言語検索対応してみた話.pdf
Search
motsat
July 19, 2018
Programming
1.5k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Elasticsearchで多言語検索対応してみた話.pdf
motsat
July 19, 2018
More Decks by motsat
See All by motsat
「SmartHR基本機能」の溜まっていく技術課題への取り組み
motsat
0
1.8k
メドピアの輪読会
motsat
2
1.4k
Other Decks in Programming
See All in Programming
Javaの型とAI時代に型が大事な理由 / java types and type in AI era
kishida
2
130
例外の正しい扱い方 そのエラー try-catchして大丈夫?
jinwatanabe
0
230
3Dシーンの圧縮
fadis
1
770
LLMによるContent Moderationの本番運用の裏側と品質担保への挑戦
suikabar
2
630
AIだと陥りがちなJakarta EE最新技術への移行時の落とし穴と解決策
tnagao7
0
110
依存関係から依存物へ―Dependencyという言葉の歴史をひも解く
j_lee
0
120
AI 時代のソフトウェア設計の学び方
masuda220
PRO
29
12k
Snowflake Summitでの新機能 CoCo / CoWork / snowflake-summit-2026-overall-what-new-coco
tatsuhiro
1
130
Mujeres en SEO Summit 2026 - Greatest Disaster Hits en Web Performance
guaca
0
170
正しくソフトウェアを作る、前提を疑うための認知の視点 / doubt-premise
minodriven
21
6.6k
Go1.27で導入されるジェネリクスメソッドでできること
mackee
0
120
IBM Bobを活用したレガシーアプリの最新化
oniak3ibm
PRO
1
190
Featured
See All Featured
Code Reviewing Like a Champion
maltzj
528
40k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
71
40k
Six Lessons from altMBA
skipperchong
29
4.3k
Raft: Consensus for Rubyists
vanstee
141
7.5k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
250
Game over? The fight for quality and originality in the time of robots
wayneb77
1
200
GraphQLとの向き合い方2022年版
quramy
50
15k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
10k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
55k
How to Talk to Developers About Accessibility
jct
2
230
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
201
75k
So, you think you're a good person
axbom
PRO
2
2.1k
Transcript
ElasticsearchͰଟݴޠݕࡧର Ԡͯ͠Έͨ
ࣗݾհ ɹɾ໊લ ɹɹࠤ౻ ݩل ɹɾϝυϐΞྺ ɹɹ 2017/9 ʙ WebͷιϑτΤΞΤϯδχΞɹ ɹ
͜Ε͔Β͓͢͠Δ͜ͱ ɾӳޠυΩϡϝϯτΛຊޠͰݕࡧ͢ΔΑ͏ͳɺҟͳΔݴޠ ؒͷݕࡧ ɾ༁Λߦ͏ࣄ͕Ͱ͖ͳ͍߹ͷରԠ ɾElasticsearchͰߦͬͨࡍͷํ๏ͷ1ͭͱɺ ɹϝϦοτ/σϝϦοτ
ҩࢣઐ༻αΠτʮMedPeerʯ
㲔 ৽αʔϏεͷ։ൃ
ʮPubmedจΛຊޠͰޮΑ͘ݕࡧʯ
PubMedจʁ ɹւ֎ҩֶจݙใͷσʔλϕʔε ɹɾӳޠ ʢ͘͝·Εʹผͷݴޠʣ ɾAPIɺFTPͰͷϑΝΠϧऔಘʹରԠ͍ͯ͠Δ
ɾӳจυΩϡϝϯτΛຊޠͰݕࡧ ͍ͨ͠ ɾݕࡧରλΠτϧɺຊจ ࣮ݱ͢Δ͜ͱ
ຊޠʹ༁͓͚ͯ͠ ͳ͍ͣ
༁Λߦ͏ͨΊͷAPI ɾGoogle Translation API 100 ສจࣈ - 20υϧ ɹˠ Pubmed༁ʹෆࣗવͳ͕গͳ͍
ɾMicrosoft Translator API 100 ສจࣈ - 10υϧʢ1120ԁ) ɹˠ ྉ͍͕ۚ҆ɺPubmed༁͢Δͱෆࣗવͳ͕ΘΓͱ͋Δ ɾAmazon Translate ɹຊޠະରԠ(2017/ळࠒ)
༁࣭Λ༏ઌͯ͠ɺPubmed༁ʹෆ ࣗવͳ͕গͳ͔ͬͨͷͰ Google Translation API ʹܾఆɻ
༁ྉۚ
ɾPubMedจ ɹ1700ສ݅ʢMedPeerͰͷऔΓࠐΈରɻʑ૿Ճʣ ɾฏۉจࣈ ɹ1300จࣈ (λΠτϧ100จࣈɺຊจ1200จࣈ) ɹɹɹɹɹˣɹɹɹɹ ɾ߹ܭจࣈ ɹ221ԯจࣈ (100 +
1200) * 1700ສ݅
Google Translation API100 ສจࣈ - 20υϧ ɾֹۚ (221ԯ / 100ສจࣈ)
* 20υϧ = 442000υϧ ɹ ຊԁ = 4889ສԁ ɹ (2017/07/07࣌)
4889ສԁ ߴ͍ʢฐࣾج४ʣ
4889ສԁߴ͍ ɾશͯ༁͢Δͱߴ͗͢Δ ɾͱ͍͑ɺ݅ݮΒͨ͘͠ͳ͍ → ʮӳจυΩϡϝϯτΛຊޠͰݕࡧʯΛͲ ͏͢Δ͔
1.ݩυΩϡϝϯτΛ༁͍ͯ͠ͳͯ͘ ݕࡧՄೳʹ → ຊޠݕࡧʹࣙॻΛ͏ɻ ɹElasticsearchͷʮSynonym Token Filterʯ
Elasticsearch Synonym Token Filter https://www.elastic.co/guide/en/elasticsearch/reference/current/ analysis-synonym-tokenfilter.html
ಉҙޠྨٛޠΛઃఆͰ͖Δػೳɻ ྫʣ ͱ͍͏ఆ͕ٛ͋Εɺ ʮi-podʯͰݕࡧ →ʮi podʯʯʮipodʯʹώοτ ʮi podʯͰݕࡧ →ʮi-podʯʯʮipodʯʹώοτ Synonym
Token Filter i-pod, i pod => ipod
ߴ݂ѹ => hypertension ΠϯϑϧΤϯβ => influenza ͜ΕΛ͍ɺ ຊޠ/ӳޠΛؔ࿈͚ͮΔ
pubmed: { properties: { title_en: { type: "text", analyzer: “english_analyzer"
}, title_ja: { type: "text", analyzer: "ja_analyzer" }, body_en: { type: “text”, analyzer: "english_analyzer" }, body_ja: { type: "text", analyzer: "ja_analyzer" }, }, }, Indexͷproperties(Ϛοϐϯά) ɹɾຊޠϑΟʔϧυ(title_ja/body_ja)ɺ ɹɹӳޠϑΟʔϧυ(title_en/body_en)ΛλΠτϧ/ຊจ ɹɹͦΕͧΕͰ༻ҙ ɾӳޠϑΟʔϧυɺຊޠϑΟʔϧυͰanalyzerΛ͚Δ Elasticsearch༻ͷઃఆ
Indexͷanalysisઃఆ ɹɾfilterʹtype:”synonym”ͰઃఆՃ ɹɾӳޠϑΟʔϧυ༻ͷʮenglish_analyzerʯͷfillterʹɺ ɹɹsynonym filterΛ͏Α͏ઃఆ(ଞެࣜͷEnglishઃఆΛϕʔεʹ) Elasticsearch༻ͷઃఆ { “index” : {
“analysis“: { “filter“ : { “synonym“ : { “type“ : "synonym", “synonyms“ : [‘ߴ݂ѹ => hypertension’, …]}, “analyzer“: { “english_analyzer“: { “tokenizer”: "standard", “filter”: [“synoncym”,”english_possessive_stemmer”,”lowercase “,…] }, …}, }, }, }
Analyze݁Ռ (kibana) ʮߴ݂ѹʯˠʮhypertensionʯͷtokenͱͳΔ (࣮ࡍʹɺઌఔͷαϯϓϧͷଞͷfilterʹΑΓՃ͞Εͨtokenʣ
Ωʔϫʔυʮߴ݂ѹʯͰຊޠ/ӳޠݕࡧ͕Մೳʹ ɾຊޠͷʮߴ݂ѹʯ ɾӳจͷʮhypertensionʯ
ࣙॻͲ͜ͰखʹೖΕΔ͔ ༗໊ͳࣙॻ ɾJMdict (Japanese-Multilingual Dictionary) http://www.edrdg.org/jmdict/j_jmdict.html ɹӳޠҎ֎ೖ͍ͬͯΔΑ͏ͳͷͰɺ ɹෳݴޠͷઃఆͰ͖Δ͔͠Ε·ͤΜɻ
ͲΜͳࣙॻͰྑ͍ͷ͔ ɾઐ༻ޠݫ͍͠ ɹઐ༻ޠ(ҩྍ)ઐͷࣙॻͰͳ͍ͱ୯ޠ͕ཏͰ͖ͳ͍ࣄ͕ ଟ͍ɻ ɹˠ ઐྖҬͷࣙॻΛ୳͢ɺݕࡧϩά͔ΒࣙॻΛΞοϓσʔτ͢Δ ͳͲ ϝυϐΞגࣜձࣾϩθολ༷ͱܖ͠ɺҩྍʹಛԽͨࣙ͠ॻΛఏ ڙ͖ͯ͠·ͨ͠ɻ
ࣙॻʹΑΔଟݴޠͷݕࡧ ϝϦοτ ɹɾશ༁͠ͳͯ͘ݕࡧͰ͖ΔΑ͏ʹͳΔ σϝϦοτ ɹɾࣙॻͷ༻ҙ͕ඞཁʹͳΔ ɹɾݕࡧਫ਼͕ࣙॻͷ࣭ʹࠨӈ͞Ε͍͢ ɹɹˠ จষͷ୯ޠΛཏͰ͖ͳ͍Մೳੑେ͖͍ ɾ୯ޠϕʔεͷݕࡧʹڧ͍͕ɺจͷݕࡧʹऑ͍ ɾݕࡧͰ͖Δ͕ɺ݁Ռදࣔӳޠ
Ϣʔβʔ ΠϯλʔϑΣʔε͍ͩ͠Ͱ͕͢ɺ ɾλΠτϧࣄલ ɹ༁APIͷ͕ͪ࣌ؒͭΒ͍ʢҰཡͰͬ͞ͱݟΔఔͳͷͰʣ ɹຊจΑΓจࣈগͳ͍ͷͰɺશ༁ֹͯۚ͠Ί 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ɾຊจৄࡉදࣔ࣌ ɹදࣔϖʔδͰඇಉظͰॲཧ͞Εͨ༁݁ՌΛදࣔɹ 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ྉۚ ɾ༁ྉ(ݻఆ) ɹ ࣄલʹλΠτϧ͚ͩ༁͢Δࣄʹͨ͠ ɹ 1700ສ݅ * λΠτϧͷΈ(100จࣈ) = 376ສԁɹ(Google
Translate API) ɾ༁ྉ(มಈ) ɹදࣔ࣌ʹ༁͢Δࣄʹͨ͠ = αສԁʢPVඇެදͷؔͰग़ ͤͳ͍Ͱ͕͢ɺݕࡧ͞Εදࣔ͞ΕΔϖʔδҰ෦ͳͷͰ͔ͳΓ҆͘ʣ ɾࣙॻ ɹ גࣜձࣾϩθολ༷ (ҩྍܥಛԽͷӳࣙॻ)ɹ= bສԁʢ͜Εެ ද☓ͳͷͰ͕͢ɺۃΊ͓ͯ҆͘ఏڙ͍͍ͯ·͢ʣ
࠷ऴతͳֹۚ 4889ສԁ → 376+α+b ສԁ େৎʢฐࣾج४ʣ
ิ 1700ສ݅Ҏ߱ʢPubMedͷίϯςϯπʑ૿Ճʣ ͷ༁ɺΑΓҩྍಛԽͷ༁Λߦ͏ͨΊGoogle༁͚ͩͰ ͳ͘ɺԼهͷύʔτφʔ༷ͱܖΛߦ͍APIΛར༻͍ͯ͠· ͢ɻ ɾגࣜձࣾγΣΞϝσΟΧϧ༷ ɹҩྍܥಛԽͷ༁API ɹhttps://www.ikotoba.jp/ ɾגࣜձࣾϩθολ༷ PubMedಛԽͷ༁API
ɹhttps://www.rozetta.jp/
·ͱΊ - ݴޠ͕ҟͳΔυΩϡϝϯτݕࡧʹɺ ɹ ElasticsearchͷSynonym Token FilterΛͬͨ - ༁ྉݮΒͤΔ͕ɺϝϦοτ/σϝϦοτ͋Γ
- ୯ޠͰͷݕࡧΛओͱ͢Δ߹ʹద༻͍͢͠
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠