Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
転置インデックスでどう検索しているか
Search
kotaroooo0
November 19, 2020
Technology
380
0
Share
転置インデックスでどう検索しているか
kotaroooo0
November 19, 2020
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
110
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.6k
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
0
130
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
110
Other Decks in Technology
See All in Technology
JEP 522 Deep Dive - G1 GC同期コスト削減によるスループット向上を徹底検証&解説
tabatad
1
840
AI フレンドリーなエラー監視を TypeScript で実現する
shinyaigeek
2
260
先取りMaven4 ~16年ぶりのメジャーアップデート、その進化とは?~
ogiwarat
0
140
Building applications in the Gemini API family.
line_developers_tw
PRO
0
1.5k
LLMと共に進化するプロセスを目指して
ymatsuwitter
11
3k
生成 AI × MCP で切り拓く次世代 SRE!自律型運用への挑戦と開発者体験の進化
_awache
0
150
コードレビューを制するチームがソフトウェアデリバリーのフローを制す / Beyond Code Review: Distributing Its Responsibilities Across the SDLC
mtx2s
4
1.1k
PHP と TypeScript の型システム比較:AI 時代の「型」は誰のためにあるのか? #frontend_phpcon_do / frontend_phpcon_do_2026
shogogg
1
250
製造業のクラウド活用最適解〜AI,DXを加速するデータ基盤の作り方〜
hamadakoji
0
370
【Gen-AX】20260530開催_JJUG CCC 2026 Spring
genax
0
420
価格.comをAI駆動で全面刷新する ー 30年分の技術的負債を返し、次の30年の土台をつくる ー / AI Engineering Summit Tokyo 2026
tkyowa
49
52k
正解のないAIプロダクトをどう導くか?dodaが挑む、ユーザーの『本音』を構造化する評価設計と検証のリアル
techtekt
PRO
0
180
Featured
See All Featured
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
190
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
190
Designing for Timeless Needs
cassininazir
1
250
Darren the Foodie - Storyboard
khoart
PRO
3
3.4k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
860
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
560
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
1.1k
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
160
We Are The Robots
honzajavorek
0
240
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.8k
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
720
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.3k
Transcript
2020/11/19 @kotaroooo0 సஔΠϯσοΫεͰ Ͳ͏ݕࡧ͍ͯ͠Δ͔
ࣗݾհ
సஔΠϯσοΫεɺͪΖΜͬͯΔΑ? grepΈͨ͘ஞ࣍ݕࡧͯ͠ΔͱΊͪΌͪ͘Ό͕͔͔࣌ؒΔ͔ ΒɺݕࡧΛૣ͘͢ΔͨΊʹࣄલʹຊͷ࣍Έ͍ͨͳͷΛ ࡞͓ͬͯ͘ΜͰ͠ΐ? ͰɺͲ͏ͬͯݕࡧ͍ͯ͠Δ͔·Ͱ… ఆฉ͖ख
సஔΠϯσοΫεΛ Δ
సஔΠϯσοΫε netflix prime amazon - సஔΠϯσοΫε = ࣙॻ + సஔϦετ
1 1 2 3 4 5 5 ࣙॻ సஔϦετ ϙεςΟϯάϦετ
సஔΠϯσοΫε netflix prime amazon - Word-level inverted list ͱݺΕɺ୯ޠ͕จॻͷԿ୯ޠ͔อଘ͢Δ͜ͱ͋Δ -
DocID;offset1,ofset2… 1;2 1;3 2;3 3;5 4;1 5;2 5;3 2;5
୯ޠͷҐஔใͳʹʹ͏? - ϑϨʔζΛ୳͢߹ - ʮAmazon Primeʯͱݕࡧͨ͠߹ - D1: “a prime
concern of Amazon” - D2: “Amazon Prime movies” - Ґஔใ͕͋Ε୯ޠͷॱংΛߟྀ͢Δ͜ͱ͕Ͱ͖ΔͷͰɺD2ͷΈΛώοτ͞ ͤΔ͜ͱ͕Ͱ͖ͨΓɺD2ͷείΞΛେ͖ͨ͘͠Γ͢Δ͜ͱ͕Ͱ͖Δ
ݕࡧ͢Δ
ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -ORݕࡧ: 1,2,3,4,5,6 pink
Orange blue 6 3 4 5 2 1
φΠʔϒͳݕࡧઓུ - ϙεςΟϯάϦετΛࠪ͢Δํࣜ - TAAT(Term At A Time) - ϙεςΟϯάϦετΛ̍ͭͣͭॲཧ͢Δɻಉ࣌ʹ։͘ϙεςΟϯάϦετͷΧʔι
ϧ͚̍ͭͩɻ - ୯ޠ͝ͱʹࠪ͢Δ - DAAT(Document At A Time) - શ୯ޠͷϙεςΟϯάϦετΛಉ࣌ʹॲཧ͢ΔɻΫΤϦʹؚ·ΕΔ୯ޠͷϙε ςΟϯάϦετͷΧʔιϧΛͯ͢։͖ɼಉ࣌ʹਐΊ͍ͯ͘ɻ - υΩϡϝϯτ͝ͱʹࠪ͢Δ
TAATͰͷANDݕࡧ 1. ϙεςΟϯάϦετͷαΠζ͕࠷খͷͷ(prime)Λબ͠ΛɺAccumulator࡞ [2, 5] 2. amazonͷϦετΛࠪ ɾ2ؚ·Ε͍ͯΔ͔?5ؚ·Ε͍ͯΔ͔?ͷΈͷνΣοΫͰOK netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
TAATͰͷORݕࡧ 1. ͲͷΩʔͰྑ͍ͷͰAccumlatorΛ࡞ [1,2,5] (amazon) 2. ॏෳ͠ͳ͍શͯͷΩʔΛݟͯϚʔδ ✌(‘ω'✌ ) ݪ࢝త
( ✌'ω')✌ netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulated = [] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulater = [2] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulate = [2, 5]
netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷORݕࡧ - ΧʔιϧΛಈ͔ͯ͠ɺશͯͷཁૉΛॏෳͳ͘AccumulatorʹՃ ✌(‘ω'✌ ) ݪ࢝త ( ✌'ω')✌ netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͱTAAT - DAATͷϝϦοτ - DAATͷํ͕ɺলϝϞϦͰࡁΉ(ྫ: τοϓ10݅ݕࡧ) - DAATͷํ͕ɺΫΤϦ༻ޠ͕จॻͷಛఆͷ݅Λຬ͍ͨͯ͠Δ͔Ͳ͏͔Λ؆୯ʹࣝ ผͰ͖Δ(ྫ: ϑϨʔζݕࡧɺϑΟϧλϦϯά)
- ElasticsearchͰར༻͞Ε͍ͯΔLuceneDAATํࣜ - ORݕࡧݪ࢝తͳΈͰ͋ΓɺANDݕࡧΑΓଟ͘ͷυΩϡϝϯτΛࠪ͢ΔͨΊɺ ॏ͍ͨ - ݕࡧΤϯδϯORݕࡧʹ࠷దԽ͞Ε͍ͯΔ
ORݕࡧͷ ࠷దԽ४උ
Ͳ͏ߴԽ͢Δ͔ - DAATΛϕʔεʹվળ͢Δ - ݕࡧ݁Ռ্͕Ґ͚݅ͩඞཁͰ͋Δɺ্ҐʹདྷΔՄೳੑ͕ͳ͍จষͷධՁΛεΩοϓ ͢Δ͜ͱʹΑΓɺॲཧͷߴԽ͕Մೳ - ͍߹Θͤʹର্ͯ͠Ґk݅ͷΈΛऔΓग़͢͜ͱΛtop-k query processingͱݺͿ
จষͷϥϯΫ͚ - ্Ґk݅Λग़ྗ͢ΔͨΊʹɺΫΤϦʹରͯ͠ͲͷจষͷॱҐ͕ߴ͍͔Λܾఆ͢Δඞཁ͕͋ Δ - TF-IDF, Okapi BM25
సஔΠϯσοΫεͷ֦ு - సஔΠϯσοΫεʹରͯ͠ɺ֤୯ޠͷείΞ࠷େͱυΩϡϝϯτ͝ͱͷ୯ޠͷείΞ Λ༩͢Δ - ܗࣜ DocID;Score netflix prime amazon
1;5 1;3 2;1 3;2 4;1 5;2 5;1 2;3 max_score=5 max_score=3 max_score=2
Top-k Query ɾmax-score ɾinterval-based running
max-score - ্ҐkҐʹೖΔͨΊͷείΞ5ඞཁͱ͢Δͱɺ”amazon”,”prime”ͷmax_score5ະຬͳͷ Ͱɺ”amazon”ͷΈ”prime”ͷΈΛؚΉจষީิ͔Β֎ΕΔ netflix prime amazon max_score=5 max_score=3 max_score=2
1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - ্Ґ1݅Λऔಘ͍ͨ͠ - จॻ1: score6 - ඞͣ”netflix”ΛؚΉυΩϡϝϯτ͡Όͳ͍ͱμϝ netflix prime
amazon max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - “netflix”ΛؚΉจॻ5·ͰΧʔιϧΛඈ͢ - จॻ5είΞ4 - จॻ1,5ͷΈΛධՁ͢Δ͚ͩͰऴྃ netflix prime amazon
max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 2;3 5;2 3;3 5;1 5;1
Interval-base - max_scoreΑΓ͞ΒʹεΩοϓͰ͖Δ - ͕͜͜ΒΜͰྗਚ͖ͨ࣌ؒͪ͠ΐ͏Ͳ͍͍ͩΖ͏…
LuceneͰ - Lucent 8Ͱmax-scoreͷൃలܗͰ͋ΔWAND͕ΘΕ͍ͯ·͢ - Ding, Shuai, and Torsten Suel.
"Faster top-k document retrieval using block-max indexes." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. - 2019/3/14ϦϦʔεͷLucene 8Ͱ্ͷΞϧΰϦζϜ͕࣮͞ΕΔͳͲࠓͰޮతͳΞϧ ΰϦζϜͷݚڀ͕ଓ͍͍ͯΔ APA
సஔΠϯσοΫεͷ ࣮
·ͱΊ - సஔΠϯσοΫε༷ʑͳϝλใΛՃ͢Δ͜ͱͰ֦ு͞ΕΔ(୯ޠͷΦϑηοτɺε ίΞɺϙΠϯλ) - సஔΠϯσοΫεʹରͯ͠ANDݕࡧɺORݕࡧ͢ΔࡍͷφΠʔϒͳํ๏ - TAAT: ୯ޠ͝ͱʹࠪ͢Δ -
DAAT: จॻ͝ͱʹࠪ͢Δ - DAATʹର͢ΔORݕࡧʹ࠷దԽ - max-score: ୯ޠ͝ͱͷ࠷େείΞͱݱࡏͷ࠷େείΞΛอ࣋͢Δ͜ͱͰࢬמΓΛߦ ͍୳ࡧΛεΩοϓ͢Δ - LuceneͰDAAT͕࠾༻͞Ε͓ͯΓɺݱࡏͰΞϧΰϦζϜ͕վળ͞Ε͍ͯΔ