Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
俺の全文検索エンジン(Go製)を作り始めた
Search
kotaroooo0
November 11, 2020
Programming
0
110
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
November 11, 2020
Tweet
Share
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
85
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.4k
転置インデックスでどう検索しているか
kotaroooo0
0
330
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
90
Other Decks in Programming
See All in Programming
Software Architecture
hschwentner
6
2.3k
Domain-centric? Why Hexagonal, Onion, and Clean Architecture Are Answers to the Wrong Question
olivergierke
3
980
マンガアプリViewerの大画面対応を考える
kk__777
0
280
AI Agent 時代的開發者生存指南
eddie
4
2.1k
bootcamp2025_バックエンド研修_WebAPIサーバ作成.pdf
geniee_inc
0
130
社会人になっても趣味開発を続けたい! / traPavilion
mazrean
1
100
Things You Thought You Didn’t Need To Care About That Have a Big Impact On Your Job
hollycummins
0
260
理論と実務のギャップを超える
eycjur
0
180
AI駆動で0→1をやって見えた光と伸びしろ
passion0102
1
850
Google Opalで使える37のライブラリ
mickey_kubo
3
150
三者三様 宣言的UI
kkagurazaka
0
220
コード生成なしでモック処理を実現!ovechkin-dm/mockioで学ぶメタプログラミング
qualiarts
0
270
Featured
See All Featured
For a Future-Friendly Web
brad_frost
180
10k
The Pragmatic Product Professional
lauravandoore
36
7k
Build The Right Thing And Hit Your Dates
maggiecrowley
38
2.9k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
10
890
Large-scale JavaScript Application Architecture
addyosmani
514
110k
Mobile First: as difficult as doing things right
swwweet
225
10k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.1k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Designing for humans not robots
tammielis
254
26k
Scaling GitHub
holman
463
140k
Faster Mobile Websites
deanohume
310
31k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
Transcript
2020/11/11 @kotaroooo0 ԶͷશจݕࡧΤϯδϯ(Go) Λ࡞Γ࢝Ίͨ
͜ͷLTΛฉ͘ͱ… 1. ͳΜͱͳ͘શจݕࡧΤϯδϯͷΈ͕͔ Δ 2. GoͰશจݕࡧΤϯδϯΛ࡞Γ࢝ΊΒΕΔ
-શจݕࡧΤϯδϯͷΈΛΔ -ElasticsearchͰΘΕ͍ͯΔApache LuceneΆ͍༷ͷͷΛ࡞Δ -GoΛֶΜͰ͍ΔͷͰԿ͔࡞Γ͍ͨ -Twitter BotΛ࡞͍ͬͯΔ͕ɺͪΐ͏Ͳ͍͍શจݕࡧΤϯδϯ͕ͳ͍ -ܰྔɺ͔ͭॏΈ͖ϨʔϕϯγϡλΠϯڑΛܭࢉͰ͖Δͭ શจݕࡧΤϯδϯΛ࡞Δཧ༝
3Ͱ͔ΔશจݕࡧΤϯδϯ
શจݕࡧͷΈ INDEXING ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 จॻ1 “I have a pen.” จॻ2 “We have desk.” CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
શจݕࡧͷΈ SEARCH ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 ݕࡧϫʔυ: “pen” จॻ1͕ώοτ CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
ANALYZERͳͥඞཁ? - τʔΫϯׂͯ͘͠ΕΔͨΊ - “I have a pen.” ͜ͷ··ͰసஔΠϯσοΫε Λ࡞Ͱ͖ͳ͍ͷͰɺI,
have, a, penͱτʔΫ ϯׂ͍ͨ͠ - ΫΤϦͷදه༳ΕΛٵऩͨ͠Γ͢ΔͨΊ - “GOD”ͱ͍͏୯ޠΛؚΉυΩϡϝϯτ ɺ”god”Ͱώοτ͢ΔΑ͏ʹখจࣈʹ౷Ұ ͍ͤͨ͞ - ແବͳτʔΫϯͷϑΟϧλϦϯά - theͳͲΠϯσΩγϯάͯ͠ແବ Analyzeલ Analyzeޙ “I have a BIG pen!” have, big, pen
۩ମతͳANALYZERྫ - Char Filter(Tokenizerͷલʹॲཧ͢Δ) 0ݸҎ্ - Mapping: إจࣈΛ୯ޠʹมͳͲ - HTMLstrip:
HTMLΛύʔε - Tokenizer(τʔΫϯׂ͢Δ) 1ݸ - Standard: εϖʔεͳͲϧʔϧʹैׂͬͯ - Kuromoji: ܗଶૉղੳͰׂ - Ngram: Nจࣈ͝ͱʹׂ - Token Filter(Tokenizerͷޙʹॲཧ͢Δ) 0ݸҎ্ - Lowercase: খจࣈ - Stopword: ετοϓϫʔυআڈ - Stemming: දه༳Ε CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
࣮
ANALYZERͷ࣮ߦΠϝʔδ Analyzeલ MappingCharFilter StandardTokenizer LowercaseFilter StopWordFilter StemmerFilter I have a
lot of TASKs. I am very sad :( I have a lot of TASKs. I am very sad _sad_ I, have, a, lot, of, TASKs, I, am, very, sad, sad I, have, a, lot, of, tasks, I, am, very, sad, sad lot, tasks, am, very, sad, sad lot, task, am, very, sad, sad ॲཧͷྲྀΕ
ANALYZERͷ࣮
CHAR FILTERͷ࣮
TOKENIZERͷ࣮
TOKEN FILTERͷ࣮
INDEX - సஔΠϯσοΫε map[string][]int - సஔΠϯσοΫεΛϑΟʔϧυ ໊͝ͱʹ࣋ͭ - υΩϡϝϯτɺIDͱϑΟʔϧ υΛ࣋ͭ
- Indexing͢Δͱ͖AnalyzerΛ ௨͢
SEARCH -ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -OR: ݕࡧ1,2,3,4,5,6
pink Orange blue 6 3 4 5 2 1
ಈ͔ͯ͠ΈΔ
ݕࡧྫ - ※Analyzer͖ͬ͞ͱಉ༷ - IndexSearchAnalyzerΛ௨͢ - ”foxes”Ͱ”fox”ΛؚΉυΩϡϝϯτ͕Ϛον - ”happy”Ͱ”:)”ΛؚΉυΩϡϝϯτ͕Ϛον -
ANDݕࡧ - fine,FaX,foxes,happy͕શؚͯ·Ε͍ͯΔυΩϡϝ ϯτ1,2͕ώοτ͍ͯ͠Δ
ࠓޙ -͍͋·͍ݕࡧΛ࣮͢Δ -Fuzzy QueryɺSuggesters -είΞܭࢉΛ࣮͢Δ -IFIDF, BM25
ࢀߟ -https://github.com/kotaroooo0/stalefish -https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/