Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
俺の全文検索エンジン(Go製)を作り始めた
Search
kotaroooo0
November 11, 2020
Programming
130
0
Share
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
November 11, 2020
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
110
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.6k
転置インデックスでどう検索しているか
kotaroooo0
0
370
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
100
Other Decks in Programming
See All in Programming
Agent Skills を社内で育てる仕組み作り
jackchuka
1
2.1k
開発とはなにか、Essenceカーネルで見えるもの
ukin0k0
0
190
20260514_its_the_context_window_stupid.pdf
heita
0
1k
AgentCore Optimizationを始めよう!
licux
3
250
ふにゃっとしない名前の付け方 〜哲学で茹で上げる、コシのあるソフトウェア設計〜
shimomura
0
120
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
390
Skillは並べた。動かなかった。契約で繋いだ。— 65個のSkillから、自走する開発サイクルへ
junholee
0
620
My daily life on Ruby
a_matsuda
3
410
柔軟なPDFレイアウトエディタを支える型システム設計 — Discriminated UnionとConditional Typeの実践
minako__ph
1
120
AWSはOSSをどのように 考えているのか?
akihisaikeda
0
120
Augmenting AI with the Power of Jakarta EE
ivargrimstad
0
500
運転動画を検索可能にする〜Cosmos-Embed1とDatabricks Vector Searchで〜/cosmos-embed1-databricks-vector-search
studio_graph
3
960
Featured
See All Featured
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
110
Stop Working from a Prison Cell
hatefulcrawdad
274
21k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
140
Building Applications with DynamoDB
mza
96
7k
Heart Work Chapter 1 - Part 1
lfama
PRO
7
35k
Lightning talk: Run Django tests with GitHub Actions
sabderemane
0
180
Being A Developer After 40
akosma
91
590k
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
180
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.2k
Building an army of robots
kneath
306
46k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
230
Optimizing for Happiness
mojombo
378
71k
Transcript
2020/11/11 @kotaroooo0 ԶͷશจݕࡧΤϯδϯ(Go) Λ࡞Γ࢝Ίͨ
͜ͷLTΛฉ͘ͱ… 1. ͳΜͱͳ͘શจݕࡧΤϯδϯͷΈ͕͔ Δ 2. GoͰશจݕࡧΤϯδϯΛ࡞Γ࢝ΊΒΕΔ
-શจݕࡧΤϯδϯͷΈΛΔ -ElasticsearchͰΘΕ͍ͯΔApache LuceneΆ͍༷ͷͷΛ࡞Δ -GoΛֶΜͰ͍ΔͷͰԿ͔࡞Γ͍ͨ -Twitter BotΛ࡞͍ͬͯΔ͕ɺͪΐ͏Ͳ͍͍શจݕࡧΤϯδϯ͕ͳ͍ -ܰྔɺ͔ͭॏΈ͖ϨʔϕϯγϡλΠϯڑΛܭࢉͰ͖Δͭ શจݕࡧΤϯδϯΛ࡞Δཧ༝
3Ͱ͔ΔશจݕࡧΤϯδϯ
શจݕࡧͷΈ INDEXING ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 จॻ1 “I have a pen.” จॻ2 “We have desk.” CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
શจݕࡧͷΈ SEARCH ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 ݕࡧϫʔυ: “pen” จॻ1͕ώοτ CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
ANALYZERͳͥඞཁ? - τʔΫϯׂͯ͘͠ΕΔͨΊ - “I have a pen.” ͜ͷ··ͰసஔΠϯσοΫε Λ࡞Ͱ͖ͳ͍ͷͰɺI,
have, a, penͱτʔΫ ϯׂ͍ͨ͠ - ΫΤϦͷදه༳ΕΛٵऩͨ͠Γ͢ΔͨΊ - “GOD”ͱ͍͏୯ޠΛؚΉυΩϡϝϯτ ɺ”god”Ͱώοτ͢ΔΑ͏ʹখจࣈʹ౷Ұ ͍ͤͨ͞ - ແବͳτʔΫϯͷϑΟϧλϦϯά - theͳͲΠϯσΩγϯάͯ͠ແବ Analyzeલ Analyzeޙ “I have a BIG pen!” have, big, pen
۩ମతͳANALYZERྫ - Char Filter(Tokenizerͷલʹॲཧ͢Δ) 0ݸҎ্ - Mapping: إจࣈΛ୯ޠʹมͳͲ - HTMLstrip:
HTMLΛύʔε - Tokenizer(τʔΫϯׂ͢Δ) 1ݸ - Standard: εϖʔεͳͲϧʔϧʹैׂͬͯ - Kuromoji: ܗଶૉղੳͰׂ - Ngram: Nจࣈ͝ͱʹׂ - Token Filter(Tokenizerͷޙʹॲཧ͢Δ) 0ݸҎ্ - Lowercase: খจࣈ - Stopword: ετοϓϫʔυআڈ - Stemming: දه༳Ε CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
࣮
ANALYZERͷ࣮ߦΠϝʔδ Analyzeલ MappingCharFilter StandardTokenizer LowercaseFilter StopWordFilter StemmerFilter I have a
lot of TASKs. I am very sad :( I have a lot of TASKs. I am very sad _sad_ I, have, a, lot, of, TASKs, I, am, very, sad, sad I, have, a, lot, of, tasks, I, am, very, sad, sad lot, tasks, am, very, sad, sad lot, task, am, very, sad, sad ॲཧͷྲྀΕ
ANALYZERͷ࣮
CHAR FILTERͷ࣮
TOKENIZERͷ࣮
TOKEN FILTERͷ࣮
INDEX - సஔΠϯσοΫε map[string][]int - సஔΠϯσοΫεΛϑΟʔϧυ ໊͝ͱʹ࣋ͭ - υΩϡϝϯτɺIDͱϑΟʔϧ υΛ࣋ͭ
- Indexing͢Δͱ͖AnalyzerΛ ௨͢
SEARCH -ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -OR: ݕࡧ1,2,3,4,5,6
pink Orange blue 6 3 4 5 2 1
ಈ͔ͯ͠ΈΔ
ݕࡧྫ - ※Analyzer͖ͬ͞ͱಉ༷ - IndexSearchAnalyzerΛ௨͢ - ”foxes”Ͱ”fox”ΛؚΉυΩϡϝϯτ͕Ϛον - ”happy”Ͱ”:)”ΛؚΉυΩϡϝϯτ͕Ϛον -
ANDݕࡧ - fine,FaX,foxes,happy͕શؚͯ·Ε͍ͯΔυΩϡϝ ϯτ1,2͕ώοτ͍ͯ͠Δ
ࠓޙ -͍͋·͍ݕࡧΛ࣮͢Δ -Fuzzy QueryɺSuggesters -είΞܭࢉΛ࣮͢Δ -IFIDF, BM25
ࢀߟ -https://github.com/kotaroooo0/stalefish -https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/