Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コー...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Hiroyuki Deguchi
February 26, 2025
Research
770
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コーパス検索のための柔らかくも高速なパターンマッチャー"
Hiroyuki Deguchi
February 26, 2025
More Decks by Hiroyuki Deguchi
See All by Hiroyuki Deguchi
20240820: Minimum Bayes Risk Decoding for High-Quality Text Generation Beyond High-Probability Text
de9uch1
0
350
サブセット探索を用いた高速なkNNニューラル機械翻訳
de9uch1
0
170
20240226_AAMT-Japio
de9uch1
0
190
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
de9uch1
0
160
Paper Reading: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation
de9uch1
0
220
My Research Environmental Setup
de9uch1
0
340
Nearest Neighbor Machine Translation
de9uch1
0
280
Paper Reading - Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
de9uch1
0
310
paper reading - Tree Transformer
de9uch1
0
280
Other Decks in Research
See All in Research
人間中心の意思決定支援AI
yukinobaba
PRO
4
2.4k
「行ける・行けない表」による地域公共交通の性能評価
bansousha
0
160
セマンティック通信勉強会 6Gに向けたデバイス間効率的な通信の技術紹介・課題・今後展望
satai
3
150
Unified Audio Source Separation (Defense Slides)
kohei_1979
1
610
言語モデルから言語について語る際に押さえておきたいこと
eumesy
PRO
5
2.3k
CyberAgent AI Lab研修 / Social Implementation Anti-Patterns in AI Lab
chck
7
4.6k
2026年1月の生成AI領域の重要リリース&トピック解説
kajikent
0
1k
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
200
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
satai
3
840
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent
satai
2
270
「AIとWhyを深堀る」をAIと深堀る
iflection
0
470
討議:RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
0
940
Featured
See All Featured
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
180
Exploring anti-patterns in Rails
aemeredith
3
400
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
360
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
Designing Experiences People Love
moore
143
24k
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
190
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
The Curse of the Amulet
leimatthew05
1
13k
Designing Powerful Visuals for Engaging Learning
tmiket
1
400
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
The browser strikes back
jonoalderson
0
1.2k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Transcript
None
◼ ⚫ ⚫ ⚫ ⚫ ⚫ ◼ ⚫ ⚫ ⚫
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
𝑤 ◼ 𝑤 ◼ ⚫ 𝑤 ⚫ 𝑤 ◼ ⚫
◼ ⚫ ◼ ⚫ ◼ ⚫ (Radovanovic+, JMLR2010) ⚫ ▶
Wang+, (arxiv) 2022, “Text Embeddings by Weakly-Supervised Contrastive Pre-training”. Radovanovic+, JMLR 2010, “Hubs in space: Popular nearest neighbors in high-dimensional data”.
◼ ◼ ◼ ◼
◼ ⚫ 𝐩 = 𝑝1 , … , 𝑝𝑀 ∈
Σ∗ ⚫ 𝐭 = 𝑡1 , … , 𝑡𝑁 ∈ Σ∗ ▶ Σ∗ ◼ ⚫ ⚫ ◼ ◼
◼ ⚫ 𝑤 ∈ 𝒱 𝐷 ⚫ 𝐯𝑤 ∈ ℝ𝐷
≔ 𝑤 ⚫ ▶ cos 𝐯person , 𝐯people > cos 𝐯person , 𝐯bird ▶
◼ ◼ ⚫ ※
◼ ⚫ 𝑡𝑖 = 𝑝𝑗 cos 𝐯𝑡𝑖 , 𝐯𝑝𝑗 ≥
𝛼 ▶ 𝛼 = 1.0 ※ 𝛼 = 0.7 ◼ ⚫ ⚫
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 𝑡9 𝑡10
𝑡11 𝑡12 𝑡13 ◼ ◼ ◼ ⚫
◼ 𝒮𝑤 ≔ 𝑣 ∈ 𝒱 cos 𝐯𝑣 ⊤𝐯𝑤 ≥
𝛼 ⚫ 𝑤 𝒮we 𝒮talk 𝒮about
⇔ 𝒮we , 𝒮talk , 𝒮about ⇔ 𝒮we , 𝒮talk
, 𝒮about 𝑖, 𝑖 + 1, 𝑖 + 2 𝑖 𝒮we 𝒮talk 𝒮about
𝒮we 𝒮talk 𝒮about 𝒮we ℳ ℳ ← 1,10,6 𝒮talk ℳ
ℳ′ ← 2 − 1, 11 − 1,7 − 1 = 1,10,6 ℳ ← ℳ ∩ ℳ′ = 1,10,6 𝒮about ℳ ℳ′ ← 8 − 2, 12 − 2 = 6,10 ℳ ← ℳ ∩ ℳ′ = 6,10 ℳ
𝒮we 𝒮talk 𝒮about 𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7
𝑡8 𝑡9 𝑡10 𝑡11 𝑡12 𝑡13
𝒮𝑝1 𝐼𝒮𝑝1 𝒮𝑝𝑀 𝐼𝒮𝑝𝑀 𝒮𝑝1 𝐼𝒮𝑝1 ℳ ℳ ← 𝐼𝒮𝑝1
𝑘 = 2, … , 𝑀 ℳ′ ← 𝑖 − 𝑘 + 1 𝑖 ∈ 𝐼𝒮𝑝𝑘 ℳ ← ℳ ∩ ℳ′ ℳ 𝐩 = 𝑝1 , … , 𝑝𝑀
◼ ⚫ ▶ ▶ ⚫
◼ ◼ ⚫ (Wang+, 2024) ⚫ (Douze+, 2024) (Malkov &
Yashunin, IEEE TPAMI, 2018) ◼ ⚫ ▶ 𝛼 = 0.55 (Pennington+, EMNLP2014) ▶ 𝛼 = 0.50 (Grave+, arXiv:1802.06893) Wang+, arXiv:2402.05672, “Multilingual E5 Text Embeddings: A Technical Report”. Douze+, arXiv:2401.08281, “The Faiss library”. Malkov & Yashunin, IEEE TPAMI, 2018, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”. Pennington+, EMNLP2014, “GloVe: Global Vectors for Word Representation”. Grave+, arXiv:1802.06893, “Learning Word Vectors for 157 Languages”.
◼ ⚫ ▶ ⚫
◼ ⚫ ⚫ ◼ ⚫
◼ ⚫ (Crane, IJDL 2023) ⚫ (Bothwell+, EMNLP2023) ◼ Crane,
IJDL 2023, “The Perseus Digital Library and the future of libraries.”. Bothwell+, EMNLP2023, “Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines”.
◼ ⚫ ⚫ ⚫ ◼ ◼
◼ ⚫ ◼ ⚫ ▶ ◼ ⚫ ▶ ▶ ⚫
⚫
◼ ⚫ 𝐼𝒮𝑝𝑘 ▶ ⚫ ◼ ⚫
◼ ◼ ⚫ ⚫ ⚫ 𝑂 1 ▶ ◼ ⚫
⚫ ⚫ 𝑂 log |𝐵| ▶
◼ ⚫ ⚫ ◼ ⚫ ⚫ ▶ ⚫