Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コー...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Hiroyuki Deguchi
February 26, 2025
Research
770
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
20250226 NLP colloquium: "SoftMatcha: 10億単語規模コーパス検索のための柔らかくも高速なパターンマッチャー"
Hiroyuki Deguchi
February 26, 2025
More Decks by Hiroyuki Deguchi
See All by Hiroyuki Deguchi
20240820: Minimum Bayes Risk Decoding for High-Quality Text Generation Beyond High-Probability Text
de9uch1
0
350
サブセット探索を用いた高速なkNNニューラル機械翻訳
de9uch1
0
170
20240226_AAMT-Japio
de9uch1
0
200
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability
de9uch1
0
160
Paper Reading: Sampling-Based Approximations to Minimum Bayes Risk Decoding for Neural Machine Translation
de9uch1
0
220
My Research Environmental Setup
de9uch1
0
340
Nearest Neighbor Machine Translation
de9uch1
0
290
Paper Reading - Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
de9uch1
0
310
paper reading - Tree Transformer
de9uch1
0
290
Other Decks in Research
See All in Research
National high-resolution cropland classification of Japan with agricultural census information and multi-temporal multi-modality datasets
satai
3
290
東京大学工学部計数工学科、計数工学特別講義の説明資料
kikuzo
0
500
適応的スパムフィルタのための軽量な類似メッセージカウンタ / jsai2026-adaptive-spam-filter
monochromegane
0
3.7k
重要だけど測れていないもの:高齢者ケアの見えない課題
theoriatec2024
0
360
さくらインターネット研究所テックトーク2026春、研究開発Gr.25年度成果26年度方針
kikuzo
0
150
NII S. Koyama's Lab Research Overview AY2026
skoyamalab
0
320
「AIとWhyを深堀る」をAIと深堀る
iflection
0
490
多様なデータを許容し学習し続ける模倣学習 / Advanced Imitation Learning for VLA
prinlab
0
220
進学校の生徒にはア行の苗字が多いのか
ozekinote
0
450
言語モデルから言語について語る際に押さえておきたいこと
eumesy
PRO
5
2.3k
計算情報学研究室(数理情報学第7研究室)2026
tomohirokoana
0
570
LLM Compute Infrastructure Overview
karakurist
2
1.4k
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
57
14k
Automating Front-end Workflow
addyosmani
1370
210k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
Bash Introduction
62gerente
615
220k
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
440
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
180
Art, The Web, and Tiny UX
lynnandtonic
304
22k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
WENDY [Excerpt]
tessaabrams
11
38k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
66
55k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
62k
A designer walks into a library…
pauljervisheath
211
24k
Transcript
None
◼ ⚫ ⚫ ⚫ ⚫ ⚫ ◼ ⚫ ⚫ ⚫
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
◼ ◼ ◼ ◼
𝑤 ◼ 𝑤 ◼ ⚫ 𝑤 ⚫ 𝑤 ◼ ⚫
◼ ⚫ ◼ ⚫ ◼ ⚫ (Radovanovic+, JMLR2010) ⚫ ▶
Wang+, (arxiv) 2022, “Text Embeddings by Weakly-Supervised Contrastive Pre-training”. Radovanovic+, JMLR 2010, “Hubs in space: Popular nearest neighbors in high-dimensional data”.
◼ ◼ ◼ ◼
◼ ⚫ 𝐩 = 𝑝1 , … , 𝑝𝑀 ∈
Σ∗ ⚫ 𝐭 = 𝑡1 , … , 𝑡𝑁 ∈ Σ∗ ▶ Σ∗ ◼ ⚫ ⚫ ◼ ◼
◼ ⚫ 𝑤 ∈ 𝒱 𝐷 ⚫ 𝐯𝑤 ∈ ℝ𝐷
≔ 𝑤 ⚫ ▶ cos 𝐯person , 𝐯people > cos 𝐯person , 𝐯bird ▶
◼ ◼ ⚫ ※
◼ ⚫ 𝑡𝑖 = 𝑝𝑗 cos 𝐯𝑡𝑖 , 𝐯𝑝𝑗 ≥
𝛼 ▶ 𝛼 = 1.0 ※ 𝛼 = 0.7 ◼ ⚫ ⚫
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7 𝑡8 𝑡9 𝑡10
𝑡11 𝑡12 𝑡13 ◼ ◼ ◼ ⚫
◼ 𝒮𝑤 ≔ 𝑣 ∈ 𝒱 cos 𝐯𝑣 ⊤𝐯𝑤 ≥
𝛼 ⚫ 𝑤 𝒮we 𝒮talk 𝒮about
⇔ 𝒮we , 𝒮talk , 𝒮about ⇔ 𝒮we , 𝒮talk
, 𝒮about 𝑖, 𝑖 + 1, 𝑖 + 2 𝑖 𝒮we 𝒮talk 𝒮about
𝒮we 𝒮talk 𝒮about 𝒮we ℳ ℳ ← 1,10,6 𝒮talk ℳ
ℳ′ ← 2 − 1, 11 − 1,7 − 1 = 1,10,6 ℳ ← ℳ ∩ ℳ′ = 1,10,6 𝒮about ℳ ℳ′ ← 8 − 2, 12 − 2 = 6,10 ℳ ← ℳ ∩ ℳ′ = 6,10 ℳ
𝒮we 𝒮talk 𝒮about 𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6 𝑡7
𝑡8 𝑡9 𝑡10 𝑡11 𝑡12 𝑡13
𝒮𝑝1 𝐼𝒮𝑝1 𝒮𝑝𝑀 𝐼𝒮𝑝𝑀 𝒮𝑝1 𝐼𝒮𝑝1 ℳ ℳ ← 𝐼𝒮𝑝1
𝑘 = 2, … , 𝑀 ℳ′ ← 𝑖 − 𝑘 + 1 𝑖 ∈ 𝐼𝒮𝑝𝑘 ℳ ← ℳ ∩ ℳ′ ℳ 𝐩 = 𝑝1 , … , 𝑝𝑀
◼ ⚫ ▶ ▶ ⚫
◼ ◼ ⚫ (Wang+, 2024) ⚫ (Douze+, 2024) (Malkov &
Yashunin, IEEE TPAMI, 2018) ◼ ⚫ ▶ 𝛼 = 0.55 (Pennington+, EMNLP2014) ▶ 𝛼 = 0.50 (Grave+, arXiv:1802.06893) Wang+, arXiv:2402.05672, “Multilingual E5 Text Embeddings: A Technical Report”. Douze+, arXiv:2401.08281, “The Faiss library”. Malkov & Yashunin, IEEE TPAMI, 2018, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs”. Pennington+, EMNLP2014, “GloVe: Global Vectors for Word Representation”. Grave+, arXiv:1802.06893, “Learning Word Vectors for 157 Languages”.
◼ ⚫ ▶ ⚫
◼ ⚫ ⚫ ◼ ⚫
◼ ⚫ (Crane, IJDL 2023) ⚫ (Bothwell+, EMNLP2023) ◼ Crane,
IJDL 2023, “The Perseus Digital Library and the future of libraries.”. Bothwell+, EMNLP2023, “Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines”.
◼ ⚫ ⚫ ⚫ ◼ ◼
◼ ⚫ ◼ ⚫ ▶ ◼ ⚫ ▶ ▶ ⚫
⚫
◼ ⚫ 𝐼𝒮𝑝𝑘 ▶ ⚫ ◼ ⚫
◼ ◼ ⚫ ⚫ ⚫ 𝑂 1 ▶ ◼ ⚫
⚫ ⚫ 𝑂 log |𝐵| ▶
◼ ⚫ ⚫ ◼ ⚫ ⚫ ▶ ⚫