Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Going The Distance
Search
Richard Schneeman
September 22, 2014
Programming
3
14k
Going The Distance
Levenshtein Distance and the beauty of algorithms
Richard Schneeman
September 22, 2014
Tweet
Share
More Decks by Richard Schneeman
See All by Richard Schneeman
[RubyConf] Beware the Dreaded Dead End
schneems
1
340
[Kaigi] Beware the Dead End
schneems
0
150
Threads Aren't Evil
schneems
0
550
Bayes is BAE
schneems
0
3.5k
Testing the Untestable
schneems
1
760
SLOMO
schneems
2
1k
Saving Sprockets
schneems
8
17k
Memory Leaks, Tweaks, and Techniques
schneems
1
190
Speed Science
schneems
20
36k
Other Decks in Programming
See All in Programming
AI コーディングエージェントの時代へ:JetBrains が描く開発の未来
masaruhr
2
220
SQLアンチパターン第2版 データベースプログラミングで陥りがちな失敗とその対策 / Intro to SQL Antipatterns 2nd
twada
PRO
34
10k
Strands Agents で実現する名刺解析アーキテクチャ
omiya0555
1
110
変化を楽しむエンジニアリング ~ いままでとこれから ~
murajun1978
0
520
Workers を定期実行する方法は一つじゃない
rokuosan
0
130
オンコール⼊⾨〜ページャーが鳴る前に、あなたが備えられること〜 / Before The Pager Rings
yktakaha4
2
1.2k
型で語るカタ
irof
1
850
No Install CMS戦略 〜 5年先を見据えたフロントエンド開発を考える / no_install_cms
rdlabo
0
370
ソフトウェア設計とAI技術の活用
masuda220
PRO
25
6.9k
Understanding Kotlin Multiplatform
l2hyunwoo
0
230
AWS Summit Japan 2024と2025の比較/はじめてのKiro、今あなたは岐路に立つ
satoshi256kbyte
1
250
効率的な開発手段として VRTを活用する
ishkawa
1
180
Featured
See All Featured
Product Roadmaps are Hard
iamctodd
PRO
54
11k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
138
34k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.6k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
357
30k
Fireside Chat
paigeccino
37
3.5k
Why You Should Never Use an ORM
jnunemaker
PRO
58
9.5k
Java REST API Framework Comparison - PWX 2021
mraible
31
8.7k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
130
19k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
Faster Mobile Websites
deanohume
308
31k
Transcript
Going The Distance @schneems
They Call me @Schneems
Ruby Schneems
Ruby Python
None
Code Triage .com
Challenge: Comment on an issue
Docs Doctor .org
None
Top 50 Rails Contrib
Cats
None
Can you keep Ruby weird?
%CP%CP
'XGT[QPG%CP%CP
Thank You!
Algorithms:
I went to Georgia Tech for…
Mechanical Engineering
Self taught Programmer ~8 years
CS is boring to me
Programming is interesting
Building programs that accomplish tasks
But…
Those CS students are on to something
Algorithms
Are
Beautiful
<3
Algorithms solve problems
Introducting A Problem…
Spelling
When you are tired, or distracted spelling becomes harder
$ git commmit -m first WARNING: You called a Git
command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
How does Git know?
Introducing: Edge Distance
The “cost” to change one word to another
Less “cost” means less more likely match
Cost of? zat => bat
Cost of? zat => bat 1
Cost of? zzat => bat
Cost of? bat 2 zzat =>
How do we code it though?
My First Attempt
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat =>
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
Perfect?
saturday sunday Cost = ?
saturday sunday Cost = 7
Wat?
Turns out I almost recreated
Hamming Distance
Hamming AKA: Signal Distance
Measures: the errors introduced in a string Hamming
Only valid for strings of same length Hamming
Good for: Detecting and correcting errors in binary and telecommunications
Hamming
Bad for: mis- spelled words Hamming
Does not include: Insertion Deletion Hamming
Introducing: An Algorithm!
Introducing: Levenshtein Distance
How do we calculate deletion?
distance("schneems", "zschneems")
distance("schneems", "zschneems") Match! deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
How do we calculate insertion?
distance("schneems", "chneems")
distance("schneems", "chneems") Match! insertion
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion
substitution?
distance("zchneems", "schneems") Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution
How do we calculate Distance?
Pretend we have a distance() method
Strings of different lengths
distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3
return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths
Calculate distance between every substring
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
Pick the cheapest operation, then add one
What about when characters match?
distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”
str2 = “sunday”
Our powers combined, we form!
None
`
Recursive Levenshtein Distance!
def distance(str1, str2) # Different lengths return str2.length if str1.empty?
return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein
Much better
What does that look like?
None
github.com/ schneems/ going_the_distance
Hmm…
“Dirty Distance” took 8 comparisons
“Recursive” took 1647 comparisons
No trophy, no flowers, no flashbulbs, no wine,
Ouch
Can we do better?
If you watch the recursive algorithm closely, you notice repeats
Maybe we can store substring distance and use to calculate
total distance
I want you to join the club
A members only club
Matrix: Levenshtein Distance
Matrix: Levenshtein Distance
“” => “saturday” Cost?
+---+---+ | | S | +---+---+ | | 1 |
+---+---+ Matrix
+---+---+---+ | | S | A | +---+---+---+ | |
1 | 2 | +---+---+---+ Matrix
+---+---+---+---+ | | S | A | T | +---+---+---+---+
| | 1 | 2 | 3 | +---+---+---+---+ Matrix
+---+---+---+---+---+ | | S | A | T | U
| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
+---+---+---+---+---+---+ | | S | A | T | U
| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
“sunday” => “” Cost?
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Now, fill in the matrix
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Break it down
How much does it cost to change “s” into “s”?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
How do we calculate insertion programmatically?
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+ cost of change (+1)
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
Keep Going
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
Next Char “su” => “s”
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
Change “su” to “s” is a deletion. How do we
calculate?
Deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
- Insertion - Deletion - Substitution
- Insertion - Deletion - Substitution
Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
- Insertion - Deletion - Substitution
What about match?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
Why?
If the current character matches, cost is to change previous
character to previous sub string
i.e. changing “” to “” +---+---+---+ | | | S
| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
Now What?
Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1
== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
Iterate!
None
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
We can also get cost of sub strings
“sun” => “sat”
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
Better than Recursive?
As they speed through the finish, the flags go down.
48 iterations
Wayyyyyy better than 1647
bit.ly/ going_the_distance
My Problem
I am human
I get tired
Machines don’t understand tired
One day I tried typing
$ rails generate migration
But accidentally typed
$ rails generate migratoon
ERROR
None
Stress is increased when we fail at simple tasks
Why? It’s not hard
Why can’t my software be more like git?
We know what you’re trying to accomplish. Let’s help you
out
None
When you have ERROR
Compare given command to possible commands
Recommend smallest distance.
Google:
Google:
Read A lot of words from real books
~1+ million words
Count Each word
Higher count, higher probability
Get edit distance between input and dictionary
Lower edit, higher probability
Show Suggestion
None
Cache correct spelling suggestions
did_you_mean gem
None
More distance measurements
Levenshtein Distance
Hamming Distance
longest common subsequence Distance
Manhattan Distance
Tree Distance
Jaro-Winkler Distance
Many Many More
Algorithms are Awesome
Where to go next?
I want to learn more about algorithms?
Wikipedia!
Rosetta code
Give an algorithm talk
Everyone suggests a new Algorithm!
Algorithms are a way of sharing knowledge
Expand your knowledge Explore Algorithms
Antepenultimate Slide
YAY!
Questions @schneems
Going The Distance