Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PyNNDescent: Fast Approximate Nearest Neighbors...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Leland McInnes
July 16, 2021
Programming
1k
0
Share
PyNNDescent: Fast Approximate Nearest Neighbors with Numba
A PDF version of slides for my SciPy 2021 talk on PyNNDescent.
Leland McInnes
July 16, 2021
More Decks by Leland McInnes
See All by Leland McInnes
Word and Document Embeddings
lmcinnes
0
170
Topological Data Analysis
lmcinnes
1
360
Ensemble Topic Modelling
lmcinnes
1
490
Learning Topology: topological methods for unsupervised learning
lmcinnes
2
3.6k
A Guide to Dimension Reduction
lmcinnes
3
1.4k
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
lmcinnes
2
2.7k
Other Decks in Programming
See All in Programming
Lightning-Fast Method Calls with Ruby 4.1 ZJIT / RubyKaigi 2026
k0kubun
3
400
Vibe하게 만드는 Flutter GenUI App With ADK , 박제창, BWAI Incheon 2026
itsmedreamwalker
0
550
Radical Imagining - LIFT 2025-2027 Policy Agenda
lift1998
0
350
Server-Side Kotlin LT大会 vol.18 [Kotlin-lspの最新情報と Neovimのlsp設定例]
yasunori0418
1
150
AIベース静的検査器の偽陽性率を抑える工夫3選
orgachem
PRO
3
300
セグメントとターゲットを意識するプロポーザルの書き方 〜採択の鍵は、誰に刺すかを見極めるマーケティング戦略にある〜
m3m0r7
PRO
0
550
実用!Hono RPC2026
yodaka
2
220
Offline should be the norm: building local-first apps with CRDTs & Kotlin Multiplatform
renaudmathieu
0
210
レガシーPHP転生 〜父がドメインエキスパートだったのでDDD+Claude Codeでチート開発します〜
panda_program
0
960
10 Tips of AWS ~Gen AI on AWS~
licux
5
410
GNU Makeの使い方 / How to use GNU Make
kaityo256
PRO
16
5.6k
〜バイブコーディングを超えて〜 チームで実験し続けたAI駆動開発
tigertora7571
0
110
Featured
See All Featured
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
99
The Pragmatic Product Professional
lauravandoore
37
7.2k
The agentic SEO stack - context over prompts
schlessera
0
750
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
Building an army of robots
kneath
306
46k
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
490
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
160
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.9k
Testing 201, or: Great Expectations
jmmastey
46
8.1k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
How to Talk to Developers About Accessibility
jct
2
180
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
290
Transcript
Fast Approximate Nearest Neighbour Search with Numba
What are Nearest Neighbours?
Given a set of points with A distance measure between
them…
… and a new “query point” …
Find the closest points to the query point
Why Nearest Neighbors?
Nearest Neighbour computations are at the heart of many machine
learning algorithms
KNN-Classi fi ers KNN-Regressors
Clustering https://commons.wikimedia.org/wiki/File:DBSCAN-Illustration.svg by Chire https://www. fl ickr.com/photos/trevorpatt/41875889652/in/photostream/ by Trevor Patt
HDBSCAN DBSCAN Single Linkage Clustering Spectral Clustering
Dimension Reduction http://lvdmaaten.github.io/tsne/ http://www-clmc.usc.edu/publications/T/tenenbaum-Science2000.pdf t-SNE Isomap Spectral Embedding UMAP
Recommender Systems Query Expansion
Why Approximate Nearest Neighbours?
Finding exact nearest neighbours is hard
Approximate nearest neighbour search trades accuracy for performance
How Do You Find Nearest Neighbors?
Using Trees
Hierarchically divide up the space into a tree
Bound the search using the tree structure (And the triangle
inequality)
KD-Tree
Ball Tree
Random Projection Tree
Using Graphs
How do you search for nearest neighbours of a query
using a graph? Malkov and Yashunin, 2018 Dong, Moses and Li, 2011 Iwasaki and Miyazaki, 2018
Start with a nearest neighbour graph of the training data
Assume we now want to fi nd neighbours of a query point
Choose a starting node in the graph (potentially randomly) as
a candidate node
None
Look at all nodes connected by an edge to the
best untried candidate node in the graph Add all these nodes to our potential candidate pool
None
Sort the candidate pool by closeness to the query point
Truncate the pool to the k best candidates
None
Return to the Expansion step unless we have already tried
all the candidates in the pool
Stop when there are no untried candidates in the pool
None
None
None
None
Looks inef fi cient Scales up well
None
Graph adapts to intrinsic dimension of the data
But how do we build the graph?!
The algorithm works (badly) even on a bad graph
Run one iteration of search for every node Update the
graph with new better neighbours Search is better on the improved graph
None
None
None
None
None
Perfect accuracy of neighbours is not assured We can get
an approximate knn-graph quickly
How Do You Make it Fast?
Algorithm tricks
Query node Expansion node Current neighbour
Neighbour A Neighbour B Common node
Hubs have a lot of neighbours!
None
None
Sample neighbours when constructing the graph Prune away edges before
performing searches
Necessary to fi nd green’s nearest neighbour Necessary to fi
nd blue’s nearest neighbour Not required since we can traverse through blue
For search remove the longest edges of any triangles in
the graph
Initialize with Random Projection Trees
Implementation tricks
None
Pro fi le and inspect llvm code for innermost functions
Type declarations and code choices can help the compiler a lot!
@numba.jit def euclidean(x, y): return np.sqrt(np.sum((x - y)**2)) Query benchmark
took 12s
@numba.jit(fastmath=True) def euclidean(x, y): result = 0.0 for i in
range(x.shape[0]): result += (x[i] - y[i])**2 return np.sqrt(result) Query benchmark took 8.5s
@numba.njit( numba.types.float32( numba.types.Array( numba.types.float32, 1, "C", readonly=True ), numba.types.Array( numba.types.float32,
1, "C", readonly=True ), ), fastmath=True, locals={ "result": numba.types.float32, "diff": numba.types.float32, "i": numba.types.uint16, }, ) def squared_euclidean(x, y): result = 0.0 dim = x.shape[0] for i in range(dim): diff = x[i] - y[i] result += diff * diff return result Query benchmark took 7.6s
Custom data structure implementations to help numba for often called
code
@numba.njit( "i4(f4[ :: 1],i4[ :: 1],f4,i4)", ) def simple_heap_push(priorities, indices,
p, n): ...
Numba has signi fi cant function call overhead with large
parameters Use closures over static data instead
@numba.njit() def frequently_called_function(param, large_readonly_data): ... val = access(large_readonly_data, param) ...
def create_frequently_called_function(large_readonly_data): @numba.njit() def closure(param): ... val = access(large_readonly_data, param) ... return closure
How Does it Compare?
Performance
We can test query performance using ann-benchmarks https://github.com/erikbern/ann-benchmarks
Consider the whole accuracy / performance trade-off space
vs
None
None
None
None
Caveats: •Newer algorithms and implementations •Hardware can makes a big
difference •No GPU support for pynndescent
Features
Out of the box support for a wide variety of
distance measures: Euclidean Cosine Hamming Manhattan Minkowski Chebyshev Jaccard Haversine Dice Wasserstein Hellinger Spearman Correlation Mahalanobis Canberra Bray-Curtis Angular TSSS +20 more measures https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa By Maarten Grootendorst
Custom metrics in Python (using numba)
Support for sparse data
Drop-in replacement for sklearn KNeighborsTransformer
Summary
pip install pynndescent conda install pynndescent https://github.com/lmcinnes/pynndescent
[email protected]
@leland_mcinnes
Questions?
[email protected]
@leland_mcinnes