Randomly map vectors in a metric space into sketches in the Hamming space Hashing Hamming space Metric space (e.g. Cosine or Jaccard) High dimension :( (~103 to ~106) 0.2 0.7 0.1 0.5 0.2 0.3 0.3 0.8 ⋮ 0.1 Low dimension :) (32 or 64) 0 1 1 ⋮ 0 Many similarity search problems can be solved as Hamming distance problem!! (discrete strings)

algorithms produce binary sketches ▹ Modern hashing algorithms produce integer sketches – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17] ▹ But, most search methods are designed for binary sketches e.g., 001101001001 e.g., 236301499231 • Dynamics ▹ Modern real-world datasets are dynamic (i.e., updated over time) – Such as Web pages and time series data ▹ But, most search methods are limited to static datasets or inefﬁcient for dynamic datasets dataset x insert Our challenge Develop an efﬁcient dynamic search method for both binary and integer sketches

m-dimensional vector of non-negative integers • We have a dataset X = {x1 , x2 , …, xn }, which is a dynamic set of n sketches • Given sketch y and Hamming radius r as a query, we want to quickly ﬁnd similar sketches such that {xi : H(xi , y) ≤ r} ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension) x1 111020 x2 001020 x3 032021 x4 113021 Dataset X n Generality Dynamics H(x1, y) = 1 H(x2, y) = 3 H(x3, y) = 3 H(x4, y) = 1 ≤ r ≤ r similar similar y = 111021 r = 1 Query

but they are inefﬁcient for dynamic datasets • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree, but it is not applicable to integer sketches • We propose new methods DyFTs for dynamic datasets of integer sketches, which leverage a trie data structure

by merging common preﬁxes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 • Similarity search is performed by traversing nodes while counting #errors to the query sketch • If #errors exceeds the radius, we stop traversing down to the all descendants • The time complexity is O(mr+2) Search for y = 111020 with r =1 x1 and x7 are similar not depending on dataset size n

large database maintains many pointers and consumes huge memory ▹ Reducing redundant nodes is an often-used solution ▹ But, there is no reduction technique for similarity searches • Generality ▹ Sketches consist of integers from {0,1,…,σ–1} ▹ σ is a given parameter depending on hashing techniques – σ ≤ 4 is recommended in MinHash – σ ≥ 16 is recommended in CWS ▹ But, existing trie implementations have been designed for byte sketches, i.e., σ = 256 Our DyFT is a new similarity search method to solve the issues

and integer sketches ▹ Store only some of trie nodes around the root for memory efﬁciency ▹ Exploit the trie search algorithm for ﬁltering out dissimilar sketches x1 x2 x3 x7 x5 x4 x8 x6 x1 111020 x2 001020 x3 032021 x4 113021 x5 333110 x6 330110 x7 311020 x8 030120 Database X Veriﬁcation H(x1 , y) = 0 ≤ r H(x4 , y) = 2 > r H(x7 , y) = 1 ≤ r similar similar dissimilar 0 1 3 0 3 1 3 Search for y = 111020 with r =1 Candidate solutions

new sketch • Append to the posting list of leaf node • If the length of (or ) exceeds threshold , split and create new leaf nodes v xi xi Lv v Lv |Lv | τ Lv v x3 x8 ︙ ︙ Insert x9 = 030110 0 3 x9 Append v ︙ ︙ 0 3 x3 x8 x9 0 2 Split (if ) |Lv | > τ |Lv | Lv

reasonable value of can be determined depending on the conﬁguration of the dataset and given parameters of hashing techniques • But, it is impossible to search such a reasonable value for dynamic datasets τ If is large τ Large veriﬁcation time If is small τ Large traversal time The best values are reversed :( One order of magnitude! Fast

• Then, determine an optimal threshold minimizing the search cost τ* (if ) |Lv | ≤ τ* keep? or split? (if ) |Lv | > τ* Lv offers the case that can maintain the smaller cost τ* v Can always achieve the fastest search Fast

node is deﬁned by ▹ is the Reach Probability deﬁned for a random sketch from a uniform distribution ▹ is the Computational Cost deﬁned for inner and leaf nodes separately v SC(v) = RP(v) × CC(v) RP(v) CC(v) v Inner node v Lv Leaf node CCin (v) Check children CCleaf (v) Verify sketches SCin (v) = SCleaf (v) = RP(v) RP(v) Given a random sketch Given a random sketch

costs in the two cases: v If keeping leaf v ︙ v |Lv | SCleaf (v) then, the search cost is v ︙ u1 u2 uk ︙ ︙ ︙ If splitting leaf v SCin (v) + ∑ SCleaf (ui ) then, the new search cost is Precomputable :) |Lv | > τ*(r, ℓ, σ) • Can derive the condition if the right case can maintain a smaller search cost DyFT can grow while maintaining fast similarity searches with few node pointers

There are many trie implementations • Bad point :( ▹ They are designed for byte strings ▹ But, sketches consist of general integers • Our approach ▹ Reconstruct integer sketches into byte ones ▹ Represent them using an adaptive radix tree (space-efﬁcient trie implementation) x = 2 3 6 3 0 1 2 11 2… x’ = 0xF2 0xAE 0x53…

node implementation depending on #children • The data structure is modiﬁed for node traversal in similarity search For a node with few children, use a list-based data structure For a node with moderate children, use a hybrid data structure of a list and an array For a node with many children, use an array-based data structure

modern similarity search? ▹ There is no efﬁcient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We constructed a search cost model, deﬁned an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique

Each pair is represented as a 3.6 million dimensional binary ﬁngerprint ▹ We converted the ﬁngerprints into binary and integer sketches using Li’s minhash algorithm for Jaccard similarity [Li+, WWW10] ▹ We constructed an index by inserting sketches in random order • Queryset ▹ We randomly sampled 1000 sketches from the dataset • Code ▹ We implemented all data structures using C++17 ▹ Source code is available at https://github.com/kampersanda/dyft Aspirin Caffeic Acid

time (ms/query) Optimal threshold is the fastest in most cases τ* The search times with ﬁxed thresholds are reversed according to the dataset size τ = 1,10,100 n Fast

Search time (ms/query) Update time (sec) Memory usage (GB) • four orders of magnitude faster on the search time • competitive on the update time • one order of magnitude smaller on the memory DyFT was

search time • competitive on the update time • always smaller on the memory DyFT was Search time (ms/query) Update time (sec) Memory usage (GB) Always faster Always smaller Competitive

modern similarity search? ▹ There is no efﬁcient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique We constructed a search cost model, deﬁned an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches