Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIP Open Seminar #6

Shunsuke Kanda
February 04, 2021

AIP Open Seminar #6

Presentation slide on AIP Open Seminar #6

Shunsuke Kanda

February 04, 2021
Tweet

More Decks by Shunsuke Kanda

Other Decks in Research

Transcript

  1. Dynamic Similarity Search on Integer Sketches Shunsuke Kanda and Yasuo

    Tabei Succinct Information Processing Unit (Presented at ICDM20)
  2. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments
  3. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments
  4. Similarity-preserving Hashing • Core technique for fast similarity searches ▹

    Randomly map vectors in a metric space into sketches in the Hamming space Hashing Hamming space Metric space (e.g. Cosine or Jaccard) High dimension :( (~103 to ~106) 0.2 0.7 0.1 0.5 0.2 0.3 0.3 0.8 ⋮ 0.1 Low dimension :) (32 or 64) 0 1 1 ⋮ 0 Many similarity search problems can be solved as Hamming distance problem!! (discrete strings)
  5. Modern Issues on Similarity Search • Generality ▹ Traditional hashing

    algorithms produce binary sketches ▹ Modern hashing algorithms produce integer sketches – Such as b-bit minhash [Li+, WWW10], 0-bit CWS [Li, KDD15], and GCWS [Li, KDD17] ▹ But, most search methods are designed for binary sketches e.g., 001101001001 e.g., 236301499231 • Dynamics ▹ Modern real-world datasets are dynamic (i.e., updated over time) – Such as Web pages and time series data ▹ But, most search methods are limited to static datasets or inefficient for dynamic datasets dataset x insert Our challenge Develop an efficient dynamic search method for both binary and integer sketches
  6. Problem Statement • Sketch x of length m is an

    m-dimensional vector of non-negative integers • We have a dataset X = {x1 , x2 , …, xn }, which is a dynamic set of n sketches • Given sketch y and Hamming radius r as a query, we want to quickly find similar sketches such that {xi : H(xi , y) ≤ r} ▹ H(∙, ∙) is the Hamming distance (i.e., # of errors in each dimension) x1 111020 x2 001020 x3 032021 x4 113021 Dataset X n Generality Dynamics H(x1, y) = 1 H(x2, y) = 3 H(x3, y) = 3 H(x4, y) = 1 ≤ r ≤ r similar similar y = 111021 r = 1 Query
  7. State-of-the-art Similarity Search Methods • Most methods use hash tables,

    but they are inefficient for dynamic datasets • Recently, Eghbali et al. [IEEE TPAMI19] addressed this issue by using a search tree, but it is not applicable to integer sketches • We propose new methods DyFTs for dynamic datasets of integer sketches, which leverage a trie data structure
  8. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments
  9. Trie-based Similarity Search • Trie is a labeled tree built

    by merging common prefixes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 x3 = 032021
  10. Trie-based Similarity Search • Trie is a labeled tree built

    by merging common prefixes of sketches • The downgoing path from the root to a leaf represents the associated sketch x1 x2 x3 x7 x5 x4 x8 x6 0 1 3 0 3 1 1 3 1 0 2 0 0 2 1 2 0 0 2 1 1 3 0 2 0 0 2 1 1 0 2 0 0 3 1 1 0 1 1 0 • Similarity search is performed by traversing nodes while counting #errors to the query sketch • If #errors exceeds the radius, we stop traversing down to the all descendants • The time complexity is O(mr+2) Search for y = 111020 with r =1 x1 and x7 are similar not depending on dataset size n
  11. Two Issues on Trie Implementation • Scalability ▹ Trie for

    large database maintains many pointers and consumes huge memory ▹ Reducing redundant nodes is an often-used solution ▹ But, there is no reduction technique for similarity searches • Generality ▹ Sketches consist of integers from {0,1,…,σ–1} ▹ σ is a given parameter depending on hashing techniques – σ ≤ 4 is recommended in MinHash – σ ≥ 16 is recommended in CWS ▹ But, existing trie implementations have been designed for byte sketches, i.e., σ = 256 Our DyFT is a new similarity search method to solve the issues
  12. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments For Scalability Issue
  13. Dynamic Filter Trie (DyFT) • Trie-based similarity search for binary

    and integer sketches ▹ Store only some of trie nodes around the root for memory efficiency ▹ Exploit the trie search algorithm for filtering out dissimilar sketches x1 x2 x3 x7 x5 x4 x8 x6 x1 111020 x2 001020 x3 032021 x4 113021 x5 333110 x6 330110 x7 311020 x8 030120 Database X Verification H(x1 , y) = 0 ≤ r H(x4 , y) = 2 > r H(x7 , y) = 1 ≤ r similar similar dissimilar 0 1 3 0 3 1 3 Search for y = 111020 with r =1 Candidate solutions
  14. Update Procedure • Visit the deepest reachable leaf node using

    new sketch • Append to the posting list of leaf node • If the length of (or ) exceeds threshold , split and create new leaf nodes v xi xi Lv v Lv |Lv | τ Lv v x3 x8 ︙ ︙ Insert x9 = 030110 0 3 x9 Append v ︙ ︙ 0 3 x3 x8 x9 0 2 Split (if ) |Lv | > τ |Lv | Lv
  15. What is a Reasonable Splitting Threshold ? τ • A

    reasonable value of can be determined depending on the configuration of the dataset and given parameters of hashing techniques • But, it is impossible to search such a reasonable value for dynamic datasets τ If is large τ Large verification time If is small τ Large traversal time The best values are reversed :( One order of magnitude! Fast
  16. Optimal Treshold τ* • First, construct a search cost model

    • Then, determine an optimal threshold minimizing the search cost τ* (if ) |Lv | ≤ τ* keep? or split? (if ) |Lv | > τ* Lv offers the case that can maintain the smaller cost τ* v Can always achieve the fastest search Fast
  17. Definition of Search Cost SC(v) • The search cost for

    node is defined by ▹ is the Reach Probability defined for a random sketch from a uniform distribution ▹ is the Computational Cost defined for inner and leaf nodes separately v SC(v) = RP(v) × CC(v) RP(v) CC(v) v Inner node v Lv Leaf node CCin (v) Check children CCleaf (v) Verify sketches SCin (v) = SCleaf (v) = RP(v) RP(v) Given a random sketch Given a random sketch
  18. Optimal Threshold τ* • Given leaf , compare the search

    costs in the two cases: v If keeping leaf v ︙ v |Lv | SCleaf (v) then, the search cost is v ︙ u1 u2 uk ︙ ︙ ︙ If splitting leaf v SCin (v) + ∑ SCleaf (ui ) then, the new search cost is Precomputable :) |Lv | > τ*(r, ℓ, σ) • Can derive the condition if the right case can maintain a smaller search cost DyFT can grow while maintaining fast similarity searches with few node pointers
  19. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments For Generality Issue
  20. How to implement DyFT efficiently? • Good point :) ▹

    There are many trie implementations • Bad point :( ▹ They are designed for byte strings ▹ But, sketches consist of general integers • Our approach ▹ Reconstruct integer sketches into byte ones ▹ Represent them using an adaptive radix tree (space-efficient trie implementation) x = 2 3 6 3 0 1 2 11 2… x’ = 0xF2 0xAE 0x53…
  21. Adaptive Radix Tree [Leis+, ICDE13] • Adaptively select a space-efficient

    node implementation depending on #children • The data structure is modified for node traversal in similarity search For a node with few children, use a list-based data structure For a node with moderate children, use a hybrid data structure of a list and an array For a node with many children, use an array-based data structure
  22. Summary of Our Method • What is a issue on

    modern similarity search? ▹ There is no efficient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We constructed a search cost model, defined an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique
  23. Contents 1. Background & Contribution 2. Preliminary: Trie-based Similarity Search

    3. New method: Dynamic Filter Trie (DyFT) i. Node reduction technique ii. Node implementation technique 4. Experiments
  24. Experimental Setup • Dataset ▹ 216 million compound-protein pairs –

    Each pair is represented as a 3.6 million dimensional binary fingerprint ▹ We converted the fingerprints into binary and integer sketches using Li’s minhash algorithm for Jaccard similarity [Li+, WWW10] ▹ We constructed an index by inserting sketches in random order • Queryset ▹ We randomly sampled 1000 sketches from the dataset • Code ▹ We implemented all data structures using C++17 ▹ Source code is available at https://github.com/kampersanda/dyft Aspirin Caffeic Acid
  25. Analysis for Optimal Threshold τ* Binary Sketch Integer Sketch Search

    time (ms/query) Optimal threshold is the fastest in most cases τ* The search times with fixed thresholds are reversed according to the dataset size τ = 1,10,100 n Fast
  26. Comparison with State-of-the-Arts (Binary Sketches) 1600x faster 13x smaller Competitive

    Search time (ms/query) Update time (sec) Memory usage (GB) • four orders of magnitude faster on the search time • competitive on the update time • one order of magnitude smaller on the memory DyFT was
  27. Comparison with State-of-the-Arts (Integer Sketches) • always faster on the

    search time • competitive on the update time • always smaller on the memory DyFT was Search time (ms/query) Update time (sec) Memory usage (GB) Always faster Always smaller Competitive
  28. Summary of Our Method • What is a issue on

    modern similarity search? ▹ There is no efficient dynamic data structure for integer sketches • What are issues on trie-based similarity search? ▹ Scalability: There is no node reduction technique for similarity search ▹ Generality: There is no node implementation technique for integer sketches We developed DyFT based on a trie data structure We reconstructed integer sketches into byte sketches to leverage an existing trie implementation technique We constructed a search cost model, defined an optimal threshold, and reduced DyFT nodes while maintaining fast similarity searches