Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Sense of Vector Databases

Making Sense of Vector Databases

Balkrishna Rawool

December 06, 2024
Tweet

More Decks by Balkrishna Rawool

Other Decks in Technology

Transcript

  1. 2

  2. 4 Vector Ø In mathematics, a vector is a quantity

    that has magnitude and direction. ⃗ 𝑣 Ø In computer science, a vector is group or array of numbers. Ø Examples: Ø [2.1, 0.34, 4.678] Ø [4] Ø [0.13483997249264842, 0.26967994498529685, −0.40451991747794525, 0.5393598899705937, 0.674199862463242] Ø They are both equivalent. Ø The elements in a vector are also called the dimensions of the vector. 𝑋 𝑌 𝑍 V(x, y, z) 𝜃 𝑚 ⃗ 𝑣 m, 𝜃 = V(x, y, z) ⃗ 𝑣
  3. 5 Vector Embeddings Ø Typically a vector is calculated from

    some data by applying an embedding algorithm/ model. Ø Here, the vector is also known as embedding or vector embedding. Embedding model/ algorithm Data/ Tokens Ø An embedding model, puts data with similar meaning close to each other. Ø Measurement of this togetherness is called similarity or distnace between the two vectors/ embeddings. Vector DB
  4. 6 Vector Databases Ø Vector databases store vectors. Ø Operations:

    Ø Inserting Ø Updating Ø Deleting Ø Indexing Ø Searching Ø Examples: Ø PostgreSQL with pgvector Ø Elasticsearch Ø Redis Ø Milvus Ø MongoDB Ø Neo4j Ø Pinecone Ø Chroma Ø Weaviate
  5. 8 Simple VectorDB using Java Vector API Ø Java Vector

    API Ø Takes advantage of CPU architectures for faster processing Ø Single Instruction Multiple Data (SIMD) Ø Current status: Ninth Incubator – JEP-489 Ø Simple VectorDB implementation Ø Stores vectors in memory Ø Insertion and update Ø Selection Ø Searching with Cosine Similarity
  6. 10 Example 1: Simple Vectors Ø A world that only

    consists of gray colors. Ø All embeddings are one-dimensional. Mapping Data Vector DB Query Store Search Neighbours 1 2 3 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
  7. 11 Similarity/ Distance Ø Two vectors which have similar meaning

    are close together. Ø Similarity/ distance is measurement of how close they are. Ø Ways of calculating distance: Ø Manhattan Distance Ø Dot Product Similarity Ø Cosine Similarity Ø Euclidean Distance Ø Manhattan Distance Manhattan distance between V1 and V2 = 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2 + |𝑧1 − 𝑧2| 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2)
  8. 12 Similarity/ Distance 𝑋 𝑌 𝑍 V1(x1, y1, z1) Ø

    Dot Product Similarity V2(𝑥2, 𝑦2, 𝑧2) Dot product of V1 and V2 = 𝑥1 ∗ 𝑥2 + 𝑦1 ∗ 𝑦2 + (𝑧1 ∗ 𝑧2) Ø Euclidean distance Ø Straight line distance between two vectors. Euclidean diustance between V1 and V2 = (𝑥1 − 𝑥2)!+(𝑦1 − 𝑦2)!+(𝑧1 − 𝑧2)!
  9. 13 Magnitude and Normalization of Vector Ø A normailzed vector

    of a vector, is just the same vector but without magnitude. Ø So, a unit vector in the same direction as the original vector. 𝑋 𝑌 𝑍 V(x, y, z) Magnitude of V = | 𝑉 | = 𝑥! + 𝑦! + 𝑧! Ø Magnitude or length of a vector is its length from origin. Normalized V = " | $ | = ( % $ , & $ , ' $ )
  10. 14 Cosine Similarity Cosine similarity between V1 and V2 =

    cos 𝜃 = "()"! $( )| $! | Ø Cosine similarity of two vectors is cosine of the angle between them. 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2) 𝜃
  11. 16 Example 2: Cosine Similarity Ø World becomes colorful. Ranging

    from RGB(1,1,1) to RGB(255,255,255) Ø Embedding algorithm: Ø Map color to a vector with 3 RGB values Ø Normalize the vector Embedding Algorithm RGB Colors Vector DB Query Store Search Neighbours 1 2 3 4
  12. 17 Embedding Models Ø Embedding models convert input data/tokens into

    vectors. Ø These vectors capture the meaning and relationship between tokens. Ø Examples Ø word2vec – Text/ words Ø GLoVE (Global Vectors) – Text/ words Ø BERT (Bidirectional Encoder Representations from Transformers) – Text Ø CNN (Convolutionsl Neural Network) – Images/ Videos Ø img2vec – Images Ø GPT (Generative Pre-trained Transformer) – Multimodal Ø CLIP (Contrastive Language-Image Pre-training) – Multimodal (Text and Images)
  13. 19 Example 3: word2vec Embeddings Ø Convert words to vectors

    Ø Embedding algorithm: Ø word2vec Ø Deeplearning4j word2vec implementation, applied to 4 wikipedia pages Ø Vectors contain meaning and relationships between tokens. word2vec 4 wiki pages PostgreS + pgvector Word Store Search Neighbours 1 2 3 4
  14. 20 Indexing Algorithms Ø k-d Tree (k Dimensional Tree) Ø

    A binary search tree with each layer using a different dimension. Ø R- tree (Rectangle Tree) Ø Vectors are grouped into rectangles. Ø LSH (Locality-Sensitive Hashing) Ø Vectors are placed in buckets of hashes. Ø AANoy (Approximate Nearest Neighbours, Oh Yeah) Ø A forest of multiple binary search trees Ø HNSW (Hierarchical Navigable Small World) Ø ScaNN (Scalable Nearest Neighbours) Ø Hybrid approach with several techniques combined
  15. 21 HNSW (Hierarchical Navigable Small World) Ø Vectors are plotted

    onto a graph with each vector being a node and “close” vectors are connected by edges. Ø The graph is divided into layers with varying granularity. Ø To search for the nearest neighbour Ø Start with a fixed entry point on the highest layer. Ø Find a vector with least distance from the query vector. Ø Use this vector as entry point to the next layer. Ø Repeat this until you reach the lowest layer with maximum granularity. Entry-point Query-vector Layer 2 Layer 1 Layer 0
  16. 23 Example 4: HNSW Indexing Ø HNSW indexing of vectors

    of relatively large wiki dataset Ø Embedding algorithm: Ø llama3 embedding, applied to 10 wikipedia pages Ø Search for 5 nearest neighbours llama3 embedding 10 wiki pages PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4
  17. 24 Retrieval Augmented Generation (RAG) Ø Augment prompt to an

    LLM with input with specific knowledge that is relevant for the prompt. Embedding Model Data Vector DB Prompt Store Search Neighbours 1 2 3 4 LLM
  18. 26 Example 5: RAG Ø Use RAG to ask an

    LLM questions about Epic Comic Co book-store Ø Embedding model: llama3, Vector DB: PostgreS with pgvector llama3 Embedding FAQ PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4 llama3
  19. 28 Example 6: img2vec Embedding Ø Vectorize a lot of

    images and search for similar images Ø Embedding model: img2vec Ø Vector DB: Weaviate img2vec Embedding Lots of images Weaviate New image Store Search Neighbours 1 2 3 4