Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Sense of Vector Databases

Making Sense of Vector Databases

Avatar for Balkrishna Rawool

Balkrishna Rawool

December 06, 2024
Tweet

More Decks by Balkrishna Rawool

Other Decks in Technology

Transcript

  1. 2

  2. 4 Vector Ø In mathematics, a vector is a quantity

    that has magnitude and direction. ⃗ 𝑣 Ø In computer science, a vector is group or array of numbers. Ø Examples: Ø [2.1, 0.34, 4.678] Ø [4] Ø [0.13483997249264842, 0.26967994498529685, −0.40451991747794525, 0.5393598899705937, 0.674199862463242] Ø They are both equivalent. Ø The elements in a vector are also called the dimensions of the vector. 𝑋 𝑌 𝑍 V(x, y, z) 𝜃 𝑚 ⃗ 𝑣 m, 𝜃 = V(x, y, z) ⃗ 𝑣
  3. 5 Vector Embeddings Ø Typically a vector is calculated from

    some data by applying an embedding algorithm/ model. Ø Here, the vector is also known as embedding or vector embedding. Embedding model/ algorithm Data/ Tokens Ø An embedding model, puts data with similar meaning close to each other. Ø Measurement of this togetherness is called similarity or distnace between the two vectors/ embeddings. Vector DB
  4. 6 Vector Databases Ø Vector databases store vectors. Ø Operations:

    Ø Inserting Ø Updating Ø Deleting Ø Indexing Ø Searching Ø Examples: Ø PostgreSQL with pgvector Ø Elasticsearch Ø Redis Ø Milvus Ø MongoDB Ø Neo4j Ø Pinecone Ø Chroma Ø Weaviate
  5. 8 Simple VectorDB using Java Vector API Ø Java Vector

    API Ø Takes advantage of CPU architectures for faster processing Ø Single Instruction Multiple Data (SIMD) Ø Current status: Ninth Incubator – JEP-489 Ø Simple VectorDB implementation Ø Stores vectors in memory Ø Insertion and update Ø Selection Ø Searching with Cosine Similarity
  6. 10 Example 1: Simple Vectors Ø A world that only

    consists of gray colors. Ø All embeddings are one-dimensional. Mapping Data Vector DB Query Store Search Neighbours 1 2 3 4 [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
  7. 11 Similarity/ Distance Ø Two vectors which have similar meaning

    are close together. Ø Similarity/ distance is measurement of how close they are. Ø Ways of calculating distance: Ø Manhattan Distance Ø Dot Product Similarity Ø Cosine Similarity Ø Euclidean Distance Ø Manhattan Distance Manhattan distance between V1 and V2 = 𝑥1 − 𝑥2 + 𝑦1 − 𝑦2 + |𝑧1 − 𝑧2| 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2)
  8. 12 Similarity/ Distance 𝑋 𝑌 𝑍 V1(x1, y1, z1) Ø

    Dot Product Similarity V2(𝑥2, 𝑦2, 𝑧2) Dot product of V1 and V2 = 𝑥1 ∗ 𝑥2 + 𝑦1 ∗ 𝑦2 + (𝑧1 ∗ 𝑧2) Ø Euclidean distance Ø Straight line distance between two vectors. Euclidean diustance between V1 and V2 = (𝑥1 − 𝑥2)!+(𝑦1 − 𝑦2)!+(𝑧1 − 𝑧2)!
  9. 13 Magnitude and Normalization of Vector Ø A normailzed vector

    of a vector, is just the same vector but without magnitude. Ø So, a unit vector in the same direction as the original vector. 𝑋 𝑌 𝑍 V(x, y, z) Magnitude of V = | 𝑉 | = 𝑥! + 𝑦! + 𝑧! Ø Magnitude or length of a vector is its length from origin. Normalized V = " | $ | = ( % $ , & $ , ' $ )
  10. 14 Cosine Similarity Cosine similarity between V1 and V2 =

    cos 𝜃 = "()"! $( )| $! | Ø Cosine similarity of two vectors is cosine of the angle between them. 𝑋 𝑌 𝑍 V1(x1, y1, z1) V2(𝑥2, 𝑦2, 𝑧2) 𝜃
  11. 16 Example 2: Cosine Similarity Ø World becomes colorful. Ranging

    from RGB(1,1,1) to RGB(255,255,255) Ø Embedding algorithm: Ø Map color to a vector with 3 RGB values Ø Normalize the vector Embedding Algorithm RGB Colors Vector DB Query Store Search Neighbours 1 2 3 4
  12. 17 Embedding Models Ø Embedding models convert input data/tokens into

    vectors. Ø These vectors capture the meaning and relationship between tokens. Ø Examples Ø word2vec – Text/ words Ø GLoVE (Global Vectors) – Text/ words Ø BERT (Bidirectional Encoder Representations from Transformers) – Text Ø CNN (Convolutionsl Neural Network) – Images/ Videos Ø img2vec – Images Ø GPT (Generative Pre-trained Transformer) – Multimodal Ø CLIP (Contrastive Language-Image Pre-training) – Multimodal (Text and Images)
  13. 19 Example 3: word2vec Embeddings Ø Convert words to vectors

    Ø Embedding algorithm: Ø word2vec Ø Deeplearning4j word2vec implementation, applied to 4 wikipedia pages Ø Vectors contain meaning and relationships between tokens. word2vec 4 wiki pages PostgreS + pgvector Word Store Search Neighbours 1 2 3 4
  14. 20 Indexing Algorithms Ø k-d Tree (k Dimensional Tree) Ø

    A binary search tree with each layer using a different dimension. Ø R- tree (Rectangle Tree) Ø Vectors are grouped into rectangles. Ø LSH (Locality-Sensitive Hashing) Ø Vectors are placed in buckets of hashes. Ø AANoy (Approximate Nearest Neighbours, Oh Yeah) Ø A forest of multiple binary search trees Ø HNSW (Hierarchical Navigable Small World) Ø ScaNN (Scalable Nearest Neighbours) Ø Hybrid approach with several techniques combined
  15. 21 HNSW (Hierarchical Navigable Small World) Ø Vectors are plotted

    onto a graph with each vector being a node and “close” vectors are connected by edges. Ø The graph is divided into layers with varying granularity. Ø To search for the nearest neighbour Ø Start with a fixed entry point on the highest layer. Ø Find a vector with least distance from the query vector. Ø Use this vector as entry point to the next layer. Ø Repeat this until you reach the lowest layer with maximum granularity. Entry-point Query-vector Layer 2 Layer 1 Layer 0
  16. 23 Example 4: HNSW Indexing Ø HNSW indexing of vectors

    of relatively large wiki dataset Ø Embedding algorithm: Ø llama3 embedding, applied to 10 wikipedia pages Ø Search for 5 nearest neighbours llama3 embedding 10 wiki pages PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4
  17. 24 Retrieval Augmented Generation (RAG) Ø Augment prompt to an

    LLM with input with specific knowledge that is relevant for the prompt. Embedding Model Data Vector DB Prompt Store Search Neighbours 1 2 3 4 LLM
  18. 26 Example 5: RAG Ø Use RAG to ask an

    LLM questions about Epic Comic Co book-store Ø Embedding model: llama3, Vector DB: PostgreS with pgvector llama3 Embedding FAQ PostgreS + pgvector Prompt Store Search Neighbours 1 2 3 4 llama3
  19. 28 Example 6: img2vec Embedding Ø Vectorize a lot of

    images and search for similar images Ø Embedding model: img2vec Ø Vector DB: Weaviate img2vec Embedding Lots of images Weaviate New image Store Search Neighbours 1 2 3 4