Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GeoEmbeddings with PostGIS & pgVector

GeoEmbeddings with PostGIS & pgVector

Slides from my talk from PostGIS Day 2025
Will add video when ready

Avatar for Shoaib Burq

Shoaib Burq

November 20, 2025
Tweet

More Decks by Shoaib Burq

Other Decks in Technology

Transcript

  1. Geospatial Embeddings with PostGIS and PGVector Lessons learned and future

    directions Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  2. Who am I and what am I working on? •

    Real-time multi-user, Auth, REST & GraphQL, Raw DB • MobilityDB, pgPointCloud, H3, A5, pgvector(scale) • Vector & Raster Tile Servers • Data Importers (OSM) Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  3. Why now? • Classic RAG: AI assistants answering questions about

    places • Similarity searches: Finding semantically similar geographic features • Downstream task training: Segmentation & object detection • Damaged building detection • Diseased crop identification Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  4. pgvector -- Create table with vector column CREATE TABLE documents

    ( id SERIAL PRIMARY KEY, content TEXT, embedding VECTOR(1536) ); -- Create HNSW index CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops); Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  5. Filtered Search: pgvector -- Post-filtering: Search first, filter after --

    Problem: May not return enough results SELECT * FROM documents WHERE category = 'electronics' -- Post-filter ORDER BY embedding <=> $1 LIMIT 10; Challenge: May return < 10 results if filter is selective Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  6. pgvectorscale -- Create table with labels CREATE TABLE documents (

    id SERIAL PRIMARY KEY, embedding VECTOR(1536), labels SMALLINT[] -- Category labels ); -- Index includes labels for pre-filtering CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels); Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  7. Filtered Search: pgvectorscale -- Efficient label-filtered search SELECT * FROM

    documents WHERE labels && ARRAY[1, 3] -- Pre-filtered! ORDER BY embedding <=> $1 LIMIT 10; Benefit: Always returns 10 results, faster execution Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  8. We came here for the geospatial embeddings, not graph theory!

    Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  9. Generating Geo Embeddings Vision VIT Models Shoaib Burq & M.

    Hassan | geobase.app | PostGIS Day 2025
  10. Geospatial Filtering Problem SELECT * FROM places WHERE geom &&

    ST_MakeEnvelope(-122.5, 37.7, -122.3, 37.8) ORDER BY embedding <=> query_vector LIMIT 10; Post-filtering with geometry leads to: - Inefficient queries, Poor recall - Vector index ignored Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  11. Benchmark Setup • Dataset: 100,000 geometry points • Indexes: Default

    PostGIS GiST index on geometry, HNSW on vectors • Hardware: DigitalOcean Basic Droplet (2 vCPU, 4 GB RAM, 25 GB SSD) • Config: Default PostgreSQL settings Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  12. Baseline Queries • Query 1: Spatial Filter Only (GiST) •

    Query 2: Vector Similarity Only (HNSW) • Query 6: Sequential Scan (No Indexes) Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  13. Combined Queries • Query 3: High Selectivity (633 rows filtered)

    • Query 4: Very High Selectivity (169 rows) • Query 5: Low Selectivity (5,882 rows) • Query 7: Moderate Selectivity (2,578 rows) Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  14. CTE Variants • Query 8: CTE Version - High Selectivity

    • Query 9: CTE Version - Low Selectivity Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  15. Benchmark Results Summary Query Index Used Execution Time Rows Filtered

    Key Issue Query 1 (Spatial only) GiST 2.7 ms 633 Baseline - works fine Query 2 (Vector only) HNSW 79.4 ms N/A Baseline - works fine Query 3 (Combined, High) GiST only 1,364 ms 633 ❌ Vector index NOT used Query 4 (Very High) GiST only 5.0 ms 169 Vector index NOT used Query 5 (Low) GiST only 10,850 ms 5,882 ❌❌ CRITICAL: 10.85 seconds! Query 6 (No indexes) Sequential 95.0 ms 633 Baseline - worst case Query 7 (Moderate) GiST only 57.9 ms 2,578 ❌ Vector index NOT used Query 8 (CTE, High) GiST only 21.3 ms 633 Same plan, cache effects Query 9 (CTE, Low) GiST only 117.1 ms 5,882 Same plan, cache effects Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  16. Query 3: Spatial Region Shoaib Burq & M. Hassan |

    geobase.app | PostGIS Day 2025
  17. Benchmark Query 3: High Selectivity SELECT id, name, category, embedding

    <=> (SELECT embedding FROM test_places LIMIT 1) AS distance, ST_AsText(geom) FROM test_places WHERE geom && ( SELECT query_geometry FROM benchmark_query_regions WHERE label_name = 'query_high_selectivity' ) ORDER BY embedding <=> (SELECT embedding FROM test_places LIMIT 1) LIMIT 10; Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  18. Query 3 Results • Execution Time: 1,364 ms (1.36 seconds)

    • Rows Filtered: 633 (0.63% of dataset) • Vector Index: NOT USED ❌ Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  19. Query 5: Spatial Region Shoaib Burq & M. Hassan |

    geobase.app | PostGIS Day 2025
  20. Benchmark Query 5: Low Selectivity ! SELECT id, name, category,

    embedding <=> (SELECT embedding FROM test_places LIMIT 1) AS distance FROM test_places WHERE geom && ( SELECT query_geometry FROM benchmark_query_regions WHERE label_name = 'query_low_selectivity' ) ORDER BY embedding <=> (SELECT embedding FROM test_places LIMIT 1) LIMIT 10; Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  21. Query 5 Results ! • Execution Time: 10,850 ms (10.85

    seconds!) • Rows Filtered: 5,882 (5.88% of dataset) • Vector Index: NOT USED ❌ Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  22. Query 7: Spatial Region Shoaib Burq & M. Hassan |

    geobase.app | PostGIS Day 2025
  23. Benchmark Query 7: Moderate Selectivity SELECT id, name, category, embedding

    <=> (SELECT embedding FROM test_places LIMIT 1) AS distance, ST_AsText(geom) FROM test_places WHERE geom && ( SELECT query_geometry FROM benchmark_query_regions WHERE label_name = 'query_moderate_selectivity' ) ORDER BY embedding <=> (SELECT embedding FROM test_places LIMIT 1) LIMIT 10; Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  24. Query 7 Results • Execution Time: 57.87 ms • Rows

    Filtered: 2,578 (2.58% of dataset) • Vector Index: NOT USED ❌ Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  25. Why Query 7 Appears Faster Buffer Access Analysis Query 3:

    - Bitmap Heap Scan: ~632 ms for 633 rows → 1 ms per row - Buffers: shared hit=5,620 - Cold cache, random block access Query 7: - Bitmap Heap Scan: ~48 ms for 2,578 rows → 0.018 ms per row Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  26. The Problem: Separate Indexes Don't Combine • ✅ GiST (spatial)

    works great for spatial queries • ✅ HNSW (vector) works great for vector similarity • ❌ But combined queries fail: Vector index ignored, 10-100x slower Key Issue: PostgreSQL planner can't efficiently combine separate indexes Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  27. The Solution: Composite Index At idea stage - seeking collaborators!

    CREATE INDEX idx_places_diskann_composite ON places USING diskann ( embedding vector_cosine_ops, geom vector_geometry_ops ); Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  28. How it works: - Spatial filter prunes vector search space

    BEFORE graph traversal - GeometryKey stored in each node enables filtering during search - Maintains graph structure optimization Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025
  29. ┌────────────────────────────────────┐ │ 1. Query Geometry + Query Vector │ └───────────────┬────────────────────┘

    │ ▼ ┌───────────────────────────────┐ │ 2. Coarse Spatial Grid │ │ (detect intersecting bins) │ └───────────┬───────────────────┘ │ ▼ ┌───────────────────────────────────────────────┐ │ 3. Get ANN Entry Nodes From Each Bin │ │ e.g., Bin 5 → H, J | Bin 6 → K | Bin 9 → Q │ └──────────────┬────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ 4. ANN Graph Traversal (with GeometryKey Spatial Pruning) | │ | │ Candidate node → check bbox overlap → prune or compute vector dist | │ Expand neighbors → apply bbox prune → continue best-first search | │ | └─────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ 5. Final exact PostGIS check │ └──────────────────────────────────┘ │ ▼ ┌───────────────────────────────────┐ │ Top-K Intersection │ │ (vector similarity + spatial bbox)│ └───────────────────────────────────┘ Shoaib Burq & M. Hassan | geobase.app | PostGIS Day 2025