Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simplifying Real-Time Vector Store Ingestion wi...

Simplifying Real-Time Vector Store Ingestion with Apache Flink @ Current 2025 London UK

Abstract

Retrieval-Augmented Generation (RAG) has become a foundational paradigm that augments the capabilities of language models—small or large—by attaching information stored in vector databases to provide grounding data. While the concept is straightforward, maintaining up-to-date embeddings as data constantly evolves across various source systems remains a persistent challenge. This lighting talk explores how to build a real-time vector ingestion pipeline on top of Apache Flink and its extensive connector ecosystem to keep vector stores fresh at all times seamlessly. To eliminate the need for custom code while still preserving a reasonable level of configurability, a handful of composable user-defined functions (UDFs) are discussed to address loading, parsing, chunking, and embedding of data directly from within Flink's Table API or Flink SQL jobs. Easy-to-follow examples demonstrate how the discussed approach helps to significantly lower the entry barrier for RAG adoption, ensuring that retrieval remains consistent with your latest knowledge.

Recording: pending...

Avatar for Hans-Peter Grahsl

Hans-Peter Grahsl

May 21, 2025
Tweet

More Decks by Hans-Peter Grahsl

Other Decks in Programming

Transcript

  1. Vector Embeddings ... ... encode the meaning of raw, unstructured,

    or otherwise complex data objects to derive relationships based on semantic similarity. @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  2. Embedding Models ... ... are (pre-)trained for specific purposes allowing

    us to transform text, image, audio, video, ... inputs into vector embeddings. @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  3. quality context needs relevant knowledge based on fresh data @hpgrahsl

    (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  4. Function #1: EMBED_TEXT (sync | async) public class EmbedTextAsyncUdf extends

    AsyncScalarFunction { transient EmbeddingModelAsync asyncEmbeddingModel; public void eval( @DataTypeHint(TypeDefinitions.EMBEDDING_VECTOR) CompletableFuture<float[]> result, @Nullable String text, String modelType, String providerType, String modelName, String udfConfigName) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different embedding behaviour } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  5. Function #2: CHUNK_TEXT (sync) public class ChunkTextUdf extends ScalarFunction {

    transient DocumentSplitter documentSplitter; public @Nullable List<String> eval( @Nullable String data, @Nullable String splitterType, int maxSegementSizeChars, int maxOverlapSizeChars) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different call param sets } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  6. Function #3: LOAD_TEXT (async) public class LoadTextUdf extends AsyncScalarFunction {

    public void eval( CompletableFuture<String> result, @Nullable String location, String loaderType) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different call param sets } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK
  7. Function #4: BATCH_CHUNKS (sync) public class BatchChunksUdtf extends TableFunction<String[]> {

    public void eval(List<String> texts, int batchSize) { /* actual implementation to create sub-arrays */ } } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK