Simplifying Real-Time Vector Store Ingestion with Apache Flink @ Current 2025 London UK

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st
| London UK

! Why? @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |
May 21st | London UK

| London UK

1536 @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May
21st | London UK

Simplifying Real-Time Vector Store Ingestion with Apache Flink @hpgrahsl (.bsky.social)
— decodable.co | Current 2025 | May 21st | London UK

Vector Embeddings ... ... encode the meaning of raw, unstructured,
or otherwise complex data objects to derive relationships based on semantic similarity. @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

Embedding Models ... ... are (pre-)trained for speciﬁc purposes allowing
us to transform text, image, audio, video, ... inputs into vector embeddings. @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

! BORING ! ? BUZZWORDS ? @hpgrahsl (.bsky.social) — decodable.co
| Current 2025 | May 21st | London UK

| London UK

LLMs need quality context @hpgrahsl (.bsky.social) — decodable.co | Current
2025 | May 21st | London UK

quality context needs relevant knowledge based on fresh data @hpgrahsl
(.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

Function #1: EMBED_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025
| May 21st | London UK

Function #1: EMBED_TEXT (sync | async) public class EmbedTextAsyncUdf extends
AsyncScalarFunction { transient EmbeddingModelAsync asyncEmbeddingModel; public void eval( @DataTypeHint(TypeDefinitions.EMBEDDING_VECTOR) CompletableFuture<float[]> result, @Nullable String text, String modelType, String providerType, String modelName, String udfConfigName) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different embedding behaviour } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

! EMBED_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #2: CHUNK_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #2: CHUNK_TEXT (sync) public class ChunkTextUdf extends ScalarFunction {
transient DocumentSplitter documentSplitter; public @Nullable List<String> eval( @Nullable String data, @Nullable String splitterType, int maxSegementSizeChars, int maxOverlapSizeChars) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different call param sets } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

! CHUNK_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #3: LOAD_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #3: LOAD_TEXT (async) public class LoadTextUdf extends AsyncScalarFunction {
public void eval( CompletableFuture<String> result, @Nullable String location, String loaderType) { /* actual implementation using e.g. langchain4j */ } // more overloadings of eval(...) for different call param sets } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

! LOAD_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #4: BATCH_CHUNKS @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #4: BATCH_CHUNKS (sync) public class BatchChunksUdtf extends TableFunction<String[]> {
public void eval(List<String> texts, int batchSize) { /* actual implementation to create sub-arrays */ } } @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st | London UK

! BATCH_CHUNKS @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

| London UK

Simplifying Real-Time Vector Store Ingestion wi...

Simplifying Real-Time Vector Store Ingestion with Apache Flink @ Current 2025 London UK

Hans-Peter Grahsl

More Decks by Hans-Peter Grahsl

Other Decks in Programming

Featured

Transcript

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

! Why? @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

1536 @hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May

Simplifying Real-Time Vector Store Ingestion with Apache Flink @hpgrahsl (.bsky.social)

Vector Embeddings ... ... encode the meaning of raw, unstructured,

Embedding Models ... ... are (pre-)trained for speciﬁc purposes allowing

! BORING ! ? BUZZWORDS ? @hpgrahsl (.bsky.social) — decodable.co

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

LLMs need quality context @hpgrahsl (.bsky.social) — decodable.co | Current

quality context needs relevant knowledge based on fresh data @hpgrahsl

Function #1: EMBED_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #1: EMBED_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #1: EMBED_TEXT (sync | async) public class EmbedTextAsyncUdf extends

! EMBED_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #2: CHUNK_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #2: CHUNK_TEXT (sync) public class ChunkTextUdf extends ScalarFunction {

! CHUNK_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #3: LOAD_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #3: LOAD_TEXT (async) public class LoadTextUdf extends AsyncScalarFunction {

! LOAD_TEXT @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

Function #4: BATCH_CHUNKS @hpgrahsl (.bsky.social) — decodable.co | Current 2025

Function #4: BATCH_CHUNKS (sync) public class BatchChunksUdtf extends TableFunction<String[]> {

! BATCH_CHUNKS @hpgrahsl (.bsky.social) — decodable.co | Current 2025 |

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st

@hpgrahsl (.bsky.social) — decodable.co | Current 2025 | May 21st