Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Gently Down the Stream with Apache Pinot (Navina Ramesh, StarTree) | RTA Summit 2023

Gently Down the Stream with Apache Pinot (Navina Ramesh, StarTree) | RTA Summit 2023

Streaming systems like Apache Kafka, Amazon Kinesis, and Google Pubsub have become the defacto standard to capture real-time events and CDC events, which are then ingested into an OLAP system like Apache Pinot to derive real-time insights on data.

However, most of the time, the data in the stream needs to undergo transformations prior to entering an OLAP system in order to be useful for a user-facing application. Such pre-processing is typically achieved through stream processing pipelines running in systems like Apache Flink, Apache Samza, KStreams etc. While this approach works, it brings in operational overhead that is expensive and tedious to maintain.

In this talk, we will explore some powerful real-time ingestion features in Apache Pinot that almost eliminates the need for stream processing pipelines. Starting from ingestion operations like filtering and column transformations to handling CDC data from Debezium supported-sources, Apache Pinot reduces the effort needed to build a user-facing analytical application.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Why Realtime Analytics Matters? • Analytics helps make better decisions

    by providing insights into data. 3 Events Insight Action
  2. Why Realtime Analytics Matters? 4 • Realtime analytics reduces the

    time to go from data to insights and actions Time Value Realtime
  3. Total users 700 Million QPS 10000+ Latency SLA < 100

    ms p99th Freshness Seconds Examples of Realtime Analytics
  4. Examples of Real-Time Analytics Missed orders Inaccurate orders Downtime Top

    selling items Menu item Feedback Total users 500,000+ QPS 100s Latency SLA < 100 ms p99th Freshness Seconds - Minutes
  5. Characteristics of a Realtime Analytics Platform • Query latency ~sub-second

    to seconds • Data freshness ~seconds • QPS ~ order of thousands 8 1000s of QPS Sub-second Latency Seconds Freshness Highly Available Scalable Cost Effective Realtime Ingestion Velocity High Dimensionality
  6. 10 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  7. Realtime Ingestion in Apache Pinot • Partition-level, Replicated Consumption Model

    • Periodic Segment Creation based on size/time thresholds • Data ready for query as soon as ingestion 12
  8. Realtime Ingestion 14 Server-1 Server-2 Server-3 Zookeeper Controller Consuming seg0_0

    seg1_0 seg2_0 seg0_0 seg1_0 seg2_0 Replicated & Partitioned Consumption Brokers Query consuming segments Seg0_0 -> Server-1,Server-2 Seg1_0 -> Server-2,Server-3 Seg2_0 -> Server-3,Server-1
  9. 15 Server-1 Server-2 Server-3 Zookeeper Controller seg0_0 seg1_0 seg2_0 seg0_0

    seg1_0 seg2_0 Realtime Ingestion Deep Store seg0_0 seg1_0 seg2_0 Seal segment (immutable)
  10. 16 Server-1 Server-2 Server-3 Zookeeper Controller seg0_0 seg1_0 seg2_0 seg0_0

    seg1_0 seg2_0 Deep Store seg0_0 Realtime Ingestion seg0_1 seg0_1 seg2_1 seg1_1 seg1_0 seg2_0 seg2_1 seg1_1 Seg0_0 -> Server-1,Server-2 Seg1_0 -> Server-2,Server-3 Seg2_0 -> Server-3,Server-1 Seg0_1 -> Server-1,Server-2 Seg1_1 -> Server-2,Server-3 Seg2_1 -> Server-3,Server-1 Consuming, Indexing & Serving
  11. Apache Pinot Realtime Engine • Pluggable Design • Components ◦

    Stream Consumer ◦ Decoder, Transform Pipeline, Indexer ◦ Upsert Metadata Manager ◦ Segment Completion Handler • Components allow Pinot to: ◦ Fetch data from Source ◦ Parse and apply transforms and aggregations on data ◦ Apply transforms based on data semantics ◦ Build and store Segments 18
  12. Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager

    Controller Zookeeper Realtime Engine Component - Stream Consumer Fetches from a data source
  13. Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper

    Realtime Engine Component - Decoder Deserialize data into Pinot row Stream Consumer • CSV • JSON • Avro • Protobuf • … • Debezium Decoder
  14. Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager

    Controller Zookeeper Realtime Engine Component - Transform Pipeline Applies transformation functions Filtering Flattening Aggregation
  15. Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager

    Controller Zookeeper Realtime Engine Components - Indexer Index data in-memory; Ready to query
  16. Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager

    Controller Zookeeper Realtime Engine Components - Upsert Manager Maintains metadata for upsert semantics
  17. Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper

    Realtime Engine Components Stream Consumer Message Processor
  18. 25 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  19. Transform Functions in Apache Pinot 27 • Large library of

    in-built transform functions • Define custom transforms ◦ Using UDFs ◦ Using Groovy ◦ Using @Scalar Annotation on Java transform methods
  20. Transform Function 28 Number of clicks per hour on a

    given day SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01
  21. 29 Number of clicks per hour on a given day

    SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) Transform functions - toEpochDays, toEpochHours, SUM Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01
  22. 30 Average conversion rate since a given day SELECT AVG(click/views)

    FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01
  23. 31 Average conversion rate since a given day SELECT AVG(click/views)

    FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ Transform functions - toEpochDays, AVG Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01
  24. 32 Transform Functions in Apache Pinot ad_id views clicks timestamp

    1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01 Transform functions are convenient. But it comes at a cost of query performance! • Needs extra calculations per record • Cannot apply index on transformation results Use Ingestion Transforms in Pinot!
  25. 33 Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”:

    “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ... Ingestion Transforms in Apache Pinot • Pre-materialize transformation result as a dedicated pinot column • Index can be applied on the pre-materialized column!
  26. 34 ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00

    2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01 Number of clicks per hour on a given day SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) SELECT SUM(clicks) FROM ads WHERE daysTs = ‘2023-03-30 00:00:00’ GROUP BY hoursTs Ingestion Transforms in Apache Pinot
  27. 35 Average conversion rate since a given day SELECT AVG(click/views)

    FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ Ingestion Transforms in Apache Pinot Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ...
  28. 36 Average conversion rate since a given day SELECT AVG(click/views)

    FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ Ingestion Transforms in Apache Pinot Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ...
  29. 37 Derived Columns in Apache Pinot • Column with its

    value calculated from other columns • Generated on-the-fly during ingestion • Backfill existing segments with reload of the segment on the server
  30. 38 Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”:

    “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }, { “columnName”: “conversionRate”, “transformFunction”: “clicks/views” } ] ... Derived Columns in Apache Pinot Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ SELECT AVG(conversionRate) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’
  31. 39 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  32. 40 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  33. 42 Complex Types { "id": "17244207357", "type": "PullRequestEvent", "repo": {

    "id": 326604100, "name": "jthegedus/svelte-adapter-firebase" }, "payload": { "number": 120, "pull_request": { "id": 694073641, "user": { "login": "github-actions[bot]", "id": 41898282, "type": "Bot" }, "body": ":robot: I have created a release ...", "created_at": "2021-07-21T05:59:14Z", "labels": [ { "id": 3184575526, "name": "autorelease: pending" }, { "id": 2412457281, "name": "autorelease: testing" } ] } } } Nested Fields Complex Types / Nested Fields Nested Fields Array
  34. 43 Complex Types { "id": "17244207357", "type": "PullRequestEvent", "repo": {

    "id": 326604100, "name": "jthegedus/svelte-adapter-firebase" }, "payload": { "number": 120, "pull_request": { "id": 694073641, "user": { "login": "github-actions[bot]", "id": 41898282, "type": "Bot" }, "body": ":robot: I have created a release ...", "created_at": "2021-07-21T05:59:14Z", "labels": [ { "id": 3184575526, "name": "autorelease: pending" }, { "id": 2412457281, "name": "autorelease: testing" } ] } } } Nested Fields Complex Types / Nested Fields Nested Fields Array
  35. 44 • Store top-level fields as JSON string and extract

    during query time Complex Types / Nested Fields SELECT id, type, json_extract_scalar("repo", '$.name', 'String', 'null') repo_name, json_extract_scalar("payload", '$.number', INT, 'null') payload_number, json_extract_scalar(json_extract_scalar("payload", '$.pull_request', 'String', 'null'), '$.labels', 'STRING', 'null') payload_pullrequest_labels, json_extract_scalar("payload", '$.created_at', 'String', 'null') payload_created_at FROM PullRequestEvent
  36. 45 • Store top-level fields as JSON string and extract

    during query time Complex Types / Nested Fields SELECT id, type, json_extract_scalar("repo", '$.name', 'String', 'null') repo_name, json_extract_scalar("payload", '$.number', INT, 'null') payload_number, json_extract_scalar(json_extract_scalar("payload", '$.pull_request', 'String', 'null'), '$.labels', 'STRING', 'null') payload_pullrequest_labels, json_extract_scalar("payload", '$.created_at', 'String', 'null') payload_created_at FROM PullRequestEvent • Not optimized for query - ◦ Transform functions applied at query-time ◦ For every record, json object is constructed from the stored json string • Cannot index on nested json type • Storage overhead - storing entire json string
  37. 46 • Use Ingestion Transforms! ◦ Specify complexTypeConfig with delimiter

    to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … }
  38. 47 • Use Ingestion Transforms! ◦ Specify complexTypeConfig with delimiter

    to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_requ est.created_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 [ { "id": 3184575526, "name": "autorelease: pending"}, { "id": 2412457281, "name": "autorelease: testing"}] 2021-07-21T05:5 9:14Z
  39. 48 • Use Ingestion Transforms! ◦ Specify complexTypeConfig with delimiter

    to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_requ est.created_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 [ { "id": 3184575526, "name": "autorelease: pending"}, { "id": 2412457281, "name": "autorelease: testing"}] 2021-07-21T05:5 9:14Z
  40. 49 • Use Ingestion Transform! ◦ Specify fieldsToUnnest within complexTypeConfig

    to unnest the array Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels.name", "dataType": "JSON"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.”, “fieldsToUnnest”: [“payload.pull_request.labels”] }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_request.cr eated_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 autorelease: pending 2021-07-21T05:59:14Z 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 autorelease: testing 2021-07-21T05:59:14Z
  41. 50 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  42. 51 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  43. 53 Dealing with Data Granularity Data Granularity customer_id product_id price

    timestamp 205 car 1500 1571900400000 206 truck 2200 1571900400000 207 car 1300 1571900400000 208 truck 700 1572418800000 209 car 1100 1572505200000 210 car 2100 1572505200000 milliseconds product_id price timestamp car 2800 10-24-2019 00:00:00 truck 2200 10-24-2019 00:00:00 truck 700 10-30-2019 00:00:00 car 3200 10-31-2019 00:00:00 daily
  44. Dealing with Data Granularity 54 customer_id product_id price timestamp 205

    car 1500 1571900400000 206 truck 2200 1571900400000 207 car 1300 1571900400000 208 truck 700 1572418800000 209 car 1100 1572505200000 210 car 2100 1572505200000 211 truck 800 1572678000000 212 car 800 1572678000000 213 car 1900 1572678000000 214 car 1000 1572678000000 Daily sales value per product SELECT COUNT(*), SUM(price) FROM sales GROUP BY product_id, toEpochDays(timestamp) Computed at query time Aggregated at query time
  45. Ingestion Aggregation 55 • Aggregate Data during Ingestion time Table

    Config: { "transformConfigs": [{ "columnName": "daysSinceEpoch", "transformFunction": "toEpochDays(timestamp)" }], "aggregationConfigs": [{ "columnName": "total_sales", "aggregationFunction": "SUM(price)" }, { "columnName": "sales_count", "aggregationFunction": "COUNT(*)" }] }, "tableIndexConfig": { "noDictionaryColumns": [ "sales_count", "total_sales" ]}} product_id sales_count total_sales day car 2 2800 10-24-2019 00:00:00 truck 1 2200 10-24-2019 00:00:00 truck 1 700 10-30-2019 00:00:00 car 2 3300 10-31-2019 00:00:00 truck 1 800 11-02-2019 00:00:00 car 3 3700 11-02-2019 00:00:00 Daily sales value per product SELECT sales_count, total_sales FROM sales GROUP BY product_id, day
  46. Ingestion Aggregation 56 • Use ingestion time aggregation when only

    aggregated data is required and not the entire raw data • Allowed Aggregate functions: • SUM, MIN, MAX, COUNT, DISTINCTCOUNTHLL Caveats: • All metric columns must have aggregation functions defined.
  47. 57 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  48. 58 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  49. Data Semantics • Data Semantics ◦ Change Data Capture events

    - READ, CREATE, INSERT, UPDATE, DELETE ◦ Multiple events for same record key MySql Mutable records Immutable records Debezium
  50. Data Semantics Immutable records Stream with changes to same record

    • Data Semantics ◦ Change Data Capture events - READ, CREATE, INSERT, UPDATE, DELETE ◦ Multiple events for same record key
  51. 62 Data Semantics [ {"name": "test.movies.movies.Key" }, "payload": { "id":

    "{\"$oid\": \"1002\"}" } }, {"payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql","name": "test","ts_ms": 1670357925000, "snapshot": "false","db": "movies", … }, "op": "c", "ts_ms": 1670357926132, "transaction": null } }] [ {"name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, {"payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql","name": "test","ts_ms": 1670357925000, "snapshot": "false","db": "movies", ... }, "op": "c", "ts_ms": 1670357926132, "transaction": null } }] movie_id movie_title rating imdb_score timestamp 1002 Black Panther PG-13 7.3 2022-12-06 20:18:46 Update record by primary key Dealing with Data Semantics
  52. Dealing with Data Semantics ID Name Rating IMDB 1002 Black

    Panther PG-13 7.3 [ {"schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, { "schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Envelope" }, "payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql", "name": "test", "ts_ms": 1670357925000, "snapshot": "false", "db": "movies", … }, "op": "c", "ts_ms": 1670357926132, "transaction": null } } ] Create Record MySQL Debezium
  53. Dealing with Data Semantics ID Name Rating IMDB 1002 Black

    Panther PG-13 8.5 [ {"schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, { "schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Envelope" }, "payload": { "before": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 8.5}", "source": { "connector": "mysql", "name": "test", "ts_ms": 16703579961000, "snapshot": "false", "db": "movies" … }, "op": "u", "ts_ms": 1670357986189, "transaction": null } } ] Update MySQL
  54. Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc

    Id DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics CREATE
  55. Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc

    Id 1002 0 1002 Black Panther PG-13 7.3 DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics - CDC UPDATE
  56. Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc

    Id 1002 1 1002 Black Panther PG-13 7.3 1002 Black Panther PG-13 8.5 DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics - CDC
  57. 68 From Stream to Pinot Data Semantics Data Partitioning Multiple

    data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity
  58. 69 From Stream to Pinot Data Partitioning Multiple data sources

    Data Transformation Derived Data Complex type / Nested Fields Data Granularity Data Semantics
  59. Yes! When is Stream Processing Needed? • Multiple Data sources

    ◦ Complex processing of data from multiple sources such as ▪ Stream-Table join ▪ Stream-Stream join • Data that requires re-partitioning 71
  60. Realtime Ingestion in Pinot • Pluggable architecture • Ingestion-time Data

    Processing ◦ Ingestion Transforms ◦ Derived Columns ◦ Complex type handling (filtering / flattening) ◦ Ingestion Aggregation ◦ Upsert semantics • Use Ingestion features to get data into an optimal form for your use-case 73
  61. Enhancements to Pinot Realtime Engine • Support record deletion in

    upsert tables (In Progress - GH#10452) • Segment compaction in upsert tables (In Progress - GH#6912) • Support for Data Enrichment • Support ingesting from multiple sources 75
  62. Stream Consumer in Apache Pinot • Streaming data system should

    ◦ Support or allow a partitioned-view of the data for consumption ◦ Support or allow deriving a notion of an “offset” or a stream “marker” ◦ Support some mechanism to “seek” within a stream ◦ Provide message ordering at a partition-level 77
  63. Stream Decoders in Apache Pinot • Stream Message Decoders accept

    the input from a StreamConsumer and convert it into Pinot’s data model represented as a GenericRow • Custom-build your decoder by implementing a simple interface Interface StreamMessageDecoder<T> { GenericRow decode(T payload, GenericRow destination); } 78