Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Data Processing in Real-time: Amazon ...

Sungmin Kim
April 20, 2023
56

Streaming Data Processing in Real-time: Amazon Kinesis Data Streams vs. MSK

Agenda

- Key Components of Real-time Analytics
- Anatomy of Amazon Kinesis Data Streams & MSK
- Comparing Amazon Kinesis Data Streams to MSK
- Monitoring Metrics
- Common Architecture Patterns
- Key Takeaways

Sungmin Kim

April 20, 2023
Tweet

Transcript

  1. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Choose the right Stream Storage: Amazon Kinesis Data Streams vs. MSK Sr. Solutions Architect AWS Sungmin Kim
  2. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Agenda Key Components of Real-time Analytics Anatomy of Amazon Kinesis Data Streams & MSK Comparing Amazon Kinesis Data Streams to MSK Monitoring Metrics Common Architecture Patterns Key Takeaways
  3. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Key Components of Real-time Analytics
  4. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. From Batch to Real-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  5. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer
  6. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)
  7. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka
  8. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Anatomy of Amazon Kinesis Data Streams & MSK
  9. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Key Features of Kinesis Data Streams and MSK • Distributed Queue • Stream Storage #Queue #Distributed #Storage rear front Pop Push
  10. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Consumer oldest data newest data 5 4 3 2 1 0 3 2 1 0 2 #Queue: FIFO, Scale-Up vs Scale-Out 5 4 4 3 2 1 0 5 Producers
  11. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Hash Function Consumer PK PK PK PK oldest data newest data Producers shard/partition-1 shard/partition-2 3 2 1 0 5 4 3 2 1 0 4 3 2 1 0 2 shard/partition-3 #Distributed: Scale-Out Consumer 0 Consumer 4 0 Consumer Group 4 3 2 1 0
  12. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 #Storage: Stream Buffer 2 1 0 4 3 2 1 0 0
  13. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Anatomy of = next consumer offset
  14. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Benefits of Stream Storage • Decouple producers & consumers • Persistent Buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce
  15. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Comparing Amazon Kinesis Data Streams to MSK
  16. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Kinesis Data Streams vs. MSK broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition producer (“zerg.hydra”) consumer (“zerg.hydra”) Brokers, producers and consumers use Zookeeper to manage and share state. Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey}
  17. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Perspective § A Single cluster vs Multiple clusters? § Number of brokers per cluster? § Number of topics per broker? § Number of partitions per topic? • Cluster provisioning model • Only increase number of partitions; can’t decrease • Fully managed with native Apache Kafka – Easy lift and shift migration • Operational Perspective § Number of streams? § Number of shards per stream? • Throughput provisioning model • Increase/Decrease number of shards • Easy to use, cloud-native.
  18. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Monitoring Metrics
  19. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? MSK (Apache Kafka) Monitoring broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster producer (“zerg.hydra”) consumer (“zerg.hydra”) Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition
  20. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CloudWatch Metrics for MSK (Apache Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
  21. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Kinesis Data Streams Monitoring 5 transactions per second, per shard With only one consumer application, records can be retrieved every 200ms up to 1MB or 1,000 records per seconds, per shard for writes • 10MB per second, per shard • up to 10,000 records per call Consumer Application GetRecords() Data Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey} How long time does a record stay in a shard?
  22. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CloudWatch Metrics for Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations
  23. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Choosing Right Metrics Too Much = Useless = Too Little
  24. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Kafka vs MSK vs Kinesis Data Streams Operational Excellence Kinesis Data Streams Apache Kafka Amazon MSK Degree of Freedom ≈ Complexity Amazon MSK Serverless
  25. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Comparison Summary Attribute Apache Kafka Kinesis Streams Managed Streaming for Kafka Cost $$$ $ (pay for what you use) $$ (pay for infrastructure) Ease of use Advanced setup required Get started in minutes Get started in minutes Management Overhead High Low Low Scalability Difficult to scale Scale in seconds with one click Scale in minutes with one click Throughput Infinite Scales with shards, supports up to 1mb payloads Infinite Durability Configurable 3x by default Configurable Infrastructure You manage AWS manages AWS manages Write-to-Read Latency <100 ms is achievable <100 ms (with HTTP/2) <100 ms is achievable Open Sourced? Yes No Yes
  26. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Common Architecture Patterns
  27. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Near-real-time analytics with Data Lake L A M B D A A R C H I T E C T U R E Amazon Kinesis Data Streams Data sources Capture Transformation Amazon QuickSight Visualization Amazon OpenSearch Service Amazon Kinesis Data Firehose Dashboard Amazon S3 Amazon Athena Visualization
  28. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Near-real-time analytics with Data Warehouse K A P A A R C H I T E C T U R E Auto Refresh Data sources
  29. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Platform modernization with change data capture (CDC) with Amazon DMS Amazon RDS Amazon S3 Data lake Amazon OpenSearch Service AWS Lambda CDC Amazon QuickSight Visualization Amazon Athena Dashboard Visualization AWS DMS Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
  30. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Platform modernization with change data capture (CDC) with Amazon MSK Connect Amazon RDS MSK Connect Amazon S3 Data lake Amazon OpenSearch Service Amazon MSK MSK Connect AWS Lambda CDC Amazon QuickSight Visualization Amazon Athena Dashboard Visualization AWS DMS
  31. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Lambda + Kinesis Data Streams & MSK B A T C H S I Z E / B A T C H W I N D O W Amazon Kinesis Data Streams AWS Lambda Amazon Managed Streaming for Kafka
  32. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Lambda + MSK C O N S U M E R G R O U P I D AWS Lambda Amazon Managed Streaming for Kafka
  33. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Kafka Consumer Group: Fan-out Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group 1 Consumer 1 Consumer 2 Consumer 0 Consumer Group 2 Consumer 1 Consumer 2
  34. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Kafka Consumer Group
  35. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Kafka Consumer Group: Mismatched Partitions Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Consumer 3 X GOOD BAD BAD
  36. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Key Takeaways • Distributed Queue as Stream Storage § Preserve Ordering § Parallel Consumption § Persistent Buffer § Decouple producers & consumers • Trade-off: Operational Excellence vs. Degree of Freedom § MUST keep an eye on the right monitoring metrics
  37. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resources • Streaming Data Solution for Amazon Kinesis § https://aws.amazon.com/ko/solutions/implementations/streaming-data-solution-for-amazon-kinesis/ • Amazon MSK Labs § https://catalog.workshops.aws/msk-labs/en-US • Real Time Streaming with Amazon Kinesis § https://catalog.workshops.aws/real-time-streaming-with-kinesis/en-US • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en-US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns