Streaming Data Processing in Real-time: Amazon Kinesis Data Streams vs. MSK

© 2023, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Choose the right Stream Storage: Amazon Kinesis Data Streams vs. MSK Sr. Solutions Architect AWS Sungmin Kim

rights reserved. Agenda Key Components of Real-time Analytics Anatomy of Amazon Kinesis Data Streams & MSK Comparing Amazon Kinesis Data Streams to MSK Monitoring Metrics Common Architecture Patterns Key Takeaways

rights reserved. Key Components of Real-time Analytics

rights reserved. From Batch to Real-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process

rights reserved. Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer

rights reserved. Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)

rights reserved. Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

rights reserved. Anatomy of Amazon Kinesis Data Streams & MSK

rights reserved. Key Features of Kinesis Data Streams and MSK • Distributed Queue • Stream Storage #Queue #Distributed #Storage rear front Pop Push

rights reserved. Consumer oldest data newest data 5 4 3 2 1 0 3 2 1 0 2 #Queue: FIFO, Scale-Up vs Scale-Out 5 4 4 3 2 1 0 5 Producers

rights reserved. Hash Function Consumer PK PK PK PK oldest data newest data Producers shard/partition-1 shard/partition-2 3 2 1 0 5 4 3 2 1 0 4 3 2 1 0 2 shard/partition-3 #Distributed: Scale-Out Consumer 0 Consumer 4 0 Consumer Group 4 3 2 1 0

rights reserved. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 #Storage: Stream Buffer 2 1 0 4 3 2 1 0 0

rights reserved. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Anatomy of = next consumer offset

rights reserved. Benefits of Stream Storage • Decouple producers & consumers • Persistent Buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce

rights reserved. Comparing Amazon Kinesis Data Streams to MSK

rights reserved. Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Kinesis Data Streams vs. MSK broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition producer (“zerg.hydra”) consumer (“zerg.hydra”) Brokers, producers and consumers use Zookeeper to manage and share state. Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey}

rights reserved. Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Perspective § A Single cluster vs Multiple clusters? § Number of brokers per cluster? § Number of topics per broker? § Number of partitions per topic? • Cluster provisioning model • Only increase number of partitions; can’t decrease • Fully managed with native Apache Kafka – Easy lift and shift migration • Operational Perspective § Number of streams? § Number of shards per stream? • Throughput provisioning model • Increase/Decrease number of shards • Easy to use, cloud-native.

rights reserved. Monitoring Metrics

rights reserved. RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? MSK (Apache Kafka) Monitoring broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster producer (“zerg.hydra”) consumer (“zerg.hydra”) Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition

rights reserved. CloudWatch Metrics for MSK (Apache Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.

rights reserved. Kinesis Data Streams Monitoring 5 transactions per second, per shard With only one consumer application, records can be retrieved every 200ms up to 1MB or 1,000 records per seconds, per shard for writes • 10MB per second, per shard • up to 10,000 records per call Consumer Application GetRecords() Data Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey} How long time does a record stay in a shard?

rights reserved. CloudWatch Metrics for Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations

rights reserved. Choosing Right Metrics Too Much = Useless = Too Little

rights reserved. Kafka vs MSK vs Kinesis Data Streams Operational Excellence Kinesis Data Streams Apache Kafka Amazon MSK Degree of Freedom ≈ Complexity Amazon MSK Serverless

rights reserved. Comparison Summary Attribute Apache Kafka Kinesis Streams Managed Streaming for Kafka Cost $$$ $ (pay for what you use) $$ (pay for infrastructure) Ease of use Advanced setup required Get started in minutes Get started in minutes Management Overhead High Low Low Scalability Difficult to scale Scale in seconds with one click Scale in minutes with one click Throughput Infinite Scales with shards, supports up to 1mb payloads Infinite Durability Configurable 3x by default Configurable Infrastructure You manage AWS manages AWS manages Write-to-Read Latency <100 ms is achievable <100 ms (with HTTP/2) <100 ms is achievable Open Sourced? Yes No Yes

rights reserved. Common Architecture Patterns

rights reserved. Near-real-time analytics with Data Lake L A M B D A A R C H I T E C T U R E Amazon Kinesis Data Streams Data sources Capture Transformation Amazon QuickSight Visualization Amazon OpenSearch Service Amazon Kinesis Data Firehose Dashboard Amazon S3 Amazon Athena Visualization

rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Near-real-time analytics with Data Warehouse K A P A A R C H I T E C T U R E Auto Refresh Data sources

rights reserved. Platform modernization with change data capture (CDC) with Amazon DMS Amazon RDS Amazon S3 Data lake Amazon OpenSearch Service AWS Lambda CDC Amazon QuickSight Visualization Amazon Athena Dashboard Visualization AWS DMS Amazon Kinesis Data Streams Amazon Kinesis Data Firehose

rights reserved. Platform modernization with change data capture (CDC) with Amazon MSK Connect Amazon RDS MSK Connect Amazon S3 Data lake Amazon OpenSearch Service Amazon MSK MSK Connect AWS Lambda CDC Amazon QuickSight Visualization Amazon Athena Dashboard Visualization AWS DMS

rights reserved. Tips

rights reserved. Amazon Lambda + Kinesis Data Streams & MSK B A T C H S I Z E / B A T C H W I N D O W Amazon Kinesis Data Streams AWS Lambda Amazon Managed Streaming for Kafka

rights reserved. Amazon Lambda + MSK C O N S U M E R G R O U P I D AWS Lambda Amazon Managed Streaming for Kafka

rights reserved. Kafka Consumer Group: Fan-out Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group 1 Consumer 1 Consumer 2 Consumer 0 Consumer Group 2 Consumer 1 Consumer 2

rights reserved. Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Kafka Consumer Group

rights reserved. Kafka Consumer Group: Mismatched Partitions Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Partition 0 Partition 1 Partition 2 Topic Consumer 0 Consumer Group Consumer 1 Consumer 2 Consumer 3 X GOOD BAD BAD

rights reserved. Key Takeaways • Distributed Queue as Stream Storage § Preserve Ordering § Parallel Consumption § Persistent Buffer § Decouple producers & consumers • Trade-off: Operational Excellence vs. Degree of Freedom § MUST keep an eye on the right monitoring metrics

rights reserved. Resources • Streaming Data Solution for Amazon Kinesis § https://aws.amazon.com/ko/solutions/implementations/streaming-data-solution-for-amazon-kinesis/ • Amazon MSK Labs § https://catalog.workshops.aws/msk-labs/en-US • Real Time Streaming with Amazon Kinesis § https://catalog.workshops.aws/real-time-streaming-with-kinesis/en-US • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en-US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns

Streaming Data Processing in Real-time: Amazon ...

Streaming Data Processing in Real-time: Amazon Kinesis Data Streams vs. MSK

More Decks by Sungmin Kim

Featured

Transcript