rights reserved. Agenda Key Components of Real-time Analytics Anatomy of Amazon Kinesis Data Streams & MSK Comparing Amazon Kinesis Data Streams to MSK Monitoring Metrics Common Architecture Patterns Key Takeaways
rights reserved. From Batch to Real-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
rights reserved. Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer
rights reserved. Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)
rights reserved. Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka
rights reserved. Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Kinesis Data Streams vs. MSK broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition producer (“zerg.hydra”) consumer (“zerg.hydra”) Brokers, producers and consumers use Zookeeper to manage and share state. Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey}
rights reserved. Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Perspective § A Single cluster vs Multiple clusters? § Number of brokers per cluster? § Number of topics per broker? § Number of partitions per topic? • Cluster provisioning model • Only increase number of partitions; can’t decrease • Fully managed with native Apache Kafka – Easy lift and shift migration • Operational Perspective § Number of streams? § Number of shards per stream? • Throughput provisioning model • Increase/Decrease number of shards • Easy to use, cloud-native.
rights reserved. RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? MSK (Apache Kafka) Monitoring broker1 broker2 broker3 ZooKeeper P0 R1 P2 R1 P0 R2 P1 R2 P1 R3 P2 R3 Kafka Cluster producer (“zerg.hydra”) consumer (“zerg.hydra”) Px Ry Px Ry active replica (id y) of partition x for topic “zerg.hydra” active replica (id y) of partition x, this broker is leader for that partition
rights reserved. CloudWatch Metrics for MSK (Apache Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
rights reserved. Kinesis Data Streams Monitoring 5 transactions per second, per shard With only one consumer application, records can be retrieved every 200ms up to 1MB or 1,000 records per seconds, per shard for writes • 10MB per second, per shard • up to 10,000 records per call Consumer Application GetRecords() Data Hash Function PK PK PK PK Stream Shard1 Records Shard2 Records Shard3 Records PutRecord {Data, StreamName, PartitionKey} How long time does a record stay in a shard?
rights reserved. CloudWatch Metrics for Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations
rights reserved. Kafka vs MSK vs Kinesis Data Streams Operational Excellence Kinesis Data Streams Apache Kafka Amazon MSK Degree of Freedom ≈ Complexity Amazon MSK Serverless
rights reserved. Comparison Summary Attribute Apache Kafka Kinesis Streams Managed Streaming for Kafka Cost $$$ $ (pay for what you use) $$ (pay for infrastructure) Ease of use Advanced setup required Get started in minutes Get started in minutes Management Overhead High Low Low Scalability Difficult to scale Scale in seconds with one click Scale in minutes with one click Throughput Infinite Scales with shards, supports up to 1mb payloads Infinite Durability Configurable 3x by default Configurable Infrastructure You manage AWS manages AWS manages Write-to-Read Latency <100 ms is achievable <100 ms (with HTTP/2) <100 ms is achievable Open Sourced? Yes No Yes
rights reserved. Near-real-time analytics with Data Lake L A M B D A A R C H I T E C T U R E Amazon Kinesis Data Streams Data sources Capture Transformation Amazon QuickSight Visualization Amazon OpenSearch Service Amazon Kinesis Data Firehose Dashboard Amazon S3 Amazon Athena Visualization
rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Near-real-time analytics with Data Warehouse K A P A A R C H I T E C T U R E Auto Refresh Data sources
rights reserved. Platform modernization with change data capture (CDC) with Amazon DMS Amazon RDS Amazon S3 Data lake Amazon OpenSearch Service AWS Lambda CDC Amazon QuickSight Visualization Amazon Athena Dashboard Visualization AWS DMS Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
rights reserved. Amazon Lambda + Kinesis Data Streams & MSK B A T C H S I Z E / B A T C H W I N D O W Amazon Kinesis Data Streams AWS Lambda Amazon Managed Streaming for Kafka
rights reserved. Key Takeaways • Distributed Queue as Stream Storage § Preserve Ordering § Parallel Consumption § Persistent Buffer § Decouple producers & consumers • Trade-off: Operational Excellence vs. Degree of Freedom § MUST keep an eye on the right monitoring metrics
rights reserved. Resources • Streaming Data Solution for Amazon Kinesis § https://aws.amazon.com/ko/solutions/implementations/streaming-data-solution-for-amazon-kinesis/ • Amazon MSK Labs § https://catalog.workshops.aws/msk-labs/en-US • Real Time Streaming with Amazon Kinesis § https://catalog.workshops.aws/real-time-streaming-with-kinesis/en-US • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en-US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns