Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka Streams vs. Spark Structured Streaming (e...

Avatar for Lee Dongjin Lee Dongjin
October 25, 2018

Kafka Streams vs. Spark Structured Streaming (extended)

Kafka Streams vs. Spark Structured Streaming. 어떻게 사용할 수 있고, 내부는 어떻게 되어 있으며, 장단점은 무엇이고 어디에 써야 하는가?
2018년 10월, SKT 사내 세미나에서 발표.

Kafka Streams vs. Spark Structured Streaming: How you can use, How it works under the hood, advantages and disadvantages, and when to use it?
Presented in SK Telecom, October 2018.

Slides: English. Presentation: Korean.

Avatar for Lee Dongjin

Lee Dongjin

October 25, 2018
Tweet

More Decks by Lee Dongjin

Other Decks in Technology

Transcript

  1. Introduction • So many Streaming Frameworks / Libraries ◦ RxJava,

    Spring Reactor, AKKA streams, Flink, Samza, Storm, … ◦ What to use?! • Spark Structured Streaming vs. Kafka Streams ◦ Advantages, Disadvantages, and Trade-offs. ◦ When to use, or not to use.
  2. Spark Structured Streaming: Overview (1) • Stream processing engine based

    on Spark SQL (2.0) API: RDD → Dataframe Execution: Batch → Streaming (microbatch, continuous) Spark Core (1.x) Spark SQL (2.x) Spark Streaming Spark Structured Streaming
  3. Spark Structured Streaming: Overview (2) • Describes the processing logic

    with Spark SQL Operations ◦ Easy to learn: almost identical to normal Batch SQL ▪ Except: Source, Sink, Trigger, Output Mode, Watermark, etc ... ◦ Provides various data sources and functions ◦ Optimized by Catalyst Optimizer
  4. Spark Structured Streaming: WordCount (1) // Create DataFrame representing the

    stream of input lines // from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words & generate running word count val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  5. Spark Structured Streaming: WordCount (2) // Start running the query

    that prints // the running counts to the console val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination()
  6. Kafka Streams: Overview (1) • A stream processing library that

    directly integrates with Kafka ◦ from 0.10.0. ◦ Doesn't need a special runtime like YARN: runs in a normal java process. ◦ Masterless: runs on top of Kafka’s consumer group.
  7. Kafka Streams: Overview (2) • Describes the processing logic as

    a graph of processors ◦ ‘Processing Topology’ ◦ Source, Sink: Subclass of Processor • With… ◦ High level DSL (a.k.a. KStream API) ▪ Recommended ◦ Low level API
  8. Kafka Streams: WordCount (1) // Build Topology with StreamsBuilder final

    StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream<String, String> source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream<String, String> tokenized = source .flatMapValues(value -> Arrays.asList( value.toLowerCase(Locale.getDefault()).split(" ")) );
  9. Kafka Streams: WordCount (2) // KTable: Stateful abstraction of aggregated

    stream // Build KTable from KStream by group and aggregate operations final KTable<String, Long> counts = tokenized .groupBy((key, value) -> value) .count(); // Convert KTable to KStream KStream<String, Long> changeLog = counts.toStream(); // Write back to output kafka topic changeLog.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();
  10. Kafka Streams: WordCount (3) Properties props = ... // Configuration

    properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start();
  11. Kafka Streams: Low Level API • You can define Processor

    classes manually ◦ Example: WordCountProcessor ◦ In fact, DSL builds Processor instances internally. • Achieves efficiency & fault-tolerance with… ◦ Intermediate Topic ◦ StateStore
  12. Kafka Streams: StateStore • Local, in-memory Key-Value store ◦ Implemented

    with RocksDB - Fast! ◦ Backed by changelog topic - Easy to restore! • Under the hood ◦ ex) KTable
  13. How Spark Structured Streaming works (1) • Dataframe: Container of

    QueryExecution ◦ Transformation method: Returns new Dataframe object with updated LogicalPlan. ▪ map, filter, select, ... ◦ Action method: trigger computation and return results. ▪ count, show, ... Dataframe QueryExecution LogicalPlan Provides API Handles the primary workflow for executing LogicalPlan Describes logical operation
  14. How Spark Structured Streaming works (2) • When action method

    is called, the LogicalPlan is translated into RDD operations ◦ And finally, Tasks. (Unresolved) LogicalPlan (Resolved) LogicalPlan (Optimized) LogicalPlan SparkPlan Resolve variables w/ Catalog Logical Optimization (ex. Predicate Pushdown) Convert into RDD Operations * RDD Operations are divided into Tasks and run by Executors.
  15. How Spark Structured Streaming works (3) • Then, What happens

    with Streaming? ◦ StreamExecution ▪ (Almost) Resolved LogicalPlan ▪ Trigger ▪ Output Mode ▪ Output Sink StreamExecution LogicalPlan
  16. How Spark Structured Streaming works (4) • Driver triggers StreamExecution

    periodically. ◦ Driver checks newly arrived records. (ex. checks the latest offset in Kafka topic.) ◦ Clones LogicalPlan, fill with arrived records, and run in normal workflow. ▪ In other words, all workflow is identical with Batch computations from then on. ◦ For each Task, Executor requests the records of given offset range to Kafka brokers.
  17. How Spark Structured Streaming works (5) val lines = spark.readStream

    ... .load() ... val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() (Create Dataframe, along with contained LogicalPlan.) (Resolve LogicalPlan, Create StreamExecution instance, and have StreamingQueryManager to start it.)
  18. How Kafka Streams works (1) • Built on top of

    Kafka’s consumer group feature ◦ Automatically divides the records into disjoint sets. • StreamTask ◦ Created per input partition. ◦ ex) Streams Topology with input topic A (2 partitions) and B (3 partitions): 3 StreamTasks! ◦ Run by thread pool.
  19. Comparison Kafka Streams Spark Structured Streaming Deployment Standalone Java Application

    Spark Executor (mostly, YARN cluster) Streaming Source Kafka only Kafka, File System, Kinesis, ... Fault-Tolerance StateStore, backed by changelog RDD cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)
  20. Conclusion • Spark Structured Streaming for rich semantics ◦ ETL

    Tasks. ◦ ex) Join records with RDBMS, run ML pipeline, etc., ... • Kafka Streams for lightweight manipulation of Kafka topics ◦ Preprocess Kafka topics. ◦ Microservice run with Kafka topics. ◦ Event-based prediction. (e.g., Kafka Streams w/ tensorflow)
  21. Questions? • Slides ◦ https://speakerdeck.com/dongjin • 한국 스파크 사용자 모임

    ◦ https://www.facebook.com/groups/sparkkoreauser/ • Kafka 한국 사용자 모임 ◦ https://www.facebook.com/groups/kafkakorea/