Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LISA18: Apache Kafka and KSQL in Action : Let’s...

Robin Moffatt
October 31, 2018
310

LISA18: Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with the Kafka Connect API, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API, and KSQL.

Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!

This will be a practical talk, after which attendees will have a clear idea of the power of stream processing, and how to get started with it using the open-source Apache Kafka and KSQL projects.

Robin Moffatt

October 31, 2018
Tweet

Transcript

  1. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff [email protected] confluent.io/ksql USENIX Large Installation System Administration Conference (LISA) October 31 2018
  2. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle Developer Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: • Geek stuff • Beer & Fried Breakfasts $ whoami https://speakerdeck.com/rmoff/ @rmoff
  3. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Kafka
  4. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Kafka is a Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  5. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Streaming is not just for realtime
  6. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Streaming is for everyone
  7. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! All data is events
  8. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql A Dumb Pipeline HDFS / S3 / BigQuery etc Logs
  9. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql A Dumb Pipeline HDFS / S3 / BigQuery etc Logs Logs
  10. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Stream Processing with Apache Kafka and KSQL Stream Processing Logs HDFS / S3 / BigQuery etc All logs Errors
  11. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
  12. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC
  13. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC
  14. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Rating events Join events to users, and filter Push notification Operational Dashboard Data Lake User data Let’s Build It!
  15. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Rating events Join events to users, and filter Push notification Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It!
  16. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Rating events Join events to users, and filter Push notification Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Kafka Connect Kafka Connect Kafka Connect Kafka Connect
  17. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql An API of Apache Kafka, providing reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ Kafka Connect
  18. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources syslog flat file CSV JSON MQTT
  19. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sinks Amazon S3 MQT
  20. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 MQT syslog flat file CSV JSON MQTT
  21. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Confluent Hub hub.confluent.io • One-stop place to discover and download : • Connectors • Transformations • Converters
  22. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  23. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  24. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! curl -X "POST" "http://kafka-connect-cp:18083/connectors/" \ -H "Content-Type: application/json" \ -d '{ "name": "es_sink_lisa18", "config": { "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector", "key.converter": "org.apache.kafka.connect.storage.StringConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable": false, "topics": "lisa18", "key.ignore": "true", "schema.ignore": "true", "type.name": "type.name=kafkaconnect", "connection.url": "http://elasticsearch:9200" } }' Kafka → Elasticsearch
  25. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! MySQL Debezium Kafka Connect Producer API Demo Time!
  26. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Rating events Join events to users, and filter Push notification Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API Let’s Build It! Kafka Connect Kafka Connect Kafka Connect
  27. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Rating events Join events to users, and filter Push notification Operational Dashboard Data Lake User data RDBMS S3/HDFS/ SnowflakeDB etc Elasticsearch App App Producer API Consumer API KSQL Kafka Connect Kafka Connect Kafka Connect KSQL
  28. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! Declarative Stream Language Processing KSQL is a
  29. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! KSQL is the Streaming SQL Engine for Apache Kafka
  30. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  31. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  32. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  33. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! CREATE STREAM pageviews WITH (PARTITIONS=4, VALUE_FORMAT='AVRO') AS 
 SELECT * FROM pageviews_json; KSQL for Data Transformation Make simple derivations of existing topics from the command line
  34. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”
  35. @rmoff / http://cnfl.io/ksql Apache Kafka and KSQL in Action :

    Let’s Build a Streaming Data Pipeline! MySQL Debezium Kafka Connect Producer API Elasticsearch Kafka Connect Demo Time!
  36. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3
  37. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Do you think that’s a table you are querying?
  38. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql The Stream Table Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time Stream Table Read more: https://cnfl.io/stream-table-duality
  39. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
  40. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID;
  41. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3
  42. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI SQL Stream Processing KSQL Datacenter Public Cloud Confluent Cloud CONFLUENT FULLY-MANAGED CUSTOMER SELF-MANAGED
  43. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql •Kafka Connect • Integration between Kafka and other data stores •Kafka • Provides stream processing natively •KSQL • Build stream processing apps with just SQL If you remember one thing… (or three)
  44. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  45. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql Try it out! https://cnfl.io/kafka-ksql-elastic
  46. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links
  47. Apache Kafka and KSQL in Action : Let’s Build a

    Streaming Data Pipeline! @rmoff / http://cnfl.io/ksql • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op Resources #EOF