Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming ETL in Practice with PostgreSQL, Apac...

Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!

Companies new and old are all recognising the importance of a low-latency, scalable, fault-tolerant data backbone, in the form of the Apache Kafka® streaming platform. With Kafka, developers can integrate multiple sources and systems, which enables low latency analytics, event driven architectures and the population of multiple downstream systems. These data pipelines can be built using configuration alone.

In this talk, we’ll see how easy it is to stream data from a database such as PostgreSQL into Kafka using CDC and Kafka Connect. In addition, we’ll use KSQL to filter, aggregate and join it to other data, and then stream this from Kafka out into multiple targets such as Elasticsearch and S3. All of this can be accomplished without a single line of code!

Why should Java geeks have all the fun?

Robin Moffatt

July 03, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL

    PGConf UK 2018 3 Jul 2018 / Robin Moffatt @rmoff [email protected] https://speakerdeck.com/rmoff/
  2. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 2 • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts $ whoami https://speakerdeck.com/rmoff/
  3. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 3 App App App App search Hadoop DWH monitoring security MQ MQ cache cache A bit of a mess…
  4. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 4 The Streaming Platform KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  5. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 5 Database offload → Analytics HDFS / S3 / BigQuery etc RDBMS CDC
  6. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 6 Streaming ETL with Apache Kafka and KSQL order events customer customer orders Stream Processing RDBMS CDC
  7. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 7 Real-time Event Stream Enrichment order events customer Stream Processing customer orders RDBMS <y> CDC
  8. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 8 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC
  9. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 9 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC
  10. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 10 KSQL Streaming ETL with Apache Kafka
  11. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 11 Streaming Integration with Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON
  12. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 12 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema
  13. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 13 Integrating Postgres with Kafka Kafka Connect & Debezium Kafka Connect & JDBC Sink
  14. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 14 Confluent Hub hub.confluent.io • Launched June 2018 • One-stop place to discover and download : • Connectors • Transformations • Converters
  15. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 15 KSQL Streaming ETL with Apache Kafka
  16. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 Declarative Stream Language Processing KSQL is a
  17. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 KSQL is the Streaming SQL Engine for Apache Kafka
  18. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  19. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  20. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  21. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 21 KSQL Streaming ETL with Apache Kafka
  22. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 22 Kafka Connect Producer API Elasticsearch Kafka Connect { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } Postgres Demo Time! Kafka Connect Postgres
  23. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 23 Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } POOR_RATINGS Filter all ratings where STARS<3 CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3
  24. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 24 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data UNHAPPY_PLATINUM_CUSTOMERS Filter for just PLATINUM customers CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3
  25. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 25 Kafka Connect Producer API { "rating_id": 5313, "user_id": 3, "stars": 4, "route_id": 6975, "rating_time": 1519304105213, "channel": "web", "message": "worst. flight. ever. #neveragain" } { "id": 3, "first_name": "Merilyn", "last_name": "Doughartie", "email": "[email protected]", "gender": "Female", "club_status": "platinum", "comments": "none" } RATINGS_WITH_CUSTOMER_DATA Join each rating to customer data RATINGS_BY_CLUB_STATUS_1MIN Aggregate per-minute by CLUB_STATUS CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) FROM RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS;
  26. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 26 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL
  27. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 27 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  28. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 29 • Postgres integration into Kafka • http://debezium.io/docs/connectors/postgresql/ • https://www.simple.com/engineering/a-change-data-capture-pipeline-from-postgresql-to-kafka • https://www.slideshare.net/JeffKlukas/postgresql-kafka-the-delight-of-change-data-capture • https://blog.insightdatascience.com/from-postgresql-to-redshift-with-kafka-connect-111c44954a6a • Streaming ETL • Embrace the Anarchy : Apache Kafka's Role in Modern Data Architectures Recording & Slides • Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka and KSQL • Steps to Building a Streaming ETL Pipeline with Apache Kafka and KSQL Recording & Slides • https://www.confluent.io/blog/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data • https://github.com/confluentinc/ksql/ Useful links
  29. @rmoff / Streaming ETL in Practice with PostgreSQL, Apache Kafka,

    and KSQL - PGConf UK 2018 30 • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / [email protected]) can help with introductions on a given sales op Resources #EOF