Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oracle Developer Meetup - Streaming ETL in Prac...

Oracle Developer Meetup - Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Robin Moffatt

June 25, 2018
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

    Oracle Developer Meetup 25 Jun 2018 / Robin Moffatt @rmoff [email protected] https://speakerdeck.com/rmoff/
  2. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 2 $ whoami • Developer Advocate @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director & Dev Champion • Blogging : http://rmoff.net & http://cnfl.io/rmoff • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts https://speakerdeck.com/rmoff/
  3. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 3 Analytics—In the beginning… Sales DWH
  4. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 4 And then there were more data sources… Sales DWH Inventory
  5. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 5 Batch Transformations … (ETL / ELT) Sales DWH Inventory
  6. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 6 Add a Data Lake… Sales DWH Inventory Data Lake
  7. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 7 …or Replace the Data Warehouse Sales Inventory Data Lake
  8. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 8 Still need to do Batch transformations… Sales Inventory Data Lake
  9. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 9 Want your data anytime ? Batch is Latency built in by Design
  10. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Photo by Denys Nevozhai on Unsplash Microservices Mobile Machine 
 Learning Internet of 
 Things The World has Changed
  11. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 11 Photo by Rosie Fraser on Unsplash Lots of new technologies (whether you like it or not)
  12. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 App App App App search Hadoop DWH monitoring security MQ MQ cache cache
  13. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KAFKA DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs
  14. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 16 A Customer Experience
  15. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 18 A Sensor Reading
  16. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 19 An Application Log Entry
  17. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 21 Do you think that’s a table you are querying?
  18. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50
  19. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 Time
  20. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 Account ID Balance 12345 €75 Time
  21. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time
  22. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 22 The Table Stream Duality Account ID Balance 12345 €50 Account ID Amount 12345 + €50 12345 + €25 12345 -€60 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Time Stream Table
  23. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 23 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash
  24. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Streaming Platform Vision
  25. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Streaming Platform Vision Event-Driven Scalable Decoupled
  26. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Streaming Platform Vision Event-Driven Scalable Decoupled
  27. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 26 Database offload→Hadoop/Object Storage/Cloud DW for Analytics HDFS / S3 / BigQuery etc RDBMS CDC
  28. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 27 Streaming ETL with Apache Kafka and KSQL order items customer customer orders Stream Processing RDBMS CDC
  29. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 28 Real-time Event Stream Enrichment with Apache Kafka and KSQL order events customer Stream Processing customer orders RDBMS <y> CDC
  30. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 29 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> New App <x> CDC
  31. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 30 Transform Once, Use Many order events customer Stream Processing customer orders RDBMS <y> HDFS / S3 / etc New App <x> CDC
  32. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 31 Evolve processing from old systems to new RDBMS Existing App CDC
  33. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 31 Evolve processing from old systems to new Stream Processing RDBMS Existing App CDC New App <x>
  34. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 32 Evolve processing from old systems to new Stream Processing RDBMS Existing App New App <x> New App <y> CDC
  35. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 33 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL
  36. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL
  37. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL
  38. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Kafka Connect Kafka Brokers Kafka Connect Tasks Workers Sources Sinks Amazon S3 syslog flat file CSV JSON
  39. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 The Connect API of Apache Kafka® ✓ Fault tolerant and automatically load balanced ✓ Extensible API ✓ Single Message Transforms ✓ Part of Apache Kafka, included in
 Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/ ✓ Centralized management and configuration ✓ Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3 ✓ Supports CDC ingest of events from RDBMS ✓ Preserves data schema
  40. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Confluent Hub • Launched June 2018 • One-stop place to discover and download : • Connectors • Transformations • Converters hub.confluent.io
  41. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 38 Kafka Connect + Schema Registry = WIN RDBMS Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  42. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 39 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  43. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 39 Kafka Connect + Schema Registry = WIN RDBMS Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect Avro Message
  44. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 40 Change-Data-Capture (CDC)
  45. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches:
  46. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC
  47. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 40 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific.
  48. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 • Use a database query to try and identify new & changed rows
 
 
 • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query •Incremental ingest driven through incrementing ID column and/or timestamp column 41 Query-based CDC SELECT * FROM my_table 
 WHERE col > <value of col last time we polled>
  49. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 42 Log-based CDC • Use the database's transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…)
  50. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 43 Query-based vs Log-based CDC Photo by Matese Fields on Unsplash Read more: http://cnfl.io/kafka-cdc • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can't track deletes, or multiple events between polling interval
  51. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 44 Query-based vs Log-based CDC Photo by Sebastian Pociecha on Unsplash Read more: http://cnfl.io/kafka-cdc • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$
  52. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 45 Which Log-Based CDC Tool? • Proprietory RDBMS, 
 e.g. Oracle, MS SQL • Attunity • IBM InfoSphere Data Replication (IIDR) • Oracle GoldenGate • SQData • HVR • Open Source RDBMS, 
 e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Mainframe
 e.g. VSAM, IMS • Attunity • SQData All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector
  53. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL
  54. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Elasticsearch Oracle Orders Logins Txn log Kafka Streams API KSQL Kafka Kafka
 Connect Kafka Connect Swingbench Realtime Analytics with Kafka and KSQL
  55. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Declarative Stream Language Processing KSQL is a
  56. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL is the Streaming SQL Engine for Apache Kafka
  57. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL in Development and Production Interactive KSQL
 for development and testing REST “Hmm, let me try
 out this idea...”
  58. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL in Development and Production Interactive KSQL
 for development and testing Headless KSQL
 for Production Desired KSQL queries have been identified REST “Hmm, let me try
 out this idea...”
  59. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL for Streaming ETL CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum'; Joining, filtering, and aggregating streams of event data
  60. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number, count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  61. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE '%Invalid user%'; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting
  62. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 CREATE STREAM views_by_userid WITH (PARTITIONS=6, REPLICAS=5, VALUE_FORMAT='AVRO', TIMESTAMP='view_time') AS 
 SELECT * FROM clickstream PARTITION BY user_id; KSQL for Data Transformation Make simple derivations of existing topics from the command line
  63. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 54 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL
  64. 56

  65. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 57 Customers Logons Logon_Enriched Join, and Derive additional columns
  66. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 58 Orders Orders_agg_hourly Aggregate order metrics by hour
  67. 59

  68. 59

  69. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data
 Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | CLI Apache Open Source Confluent Open Source Confluent Enterprise SQL Stream Processing KSQL
  70. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle
  71. @rmoff / Streaming ETL in Practice with Oracle, Apache Kafka,

    and KSQL / June 2018 @rmoff [email protected] https://slackpass.io/confluentcommunity https://www.confluent.io/download/