Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Data Integration at Scale with Kafka ...

Real-time Data Integration at Scale with Kafka Connect - Dublin Apache Kafka Meetup 04 Jul 2017

Apache Kafka is a streaming data platform. It enables integration of data across the enterprise, and ships with its own stream processing capabilities. But how do we get data in and out of Kafka in an easy, scalable, and standardised manner? Enter Kafka Connect.
Part of Apache Kafka since 0.9, Kafka Connect defines an API that enables the integration of data from multiple sources, including MQTT, common NoSQL stores, and CDC from relational databases such as Oracle. By "turning the database inside out" we can enable an event-driven architecture in our business that reacts to changes made by applications writing to a database, without having to modify those applications themselves. As well as ingest, Kafka Connect has connectors with support for numerous targets, including HDFS, S3, and Elasticsearch.
This presentation will briefly recap the purpose of Kafka, and then dive into Kafka Connect, with practical examples of data pipelines that can be built with it and are in production at companies around the world already. We'll also look at the Single Message Transform (SMT) capabilities introduced with Kafka 0.10.2 and how they can make Kafka Connect even more flexible and powerful.

Robin Moffatt

July 04, 2017
Tweet

More Decks by Robin Moffatt

Other Decks in Technology

Transcript

  1. 1 Real-time Data Integration at Scale with Kafka Connect Robin

    Moffatt Partner Technology Evangelist, EMEA @ Confluent @rmoff [email protected]
  2. 2

  3. 3

  4. 4

  5. 7

  6. 8 Single Message Transform (SMT) -- Extract, TRANSFORM, Load… •

    Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different ElasticSearch indexes • Cast data types to match destination
  7. 9 Kafka Connect API Library of Connectors Databases Analytics Applications

    / Other Datastore/File Store https://www.confluent.io/product/connectors/
  8. 10 Streaming Application Data to Kafka • Applications are rich

    source of events • Modifying applications is not always possible or desirable • And what if the data gets changed within the database or by other apps? • JDBC is one option for extracting data
  9. 11 Liberate Application Data into Kafka with CDC • Relational

    databases use transaction logs to ensure Durability of data • Change-Data-Capture (CDC) mines the log to get raw events from the database • CDC tools that integrate with Kafka Connect include: • Debezium • DBVisit • GoldenGate • Attunity • + more
  10. 12 Kafka Connect Common Patterns – Data Integration into Data

    Lake for batch analytics Oracle DB2 MS SQL Postgres MySQL Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr CRM ERP WebApp Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM)
  11. 13 Common Patterns – Event-Driven microservices CRM WebApp Orders Service

    Stock Service Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr Kafka Connect Oracle DB2 MS SQL Postgres MySQL Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM) ERP
  12. 14 Common Patterns – Event-Driven microservices & audit/search/storage CRM WebApp

    Orders Service Stock Service Cassandra MongoDB Couchbase HBase S3 / Athena HDFS BigQuery Elasticsearch Solr Kafka Connect Oracle DB2 MS SQL Postgres MySQL Twitter IRC Bloomberg … Kafka Connect Mainframe (e.g. VSAM) ERP
  13. 15 The Numerous Benefits of Kafka Connect • Restart capabilities

    (offset management) • Distributed workers • Parallelism (for throughput) • Load balancing • Fault tolerance • Schema preservation • Data serialisation • Centralised management and configuration
  14. 16 Kafka Connect – under the covers • Each Kafka

    Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically
  15. 17 Kafka Connect – under the covers • Each Kafka

    Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically
  16. 18 Kafka Connect – under the covers • Each Kafka

    Connect node is a worker • Each worker executes one or more tasks • Tasks do the actual work of pulling data from sources / landing it to sinks • Kafka Connect manages the distribution and execution of tasks • Parallelism, fault-tolerance, load balancing all handled automatically
  17. 19 Kafka Connect – Standalone vs Distributed • Kafka Connect

    has two modes: standalone or distributed • Distributed - Scaleout & fault tolerance easy – just add more workers • Can run on one node! • Standalone - Useful for where data source is machine-specific (e.g. single-node log files)
  18. 20 Kafka Connect - Converters • Data from source system

    is in its own format (e.g. RecordSet from JDBC) • Kafka Connect’s Converters provide reusable functionality to serialise data into JSON or Avro • The Confluent Schema Registry is used to stores schemas of ingested data http://docs.confluent.io/current/connect/concepts.html#converters
  19. 21 Configuring Kafka Connect - REST API • Configure &

    control Kafka Connect through REST API • Validate connector configuration • Create connectors • List available plugins • Query connector & task state • Pause, resume, restart connectors + tasks • Configuration is persisted through a Kafka topic • Reference : http://docs.confluent.io/current/connect/restapi.html
  20. 25

  21. 26 Confluent: a Streaming Platform based on Apache Kafka™ Database

    Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect| Streams Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy
  22. 27 Kafka Connect – Getting Started • Docs : http://docs.confluent.io/current/connect/

    • Includes Quickstart and full Connect documentation including Architecture + Internals • Official Confluent Platform Docker images available • http://docs.confluent.io/current/cp-docker- images/docs/quickstart.html#kafka-connect • List of connectors • https://www.confluent.io/product/connectors/ • Also search on github https://github.com/search?q=kafka-connect https://www.confluent.io/download/ @rmoff [email protected]