to unify data from multiple systems, so create conformed dimensions and batch processes to federate our data. This is all batch driven, so latency is built in by design.
as our data warehouse, we want to use our transactional data to populate search replicas, Graph databases, noSQL stores…all introducing more point-to-point dependencies in our system
end up with a spaghetti architecture. It can't scale easily, it's tightly coupled, it's generally batch-driven and we can't get data when we want it where we want it.
source of events • Modifying applications is not always possible or desirable • And what if the data gets changed within the database or by other apps? • JDBC is one option for extracting data • Confluent Open Source includes JDBC source & sink connectors
databases use transaction logs to ensure Durability of data • Change-Data-Capture (CDC) mines the log to get raw events from the database • CDC tools that integrate with Kafka Connect include: • Debezium • DBVisit • GoldenGate • Attunity • + more
Modify events before storing in Kafka: • Mask/drop sensitive information • Set partitioning key • Store lineage • Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different Elasticsearch indexes • Cast data types to match destination
Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in real-time • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing Ksql>
• Interpretations of topic content • STREAM - data in motion • TABLE - collected state of a stream • One record per key (per window) • Current values (compacted topic) • STREAM – TABLE Joins
TUMBLING: Fixed-size, non-overlapping, gap-less windows • SELECT ip, count(*) AS hits FROM clickstream WINDOW TUMBLING (size 1 minute) GROUP BY ip; • HOPPING: Fixed-size, overlapping windows • SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip; • SESSION: Dynamically-sized, non-overlapping, data-driven window • SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream WINDOW SESSION (20 second) GROUP BY ip; More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
rental_date INT, inventory_id INT, customer_id INT, return_date INT, staff_id INT, last_update INT ) WITH (kafka_topic = 'sakila-rental', value_format = 'json'); Message ---------------- Stream created * Command formatted for clarity here. Linebreaks need to be denoted by \ in KSQL
Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka™ Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI