Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2018 - Tim Berglund - Processing Streaming ...

ITT 2018 - Tim Berglund - Processing Streaming Data with KSQL

Apache Kafka is a de facto standard streaming data processing platform, being widely deployed as a messaging system, and having a robust data integration framework (Kafka Connect) and stream processing API (Kafka Streams) to meet the needs that common attend real-time message processing. But there’s more!

Kafka now offers KSQL, a declarative, SQL-like stream processing language that lets you define powerful stream-processing applications easily. What once took some moderately sophisticated Java code can now be done at the command line with a familiar and eminently approachable syntax. Come to this talk for an overview of KSQL with live coding on live streaming data.

Istanbul Tech Talks

April 17, 2018
Tweet

More Decks by Istanbul Tech Talks

Other Decks in Programming

Transcript

  1. Scalable Consumption consumer group A producer consumer group A consumer

    group B consumer group B … … … partition 1 partition 2 partition 3 Partitioned Topic
  2. Logs and Pub/Sub consumer A producer consumer B 8 7

    6 4 3 2 1 5 first record latest record
  3. Stream Processing by Analogy Kafka Cluster Connect API Stream Processing

    Connect API $ cat < in.txt | grep “ksql” | tr a-z A-Z > out.txt
  4. KSQL for Data Exploration SELECT status, bytes FROM clickstream WHERE

    user_agent = ‘Mozilla/5.0 (compatible; MSIE 6.0)’; An easy way to inspect data in a running cluster
  5. KSQL for Streaming ETL • Kafka is popular for data

    pipelines. • KSQL enables easy transformations of data within the pipe. • Transforming data while moving from Kafka to another system. CREATE STREAM vip_actions AS 
 SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id 
 WHERE u.level = 'Platinum';
  6. KSQL for Anomaly Detection CREATE TABLE possible_fraud AS
 SELECT card_number,

    count(*)
 FROM authorization_attempts 
 WINDOW TUMBLING (SIZE 5 SECONDS)
 GROUP BY card_number
 HAVING count(*) > 3; Identifying patterns or anomalies in real-time data, surfaced in milliseconds
  7. KSQL for Real-Time Monitoring • Log data monitoring, tracking and

    alerting • Sensor / IoT data CREATE TABLE error_counts AS 
 SELECT error_code, count(*) 
 FROM monitoring_stream 
 WINDOW TUMBLING (SIZE 1 MINUTE) 
 WHERE type = 'ERROR' 
 GROUP BY error_code;
  8. KSQL for Data Transformation CREATE STREAM views_by_userid WITH (PARTITIONS=6, VALUE_FORMAT=‘JSON’,

    TIMESTAMP=‘view_time’) AS 
 SELECT * FROM clickstream PARTITION BY user_id; Make simple derivations of existing topics from the command line
  9. Where is KSQL not such a great fit? BI reports

    (Tableau etc.) • No indexes • No JDBC (most BI tools are not good with continuous results!) Ad-hoc queries • Limited span of time usually retained in Kafka • No indexes
  10. CREATE STREAM clickstream ( time BIGINT, url VARCHAR, status INTEGER,

    bytes INTEGER, userid VARCHAR, agent VARCHAR) WITH ( value_format = ‘JSON’, kafka_topic=‘my_clickstream_topic’ ); Creating a Stream
  11. CREATE TABLE users ( user_id INTEGER, registered_at LONG, username VARCHAR,

    name VARCHAR, city VARCHAR, level VARCHAR) WITH ( key = ‘user_id', kafka_topic=‘clickstream_users’, value_format=‘JSON'); Creating a Table
  12. CREATE STREAM vip_actions AS SELECT userid, fullname, url, status 


    FROM clickstream c 
 LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum'; Joins for Enrichment
  13. • Starts a CLI and a server in the same

    JVM • Ideal for developing on your laptop bin/ksql-cli local • Or with customized settings bin/ksql-cli local --properties-file ksql.properties #1 STAND-ALONE AKA ‘LOCAL MODE’ How to run KSQL
  14. How to run KSQL JVM KSQL Server KSQL CLI JVM

    KSQL Server JVM KSQL Server Kafka Cluster #2 CLIENT-SERVER
  15. • Start any number of server nodes bin/ksql-server-start • Start

    one or more CLIs and point them to a server bin/ksql-cli remote https://myksqlserver:8090 • All servers share the processing load Technically, instances of the same Kafka Streams Applications Scale up/down without restart How to run KSQL #2 CLIENT-SERVER
  16. How to run KSQL Kafka Cluster JVM KSQL Server JVM

    KSQL Server JVM KSQL Server #3 AS A STANDALONE APPLICATION
  17. • Start any number of server nodes Pass a file

    of KSQL statement to execute bin/ksql-node query-file=foo/bar.sql • Ideal for streaming ETL application deployment Version-control your queries and transformations as code • All running engines share the processing load Technically, instances of the same Kafka Streams Applications Scale up/down without restart How to run KSQL #3 AS A STANDALONE APPLICATION
  18. How to run KSQL Kafka Cluster #4 EMBEDDED IN AN

    APPLICATION JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code
  19. • Embed directly in your Java application • Generate and

    execute KSQL queries through the Java API Version-control your queries and transformations as code • All running application instances share the processing load Technically, instances of the same Kafka Streams Applications Scale up/down without restart How to run KSQL #4 EMBEDDED IN AN APPLICATION