Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Yugabyte & Databricks Spark workshop at Distrib...

AMEY BANARSE
September 22, 2021

Yugabyte & Databricks Spark workshop at Distributed SQL Summit(DSS) 2021

Session Youtube video link:https://www.youtube.com/watch?v=2NpwlFyMhtE

session blog: https://www.yugabyte.com/blog/tutorial-building-applications-with-yugabytedb-and-spark/

At Distributed SQL Summit 2021, I presented a workshop on how to build an application using the YugabyteDB Spark Connector and Databricks Spark to deliver business outcomes for our customers.

The YugabyteDB Spark Connector combines the best-of-breed technologies of Databricks Spark — an industry-leading, distributed computing engine — with YugabyteDB, a modern, cloud-native distributed SQL database. This connector allows customers to seamlessly and natively read from, perform complex ETL, and write to YugabyteDB.

AMEY BANARSE

September 22, 2021
Tweet

More Decks by AMEY BANARSE

Other Decks in Technology

Transcript

  1. © 2021 - All Rights Reserved 1 Building Applications with

    YugabyteDB and Databricks Spark Amey Banarse Wei Wang
  2. © 2021 - All Rights Reserved Who we are Wei

    Wang VP of Field Engineering, Yugabyte, Inc. Pivotal • FINRA • NYSE University of Pennsylvania (UPenn) @ameybanarse about.me/amey Amey Banarse Principal Pre Sales Architect Rackspace • Accenture • HP Enterprise University of Oklahoma @wwang0825 2
  3. © 2021 - All Rights Reserved Workshop Agenda 3 •

    Overview of Yugabyte Architecture • Yugabyte’s YCQL API • YugabyteDB Spark Connector • Hands-on Workshop with Databricks Spark
  4. © 2021 - All Rights Reserved Transactional, distributed SQL database

    designed for resilience and scale 100% open source, PostgreSQL compatible, enterprise-grade RDBMS …..built to run across all your cloud environments 4
  5. © 2021 - All Rights Reserved Designed for cloud native

    microservices. 5 Sharding & Load Balancing Raft Consensus Replication Distributed Transaction Manager & MVCC Document Storage Layer Custom RocksDB Storage Engine DocDB Distributed Document Store Yugabyte Query Layer YSQL YCQL PostgreSQL Google Spanner YugabyteDB SQL Ecosystem ✓ Massively adopted ✘ New SQL flavor ✓ Reuse PostgreSQL RDBMS Features ✓ Advanced Complex ✘ Basic cloud-native ✓ Advanced Complex and cloud-native Highly Available ✘ ✓ ✓ Horizontal Scale ✘ ✓ ✓ Distributed Txns ✘ ✓ ✓ Data Replication Async Sync Sync + Async
  6. © 2021 - All Rights Reserved 6 ◦ Strongly-consistent secondary

    indexes ◦ Native JSON support ◦ Geographic location hints to optimize cost and latency ◦ Distributed ACID transactions ◦ Geo-distributed ◦ Supports Cassandra CQL 3.9.x and 4.x Features Cassandra YCQL Clustering and sharding ✓ ✓ Keyspaces and tables ✓ ✓ Indexes & unique constraints ✓ ✓ Transactions ✓ ✓ Strongly consistent secondary indexes 𐄂 ✓ Native JSON 𐄂 ✓ Geographic location hints 𐄂 ✓ In addition to Cassandra CQL API support, YCQL adds the following enhancements: Yugabyte Query Layer - YCQL
  7. © 2021 - All Rights Reserved 7 Secondary Indexes -

    YCQL CREATE TABLE product_rankings ( asin text, category text, sales_rank int, ... PRIMARY KEY (asin, category) ); CREATE INDEX top_products_in_category ON product_rankings (category, sales_rank, asin); Partition by product ID for efficient lookups Secondary index to list top products in a category
  8. © 2021 - All Rights Reserved 8 Secondary Indexes -

    YCQL SELECT * FROM product_rankings WHERE asin = '0684841363'; SELECT * FROM product_rankings WHERE category = 'Books' LIMIT 10 OFFSET 20; Query by ID for product summary data Query by index to list top products in a category by sales rank (best sellers)
  9. © 2021 - All Rights Reserved Global Transactions Multi-Row/Multi-Shard Operations

    At Scale 9 CREATE TABLE orders ( order_id text PRIMARY KEY, user_id uuid, order_details jsonb, ... ) WITH transactions = {'enabled': 'true'}; Update inventory and orders tables atomically. Simple to enable global transactions on any table Use JSON type for flexible schema objects.
  10. © 2021 - All Rights Reserved 10 Global Transactions Multi-Row/Multi-Shard

    Operations At Scale BEGIN TRANSACTION UPDATE product_inventory SET quantity = quantity - 2 WHERE asin = '0684841363'; INSERT INTO orders (order_id, user_id, order_details, … ) VALUES ('<order-id>', '<user-id>', '{id: "0684841363", quantity: 2, … }', … ); END TRANSACTION; Decrease inventory count for products fulfilled. Query and index by JSON attributes if needed! Add to orders table for products purchased.
  11. © 2021 - All Rights Reserved 12 YugabyteDB Spark Connector

    Source Tables Derived Tables Enrichment / pre-aggregation Batch Aggregates Key features: • Native JSONB support - ex. Column pruning • Performance optimizations with predicate pushdowns • Cluster, topology and partition awareness
  12. © 2021 - All Rights Reserved What we will be

    building in this workshop? Integrating Apache Spark with Yugabyte Cloud A Scala Application Yugabyte Cloud Read From Write To YugabyteDB Spark Connector 3.0-yb-8 - Compatible with Spark 3.0 and Scala 2.12 - Compatible with Yugabyte YCQL 3.7+ - Exposes YCQL tables as Spark RDDs and Datasets/DataFrames - Saves RDDs /DataFrames back to Cassandra by implicit saveToCassandra call - Allows for execution of arbitrary CQL statements Yugabyte Cloud - Data source and target for Spark application - Namespace: test - Table: Employees_json Spark application: Native support of JSON - Read from and write to Yugabyte cloud - Perform sample ETL operation: Window function - Process JSON data type Apache Spark 3.x YugabyteDB Spark Connector Spark APIs Perform ETL 13
  13. 19 Yugabyte Confidential © 2019 All rights reserved. YCQL Example

    CREATE KEYSPACE example WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '3'};; CREATE TABLE user_actions(user_id INT, ts TIMESTAMP, action TEXT, PRIMARY KEY((user_id), ts)) WITH CLUSTERING ORDER BY (ts DESC); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:15', 'log in'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:25', 'change password'); INSERT INTO user_actions(user_id, ts, action) VALUES (1, '2000-12-2 12:30:35', 'log out');
  14. 20 Yugabyte Confidential © 2019 All rights reserved. Databricks Scala

    Notebook (ycql) import org.apache.spark.SparkConf import org.apache.spark.SparkContext import com.datastax.spark.connector._ import com.datastax.spark.connector.cql._ val keyspace = "example" val table = "user_actions" val tableDf = spark.read .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> table, "keyspace" -> keyspace)) .load() tableDf.show() +-------+-------------------+---------------+ |user_id| ts| action| +-------+-------------------+---------------+ | 1|2000-12-02 12:30:35| log out| | 1|2000-12-02 12:30:25|change password| | 1|2000-12-02 12:30:15| log in| +-------+-------------------+---------------+
  15. 21 Yugabyte Confidential © 2019 All rights reserved. YSQL Example

    CREATE SCHEMA cycling; DROP TABLE IF EXISTS cycling.cyclist_name; CREATE TABLE cycling.cyclist_name ( id bigint PRIMARY KEY, lastname character varying (50), firstname character varying (50) ); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (1, 'VOS','Marianne'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (2, 'VAN DER BREGGEN','Anna'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (3, 'FRAME','Alex'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (4, 'TIRALONGO','Paolo'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5, 'KRUIKSWIJK','Steven'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6, 'MATTHEWS', 'Michael');
  16. 22 Yugabyte Confidential © 2019 All rights reserved. Databricks Scala

    Notebook (ysql) val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:postgresql://xxx.xxx.xxx.xxx:5433/test") .option("dbtable", "cycling.cyclist_name") .option("user", "yugabyte") .option("password", "<password>") .load() jdbcDF.show() +---+---------------+---------+ | id| lastname|firstname| +---+---------------+---------+ | 3| FRAME| Alex| | 5| KRUIKSWIJK| Steven| | 4| TIRALONGO| Paolo| | 2|VAN DER BREGGEN| Anna| | 6| MATTHEWS| Michael| | 1| VOS| Marianne| +---+---------------+---------+
  17. © 2021 - All Rights Reserved Thank You Join us

    on Slack: yugabyte.com/slack Star us on Github: github.com/yugabyte/yugabyte-db