Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wvlet: A New Flow-Style Query Language For Func...

Taro L. Saito
December 12, 2024

Wvlet: A New Flow-Style Query Language For Functional Data Modeling and Interactive Data Analysis - Trino Summit 2024

Wvlet, pronounced as weave-let, is a new flow-style query language for SQL-based database engines such as Trino, DuckDB, and Hive. Wvlet queries, saved as .wv files, offer an intuitive way to describe data processing pipelines compiled into a series of SQL queries. This session will demonstrate how the flow-style query syntax enhances existing SQL engines and facilitates functional data modeling. This approach enables reusable and composable methods for constructing complex data pipelines. Wvlet is open-source and freely available at https://wvlet.org/.

Taro L. Saito

December 12, 2024
Tweet

More Decks by Taro L. Saito

Other Decks in Technology

Transcript

  1. Wvlet: A New Flow-Style Query Language For Functional Data Modeling

    and Interactive Data Analysis December 12, 2024 Trino Summit 2024 Taro L. Saito (Leo) Senior Principal Engineer at Treasure Data wvlet
  2. 2024: 50th Anniversary of SQL • SQL was born in

    1974 ◦ SQL’s early design choices paved the way for its standardization and commercial adoption • Since then, applications of SQL have widespread ◦ From RDBMS to Analytical queries in Trino, Hive, Spark, etc. ◦ From Embedded SQLite (OLTP) to DuckDB (OLAP with Parquet/Iceberg support) ◦ SQL’s user base has grown beyond database administrator (DBA) to include non-engineers 2
  3. Trino at Treasure Data (2024) • Treasure Data has been

    operating Trino (formerly Presto) as a service since 2014 (10th anniversary!) ◦ 3+ million Trino SQL queries processed / day ◦ 400+ trillion rows processed / day ◦ 3+ billion S3 GET requests / day ▪ Reduced from 10 billion requests / day (2023) by partition optimization • In 2024, we completed the customer traffic migration from Presto (350) to Trino ◦ See our migration/test methods in DBTest 2022 paper • A lot of challenges in managing SQL and helping users (or LLM) write efficient SQL queries 3
  4. What’s Wrong with SQL? • The syntactic order of SQL

    doesn’t match the actual data flow ◦ Even for SQL experts it’s difficult to debug if a query becomes complex and deeply nested ◦ A Critique of Modern SQL And A Proposal Towards A Simple and Expressive Query Language (CIDR 24) • SQL standard covers too small area and lacks essential software engineering features for managing multiple queries ◦ No built-in support for reusing and generating queries ◦ No extension point for multi-query optimization ▪ e.g., incremental processing and pipeline execution like dbt ◦ No built-in debugging or testing capabilities 4
  5. Wvlet: Modernizing 50-Year-Old SQL • SQL has proven to be

    effective and useful ◦ Its natural-language based syntax has been widely adopted even for data analytics • Redesigned SQL to be more intuitive and functional • Wvlet: a new flow-style query language for weaving data ◦ Pronounced as weave-let ◦ Queries start with “from” for intuitive data flow • Functional ◦ Wvlet queries are reusable and composable like functions • Cross SQL engine support ◦ Generates SQL for Trino, Hive, DuckDB, etc. • Try Wvlet on your web browser https://wvlet.org/ ◦ No installation is required 5 Command-line editor of Wvlet (wv)
  6. Wvlet: Architecture • From query (.wv) files, Wvlet compiler produces

    logical plans, execution plans, and SQL statements • Logical Plans ◦ Tree-representations of relational operators (e.g., scan, filter, projection) ▪ Sort(Join(Filter(TableScan(...)), TableScan(...))) • Execution Plans ◦ A sequence of steps to execute SQL and other programs ▪ ExecutePlan(ExecuteSQL(query), ExecuteTest(expr), ExecuteCommand(expr), …) 7 Query Rewrite Framework Subquery Materialization Query Fusion Differentially Private SQL Incremental Processing Generate SQL Non SQL-Extension Query Optimizer Query Compiler Logical Plans Flow-Style Query Execution Plans Query Runner Table Functions Table Data Import/Export Query Library Composable Data Models Query.wv Static Type Analysis Library Code Assembly wvlet S3
  7. Wvlet: Analyze As You Write • Flow-style queries ◦ Each

    line is a single operation • Peek the data and schema at any point ◦ A subquery becomes a range of lines in the query text • Easy to extend ◦ Queries can be reused for further analysis 8
  8. Related Work: GoogleSQL Pipe Syntax (2024) • SQL Has Problems.

    We Can Fix Them: Pipe Syntax In SQL (VLDB 2024) • Extended SQL syntax with pipe operator (|>) ◦ Available in ZetaSQL, an open-source SQL parser, used in Google products, e.g., BigQuery, F1, etc. • Other flow-style languages: ◦ PRQL, Microsoft Kusto, DryadLINQ, etc. 9 SQL SQL + Pipe Syntax
  9. Wvlet: Relational Operators (online reference) • In Wvlet, all keywords

    must be lower-case letters ◦ For consistency and reducing implementation efforts • Same operators with SQL: ◦ from, select, join, where, order by, limit, etc. • Aggregation in Wvlet ◦ group by k1, k2, … agg … ◦ agg (aggregation expr), .. ▪ Report group-by keys and aggregation expressions: k1, k2, …, expr1, expr2, … ◦ group by k1, k2, … where … ▪ equivalent to group by … having … 10 Basic Flow of Wvlet Queries Aggregation Query
  10. Wvlet: Column At A Time Operators • For minimizing your

    typing effort • add (expr) as … ◦ Add a new column • rename (column) as (new name) • exclude (column), … • shift (to left/right)? column, …. ◦ Reorder columns for readability • Changing a single column or its order in SQL is not easy as we need to enumerate all columns in SELECT 11 Wvlet SQL
  11. Wvlet: Additional Relational Operators • Refined from SQL ◦ concat

    = UNION ALL in SQL ▪ Most of SQL users wrongly use UNION (duplicate elimination) where UNION ALL (concatenate two relations) is appropriate. ◦ dedup = select distinct * (= duplicate elimination) • Utility operators ◦ transform: Update only subset of columns ◦ sample n ◦ pivot ▪ Transform column values into individual columns ▪ trino#1206 (Remains open since 2019) ▪ Wvlet can run multiple SQL queries, so it’s relatively easy to implement static/dynamic pivot 12 Wvlet SQL
  12. Wvlet: Update Statements • save as ◦ A shorthand notation

    for: ▪ DROP TABLE IF EXISTS tbl; CREATE TABLE tbl AS SELECT … • Most frequently used pattern in Treasure Data ▪ CREATE OR REPLACE TABLE AS SELECT • Available in DuckDB, Trino 431 (Since Oct 2023) ◦ For DuckDB connector, you can save query results as Parquet/JSON files • append to ◦ Almost same with INSERT INTO, but it clarifies append-only semantics (no overwrites) ▪ Useful when using versioned tables like Iceberg/Delta Lake • delete ◦ Delete selected rows from the table 13 Wvlet SQL
  13. Function Chaining via Dot Operator • In SQL, every function

    is global ◦ SUM(x), AVG(x), COUNT(*), ROUND(x, 1), … ◦ A lot of cursor movements are necessary to apply functions in SQL • Wvlet supports chaining functions with dot operator as in modern programming languages: ◦ e.g., x.sum, x.avg.round(2), _.count ◦ _ (underscore) refers to the output from the previous operator 14 Wvlet SQL
  14. Testing Queries • Wvlet Test Syntax ◦ Useful for verifying

    resulting schema and results • Wvlet is tested with Wvlet ◦ Wvlet spec queries ◦ Covers all 22 TPC-H queries 15
  15. Debugging Queries • Debug operator can be used for checking

    intermediate query results • ExecutionPlanner generates multiple execution paths for debug and regular query evaluation ◦ Debug path: Query before debug statement -> debug query ◦ Regular path: Query without debug statement 16 Regular Query Result Debug Query Result Test Result
  16. Wvlet Internals: Compiler • Compiler phases: Parser -> SymbolLabeler ->

    TypeResolver -> Rewriter -> ExecutionPlanner -> GenSQL ◦ CompilationUnit holds the source text, untyped logical plan, typed logical plan, execution plan, etc. • With AirSpec testing library, you can adjust the log levels of individual components ◦ -L (class name pattern)=(log level) option 17 Tokens LogicalPlan (After typing) Generated SQL Query Result Query (.wv)
  17. Wvlet Development Roadmap • Roadmap is maintained at GitHub Project

    page. ◦ Planning milestones for about every 3 months ◦ Versions will be YYYY.(milestone month).(patch) (See #170 for the versioning scheme) • Release 2024.9 ◦ ✅ Flow-style query language design and compiler ▪ Including scanner, parser, typer, tree rewrite framework, execution planner, SQL generator, etc. ◦ ✅ DBMS Connector (DuckDB, Trino) ◦ ✅ wv: Interactive command-line editor (REPL) ◦ ✅ Installer (Homebrew) ◦ ✅ Web UI, Playground (Monaco Editor with DuckDB-Wasm) • Release 2025.1 ◦ Model management ▪ Generate dependent model materialization plan, like dbt ▪ GitHub integration ◦ Compiler plugins for advance optimization ▪ Incremental processing, query fusion, etc. ◦ Language SDKs ▪ Python, Rust, C/C++, Java, etc. ◦ SQL to Wvlet converter 18
  18. Summary • Designed Wvlet, a new open-source flow-style query language

    • Addresses challenges in 50-year-old design of SQL ◦ Leverage good sides of SQL ▪ natural-language like syntax ◦ Intuitive syntax for data flow ◦ Reusability ◦ Extensibility • Website: https://wvlet.org/ • GitHub: https://github.com/wvlet/wvlet 19
  19. 20

  20. Wvlet Internals: Standard Library • Wvlet defines standard data types

    and functions, which define how to generate SQL ◦ example: x.to_int.round(1) => cast(x as bigint).round(1) => round(cast(x as bigint),1) 22
  21. Wvlet Internals: DBMS Specific Functions • Function can have different

    implementations depending on the target databases (Trino/Hive/DuckDB, etc.) ◦ x.count_approx_distinct ▪ Trino: approx_distinct(x) ▪ DuckDB: approx_count_distinct(x) • Works for consuming the differences between SQL dialects and UDFs 23
  22. Extending Wvlet • Compiler plugins (to be designed #185) ◦

    Add a custom rule set to optimize logical plans and execution plans. ▪ Optimization rules: Query fusion optimization, subquery materialization, incremental processing, etc. ◦ Security rules (e.g., forbidding local file access operators at the cloud environment) • Table functions receive table-value data and output table value data ◦ Not limited to SQL, we will be able to invoke ML algorithms, calling Embulk, or issuing SQL queries to different query engines by generating such execution plans. 24 Query Rewrite Framework Subquery Materialization Query Fusion Differentially Private SQL Incremental Processing Generate SQL Non SQL-Extension Query Optimizer Query Compiler Logical Plans Flow-Style Query Execution Plans Query Runner Table Functions Table Data Import/Export Query Library Composable Data Models Query.wv Static Type Analysis Library Code Assembly wvlet S3
  23. Writing A Compiler From Scratch? Leverage LLMs • Developing a

    compiler is a challenge with known difficulties and known solutions ◦ = solved problem • By leveraging LLM-based tools like GitHub Copilot, we can accelerate the development • LLMs excel at utilizing well-known solutions from open-source code. 25 GitHub Copilot suggests code in a second Indicate what you want to code
  24. Wvlet is easy to learn, even for Large Language Models

    (LLMs) • An LLM Agent, created from Wvlet documentation, can successfully convert TPC-H SQL queries into Wvlet syntax ◦ Wvlet: Query Syntax documentation (link) 26 TPC-H Q1 in Wvlet Original TPC-H Q1 SQL
  25. wv: Interactive Query Editor • Available with brew install wvlet/wvlet/wvlet

    ◦ Supports Trino, DuckDB via profile settings • Shortcut keys for checking intermediate schema and results ◦ ctrl-j, ctrl-d (describe the schema at the line) ◦ ctrl-j, ctrl-t (test run the subquery upto the line) ◦ ctrl-j, ctrl-r (run the whole query) 27 Intermediate Table Schema Subquery Result
  26. Design Philosophy of Wvlet • Link: https://wvlet.org/wvlet/docs/internal/design • Matching syntax

    order with data flow ◦ Typing from left-to-right, top-to-bottom order • Use only lowercase keywords ◦ Forbid mixing SELECT and select • One operation at a time ◦ SQL’s SELECT operator is too powerful ◦ Breaking down SELECT statement into simple operators (add, transform, exclude, shift, etc) • Human and machine friendly-syntax ◦ Query lines form subqueries ▪ Trailing comma support is essential to correctly extract subqueries in an executable form ◦ Enable in-query data/schema inspection 28 SQL Wvlet
  27. Related Work: SaneQL (Thomas Neumann, et al. 2024) • The

    syntactic order of SQL doesn’t match with the semantic order of data processing • Redesigned SQL by using a function chain syntax ◦ A Critique of Modern SQL And A Proposal Towards A Simple and Expressive Query Language (CIDR 24) 29
  28. Related Work: PRQL (Prequel) • Pipelined Relational Query Language (Since

    2022) https://prql-lang.org/ • Written in Rust, compiling PRQL to SQL ◦ Syntax is a bit far from SQL, more like DataFrame API or programming language syntax 30