Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Druid 2030 (Darin Briskman, Imply) | RTA...

Apache Druid 2030 (Darin Briskman, Imply) | RTA Summit 2023

First created in 2011, Apache® Druid is now in its second decade of empowering real-time analytics. How does Druid provide subsecond queries at Petabyte scale, high concurrency, and combined stream and batch analytics? What have we learned from the first decade of Druid, and what is changing? What will Druid look like in 2030?

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. ©2023, Imply A few things before we start 2 I’m

    a member of the Druid Community, giving my own opinions and thoughts. I do not speak for the Community, nor for Imply Data. Thanks to the RTA organizers for inviting me and Imply to participate in today’s events. Pinot and Druid are open source cousins, and we’re all working together to advance real-time analytics for everyone.
  2. ©2023, Imply About me and Imply 3 Imply was founded

    by the creators of Apache Druid® to help developers get to successful projects faster
  3. ©2023, Imply In the Beginning • Needed fast processing and

    rollups for AdTech • Requirement: ingest a billion rows in under 1 minute & query them in under 1 second • Tried and failed with RDBMS: Greenplum, Postgres, MySQL, and InfoBright • Tried and failed with NoSQL • Decided to create a new database. How hard could it be? • Announced Druid on 30 Apr 2011
  4. ©2023, Imply Evolution and Open Source Rapid iteration as Druid

    drove the AdTech Business After getting interest from Netflix and others, released Druid as Open Source in 2012. Moved to Apache license in 2015. Promoted to Apache Software Foundation top-level in 2017. Used by over 1900 organizations.
  5. ©2023, Imply Global and vibrant Community Companies using Druid Active

    Contributors YoY Increase in Community Activity Community Members 1,900+ 150% 14,000+ 500+
  6. ©2023, Imply Druid Today: a Real-Time Analytics Database Sub-second queries

    at any scale Interactive analytics on TB-PBs of data High concurrency at the lowest cost 100s to 1000s QPS via a highly efficient engine Real-time and historical insights True stream ingestion for Kafka and Kinesis Plus, non-stop reliability with automated fault tolerance and continuous backup 1 2 3 For analytics applications that require:
  7. ©2023, Imply The right database for analytics apps makes a

    difference OR Without Druid With Druid
  8. ©2023, Imply 10 Applications Analytics Applications Druid is built for

    the intersection of analytics and applications. Apache Druid
  9. ©2023, Imply Real-Time Analytics Applications Real-time Analytics Database Real-Time Analytics

    Requires a Real-Time Database 11 Analytics Data Warehouses Applications Transactional Databases Read-optimized TB-PBs of Data High Cardinality Sub-Sec Response High Concurrency Real-time Data BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized BI Reporting Monthly Reporting Static Dashboards ACID Compliance Small Data Write-optimized ✓ ✓ ✓ ✓ ✓ ✓
  10. ©2023, Imply Examples of real-time analytics applications 12 Operational Visibility

    at Scale Rapid Data Exploration Customer-facing Analytics Real-time Decisioning ICE Security Ops Platform Citrix Analytics Service Salesforce Edge Intelligence Reddit Real-Time Ads Powered by
  11. ©2023, Imply And many more! Focus on community success Retail

    Financial Gaming Networking/Energy Technology Security Ad Tech Media
  12. ©2023, Imply Coming in May: Druid 26.0 16 Schema auto-discovery

    Get both high performance & flexibility More ANSI SQL Compatibility Unnest & Arrays Shuffle Joins Simpler Data Ingestion Plus additional features including Sessionization, Interpolation, and Advanced Dictionary Compression
  13. ©2023, Imply With Druid you now get both high performance

    & flexibility 17 Strong Data Types High Performance Data Type Discovery Ease of Use like schemaless databases N ew
  14. ©2023, Imply 18 Voice Assistant Sends data about each request

    plus periodic status updates IoT Use Case Streaming Pipeline Analytics Database Data Analyst
  15. ©2023, Imply 19 Voice Assistant Temperature & Humidity sensors enabled

    IoT Use Case Temp & Humidity Data Streaming Pipeline Analytics Database Data Analyst Adding Humidity & Temperature Data No Broken Schemas New columns for temperature & Humidity are automatically discovered and added with the right data type to the Druid table Auto-discover and add New Data Field & Type
  16. ©2023, Imply Joins Prepare Incoming Data for Fast Analytics in

    Druid 20 Partition 1 Partition 2 Partition n Partition 1 Partition 2 Partition n Table 1 Table 2 Table 3
  17. ©2023, Imply Pre-joining Data was Necessary to Bring Datasets Together

    21 Fact Table A Fact Table A Ingest Store Query Third Party Tools Real-time Analytics
  18. ©2023, Imply Ingestion Becomes Simpler, Easier, and Less Expensive 22

    Fact Table A Fact Table A Ingest Store Query Druid Can Now Join Datasets at Ingestion
  19. ©2023, Imply 23 Unnest “I want to UNNEST this repeated

    record into its own little temporary table.” Array “I wish I could do a SQL join without getting duplicate rows back.” Extend Standard SQL Features Keeping up with Evolving ANSI SQL Standards
  20. ©2023, Imply 24 What is the average basket size with

    and without groceries? And how is it trending over the last 18 months? Billions transactions/month Standard Query Take too long + Isn’t necessary
  21. ©2023, Imply 25 Statistics teaches us, when there is too

    much data, sample Sample GROUP BY Query a subset of data Druid ensures it is statistically valid Billions transactions/month What is the average basket size with and without groceries? And how is it trending over the last 18 months?
  22. ©2023, Imply 26 What was the impact on sales from

    an approaching hurricane? Automatically figure out time boundaries of a data set Find day parts & start/end of impactful events Time weighted averages and interpolation
  23. ©2023, Imply Advanced String Dictionary Compression 27 100 TBs 70

    TBs Saving up to 30% on string storage w/ ZERO impact on performance Efficiency of querying numbers while retaining all the flexibility of the human language
  24. ©2023, Imply Next Generation Infrastructure 31 2023-24 Memory Pooling Effective

    50%+ reduction in costs of database infrastructure 2025 - 26 Data Processing Accelerators Co-processors that improve database performance per CPU by 3x - 20x
  25. ©2023, Imply Cloud Maturity Today: ~45% of global IT infrastructure

    on the Cloud End of 2027: ~85% of global IT infrastructure on the Cloud Streaming Data and Real-Time Analytics Applications are the default on the Cloud Managed services become even more dominant as a delivery model 32
  26. ©2023, Imply RTA and Machine Learning AI/ML becomes just another

    piece of the analytic toolset, like regressions and sketches. By 2025, all data-center class CPUs include packaged GPUs for machine learning inference acceleration. Active GANs optimize segmentation and query optimization. LLMs automate data gathering and (ironically) data quality 33
  27. ©2023, Imply Real-Time Analytics Applications in 2030 34 EB scale

    and beyond Still subsecond response Ubiquitous streaming Most queries still SQL
  28. ©2023, Imply Real-Time Analytics Applications in 2030 Open Source wins

    Druid is one of the projects that lead Real-Time Analytics databases (no single winner) Streaming and Analytics delivered as managed services Mix of edge and centralized computing 35 + ???
  29. ©2023, Imply 37 Darin Briskman Director of Technology [email protected] Join

    the Druid Community! https://druid.apache.org Druid Architecture and Concepts https://imply.io/druid-architecture-concepts/ Building Real-Time Analytics Applications https://bit.ly/40AIlB6 Try Polaris, the Druid DBaaS https://imply.io/polaris