Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Liz Heym Catching Waves With Time-Series Data, ...

Irina Nazarova
July 21, 2024
33

Liz Heym Catching Waves With Time-Series Data, SF Bay Area Ruby Meetup July 18 2024

Irina Nazarova

July 21, 2024
Tweet

Transcript

  1. We’ll cover: - How to select a tool for managing

    time-series data - How to organize, query, and aggregate time-series data - How to translate your design to API constraints
  2. But first! What is time-series data? Time-series data is a

    collection of observations recorded over consistent intervals of time.
  3. A surfer’s goal Liz has just taken her first surf

    lesson, and she’s keen on learning how her surfing will improve over time. She’s decided to record this data in a time-series database and to access it via an API endpoint. But where does she start?
  4. Selecting the right board for the conditions 1 Surf a

    board you already have Use a time-series DB already in your tech stack 2 Use the old board, but add a new set of fins Use an extension for a DB you already use 3 Buy a new board Adopt a new DB technology 4 Shape your own board Design your own DB
  5. 1. Surf a board you already have If you already

    have a database that’s well-suited for time-series data, why change? Maybe you just need to adjust your techniques!
  6. 2. Keep the old board, but add a new set

    of fins • Old board = Postgres • New fins = Postgres extension • A few options: pg_timeseries or TimescaleDB
  7. 3. Buy a new board Sometimes, your existing tools don’t

    cut it, and you need to invest in something entirely new. ClickHouse is a fast, open-source analytical database, designed around time-series data.
  8. 4. Shape your own board Sometimes, no available database seems

    suited to your highly specific needs. In 2008, the engineers Meraki found themselves in this position, and LittleTable was born.
  9. 4. Shape your own board: LittleTable • Relational database •

    Optimized for time-series data • Data clustered for continuous disk access • SQL interface for querying LT White Paper
  10. The Perfect Technique • Now that you have a board,

    you need to learn how to surf it! • Much like in surfing, there are tried-and-true techniques for best handling time-series data. • We’ll cover: 1. Data arranged by time 2. Hierarchically-delineated key 3. Querying by index 4. Aggregation and Compression
  11. 1. Data arranged by time • Key feature of a

    time-series DB • ClickHouse automatically generates an index on the ts column • Performant when accessing a range of time • LittleTable is append-only
  12. 2. Hierarchically-delineated key • In addition to being grouped by

    time, data is organized according to this composite key. • Crucial to understand how this data is going to be accessed—not every query will be efficient
  13. 2. Hierarchically-delineated key • Organize by increasing specificity • Cisco

    Meraki’s example from the previous slide: Network, Device • For Liz’s surfing application: Surfer, Region, Break
  14. 3. Querying by index: LittleTable • LittleTable is organized across

    two axes: composite key and time ◦ Only need a prefix • Performant query for LittleTable: ◦ Surfer ◦ Region, ◦ Timestamp
  15. 3. Querying by index: ClickHouse • ClickHouse include timestamp at

    the end of the composite index ◦ So you must query with the full key • Non-performant query ◦ Surfer, Timestamp • Performant query ◦ Surfer, Region, Break, Timestamp Liz, LA, Malibu, over the past month Liz, Humboldt, Moonstone, over the past month Two weeks … Two weeks
  16. 4. Aggregation and Compression • Time-series data can pile up

    fast • Two needs: ◦ Don’t have infinite storage ◦ Also want to show as much data as possible
  17. 4. Aggregation and Compression • Don’t have infinite storage ◦

    Data retention ◦ Time-to-live • Also want to show as much data as possible ◦ Compression ◦ Aggregation
  18. 4. Aggregation: LittleTable • Base table and aggregate table •

    Base table (data per wave): ◦ Distance, Duration • Aggregate table (data per interval of time): ◦ Total distance, total duration, max speed, wave count
  19. 4. Aggregation: LittleTable • We can aggregate the data over

    the following intervals: ◦ Base table—with a TTL of 1 month ◦ One day—with a TTL of 6 months ◦ One week—with a TTL of 1 year ◦ One month—with a TTL of 5 years
  20. Getting out there We have our data: • Stored •

    Aggregated • Easily accessible Now we design an API endpoint that Liz can use to easily query her surf data.
  21. Getting out there: Query params • Required ◦ Surfer ◦

    Timespan • Optional ◦ Region ◦ Break
  22. Getting out there: Timespan and interval • timespan = the

    full period of time over which we want data. ◦ Our longest TTL is 5 years: that’s the max timespan • interval = the grain at which the data is aggregated ◦ Calculated based on the timespan • The interval options are: ◦ One day (TTL 6 months) ◦ One week (TTL 1 year) ◦ One month (TTL 5 years)