Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Charity Majors on Scuba: Diving into Data at Fa...

Charity Majors on Scuba: Diving into Data at Facebook

Facebook takes performance monitoring seriously. Performance issues can impact over one billion users so we track thousands of servers, hundreds of PB of daily network traffic, hundreds of daily code changes, and many other metrics. We require latencies of under a minute from events occuring (a client request on a phone, a bug report filed, a code change checked in) to graphs showing those events on developers’ monitors.

Scuba is the data management system Facebook uses for most real-time analysis. Scuba is a fast, scalable, distributed, in-memory database built at Facebook. It currently ingests millions of rows (events) per second and expires data at the same rate. Scuba stores data completely in memory on hundreds of servers each with 144 GB RAM. To process each query, Scuba aggregates data from all servers. Scuba processes almost a million queries per day. Scuba is used extensively for interactive, ad hoc, analysis queries that run in under a second over live data. In addition, Scuba is the workhorse behind Facebook’s code regression analysis, bug report monitoring, ads revenue monitoring, and performance debugging.

Papers_We_Love

June 26, 2017
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. Me

  2. Scuba at Facebook (Requirements) It must be *fast* It must

    be (ridiculously) flexible Fast and nearly-right is infinitely better than slow and perfectly-right Should be usable by non-engineers Results must be live in a few seconds.
  3. Scuba at Facebook (Implementation, Examples) Started sometime in 2011, to

    improve dire mysql visibility hack(athon) upon hack(athon) Less than 10k lines of C++ Designed Evolved Changed my life (and not just me)
  4. Paper overview (Introduction) “We used to rely on metrics and

    pre aggregated time series, but our databases are exploding (and we don’t know why)” — Facebook, 2011
  5. Embedded in this whitepaper are lots of clues about how

    to do event-driven or distsys debugging at scale.
  6. Paper overview (Use Cases) Site Reliability Bug Monitoring Performance Tuning

    Trend Analysis Feature Adoption Ad Impressions, Clicks, Revenue A/B Testing Egress/Ingress Bug Report Monitoring Real Time Post Content Monitoring Pattern Mining … etc
  7. Paper overview (Ingestion/Distribution) Accept structured events. Timestamped by client, sample_rate

    per row. No schema, no indexes. Wide and sparse ‘tables’. Pick two leaves for incoming write, send the batch to the leaf with more free memory. (load variance goes from O(log N) to O(1)) Delete old data at the same rate as new data is written.
  8. Paper overview (Query Execution) Root/Intermediate/Leaf aggregators & Leaf servers. Root:

    parse and validate query, fanout to self+4 Intermediate: fanout to self+4 (until only talking to local Leaf Aggregator based on sum of records to tally) consolidate records and pass them back up to the Root Leaf server: full table scan lol
  9. Paper overview (Performance Model & Client Experiments) “We wanted to

    write a whitepaper, so we slapped some sorta formal-looking stuff on here lol”
  10. Paper overview (Performance Model & Client Experiments) “We wanted to

    write a whitepaper, so we slapped some sorta formal-looking stuff on here lol”
  11. Why you should care In the future, every system will

    be a distributed system You don’t know what you don’t know You can’t predict what data you will need You NEED high cardinality tooling You need exploratory, ad hoc analysis for unknown unknowns Everything is a tradeoff, but these are better tradeoffs in the future.