Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Architecture of a Distributed Analytics and...

The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

The numerical analysis of time-series data isn't new. The scale of today's problems is. With millions of concurrent data streams, some of which run at 1MM samples per second, the challenge of storing the data and making it continuously available for analysis is a daunting challenge.

Theo Schlossnagle

February 26, 2015
Tweet

More Decks by Theo Schlossnagle

Other Decks in Technology

Transcript

  1. A fly-by tour of the design of Snowth A distributed

    database for storage and analysis of time-series telemetry http://l42.org/FQE
  2. Problem Space • System Availability • Significant Retention (10 years)

    • > 107 different metrics • Frequency Range [1mHz - 1GHz] • ~1ms for time range retrieval • Support tomorrow’s “data scientist” https://www.flickr.com/photos/design-dog/4358548056
  3. A rather epic data storage problem. What we are scribing

    to disk [email protected] : 525000/yr [email protected] : 5.25x1012 /yr 1@1kHz : 31.5x109 /yr 10MM@1kHz : 3.15x1018 /yr Photo by: Nicolas Buffler (ccby20) (modified)
  4. Storing data requires a Data Format (stats) In: some number

    samples Out: number of samples, average, stddev, counter, counter stddev, derivative, derivative stddev (in 32 bytes)
  5. Storing data requires a Data Format (histogram) In: lots of

    measurements Out: a set of buckets representing two significant digits of precision in base ten and a count of samples seen in that bucket.
  6. Managing the economics Histograms We solve this problem by supporting

    “histogram” as a first-class datatype without Snowth. Introduce some controlled time error. Introduce some controlled value error.
  7. I didn’t come to talk about Illustration 14 Meta Object

    Set L0 L1 Boot L2 L3 .... Blank Space Name/Value Pairs uberblock_phys_t array uint64_t ub_magic uint64_t ub_version uint64_t ub_txg uint64_t ub_vdev_sum uint64_t ub_timestamp blkptr_t ub_rootbp dnode_phys_t metadnode zil_header_t os_zil_header uint64_t os_type = DMU_OST_META uint8_t dn_type =DMU_OT_DNODE uint8_t dn_indblkshift; uint8_t dn_nlevels uint8_t dn_nblkptr; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[3]; uint8_t dn_bonus[BONUSLEN] dnode_phys_t ... ... . 0 1 2 3 4 1022 1023 uint8_t dn_type= DMU_OT_OBJECT_DIRECTORY uint8_t dn_indblkshift; uint8_t dn_nlevels = 1 uint8_t dn_nblkptr = 1; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[1]; uint8_t dn_bonus[BONUSLEN] root_dataset = 2 config = 4 sync_bplist = 1023 object_directory root_dataset config sync_bplist 1024 1025 1026 1027 1028 2046 2047 2048 2049 2050 2051 2052 Boot Hdr On-disk format We use a combination of a fork of leveldb and propriety on-disk formats… it also has changed a bit overtime and stands to change a bit going forward… but, that would be a different talk.
  8. DISPLAYING 1­21 of 91 graphs PER PAGE: 21 ? 1

    2 3 4 5 NEXT snowth6 IO latency Feb 24 2015 Anomaly Example BeaconReqRate Feb 20 2015 Public Trap Metrics Feb 19 2015 Beacon request rate Feb 18 2015 MQ Volume (fq) Feb 17 2015 API request rate Feb 17 2015 Anomaly Example 7 Feb 17 2015 Anomaly Example 8 Feb 17 2015 Stratcon uptime Feb 17 2015 Metric Velocity Feb 16 2015 lbva ATS (dc3) Feb 06 2015 Snowth DC3 space Feb 06 2015 Ashburn DC3 Egress Traffic Jan 22 2015 Snowth NNT Aggregate Put Calls Dec 04 2014 Snowth NNT Aggregate Put Calls (Raw Val… Dec 01 2014 Snowth Cluster Peer Lag Nov 26 2014 snowth6 IO latency (µs) Oct 31 2014 Snowth Space Oct 31 2014 Metrics seen by broker Oct 25 2014 Public broker noitd memory Oct 23 2014 Metrics / second Oct 22 2014 View: Jan 04 2015, 23:58 – Jan 05 2015, 23:59 Sort by: Last updated
  9. Understanding the Data science + big data This is not

    a new world, but we felt our constraints made the solution space new.
  10. Quick Recap ❖ Multi petabyte scale ❖ Zero downtime ❖

    Fast retrieval ❖ Fast data-local math
  11. High-level architecture Consistent Hashing 2256 buckets, not v-buckets K-V, but

    V are append-only http://www.flickr.com/photos/colinzhu/312559485/
  12. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n4-­‐1 n4-­‐2 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐3 n3-­‐4 n4-­‐3 n6-­‐2 n6-­‐4
  13. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 o1
  14. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 o1
  15. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n4-­‐1 n4-­‐2 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐3 n3-­‐4 n4-­‐3 n6-­‐2 n6-­‐4
  16. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4
  17. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 Availability
 Zone  2 Availability
 Zone  1
  18. n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2

    n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 o1 Availability
 Zone  2 Availability
 Zone  1
  19. Availability
 Zone  1 Availability
 Zone  2 n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4

    n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2 n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 o1
  20. Availability
 Zone  1 Availability
 Zone  2 n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4

    n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2 n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4 n1-­‐1 n1-­‐2 n1-­‐3 n1-­‐4 n2-­‐1 n2-­‐2 n2-­‐3 n2-­‐4 n3-­‐1 n3-­‐2 n3-­‐3 n3-­‐4 n4-­‐1 n4-­‐2 n4-­‐3 n4-­‐4 n5-­‐1 n5-­‐2 n5-­‐3 n5-­‐4 n6-­‐1 n6-­‐2 n6-­‐3 n6-­‐4
  21. Problems ❖ It turns out we spend a ton of

    time writing logs. ❖ So we wrote a log subsystem, optionally asynchronous ❖ non-blocking mpsc fifo between publishers and log writer ❖ one thread dedicated per log sink (usually a file) ❖ support POSIX files, jlogs, and pluggable log writers (modules) ❖ We also have a synchronous in-memory ring buffer log (w/ debugger support) ❖ DTrace instrumentation of logging calls (this is life-alteringly useful)
  22. Write Path Architecture 1.5 event loop I/O worker job ✔

    ✔ Data Submission log access log error
  23. Write Path Architecture 2.0 event loop WL1 worker job ✔

    ✔ Data Submission WL2 worker WL3 worker log access log error
  24. Problems ❖ The subtasks have ❖ contention based on key

    locality: ❖ updating different metrics vs. different times of one metric* job ✔ ✔ *only for some backends
  25. Write Path Architecture 3.0 event loop WL1 job ✔ ✔

    Data Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource log access log error
  26. Problems ❖ plockstat showed we had significant contention ❖ writing

    replication journals ❖ we have several operations in each subtask ❖ operations that can be performed
 asynchronously to the subtask job ✔ ✔ ✔
  27. Write Path Architecture event loop WL1 job ✔ ✔ Data

    Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error
  28. Job Queues [EVENTLOOP THREAD:X] ❖ while true ❖ while try

    jobJ <- queue:BX ❖ jobJ do “asynch cleanup” ❖ eventloop sleep for activity ❖ some event -> callback ❖ jobJ -> queue:W1 & sem_post()
 [JOBQ:W1 THREAD:Y] ❖ while true ❖ wakes up from sem_wait() ❖ jobJ <- queue:W1 ❖ jobJ do “asynch work” ❖ insert queue:BX ❖ wakeup eventloop on thr:X
  29. Job Queues: implementation ❖ online thread concurrency is mutable ❖

    smoothed mean wait time and run time ❖ will return a job to origin thread for synchronous completion. ❖ BFM job abortion using signals with sigsetjmp/siglongjmp [DRAGONS] ❖ we don’t use this feature in Snowth ❖ eventloop wakeup using:
 port_send/kevent/eventfd Photograph by Annie Mole
  30. Job Completion - simple refcnt ❖ begin ❖ refcnt ->

    1 ❖ add initial jobs… ❖ def(refcnt)
 ->? 0 : complete ❖ add job: ❖ inc(refcnt) ❖ complete job: ❖ dec(refcnt)
 ->? 0 : complete

  31. So, how does all this play out? What’s the performance

    look like? A telemetry store has benefits highly different workloads mostly uni-modal
  32. Lorem Ipsum Dolor Visualizing all I/O latency the slice: 3.2⨉106

    samples the graph: 300⨉106 samples retrieval pipeline is simple
  33. Nothing is ever as simple as it seems. Retrieval seems

    easy… but accessible to data scientists want to run math near data make it safe and make it fast
  34. Computation is cheap Movement is expensive* It’s like packing a

    truck, driving it to another state to have the inventory counted vs. just packing a truck and counting. https://www.flickr.com/photos/kafka4prez/ *usually
  35. Allowing data-local analysis Enabling Data Scientists Code in C? (no)

    Must be fast. Must be process-local. LuaJIT.
  36. Problems ❖ Lua (and LuaJIT) ❖ are not multi-thread safe

    ❖ garbage collection can wreak havoc in high performance systems ❖ lua’s math support is somewhat limited
  37. Leveraging multiple cores for computation Threads ❖ Separate lua state

    per OS thread: NPT ❖ Shared state requires lua/C crossover ❖ lua is very good at this, but…
 still presents significant impedance.
  38. Tail collection Garbage Collection woes ❖ NPTs compete for work:

    ❖ wait for work (consume) ❖ disabled GC ❖ do work -> report completion ❖ enable GC ❖ force full GC run https://www.flickr.com/photos/neate_photos/6160275942
  39. Tail collection Maths, Math, and LuaJIT ❖ We use (a

    forked) numlua: ❖ FFTW*, BLAS, LAPACK, CDFs ❖ It turns out that LuaJIT is:
 wicked fast for our use-case. ❖ Memory management is an issue.
  40. Overall (simplified) Architecture event loop Data Access WL1 job ✔

    ✔ WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error ✔ job NPT
  41. The birth of mtev - https://github.com/circonus-labs/libmtev Heavy lifting: libmtev mtev

    was a project to make the eventer itself multi-core and make it all a library https://www.flickr.com/photos/kartlasarn/6477880613
  42. Log Subsystem Mount Everest Framework log access log error Config

    management Multi-core Eventloop Dynamic Job Queues Online Console EL Journal node 1 job #  show  mem   #  write  mem   #  shutdown POSIX/TLS HTTP Protocol Listener Hook Framework DSO Modules LuaJIT Integration https:// COMING SOON
  43. References ❖ Circonus - http://www.circonus.com ❖ libmtev - https://github.com/circonus-labs/libmtev ❖

    Concurrency Kit - http://concurrencykit.org ❖ LuaJIT - http://luajit.org ❖ More on Snowth - http://l42.org/EwE ❖ plockstat - https://www.illumos.org/man/1M/plockstat