Upgrade to PRO for Only $50/Yearβ€”Limited-Time Offer! πŸ”₯

Berlin 2013 - Big Graphite Workshop - Devdas Bh...

Avatar for Monitorama Monitorama
September 20, 2013
650

Berlin 2013 - Big Graphite Workshop - DevdasΒ Bhagat

Avatar for Monitorama

Monitorama

September 20, 2013
Tweet

Transcript

  1. Graphite basics β€’ Graphite is a web based Graphing program

    for time series data series plots. β€’ Written in Python β€’ Consists of multiple separate daemons β€’ Has it's own storage backend – Like RRD, but with more features
  2. Moving parts β€’ Whisper/Ceres – The storage backend β€’ Webapp

    – Web frontend, and API provider β€’ Relaying daemons – Event based daemons – Matches input based on name – Relays to one or more destinations based on rules or hashing
  3. Original production setup β€’ A small cluster – We were

    planning to grow slowly β€’ RAID 1+0 spinning disk setup – It works for our databases β€’ Ran into the IO wall – Spinning rust sucks at IO – Whisper updates force crazy seek patterns
  4. Scaling problems β€’ We started with hosts in a /24

    feeding one box. β€’ We ran into IO issues when we added the second /24. – On the second day
  5. Sharding β€’ Added more backends β€’ Manual rules to split

    traffic coming to the Graphite setup to storage nodes β€’ This becomes hard to maintain and balance
  6. Speeding up IO β€’ Move to 400 GB SSDs from

    HP in RAID 1. β€’ We got performance – Not as much as we needed β€’ Losing a SSD meant the host crashed – Negating the whole RAID 1 setup β€’ SSDs aren't as reliable as spinning rust in high update scenarios
  7. Naming conventions β€’ None in the beginning β€’ We adopted

    – sys.* for systems metrics – user.* for user testing metrics – Anything else that made sense
  8. Metrics collectors β€’ Collectd ran into memory problems – Used

    too much RAM β€’ Switch to Diamond – Python application – Base framework + metric collection scripts – Added custom patches for internal metrics
  9. Relaying β€’ We started with relays only on the cluster

    – Relaying was done based on regex matching β€’ Ran into CPU bottlenecks as we added nodes – Spun up relay nodes in each datacenter β€’ Did not account for organisational growth – CPU was still a bottleneck β€’ Ran multiple relays on each host – Haproxy used as a load balancer – Pacemaker used for cluster failover
  10. statsd β€’ We added statsd early on β€’ We didn't

    use it for quite some time – Found that our PCI vulnerability scanner reliably crashed it – Patched it to handle errors, log and throw away bad input β€’ The first major use was for throttling external provider input
  11. Business metrics β€’ Turns out, our developers like Graphite β€’

    They didn't understand RRD/Whisper semantics though – Treat graphite queries as if they were SQL β€’ Create a very large number of named metrics – Not much data in each metric, but the request was for 5.3TiB of space
  12. Sharding – take 2 β€’ Manually maintaining regexes became painful

    – Two datacenters – 10 backend servers β€’ Keeping disk usage balanced was even harder – We didn't know who would create metrics and when (this is a feature, not a bug)
  13. Sharding – take 2 β€’ Introduce hashing β€’ Switch from

    RAID 1 to RAID 0 β€’ Store data in two locations in a ring β€’ Mirror rings between datacenters β€’ Move metrics around so we don't lose data β€’ Ugly shell scripts to synchronise data between datacenters.
  14. Using Graphite β€’ Graphs – Time series data β€’ Dashboards

    – Developers create their own – Overhead displays β€’ Additional charting libraries – D3.js β€’ Nagios – Trend based alerting
  15. Current problems β€’ Hardware – CPU usage β€’ Too easy

    to saturate – Disk IO β€’ We saturate disks β€’ Reading can get a bit … slow – Disks β€’ SSDs die under update load
  16. More interesting problems β€’ Software breaking on updates – We

    have had problems recording data after upgrading whisper β€’ Horizontal scalability – Adding shards is hard – Replacing SSDs is getting a bit expensive β€’ People – Want a graph, throw the data at Graphite – Even if it isn't time series data or one record a day
  17. Things we are looking at β€’ Second order rate of

    change alerting – Not just the trend, the rate at which it changes β€’ OpenTSDB storage β€’ Anomaly detection – Skyline, etc β€’ Tracking even more business metrics β€’ Hiring people to work on such fun problems – Developers, Sysadmins ... – http://www.booking.com/jobs
  18. ?