Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PDX 2017 - Pedro Andrade

PDX 2017 - Pedro Andrade

Avatar for Monitorama

Monitorama

May 23, 2017
Tweet

More Decks by Monitorama

Other Decks in Technology

Transcript

  1. 23 May 2017 Founded in 1954 12 European states Associate

    members • India • Pakistan • Turkey • Ukraine Staff members • 2300 Other personnel • 1400 Scientific users • 12500
  2. Monitoring @ CERN 7 40m per sec > 100k per

    sec > 1k per sec HW trigger > SW trigger > recorded From ~1 PB per sec to ~4GB per sec
  3. CERN Data Centres • Main site • Built in the

    70s • Geneva, Switzerland • Extension site • Budapest, Hungary • 3x100Gb links • Commodity HW 23 May 2017 Monitoring @ CERN 8
  4. Worldwide LHC Computing Grid • WLCG provides global computing resources

    to store, distribute and analyse the LHC data • The CERN data centre (Tier-0) distributes LHC data to other WLCG sites (Tier-1, Tier-2, Tier-3) • Global collaboration of more than 170 data centres around the world from 42 countries Monitoring @ CERN 10 23 May 2017
  5. About WLCG: • A community of 10,000 physicists • ~250,000

    jobs running concurrently • 600,000 processing cores • 15% of the resources are at CERN • 700 PB storage available worldwide • 20-40 Gbit/s connect CERN to Tier1s Tier-0 (CERN) • Initial data reconstruction • Data distribution • Data recording & archiving Tier-1s (13 centres) • Initial data reconstruction • Permanent storage • Re-processing • Analysis Tier-2s (>150 centres) • Simulation • End-user analysis
  6. Monitoring Team Provide a common infrastructure to measure, collect, transport,

    visualize, process and alarm monitoring data from the CERN Data Centres and the WLCG collaboration 23 May 2017 Monitoring @ CERN 12
  7. Monitoring Data • Big variety of data as metrics or

    logs • Data centre HW and OS • Data centre services • WLCG site/services monitoring • WLCG job monitoring and data management • Spikey workload with an average of 500GB/day • When things go bad we have more data/users 23 May 2017 Monitoring @ CERN 13
  8. Monitoring Architecture • Common solution / Open source tools •

    Scalable infrastructure / Empower users 23 May 2017 Monitoring @ CERN 14 Sources Transport Processing Access Storage
  9. Sources • Internal data sources • Data centre OS and

    HW • Based on Collectd: 30k nodes, 1 min samples • External data sources • Data centre services and WLCG • A mix between push and pull models • Different technologies and protocols 23 May 2017 Monitoring @ CERN 15
  10. Transport 23 May 2017 Monitoring @ CERN 17 Flume Kafka

    Kafka Flume Kafka Flume JMS Flume JDBC Flume HTTP Flume Logs Flume Metrics Flume DC DB HTTP AMQ Logs Metrics Metrics
  11. Transport • 10 different types of Flume agents • E.g.

    JMS/AVRO, AVRO/KAFKA, KAFKA/ESSINK • File based channels (small capacity) • Several interceptors and morhphlines • For validation and transformation • Check mandatory fields, apply common schema 23 May 2017 Monitoring @ CERN 18
  12. Transport • Kafka is the rock-solid core of transport layer

    • Data buffered for 12h (target is 72h) • Each data source in a separate topic • Each topic divided in 20 partitions • Two replicas per partition 23 May 2017 Monitoring @ CERN 19
  13. Processing 23 May 2017 Monitoring @ CERN 21 Flume Kafka

    (buffering) (enrichment) (aggregation) Flume
  14. Processing • Stream processing jobs (scala) • Enrichment jobs: e.g.

    topology metadata • Aggregation jobs: mostly over time • Correlation jobs: cpu load vs service activity • Batch processing jobs (scala) • Compaction and reprocessing jobs • Monthly or yearly reports for management 23 May 2017 Monitoring @ CERN 22
  15. Processing • Jobs built and packaged as Docker images •

    Automatic process at every push with Gitlab CI • Jobs orchestrated as processes on Mesos • Marathon for long-living processes (e.g. streaming) • Chronos for recurrent execution (e.g. batch) • Jobs executed in Mesos or Yarn clusters 23 May 2017 Monitoring @ CERN 23
  16. Storage • HDFS for long-term archive and data recovery •

    Data kept forever (limited to resources) • ES for searches and data discovery • Data kept for 1 month • InfluxDB for plots and dashboards • Data kept for 7 days • 5m bins kept for 1M, 1h bins kept for 5Y 23 May 2017 Monitoring @ CERN 25
  17. Storage • All data written to HDFS by default •

    /project/monitoring/archive/fts/metrics/2017/05/22/ • Data aggregated once per day into ~1Gb files • Selected data sets stored in ES and/or InfluxDB • ES: two generic instances (metrics and logs) • ES: each data producer in a separate index • InfluxDB: one instance per data producer 23 May 2017 Monitoring @ CERN 26
  18. Alarms • Based on existing in-house toolset • Old implementation

    with >5 years • Local alarms from metrics thresholds • Moving towards a multiple scope strategy • Local alarms from Collectd, to enable local actuators • Base alarms from Grafana, easy integration • Advanced alarms from Spark Jobs (user specific) 23 May 2017 Monitoring @ CERN 31
  19. Monitoring Infrastructure • Openstack VMs • 60 Flume, 20 Kafka,

    10 Spark, 50 Elasticsearch • Some nodes with Ceph volumes attached • Physical nodes • Only used for InfluxDB and HDFS • All configuration done via Puppet 23 May 2017 Monitoring @ CERN 32
  20. Lessons Learned • Protocol based flume agents • Extremely handy,

    very easy to add new sources • Kafka is the core of the “business” • Careful resources planning (topic/partition split) • Control consumer group and offsets is key • Securing the infrastructure takes time 23 May 2017 Monitoring @ CERN 33
  21. Lessons Learned • Scala was the right choice for Spark

    • Keep batch and streaming code as close as possible • DataFrame to decouple from JSON (un)marshalling • Store checkpoints on HDFS • Very positive experience with Marathon • InfluxDB and Elasticsearch • Tried to make them as complementary as possible 23 May 2017 Monitoring @ CERN 34