Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2020.02 Meetup] [Talk #1] Ricardo Rocha - Comp...

[2020.02 Meetup] [Talk #1] Ricardo Rocha - Computing and Operations at CERN : From Physical HW to Virtualization and Containers

In this talk, we describe and cover the challenges of running the infrastructure required to store and analyse 100s of PetaBytes of data, and how we manage 1000s of servers totalling more than 300k cores and offering over 400PBs of storage. We will cover the compute and networking infrastructure running on OpenStack as well as the required configuration management services for automation. And we will finish with the current move towards a containerized infrastructure where Docker and Kubernetes play a key role.

Ricardo is a software engineer at CERN currently part of the CERN cloud team, focusing primarily on networking and container based deployments. Previously he helped develop and deploy several components of the Worldwide LHC Computing Grid, a network of ~200 collaborating sites around the world helping to analyze the Large Hadron Collider data. He has a computing degree from FEUP (Faculdade Engenharia da Universidade do Porto), joining CERN as part of his final project focusing on Grid Computing. Ricardo has presented his and his teams work in different international conferences - Computing for High Energy Physics (CHEP), IEEE NSS/MIC, IEEE MSST, DockerCon, Kubecon and multiple OpenStack summits.

DevOps Lisbon

February 10, 2020
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. Founded in 1954 What is 96% of the universe made

    of? Fundamental Science Why isn’t there anti-matter in the universe? What was the state of matter just after the Big Bang?
  2. 7

  3. 9

  4. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive
  5. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive
  6. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  7. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades, Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  8. OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

    Scalability, Rolling Upgrades, Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron
  9. OpenStack Private Cloud Automate everything! Puppet based deployment of all

    components Including control plane running on VMs Same is true for most CERN services Workflows for all sorts of tasks Onboarding new users, project creation, quota updates, special capabilities Overcommit, Pre-emptible instances, Backfilling workloads
  10. Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

    Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive Containers Seconds Seconds Seconds Very Good Less Intrusive
  11. Lingua franca of the cloud Managed services offered by all

    major public clouds Multiple options for on-premise or self-managed deployments Common declarative API for basic infrastructure : compute, storage, networking Healthy ecosystem of tools offering extended functionality Kubernetes
  12. Lingua franca of the cloud Managed services offered by all

    major public clouds Multiple options for on-premise or self-managed deployments Common declarative API for basic infrastructure : compute, storage, networking Healthy ecosystem of tools offering extended functionality Kubernetes
  13. GitOps for Automation We were already doing similar things with

    Puppet Git as the source of truth for configuration data Allowing multiple choices of deployment models 1 ⇢ 1: Currently the most popular: one application, one cluster 1 ⇢ *: One application, multiple clusters (HA, Blast Radius, Rolling Upgrades) * ⇢ *: Separation of roles, improve resource usage Meta Chart git push FluxCD git pull Helm Release CRD Helm Operator
  14. Cluster Creation Image Pre-Pull Data Stage-In Process 5 min 4

    min 4 min 90 sec Kubernetes More than just infrastructure management Potential to ease scaling out data analysis on-demand Challenge: Re-processing the Higgs analysis in under 10min Processing a dataset of ~70TB of data split in ~25000 files
  15. Cluster on GKE Max 25000 Cores Single Region, 3 Zones

    70 TB Dataset Job Results Interactive Visualization Aggregation 25000 Kubernetes Jobs