[2020.02 Meetup] [Talk #1] Ricardo Rocha - Computing and Operations at CERN : From Physical HW to Virtualization and Containers

Infrastructure at CERN Scale Ricardo Rocha - CERN Cloud Team
@ahcorporto [email protected]

Founded in 1954 What is 96% of the universe made
of? Fundamental Science Why isn’t there anti-matter in the universe? What was the state of matter just after the Big Bang?

~70 PB/year 700 000 Cores ~400 000 Jobs ~30 GiB/s
200+ Sites

Computing at CERN Increased numbers, increased automation 1970s 2007

Computing at CERN Increased numbers, increased automation 1970 2007

Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or
Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive

Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive

OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)
Scalability, Rolling Upgrades Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron

OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)
Scalability, Rolling Upgrades, Regions split in multiple Cells Often matching hardware deliveries Different configurations and capabilities Single hypervisor type (KVM, used to have HyperV as well) CELL 1 MAIN CELL 2 CELL N Compute GPU Compute Nova Network Neutron Neutron

OpenStack Private Cloud Automate everything! Puppet based deployment of all
components Including control plane running on VMs Same is true for most CERN services Workflows for all sorts of tasks Onboarding new users, project creation, quota updates, special capabilities Overcommit, Pre-emptible instances, Backfilling workloads

Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive Containers Seconds Seconds Seconds Very Good Less Intrusive

Lingua franca of the cloud Managed services offered by all
major public clouds Multiple options for on-premise or self-managed deployments Common declarative API for basic infrastructure : compute, storage, networking Healthy ecosystem of tools offering extended functionality Kubernetes

GitOps for Automation We were already doing similar things with
Puppet Git as the source of truth for configuration data Allowing multiple choices of deployment models 1 ⇢ 1: Currently the most popular: one application, one cluster 1 ⇢ *: One application, multiple clusters (HA, Blast Radius, Rolling Upgrades) * ⇢ *: Separation of roles, improve resource usage Meta Chart git push FluxCD git pull Helm Release CRD Helm Operator

Cluster Creation Image Pre-Pull Data Stage-In Process 5 min 4
min 4 min 90 sec Kubernetes More than just infrastructure management Potential to ease scaling out data analysis on-demand Challenge: Re-processing the Higgs analysis in under 10min Processing a dataset of ~70TB of data split in ~25000 files

OpenStack Magnum 70 TB Dataset Job Results Interactive Visualization Aggregation
25000 Kubernetes Jobs

Cluster on GKE Max 25000 Cores Single Region, 3 Zones
70 TB Dataset Job Results Interactive Visualization Aggregation 25000 Kubernetes Jobs

Questions ? @ahcorporto [email protected] http://visits.cern/

[2020.02 Meetup] [Talk #1] Ricardo Rocha - Comp...

[2020.02 Meetup] [Talk #1] Ricardo Rocha - Computing and Operations at CERN : From Physical HW to Virtualization and Containers

DevOps Lisbon

More Decks by DevOps Lisbon

Other Decks in Technology

Featured

Transcript

Infrastructure at CERN Scale Ricardo Rocha - CERN Cloud Team

Founded in 1954 What is 96% of the universe made

7

9

~70 PB/year 700 000 Cores ~400 000 Jobs ~30 GiB/s

Computing at CERN Increased numbers, increased automation 1970s 2007

Computing at CERN Increased numbers, increased automation 1970 2007

Computing at CERN Increased numbers, increased automation 1970 2007

Computing at CERN Increased numbers, increased automation 1970 2007

Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

OpenStack Private Cloud 3 Separate Regions (Main, Batch, Point 8)

OpenStack Private Cloud Automate everything! Puppet based deployment of all

Provisioning Deployment Update Physical Infrastructure Days or Weeks Minutes or

Lingua franca of the cloud Managed services offered by all

Lingua franca of the cloud Managed services offered by all

GitOps for Automation We were already doing similar things with

Cluster Creation Image Pre-Pull Data Stage-In Process 5 min 4

OpenStack Magnum 70 TB Dataset Job Results Interactive Visualization Aggregation

Cluster on GKE Max 25000 Cores Single Region, 3 Zones

Questions ? @ahcorporto [email protected] http://visits.cern/