containers, kubernetes and networking Accelerators and ML Previous work in storage and the WLCG (worldwide LHC computing grid) @ahcorporto [email protected]
Hours Minutes or Hours Utilization Poor Maintenance Highly Intrusive Cloud API Virtualization Minutes Minutes or Hours Minutes or Hours Good Potentially Less Intrusive Containers Seconds Seconds Seconds Very Good Less Intrusive
split into Hardware and Software Filters ( this might change too ) 40 million particle interactions / second ~3000 multi-core nodes ~30.000 applications to supervise Critical system, sustained failure means data loss Can it be improved for Run 4? Study 2017, Mattia Cadeddu, Giuseppe Avolio Kubernetes 1.5.x A new evaluation phase to be tried this year ATLAS Event Filter
hierarchical filesystem In production for several years, battle tested, solved problem Now with containers? Can they carry all required software? > 200 sites in our computing grid ~400 000 concurrent jobs Frequent software releases 100s of GBs
as described early Deep Learning for Fast Simulation Can we easily distribute to reduce training time? Sofia Vallecorsa, CERN OpenLab Konstantinos Samaras-Tsakiris
We have > 200 of them Multiple components for Storage and Compute Lots of history in the software Fernando Barreiro Megino Fahui-Lin, Mandy Yang ATLAS Distributed Computing Can a Kubernetes endpoint be a Grid site?
VM Master killed (OOM) on Saturday Test Cluster with 2000 cores Good: Initial results show error rates as any other site Improvements: defaults on the scheduler causing inefficiencies Pack vs Spread Affinity Predicates, Weights Custom Scheduler?
runc as container runtime - relying on CRI Kubernetes as the container orchestrator Fluentd for log collection and aggregation Prometheus for monitoring and metric collection
DNS challenge is not yet an options No API available to update TXT records We rely on the HTTP-01 challenge This requires a firewall opening to get a certificate, not ideal
our deployments We need to move Traefik to 2.0 - yes we’re still in 1.x Most used Ingress controller by far - almost 400 clusters using it at CERN Integrate Ingress with our external LBs - using VIP, no DNS Monitor developments of the new Service APIs https://github.com/kubernetes-sigs/service-apis