Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How CERN serves 1EB of data via FUSE

How CERN serves 1EB of data via FUSE

CERN, the European Organization for Nuclear Research, generates vast amounts of data from experiments at the Large Hadron Collider (LHC). CERN’s Storage and Data Management Group at the IT department is responsible for managing this data, including its long-term archival on tape, distribution across the Worldwide LHC Computing Grid (WLCG), as well as providing secure and convenient forms of data access to more than 30000 users that need it.

In this talk we will give an overview of the open source projects leveraged by CERN to satisfy the storage needs of the organization, such as CERNBox and EOS, and discuss some of the unique challenges faced by the high-energy physics community in data storage and management. Additionally, we discuss how FUSE is used to allow users to access data securely from anywhere in the world.

Guilherme AMADIO

Kernel Recipes

September 30, 2024
Tweet

More Decks by Kernel Recipes

Other Decks in Technology

Transcript

  1. 2024 How CERN serves 1 EB of data via FUSE

    G. Amadio CERN IT Storage and Data Management
  2. 2 CERN, the European Organisation for Nuclear Research, is widely

    recognised as one of the world’s leading laboratories for fundamental research. CERN is an intergovernmental organisation with 24 Member States and 10 Associate Member States. It is situated on the French-Swiss border, near Geneva. CERN’s mission is to gain understanding of the most fundamental particles and laws of the universe.
  3. 6 They measure the position, momentum, and charge of particles.

    These measurements are performed about 40 million times per second. LHC experiments were built by international collaborations spanning most of the Globe. LHC detectors are analogous to 3D Cameras
  4. 10 ALICE Experiment Data Flow (Heavy Ion, Run 3) 150

    PB HDD (EC 10+2) Online Processing 250 Nodes 2000 GPUs 14 PB Fallback HDD Storage Buffer (48h) 1 PB Realtime SSD Storage Buffer (3h) 300 TB SSD 5Tbps 10GB/s Up to 280 GB/s 28Tbps 10GB/s Worldwide LHC Computing Grid CERN Cloud (OpenStack) Offline Processing Physics Analysis 50–500 GB/s >10GB/s ALICE Point 2
  5. • 180 PB raw, 150 PB usable space • Layout:

    10 + 2 erasure coding • 12000 HDDs in total • About ~1/3 shown below (3840 HDDs) • 1.3 TB/s peak read throughput • 380 GB/s peak write throughput • 96 x 18TB per node • 100Gbps networking EOS ALICE O² Installation at CERN 11
  6. 12 Peak LHC Detector Data Rates for the proton-proton Run

    of 2024 ATLAS ALICE LHCb CMS 20GB/s 12GB/s 13.5GB/s 45GB/s ~15–60GB/s Peak 1.3TB/s Avg ~450 GB/s Data Taking Throughput (last 2 weeks) CERN Cloud (OpenStack) Offline Processing Physics Analysis
  7. • XRootD is a system for scalable cluster data access

    • Initially developed for BaBar experiment at SLAC (~2002) ◦ The Next Generation ROOT File Server, A. B. Hanushevsky, A. Dorigo, F. Furano • Written in C++, open source (LGPL + GPL) • Available in EPEL and most Linux distributions • You can think of XRootD as curl + nginx + varnish ◦ Supports XRootD’s own root:// protocol as well as HTTP ◦ root:// is a stateful, POSIX-like protocol for remote data access ◦ TLS (roots:// and https://) support since XRootD 5.0 ▪ Supports TLS for control channel only, or control + data channel • Can be configured also as proxy / caching server (XCache) • Authentication via Kerberos, X509, shared secret, tokens • Not a file system & not just for file systems 20
  8. • XRootD clustering has many uses ◦ Creating a uniform

    namespace, even though it is distributed ◦ Load balancing and scaling in situations where all servers are similar ▪ Proxy servers and caching servers (XCache) ▪ Serving data from distributed filesystems (e.g. Lustre, ceph) ▪ Ceph + XCache can avoid bad performance from fragmented reads • Wide deployment across high energy physics ◦ CMS AAA Data Federation, Open Science Grid (OSG), WLCG • Highly adaptable plug-in architecture ◦ If you can write a plug-in you can cluster it ◦ Used by LSST (Vera C. Rubin Observatory) ▪ LSST Qserv for clustered MySQL ▪ https://inspirehep.net/literature/716175 • Extensive support for monitoring 21 xrootd cmsd Data Access Clustering cmsd xrootd cmsd xrootd 64 nodes cmsd xrootd cmsd xrootd cmsd xrootd 64² = 4096 nodes cmsd xrootd cmsd xrootd 64³ = 262144 nodes data server (leaf nodes) manager (root node) supervisors (interior nodes)
  9. • Distributed filesystem built using the XRootD framework ◦ XRootD

    provides the protocol, request redirections, third-party copy support ◦ POSIX-like API, many authentication methods (Kerberos, X509, shared secrets, JWT tokens) • Started in 2010 with the rationale to solve the physics analysis use-case for the LHC • Extremely cost effective storage system (less than 1CHF / TB / month for EC 10+2 layout) • Efficient sharing of resources with thousands of users across the grid ◦ Quotas, ACLs for sharing, QoS features (data and metadata) • Remotely accessible storage infrastructure ◦ High energy physics distributed computing infrastructure with over 150 sites ◦ FUSE mount provides convenient, familiar interface for users (batch, but also laptop) • Storage system suitable for physics analysis use cases and data formats ◦ ROOT columnar data format, heavy usage of vector reads for physics analysis • Integration with OwnCloud web interface, sync & share, samba: 22
  10. EOS Architecture • Namespace server (MGM) ◦ Scale up ◦

    Typically use 3 nodes (1 active, 2 standby) ◦ Namespace persistency with key-value store (QuarkDB) ▪ Developed at CERN, REDIS protocol, based on RocksDB ▪ Separate process, usually runs along with the MGM ◦ Namespace cache in RAM to reduce latency • File storage servers (FST) ◦ Scale out ◦ Typically ~100–150 nodes ◦ Based on cheap JBODs, RAIN layout ◦ Multiple replica or erasure coding layouts supported • Multiple instances with independent namespaces ◦ /eos/{alice,atlas,cms,lhcb,public}, /eos/user, /eos/project, etc 23 MGM FST Client meta-data data
  11. • Throttling ◦ Limit rate of meta-data operations ◦ Limit

    bandwidth per user, group, or application ◦ Limit number of threads used by clients in the meta-data service • Global limit on thread pool size and file descriptors ◦ Run with at most N threads for all users, use at most N descriptors • File placement policies ◦ Layout (replica or erasure coding) ◦ Physical hardware to use (HDD or SSD) ◦ Geographic location (put EC pieces on separate nodes or groups) • Conversion and Cleanup policies ◦ Automatically convert large files from replica to EC layout or move to HDD ◦ Periodically cleanup of files and/or empty directories from namespace EOS Features Needed by LHC Experiments 24
  12. EOS and FUSE 25 • Origins with XRootD’s xrootdfs •

    Mostly libfuse2 client used in production ◦ Roughly 30k active clients on average • Libfuse3 used by SAMBA gateways only • Issues in the past due to FUSE notification mechanisms ◦ Invalidation, lookup, forget ◦ Processes ending up in D state due to locking • Uses XRootD protocol underneath • Operations broadcast to clients ◦ With a limit so as to not overload the server • Local journal for writes • First 256k of each file cached on open History of EOS FUSE client
  13. EOS and FUSE (2) • Some problems we’ve encountered ◦

    Inode invalidation suppressed when write-back cache is enabled ▪ Causes file sizes to become stale when another client changes a file (more information) ◦ Problems seen on Alma Linux 9 due to mounting same filesystem more than once • Areas of interest, suggestion for new features ◦ Physics analysis makes heavy use of vector reads, FUSE breaks them up into multiple requests ▪ Implemented hook in ROOT to bypass FUSE and read directly via XRootD when possible ◦ Architecture in which meta-data caches could be shared between kernel and userspace would be desirable • FUSE performance not a problem (bottlenecks are addressable on server side in EOS) • Exploring ideas with FUSE passthrough (eoscfsd, add authentication to other backends) 26
  14. 27 • CERN Virtual Machine File System • FUSE HTTP-based

    network read-only filesystem • Originally developed to decouple experiment software from base OS • Solution to distribute 1 TB per night of integration builds to the computing grid • Origins as FUSE frontend to GrowFS using parrot system call interception • Namespace stored in SQLite, content-addressed storage for data deduplication • More recently being used to distribute “unpacked” container images ◦ Use files directly from CVMFS, download and cache only what you need • Software distribution on HPCs ◦ Gentoo Prefix, ComputeCanada, EESSI
  15. 28 S3 RBD Ceph FS AFS FUSE Backup Sync &

    Share Physics Analysis Web Applications Cloud CDS CERN OpenData Storage Overview at CERN SAMBA restic
  16. 29 References and Further Information Website https://xroot.slac.stanford.edu GitHub https://github.com/xrootd/xrootd XRootD/FTS

    Workshop (9–13 Sep 24) https://indico.cern.ch/event/1386888 Website https://eos.web.cern.ch GitHub https://github.com/cern-eos/eos EOS Workshop (14–15 Mar 24) https://indico.cern.ch/event/1353101 Website http://cernvm.cern.ch GitHub https://github.com/cvmfs/cvmfs CVMFS Workshop (16 – 18 Sept 24) https://indico.cern.ch/event/1347727