Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy... from genomic data science gateway to ...

James Taylor
September 24, 2019

Galaxy... from genomic data science gateway to global community

Keynote presentation on Galaxy (https://galaxyproject.org) origins and the Galaxy community presented at Gateways 2019, the annual meeting hosted by the Science Gateways Community Institute (https://sciencegateways.org/)

James Taylor

September 24, 2019
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. ...from genomic data science gateway to global community James Taylor

    (@jxtx), Johns Hopkins, http://speakerdeck.com/jxtx
  2. Mammalian comparative genomics — the beginning 2001: Initial sequence of

    the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome
  3. Mammalian comparative genomics — the beginning 2001: Initial sequence of

    the human genome 2002: Initial sequence of the mouse genome 2004: Initial sequence of the rat genome Our story begins somewhere around here!
  4. Coding regions (genes) – deeply conserved across evolution, ~1.5% of

    the human genome Regulatory regions – much less conserved, 5-10% of the human genome
  5. Whole genome scale alignments can potentially help us to understand

    biological function What is aligned to what and does it overlap with anything interesting? Can we see specific signals in alignments that inform us about specific functions? Answering these questions requires computational approaches
  6. Can we make it easier and more efficient for experimental

    ( ) and computational ( ) researchers to collaborate?
  7. GALA enabled query annotation information from the human genome, alongside

    alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries
  8. To enable collaboration, can we make it easy for computational

    researchers to integrate new tools, and for experimental researchers to use them?
  9. And then everything changed… again. Illumina NovaSeq 6000 20 Billion

    300bp DNA fragments per run ~ 6 Terabytes Every 2 days…
  10. ...and applicable across (nearly) all of Biology! - How is

    the production of the right protein at the right time controlled? - How are cells organized in 3D? - How are cell types decided in development? - How are different species related? - What genome variants lead to different phenotypes or disease risk?
  11. Modern biology has rapidly transformed into a data intensive discipline

    - Large scale data acquisition has become easy, e.g. high-throughput sequencing and imaging - Experiments are increasingly complex - Making sense of results often requires mining and making connections across multiple databases - Nearly all high-profile research involves some quantitative methods How does this affect traditional research practices and outputs?
  12. Idea Experiment Raw Data Tidy Data Summarized data Results Experimental

    design Data collection Data cleaning Data analysis Inference Data Pipeline, inspired by Leek and Peng, Nature 2015 The part we are considering here The part that ends up in the Publication
  13. Three major concerns Accessibility: Making use of large-scale data requires

    complex computational resources and methods. Can all researchers access these approaches? How can we make these methods available to everyone Transparency: Is it possible to communicate analyses and results in ways that are both easy to understand and provide all of the essential details Reproducibility: Can analyses be precisely reproduced, to facilitate rigorous validation and peer review, and ease reuse?
  14. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  15. Describe analysis tool behavior abstractly Pervasive sharing, and publication of

    documents with integrated analysis Analysis environment automatically and transparently tracks details Workflow system for complex analysis, constructed explicitly or automatically
  16. Galaxy is available as... A free (for everyone) web service

    integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  17. usegalaxy.org - We provided Galaxy as a free public website

    from the very beginning - Fortunately nobody knew about it at first, and in 2005 the data wasn’t all that big anyway - However, the demand for easy-to-use tools in the research community was even more than we anticipated… and we didn’t have much funding - For eight years Galaxy was run largely on surplus hardware decommissioned by other groups, borrowed storage, whatever we could find
  18. 125,000 registered users 2PB user data 19M jobs run 100

    training events (2017 & 2018) Stats for Galaxy Main (usegalaxy.org) in May 2018
  19. PSC, Pittsburgh Stampede • 462,462 cores • 205 TB memory

    Bridges Dedicated resources Shared XSEDE resources TACC Austin Galaxy Cluster (Rodeo) • 256 cores • 2 TB memory Corral/Stockyard • 20 PB disk PTI IU Bloomington (Nate Coraor)
  20. SmartOS (PSU) Bare metal cluster (TACC) VMWare (TACC) Stampede2 (TACC)

    pulsar Bridges (PSC) Pulsar/AMQP Pulsar/HTTP Slurm PostgreSQL usegalaxy.org Compute Architecture (June 2018) NFS Jetstream (TACC) Jetstream (IU) Swarm db CVMFS slurm/rabbitmq roundup64 ... roundup49 cvmfs stratum0 cvmfs stratum0 jobs jobs web web swarm instance swarm instance swarm instance swarm instance slurm/pulsar/ swarm cvmfs stratum1 slurm instance slurm instance slurm instance slurm instance Corral (TACC) 2.3 PB dataset storage pulsar cvmfs stratum1 slurm/pulsar /swarm slurm instance instance instance instance cvmfs stratum1/swarm (Nate Coraor)
  21. This approach provides both scalability and flexibility - A set

    of dedicated compute resources (deployed on TACC’s internal cloud) provide basic services and first line job execution - The bulk of Galaxy jobs run on Jetstream, an OpenStack cloud which allows us to leverage elasticity to efficiently adjust to changing user demands - Unique resources like Bridges and Stampede2 allow us to serve jobs that have extremely large memory demands (e.g. genome and transcriptome assembly), or are highly parallel with long runtimes (e.g. large-scale read mapping jobs)
  22. Not just more jobs, different types of jobs Can now

    run larger jobs, and more jobs: 325,000 jobs run on behalf of 12,000 users Can run new types of jobs: Galaxy Interactive Environments: Jupyter, RStudio (Enis Afgan)
  23. - Galaxy makes it easy to integrate new tools -

    The Galaxy Toolshed (2011) makes it easy to share those tools - However, new tools are published far faster than we can integrate them - We needed help if this is going to scale at all!
  24. • Maintains a set of high quality Galaxy tools in

    the GitHub repository. This repo serves as an excellent example and inspiration to all Galaxy tool developers. • Cultivates and shares the Galaxy tool development best practices document. • Provides support to tool developers on a public Gitter channel.
  25. 2015: CONTRIBUTING.md - In 2015 we established an official open

    governance policy for core Galaxy code - We established the committers group, consisting of experience Galaxy developers with the responsibility of managing contributions, as well as adding additional committers - All committers have equal power – we gave up control over the code in order to share ownership with the community!
  26. XSEDE, Indiana University XSEDE & CyVerse, TACC, Austin EU JRC,

    Ispra Penn State cvmfs0-tacc0 • test.galaxyproject.org • main.galaxyproject.org cvmfs1-tacc0 cvmfs1-iu0 • Stratum 0 servers • Stratum 1 servers galaxy.jrc.ec.europa.eu de.NBI, RZ Freiburg cvmfs0-psu0 • singularity.galaxyproject.org • data.galaxyproject.org cvmfs1-psu0 cvmfs1-ufr0.usegalaxy.eu CVMFS server distribution Galaxy Australia, Melbourne cvmfs1-mel0.gvl.org.au
  27. Achieving usegalaxy.✱ coherence - Common reference and index data -

    These are already distributed by CVMFS, but organized in a ad hoc manner due to the history of Galaxy - Currently building an automated approach where metadata defining the complete set of reference and index data will live in Github, builds will be automated based on Github state, and succesfull builds deployed through CVMFS for replication to all site - Intergalactic Data Commission: https://github.com/usegalaxy-eu/idc - Common tools - A common set of tools and a common tool menu organization is currently being defined. Tools and tool configuration will also be replicated through CVMFS - This will ensure both that users will have the same user experience across different usegalaxy. ✱ instances, and that workflows can be moved between instances and still execute correctly and reproducibly - Local custom tools will still be supported but clearly identified
  28. Challenges for human genomic (+) data sharing The value of

    data is greatly increased by integration across datasets - e.g. in human genomics, power to detect relationships between individual variants and disease depends on the number of individuals measured Moving/copying data is wasteful: transfer costs, redundant storage costs Human genomic data comes with privacy concerns, need to ensure security and detect threats
  29. AnVIL: Inverting the model of genomic data sharing Traditional: Bring

    data to the researcher - Copying/moving data is costly - Harder to enforce security - Redundant infrastructure - Siloed compute Goal: Bring researcher to the data - Reduced redundancy and costs - Active threat detection and auditing - Greater accessibility - Elastic, shared, compute
  30. What is the AnVIL? - Scalable and interoperable resource for

    the genomic scientific community - Cloud-based infrastructure - Shared analysis and computing environment - Support genomic data access, sharing and computing across large genomic, and genomic related, data sets - Genomic datasets, phenotypes and metadata - Large datasets generated by NHGRI programs, as well as other initiatives / agencies - Data access controls and data security - Collaborative environment for datasets and analysis workflows - ...for both users with limited computational expertise and sophisticated data scientist users
  31. Goals of the AnVIL 1. Create open source software Storage,

    scalable analytics, data visualization 2. Organize and host key NHGRI datasets CCDG, CMG, eMERGE, and more 3. Operate services for the world Security, training & outreach, new models of data access
  32. AnVIL / Terra: analysis workspaces and batch workflows AnVIL /

    Gen3: Data models, indexing, querying AnVIL / Dockstore: sharing containerized tools and workflows AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ...
  33. AnVIL / Terra: analysis workspaces and batch workflows AnVIL /

    Gen3: Data models, indexing, querying AnVIL / Analysis Environments: Jupyter Notebooks, RStudio, Galaxy, ... FISMA Moderate 2 ATOs Pursuing FedRAMP All data use and analysis in a FISMA moderate environment Implemented on Primary data storage costs covered by AnVIL, user private data and compute billed directly through Google
  34. Scale Start Kubernetes + Helm Kubernetes + Helm Proposed system

    architecture Leo Kubernetes + Helm CloudMan Galaxy RStudio / Bioconductor ... API Persistence Workspace Persistence Launch AnVIL portal Start Galaxy Start RStudio One instance per user CVMFS
  35. Security Boundary User 1 Isolated Resources User Data and DB

    User 1 Galaxy Instance User Compute Containers Shared DB (No protected Data) User 2 Isolated Resources User Data and DB User 2 Galaxy Instance User Compute Containers Anonymous User Unprivileged Galaxy Instance User 1 User 2 Galaxy Multiplexer Isolated Galaxy instances with a single interface
  36. Kubernetes Job Pod Galaxy new job: inputs: - dataset 1

    - dataset 2 outputs: - dataset 3 tool: HISAT2 create job Data Storage Volume execute job get datasets 1, 2 execute job 3 job complete 1 2 1 2 3 compute Time Future k8s Remote Execution Data Flow NFS 3 1 2 control message data movement BioContainer Executor Container @jmchilton @natefoo
  37. Challenges for (health) science gateways - Human genomic, health, and

    other protected data will only be available from a small set of analysis platforms - For the foreseeable future this is motivated by policy, compliance, and political questions rather than technical concerns - Moving data requires meeting substantial compliance requirements - Making gateway software more modular and flexible, along with standards for deployment can mitigate this - Kubernetes could be a lowest common denominator, but more standardization is needed - We need to renew emphasis on interoperability at the platform, tool, and workflow level
  38. ACK

  39. Acknowledgements: Galaxy Contributors - Core Code: contributors to galaxyproject/galaxy: -

    ~315 (~39 new since last year) - Tools: contributors to galaxyproject/tools-iuc: - ~195 (~38 new since last year) - ...and the ever vigilant Intergalactic Utilities Commission for handling these contributions and maintaining the quality of essential Galaxy tools - ...and everyone else who has contributed a tool to the ToolShed - Training: contributors to galaxyproject/training-material - ~140 (~34 new since last year) - ...and everyone who has conducted or attended Galaxy Training - Everyone who has contributed to Galaxy in other ways: - users, supporters, … - Funding: NSF and NIH (to our team), and all of the funders of the Global Galaxy Community
  40. Acknowledgements Galaxy: Enis Afgan, Dannon Baker, Daniel Blankenberg, Dave Bouvier,

    Martin Čech, John Chilton, Dave Clements, Nate Coraor, Jeremy Goecks, Sergey Golitsynskiy, Qiang Gu, Juleen Graham, Björn Grüning, Sam Guerler, Mo Heydarian, Will Holden, Jennifer Hillman-Jackson, Vahid Jalili, Delphine Lariviere, Alexandru Mahmoud, Anton Nekrutenko, Alex Ostrovsky, Helena Rasche, Luke Sargent, Nicola Soranzo, Marius van den Beek The rest of the Taylor Lab at JHU: Boris Brenerman, Min Hyung Cho, Peter DeFord, Max Echterling, Nathan Roach, Michael E. G. Sauria, German Uritskiy Funding: NHGRI U41 HG006620 (Galaxy), NHGRI U24 HG010263 (AnVIL), NCI U24 CA231877 (Galaxy Federation), NSF DBI 0543285 and DBI 0850103 (Galaxy on US cyberinfrastructure) +Collaborators: Dave Hancock and the Jetstream group, Ross Hardison and the VISION group, Victor Corces, Karen Reddy, Johnston, Kim, Hilser, and DiRuggiero labs (JHU Biology), Battle, Goff, Langmead, Leek, Schatz, Timp labs (JHU Genomics)
  41. Broad Institute Anthony Philippakis, Daniel MacArthur, Alex Bauman, Adrian Sharma,

    Andrew Rula, Dave Bernick, Jonathan Lawson, Kristian Cibulskis, Namrata Gupta, Rob Title, Eric Banks, RIch Silva University of Chicago Robert Grossman, Abby George, Garrett Rupp, Zac Flamig University of California Santa Cruz Benedict Paten, Denis Yuen, Brian O’Connor, Charles Overbeck, Kevin Osborn, Louise Cabansay, Natalie Perez, Stefan Kuhn, Walt Shands Vanderbilt Robert Carroll, Lakhan Swamy, Kristin Wuichet Washington University Ira Hall, Adam Coffman, Allison Reieir, Haley Abel, Jason Walker Johns Hopkins James Taylor, Jeff Leek, Kasper Hansen, Enis Afgan, Alexandru Mahmoud, Sergey Golitsynskiy, Jenn Vessio, John Muschelli, Mo Heydarian Penn State University Anton Nekrutenko, John Chilton, Nate Coraor, Marten Cech Oregon Health & Sciences University Jeremy Goecks, Kyle Ellrott, Brian Walsh, Luke Sargent, Vahid Jalili Roswell Park Cancer Institute Martin Morgan, Nitesh Turaga Harvard Vincent Carey, BJ Stubbs, Shweta Gopaulakrishnan City University of New York Levi Waldron, Sehyun Oh, Ludwig Geistlinger Acknowledgements: AnVIL Team