Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Community Conference 2017: Evolution of ...

Galaxy Community Conference 2017: Evolution of Galaxy.

A somewhat accurate look back on the Galaxy project (with Anton Nekrutenko).

James Taylor

June 29, 2017
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Dave Clements Marilyne Summo Patricia Laplagne Olivier Inizan Christophe Caron

    Gildas Le Corguillé Jean François Duffayard Nicole Vaslievsky Gautier Sarah Frédéric de Lamotte Virginie Rossard
  2. The legend goes like this: in 1980 Webb Miller and

    Eugene Myers asked Ross Hardison if there was anything interesting to do in biology... Ross Hardison Webb Miller Eugene Myers
  3. ...by early 2000’s the big data in biology was genomic

    sequences and alignments. Penn State was central in developing alignment tools Webb Miller Ross Hardison
  4. Webb Miller Ross Hardison ...by early 2000’s the big data

    in biology was genomic sequences and alignments. Penn State was central in developing alignment tools
  5. The basic question in the early 2000’s was: What is

    aligned to what and does it overlap with anything interesting?
  6. GALA enabled query annotation information from the human genome, alongside

    alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries (the birth of the History system)
  7. We threw the first one away (quickly) and rewrote from

    scratch in Python At this point we made several key design decisions that (in hindsight) determined whether we would succeed or fail (We got very lucky)
  8. 1. No longer store data in a database, but in

    flat files in various common formats This meant existing tools could be integrated easily because they did not need to change the data formats they work with or interact with a database It also meant that when high-throughput sequence data suddenly came along (2005), we were prepared to deal with data at that scale easily
  9. 2. Rather than build new analysis tools in the system,

    build an abstract configuration driven interface to command line tools We did this to make our lives easier, we had many analysis tools lying around that we didn’t want to rewrite for Galaxy But this was equally appealing to other developers who could now easily make their tools available to Biologists
  10. 3. Make the entire stack self-contained, allowing a complete Galaxy

    to be setup on most systems in minutes We primarily did this to engage tool developers, making it as easy as possible to develop new tool wrappers for contribution We envisioned those tools would all be made available through the main Galaxy service But it also provided a scaling strategy, making it easy for sites to run their own Galaxy
  11. 4. Open-source and openly developed from the first commit Provide

    everything we do under a liberal open-source license (no copyleft), and only support open-source tools on the main instance Our primary development repository is exposed to the public, initially hosted by us but later moved out third parties (bitbucket.org, and then github.com) The software is distributed only through version control, with a rapid release cycle (at least monthly)
  12. The basic question in the late 2000’s becomes: What would

    happen if I sequence the s****t out of anything?* *For metagenomic studies this was, in fact, precisely the question asked.
  13. Best thing about the introduction of the ToolShed: Birth of

    the Intergalactic Utilities Commission
  14. The community established itself and the evolutionary timeline accelerated! The

    only bad thing about it is that it hard to put things in chronological order from memory without using git log
  15. This included many things covered today and tomorrow including: -

    Visualizations beyond trackster - Expansion beyond genomics - Massive tool suite contributions and updates - Interactive environments - Training & Tours - … uhhh … so much more
  16. The problem is that PubMed indexes Nature and Science that

    are scientific journals with a broad subject coverage
  17. Being in sciences let’s invent an index: G i =

    true Galaxy pubs/false Galaxy pubs 2005: G i = 1/15 2017: G i = 11/14
  18. “Just so you know, you've got a lot of really

    rare specimens preserved here”
  19. Björn Grüning is connected to Internet directly (probably born with

    802.11 circuitry. His particular hardware version lacks sleep functionality)
  20. We need better ways to look at, think about, and

    manage datasets and the 100k scale. At some point users no longer care about seeing the individual history, workflow, just specific results. New: many workflow view, for monitoring the execution of many workflows in parallel New: reports — generate summaries of executing workflows, multiple workflows, from user templates with continuous updates
  21. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? Interactive Environments: 10s of datasets, ad hoc analyses
  22. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? ad hoc, more flexible Visualization and analytics 10s of datasets, highly interactive
  23. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? ad hoc, more flexible visual exploration ?
  24. We need to support exploratory data analysis even more than

    we do now Dataset complexity, heterogeneity, dimensionality and all only increasing The analysis decision process requires more support for data exploration, both visual and interactive data manipulation
  25. The future Galaxy needs to scale seamlessly across the data

    analysis process… …supporting analysts as they transition from exploratory, to batch, to high-throughput
  26. At either end of the spectrum, there are common themes.

    The future Galaxy embraces real time and continuous communication. From exploratory analysis to batch job tracking to automatic reports, Galaxy needs to be responsive and informative. The future Galaxy is increasingly interactive The future Galaxy better supports transitions between analysis modes.
  27. The future of this project depends solely on the community,

    its openness, and continuing outreach!