Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multicore scores and resource optimisation with...

Multicore World 2013
February 19, 2013

Multicore scores and resource optimisation within the Galaxy Project

Multicore World 2013

February 19, 2013


  1. Motivation • Significant levels of parallelism are here to stay

    • However many users of scientific computing: – have little interest in why parallelism is important to speed increases – do not have time or support to redevelop all their legacy code • Our desire: increase multicore use in Otago Biochemistry – promoting container platform changes to better support multicore – presenting simple metrics to non-technologist to focus development efforts where the most effective difference can be made • (... and to get an degree) 2
  2. The Galaxy Project • Galaxy provides web-based access to bioinformatics

    tools – http://galaxyproject.org/ – Users at Otago Biochemistry seem to find it highly useful • Galaxy provides a consistent, accessible interface that wraps stand-alone analysis software – scientists can focus on their actual work – no need for skilling up in computing methods • Experiments are built as ‘pipelines’ of ‘tools’ – The tool invocations provide parameter pages – Pipelines can be reused easily 3
  3. 4

  4. Downsides of Galaxy • User friendly abstraction hides problems, as

    well as detail – As experiment datasets grow, so too will the difficulties caused • Gross computing inefficiencies hidden from end users – Many tools make poor use of the underlying resources: cores, etc • Separates tool developers from end-users – Users may not understand whether to blame Galaxy or tools: – Tool developers may miss out on feedback – Users may not realise they should be expecting better performance 5
  5. Keeping an eye on the Galaxy • We began developing

    a monitoring framework for Galaxy – Galaxy does little monitoring of pipeline/tool optimisation itself • System monitoring – Overall performance information – Prevents system blindly exhausting all resources • Pipeline monitoring – Check configuration of tools in a pipeline before execution – Detect excessive projected resource use – Suggest optimisations (e.g. minimising intermediate data size through reordering) 6
  6. Resource monitoring specifics • Users are prompted with a warning

    if there is a sustained RAM usage that is above a given threshold. • A warning is presented to the user if a large number of processes are persistently blocking for I/O. • The total percentage complete is now displayed for some of the common Galaxy tools used. • When known to be effective, the percentage of work complete is extrapolated to estimated time to completion. • Tools with pre-classified RAM consumption patterns based on key input parameters, will provide estimated RAM use • Before executing tools, a history is consulted: – can suggest if invocations appear to be unreasonable 7
  7. Case study: Beagle phasing • Beagle is a Markov-Chain Monte-Carlo

    (MCMC) algorithm – (Phasing ‘determines’ which alleles—i.e. alternative forms of a particular gene—come from which chromosome in a parent) • beagle (implemented in Java) was using one core! – However could split data, reduce precision, and increase speed • RAM has two phase pattern – Used to provide warnings if all resources will be consumed 8 beagle processing data on an 8GiB system
  8. History-based reporting • Provide warnings & information about likely tool

    behaviour • User can ignore all information • Can catch cases that crashed the system – Also have the system monitor working in parallel 9
  9. Case study: Ensembl • Ensembl variant effect predictor is a

    tool developed by EMBL-EBI and the Wellcome Trust Sanger Institute – Implemented in Perl – Also only used one core by default! • Cannot simply partition input, due to windowing function – ... but can turn off the windowing function – then get an extra 80% in throughput per core (up to 16) ... – ... even thought computation was slowed down for each instance • The tool’s developers added a process forking feature – (usually match to core count) 10
  10. Multicore scores • Multicore score aims to focus efforts increasing

    efficiency of bioinformatics tools contained within Galaxy – Score is easy to compute: over tool or pipeline – Provides a direct measure of relative efficiency (usually) – Easy to explain to scientists: • they can focus and prioritise developers’ future efforts • The multicore score is the CPU utilisation of all cores over the course of a workflow or tool execution, normalised to the number of cores (C), and the total time taken (T) 11
  11. Conclusions and Future Work • Galaxy’s abstractions are greatly appreciated

    by scientists, but risk hiding performance problems – We developed a resource monitoring framework in response • Simple aggregate metrics can give a good estimate of whether everything is “going OK” in Galaxy – Many Galaxy tools are making poor use of multicore currently • Ideally a resource utilisation protocol between Galaxy and its tools would allow scheduling of the tools in workflows for the most efficient CPU use 12