• However many users of scientific computing: – have little interest in why parallelism is important to speed increases – do not have time or support to redevelop all their legacy code • Our desire: increase multicore use in Otago Biochemistry – promoting container platform changes to better support multicore – presenting simple metrics to non-technologist to focus development efforts where the most effective difference can be made • (... and to get an degree) 2
tools – http://galaxyproject.org/ – Users at Otago Biochemistry seem to find it highly useful • Galaxy provides a consistent, accessible interface that wraps stand-alone analysis software – scientists can focus on their actual work – no need for skilling up in computing methods • Experiments are built as ‘pipelines’ of ‘tools’ – The tool invocations provide parameter pages – Pipelines can be reused easily 3
well as detail – As experiment datasets grow, so too will the difficulties caused • Gross computing inefficiencies hidden from end users – Many tools make poor use of the underlying resources: cores, etc • Separates tool developers from end-users – Users may not understand whether to blame Galaxy or tools: – Tool developers may miss out on feedback – Users may not realise they should be expecting better performance 5
a monitoring framework for Galaxy – Galaxy does little monitoring of pipeline/tool optimisation itself • System monitoring – Overall performance information – Prevents system blindly exhausting all resources • Pipeline monitoring – Check configuration of tools in a pipeline before execution – Detect excessive projected resource use – Suggest optimisations (e.g. minimising intermediate data size through reordering) 6
if there is a sustained RAM usage that is above a given threshold. • A warning is presented to the user if a large number of processes are persistently blocking for I/O. • The total percentage complete is now displayed for some of the common Galaxy tools used. • When known to be effective, the percentage of work complete is extrapolated to estimated time to completion. • Tools with pre-classified RAM consumption patterns based on key input parameters, will provide estimated RAM use • Before executing tools, a history is consulted: – can suggest if invocations appear to be unreasonable 7
(MCMC) algorithm – (Phasing ‘determines’ which alleles—i.e. alternative forms of a particular gene—come from which chromosome in a parent) • beagle (implemented in Java) was using one core! – However could split data, reduce precision, and increase speed • RAM has two phase pattern – Used to provide warnings if all resources will be consumed 8 beagle processing data on an 8GiB system
tool developed by EMBL-EBI and the Wellcome Trust Sanger Institute – Implemented in Perl – Also only used one core by default! • Cannot simply partition input, due to windowing function – ... but can turn off the windowing function – then get an extra 80% in throughput per core (up to 16) ... – ... even thought computation was slowed down for each instance • The tool’s developers added a process forking feature – (usually match to core count) 10
efficiency of bioinformatics tools contained within Galaxy – Score is easy to compute: over tool or pipeline – Provides a direct measure of relative efficiency (usually) – Easy to explain to scientists: • they can focus and prioritise developers’ future efforts • The multicore score is the CPU utilisation of all cores over the course of a workflow or tool execution, normalised to the number of cores (C), and the total time taken (T) 11
by scientists, but risk hiding performance problems – We developed a resource monitoring framework in response • Simple aggregate metrics can give a good estimate of whether everything is “going OK” in Galaxy – Many Galaxy tools are making poor use of multicore currently • Ideally a resource utilisation protocol between Galaxy and its tools would allow scheduling of the tools in workflows for the most efficient CPU use 12