Next Generation Sequencing (NGS) is a fundamental practice in bioinformatics. Pipelines are comprised of complex, multi-step processes involving many different tools and intermediate data formats. With easy access to cloud infrastructure and containerized applications that are portable across clouds, users are increasingly extending pipelines to the cloud.
In this first in a series of two articles we’ll discuss Nextflow, a leading tool for managing bioinformatics workflows and show how it can be used with Univa Grid Engine (UGE) and Navops Launch to realize a framework and cloud agnostic hybrid cloud infrastructure.
On a cluster, bioinformatics pipelines can manifest themselves as hundreds or even thousands of discrete jobs. To make matters worse, many users run different pipelines simultaneously against different datasets and pipelines are constantly changing as new tools, and more effective analysis techniques are identified.
In the past, genomic pipelines were managed using custom scripts written in Bash or Python. While functional, these custom solutions tend to be “brittle” and hard to maintain. Scripted workflows are often complex because challenges like synchronizing multi-step flows, managing data, and handling run-time exceptions were left to the author of the script. Small changes to data formats, tools, or the environment could result in scripts failing. Given their complexity, often the original author of the script was the only person able to troubleshoot issues and resolve problems efficiently. A better practice is to use a tool purpose-built for distributed, collaborative genomic workflows.