hrvatski jezikClear Cookie - decide language by browser settings

CLOUDFLOW - Enabling Faster Biomedical Pipelines with Mapreduce and Spark

Forer, Lukas; Afgan, Enis; Weissenteiner, Hansi; Davidović, Davor; Specht, Guenther; Kronenberg, Florian; Schoenherr, Sebastian (2016) CLOUDFLOW - Enabling Faster Biomedical Pipelines with Mapreduce and Spark. Scalable Computing: Practice and Experience, 17 (2). pp. 103-114. ISSN 1895-1767

PDF - Published Version - article
Available under License Creative Commons Attribution.

Download (184kB) | Preview


For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow - a high-level pipeline framework that allows users to create sophisticated biomedical pipelines using predefined code blocks while the framework automatically translates those into the MapReduce execution model. With the introduction of the YARN resource management layer, new computational processing models such as Apache Spark are now plugable into the Hadoop ecosystem. In this paper we describe the extension of Cloudflow to support Apache Spark without any adaptions to already implemented pipelines. The described performance evaluation demonstrates that Spark can bring an additional boost for analysing next generation sequencing (NGS) data to the field of genetics. The Cloudflow framework is open source and freely available at

Item Type: Article
Uncontrolled Keywords: Apache YARN ; Pipeline Framework ; Spark ; Cloud Computing
Subjects: TECHNICAL SCIENCES > Computing
Divisions: Center for Informatics and Computing
Depositing User: Davor Davidović
Date Deposited: 22 Mar 2017 15:28
DOI: 10.12694/scpe.v17i2.1159

Actions (login required)

View Item View Item


Downloads per month over past year

Increase Font
Decrease Font
Dyslexic Font