Algorithms for de novo genome assembly from third generation sequencing data

Sović, Ivan (2016) Algorithms for de novo genome assembly from third generation sequencing data. Doctoral thesis, Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva.

Preview

PDF - Published Version - article
Download (9MB) | Preview

Abstract

During the past ten years, genome sequencing has been an extremely hot and active topic, with an especial momentum happening right now. New, exciting and more affordable technologies have been released, requiring the rapid development of new algorithmic methods to cope with the data. Affordable commercial availability of the sequencing technology and algorithmic methods which can leverage the data could open doors to a vast number of very important applications, such as diagnosis and treatment of chronic diseases through personalized medicine or identification of pathogenic microorganisms from soil, water, food or tissue samples. Sequencing the entire genome of an organism is a difficult problem, because all sequencing technologies to date have limitations on the length of the molecule that they can read (much smaller than the genomes of a vast majority of organisms). In order to obtain the sequence of an entire genome, reads need to be either stitched together (assembled) in a de novo fashion when the genome of the organism is unknown in advance, or mapped and aligned to the reference genome if one exists (reference assembly or mapping). The main problem in both approaches stems from the repeating regions in the genomes which, if longer than the reads, prevent complete assembly of the genome. The need for technologies that would produce longer reads which could solve the problem of repeating regions has resulted in the advent of new sequencing approaches – the so-called third generation sequencing technologies which currently include two representatives: Pacific Biosciences (PacBio) and Oxford Nanopore. Both technologies are characterized, aside from long reads, by high error rates which existing assembly algorithms of the time were not capable of handling. This caused the development of time-consuming read error correction methods which were applied as a pre- processing step prior to assembly. Instead, the focus of the work conducted in the scope of this thesis is to develop novel methods for de novo DNA assembly from third generation sequencing data, which provide enough sensitivity and precision to completely omit the error-correction phase. Strong focus is put on nanopore data. In the scope of this thesis, four new methods were developed: (I) NanoMark - an evaluation framework for comparison of assembly methods from nanopore sequencing data ; (II) GraphMap - a fast and sensitive mapper for long error- prone reads ; (III) Owler - a sensitive overlapper for third generation sequencing ; and (IV) Racon - a rapid consensus module for correcting raw assemblies. Owler and Racon were used as modules in the development of a novel de novo genome assembler Aracon. The results show that Aracon reduces the overall assembly time by at least 3x and up to even an order of magnitude less compared to the state-of-the-art methods, while retaining comparable or better quality of assembly.

Item Type:	Thesis (Doctoral thesis)
Uncontrolled Keywords:	de novo; assembly; PacBio; nanopore; NanoMark; GraphMap; Racon; Aracon
Subjects:	TECHNICAL SCIENCES > Computing TECHNICAL SCIENCES > Computing > Data Processing BIOTECHNICAL SCIENCES > Biotechnology BIOTECHNICAL SCIENCES > Biotechnology > Bioinformatics
Divisions:	Center for Informatics and Computing
Depositing User:	Ivan Sović
Date Deposited:	23 Feb 2017 13:58
URI:	https://fulir.irb.hr:/id/eprint/3390

Actions (login required)

View Item

Download Statistics

Downloads

Downloads per month over past year