A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus

Research output: Contribution to journalArticleAcademicpeer-review

1 Citation (Scopus)

Abstract

A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a permutation of DNA reads, segmented into contigs. An interleaved merge operator for contigs allows for the quick minimization of a fitness function measuring the string length of a candidate solution. This study assembles read libraries for three genomic fragments from different organisms, five complete virus genomes, and one complete bacterial genome, with the largest genome length of 159  kbp. To evaluate the accuracy of any assembled genome, test libraries of DNA reads are generated from reference genomes, and the assembly is compared to the reference. The method has very high assembly accuracy: over repeated assemblies for each input genome, the original genome was constructed optimally in over 85% of the runs. Given the consistency of the algorithm, the method is suitable to determine the consensus genome in de-novo assembly problems. There are two limitations to the method: genomes with long repeats may be overcompressed, and the computational complexity is high.
Original languageEnglish
Pages (from-to)184-199
JournalInformation sciences
Volume420
DOIs
Publication statusPublished - 2017

Fingerprint

Genome
Genes
DNA
Mathematical operators
Scaffold
Fitness Function
Operator
Viruses
Scaffolds
Virus
Genomics
Macros
Reverse
Computational complexity
Fragment
Permutation
Computational Complexity
Mutation
Strings
Genetic algorithms

Keywords

  • De novo, DNA, Assembly, Genetic algorithm, Consensus genome

Cite this

@article{b8376d39946143f680c225d0701aab85,
title = "A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus",
abstract = "A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a permutation of DNA reads, segmented into contigs. An interleaved merge operator for contigs allows for the quick minimization of a fitness function measuring the string length of a candidate solution. This study assembles read libraries for three genomic fragments from different organisms, five complete virus genomes, and one complete bacterial genome, with the largest genome length of 159  kbp. To evaluate the accuracy of any assembled genome, test libraries of DNA reads are generated from reference genomes, and the assembly is compared to the reference. The method has very high assembly accuracy: over repeated assemblies for each input genome, the original genome was constructed optimally in over 85{\%} of the runs. Given the consistency of the algorithm, the method is suitable to determine the consensus genome in de-novo assembly problems. There are two limitations to the method: genomes with long repeats may be overcompressed, and the computational complexity is high.",
keywords = "De novo, DNA, Assembly, Genetic algorithm, Consensus genome",
author = "Doina Bucur",
year = "2017",
doi = "10.1016/j.ins.2017.07.039",
language = "English",
volume = "420",
pages = "184--199",
journal = "Information sciences",
issn = "0020-0255",
publisher = "Elsevier",

}

A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus. / Bucur, Doina .

In: Information sciences, Vol. 420, 2017, p. 184-199.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - A stochastic de novo assembly algorithm for viral-sized genomes obtains correct genomes and builds consensus

AU - Bucur, Doina

PY - 2017

Y1 - 2017

N2 - A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a permutation of DNA reads, segmented into contigs. An interleaved merge operator for contigs allows for the quick minimization of a fitness function measuring the string length of a candidate solution. This study assembles read libraries for three genomic fragments from different organisms, five complete virus genomes, and one complete bacterial genome, with the largest genome length of 159  kbp. To evaluate the accuracy of any assembled genome, test libraries of DNA reads are generated from reference genomes, and the assembly is compared to the reference. The method has very high assembly accuracy: over repeated assemblies for each input genome, the original genome was constructed optimally in over 85% of the runs. Given the consistency of the algorithm, the method is suitable to determine the consensus genome in de-novo assembly problems. There are two limitations to the method: genomes with long repeats may be overcompressed, and the computational complexity is high.

AB - A genetic algorithm with stochastic macro mutation operators which merge, split, move, reverse and align DNA contigs on a scaffold is shown to accurately and consistently assemble raw DNA reads from an accurately sequenced single-read library into a contiguous genome. A candidate solution is a permutation of DNA reads, segmented into contigs. An interleaved merge operator for contigs allows for the quick minimization of a fitness function measuring the string length of a candidate solution. This study assembles read libraries for three genomic fragments from different organisms, five complete virus genomes, and one complete bacterial genome, with the largest genome length of 159  kbp. To evaluate the accuracy of any assembled genome, test libraries of DNA reads are generated from reference genomes, and the assembly is compared to the reference. The method has very high assembly accuracy: over repeated assemblies for each input genome, the original genome was constructed optimally in over 85% of the runs. Given the consistency of the algorithm, the method is suitable to determine the consensus genome in de-novo assembly problems. There are two limitations to the method: genomes with long repeats may be overcompressed, and the computational complexity is high.

KW - De novo, DNA, Assembly, Genetic algorithm, Consensus genome

U2 - 10.1016/j.ins.2017.07.039

DO - 10.1016/j.ins.2017.07.039

M3 - Article

VL - 420

SP - 184

EP - 199

JO - Information sciences

JF - Information sciences

SN - 0020-0255

ER -