Towards accurate de novo assembly for genomes with repeats

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    1 Citation (Scopus)

    Abstract

    De novo genome assemblers designed for short k-mer length or using short raw reads are unlikely to recover complex features of the underlying genome, such as repeats hundreds of bases long. We implement a stochastic machine-learning method which obtains accurate assemblies with repeats and self-validates assemblies via consensus. For this, a prior assembler is extended with the ability to (a) assemble variable-length raw reads, which may span and unambiguously recover interspersed repeats in the genome, and (b) recognize long, direct terminal repeats during the assembly, then report an unambiguous circular assembly. Consensus is obtained via stochastically independent runs of the assembler on the same read library. We experiment on viral and mitochondrial genomes of up to 41 kbp, with synthetic raw-read libraries, to be able to evaluate the assembly against a reference. We show the prerequisites for obtaining accurate assemblies. For genomes with interspersed repeats, using raw reads of average length comparable to the repeat length likely gives an accurate genome. Genomes with long direct terminal repeats can be assembled accurately also with reads shorter than the repeat length. In both cases, a simple majority forms consensus, since over 70 % of independent runs on this set of genomes yield a correct assembly.
    Original languageEnglish
    Title of host publication2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)
    Number of pages8
    ISBN (Electronic)978-1-4673-8988-4
    DOIs
    Publication statusPublished - 1 Aug 2017
    Event2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology - INNSIDE Hotel, Manchester, United Kingdom
    Duration: 23 Aug 201725 Aug 2017
    http://cibcb2017.org/

    Conference

    Conference2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology
    Abbreviated titleCIBCB 2017
    CountryUnited Kingdom
    CityManchester
    Period23/08/1725/08/17
    Internet address

    Keywords

    • biology computing
    • genomics
    • learning (artificial intelligence)
    • molecular biophysics
    • mitochondrial genomes
    • stochastic machine-learning method
    • synthetic raw-read libraries
    • variable-length raw read
    • viral genomes
    • Bioinformatics
    • DNA
    • Genetic algorithms
    • Genomics
    • Layout
    • Libraries
    • Sequential analysis

    Fingerprint Dive into the research topics of 'Towards accurate de novo assembly for genomes with repeats'. Together they form a unique fingerprint.

    Cite this