16th May 2015
The accuracy of sequencing single DNA molecules with nanopores is continually improving, but de novo genome sequencing and assembly using only nanopore data remain challenging. Here we describe PoreSeq, an algorithm that identifies and corrects errors in nanopore sequencing data and improves the accuracy of de novo genome assembly with increasing coverage depth. The approach relies on modelling the possible sources of uncertainty that occur as DNA transits through the nanopore and finds the sequence that best explains multiple reads of the same region. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85% to 99% at 100× coverage. We also use the algorithm to assemble Escherichia coli with 30× coverage and the λ genome at a range of coverages from 3× to 50×. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.