HERRO: haplotype-aware error correction of ultra-long nanopore reads


Abstract

Phased genome assembly traditionally requires the use of at least two distinct long-read technologies: ultra-long nanopore simplex reads, and either PacBio HiFi or nanopore duplex reads. However, our research indicates that self-corrected ultra-long nanopore reads alone are adequate for achieving this goal. We have developed a tool called HERRO, which specializes in the error correction of ultra-long reads while taking into account nucleotide variations characteristic of each haplotype. HERRO is a two-stage technique that integrates read overlapping with artificial intelligence (AI)-based error correction. It utilises a blend of convolutional neural networks and self-attention mechanisms within its architecture. Correcting R10.4.1 ultra-long reads, HERRO achieves a significant increase in accuracy, reducing the error rate by more than an order of magnitude. By employing only corrected ultra-long reads, HERRO demonstrates results in de novo assembly of the human HG002 genome that are on par with or surpass those obtained by methods that combine uncorrected ultra-long reads with either PacBio HiFi or duplex nanopore reads.

Biography

Professor Mile Sikic is a group leader at the Genome institute of Singapore, A*STAR. He is also a professor in computer science at the University of Zagreb and he obtained a PhD in 2008 at the same university. At the beginning of his career, he was involved as a system integrator, consultant, and project manager, in more than 70 industry projects in the fields of computer and mobile networks. His research interests include the development of classical algorithms and AI methods for genome sequence analysis and RNA structure prediction. Prof. Sikic has started several companies and one hedge fund.

Authors: Mile Sikic