GALA: gap-free chromosome-scale assembly with long reads

High-quality genome assembly has wide applications in genetics and medical studies. However, it is still very challenging to achieve gap-free chromosome-scale assemblies using current workflows of long-read platforms.

Here we propose a chromosome-by-chromosome assembly strategy implemented through the multiple-layer computer graph which identifies mis-assemblies within preliminary assemblies or chimeric raw reads and partitions the data into chromosome-scale linkage groups. The subsequent independent assembly of each linkage group generates gap-free assembly free from the mis-assembly errors which usually plague existing workflows. This flexible framework also allows us to integrate data from various technologies, such as Pacbio, Nanopore, Hi-C, and the genetic map, to generate gap-free chromosome-scale assembly.

We de novo assembled C. elegans and A. thaliana genomes using GALA with combined Pacbio and nanopore sequening data from publicly available datasets. We also demonstrated its applicability with a gap-free assembly of two chromosomes in the human genome. In addition, GALA showed promising performance for Pacbio high-fidelity long reads.

Our method enables straightforward assembly of genomes with multiple data sources and multiple computational tools, overcoming barriers that at present restrict the application of de novo genome assembly technology.

Authors: Mohamed Awad, Xiangchao Gan