TGS-GapCloser: fast and accurately passing through the Bermuda in large genome using error-prone third-generation long reads

The completeness and accuracy of genome assemblies determine the quality of subsequent bioinformatics analysis. Despite benefiting from the medium/long-range information of third-generation sequencing techniques, current gap-closing tools to enhance assemblies suffer from multi-alignments and high error rates, resulting in huge time and money costs.

We developed a software tool, TGS-GapCloser that uses low depth (>=10X) single-molecule long sequencing reads without any error correction to close gaps. The algorithm distinguishes gap regions from the alignments of long reads against original scaffolds, corrects only the candidate regions, and assigns the best sequences to each gap.

We demonstrate that TGS-GapCloser improves the contig N50 value of draft assembly by 25-fold on average, updating over 90% of gaps with 93.96% positive predictive value. Despite the high error rate of raw long reads, improved assemblies archive Q50 (99.999%) single-base accuracy with only 11.8% decrement to inputs. This tool could complete more gaps, and is also ~29-fold faster than mainstream gap-closing tools. BUSCO analysis revealed that 3.4%-13.1% more expected genes were complete. TGS-GapCloser also shows its power to fill gaps for the ultra-large genome assembly of ginkgo (~12 Gb), with 71.6% of gaps closed. The validation of inserted or merged gap sequences was conducted with NGS reads and reference genomes, respectively.

The updated genome assemblies may promote the gene annotation, structural variant calling, and thus improve the downstream analysis of ontogeny, phylogeny, and evolution.

TGS-GapCloser is available on GitHub: https://github.com/BGI-Qingdao/TGSGapFiller.

Authors: Mengyang Xu, Lidong Guo, Shengqiang Gu, Ou Wang, Rui Zhang, Guangyi Fan, Xun Xu, Li Deng, Xin Liu