Generating high-quality reference human genomes using PromethION nanopore sequencing

Miten from the University of California, Santa Cruz, kicked off the Assembly and Scaffolding breakout talking about a collaborative project to generate a pipeline to create reference quality human genomes in 7 days using nanopore sequencing and Hi-C data. The aim of the project was to provide a framework to enable more high quality reference genomes to be generated by increasing sequencing speed and reducing cost, as well as producing a pipeline that is scalable and cheaper.

The first part of Miten’s talk focused on the nanopore sequencing process itself. The data for 11 genomes were produced in 9 days on the PromethION platform. Using the Short Read Eliminator Kit from Circulomics, they achieved 7-fold enrichment for reads >100 kbps, with an average depth of coverage per genome of over 60X, and an average N50 of 42 kbps. Basecalling with the latest basecaller (Flip-flop) and aligning against the GRCh38 reference genome gave modal alignment identity of 93%, and a median identity of 90%.

The second part of Miten's talk focused on the assembly polishing and scaffolding pipeline which was performed in the cloud. For assembly, the Shasta tool was used. Shasta is a new tool developed for nanopore de novo long-read assembly that can be run on a single compute node. Miten showed that Shasta can produce complete human genomes in around 6 hours with comparable contig NG50s and fewer misassemblies compared to the tools Flye, Canu and Wtbg2, at a fraction of the time and cost. Miten then discussed two new tools for two-step polishing of the assemblies: marginPolish - a graphical based assembly alignment polisher and HELEN - an RNN-based consensus sequence polisher. Miten showed data comparing the consensus accuracies after the two step MarginPolish and HELEN vs the Racon (4x) and Medaka pipeline; the MarginPolish / HELEN approach came out on top in his tests. MarginPolish and HELEN were also quicker and cheaper to run. After assembly and polishing, the team finally added Hi-C long-range data to generate chromosome-level scaffolds.

Wrapping up his talk, Miten stated that with the PromethION and their analysis pipeline they have been able to generate long-read, high-quality, and high-coverage genomes in less than 7 days, for less than $10k per genome. Miten also announced that all three tools have been publicly released today on github.

Authors: Miten Jain