Miten Jain - Generating high-quality reference human genomes using PromethION nanopore sequencing
London Calling 2019
To catalogue and associate all forms of human genetic variation to health and disease, a new generation of genome sequencing and assembly technologies is required. However, current workflows for producing high-quality human genome assemblies have overall cost and production time bottlenecks that prohibit scaling to hundreds of individuals. We designed and evaluated an optimized PromethION-based workflow to produce near reference quality genome assemblies for the offsprings from ten parent-offspring trios. We demonstrate the production of long read, high-quality, and high-coverage genomes with a less than one-week total turnaround time from sample extraction to complete assembly, and a total projected cost of less than $10k per genome. To lower costs and improve quality we have developed three new tools: 1) Shasta - a nanopore de novo long read assembler that on a single compute node can produce complete human genomes in around 6 hours; 2) marginPolish - a new graphical model-based assembly polisher that improves on earlier methods in both cost and accuracy; and 3) HELEN - an RNN-based multi-task learning model that further refines the base and run-length prediction for each genomic position and produces state-of-the-art results. We evaluate the performance based on assembly accuracy, throughput/timing, and cost and demonstrate improvements relative to current best-of-breed in all areas. Recognizing that even 100kb reads are insufficient to scaffold through the most repetitive regions of the human genome, we augment this sequencing with a Hi-C long-range library to facilitate scaffolding and haplotype phasing.
Miten from the University of California, Santa Cruz, kicked off the Assembly and Scaffolding breakout talking about a collaborative project to generate a pipeline to create reference quality human genomes in 7 days using nanopore sequencing and Hi-C data. The aim of the project was to provide a framework to enable more high quality reference genomes to be generated by increasing sequencing speed and reducing cost, as well as producing a pipeline that is scalable and cheaper.
The first part of Miten’s talk focused on the nanopore sequencing process itself. The data for 11 genomes were produced in 9 days on the PromethION platform. Using the Short Read Eliminator Kit from Circulomics, they achieved 7-fold enrichment for reads >100 kbps, with an average depth of coverage per genome of over 60X, and an average N50 of 42 kbps. Basecalling with the latest basecaller (Flip-flop) and aligning against the GRCh38 reference genome gave modal alignment identity of 93%, and a median identity of 90%.
The second part of Miten's talk focused on the assembly polishing and scaffolding pipeline which was performed in the cloud. For assembly, the Shasta tool was used. Shasta is a new tool developed for nanopore de novo long-read assembly that can be run on a single compute node. Miten showed that Shasta can produce complete human genomes in around 6 hours with comparable contig NG50s and fewer misassemblies compared to the tools Flye, Canu and Wtbg2, at a fraction of the time and cost. Miten then discussed two new tools for two-step polishing of the assemblies: marginPolish - a graphical based assembly alignment polisher and HELEN - an RNN-based consensus sequence polisher. Miten showed data comparing the consensus accuracies after the two step MarginPolish and HELEN vs the Racon (4x) and Medaka pipeline; the MarginPolish / HELEN approach came out on top in his tests. MarginPolish and HELEN were also quicker and cheaper to run. After assembly and polishing, the team finally added Hi-C long-range data to generate chromosome-level scaffolds.
Wrapping up his talk, Miten stated that with the PromethION and their analysis pipeline they have been able to generate long-read, high-quality, and high-coverage genomes in less than 7 days, for less than $10k per genome. Miten also announced that all three tools have been publicly released today on github.