NCM 2021: Accuracy improvements in crop genome assembly using the Q20+ chemistry
Alexander explained that KeyGene are developing a comprehensive computational toolset to perform crop genome analysis ‘on an unprecedented scale’, including de novo assembly, variant detection, and data visualisation. Alexander laid out the four key aspects of reference genomes: correctness, completeness, contiguity, and cost; and explained the benefits of using nanopore sequencing technology for this process. At KeyGene, the team have been evaluating the performance of the new Q20+ (‘Kit12’) chemistry and the R10 pore series, using samples from different plant species. They have also been investigating the impact of plant-trained basecalling models. Alexander stated that they had obtained significant improvements in raw read accuracy using the new Q20+ chemistry and applying basecalling models trained on plant sequence data. For example, for maize whole-genome sequencing data, they obtained a 2.5% increase in raw read accuracy (from 96.9% to 99.4%). Alexander presented lettuce genome sequence data (192 Gb data yield; library prepared with the Q20+ Ligation Sequencing Kit and sequenced on R10.3 PromethION Flow Cells). The data were basecalled using a Guppy Q20 model trained using plant data, and assembly was performed with both Flye v2.9 and the KeyGene STL assembler. Alexander pointed out how their lettuce genome assembly data presented at London Calling in 2018 had been ‘already impressive’ compared to the published short-read-based reference. Here, compared to that 2018 genome, and compared to the lettuce genome assembled using data from an alternative long-read sequencing technology, ‘the KeyGene STL assembler using the Q20 data from Oxford Nanopore shows the best assembly’, in terms of contiguity and accuracy. Alexander added that, with the STL assembler they could ‘finish the assembly within 30 hours’. Moving on to the melon genome (library prepared with Q20+ chemistry and sequenced on R10.3 and R10.4 PromethION Flow Cells); data were similarly assembled with Fly and the STL assembler. Duplex data were also generated for the R10.4 run (16 Gb of the 169 Gb R10.4 run yield). Compared to the melon genome assembly produced in parallel via the alternative long-read technology, there were significantly fewer contigs in the Oxford Nanopore-based assembly. When considering the R10.4 data, basecalled with a model trained using plant data and assembled with Flye v2.9, the quality score ‘became on par’ with that same genome assembled from data obtained using alternative technology. Alexander lastly discussed their analysis of the duplex data alone (which had comprised ~10% of the R10.4 run yield); comparing ~35x genome coverage of duplex data with 35x depth data from the alternative technology, the ‘consensus accuracies of the ONT data [are] actually significantly higher’. ‘We think that this is a breakthrough in the technology’. Alexander concluded that ‘the duplex reads are really amazing’.