Accurate variant calling and de novo assembly with nanopore reads
- Home
- Accurate variant calling and de novo assembly with nanopore reads
Kishwar (University of California, Santa Cruz, USA) outlined his talk, focusing on two key topics: the haplotype-aware small variant calling pipeline PEPPER-Margin-DeepVariant and updates to the Shasta de novo genome assembler.
Haplotype-aware small variant calling
Kishwar provided an overview of the general variant calling pipeline, from primary sample collection and DNA sequencing, to data analysis by read alignment to a known reference and deriving the difference between the sample and the reference, and thirdly, using that set of variants to explain and inform clinical decisions, ancestry analysis, and population differences.
Introducing the precisionFDA truth challenge v2 in 2020, which aimed to assess and benchmark variant calling in difficult-to-map regions, across three different sequencing platforms, Kishwar explained how his team’s submissions for variant calling in Oxford Nanopore sequence data using PEPPER-DeepVariant achieved two awards for best performance in all benchmark regions, and in difficult-to-map regions.
In the haplotype-aware PEPPER-Margin-DeepVariant pipeline, SNP calling is performed in two steps: firstly, SNP-based haplotyping is performed, and secondly haplotype-aware variant calling is performed. The pipeline outputs a VCF/gVCF file. Kishwar pointed to the pre-print of this pipeline now available on bioRxiv (doi: https://doi.org/10.1101/2021.03.04.433952); he noted that this is collaboration between Google Health and Santa Cruz Genomics Institute.
Demonstrating performance against existing variant callers, PEPPER-Margin-DeepVariant showed higher accuracy in both SNP and INDEL calling in all benchmarking regions; SNP calling in nanopore sequence data also showed a higher F1 score compared to SNP calling in short-read sequencing data using DeepVariant. Kishwar explained that the longer reads produced by nanopore sequencing enable reads to be aligned with higher confidence. ‘This is the first time we have seen Oxford Nanopore outperform short-read based variant calling in identifying single nucleotide polymorphisms’.
In difficult-to-map regions, including segmental duplications and low mappability regions, long nanopore reads had better mapping rates, providing better resolution and outperforming short-read variant calling. Not only this, nanopore reads could be better phased: using Margin to phase variants, Kishwar showed how phase blocks were much larger than those derived from alternative long-read data.
Kishwar explained that the team are also improving their methods, including looking into ALT alignment in DeepVariant to improve variant genotyping of the alternative allele. They have found this method to improve both SNP and INDEL calling performance. They have also looked at stratified variant calls, in terms of performance in non-repeat and non-homopolymer regions (comprising 87.12% of the genome, as opposed to the 91.03% of the genome comprising all benchmarked regions), observing significantly higher variant calling performance. Thirdly, they have demonstrated improved variant calling with the recent Oxford Nanopore R10.3 Q20 chemistry. This data type ‘produces extremely high-quality reads’: with median 0.98 read identity using R10.3 Q20 data basecalled with Bonito. Training PEPPER-DeepVariant with this data (R10.3 Q20, Bonito, 60X HG002 Chr20), they found significant improvement in SNP identification and INDEL calling. In particular, when looking at stratified regions (non-homopolymer and non-repeat regions), with an INDEL calling F1 score of 0.987.
Shasta de novo assembler
Kishwar next focused on his team’s Shasta v0.7.0 update, with improved contiguity and accuracy. Updates have included: experimental iterative assembly for partial haplotype phasing and improved resolution of segmental duplications; adaptive selection of alignment criteria; and improved detangling.
Kishwar explained how adaptive selection of alignment criteria, as opposed to preconfigured parameters for a specific data type and coverage, has contributed to a more accurate and highly contiguous genome assembly. This adaptive selection is on by default in Shasta v0.7.0.
Kishwar then discussed how iterative assembly was found to significantly improve contiguity in the CHM13 human genome assembly, increasing NG50 from 65.3 Mb (R9.4.1, Guppy v3.4.5 basecaller, Shasta v0.4.0) to 90.6 Mb (R9.4.1, Bonito v0.3.1 basecaller, Shasta v0.7.0). When the updated assembly was aligned back to CHM13 version 1, 36 of the chromosome arms were assembled in one piece. They also reassembled the HG002 R10.3 Q20 Bonito basecalled data using Shasta, first having trio-binned the HG002 data to derive maternal and paternal reads; after assembly with Shasta v0.7.0, both the maternal and paternal assemblies had estimated Q scores of >39, without polishing.
Centromere assembly with Shasta
Kishwar introduced ‘one of the most exciting works that’s ongoing’, which is centromere assembly using Shasta. To demonstrate that centromeres can be assembled with Shasta, all reads of length ≥50 Kbp from the CHM13 Bonito basecalled data, were aligned to the CHM13 v1.0 assembly using winnowmap. Reads were then filtered based on mapping quality and region coverage, resulting in 540 Mb of sequence, corresponding to 63X depth of coverage for the centromere ‘test’ region on chromosome 8. When the assembly was aligned back to CHM13 v1.0 centromere 8, the Shasta assembly was shown to cover all of the centromere, with average sequence identity of 99.93%, demonstrating a high-quality reconstruction of the centromere sequence.
Diploid polishing of de novo assemblies
The PEPPER-Margin-DeepVariant pipeline can also polish de novo assemblies in a diploid manner, polishing each haplotype independently using small variants. With this, most of the SNP and INDEL variants identified in a collapsed assembly could be recalled in a diploid manner, additionally with an overall improvement in genome quality. This pipeline is not yet able to handle structural variants as DeepVariant is a small variant calling pipeline. To address this, they are looking at incorporating the Flye polisher, in collaboration with Mikhail Kolmogorov, into their PEPPER-Margin pipeline. Kishwar shared how they could recall most of the heterozygous structural variants in the assembly (HG002, 40X R9.4.1, Guppy 3.6.0 basecalled).
Kishwar stated that their aim is to merge their two pipelines together in the near future: iterative assembly with Shasta, and haplotype-aware polishing that can restore both small and structural variants.
Another project in the near future is integrating Guppy methylation calls into their PEPPER-Margin-DeepVariant small variant calling workflow.
Kishwar encouraged the audience to check out all these tools on GitHub, and highlighted that case studies are available from installation, through to data download and running the commands, to benchmarking the output. So, everything is available for users to go and test the pipeline!