Blog: cuteSV - a powerful tool to uncover the full spectrum of genomic structural variants
Mon 19th October 2020
In this blog, Tao Jiang describes his work on the development of the structural variant (SV) caller cuteSV. Read on to find out how the cuteSV bioinformatics pipeline works, how it performs relative to other SV callers, and what Tao’s plans are to further develop his analysis tool.
Structural variations (SVs) represent genomic rearrangements such as deletions, insertions, inversions, duplications, and translocations, and they are closely related to human diseases, evolution, gene regulation, and other phenotypes1. Recently, long-read sequencing technologies, like Oxford Nanopore Technologies (ONT), have offered huge potential in genomic SV discovery, with ever-increasing resolution, often revealing twice as many SVs compared to short-read sequencing efforts2,3. However, due to the lower accuracy and imperfect sensitivity, and the cost of sequencing (currently, most SV callers are still highly dependent on deep-coverage sequencing data), there is still a long way to go for promoting SV detection in a wide range of related fields.
In our recent publication4, we describe a new SV caller called cuteSV, which was used to resolve SVs in a sample of the Genome in a Bottle (GIAB) HG002 human reference genome, sequenced with the ONT PromethION platform. CuteSV is a rapid approach that achieves outstanding performance both in terms of its accuracy and sensitivity. It also shows a significant advantage in the identification of genomic SVs at incredibly low sequencing depth (see Figure 2 below). These promising results suggest that the field of SV discovery is entering an era of high performance and low cost.
Overview of cuteSV
CuteSV is a versatile, read-alignment-based SV detection workflow containing three major steps (Figure 1):
- Discovering SV signatures: cuteSV uses multiple signature extraction methods to comprehensively collect the signatures of various types of SVs from inter- and intra-alignments. Furthermore, the insertions and deletions are heuristically combined to recover real SVs from fragile alignments.
- Clustering of SV signatures: cuteSV uses a specifically designed clustering-and-refinement approach to cluster the chimerically aligned reads in local regions, and further refines these clusters to precisely identify the SV signatures from heterozygous SVs.
- SV calling and genotyping: cuteSV uses several tailored rules to perform SV calling and genotyping, based on the refined clusters of SV signatures. Moreover, it can generate several genotype likelihoods and quality scores for further quality control and generation of higher accuracy callsets.
We benchmarked the performance of cuteSV on the latest published ONT PromethION dataset of the HG002 genomic sample5 (mean read length: 17,335 bp, coverage: 47×, available at: ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/). Sniffles6, PBSV, and SVIM7 were used for comparison. CuteSV demonstrates three main beneficial features: 1) cuteSV detects more SVs than other state-of-the-art SV callers (Figure 2A-2C). In particular, it has a higher sensitivity on low coverage datasets, without a decrease in accuracy. 2) cuteSV has a strong performance in genotype calling, and finds more heterozygous and homozygous SVs. 3) cuteSV is faster than, or has comparable runtime to, state-of-the-art approaches, while requiring less memory (Figure 2H). More importantly, it has outstanding scalability, i.e. it enables an almost linear increase in speed with the number of CPU threads (Figure 2G).
With ultra-long reads produced from nanopore sequencing technology, it is now possible to resolve large structural variants. An example is shown in Figure 3: here, a 6481-bp insertion (breakpoint at chr1:9683994) was only detected in the ONT reads, possibly because the reads from the alternative long-read technology (mean read length: 7938 bp) in this region were shorter, and the aligners could not align them with such a large insertion, while a large proportion of ONT reads carried significant insertion signals in their CIGARs.
We have created a GitHub repository that provides each step on how to reproduce the benchmark (https://github.com/tjiangHIT/sv-benchmark). Please browse the repository for more details.
Although the mappability of long reads is much higher than that of short reads, their alignments are still heterogeneous. They have potential errors due to sequencing errors, SV complexity, repetitive sequences, etc. De novo assembly-based approaches are nearly free of such alignment artifacts and provide the opportunity to unravel the haplotype configuration of SVs. However, de novo assembly-based approaches also have some bottlenecks such as assembly mistakes, being computationally intensive, multiple types of data required, etc. Considering their advantages and shortcomings, we consider that alignment- and assembly-based approaches are complementary to each other. It could be useful to integrate these two approaches to produce SV callsets with higher quality.
With the development of pluralistic large-scale population genomics projects, the demand for SV resolution in cohorts has steadily increased, for the comprehensive and precise analysis of population genetic structures. Such tasks are still challenging to cuteSV, since the current version only supports resolving SVs in a single individual’s genome. To satisfy such growing demand, I am developing an additional module for force calling SVs based on a specific population background. In addition, our team is trying to establish a novel long-read-based variant calling technique to simultaneously detect SNPs, Indels, and SVs with sensitivity and efficiency.
- Sedlazeck FJ, Lee H, Darby CA, Schatz MC: Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2018, 19:329-346.
- Ho SS, Urban AE, Mills RE: Structural variation in the sequencing era. Nat Rev Genet 2019.
- Mahmoud M, Gobet N, Cruz-Davalos DI, Mounier N, Dessimoz C, Sedlazeck FJ: Structural variant calling: the long and the short of it. Genome Biol 2019, 20:246.
- Jiang T, Liu Y, Jiang Y, Li J, Gao Y, Cui Z, Liu Y, Liu B, Wang Y: Long-read-based human genomic structural variation detection with cuteSV. Genome Biol 2020, 21:189.
- Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, Sherry S, Koren S, Phillippy AM, Boutros PC, et al: A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020.
- Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC: Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods 2018, 15:461-+
- Heller D, Vingron M: SVIM: Structural Variant Identification using Mapped Long Reads. Bioinformatics 2019.