Blog: cuteSV - a powerful tool to uncover the full spectrum of genomic structural variants

In this blog, Tao Jiang describes his work on the development of the structural variant (SV) caller cuteSV. Read on to find out how the cuteSV bioinformatics pipeline works, how it performs relative to other SV callers, and what Tao’s plans are to further develop his analysis tool.

Tao JiangTao Jiang I am a lecturer at the center for bioinformatics at Harbin Institute of Technology in China. My research focuses on developing methods for detecting the full spectrum of genomic structural variations and integrating these genetic resources into relevant, cutting-edge research. With the aid of long-read sequencing technologies, I have developed several tools for characterizing structural variation in the human genome: 1) rMETL (PMID: 30759188) – uncovering mobile element insertions in complicated genomic loci or irregular compound structures, 2) rCANID (PMID: 30946672) – resolving novel sequence insertions through a light-weight clustering and local assembly method, 3) rMFilter (PMID: 28482046) – a raw reads preprocessing module which accelerates a structural variation analysis pipeline. I recently participated in the 100,000 Genomes project in China. I designed the overall technical route, including sample whole-genome sequencing and alignment, cohort-based joint-variant calling, large-scale functional annotation, and population genetic analysis.

Structural variations (SVs) represent genomic rearrangements such as deletions, insertions, inversions, duplications, and translocations, and they are closely related to human diseases, evolution, gene regulation, and other phenotypes1. Recently, long-read sequencing technologies, like Oxford Nanopore Technologies (ONT), have offered huge potential in genomic SV discovery, with ever-increasing resolution, often revealing twice as many SVs compared to short-read sequencing efforts2,3. However, due to the lower accuracy and imperfect sensitivity, and the cost of sequencing (currently, most SV callers are still highly dependent on deep-coverage sequencing data), there is still a long way to go for promoting SV detection in a wide range of related fields.

In our recent publication4, we describe a new SV caller called cuteSV, which was used to resolve SVs in a sample of the Genome in a Bottle (GIAB) HG002 human reference genome, sequenced with the ONT PromethION platform. CuteSV is a rapid approach that achieves outstanding performance both in terms of its accuracy and sensitivity. It also shows a significant advantage in the identification of genomic SVs at incredibly low sequencing depth (see Figure 2 below). These promising results suggest that the field of SV discovery is entering an era of high performance and low cost.

Overview of cuteSV

CuteSV is a versatile, read-alignment-based SV detection workflow containing three major steps (Figure 1):

  • Discovering SV signatures: cuteSV uses multiple signature extraction methods to comprehensively collect the signatures of various types of SVs from inter- and intra-alignments. Furthermore, the insertions and deletions are heuristically combined to recover real SVs from fragile alignments.
  • Clustering of SV signatures: cuteSV uses a specifically designed clustering-and-refinement approach to cluster the chimerically aligned reads in local regions, and further refines these clusters to precisely identify the SV signatures from heterozygous SVs.
  • SV calling and genotyping: cuteSV uses several tailored rules to perform SV calling and genotyping, based on the refined clusters of SV signatures. Moreover, it can generate several genotype likelihoods and quality scores for further quality control and generation of higher accuracy callsets.
Figure 1. Schematic illustration of the cuteSV approach.
Figure 1: Schematic illustration of the cuteSV approach.

We benchmarked the performance of cuteSV on the latest published ONT PromethION dataset of the HG002 genomic sample5 (mean read length: 17,335 bp, coverage: 47×, available at: ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/). Sniffles6, PBSV, and SVIM7 were used for comparison. CuteSV demonstrates three main beneficial features: 1) cuteSV detects more SVs than other state-of-the-art SV callers (Figure 2A-2C). In particular, it has a higher sensitivity on low coverage datasets, without a decrease in accuracy. 2) cuteSV has a strong performance in genotype calling, and finds more heterozygous and homozygous SVs. 3) cuteSV is faster than, or has comparable runtime to, state-of-the-art approaches, while requiring less memory (Figure 2H). More importantly, it has outstanding scalability, i.e. it enables an almost linear increase in speed with the number of CPU threads (Figure 2G).

Figure 1. Benchmarking results for SV callers, on the HG002 genome sample, sequenced with the PromethION.
Figure 2: Benchmarking results for SV callers, on the HG002 genome sample, sequenced with the PromethION. Comparison of (A) F1 scores, (B) precision, (C) recall, (D) F1 scores with genotyping, (E) precision with genotyping, and (F) recall with genotyping, on the ONT PromethION data, with various depths of coverage. The (G) runtimes and (H) memory use with different CPU threads on the 47× ONT PromethION data. “GT” and “Skip GT” indicate with and without genotyping (“GT”), respectively. SVIM is benchmarked with a single CPU thread only since it does not support multiple thread computing. PBSV is not taken into account since it crashed on the 47× dataset.

With ultra-long reads produced from nanopore sequencing technology, it is now possible to resolve large structural variants. An example is shown in Figure 3: here, a 6481-bp insertion (breakpoint at chr1:9683994) was only detected in the ONT reads, possibly because the reads from the alternative long-read technology (mean read length: 7938 bp) in this region were shorter, and the aligners could not align them with such a large insertion, while a large proportion of ONT reads carried significant insertion signals in their CIGARs.

We have created a GitHub repository that provides each step on how to reproduce the benchmark (https://github.com/tjiangHIT/sv-benchmark). Please browse the repository for more details.

An example of an insertion only detected with ONT PromethION data.
Figure 3: An example of an insertion only detected with ONT PromethION data. (A) The Integrated Genomics Viewer (IGV) snapshot of the alternative long-read technology read alignments. (B) The IGV snapshot of the ONT read alignments. With superior read lengths, 20 reads were aligned with a >6000 bp insertion in their CIGARs. CuteSV captured these signatures successfully, and identified a 6192 bp insertion at chr1:9684036, agreeing with the ground truth.

Future work

Although the mappability of long reads is much higher than that of short reads, their alignments are still heterogeneous. They have potential errors due to sequencing errors, SV complexity, repetitive sequences, etc. De novo assembly-based approaches are nearly free of such alignment artifacts and provide the opportunity to unravel the haplotype configuration of SVs. However, de novo assembly-based approaches also have some bottlenecks such as assembly mistakes, being computationally intensive, multiple types of data required, etc. Considering their advantages and shortcomings, we consider that alignment- and assembly-based approaches are complementary to each other. It could be useful to integrate these two approaches to produce SV callsets with higher quality.

With the development of pluralistic large-scale population genomics projects, the demand for SV resolution in cohorts has steadily increased, for the comprehensive and precise analysis of population genetic structures. Such tasks are still challenging to cuteSV, since the current version only supports resolving SVs in a single individual’s genome. To satisfy such growing demand, I am developing an additional module for force calling SVs based on a specific population background. In addition, our team is trying to establish a novel long-read-based variant calling technique to simultaneously detect SNPs, Indels, and SVs with sensitivity and efficiency.

Tao Jiang Team photo
Tao Jiang with his team

References

  1. Sedlazeck FJ, Lee H, Darby CA, Schatz MC: Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2018, 19:329-346.
  2. Ho SS, Urban AE, Mills RE: Structural variation in the sequencing era. Nat Rev Genet 2019.
  3. Mahmoud M, Gobet N, Cruz-Davalos DI, Mounier N, Dessimoz C, Sedlazeck FJ: Structural variant calling: the long and the short of it. Genome Biol 2019, 20:246.
  4. Jiang T, Liu Y, Jiang Y, Li J, Gao Y, Cui Z, Liu Y, Liu B, Wang Y: Long-read-based human genomic structural variation detection with cuteSV. Genome Biol 2020, 21:189.
  5. Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, Sherry S, Koren S, Phillippy AM, Boutros PC, et al: A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol 2020.
  6. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC: Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods 2018, 15:461-+
  7. Heller D, Vingron M: SVIM: Structural Variant Identification using Mapped Long Reads. Bioinformatics 2019.