Structural variant detection from long-read sequencing data with cuteSV
Tao (Harbin Institute of Technology, China) opened his talk by introducing what structural variation (SV) is and how it can be discovered. SVs range from 50 bp to several megabases in length and can be classified into: balanced rearrangements (inversions and translocations), imbalanced copy number variation (deletions, insertions, and duplications), and complex rearrangements. SVs have been detected throughout the human genome, with widespread impact, having noticeable effects at the level of molecules and cells, being associated with complex phenotypes and diseases. Due to their impact, there is high value in comprehensive detection and understanding of SVs and their evolution.
Long sequencing reads from Oxford Nanopore technology enable ‘high resolution’ and long-range SV analysis. Tao noted that, compared to short-read sequencing technologies, nanopore sequencing technology provides benefits including ultra-long reads, direct sequencing, and lack of GC bias. These benefits aid SV detection, particularly in repetitive regions. Challenges do remain in SV calling, including the demand for high coverage to resolve complex variants, unsuitable clustering rules for SV prediction, and high compute requirements for some analyses, which may not be compatible with small computing facilities. To try and address these bottlenecks, Tao has developed cuteSV.
Structural variation detection with cuteSV
Tao’s SV analysis workflow, cuteSV, contains three steps: extraction of SV signatures and recovery of evidence of SVs from fragile alignments; integration and classification of signature clusters via a stepwise refinement approach; and application of several tailored rules for final SV calling and genotyping. Tao notes three typical applications of cuteSV: long-read alignment-based SV calling; discovery of SVs using diploid-assembly alignments; and SV calling on a large scale in populations.
Next, Tao demonstrated high precision and recall of SV calling and genotyping in 47X HG002 PromethION sequencing data. The cuteSV workflow improved the identification of variant breakpoints, alternative allele sequences, and variant genotypes, with high precision and recall compared to two other ‘state-of-the-art’ SV callers, SVIM and Sniffles. Discussing these results further, Tao demonstrated how the size distribution of the detected SVs clearly showed the expected Alu and LINE1 element peaks in the data. Tao also stated that cuteSV performance was largely independent of SV size, and that the workflow uncovered a higher number of insertions than deletions in low-complexity regions.
Tao displayed an example of a large homozygous insertion – cuteSV extracted the signatures from the split alignments and reported the homozygous 40 kb insertion. As a second example, Tao demonstrated how cuteSV successfully reported a heterozygous 32 kb deletion covering two genes (LCE3B and LCE3C) known to be associated with susceptibility to psoriasis.
SV calling in a haplotype-resolved human genome assembly
Tao next demonstrated his robust evaluation of SV calling, using cuteSV, in the HG002 haplotype-resolved genome assembly. Compared with the tool SVIM-asm, cuteSV demonstrated higher performance on calling phased SVs, discovering ~95% of SVs with their correct genotypes. They believe that their workflow is a ‘new powerful method that allows the pairwise comparison of genomes and enables SV calling even in the absence of a suitable reference genome’.
Another benefit of the cuteSV workflow is the ‘low demand on the coverage of sequencing data’. Evaluating performance on 50X HG002 nanopore sequencing data, using the new Oxford Nanopore EPI2ME SV calling pipeline, which incorporates the cuteSV tool, demonstrated that high precision did not require high sequence coverage; high coverage improved recall. Overall, cuteSV identified over 90% of variants with 15X depth of coverage, and 95% of variants with 30X depth of coverage.
Furthermore, the cuteSV workflow demonstrated ‘great scalability and low memory footprint’. In terms of scalability, compared to other SV callers SVIM and Sniffles, cuteSV showed a near linear speed-up with multiple threads, suggesting it is suitable for large-scale genomics studies. It also achieved stable and low memory consumption, suitable for the modern-day generic desktop computer.
Typical applications of cuteSV
On the basis of specific design modules, Tao stated that there are three applications of cuteSV. The general use is long-read alignment-based SV calling, based on read mapping with tools such as minimap2, LRA, and NGMLR. A wide range of SV signatures can be derived from the alignment files using cuteSV to generate SV callsets and predicted genotypes. He stated that the minimum read support threshold should be no less than two, but recommended minimum read support being one-sixth of the sequence coverage. He also advised --merge_ins_threshold and --merge_del_threshold to both be set at 500, in order to discover those SVs that might have been divided into several parts during read mapping.
Another common application is the discovery of haplotype-specific SVs using diploid assembly alignments. This involves three main steps: marking haplotype-unique tags from maternal and paternal assemblies; performing alignment of assemblies and some post-processing with samtools; and thirdly diploid-assembly-based SV calling with cuteSV.
Thirdly, cuteSV also enables SV calling in large-scale populations. This involves four main steps: generation of SV callsets for each sample using cuteSVs; merging sample-level SV callsets using SURVIVOR; force calling cohort-level SVs across all samples using cuteSV; and lastly merging force called SVs to generate final cohort callsets, again using SURVIVOR.
Outlining future work, Tao stated that this involves further improvements to the SV calling pipeline. This includes production of consensus sequences of each alternative allele, which should help to effectively correct errors from sequencing and alignment, and further improve accuracy of breakpoint detection and size. A second aim is to develop a new joint SV calling method to improve SV calling performance at cohort level.