NCM 2021: Towards comprehensive genetic diagnosis of repeat expansion disorders with targeted nanopore sequencing
Ira Deveson (Garvan Institute of Medical Research, Australia) began his plenary talk by describing the mechanics of adaptive sampling, a process that enables users to programme a nanopore sequencing device to recognise particular DNA sequences as they pass through a nanopore, and then dynamically decide whether or not to sequence them. If a strand is one of interest, it can continue to pass through the nanopore to be sequenced in full; if not, the unwanted strand is ejected from the nanopore by reversing the charge on the membrane. A process requiring no special library prep steps, Ira explained how adaptive sampling allows targeted sequencing to be performed ‘right there on the sequencer’, saving time and energy.
Ira described how he and his team, the Genomic Technologies Group, set about testing its potential for the genotyping of short tandem repeat (STR) expansions — ‘an application for which nanopore sequencing almost seems like the ideal tool’. STRs are ~2–12 bp sequence motifs that are repeated multiple times in succession. They make up ~7% of the human genome, and are highly polymorphic, varying in length between individuals. This can represent healthy variation, but some expanded repeats can act as pathogenic mutations. There are many different repeats, which express themselves through different mechanisms depending on the gene impacted, and the location, size, and sequence of the repeat, and result in different diseases. Many of these diseases are neurological, including Huntington’s Disease, Fragile X Syndrome, and ALS. There are currently >40 known neurological diseases caused by repeat expansion mutations in at least 37 different genes.
Ira explained that STRs are typically longer than the read lengths generated by short-read sequencing technology. Short-read sequencing of a repeat expansion therefore results in multiple similar reads, making it difficult to correctly infer its full length and sequence via assembly — like constructing a jigsaw puzzle from many small, identical pieces. Clinical diagnosis of repeat expansion disorders currently relies on labour-intensive techniques such as Southern blots and repeat-primed PCR. Each assay tests for a single gene; Ira explained that, given the number of disorders associated with repeat expansions across many genes, which often have overlapping symptoms, clinicians currently have to make their best guess at the correct test to order, which can result in a long wait for a diagnosis. Ira described the future potential for nanopore sequencing to ‘solve the problem in one step’ by reading STRs end-to-end in single long reads, whilst the use of targeted sequencing via adaptive sampling could negate the need to perform more costly whole-genome sequencing. They decided to investigate the potential of adaptive sampling for future ‘parallel testing of all disease-associated STR genes in a single nanopore assay’. The team used software ReadFish (Payne et al., Nature Biotech, 2020), together with the Oxford Nanopore ReadUntil API, to conduct adaptive sampling for these experiments.
First, they created a catalogue of all known neurological disease-implicated STR genes, including some only recently discovered. These genes, including flanking regions for each, formed the targets for enrichment via adaptive sampling on a MinION device. Ira showed an example of adaptive sampling-enriched nanopore data mapping to the HTT gene, which causes Huntington’s disease, displaying clear enrichment across the gene and lower read depth from the adjacent regions. He noted that the reads from off-target regions were shorter than those within the target region, as a result of their being ejected from the nanopore. They observed a ~5-fold coverage enrichment across the panel, producing a depth of coverage of ~25–35x of their targets from a single MinION Flow Cell — similar to that produced by whole-genome sequencing on a high-throughput PromethION Flow Cell.
In the HTT gene, given ~30x coverage, Ira and his team typically obtained ~20–25 reads that spanned the repeat region. They can then perform phasing of the data into the maternal and paternal alleles and assemble the consensus repeat sequence for each. Ira showed the consensus sequences generated for a clinical research sample of an individual, showing one healthy copy of the repeat, comprised of 18 CAG copies, and one allele displaying an expanded repeat of 64 copies, a length indicating a pathogenic mutation. As Huntington’s disease is a dominant genetic condition, this would represent a positive result for Huntington’s for this clinical research sample. He showed the sequencing results for 12 more clinical research samples with known HTT repeat statuses, noting the diverse range of repeat lengths across the group. Of the 12, five samples belonged to individuals who had been clinically diagnosed with Huntington’s; in each case, the adaptive sampling data showed a pathogenic repeat expansion. The remaining seven samples showed repeat lengths within the healthy range, demonstrating the capacity to distinguish healthy from affected clinical research samples.
Ira and his team then compared the results from the adaptive sampling of the clinical research samples with those of the clinically established tests for those individuals. Showing the data for the genes HTT, FMR1, and RFC1, Ira stressed that the STR copy numbers identified by the clinical tests were ‘almost identical’ to those from nanopore sequencing, demonstrating that sequencing was ‘at least as precise as the current established methodologies’. The identification of pathogenic variants in FMR1 and RFC1, however, is more complex than simply measuring repeat length.
With nanopore sequencing of native DNA, it is possible to detect epigenetic modifications alongside nucleotide sequence. Ira showed sequencing data across FMR1 for two clinical research samples, one of which had an expanded repeat. Analysis of methylation in both samples showed an unmethylated region, characteristic of the gene promoter, in the healthy clinical research sample, whilst the other showed partial methylation. For this second sample, they performed phasing to analyse allele-specific methylation, showing one haplotype to be largely unmethylated and the other — the allele harbouring an STR expansion — to be largely methylated, indicating that it is likely to be silenced. This silencing is considered to be the main mechanism of pathogenicity in Fragile X Syndrome. Ira stressed that characterising this feature could not be done easily with any other technology.
Ira then moved on to the gene RFC1, in which STR expansions are the leading cause of the condition CANVAS, a debilitating ataxia. The five-nucleotide repeat is relatively common and can reach up to 5 kb in length. Healthy variants are typically comprised of one of two repeat motifs, and considered healthy regardless of their length. A third, rarer motif, is considered pathogenic, and the presence of two alleles comprised of this motif would indicate a positive result for CANVAS. Ira explained that genetic diagnosis of CANVAS therefore requires identification of the repeat length, sequence motif, and whether the repeat is present in one or both alleles. Currently, this requires multiple molecular tests. Ira and his team investigated the future potential to identify all three of these features from a single nanopore sequencing dataset.
Ira showed data from two clinical research samples — one from a healthy individual, the other from an individual with CANVAS — for which they used adaptive sampling-enriched data to generate haplotype-resolved assemblies of RFC1. In the healthy sample, both repeats were found to be ~600 bp in length and comprised of the same non-pathogenic repeat sequence. For the CANVAS sample, both repeats were made up of the pathogenic motif, with one ~2.5 kb and the other ~3.5 kb in length, closely concordant with Southern blot results. Together, this demonstrated identification of all the features that would be required to detect CANVAS. Finally, he showed the data for a third clinical research sample from a different individual with CANVAS. In clinical testing, the two ~5 kb repeats had presented as one band, whilst PCR testing identified the pathogenic motif. However, nanopore sequencing of the clinical research sample showed one repeat comprised of the pathogenic motif and the other made up of what is currently considered a non-pathogenic motif. Ira explained that this study suggests that the latter motif could become pathogenic when above a certain length, and that the assumption that it is non-pathogenic could be the result of ambiguity from the currently used clinical tests. He stressed how this example demonstrates the future potential for long reads to ‘eventually transform the genetic diagnosis of these diseases, and really enhance our understanding of how they work’.
Ira highlighted how, in this research study, they simultaneously sequenced 37 genes, whereas currently used clinical diagnostic methods require the clinician to choose one test for one gene at a time. He described how the method could allow for a greater understanding of the true genetic diversity at these repeat sites which, though some of the most diverse regions in the human genome, are currently poorly understood through the use of traditional sequencing technology. Understanding the full picture of this diversity is critical: in their study of 30 samples, they have already observed ‘different repeat sizes, different motifs, and other interesting features that haven’t been previously described’. Where currently, the limited resolution of these genes means that the boundary between what is considered a pathogenic or non-pathogenic variant may not be clearly defined, Ira emphasised the future potential of adaptive sampling and nanopore sequencing to untangle this complex picture.