Detection of differential isoform expression and usage during cellular differentiation using long-read RNA sequencing

Wilfried (Earlham Institute, UK; in collaboration with researchers from Oxford) introduced how, as it currently stands in the public domain, the human genome codes for 19,951 protein coding genes, and 86,054 protein coding transcripts, although this is likely to be an extensive underestimation. Changes in gene expression are fundamental for cellular differentiation; Wilfried’s team are interested in the regulation of neuronal differentiation — from the neuron stem cell to the mature neuron — and neuronal function.

Alternative splicing and the need for long reads

Alternative splicing is a highly regulated process, involving an extensive yet precise array of regulatory factors both in cis and in trans. Aberrant splicing is known to be associated with several diseases, such as Duchenne muscular dystrophy, early-onset Parkinson’s disease, and amyotrophic lateral sclerosis; it is also a hallmark of several cancers. Therefore, it is important to identify alternatively spliced transcripts; however, short-read RNA sequencing involves fragmentation of cDNA, which loses long-range exon connectivity information. Sequence fragments are computationally assembled by inference into potential transcripts, as opposed to being directly sequenced. It is therefore common for exons to be missing from inferred isoforms. Wilfried explained that the literature suggests that, even when all exons are detected, ‘up to 50%’ of transcripts identified via short-read sequencing are not correctly assembled. This leads to ‘a very confusing picture’ when trying to assess splicing at gene level not junction level.

To study splicing of Drosophila Dscam1, for example, long reads are required: based on the genome complexity of this gene, there could be anything between 19,008 and 38,016 possible exon combinations. Using long-read technology, a team published their research identifying 18,496 isoforms.

Providing another example, Wilfried’s team have been investigating expression of gene CACNA1C, which is a major candidate risk gene in neuropsychiatric and cardiovascular disorders, is incredibly complex: the gene comprises over 50 exons, and using nanopore sequencing, their team have identified 241 novel transcripts above those already annotated (Clark et al. Mol. Psychiatry. 2020). Variation in its expression across tissues was also greater than that seen between different individuals. Wilfried further demonstrated how this complexity could be revealed at protein level, showing an image of the structure of the CACNA1C transmembrane protein, with numerous domains. Many variants, including deletions, insertions, and truncations, have been identified across these different domains.

Splicing in neuronal differentiation

Wilfried’s research, along with the Oxford team, involved investigating splicing regulation during neuronal cell differentiation, with a focus on the following questions: can we detect expression variation during cellular differentiation; what is the sensitivity of long reads for expression quantification; and can we investigate differential gene and transcript expression? In addition, could they capture differential transcript usage depending on cell state: undifferentiated vs. differentiated?

To perform this research, Wilfried used the SH-SY5Y neuroblastoma cell line, which has a well characterised, stable phenotype, and is commonly used to study neurodegenerative diseases and neural cell development and function. Wilfried extracted cDNA from differentiated (after retinoic acid treatment) or undifferentiated SH-SY5Y cell lines. The cDNA was then used for both short-read sequencing and nanopore sequencing on GridION (generating ~10.6 million reads per sample), alongside Sequin cDNA spike-in standards. Nanopore sequence data showed very good correlation with short-read data in terms of Sequin quantification, suggesting high sensitivity, and very good correlation (0.91) between observed and expected log fold-change in expression. In terms of the analysis, a custom transcriptome annotation workflow was used, which relied on TALON; HTSeq and Salmon were also used for quantification.

As expected,’ the team identified may novel transcripts in the dataset: 3,274 novel transcripts were found to be expressed in the SH-SY5Y cells; novel events (exons, splice sites, and junctions) were validated using short-read data, reducing this to a more conserved 2,567 transcript count. These included noncoding and coding transcripts. Of interest to their team, they identified a novel truncated transcript of the voltage gated calcium channel gene CACNA2D2; missense mutations in this gene are known to be associated with intellectual disability, cerebellar atrophy, seizures, and epilepsy. The truncation led to 217 amino acids being lost, and they are currently investigating the potential function of this isoform.

Regarding differential gene expression analysis, 4,239/32,977 genes assessed were differentially expressed, split almost equally into those that were upregulated and those that were downregulated. Interestingly, there was a very large disparity in differential expression of CACNA2D2 in differentiated vs. undifferentiated cells.

At the transcript level, 5,456/99,067 transcripts assessed were differentially expressed between undifferentiated and differentiated cells. Of note, the novel CACNA2D2 transcript was one such differentially expressed transcript — being more highly expressed in undifferentiated cells. Wilfried noted that those tools that were designed to measure expression at the gene level by counting reads (e.g. HTSeq) lost substantial power in analysis; this is because reads supporting multiple isoforms were being rejected. Read counting therefore had a massive affect on downstream analysis, and he stated that ‘we need to be extremely careful when we look at the method’ that is applied for transcript quantification.

Of the genes found not to be differentially expressed, 1,276 were found to have at least one differentially expressed transcript. ‘This was very surprising’ but also ‘highlighted the fact that we really, really need to go to the transcript level to be able to capture things that we would never have been able to do if we were staying at the gene level’. Differential transcript expression without differential gene expression can be explained by differential transcript usage at different stages of differentiation. To investigate this further, Wilfried used IsoformSwitchAnalyzeR, detecting 104 cases of differential transcript usage during differentiation. For example, an isoform switch in splicing factor RBM5 was observed, switching from a noncoding to coding transcript upon cell differentiation. ‘This was very striking’; and interestingly, the team also noted enrichment of the RBM5 binding site among genes displaying differential transcript usage.

Conclusions

In summary, Wilfried stated that long nanopore reads can be used to quantify gene and transcript expression, ‘with very good sensitivity’. Using the SH-SY5Y cell line, they identified 4,239 differentially expressed genes, and 5,456 differentially expressed transcripts in total. The identification of differential transcript expression without differential gene expression provides evidence for differential transcript usage with potential functional consequences.

Their ongoing work includes development of novel annotation pipelines for transcript reconstruction; improvement of single-cell approaches, such as increasing per-cell read depth; integrating splicing information into regulatory network reconstruction; identifying tissue-specific isoforms as potential drug targets; and validating novel splicing events.

Authors: Wilfried Haerty