Dissecting RNA biology, one molecule at a time
- Home
- Resource Centre
- Dissecting RNA biology, one molecule at a time
Martin Smith from the Kinghorn Centre for Clinical Genomics at the Garvan Institute for Medical Research gave a talk outlining the current state of transcriptomics using long read sequencing. To start, Martin spoke about challenging features of mammalian transcriptomes, specifically cell type-specific expression patterns and the large dynamic range of expression in mammalian cells. In addition to this, Martin showed a circus plot displaying splice junctions from RNA capture sequencing showing the complex transcriptional activity inherent to these systems which escape classical RNA sequencing strategies. To give some background to the complexity of mammalian transcriptomes and alternative splicing, Martin referenced Deveson et al. where short-read sequence capture of RNAs from chromosome 21 suggested almost every non-coding exon undergoes alternative splicing, indicating a seemingly limitless variety of isoforms. Furthermore, when these patterns are compared with mouse models, humans show higher amounts of alternative splicing events.
Next, Martin spoke about why long reads help in the study of exon connectivity and alternative transcript isoforms. Being able to span multiple exons in single reads allows distant mutually associated exon pairs (dMAPs), those being spatially separate regions of the transcript which are associated with one transcript isoform but not others, to be resolved. To give a comparison of how different sequencing technologies may deal with this type of problem, Martin showed a schematic of a short-read, stringTie assembled transcript isoforms where dMAPs were incorrectly assigned and spurious isoforms were generated. In contrast, the consensus sequence from Oxford Nanopore cDNA sequencing resolved the correct isoforms, with each dMAP correctly located on the transcript of origin.
Martin then discussed a paper by Hardwick and Basset et al. where sequence capture of brain RNA had been used in combination with Oxford Nanopore long-read cDNA sequencing to profile non-coding RNAs that contain GWAS determined SNPs associated with neuropsychiatric traits. Many of the exonic, and a number of the intronic targeted regions formed novel, multi-exonic transcripts not present in the GENCODE annotations. The novel splice junctions and functions were backed up with a number of orthogonal methods and provided a preliminary transcript atlas of non-coding RNAs that can connect neurological phenotypes with gene expression.
Martin moved on to discuss why hybrid sequencing of transcript isoforms could be potentially beneficial in order to resolve exon boundaries. Here, Martin displayed coverage of specific synthetic exons as generated by short-read sequencing, showing high coverage and confidence at the exon boundaries, but highly uneven coverage across each exon making it “…difficult to use short read sequencing to quantify isoforms”. Long reads, on the other hand, spanned multiple exons but showed some “wobble” in the exact exon boundaries. As an alternative to this, Martin spoke about a number of tools available to use genome guided de novo assembly of transcript isoforms. Martin described two tools currently available, one being Oxford Nanopore's Pinfish pipeline, and another being Flair, made by the Brooks lab. Both essentially correct the fuzzy splice junctions of the raw reads by aligning transcripts to a reference genome in a splice aware fashion, correcting junction misalignments and then clustering by junction position. Each cluster is then collapsed through a polishing step and multiple passes result in a high-confidence transcript reference.
Examining the products of the gene PVT1, an important gene in triple-negative breast cancer, huge numbers of novel isoforms were seen using long-read cDNA sequencing compared to short read sequencing. Here, it was clear that many novel isoforms can be found in even well-studied genes, and lead Martin to suggest that “current reference transcriptomes do not exactly instil confidence”.
Towards the end of his talk, Martin spoke about some of the droplet-based, full-length single cell transcriptome work Ghamdan ‘Gammy’ Al-Eryani had been doing in their lab. Using Repertoire and Gene Expression (RAGE–seq), full-length T-cell and B-cell receptor sequences and transcriptional profiles for thousands of lymphocytes from primary breast cancer tumors were generated using nanopore sequencing in combination with short read-aided barcode demultiplexing. Typically, in single-cell cDNA sequencing, molecular barcodes added to each cDNA molecule in order to determine the cell of origin. Martin closed his talk by discussing how tools developed by James Ferguson and himself aim to demultiplex custom barcodes using raw nanopore squiggles. The toolkit, called DeePLEXIcon, uses a neural network to do this and gave a demultiplex rate of 80% with a barcode assignment accuracy of 98.8%, and will be expanded upon later in the data analysis session by James himself.