Systematic evaluation of RNA sequencing kits using cancer cell lines case study

In transcriptomics the paradigm of one gene-one product is rarely true; predictions suggest that each gene has an average of 6.3 transcript isoforms1, with 95% of multiexonic genes in the human genome being alternatively spliced2. The generation of multiple isoforms significantly increases the complexity of gene expression and therefore its analysis.

Short-read RNA sequencing is subject to systematic biases, such as GC bias and PCR bias, which may be avoided using nanopore RNA sequencing methods, particularly those which are amplification-free. As part of the Singapore Nanopore-Expression Consortium, Dr. Jonathan Goke generated nanopore data from six cell lines and evaluated nanopore methods for producing full-length RNA sequences via direct RNA sequencing or via a cDNA intermediate with or without PCR amplification. In addition, Jonathan wanted to establish QC and bioinformatic workflows for analysis of RNA sequencing data and study sample variation.

Input requirements and general information for each of the three Oxford Nanopore RNA sequencing kits are displayed in Table 1. Jonathan obtained an average read length of over 1,000 bp for all kits, with the direct cDNA kit producing the longest reads. High technical reproducibility between kits and samples tested was observed.

Table 1: RNA sequencing kits available from Oxford Nanopore Technologies.

A further direct comparison of long-read nanopore RNA sequencing to short-read RNA sequencing approaches was performed, achieving very high correlation between short- and long-read data at the gene level. This suggested that nanopore long-read data are “backward compatible” with data from previous short-read sequencing experiments. However, the long nanopore sequencing reads additionally provided a less biased, more enriched data set: firstly, 3’ bias present in short-read data was significantly reduced using long nanopore sequencing reads; secondly, very few short reads spanned one or more splice sites, in contrast to long-read data which had a greater read-length distribution and therefore spanned multiple splice sites; and finally, unlike short-read data, most of the sequences generated with nanopore methods mapped to single reference transcripts.

Overall, high consistency was found across replicates, protocols and platforms, with significantly reduced biases present in long-read RNA sequencing data compared to short-read data.

Find out more about this work by watching Jonathan's recent webinar: 'The SG-NEx project: nanopore long-read RNA-sequencing of human cancer cell lines'.

Pan et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics. 40, 1413-1415 (2008)

The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 489, 57-74 (2012)