Reference-free reconstruction and quantification of transcriptomes from nanopore long-read sequencing


Ivan de la Rubia (P.h.D student, Universitat Pompeu Fabra, Spain) started his talk by saying nanopore sequencing presents the opportunity to measure the transcriptome of any sample. Current long-read methods for transcriptome analysis rely on the comparison with a reference genome or transcriptome, often requiring multiple sequencing technologies, making it difficult to cost-effectively study the transcriptome of organisms lacking a reference genome or transcriptome. Ivan also highlighted that even in cases where a reference genome is present, disease-specific transcripts are not directly identifiable from a reference genome, and methods for DNA assembly cannot be directly transferred to transcriptome assembly since the consensus sequences cannot be interpreted to determine transcript isoforms.

Based on the reasons above, Ivan wanted to build a method to cluster similar reads at the transcript level so each cluster represented a single transcript. Next, he wanted to perform an error correction step and then quantify isoform expression to produce a final transcriptome with all the expression levels of each transcript. Finally, he set out to complete the following steps without the need for a reference genome or transcriptome. Ivan introduced RATTLE. RATTLE interprets long nanopore reads and builds clusters, where each cluster represents a specific gene. Two reads are placed in the same gene cluster if they are similar enough, this is done using a two-step algorithm to improve the run time.

In the first step RATTLE computes the primary k-mer content of each read and then calculates the number of common ‘k-mers’ to create a similarity score. The similarity score is calculated as the fraction of unique ‘k-mers’ in common divided by the maximum unique ‘k-mers’ in the two reads. If the similarity score reaches a predefined threshold a slower second comparison step is conducted between the two reads. The second step sorts the all of the ‘k-mers’ and their positions present, and then compares the position of the first read’s ‘k-mer’ to the second read ‘k-mers’ using a dynamic programming algorithm, producing a common set of linear ‘k-mers’ between both reads. This is defined as the number of positions covered by, the ‘k-mers’ divided by the length of the shortest read to produce a RATTLE score. If the reads are similar enough according to the RATTLE score, they are placed in the same cluster.

Once genes have been clustered, Ivan explained how RATTLE differentiates different gene isoforms. Different isoform transcripts contain many of the same ‘k-mers’ but can be differentiated by their different lengths due to different exon content or base differences, such as insertions or deletions, resulting in basecalling errors. The variance difference between the two transcripts is expected to be larger than the variance between the same transcript, therefore a predetermined parameter is set to determine whether the two reads are in fact from the same transcript. This results in the production of transcript clusters. The transcript clusters are then aligned using a partial order alignment algorithm and then corrected using the consensus from each column considering the quality of each base. The final step is additional clustering and error correction before a final transcriptome is produced and isoforms quantified.

For the next part of his talk, Ivan presented benchmarking analysis of RATTLE in clustering, error correction, and quantification. When compared with isONclust and CARNAC-LR, using the adjusted Rand Index, RATTLE outperformed both tools at the gene level – gene expression was quantified because the other tools are unable to perform differentiation at the transcript level. Benchmarking using murine cDNA and RNA data demonstrated even better clustering results. Ivan then explains how RATTLE improved the percentage of identified transcripts and reduced the error rate to <0.1%. Finally, using a known approximate real quantification data set Ivan demonstrated RATTLE’s ability to predict the abundance of isoform transcripts present. RATTLE was comparable to other transcript quantification methods, and in some cases better, but importantly was the only method that does not require a reference.

For the final part of his talk Ivan talked about using RATTLE in human datasets to assess the predicted quantification of transcripts in comparison to other methods. Similarly to the murine dataset, RATTLE demonstrated high correlation to the approximate real quantification data and produced a higher number of transcripts than the other comparable methods.

Ivan encourages anyone to try out RATTLE here: https://github.com/comprna/RATTLE

Authors: Ivan De La Rubia