Two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

Transcription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long-reads reveals the true complexity of processing, however the relatively high error rates of long-read technologies can reduce the accuracy of intron identification.

Here we present a two-pass approach, combining alignment metrics and machine-learning-derived sequence information to filter spurious examples from splice junctions identified in long-read alignments. The remaining junctions are then used to guide realignment. This method, available in the software package 2passtools (https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and transcriptome annotation without requiring orthogonal information from short read RNAseq or existing annotations.

Authors: Matthew T. Parker, Geoffrey J. Barton, Gordon G. Simpson