LRA: a long read aligner for sequences and assembly contigs
Jingwen Ren, from Mark Chaisson’s lab at the University of Southern California, joined the conference to discuss the development of their new long-read aligner, LRA. As long-read sequencing has become more routine, Jingwen explained, it has become possible to detect structural variants (SVs) from alignments as well as assembly contigs, opening up a need for better dedicated software.
One of the biggest challenges of alignment is to choose an appropriate gap penalty, as different gap penalty functions result in different alignments by reflecting or ignoring biological variants. Showing an example, Jingwen demonstrated how using a convex gap penalty compared to using an affine function in a case with three exact matches - with the former giving one large insertion and the latter two smaller ones. One big insertion, Jingwen explained, is preferable to more frequent, smaller effects.
The use of convex gap penalty functions is not new, and Jingwen detailed their use in commonly used nanopore aligners such as minimap2 and ngmlr. However, these aligners have their downsides, so Jingwen and team aimed to produce something with the best of both worlds: an aligner with an approach similar to ngmlr, but more efficient and with computational speeds closer to that of minimap2.
Jingwen moved on to talk in detail about the methods and pipeline that underpin LRA, beginning with the use of seeding and chaining to map long reads back to a reference. This seeding and chaining step has three clear substeps: finding exact matches, applying sparse dynamic programming and convex gap penalties to find an optimal chain, then filling in the space between adjacent anchors with banded alignment. LRA uses several variations from standard approaches, and Jingwen highlighted the use of minimisers in seeding as an example of this, making changes to avoid over-sampling in unique regions and increase sampling in areas that distinguish repetitive regions.
Equipped with the anchors from the seeding step, the next stage is to find the best aligned path through the approximate aligned intervals by clustering, cluster splitting, and finally sparse dynamic programming on super-fragments. The anchors are clustered into fine clusters, which are then split on overlapping boundaries to improve identification of SV breakpoints with repetitive boundaries. These split fine clusters can then be represented as a super-fragment, which is fed into sparse dynamic programming with convex gap penalty to find the optimal chain through the super-fragments. Several steps are then performed to get pairwise alignment details based on the path found, including refining the anchors by replacing those too far apart with denser and smaller anchors.
Sparse dynamic programming is mentioned a lot in this talk, Jingwen pointed out, but why use it? Showing a comparison table, Jingwen illustrated how this approach is more efficient, as the computational time is dependent on the number of anchors rather than sequence length, and anchor number is rarely larger than sequence length. Jingwen also referenced a 1992 paper that describes an exact solution to sparse dynamic programming with a convex gap function, but highlighted that it involved complex asynchronous processing and as such has never been applied to bioinformatics. To tackle this, the team developed a much simpler implementation with synchronous computation, as well as an extension that allows for inversions.
Moving on to the results section of the talk, Jingwen gave data for the runtimes of different aligners on nanopore reads. On the same dataset, LRA took 130% of the time for alignment of minimap2, but only 15% of the time taken by ngmlr. When applied to assembly contigs, the runtime of minimap2 and LRA was approximately the same.
Jingwen clearly expressed that the purpose of LRA was to generate better alignment across SV events, and so applied a Truvari analysis to compare SV call sets across different aligners on HG002. The precision, recall, and F1 score of LRA was significantly better than those of minimap2 and ngmlr, implying LRA may be better at aligning across SVs, showing particularly strong performance over large insertion sites. Additionally though, Jingwen noted, there is room for improvement in SV calling software, as a large number of “false negatives” were identified as having over 20% supporting reads at that position.
With supporting IGV visualisations, Jingwen pulled out several specific examples of where LRA gave more reasonable alignment results than those of minimap2 – particularly where LRA results in an insertion call but minimap2 instead shows a deletion and clustered indels. Similar outcomes were seen when aligning assembly contigs, with minimap2 again displaying clustered indels at the boundary of a deletion, indicative of a less convincing alignment.
Concluding, Jingwen summarised their findings, namely that the approach used in LRA displayed a more accurate and sensitive alignment across SV sites than alternative state-of-the-art methods. LRA is available in bioconda, and is simple to install and run, and use of it by a broader audience is encouraged.