Flye and metaFlye: algorithms for long-read de novo assembly using repeat graphs

Mikhail Kolmogorov from the University of California, San Diego, and one of the team behind popular assembly tools Flye and metaFlye, began his talk by reiterating the core goal of assembly: reconstructing a genome using a given set of sequencing reads. The typical approach to assembly is to look for reads that share overlaps, suggesting they come from related regions of the genome and allowing them to be stitched together like a puzzle. But this means assemblies, particularly of relatively large genomes, are typically fragmented and incomplete due to the presence of repeat stretches that make precise determination of overlap difficult.

Using an example, Mikhail demonstrated how using sequencing reads shorter than the repeat unit length makes the exact length of the repeats indistinguishable, causing assemblies to fail at these points. Instead, assembly can only be performed around the repeats, generating contigs with gaps between them. This highlights the need for sequencing reads that are long enough to span even the biggest repeat sections, meaning the technological development in long-read sequencing is driving genomic discoveries.

Mikhail was quick to point out though, that it is not only the sequencing technology that is pushing the dramatic improvement in genomic capability seen in recent years. Algorithmic development, by the nature of its role in analysing sequence data, defines what the genomics community is capable of; computational advances are critical for improving interpretation.

If the audience weren’t convinced, however, Mikhail presented a single, compelling example of the significance of algorithms when it comes to assembly performance. Recalling 2018, Mikhail displayed a digital karyotype representative of the first human genome assembly from nanopore data using ultra-long sequencing reads, where each coloured band represented a different section of contiguous sequence. Mikhail and team developed their assembly algorithm Flye some time later, and using the same dataset across different assembler versions, managed to achieve an impressive 3-fold improvement in contiguity – without making any changes to the original data. Crucially, Mikhail noted, we haven’t reached the full potential yet, and there is much more to come.

Having introduced the core concepts, Mikhail moved on to introduce assembly graphs, the data structures commonly used in assembly. Different algorithms build different graphs with different properties, but one of the most popular is a de Bruijn graph. In a de Bruijn graph built from short sequencing reads, for each short k-mer, there is a node on the graph, but if k-mers are sampled from a repetitive region they will correspond to the same node, collapsing repeats onto the same path. Moving from short to long sequencing reads circumvents some of this problem, but de Bruijn graphs rely on exact overlaps, meaning errors in long sequencing reads can also cause assembly failures and tangled graphs.

In order to overcome this in Flye, Mikhail explained, the team makes use of repeat graphs instead of de Bruijn graphs. Repeat graphs can be viewed as a generalisation of de Bruijn graphs, using approximate matches instead of exact matches to tolerate errors, and aiming to reveal genomic repeat structure identified by long reads.

To produce repeat graphs, Mikhail illustrated the process in detail, showing how local alignment of a genome against itself produces multiple alignments of the repeat sections, to which nodes can be assigned at the end points. Clustering the nodes and joining those with the same end point produces a looped graph, and by collapsing identical paths with the same sequence, the final repeat structure is revealed.

The genome traverses this graph somehow, and reads represent sections of paths, so aligning reads rapidly to the graph produces a “walk” along it, termed by the team a “disjointig”. An individual disjointig may not necessarily correspond accurately to the genome, Mikhail explained, but a complete set of disjointigs can be used to assemble a repeat graph, as any misassemblies in the disjointigs will be collapsed in the final graph. The very last step in the process is simplification; looping repeat graphs can make the directionality of repetitive sections unclear, so long reads spanning junctions between units are mapped to the graph to establish the direction in which the genome traverses the repeats. Mikhail showed the simplification process in Klebsiella pneumoniae, with a knotty graph transforming into a straightforward one with just one or two loops for which there were no reads long enough to span the full repeat section.

But, Mikhail questioned, how can this be applied to human genomes? The goal in this setting would be to complete assemblies of both parental haplotypes. Using 120X nanopore data, of which 60X was ultra-long, the team binned reads from HG002 into haplotypes using TrioCanu, before assembling with Flye and polishing with Medaka. This produced very strong results, yielding an NG50 of 38 Mb for the paternal haplotype and 45 Mb for the maternal.

Mikhail noted, though, that quality is still very important alongside contiguity. By updating the basecalling software used through each iteration, Mikhail was able to improve the QV scores from an initial QV28/29 for paternal/maternal haplotypes, up to QV38/39. With the R10.3 pore, this improved even further to QV40/42 when estimated using Merqury. These results, Mikhail described, are very exciting, and still don’t represent the most recent in basecalling improvements, so further gains are easily expected. In addition, near perfect phasing accuracy can also be achieved.

For the final section of the talk, Mikhail moved briefly over to the problem of metagenomic assembly. This problem, he outlined, is more challenging than single genome assembly, as species composition may be imbalanced, and repeats may be shared across distinct genomes. Short-read metagenomes are usually highly fragmented as a result, but long reads may prove in part a solution to some of the issues, and for this the team wrote metaFlye.

When showing data from a mock community sequenced on PromethION, Mikhail explained that many bacteria can be assembled completely, with only two retaining unresolved repeats – a promising and exciting first step. However, he continued, this did not apply directly to real samples, instead metaFlye produced very different and highly tangled assembly graphs. Nevertheless, when zooming into the paths present in these graphs, an interesting “bubbling” phenomenon can be observed, where the path splits into two before rejoining itself. Mikhail indicated that what they propose to be happening here is the graphing of significant heterogeneity within a mixed sample, and perhaps instead of incorrect assembly, true variation is being captured.

When benchmarking on simulated data, metaFlye showed better performance on a wide range of bacteria, particularly those that were more complex, and when applied to real data a single run generated lots of complete and near-complete bacteria with a high degree of novelty.

Overall, Flye and metaFlye make use of repeat graphs to generate high quality and highly contiguous assemblies, from single bacteria, to human haplotypes, to mixed metagenomic communities.

Authors: Mikhail Kolmogorov