Benchmarking metagenomic classification tools for long-read sequencing data

In recent years, both long-read sequencing and metagenomic analysis have been significantly advanced. Although long-read sequencing technologies have been primarily used for de novo genome assembly, they are rapidly maturing for widespread use in other applications. In particular, long reads could potentially lead to more precise taxonomic identification, which has sparked an interest in using them for metagenomic analysis.

Here we present a benchmark of several state-of-the-art tools for metagenomic taxonomic classification, tested on in-silico datasets constructed using real long reads from isolate sequencing. We compare tools that were either newly developed or modified to work with long reads, including k-mer based tools Kraken2, Centrifuge and CLARK, and mapping-based tools MetaMaps and MEGAN-LR. The test datasets were constructed with varying numbers of bacterial and eukaryotic genomes to simulate different real-life metagenomic applications. The tools were tested to detect species accurately and precisely estimate species abundances in the samples.

Our analysis shows that all tested classifiers provide useful results, and the composition of the used database strongly influences the performance. Using the same database, tested tools achieve comparable results except for MetaMaps, which slightly outperform others in most metrics, but it is significantly slower than k-mer based tools.

We deem there is significant room for improvement for all tested tools, especially in lowering the number of false-positive detections.

Authors: Josip Marić, Krešimir Križanović, Sylvain Riondet, Niranjan Nagarajan, Mile Šikić