Detection of plasmid contigs in draft genome assemblies using customized Kraken databases

Plasmids play an important role in bacterial evolution and mediate horizontal transfer of genes including virulence and antimicrobial resistance genes. Although short-read sequencing technologies have enabled large-scale bacterial genomics, the resulting draft genome assemblies are often fragmented into hundreds of discrete contigs, which makes detailed characterization of plasmids difficult. Several tools and approaches have been developed to identify plasmid sequences in such assemblies, but require trade-off between sensitivity and specificity.

Here we propose using the Kraken classifier, together with a custom Kraken database comprising known chromosomal and plasmid sequences of Klebsiella pneumoniae species complex (KpSC), to identify plasmid-derived contigs in draft assemblies. We assessed performance using Illumina-based draft genome assemblies for 82 KpSC isolates, for which complete genomes were available to supply ground truth. When benchmarked against five other classifiers (Centrifuge, RFPlasmid, mlplasmids, PlaScope, and Platon), Kraken showed balanced performance in terms of overall sensitivity and specificity (90.8% and 99.4%, respectively for contig count; 96.5% and >99.9%, respectively for cumulative contig length), and the highest accuracy (96.8% vs 91.8%-96.6% for contig count; 99.8% vs 99.0%-99.7% for cumulative contig length), and F1 score (94.5% vs 84.5%-94.1%, for contig count; 98.0% vs 88.9%-96.7% for cumulative contig length). Kraken was also among the most consistent performers at the individual genome level.

Furthermore, we demonstrate that expanding the Kraken database with additional known chromosomal and plasmid sequences (a simple procedure that unlike other methods does not require any model training) can further improve classification performance. Although we have focused here on the KpSC, this methodology could easily be applied to other species with a sufficient number of completed genomes.

IMPACT STATEMENT The assembly of bacterial genomes using short-read data often results in hundreds of discrete contigs due to the presence of repeat sequences in those genomes. Separating plasmid contigs from chromosomal contigs in such assemblies is required, e.g., to assess the mobility of antimicrobial resistance genes. Although several tools have been developed for that purpose, they often suffer from low sensitivity or specificity.

Here, we propose that the Kraken classifier coupled with a custom Kraken database comprising plasmid-free chromosomal sequences and complete plasmid sequences can be used for detection of plasmid contigs in draft genome assemblies.

We showed that Kraken achieved balanced and higher performance compared with other methods (Centrifuge, RFPlasmid, mlplasmids, PlaScope, and Platon). We therefore consider that the Kraken classifier can be the best option for predicting the origin of contigs for species with a suitable number of completed chromosomal and plasmid sequences.

Authors: Ryota Gomi, Kelly L. Wyres, Kathryn E. Holt