High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers


Genotyping methods and genome sequencing are indispensable for revealing the genomic structure of bacterial species displaying high levels of genome plasticity. However, reconstruction of the genome or assembly is not straightforward due to data complexity, including repeats, and mobile and accessory genetic elements in bacterial genomes. Moreover, since the solution to this problem is strongly influenced by sequencing technology, bioinformatics pipelines, and the selection criteria used to assess assemblers, there is no systematic way to select a priori the optimal assembler and parameter settings.

To assemble the genome of Pseudomonas aeruginosa strain AG1 (PaeAG1), short-read (Illumina) and long-read (Oxford Nanopore) sequencing data were used in 13 different non-hybrid and hybrid approaches. PaeAG1 is a multiresistant, high-risk sequence type 111 (ST-111) clone that was isolated from a Costa Rican hospital, and it was the first report of an isolate of P. aeruginosa carrying both blaVIM-2 and blaIMP-18 genes encoding for metallo-β-lactamase (MBL) enzymes.

To assess the assemblies, multiple metrics, including contiguity, correctness and completeness (3C criterion, as we define here), were used for benchmarking the 13 approaches and selecting a definitive assembly. In addition, annotation was done to identify genes (coding sequences and RNA regions), and to describe the genomic content of PaeAG1. Long-read-only and hybrid approaches showed better performance in terms of greater contiguity, higher correctness and higher completeness.

A manually curated and polished hybrid assembly gave rise to a single circular sequence with 100% of core genes and known regions being identified, >98% of reads mapped back to the genome, no gaps, and uniform coverage. The strategy followed to obtain this high-quality 3C assembly is detailed in the manuscript and we provide readers with an all-in-one script to replicate our results or to apply it to other troublesome cases.

The final 3C assembly revealed that the PaeAG1 genome has 7,190,208 bp, a 65.7% GC content, and 6,709 genes (6,620 coding sequences), many of which are included in multiple mobile genomic elements, such as 57 genomic islands, six prophages, and two complete integrons with blaVIM-2 and blaIMP-18 MBL genes. Up to 250 of the predicted genes are anticipated to play a role in virulence (adherence, quorum sensing and secretion), and 60 are anticipated to be involved in antibiotic resistance (β-lactamases, efflux pumps, etc).

Altogether, the assembly and annotation of the PaeAG1 genome provide new perspectives for continuing to investigate the genomic diversity and gene content of this important human pathogen.

Authors: José Arturo Molina-Mora, Rebeca Campos-Sánchez, César Rodríguez, Leming Shi, Fernando García