Nucleotide-resolution bacterial pan-genomics with reference graphs

Bacterial genomes follow a U-shaped frequency distribution whereby most genomic loci are either rare (accessory) or common (core) - the alignable fraction of two genomes from a single species might be only 50%. Standard tools therefore analyse mutations only in the core genome, ignoring accessory mutations.

We present a novel pan-genome graph structure and algorithms implemented in the software pandora, which approximates a sequenced genome as a recombinant of reference genomes, detects novel variation and then pan-genotypes multiple samples.

Constructing a reference graph from 578 E. coli genomes, we analyse a diverse set of 20 E. coli isolates. We show, for rare variants, pandora recovers at least 13k more SNPs than single-reference based tools, achieving equal or better error rates with Nanopore as with Illumina data, and providing a stable framework for analysing diverse samples without reference bias. This is a significant step towards comprehensive analyses of bacterial genetic variation.

Authors: Rachel M Colquhoun, Michael B Hall, Leandro Lima, Leah W Roberts, Kerri M Malone, Martin Hunt, Brice Letcher, Jane Hawkey, Sophie George, Louise Pankhurst, Zamin Iqbal