Capturing global genomic diversity in the human pangenome with long nanopore reads

While the human reference genome has been a remarkable enabling resource for the scientific community for the past 20 years, it has always been fundamentally limited by its representation of such a small proportion of human genetic diversity. For applications that require comparing an individual’s genome sequence or genetic variants to the reference, the interpretation process can falter for anyone with significantly different ancestry than the small number of samples used in building the reference genome. Indeed, while the current reference includes data from about 20 people, the majority of it is based on a single person’s genome.

To overcome this issue, the Human Pangenome Reference Consortium (HPRC) recently published the first draft of a new, graph-based approach to a reference: the pangenome, which represents a broader range of genetic diversity through the inclusion of sequence data from people of many different ancestries1.

While the consortium’s work continues — they envision expanding this effort to cover 350 individuals in time — the first draft of the pangenome contains phased diploid assemblies from 47 people with diverse backgrounds (Figure 1). The project incorporated nanopore data and added 119 Mb of sequence data, including 90 Mb of structural variation, to the GRCh38 human reference genome.

'The human pangenome reference will enable us to represent tens of thousands of novel genomic variants in regions of the genome that were previously inaccessible.'2

The consortium deployed a number of technologies, including short-read sequencing, long-read sequencing, and optical maps. Collectively, all of this data resulted in assemblies representing more than 99% of the genome sequence with greater than 99% accuracy, measured at both the base-pair level and the structural level.

The team used the PromethION sequencing device from Oxford Nanopore Technologies to generate ultra-long reads from each of the 47 samples. For the 29 samples sequenced entirely by the consortium, the average read length was 28.4 kb; for 18 additional samples with prior sequencing data from other sources, the nanopore read length N50 value was approximately 44 kb.

The team compared the results of nanopore sequencing data to data from another long-read sequencing platform. For the other platform, they generated about 40x genome coverage with a read length N50 of 19.6 kb. The same samples were sequenced with a nanopore device to coverage ranging from 10.5-fold to 43-fold. Based on an assessment of quality and mapping accuracy, the team found that while both approaches led to reliable data, the nanopore sequencing data required less coverage to achieve the same level of accuracy.

One of the goals in generating the pangenome was to improve genome interpretation, including variant calls, based on comparing new sequence data to this novel reference. To that end, the scientists assessed the comparison process and evaluated any computational challenges involved in dealing with the more complex graph-based assembly. They reported that ‘making the switch to using pangenome mapping is not significantly more computationally expensive and resulted in an average 34% reduction in false positive and false-negative [small variant] errors compared with using the standard reference methods’. They further highlighted how ‘pangenomes not only improve variant calling but also improve transcript mapping accuracy and detection of ChIP-seq peaks’.

'highly accurate haplotype-resolved assemblies enabled us to access previously inaccessible regions, highlighting new forms of genetic variation and providing new insights into mutational processes such as interlocus gene conversion1'

The pangenome also facilitates analysis of structural variants (SVs), elements that are typically too long or too complex to be accurately represented in short-read sequence data. Sequencing much longer stretches of DNA in each read, such as through nanopore sequencing, better captures full SVs for downstream analysis, with recent sequencing efforts routinely discovering around 25,000 SVs per human genome3. Using a pangenomic method known as PanGenie4 to analyse short-read data against the nanopore-enabled pangenome, scientists demonstrated that an average of 18,500 SVs could be genotyped in each sample, revealing how long nanopore sequencing reads can further enhance the utility of existing short-read datasets.

Critically, the pangenome is an important resource for addressing the lack of representation of global diversity in clinical research, enabling the characterisation of potentially significant variants in previously underrepresented populations. Already, the resource has been shown to improve read mapping and the calling of small variants. At the SV level, those improvements could be more significant still. The authors noted how ‘the pangenome might improve SV genotyping differently across individuals owing to the stronger divergence of the alleles from the reference’ and that ‘in the future, the combination of the pangenome and low-cost long-read sequencing should prove to be a potent combination for comprehensive SV genotyping’. As well as expanding the pangenome to a larger, more diverse cohort in the near future, the group also plans to move towards telomere-to-telomere genomic assemblies — ‘to properly represent the entire genome in almost all individuals’.

1. Liao, WW. et al. Nature 617, 312–324 (2023). DOI: https://doi.org/10.1038/s41586-023-05896-x

2. NIH. Scientists release a new human 'pangenome’ reference. https://www.nih.gov/news-events/news-releases/scientists-release-newhuman-pangenome-reference (2023) [Accessed: 13 September 2023]

3. Ebbert, M.T.W. et al. Genome Biol. 20,97 (2019). DOI: https://doi.org/10.1186/s13059-019-1707-2

4. Ebler, J. et al. Nat. Genet. 54, 518-525 (2022). DOI: https://doi.org/10.1038/s41588-022-01043-w