The most complete human genome ever assembled with a single technology
Although short-read sequencing technologies have improved considerably over the last decade in terms of yield and turnaround times, according to Jain and coworkers: ‘assembling human genomes with high accuracy and completeness remains challenging’ 1.
At ~3.1 Gb, the human genome is not only large, it also contains regions of uneven nucleotide composition, high levels of repetitive content (up to 69%2) and large segmental duplications. As a result, most human genome assemblies are highly fragmented and contain gaps that both limit their structural integrity and subsequent biological interpretation. Furthermore, short sequencing reads prevent the assignment of alleles or variants to their original chromosome. Such ‘phasing’ information provides significantly more insight into gene expression and function and is of particular importance when studying genetic disease.
To assess the potential of long nanopore sequencing reads to overcome these issues and deliver more contiguous, complete genomes, a team comprising researchers from the UK, USA and Canada, used the MinION to sequence the well-characterised human reference genome NA12878. The team deployed a standard kit-based DNA extraction method together with the Ligation Sequencing Kit to generate long sequence reads. In total, 91.2 Gb of data was generated, which is equivalent to 30x genome coverage.
Using the assembly tool Canu, a highly contiguous assembly was produced, comprising 2,886 contigs with an NG50* contig size of ~3 Mb. The superior genome contiguity offered by nanopore sequencing was exemplified by the inclusion of the highly repetitive — and thereby notoriously difficult to assemble — HLA class I region in a single contig.
At over 2 Mb, the longest sequencing read set a new record for a single contiguous DNA sequence.
To investigate the impact of increasing read length on assembly contiguity, the team further used a modified phenol:chloroform extraction technique together with the streamlined Rapid Sequencing Kit to generate ultra-long reads. Approximately 18 Gb of ultra-long read data was obtained (equivalent to 5x genome coverage), with the longest mapped read being 882 kb. More recently, researchers from the University of Nottingham have obtained a human ultra-long read in excess of 2 Mb — a new record for a single contiguous DNA sequence3.
Long nanopore sequencing reads allowed phasing of the entire 4 Mb MHC region.
The additional ultra-long reads not only doubled the assembly contiguity (NG50 ~6.4 Mb) but also significantly improved the facility to phase alleles. For example, it was possible to phase the entire 4 Mb major histocompatibility complex (MHC) that was contained within a single 16 Mb contig (Figure 9). As stated by the researchers: ‘The increased singlemolecule read length that we report here, obtained using a MinION nanopore sequencer, enabled us to analyse regions of the human genome that were previously intractable with state-of-the-art sequencing methods’ 1. This sentiment was further reflected through the utilisation of the nanopore data to close 12 large (>50 kb) gaps in the GRCh38 reference genome, which corresponded to 83,980 bp of previously unknown euchromatic sequence.
Unlike short-read technology, nanopore sequencing also allows the direct detection of DNA modifications alongside the nucleotide sequence. In this study, the levels of 5-methylcytosine (5mC), detected were highly concordant with results obtained using alternative methylation analysis techniques.
Data from this study is available at: github. com/nanopore-wgs-consortium/NA12878
* The NG50 value represents the longest contig such that contigs of this length or greater sum to at least half of the haploid genome size.
This case study is taken from the human white paper.
1. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 36(4):338-345 (2018).
2. de Koning, A.P., Gu, W., Castoe, T.A., Batzer, M.A., and Pollock, D.D. PLoS Genet. Repetitive elements may comprise over two-thirds of the human genome. 7(12):e1002384 (2011).
3. Payne, A., Holmes, N., Rakyan, V. and Loose, M. Whale watching with BulkVis: A graphical viewer for Oxford Nanopore bulk fast5 files. bioRxiv 312256 (2018).