Products & Services


Nanopore sequencing offers advantages in all areas of research. Our offering includes DNA sequencing, as well as RNA and gene expression analysis and future technology for analysing proteins.

Learn about applications
View all Applications
News Explore

De novo assembly of large eukaryotic genomes with long nanopore reads, and scaffolding using Pore-C


Date: 3rd December 2020

Accurate, complete and contiguous genome assemblies are essential for identifying important structural and functional elements of genomes and for identifying genetic variation in an unbiased manner

Download the PDF

Fig. 1 Assembly of Hg002 a) assembly statistics b) contigs across the genome c) Chr 20

Extremely contiguous assembly of human genome Hg002 using ultra-long reads

We used 60x of Nanopore ultra-long reads (read N50 >100 kb) to produce a highly contiguous assembly of Hg002. The final assembly has a contig N50 of 59 Mb (Fig. 1a). The largest contig was 140 Mb and 90% of the genome was contained in contigs larger than 7.4 Mb. Furthermore, the assembly showed high accuracy yielding a BUSCO score of 96.1% (complete genes). Fig. 1b illustrates contig sizes along the human chromosomes. Colour changes between light and dark grey show contig or alignment breaks. Zooming in on chr20, it can be seen that >98% of the chromosome is captured by only four contigs (Fig. 1c). The efficiency of long-read assembly and polishing tools means that the overall runtime was <20 hours on a single AWS instance.

Fig. 2 Assembly workflow a) pipeline b) hybrid performance c) comparison of assembly tools

Comparing performance of assembly tools at different coverage and read N50

A combination of Flye or Shasta followed by a round of Racon and a round of Medaka gives the best results in terms of contiguity, accuracy and runtime (Fig. 2a). Overall, the pipeline assembles a human genome in 16 to 40 hours and can easily be extended to include an optional polishing step using 30x of short-read data. Assembler choice mainly depends on raw-read N50 and depth. To illustrate the trade-off we assembled a wide range of different Hg002 data and plotted the resulting contig N50 on a read length vs read depth grid (Fig. 2c). Shasta works best with high coverage (>50x) and long reads (read N50 >40 kb). Flye is less sensitive to differences in input datasets. Overall, Shasta produced higher contig N50s, and Flye produces fewer missassemblies.

Fig. 3 Comparison of Hg002 assemblies using reads basecalled by bonito and guppy

Assembly performance of bonito’s higher single-molecule-accuracy reads

In addition to read length and depth, raw-read accuracy can influence assembly results. We have recently released a new research basecaller, bontio that can produce median raw-read accuracy of >98%. When comparing assemblies generated from bonito basecalls to those from guppy’s for Oxford Nanopore Technologies' open-data Hg002 datasets we found that bonito produces the higher contig N50 (Fig. 3a and 3b). The assembled fraction of the genome is slightly higher for the bonito reads (Fig. 3c), and reads basecalled by bonito give rise to eight times fewer missassemblies than those from guppy (Fig. 3d). Finally, when the completeness of each assembly is assessed by calculating Benchmarking Universal Single Copy Ortholog (BUSCO) scores, bonito basecalls perform slightly better than those from guppy (Fig. 3d).

Fig. 4 Scaffolding genomes with Pore-C a) NA12878 b) C. elegans c) Drosophila d) Arabidopsis

Using Pore-C contact information to improve assembly contiguity of several genomes

To demonstrate the effectiveness of scaffolding assemblies with Pore-C data, we filtered gDNA reads from NA12878, C. elegans, Drosophila and Arabidopsis to remove data with qscore less than 8 and read length less than 5-8 kb. We then performed de novo assembly using Flye for all genomes apart from human (NA12878), which we assembled with Shasta. For each genome, Pore-C data was processed to create virtual paired contacts, which were used for scaffolding the assemblies. The results show that scaffolding with approximately 10x Pore-C data can increase assembly contiguity substantially, even when the initial draft assembly is highly fragmented (Figs. 4 a-d). Where scaffolded assembly N50 is greater the than the reference, it indicates that sequence that is missing from the current reference assemblies.

Recommended for you

Open a chat to talk to our sales team