Store Resources Support

De novo assembly of large eukaryotic genomes with long nanopore reads, and scaffolding using Pore-C


Date: 1st October 2021

Accurate, complete and contiguous genome assemblies are essential for identifying important structural and functional elements of genomes and for identifying genetic variation in an unbiased manner

Download the PDF

Fig. 1 Assembly of Hg002 a) assembly statistics b) contigs across the genome c) Chr 20

Extremely contiguous assembly of human genome Hg002 using ultra-long reads

We used 60x of Nanopore ultra-long reads (read N50 >100 kb) to produce a highly contiguous assembly of Hg002. The final assembly has a contig N50 of 54 Mb (Fig. 1a). The largest contig was 130 Mb and 90% of the genome was contained in contigs larger than 15 Mb. Furthermore, the assembly showed high accuracy yielding a BUSCO score of 96.9% (complete genes). Fig. 1b illustrates contig sizes along the human chromosomes. Colour changes between light and dark grey show contig or alignment breaks. Zooming in on chr20, it can be seen that >99% of the chromosome is captured by only four contigs (Fig. 1c). The efficiency of long-read assembly and polishing tools means that the overall runtime was <20 hours on a single AWS instance.

Fig. 2 Scaffolding genomes with Pore-C a) NA12878 b) C. elegans c) Drosophila d) Arabidopsis

Using Pore-C contact information to improve assembly contiguity of several genomes

To demonstrate the effectiveness of scaffolding assemblies with Pore-C data, we performed de novo assembly using Flye for all genomes apart from human (NA12878), which we assembled with Shasta. For each genome, Pore-C data was processed to create virtual paired contacts, which were used for scaffolding the assemblies. The results show that scaffolding with approximately 10x Pore-C data can increase assembly contiguity substantially, even when the initial draft assembly is highly fragmented (Figs. 2 a-d). Where scaffolded assembly N50 is greater than the reference, it indicates that sequence that is missing from the current reference assemblies.

Fig. 3 Haplotype-resolved assemblies a) concept b) and c) collapsed and trio-binned ONT assemblies, respectively d) pipeline e) phasing f) and g) resolving haplotypes h) and i) final assemblies

New assembly pipeline enables chromosome-scale haplotype-resolved assemblies of large diploid genomes using a combination of long nanopore reads and Pore-C data

Many assemblers collapse diploid genomes into a haploid assembly, mixing variants from both haplotypes randomly (Fig. 3a). Each contig/scaffold of a collapsed assembly has k-mers from both parents (Fig. 3b). It is preferable to have an assembly for each separate haplotype, and this is often achieved by trio binning. Here, unique k-mers are extracted from each parent’s data and then used to separate the reads into paternal and maternal. The two sets are then assembled separately. This results in one assembly per haplotype where each contig/scaffold only has k-mers from one parent (Fig. 3c). However, parent data is not always available. Here we present an alternative where phasing, based on long-reads and Pore-C, enables reads to be separated into haplotypes without the need for parent data. The pipeline, based on DipASM, first assembles ONT reads into a collapsed assembly, aligns the long reads back and calls variants (Fig. 3d). These variants are then phased into chromosome-scale phaseblocks. We obtain a single phaseblock for each chromosome, containing virtually all variants with correct phasing (Fig. 3e). Next, reads are tagged using the phased variants and separated into haplotypes. The vast majority of base pairs can be phased in this way. A final assembly step yields a chromosome-scale assembly for each haplotype. The resulting scaffolds either stem from the paternal or the maternal haplotype and have human-reference scale N50s (Figs. 3f and 3g). Distinguishing maternal and paternal scaffolds is difficult without trio information, and thus both assemblies are a mix of paternal and maternal scaffolds. Finally, Figs. 3h and 3i show dot plots for both assembled haplotypes compared to the T2T CHM13 assembly.

Recommended for you