Store Resources Support

Highly contiguous assembly of the Azucena basmati rice genome using long reads and Pore-C scaffolding


Date: 3rd December 2020

Long nanopore reads simplify de novo assembly of eukaryotic genomes, resulting in increased contiguity and accuracy. Pore-C can be used to correct and scaffold the assembled contigs

Download the PDF

Fig. 1 Pore-C a) overview of laboratory workflow for plants b) multi-contact reads

Long multi-contact reads from chromatin conformation capture of isolated plant nuclei

Nanopore reads can reach hundreds of kilobases in length, which greatly assists the process of reconstructing genomes, resulting in highly contiguous de novo assemblies. Assembly contiguity can be increased further by the addition of data from Pore-C, a technique which investigates the folded state of a genome. Here we exploit the fact that genomic regions which are close in 3D space also tend to be close in the primary sequence. In the laboratory protocol, genomic DNA is first cross-linked to nuclear proteins, preserving the spatial proximity of loci. Restriction digestion and proximity ligation are used to join cross-linked fragments, which are then sequenced (Fig. 1a). The long nanopore Pore-C reads are derived from multiple genomic loci (Fig. 1b).

Fig. 2 Bioinformatics workflows for assembly and scaffolding

Bioinformatics pipeline for de novo genome assembly and scaffolding with Pore-C reads

Chromosome-scale assemblies can be generated by first creating a draft assembly and then using the proximity information encoded in Pore-C reads to correct and merge the resulting contigs into larger scaffolds. Recent innovations in long-read genome assembly have made it possible to generate a high quality draft assembly quickly. The scaffolding and correction process starts by creating a contact map from the Pore-C reads (Fig. 2 inset) which is then used by the 3D-DNA tool to merge contigs into scaffolds in a way that is consistent with the proximity information. A manual curation step is then used to correct any remaining inconsistencies between the contact map and the scaffolds before the final step of removing allelic contigs.

Fig. 3 Azucena assembly a) assembly statistics before and after scaffolding, b) and c) dot-plot and contact map of the initial draft assembly, d) and e) dot-plot and contact map after scaffolding

Highly continuous assemblies of the Azucena variety of rice using a plant Pore-C laboratory protocol and bioinformatics pipeline, which increases the assembly continuity 8 fold

Rice is the third highest agricultural commodity worldwide and is a staple food for around one third of the world’s population. Efforts are underway to characterize the existing genetic diversity of cultivated rice strains (Zhou et al. 2019) to potentially create strains better suited to meet the demand of a growing world population and respond to the challenges posed by climate change. In order to do this it is necessary to be able to generate reference-quality genomes in a cheap and scalable manner. The rice genome is diploid, with 12 pairs of chromosomes, and is just under 400 Mb in size. We generated approximately 60x basecalled reads for the variety Azucena and assembled the genome using Flye, yielding a draft assembly with 527 contigs and an N50 value of 3.7 Mb (Fig. 3a). Fig. 3b shows a dot-plot of this assembly. This complete sample-to-answer nanopore-only workflow took around one week, compared to the many months and several technologies used to create the reference assembly. To increase assembly contiguity further we generated 30 Gb (~75x) of Pore-C reads in which DpnII had been used for the restriction digestion step. Reads were incorporated into the assembly using the bioinformatics pipeline shown above (Fig. 2). Fig. 3c shows a contact map of the 527 contigs before scaffolding with Pore-C data. The resulting combined assembly had a substantially greater N50 of 29.6 Mb (Fig. 3a). The largest 12 scaffolds obtained are close to the length of entire chromosomes in the reference assembly (Fig. 3d). Fig. 3e shows the optimised contact map obtained after inclusion of Pore-C data. These methods are readily adaptable to other organisms and tissue and will be further developed and released to the Nanopore community.

Recommended for you