From raw reads to an accessible pangenome: a path strewn with pitfalls and large stones
- Home
- From raw reads to an accessible pangenome: a path strewn with pitfalls and large stones
Francois (French National Research Institute for Sustainable Development, France) stated that the focus of his talk would be discussing the application of long nanopore reads to rice pangenome analysis. Francois explained that the concept of the pangenome is based on the idea that there is a core genome which is shared between individuals of a population, and a dispensable genome or accessory genome which is unique to the individual and responsible for local adaptation; the pangenome is the combination of both, and the ratio of core/pan genome is a metric for the adaptability of a species. The pangenome can be open, whereby adding more individuals to the sequenced dataset adds more variation, or closed, meaning that the addition of new individuals to the dataset provides no additional variation.
Francois’s team are working on rice as it is the ‘main human food’, eaten by 20% of the human population every day. There are twenty species of rice, of which two are domesticated: Oryza sativa (Asian rice) and Oryza glaberrima (African rice). Between the two species there is approximately one million years of divergence, although they have the same overall genomic structure: 12 chromosomes, 350-380 Mb genome length, 99.5% homozygosity. There are at least 13 reference genomes for Asian rice, and one chromosome level reference genome for African rice, plus five or six drafts. Francois explained that there is also plenty of short-read data available, and their team have used the short-read data to produce the African rice pangenome; however, this reference was missing a large degree of sequence and therefore population diversity.
Francois outlined the experimental plan of the 12 + 12 rice project: high-molecular weight genomic DNA was extracted from 12 Asian and 12 African rice samples; the gDNA was prepared using the Ligation Sequencing Kit (LSK109), plus the short-read eliminator kit from Circulomics; and then sequencing was performed on the MinION Mk1C; one flow cell per sample. They obtained 23-41X depth per sample, with read N50 18-29 kb. For assembly and scaffolding, they used the assembly pipeline CulebrONT, testing different assemblers and polishing levels in parallel, and performed scaffolding and variation detection with Ragoo and Mummer.
Their results suggested that the best assembler for the rice genome was Flye, with Racon x3 + Medaka for polishing, which Francois stated produced a very high-quality assembly and ‘very very high’ BUSCO score of 97.7%, and an N50 of 12-17 Mb, with some chromosome-scale contigs. Almost 100% of the short-read data could be remapped to their long-read assemblies. They have also identified structural variants in the genomes.
Francois stated that the take-home messages from their work are that, firstly, longer reads mean better contiguity, with the short-read eliminator kit a ‘really powerful’ tool here to increase the read length. He stated that, although the read N50 is important, the median read length is also important in assembly. It is also important to test multiple assembler+polishing configurations; he recommended using two assemblers, to compare assemblies and clarify whether any anomalies are due to the assembler or represent true biological variation. Francois stated that, with nanopore technology, high-quality assemblies can be obtained within days: from basecalling to contigs in three days. Lastly, long reads are ‘much, much better than short reads’ to identify genetic variation.
In terms of current and future work, Francois explained that they have been applying the LiftOff tool to transfer genome annotation from Asian rice. They are also integrating all the genome data into the Rice Genome Hub that they are developing in their lab. They will also create a pangenome graph, using minigraph, and BioGraph for linearisation. And lastly, they will apply visualisation of the data using panache, which is under development in their lab.