Tackling the complexities of assembling plant genomes


The assembly of plant genomes is particularly challenging due to a number of factors, including extreme genome sizes (e.g. at 152 Gb, the Paris japonica genome is almost 50 times larger than the human genome3) and a high level of repetitiveness, which is further exacerbated by the polyploid nature of many plant species. Nanopore long sequencing reads enhance the resolution of repetitive regions, simplifying assembly of large and complex plant genomes.

Of the ~400,000 known plant species, only ~400 (1%) have had their genomes sequenced4.

Cannabis has been cultivated for millennia, with different cultivars bred for the production of fibre and grain, or tetrahydrocannabinol (THC). With a recent demand for cannabidiol (CBD) over THC, a complete, annotated Cannabis genome would enable a comprehensive understanding of the evolution and inheritance of drug potency, as well as selective breeding of desired traits. Yet despite several draft assemblies, a complete Cannabis genome has yet to be achieved. This is particularly due to the repetitive nature of the cannabinoid synthase loci, which have been implicated in the ratio of THC/CBD production and are therefore critical to resolve. Grassa et al. described the first chromosome-scale assembly of the Cannabis genome5. The assemblies enabled complete resolution of the cannabinoid synthase loci on chromosome 9, which were found in three linked regions, containing a total of 13 synthase gene copies (Figure 1). The synthases were part of transposable element cassettes, which required ultra-long nanopore reads to resolve. The team also discovered through QTL analysis that cannabinoid content (potency) was not associated with cannabinoid synthase gene clusters and that additional loci were involved in controlling potency. This study demonstrates the ability of nanopore long sequencing reads to enhance the assembly of problematic repetitive genomic regions.

Figure 1: The Cannabinoid synthase genes are located in tandemly repeated cassettes: Genes (blue) are clustered among long terminal repeats (LTR) coloured as follows: LTR ends (red), LTR body (grey), unclassified LTR (yellow), LTR01 remnants (purple), and an unclassified LTR fragment (green). A) The synthase gene cluster at 26 Mbp includes seven copies of a cassette ranging 38–84 kb in length and flanked by two LTR01 LTRs. B) Synthase genes at 29 Mbp are located in a different cassette, ranging 28–57 kb in length, with a single LTR08 upstream. C) The entire 29 Mbp synthase gene cluster is flanked by LTR01 LTRs and the third cassette is interrupted by an LTR01 remnant. Taken from Grassa et al. (2018)5.

Due to its short generation time, small size, large number of offspring, and relatively small genome (~150 Mb), Arabidopsis thaliana has long been a popular model organism in plant biology6, and in 2000 it became the first plant to genome to be sequenced7. However, in common with most current reference genomes, the short-read technology used to construct the genome precludes the analysis of large structural variants and repetitive regions, such as transposons2. To address this challenge, an international team of researchers led by Professor Todd Michael at the J. Craig Venter Institute utilised nanopore long-read sequencing to create an additional, highly contiguous reference genome for A. thaliana2. In only 4 days, using a single MinION Flow Cell, the researchers achieved a more contiguous genome assembly than the existing ‘gold standard’ TAIR1035 assembly. The initial assembly step took just 1 hour using a standard laptop. Furthermore, base quality was deemed to be on a par with the current gold-standard reference assembly. The team routinely generated reads between 200 kb and 800 kb in length using nanopore sequencing, enabling them to identify nested transposable elements and repeat regions that had been previously inaccessible to short-read sequencing technologies (Figure 2). Prof. Michael explained: ‘We found that de novo assembly with long reads revealed micro-variation that we did not see with short-read technologies. Even the small Arabidopsis genome is riddled with transposable element fragments that have dragged micro-duplications across the genome’. In addition to improving the genome of this key model organism, Prof. Michael stated that this study highlights that: ‘researchers no longer have to send out their samples to core labs or service providers, they can now do it [whole-genome sequencing] on their own bench, and within a week have an answer to their question. This is something that people can do right now in their own labs – they should do it’1.

Figure 2: Long-read nanopore-based sequencing resolved a highly repetitive region of nested transposable element fragments. An additional copy of the transposable element (TE) AT4g30720 within a 39 kb expansion (comprising numerous TE fragments) was identified that could not be resolved by short-read sequencing approaches. Image adapted from Michael et al. (2017)2.
  1. Michael, T.P. (2017). Personal communication with Oxford Nanopore Technologies on 29 August 2017
  2. Michael, T.P. et al. (2018). High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. Nature Communications 9(541)
  3. Pellicer, J. et al. (2010). The largest eukaryotic genome of them all? Botanical Journal of the Linnean Society. 164(1):10–15.
  4. Michael, T.P. (2019) Personal communication with Oxford Nanopore Technologies on 10 October 2019.
  5. Grassa, C. J. et al. (2018). A complete Cannabis chromosome assembly and adaptive admixture for elevated cannabidiol (CBD) content. BioRxiv doi: 10.1101/458083.
  6. Arabidopsis Genome Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 14;408(6814):796–815.
  7. TAIR (2017). Genome assembly [online] Available at: http://www.arabidopsis.org/portals/genAnnotation/gene_ structural_annotation/agicomplete.jsp [Accessed: 01 Oct 19]