Using long-read nanopore sequencing to unravel structural genomic variations in plants

Marie-Christine Carpentier (Genome and Plant Development Laboratory, Perpignan) discussed the use of long nanopore reads in studying transposable elements (TEs), mobile DNAs that can multiply within genomes, in plants. Marie-Christine described how TEs are a major contributor to structural variations within plant genome, comprising 10% of the genome of Arabidopsis thaliana and 85% of the much larger maize genome: TE content is seen to increase with genome size. Marie-Christine asked: how do these TEs contribute to genomic diversity at the species level? Her team focuses on the Oryza sativa (cultivated rice) genome, of which 40% comprises TEs. She displayed rice genome phylogeny from the 2014 3000 genomes project, in which 3000 cultivated rice genomes were sequenced via short-read sequencing to an average coverage of 14X. These provide a good representation of its geographical distribution, though as short reads cannot span most structural variants, assemblies remain incomplete. To detect TE insertion polymorphisms (TIPs) in this very large dataset ("it's very challenging to analyse 3000 genomes"), the team developed the TRACKSPOSON pipeline, and used it to detect all the polymorphic insertions of 31 TE families in the 3,000 rice genomes - more than 50,000 gene insertions in total. First, reads are mapped against the reference transposable element; unmapped reads are then extracted and the TE insertion points identified.

Long-read nanopore sequencing was then used to validate the TIPs detected by the TRACKSPOSON pipeline. DNA from one of the rice genomes from the 3,000 genomes project was sequenced on the MinION device in one run, generating 5 Gb of data. The long-read output identified unambiguous positions for TE insertion in the genome. Good concordance was seen between the TIPs detected by nanopore reads and those by TRACKSPOSON in short-read datasets; TRACKSPOSON was found to give a sensitivity of 81% and specificity of 94.5%. Marie-Christine concluded that nanopore long-read sequencing "unambiguously detects structural variation within the rice genome." She then asked: what about TE transcription? In order for a TE to be active, it needs to be transcribed. With short-read sequencing, only global expression of TE families could be identified, as TE copies are very similar. However, long-read cDNA sequencing enables the sequencing of entire transcripts in single reads, allowing distinction between TE copies and identification of active TE copies within a TE family. Nipponbare rice cDNA libraries were prepared and sequenced; the long reads were then mapped against the reference genome and transcriptome via GraphMap. In Poprice, an insertion in chromosome 3 was seen to be non-active, but transcript sequencing revealed that an insertion in chromosome 1 was active and expressed. The resulting data enabled "unambiguous detection of TE transcription within a TE family" in rice.

Marie-Christine concluded that the use of long-read nanopore sequencing allowed for the detection of structural variation within a non-assembled genome, and the detection of transcription of active TE copies via cDNA sequencing. The team plan to use this data to improve the annotation of genomes and next, detect long non-coding RNA.

Authors: Marie-Christine Carpentier