Assembly of the SARS-CoV-2 transcriptome through nanopore ReCappable sequencing


To start her breakout presentation, Camilla Ugolini (Italian Institute of Technology, Italy) described the biology of the SARS-CoV-2 genome: a positive-sense, ssRNA genome spanning 30 kb. She noted that every transcript it produces has a 5’ cap structure and poly(A) tail, and outlined how these are made. When the RNA-dependent RNA polymerase crosses a transcription regulating sequence (TRS), it switches template to the TRS in the leader. This results in leader-to-body fusion, creating a negative intermediate that acts as a template in the synthesis of sub-genomic RNAs (sgRNAs). Camilla noted that, when sequencing direct RNA, RNA degradation or incomplete sequencing can result in some short reads in the sequencing dataset, which can affect quantification of sgRNAs. Displaying depth of coverage across the viral genome, she showed a peak at the 3’ end of the genome, then a step decrease in coverage and a smaller peak at the 5’ end.

To address this, Camilla and her team used Nanopore ReCappable Sequencing. The method enables identification of full-length transcripts and sequencing of canonical and non-canonical sgRNAs. The workflow begins with decapping of the poly(A) transcript, followed by recapping using a cap analog. Finally, an adapter is attached to this new cap, and the transcripts are sequenced on a nanopore device. In sequencing, this ensures that full-length transcripts can be identified by the presence of this 5’ sequence. Camilla and her team sequenced the direct RNA libraries on a GridION. She showed a graph of depth of coverage across the genome, with clearly defined steps at each TRS of each sgRNA and a 5’ peak with the same depth of coverage as at the 3’ end, indicating the presence of mostly full-length, capped transcripts.

Camilla and her team mapped two Nanopore ReCappable Sequencing datasets to the SARS-CoV-2 genome using minimap2 and processed with pinfish, producing a transcriptome assembly consisting of 21 transcripts. They then mapped and quantified datasets from in-house sequencing, from their collaborators, and from literature, using this assembly. Transcript quantification was performed using NanoCount. Camilla described how the assembly recovered the structure of most sgRNAs. Beyond validation of the main canonical sgRNAs, they identified possible non-canonical sgRNAs, which they are investigating further, and some possible deletions. The data enabled both recovery and annotation of ORFs and their quantification, with transcript abundance counts from the nanopore sequencing data agreeing well with existing northern blot data.

Camilla then displayed the results of mapping a selection of datasets to the new transcriptome assembly. She noted that the difference between the numbers of basecalled reads and mapped reads were due to the infection efficiency and that the ReCappable protocol only enables attachment of sequencing adapters to ~20% of the sample. To perform quantification, they selected only full-length reads. Camilla highlighted that non-canonical sgRNAs had much lower expression levels than canonical sgRNAs. In the final quantified data, they found that, despite the use of data from different cell lines and isolates, the expression level of sgRNAs was conserved. Next, to investigate the potential non-canonical sgRNAs, the team reviewed the reads which did not contain a TRS. Manual inspection showed one to be a misaligned canonical transcript, whilst the others could either represent real non-canonical sgRNAs or mapping artefacts. Applying additional criteria for their classification produced <50 reads that they suggested were real non-canonical sgRNAs, leading the team to hypothesise that capped non-canonical sgRNAs are expressed at very low levels.

Finally, Camilla and her team assessed the presence of RNA modifications in the direct RNA sequencing data, noting that these can impact their abundance and expression. Analysing the modifications via Nanocompore against a control sample revealed the sgRNAs to be heavily modified, and identified conserved modified sites across cell lines and isolates. The team are now further exploring the roles of these modified bases.

Authors: Camila Ugolini