Bioinformatics workflows for SARS-CoV-2; from raw Nanopore reads to consensus genomes using the ARTIC coronavirus protocol

The ARTIC network has published a bioinformatics protocol for the analysis of Nanopore sequenced SARS-CoV-2 genomes.

In this session we will first explore how the Rampart software can be used to track the performance of a sequence run. The ARTIC FieldBioinformatics protocol will be followed to prepare a consensus sequence and lists of SNPs from publicly available sequence SARS-CoV-2 sequence collections. Stephen Rudd will introduce two Docker containers that have packaged the ARTIC software and will demonstrate how these containers can be used to streamline a Nanopore-based sequence analysis workflow.

Answers from the Oxford Nanopore team to your questions during the webinar:

1. How do I access the Jupyter notebook featured in the presentation?

You can find the notebooks on our Docker hubsite here: https://hub.Docker.com/u/ontresearch. The Docker image for RAMPART can be found here: https://hub.Docker.com/r/ontresearch/artic_rampart and the Docker image for the "field bioinformatics" workflow can be found here: https://hub.Docker.com/r/ontresearch/artic_bioinformatics.

2. Can the .bam file be visualised in IGV?

If you are comfortable with IGV then absolutely look at the alignment in IGV; alternatively there is Tablet, which is perhaps a little easier to install for a non-bioinformatician.

3. Why is the negative control needed for this protocol? What does it tell you about the reads which map to the SARS-CoV-2 genome?

The reason for a negative control here is that we have a load of very sensitive PCR. In any laboratory you can end up with contamination. If you have a contaminant in your PCR mastermix then you will end up with a massive amount of amplification. If you see signal within your positive control, then it puts the rest of your experiment into doubt. So please, include a negative control because it is best practice. If you see signal then you know that something is not entirely correct.

4. Where can I learn the basics of bioinformatics?

There are number of third-party and community tutorials and resources available to help with getting started at bioinformatics. A good introduction to command line is found here: https://www.katacoda.com/amblina. We have prepared several introductory tutorials which are available in the Knowledge section of the Nanopore Community: https://community.nanoporetech.com/knowledge/bioinformatics. We also host introductory workshops, including for data analysis; you can find out more here: https://store.nanoporetech.com/uk/services/introduction-workshop.html/.

5. Which software to do you recommend for phylogeny?

Nextstrain hosts several resources for generating phylogenetic trees for coronavirus here: https://nextstrain.org

6. Is MinKNOW included in the Docker image?

Docker only contains the workflow used for post-sequencing analysis; you will need to use MinKNOW alone to perform sequencing and basecalling in real time.

7. Is it possible to access the first webinar, 'Nanopore sequencing the SARS-CoV-2 genome: introduction to the protocol'?

Yes, the webinar is available to watch on demand here

8. Can RAMPART be run on GridION with MinKNOW? In that case, do we need to install Docker and Conda?

RAMPART can be run on the GridION device with MinKNOW. If you are familiar with command line, you can follow the RAMPART installation instructions to install it on the GridION. If not, Docker provides an easier way for following analysis, though a little command line work is required for Docker installation.

9. Does the software running in Windows/Docker use CPU or GPU?

They use CPU.

10. How can we access the MinIT? How can we log in to MinIT?

You'll need a laptop or smartphone to connect to MinIT. You can find instructions on connecting to MinIT here.

11. How can we generate a phylogenetic tree to compare with other coronavirus sequences?

Nextstrain host several resources for generating phylogenetic trees for coronavirus (https://nextstrain.org/)

12. Please advise how to consult/ interact after the webinar, for more specific questions/ challenges when adapting the workflow/ pipeline?

Our team of experts are happy to help; please contact support@nanoporetech.com

13. What is the strength of this technology in comparison with other NGS?

This method, using real-time nanopore sequencing and analysis, enables a quick turnaround time of ~7 hours from RNA to SARS-CoV-2 consensus sequence. It also allows the flexibility to sequence in multiplex (with 1-24 barcodes), on demand, as and when needed. The MinION device is portable, with low start-up costs required to start sequencing. It's also highly versatile, with many other library prep options and applications possible.

14. How do you upload the reference genome in this workflow?

The genome of the original strain that was sequenced in Wuhan is already included in the workflow, and this reference genome is used for much of the analysis. If desired, you could certainly include your own reference genome in the directory in which the current Wuhan reference is stored. You would then be able to point the workflow to that reference, or alternatively rename it to match the reference that it expects. To do this, when you load the Docker, you can select a directory to include from your host machine, and you would store the sequences in those directories that you mount into the Docker image. The ARTIC team provide the reference, so I wouldn't necessarily recommend it. If you would like a different reference genome included in the Docker file for either the RAMPART or the Field Bioinformatics, please put a question up on our Community - we have a couple of channels on the Bioinformatics of SARS-Cov-2. We can either share the Docker files so you can build your own containers, or we can certainly help you in this process.

15. How much coverage do you recommend for each SARS-CoV-2 genome?

With the latest V3 ARTIC primers, which address some coverage dropout issues that were observed in versions V1 and V2, we now officially recommend that if you have 1000x total depth of coverage against the SARS-CoV-2 genome, or 30 Mb of data, that will be sufficient to achieve 100x coverage across the entire genome. And just to put that into perspective, when you are multiplexing up to 24 samples on a MinION and generating 30 Mb for each barcode, you will generate sufficient data within about 1 hour of sequencing on a standard MinION Flow Cell to achieve that depth of coverage.

16. Are there particular regions of the genome that tend to be undersequenced?

In V1 and V2 of the ARTIC protocol there were some regions that had coverage dropout; the V3 primers, however, do generate sufficient coverage across the entire genome as long as you do aim for the 1000x total depth of coverage.

17. In our SARS-Cov-2+ samples, we see a similar pattern of amplicons being underrepresented/not represented in the sample, when following the ARTIC protocol. Are there any primer optimisations or workarounds being developed right now?

The V3 version of the protocol has included optimisations to reduce primer dropout compared to previous versions and improve overall coverage. The process is also under continual development and optimisation.

18. Your analysis workflows have been demonstrated on GridION, but what if I don’t have a GridION? Can you clarify what kind of compute power is needed to run the ARTIC analysis?

You can use any computer for the analysis, that is absolutely fine. We do recommend GridION because the compute has the power and flexibility to run multiple samples at the same time; however, any computer can carry out the workflow. So as long as your computer meets our minimum specifications, it should be fine.

19. Can the analysis workflow be run on the PromethION?

The PromethION's compute tower has the ability to run all these workflows, and so it can be done.  You would need to check with your local account manager what the process is for installing any of the Docker software on PromethION. However, the PromethION has been tuned for producing a substantial amount of data, and we would suggest that you would want to get that data off the box and somewhere else for downstream or secondary analysis.

20. I noticed that RAMPART v1.1.0 was not processing all FASTQ files if other FASTQ files existed with identical names in different subdirectories - which is the case when using the Guppy barcoder. Do you know if this bug is fixed now?

Initially, we would point out that you do demultiplexing with MinKNOW, and you can do this while live basecalling or as a post-run feature. Doing this will generate individual directories and files that contain the barcodes, which will avoid any same-name conflicts. Similar functions for demultiplexing using Guppy as a command line tool are in development. The Field Bioinformatics workflow that you would use after the RAMPART workflow can be run in individual demultiplexed directories, so even if you do have the same names for files from different barcodes, running and analysing the data separately won't have any impact.

21. Can RAMPART also be used to analyse or visualise data after the run has ended?

Absolutely. The analysis works perfectly offline on Raspberry Pi, and GridION, MinIT... you name it!

22. RAMPART appears to consume ports 3000 and 3001 - what if we need to process more than one flow cell at a time (on a single computer)?

It is certainly possible to run multiple flow cells and access them. It does require some additional parameter setting when initialising the Docker run command, and we are actively looking at developing Community resources to provide advice for users wishing to run more than one flow cell.

23. Does this workflow work with the SISPA protocol?

Absolutely. The graphs seen both within RAMPART and within the FieldBioinformatics will look fine regardless of which of the protocols you are using. Because of the way that the FieldBioinformatics workflow is designed and implemented, it will take any long reads and hack them into something shorter. This does mean that you lose a lot of the value that this protocol provides, but it will give you the analysis output that you are looking for.

24. Is there a difference whether we install RAMPART with Docker or with Conda (as described on the ARTIC website)?

There may be very subtle differences. The version that is installed in Docker has been deployed without using any of the Conda workflow so it is a lighter weight installation. The versions in both the Docker container and within Conda at the moment are current. There may be a little bit of asynchronicity at any particular time, but at the moment the code is equivalent.

25. Given that Conda also provides an isolated virtual environment, with user-specified dependencies, what is the advantage that Docker offers over Conda?

Firstly, Conda installation requires a bit more command line experience and a bit more expertise, but also, Docker can be deployed in multiple environments and computing environments. And as demonstrated in the webinar, there are some really nice Jupyter notebooks and workflows already included, so that will help you perform the analysis, but also give you context in terms of the results as you're getting them. If you understand Conda then feel free to opt for the Conda workflow, but there is no specific advantage here.

26. Are the primer sequences of the amplicons trimmed in the ARTIC pipeline and if so, how?

As discussed in the previous webinar, the ARTIC primers are designed as an overlapping set of tiled amplicons. Within the ARTIC analysis workflow, sequences from each of the pairs of primers are separated into exclusive pools, and they are trimmed and clipped to remove any of the primer sequence. Therefore, during mapping, only biologically-meaningful sequences are mapped against the genome, not synthetic sequence. This means that we are not going to be masking any nucleotides with synthetic primer sequence.

27. When running on the GridION or MinIT, how can we indicate that we want to use the GPU (if we can make use of it)?

In terms of basecalling and demultiplexing, if you are using Guppy then there are parameters - specifically the --device parameter, which will indicate to use a GPU on either of those devices. For the subsequent analysis, GPUs inside Docker is something which we are looking into internally at the moment, but there are no components within the ARTIC workflows that can utilise the GPU. So, at this moment, it is not ideal to use GPU for these applications.

28. ARTIC recommends barcodes to be present on both ends of the reads. As far as I know, this can only be set globally for the Guppy-for-GridION basecaller, by changing a config file and restarting the service. This is unfortunate if I want to use different settings for other sequencing experiments running in parallel. Are there any plans on offering this option on experiment startup in MinKNOW?

This likely refers to the earlier implementations of the ARTIC workflow which utilised Porechop for demultiplexing. Porechop had parameters that required barcodes at both ends, which led to a lot of unclassified data consequently shifting over to the Guppy barcoder. The demultiplexing protocols in Guppy use a similar approach, looking for barcodes at both ends, but a lot more data is retained. So in terms of requiring barcodes at both ends, if you're following the Guppy barcoder, or demultiplexer, you will still be able to demultiplex accurately.

29. Do we need to assemble the raw reads? Or is mapping enough?

The ARTIC group are recommending this mapping approach. While a genome can be assembled, the amount of variance in the genomes is thus far low, and PCR-based tiling and mapping is giving appropriate results.

30. You described that the FieldBioinformatics workflow runs in 45 min, for how many samples is this?

The 45 minutes stated here relates to using half of the CPU computer on a GridION, and it was analysing 12-16 hours worth of sequencing that had been performed on the Twist dataset. Typically, a couple of hours would probably be needed, although this will depend on number of multiplexed samples and the volume of data generated.

31. Can we run two MinIONs at the same time and use RAMPART?

The RAMPART software is designed to process a single dataset at a time. However, the great thing about Docker is that you can run many Docker instances at the same time. So, yes, you can run two MinIONs at the same time on a single computer. And yes, you can run two Docker containers at the same time, processing each of these 8 data outputs. We need to manually configure the reports to do that; we will soon share this, on the Community, and provide the relevant documentation on how to make this work.

32. Does RAMPART work on windows?

You can use RAMPART through Docker, which can be installed on Windows. Although it will depend on how up-to-date your Windows laptop is and its settings.

33. For MinKNOW, should we be using 'fast' basecalling or the 'high accuracy' basecalling algorithm?

If you want to look at SNPs, we strongly recommend you use 'high accuracy' basecalling. But if you really need a faster result, you can also use 'fast' basecalling. The ARTIC workflow uses 'fast' basecalling. The 'high accuracy' mode will provide you with a more accurate result, with better single molecule accuracy; so it depends on your requirements regarding which mode is best for you.

34. Is it possible to detect quasispecies with this approach?

The ARTIC workflows are built upon the assumption that we are looking at a largely homogeneous viral population. Quasispecies would require slightly different workflows; perhaps this would be a great question to ask to the Nanopore Community. I am not sure how we would best answer or address this at the moment. Let us come up with a more informed decision as to how we could handle this and some recommendations in the near future.

35. In most of the examples given, the RAMPART workflow has been performed on data generated from samples prepared with the Native Barcoding protocol. Will the workflow be the same for data derived from samples prepared with the PCR Barcoding Kit?

The RAMPART software makes no assumptions on your starting data. It uses the set of barcoding primers that it can find within the data; if you are running the Porechop part of the process, it will be just fine with whichever data you provide it. We recommend that if are not completely following the ARTIC protocol, that you perform the demultiplexing step yourself. If you run the demultiplexing step within the Guppy barcoder, then the RAMPART software will use your summary file.

36. How can I compare the RAMPART output with the Field Bioinformatics output? The demultiplexing is performed by Porechop and Guppy, respectively.

This is a really challenging question because they have such different objectives and expectations. While fundamentally RAMPART and the Field Bioinformatics workflows are using exactly the same sequence in question, they are intended to do very different things. The objective of RAMPART is to give you a qualitative insight as to what is happening during your sequencing run: is it fit for purpose, do you have sufficient depth of coverage, are your barcodes behaving properly, do you see any regions of dropout? You are going to use RAMPART to make a decision: 'should I continue with this run or not?' If not, press stop, wash the flow cell and try again with a different set of barcodes. The point of the Field Bioinformatics workflow is to prepare the genome consensus sequence and identify single nucleotide variants. In terms of mapping, they both use Minimap and they are both using the same reference genome. However, the processing of the data is so fundamentally different that they are pretty much incomparable.

Authors: Stephen Rudd, Product Manager - Bioinformatics, Oxford Nanopore Technologies Ltd.