Data analysis
- Home
- Documentation
- Data analysis
TechnicalDocument
Data analysis VDATD_5000_v1_revT_22Aug2016
FOR RESEARCH USE ONLY
Contents
Overview
Basecalling options
- 3. Live analysis
- 4. On-demand basecalling using the Dorado software
- 5. Basecall accuracy
- 6. Barcoding options
- 7. File formats
Further analysis
1. Basecalling overview
Introduction to basecalling
Basecalling is the process of converting the electrical signals generated by a DNA or RNA strand passing through the nanopore into the corresponding base sequence of the strand. The general data flow in a nanopore sequencing experiment is shown below.
Raw data – a direct measurement of the changes in ionic current as a DNA/RNA strand passes through the pore, which are recorded by the MinKNOW software. MinKNOW also processes the signal into "reads", each read corresponding to a single strand of DNA/RNA. These reads are written out as POD5 files: a custom Oxford Nanopore file type.
Basecalling – the basecalling algorithm uses signal processing techniques based on machine learning to transform the raw signal of the reads into basecalls. The software writes out the results of these analyses into BAM files (unaligned, or containing modified base information and/or alignment information), with a default of 4000 reads per file. Additionally, FASTQ files are also produced. Similarly, the default is 4000 reads per file.
Oxford Nanopore Technologies provides several platforms to allow users to carry out basecalling in real-time, as well as executables for users' local infrastructure. You can carry out basecalling live during the experiment, as post-processing after an experiment has finished, or a combination of these.
Basecalling with neural networks
The production version of Oxford Nanopore basecallers convert raw signal data to basecalls using algorithms that incorporate bi-directional Recurrent Neural Networks (RNNs).
A neural network models processes that occur inside the human brain. The network contains nodes arranged in layers, which carry out computations. Neural networks receive and process data, but crucially, they have been trained to have exceptional performance for particular signal processing tasks. They have been successfully used for diverse applications like pattern recognition (such as handwritten characters, speech recognition), or predicting trends over time.
A recurrent neural network is a class of neural networks in which the output is dependent on past computations. An RNN keeps an internal memory of previously-seen data, so each new computation can use information from several preceding computations. A bi-directional RNN can set data in the context of what comes both before and after in the signal.
Oxford Nanopore's basecallers use neural networks that have been trained on a range of example DNA sequences (described in more detail in the Basecaller training section of Basecalling algorithms). The network learns how to translate the series of measurements into the sequence.
Oxford Nanopore basecallers
Basecaller | Algorithm | Availability |
---|---|---|
MinKNOW basecaller | Production basecaller on the device software. This is identical to the algorithm used by our stand-alone basecaller, but may be a version behind. | Available as a free download (further details in the MinKNOW protocol). Select the basecalling option when starting the sequencing experiment, and MinKNOW will display the experiment progress via the user interface. A Dorado-powered basecall server installed with MinKNOW is also available as a package for advanced users. ont-dorado-server is available as a free download, and is also included in MinKNOW installations. You can find the changelog.md file containing the documentation file (DOCUMENTATION.md) in the root of the archives. |
Dorado basecaller | Dorado is the production basecaller that is also available in MinKNOW. | Available as a free download. You can run the executable version of the software on the host computer via the command line. Dorado is heavily optimised for NVIDIA A100 and H100 GPUs and will deliver maximal performance on systems with these GPUs. GPU compatibility is also expected to work on other NVIDIA GPUs with at least 8 GB of GPU memory and architecture from Volta onwards. |
Research algorithms | Varied | Research algorithms are available through GitHub. The releases are varied, and often include features that will be included in future versions of the production basecaller. |
2. Basecalling algorithms
Fast, High Accuracy and Super Accurate models and compatibilities
The MinKNOW basecallers offer three different basecalling models: a Fast model, a High accuracy (HAC) model, and Super accurate (SUP) model.
The Fast model is designed to keep up with data generation on Oxford Nanopore devices (MinION Mk1C, GridION, PromethION). The HAC model provides a higher raw read accuracy than the Fast model and is more computationally-intensive. The Super accurate model has an even higher raw read accuracy, and is even more intensive than the HAC model.
For more information about basecalling accuracy, see the Accuracy page on the Oxford Nanopore website.
A comparison of the speed of the models is provided in the table below:
The number of keep-up flow cells assumes a 30 Gbase flow cell output in 72 hours for MinION and GridION, and 100 Gbase output in 72 hours for PromethION.
MinKNOW basecalling: keep-up vs catch-up
Basecalling with the Fast basecalling model can keep up with the speed of data acquisition on most nanopore platforms. High Accuracy basecalling keeps up on GridION, and with 18 flow cells on PromethION A-Series. When using the more computationally-intensive models, basecalling continues after the sequencing experiment has run to completion; any reads that have not been basecalled during the experiment will be queued and processed afterwards. This is known as “Catch-up mode”.
You therefore have two options: either to allow MinKNOW to continue in catch-up mode, or to stop the analysis and basecall the remaining reads at a later time, e.g. using stand-alone Dorado.
Calling modified bases
Base modifications, including 5mC, 5hmC, and 6mA for DNA and m6A for RNA, can be called from nanopore signal data. This requires the use of a designated basecalling model that is trained to identify base modifications. The simplest way to access these models is via MinKNOW on the device, or the standalone Dorado basecaller from GitHub. MinKNOW currently has models for 5mC + 5hmC (CG-context and all-context) and 6mA (all-context) for DNA, and a m6A model for RNA operating in a DRACH context. Standalone Dorado includes these models alongside other models, including 4mC + 5mC for DNA and pseudouridine for RNA. The basecalling software outputs modified base information in BAM files.
Several advanced options are also available for calling and analysing modified bases. Remora is a tool available on GitHub that provides the tools to prepare datasets, train modified base models and run simple inference. Another option is to use modkit (also available on GitHub) for post-processing base modifications after basecalling. Modkit creates summary counts of modified and unmodified bases in an extended bedMethyl format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each reference genomic position.
If you wish to train your own all-context modified base calling models, we are now offering a limited developer release of the software tool Betta for the processing of “randomer” datasets. A randomer is a chemically synthesized oligonucleotide with a specific construct including a fixed width section of randomly inserted canonical bases. Betta provides a chemistry protocol and easy-to-use commands for generation and analysis of data from this construct design. The primary target of these pipelines is a Remora dataset for input into training a Remora modified base detection model. If you would like access to this tool, register your interest here.
Basecaller, consensus and variant caller model training
When developing basecalling, consensus, and variant-calling models using machine learning, Oxford Nanopore Technologies uses data from sequencing experiments. This data can be synthetic or derived from genomic sources. Model development is broken down into two broad categories: training (creating a model) and validation (showing that it works). The rest of this section will focus on training basecall models, although similar strategies apply to the other model types.
To train a DNA or RNA basecall model, sequencing experiments using a range of genomes are run to generate raw signal data (.pod5). This data is then prepared for training by selecting a representative subset of reads; basecalling them with Dorado; and aligning them to a ‘ground truth' reference. Once the data is prepared, the new basecall model is then trained using Bonito software, which applies machine-learning methods to fit a model to the training dataset. Various additional parameters can be set to configure the basecalling training appropriately for the sequencing condition, which are described in further detail in the Bonito documentation.
Typically, the training dataset contains raw signal data from pod5 files including samples of human, C. elegans, and ZymoBIOMICS Microbial Community Standard sequencing experiments. The data includes both PCR-amplified reads and native reads that can contain base modifications. A portion of the reads and/or genomic locations are reserved for validating the model and not included in the training dataset.
Once trained, the quality of the model is validated using reads covering genomic regions that were not included in the training dataset. Validation assesses the following parameters:
- Alignment accuracy
- % of strands that align to the reference
- Identifying strand edges and barcodes
- Specific test cases such as low complexity and homopolymer sequences
- Basecalling in and around methylation motifs
- De novo genome assembly quality
- Consensus accuracy (with and without trained polishing models)
- Short variant calling (SNPs and indels, with and without trained polishing models)
- Structural variants
If the validation meets the minimum criteria and the new model is an improvement on the currently-released models, it is then included in Oxford Nanopore's production software.
3. Live analysis
Introduction to live basecalling in MinKNOW
For MinION Mk1B, Flongle on MinION Mk1B, and PromethION 2 Solo, the MinKNOW software presents an option to basecall reads on the local computer. The basecalling is carried out live, as the read files are generated during a sequencing experiment.
Basecalling results are displayed in real-time in the MinKNOW user interface, and data is written out in the BAM or FASTQ file format.
Live alignment in MinKNOW
Basecalled reads can be aligned to a reference during the sequencing run. To do this, you will need to upload a reference FASTA or .mmi file during run set-up, and optionally a BED file when there is a specific interest in a particular region of the reference (e.g. specific gene in a chromosome).
A reference file can contain multiple entries in the same file (e.g. multiple chromosomes), and alignment hits from these files are used to populate the alignment graphs which can be viewed on the MinKNOW UI. Alignment hits from BED files will appear in the sequencing .txt file generated in the data folder.
4. On-demand basecalling using the Dorado software
Dorado basecalling software
Dorado is a data processing toolkit that contains Oxford Nanopore Technologies' basecalling algorithms, and several bioinformatic post-processing features. It is run from the command line in Windows, Mac OS X, and on multiple Linux platforms. A selection of configuration files allow basecalling of DNA and RNA libraries, made with Oxford Nanopore Technologies current sequencing kits, in a varied range of flow cells.
The Dorado toolkit contains:
- The basecaller: The Dorado basecaller implements a neural networks algorithm that allows raw data to be transformed into canonical bases of DNA or RNA, and several types of modified bases.
- Alignment: The user can provide a reference file in FASTA or minimap2 index format. If so, the reads are aligned against this reference via the integrated minimap2 aligner using the standard Oxford Nanopore Technologies preset parameters.
- Modified basecalling: It is possible to use Dorado to identify certain types of modified bases: currently 5mC, 5hmC, 4mC + 5mC and 6mA for DNA and m6A and pseudouridine for RNA. This requires the use of a specific basecalling model which is trained to identify both modified and unmodified bases.
GPU basecalling
Dorado is heavily-optimised for NVIDIA A100 and H100 GPUs and will deliver maximal performance on systems with these GPUs.
Dorado has been tested extensively and supported on the following systems:
Platform | GPU/CPU |
---|---|
Windows | (G)V100, A100, H100 |
Apple | M1, M1 Pro, M1 Max, M1 Ultra |
Linux | (G)V100, A100, H100 |
Systems not listed above but which have NVIDIA GPUs with ≥8 GB VRAM and architecture from Volta onwards have not been widely tested but are expected to work. AWS Benchmarks on NVIDIA GPUs are available here.
Dorado availability
The Dorado basecalling software is available free of charge to the Nanopore Community and on GitHub. More details on installing and running the software are found in the Dorado GitHub repository.
5. Basecall accuracy
Introduction to nanopore sequencing accuracy
Oxford Nanopore's sequencing accuracy is shown as one of several metrics:
- per-base quality score, denoted by the Phred Q-score
- raw read quality. This is calculated as an average from the q-scores, and this average quality is calibrated vs accuracy.
- raw read accuracy. Accuracy is calculated from an alignment to a reference sequence and counts insertions, deletions and substitutions as errors (bases correct / (insertions + deletions + bases aligned)), e.g. 99% accuracy can be interpreted as 99 out of 100 bases in a read were called correctly. Note that Oxford Nanopore Technologies represents the average raw read accuracy as the modal per-read accuracy from a sequencing run
Sequencing accuracy is determined by multiple factors, such as pore chemistry and basecalling algorithms. Improvements in these areas over the last several years have led to a steady increase in both single-molecule and consensus accuracy.
For more information about our latest accuracy data, please see the Accuracy page on the Oxford Nanopore website.
6. Barcoding options
Barcode design
The Oxford Nanopore Technologies barcoding kits can place barcodes at the beginning and for some kits, also at the end of the strands for multiplexing several different samples in one sequencing experiment. The barcodes will reside in a kit-specific context sequence, and different kits will have different lengths of sequence before and after the barcode. However, the sequences of the barcodes themselves are identical, regardless of kit.
The regions of a barcode
A complete barcode arrangement comprises three sections:
- The upstream flanking region, which comes between the barcode and the sequencing adapter.
- The barcode sequence.
- The downstream flanking region, which comes between the barcode and the sample sequence.
The barcode sequences remain constant across almost all of Oxford Nanopore Technologies' kits. For example, the flanking regions for barcode 10 in the Rapid Barcoding Kit (SQK-RBK114.24) are different from the flanking regions for barcode 10 in the Rapid PCR Barcoding Kit (SQK-RPB114.24), but the barcode sequence itself is the same. The exception is Native Barcoding kits, where the barcodes are the reverse complement of the standard barcodes.
Barcode and barcode flanking sequences can be found in the Chemistry technical document in the Nanopore Community.
Barcode demultiplexing options
After a barcoded sequencing run has completed, the reads can be split into folders by barcode, using one of Oxford Nanopore's demultiplexing tools:
- Real-time barcode demultiplexing in MinKNOW
- Post-run barcode demultiplexing in MinKNOW
- Barcode demultiplexing in the Dorado basecall server
A brief description of the options is provided below.
Barcode demultiplexing in MinKNOW
MinKNOW currently uses Dorado for both basecalling and barcode demultiplexing. It performs barcode demultiplexing in real-time, as the sequencing run progresses. MinKNOW demultiplexing is also available as a post-run analysis option.
Barcode demultiplexing in Dorado
The barcoding algorithm in the Dorado basecall server uses a modified Needleman-Wunsch method. Each barcode is aligned to a section of the basecall, with a score assigned to each base in the sequence depending on whether the base was a match, mis-match or a gap. The combined scores for each barcode alignment are compared, and the barcode with the highest score is chosen as long as the score is above the defined threshold. The barcode sequences can be trimmed from the reads, as a command-line option.
7. File formats
POD5 output
POD5 is an Oxford Nanopore-developed file format which stores nanopore data in an accessible way and replaces the legacy .fast5 format. This output also reads and writes data faster, uses less compute and has smaller raw data file size than .fast5.
For more information about the POD5 schema and contents, refer to POD5 file format.
.fast5 output
.fast5 is a legacy file type that is used to write out nanopore sequencing data, and can still be selected as an output type in MinKNOW. .fast5 is a type of HDF5 file, which is designed to contain all information needed for analysing nanopore sequencing data and tracking it back to its source. Read .fast5 files contain raw sequencing data for each read, with a default of 4000 reads per file.
For more information about the .fast5 schema and contents, refer to the Oxford Nanopore Technologies .fast5 API.
Default read file location
Windows
C:\data\
Mac OS X
/Library/MinKNOW/
Linux
/var/lib/minknow/
Intermediate folder
The files in the intermediate folder store unprocessed raw signal data. Once raw signal processing is complete, POD5 or .fast5 files are generated and stored in the tmp folder, where local basecalling can proceed. These files are removed as processing proceeds or at the end of the run.
If the system encounters an issue, such as running out of space, the unprocessed data will not be cleared and will remain in the intermediate folder. Due to the real-time streaming nature of the system, this data cannot be processed after the run is stopped.
FASTQ output
FASTQ files are text files that contain sequence data for each read, and associated per-base quality scores. FASTQ files can be generated in MinKNOW, Dorado, and Guppy. The default is to write out 4000 reads per FASTQ file, although this number is configurable.
A single read sequence in a FASTQ file is described in four lines:
- Line 1 begins with a '@' and is followed by a header containing information about the sequencing run.
- Line 2 is the basecalled sequence (using A, C, T, G and N).
- Line 3 contains a '+'.
- Line 4 encodes the per-base quality scores for the sequence in Line 2.
An example of a FASTQ file is shown below:
@75be78f7-bd62-4972-92d2-aba16f465b0d runid=ff83cfafb0cb3bfc28ac370b841f59798ab3d63a sampleid=RB02_lambda_ovn1 read=19343 ch=53 start_time=2019-12-23T13:44:31Z
CGGTATTACTTCGTTCAGTTTCGGACAGGTGTTTTAACC[...]TCGTACCTAT
+
'%+-($&&&&'(':+7)-%(&$$.%##))868;;87/9;[...]68(*(2)/%$
BAM output
BAM files are output by MinKNOW and stand-alone Guppy software if alignment has been performed on the basecalled dataset. BAM files are also output when using the modified base models in MinKNOW and Dorado.
8. Data analysis in EPI2ME
The EPI2ME platform
EPI2ME is a platform that encourages the development of bioinformatics skills with a focus on the analysis and exploration of nanopore sequence data. The platform provides simplified access to a collection of best-practice bioinformatics workflows that demonstrate solutions to common data analysis problems.
EPI2ME maintains a collection of Nextflow bioinformatics workflows tailored to Oxford Nanopore Technologies long-read sequencing data. They are curated and actively maintained by experts in long-read sequence analysis. Examples include alignment, variant calling, and metagenomics. Other more targeted applications include workflows for single cell transcriptomics, Pore-C chromatin conformation and tumour-normal paired sequencing for identification of somatic mutations.
The Nextflow bioinformatics workflows can all be run from the command line on Linux computers, servers, clusters and cloud resources. The EPI2ME software provides a graphical user interface for users who prefer to avoid the command line and is supported on Windows, macOS and Linux.
9. Oxford Nanopore Technologies tools and pipelines
Oxford Nanopore Technologies tools and pipelines
Oxford Nanopore Technologies' GitHub repository contains a number of data analysis tools created by our R&D division. Most of the tools require some bioinformatics knowledge and use of the command line. Examples of software that are presented through this GitHub resource include experimental basecallers (Dorado, Remora), modkit for the refinement of base modification results, and medaka for the polishing of consensus sequence and calling of haploid variants.
10. Custom analysis
Third-party tools
The Oxford Nanopore Resource Centre collates all Community-developed data analysis tools (available under the "Tools" tab). Most tools are available on GitHub, and require some knowledge of bioinformatics and use of the command line.