How basecalling works
Nanopore sequencing is based on the principle that when a single molecule passes through a nanopore with an ionic current flowing through it, the molecule disrupts the current resulting in a characteristic electrical signal. In the case of nucleic acid sequencing, the information-rich signal is then decoded using basecalling algorithms to determine the DNA or RNA sequence in real-time.
Capturing the signal
When sequencing DNA or RNA with nanopores, the changes in current caused by the strand of DNA or RNA as it passes through the pore are recorded by the MinKNOW™ software which runs all of Oxford Nanopore's sequencing devices. The processive movement of bases through the pore leads to a continual change in current, known as the “squiggle”. MinKNOW processes the squiggle into reads in real-time, each read corresponding to a single strand of DNA/RNA. These reads are written out into FAST5 files. This raw data contains information on not only canonical bases but also base modifications, such as methylation.
The nanopore determines the raw signal
The structure of the nanopore determines the raw signal and the information contained within the squiggle. Different nanopores contain different “readers” – the part of the nanopore which most heavily contributes to the blockage of current. The R9 nanopore has a single reader in the middle of the barrel. The R10 nanopore has two readers spaced along its length, meaning more bases within the DNA/RNA strand contribute to the squiggle at any one time. This leads to improvements in capturing signal around homopolymer regions, where multiples of the same nucleotide appear one after the other on the DNA/RNA strand. Oxford Nanopore consistently design and test new nanopores for improved signal characteristics.
The signal stored within FAST5 files is processed by basecalling algorithms to decode the sequence of bases into FASTQ files. Guppy, the production basecaller integrated within MinKNOW, carries out basecalling live during the run, after a run has finished, or a combination of the two.
Oxford Nanopore also has a range of open source Research Release basecallers which are developed to implement new algorithms for improvements in accuracy, as well as alternative applications such as modified base detection. These Research Releases may not be optimised for speed or ease of use. There are also a number of open-sourced basecallers developed by the Nanopore Community.
The basecalling algorithms currently deployed by Oxford Nanopore are based on neural networks. These are loosely modelled on biological neural networks within the human brain, with connections of “nodes” (equivalent to neurons) passing data between themselves to arrive at a predicted base sequence. Crucially, just like a human brain these neural networks have the ability to learn and improve their predictions over time.
Guppy currently utilises a bi-directional recurrent neural network, where information can be passed back and forth between nodes. Alternative algorithms are continuously assessed for their suitability at basecalling, such as convolutional neural networks, which are typically used for image and video processing due to the connections between nodes mimicking the organisation of the brain’s visual cortex.
The basecalling algorithms provided by Oxford Nanopore learn how to determine nucleotide sequence via machine learning. They are trained using data of a known sequence, which guides the algorithm to make correct predictions without human input. Once trained they are validated with a subset of reads not included in the training dataset.
The training data used will significantly determine the performance of the basecaller, and therefore can be used to create generic models (using genomes from multiple different organism types), species-specific models (trained on the genome of just one species) or even modification-aware basecallers (trained using native nucleic acid with base modifications still present). The default models within Guppy are trained on a mixture of native and amplified DNA/RNA, from multiple organisms including plant, animal, bacterial and viral genomes.
'A type of artificial intelligence in which computers use huge amounts of data to learn how to do tasks rather than being programmed to do them'
Sequence once, keep seeing data improvements
The incredibly information-rich signal from the nanopore captures the full context of the DNA/RNA passing through the nanopore. Users therefore have the opportunity for greater insight and improved analysis of their existing data by simply basecalling with a newer algorithm, without the need re-sequence their sample.
Models released within Guppy have shown consistent improvements in their results over time, from “raw-read” single molecule accuracy, to improved consensus accuracies and completeness of genome assemblies.
As MinKNOW streams signal data in real-time, basecalling can begin even before the DNA/RNA strand has finished passing through the nanopore. Whilst basecalling on CPU is possible, deployment of the basecallers to utilise Graphical Processing Units (GPUs) is favoured due to the ability of GPUs to calculate many values in parallel.
The Mk1C, GridION and PromethION devices all feature on-board GPUs for basecalling in real-time, or for faster time to answer than would be achievable with CPU if more intensive but more accurate basecalling models are used. In this scenario, where data generation occurs at a higher rate than basecalling can occur, reads will be queued and processed once the experiment completes. This is known as “catch-up” basecalling.
Cutting-edge performance and future analysis
Our technology continues to improve through updates to nanopores, algorithms and sequencing chemistries. By providing open access to the newest advances in research tools, researchers have the ability to help shape the progress of nanopore technology whilst benefitting early from performance improvements. Progress in these research projects demonstrates the rapid and high-achieving trajectory for nanopore sequencing technology, with a number of avenues to pursue for further developments.