Transforming Basecalling in Genomic Sequencing
By Sam Davis, Principal Scientist, Machine Learning, Oxford Nanopore Technologies
Genomic sequencing is important for understanding biological systems and human genetics. At Oxford Nanopore, we have integrated the transformer architecture into our basecalling system, significantly improving sequencing accuracy.
Basecalling Fundamentals
Basecalling is the process of translating the raw, electrical signals captured by our sequencing devices into a sequence of nucleotides (DNA or RNA). This is like voice recognition, where spoken language is translated into written text. In the context of genomic sequencing, our devices detect changes in electrical current caused by DNA or RNA strands passing through a nanopore.
Basecalling is challenging because our sequencing devices produce a lot of data. Take our PromethION P48 sequencer as an example: it has 48 flow cells, each having a maximum of 3000 channels active at any one time. Each channel is sampling at 5000 Hz making the maximum total P48 output reach 720 million samples per second!
Figure 1: Diagram of internal LSTM operations
From LSTMs to Transformers
Oxford Nanopore’s basecallers have been using recurrent neural networks since 2016. The current models as of early 2024 use deep LSTM-based (Long Short-Term Memory) networks. LSTMs are a form of recurrent neural network that excel in environments where historical data points significantly influence future predictions.
However, they process data sequentially which can limit training and inference speed and they must compress temporal information into a fixed-sized state vector which can limit scalability. This can be a critical factor when aiming to achieve the highest basecalling accuracy within a compute-constrained inference environment.
Figure 2: Block diagram of the transformer basecaller
The shift to transformer models is driven by their architecture that allows for the parallel processing of all the data points within a sequence and the direct communication between any two of them using a mechanism known as attention. Transformer models have driven advancements in numerous areas including language modelling via, for example, OpenAI’s GPT and Meta’s Llama family of models as well as protein structure protein and protein language modelling through Google DeepMind’s AlphaFold and Meta’s Evolutionary Scale Modelling (ESM) models.
Our super-accuracy model, SUP, is our largest and most accurate configuration. Switching it from LSTM to the transformer architecture has advanced basecalling by improving a wide range of accuracy metrics at the same inference-compute budget. Internal research also shows accuracy metrics continue to improve as the training and inference-compute budgets are increased.
Figure 3: History of major nanopore basecaller releases
It is important to note that we are not doing away with LSTMs as they still form the foundation of our smaller fast and hac models. At this inference compute budget LSTMs remain competitive. Their GPU implementations can process the raw signal at an incredible rate with hac basecalling able to keep up with a fully loaded PromethION P48 over a 72-hour run.
Transformer Basecaller Details
The transformer basecaller model is a deep neural network that can be decomposed into three groups of layers, as shown in Figure 2, that serve distinct purposes:
- Preprocessing: The value of the raw signal at any given time is approximately based on the relatively short section of nucleotide sequence currently in the nanopore. The signal value changes as the strand translocates through the pore and there are often distinct jumps between different signal values. The raw signal undergoes initial processing by a small stack of 1D convolutional layers. These layers reduce the scalar signal into a shorter sequence of feature vectors containing local information, preparing the input for more complex operations.
- Transformer: A stack of transformer encoder blocks processes the output from the preprocessing layers, containing local information, into a form suitable for decoding. The encoder blocks communicate information with neighbouring positions using the self-attention mechanism and independently apply a remarkably high dimensional transformation to the feature vector at each position using the feedforward component. The encoder blocks use rotary position embeddings (RoPE), post-RMSNorm, SwiGLU, and DeepNet residual branch scaling.
- Decoding: The transformer output at each position is projected to a vector containing a score for transitioning between states. The states here correspond to kmers – groups of k nucleotide bases – and the scores are part of a connectionist temporal classification conditional random field (CTC-CRF) output head. The CTC part allows the output sequence of kmers to be shorter than the input sequence and the CRF part ensures a dependence between the scores at each position. The output head is trained to maximize the probability of an unaligned target sequence of kmers given a piece of signal. At inference time a beam search is used to find a good approximation to the most probable sequence of kmers and this is then converted to a nucleotide sequence.
Innovative Features:
- Sliding Window Multi-Head Attention: The attention mechanism within the transformer encoder blocks is limited to attending to a small window of neighbouring positions. This has no impact on accuracy as the information required to predict the base at a given position is contained within a relatively small region of the input signal. However, its use makes the model agnostic to input sequence length and, crucially, it makes the computational complexity scale linearly, rather than quadratically, with sequence length.
- Internal Sequence Compression: The preprocessing layers reduce the raw input signal to a shorter sequence of feature vectors through strided convolutions. The length of the sequence input to the transformer stack is half that given to the LSTM stack in the current SUP architecture. The decoding layers in the transformer architecture include an additional linear upsampling layer that increases the sequence length by a factor of two. This extra downsampling and upsampling surrounding the transformer dramatically increases the throughput as the compute cost is reduced by a factor of 2.
Enhanced Performance and Capabilities
The integration of the transformer architecture into our SUP model has led to a dramatic increase in the accuracy of our basecaller.
To accommodate the intensive computational demands of transformer models, especially in a high-throughput setting like genomic sequencing, we have employed custom optimisations including hand-tuned CUDA and Metal kernels. This allows for precise control over NVIDIA GPU operations, maximising the throughput and efficiency of our basecalling pipeline.
The LSTM models use NVIDIA’s INT8 Tensor Cores which speeds up computation by approximating floating-point precision with 8-bit integers, a process known as quantisation. This technique, while reducing numerical precision, maintains the accuracy of basecalling.
The LSTM model implementations also fully utilise the GPU devices on which they are run. This includes techniques such as persistent kernels that cache the recurrent LSTM weights in the GPU register files to maximise memory bandwidth.
We will continue to deploy model and implementation optimisations to both the LSTM and transformer models in the future.
Figure 4: PromethION 48
Prospects and Applications
We hope that the accuracy gained by integrating the transformer architecture into the basecaller will support our community in exploring new avenues of research and application in genomics including detailed genomic studies, personalised medicine, and complex disease research.
Our technology is particularly poised to impact areas requiring the analysis of large-scale genomic data, such as cancer genomics, where understanding the full spectrum of genetic variations is vital.
Moreover, the adaptability of transformers to different computational platforms promises explorations into alternative hardware acceleration. This has the potential to broaden the scope of genomic research and applications and deliver on Oxford Nanopore’s vision to enable the analysis of anything, by anyone, anywhere.
Specific areas of future research of interest to our machine learning team includes model compression techniques that have the potential to reduce the computational cost of our models. We currently quantise our models, so operations execute using 8-bit integers, rather than 16 or 32-bit floats, making computation run faster.
However, the latest research shows that even single-bit quantisation is possible. We are also looking to expand into other model compression techniques such as knowledge distillation and various forms of model sparsity. Finally, we wish to explore alternative decoder heads, such as autoregressive next-base prediction, to understand the benefits and trade-offs they bring.
In our commitment to openness and collaboration, we are releasing the transformer model’s source code for training and inference, as well as our model weights. This initiative invites the global research community to engage with, enhance, and adapt our technology, which we believe will accelerate innovations in genomic research and application.