Protocol
Guppy protocol V GPB_2003_v1_revAX_14Dec2018
Guppy is a bioinformatics toolkit that enables real-time basecalling and several post-processing features that works on Oxford Nanopore Technologies™ sequencing platforms.
For Research Use Only
This is a Legacy product This software is no longer supported, and we recommend all customers to upgrade to the latest basecaller, Dorado, which is available in MinKNOW.
A Dorado-powered basecall server installed with MinKNOW is also available as a package for advanced users. ont-dorado-server is available as a free download from the Software Downloads page. You can find the changelog.md file containing the documentation file (DOCUMENTATION.md) in the root of the archives.
If you require further support for any ongoing critical experiments using a Legacy product, please contact Customer Support via email: support@nanoporetech.com.
FOR RESEARCH USE ONLY
Contents
Introduction
Downloads of Guppy
Quick Start
Guppy features, settings and analysis
- 8. Setting up a run: configurations and parameters
- 9. Input and output files
- 10. Guppy basecall server
- 11. Expert settings
- 12. Configuring a GPU version of Guppy with MinKNOW for MinION or PromethION 2 Solo on Windows
- 13. Configuring a GPU version of Guppy with MinKNOW for MinION or PromethION 2 Solo on Linux
Guppy toolkit
- 14. Barcoding/demultiplexing
- 15. Alignment
- 16. Calibration Strand detection
- 17. Modified base calling
- 18. Duplex basecalling
FAQ and troubleshooting
Overview
Guppy is a bioinformatics toolkit that enables real-time basecalling and several post-processing features that works on Oxford Nanopore Technologies™ sequencing platforms.
For Research Use Only
This is a Legacy product This software is no longer supported, and we recommend all customers to upgrade to the latest basecaller, Dorado, which is available in MinKNOW.
A Dorado-powered basecall server installed with MinKNOW is also available as a package for advanced users. ont-dorado-server is available as a free download from the Software Downloads page. You can find the changelog.md file containing the documentation file (DOCUMENTATION.md) in the root of the archives.
If you require further support for any ongoing critical experiments using a Legacy product, please contact Customer Support via email: support@nanoporetech.com.
1. Guppy software overview
IMPORTANT
Guppy license
Guppy is available under the Oxford Nanopore Technologies Terms and Conditions.
Guppy basecalling software
Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies' production basecalling algorithms and several bioinformatic post-processing features. It is run from the command line in Windows, Mac OS, and on multiple Linux platforms. Guppy is also integrated with our sequencing instrument software, MinKNOW, and a subset of Guppy features are available via the MinKNOW UI. A selection of configuration files allows basecalling of DNA and RNA libraries made with Oxford Nanopore Technologies’ current sequencing kits, in a range of flow cells.
The Guppy software contains many configurable parameters that can be used to specify exactly how the data analysis is performed. Adjusting some of these parameters requires a deep knowledge of nanopore data, and as such, Guppy is aimed at more advanced users. For those who are new to sequencing or have limited knowledge of sequencing data analysis, we recommend using the options presented in the MinKNOW software UI for basecalling.
Introduction to basecalling
Basecalling is the process of converting the electrical signals generated by a DNA or RNA strand passing through the nanopore into the corresponding base sequence of the strand. The general data flow in a nanopore sequencing experiment is shown below.
Raw data - a direct measurement of the changes in ionic current as a DNA/RNA strand passes through the pore, which are recorded by the MinKNOW software. MinKNOW also processes the signal into "reads", each read corresponding to a single strand of DNA/RNA. These reads are optionally written out as .fast5 files.
These .fast5 files use the HDF5 format to store data (http://www.hdfgroup.org/HDF5/); libraries exist to read and write these in many popular computer languages (e.g. R, Python, Perl, C, C++, Java).
Guppy also supports loading models from the POD5 file format.
Basecalling - the raw signal is further processed by the basecalling algorithm to generate the base sequence of the read.
Basecalling is made up of a series of steps that are executed one by one. Ionic current measurements from the sequencing device are collected by the MinKNOW software and processed into a read. The reads are transformed into basecalls using mathematical models. The results of these analyses are written into FASTQ or BAM files, with a default of 4000 reads per file. Note that .fast5 file writing support, which has been deprecated for some time now, has been officially removed and is no longer available.
Guppy basecalling models are based on a Recurrent Neural Network (RNN)
The Guppy basecalling models are based on RNNs. For more information about RNNs, as well as other basecalling options and algorithms, please refer to the Data Analysis document in the Nanopore Community.
The Guppy toolkit contains:
The basecaller
The Guppy basecaller implements a neural networks algorithm that allows raw data to be transformed into canonical bases of DNA or RNA, and several types of modified bases.
- Calibration strand detection: The basecaller is also capable of detecting calibration strands by aligning calibration sequences. Reads are aligned against a calibration reference using the basecalled data from an internally present DNA molecule in the flow cell. Calibration strands serve as a quality control for the pore and experimental processing. If the current read is identified as a calibration strand, no barcoding or alignment steps are performed.
- Adapter trimming: This is the processing and removal of the sequencing adapter (e.g. AMX, BAM, AMII, etc.) signal in the basecalled data:
- For DNA adapters it will exclude the non-sequence adapter region up to a characteristic signal in the adapter that is recognised by the basecaller.
- For (m)RNA, where the strands are sequenced in the 3' to 5' direction, it will attempt to exclude all data up to the the polyA tail.
Barcoding/demultiplexing
The beginning and the end of each strand are aligned against the barcodes currently provided by Oxford Nanopore Technologies. Demultiplexing occurs directly from the basecalled results.
Alignment
The user can provide a reference file in FASTA or minimap2 index format. If so, the reads are aligned against this reference via the integrated minimap2 aligner using the standard Oxford Nanopore Technologies preset parameters.
Modified basecalling
It is possible to use Guppy to identify certain types of modified bases: currently 5mC. This requires the use of a specific basecalling model which is trained to identify both modified and unmodified bases.
Note that 1D^2 basecalling is no longer included in current versions of the Guppy software.
Current assumptions and limitations of Guppy
The Guppy basecalling software currently provides basecalling for 1D and duplex chemistry.
Read .fast5/.pod5 files, used as input to the basecalling software, must contain raw data. Raw data has been included by default in read files generated by the MinKNOW software for the last several years, so it should not be necessary to update them. Oxford Nanopore Technologies offers two sets of tools for working with .fast5 files that users may find helpful:
- ont_fast5_api: Provides a simple interface to the .fast5 format, including tools for converting between single- and multi-read formats.
- ont_h5_validator: Provides a tool for validating .fast5 file structures against official Oxford Nanopore Technologies file schemas.
Tools are also available for manipulating POD5 files in the pod5-file-format repository at https://github.com/nanoporetech/pod5-file-format .
Guppy provides configurations for currently-available chemistries and also provides a model compatible with data generated using older PromethION firmware.
Both the alignment and barcoding pipelines accept compressed and uncompressed FASTQ files as input. These can be generated either by the Guppy basecallers, or by the MinKNOW software.
General system requirements for running Guppy
These system requirements are guidelines - the actual amount of memory and disk space required to run Guppy tools will heavily depend on options and input data.
- 4 GB RAM plus 1 GB per thread for basecalling (more RAM may be required for duplex basecalling)
- Administrator access for .deb or .msi installers
- ~2 GB of drive space for installation are required. A minimum of 512 GB storage space for basecalled read files is recommended.
CPU and GPU basecalling with Guppy
Oxford Nanopore Technologies provides Guppy executables that can be run on Central Processing Units (CPUs) on Windows, Mac OS and Linux, or on Graphics Processing Units (GPUs) on Windows and certain Linux platforms:
Windows: ont-guppy-cpu .msi installer (CPU) or ont-guppy .msi installer (GPU) macOS: ont-guppy-cpu .dmg installer (CPU only) Linux CPU:
- ont-guppy-cpu .deb for Ubuntu 16
- ont-guppy-cpu .deb for Ubuntu 18
- ont-guppy-cpu .deb for Ubuntu 20
- ont-guppy-cpu .rpm for Centos 7
- ont-guppy-cpu .rpm for Centos 8
- ont-guppy-cpu .tar.gz – general Linux archives with pre-built binaries (compatible with most Linux versions)
Linux GPU:
- ont-guppy .deb for Ubuntu 16
- ont-guppy .deb for Ubuntu 18
- ont-guppy .deb for Ubuntu 20
- ont-guppy .rpm for Centos 7
- ont-guppy .rpm for Centos 8
- ont-guppy .tar.gz – general Linux archives with pre-built binaries (compatible with most Linux versions). On the ARM platform these archives are split into CUDA 9 and CUDA 10 versions (for use with Linux 4 Tegra running Ubuntu 16 and Ubuntu 18, respectively).
Note that GPU basecalling is only supported on Linux and Windows systems. Mac OS systems do not currently have NVIDIA GPUs or CUDA support.
GPU basecalling requires NVIDIA drivers which support a minimum CUDA version of:
- CUDA 10 for Linux 4 Tegra running Ubuntu 18
- CUDA 11.1 for Linux x86 systems
- CUDA 11.4 for Windows systems
Please note that Guppy is not currently compatible with CUDA 12.0 onwards. The last compatible version of the CUDA toolkit is 11.8. CUDA 11.8 can be downloaded from the NVIDIA download archive.
In general it is recommended to install the latest GPU drivers available for your system and graphics card. See the NVIDIA driver download page for details.
Using external GPUs can dramatically increase basecalling speed. Guppy works with only NVIDIA GPUs, and has been tested using the following specific models:
- NVIDIA Tesla V100
- NVIDIA Quadro GV100
- NVIDIA GTX1080Ti
- NVIDIA Jetson TX2
- NVIDIA Jetson Xavier
If working with a different model of NVIDIA GPU than those listed above, the Guppy software requires CUDA Compute Capability >6.1 (for more information about CUDA-enabled GPUs, see the NVIDIA website)
It is possible to use other NVIDIA GPUs for basecalling, however Oxford Nanopore Technologies develops and tests software on the models stated above, so support for other models is limited.
Fast, High Accuracy and Super Accurate models and compatibilities
The MinKNOW basecallers offer three different basecalling models: a Fast model, a High accuracy (HAC) model, and Super accurate (SUP) model.
The Fast model is designed to keep up with data generation on Oxford Nanopore devices (MinION Mk1C, GridION, PromethION). The HAC model provides a higher raw read accuracy than the Fast model and is more computationally-intensive. The Super accurate model has an even higher raw read accuracy, and is even more intensive than the HAC model.
For more information about basecalling accuracy, see the Accuracy page on the Oxford Nanopore website.
A comparison of the speed of the models is provided in the table below:
The number of keep-up flow cells assumes a 30 Gbase flow cell output in 72 hours for MinION and GridION, and 100 Gbase output in 72 hours for PromethION.
Basecalling speed for Guppy
Aside from the basecalling model, the time taken to basecall a folder of reads depends on the specifications of the computer, the number of threads assigned, the options which Guppy is invoked with, and the number of reads analysed. Guppy is optimised for NVIDIA GPUs using CUDA, and can perform several orders of magnitude faster running on a modern GPU compared to a standard desktop CPU.
2. Windows
IMPORTANT
Guppy can only be installed from the Administrator account.
Supported platforms for Guppy
- 64-bit Windows 10
- 64-bit Windows 7 SP1 and the Windows 10 Universal C Runtime
IMPORTANT
Additional requirements for Windows 7 to install Guppy:
Guppy is optimised for Windows 10. If you are using earlier versions of Windows, please download Windows 10 Universal C Runtime in order for Guppy to work.
If Universal C Runtime in not installed, the following error message may appear:
The procedure entry point ucrtbase.terminate could not be located in the dynamic link library api-ms-win-crt-runtime-l1-1-10.dll
Download the .msi installer for CPU or GPU.
The installer can be found on the Software Downloads page of the Community.
Double-click on the installer.
Follow the prompts from the executable to install Guppy.
The default install location is:
C:\Program Files\OxfordNanopore\ont-guppy-cpu
3. Linux
IMPORTANT
Guppy can only be installed from the Administrator account.
Choosing a Guppy package
Depending on whether you are basecalling on a CPU or a GPU, you will need to choose an appropriate Guppy package. These packages are available as separate installers on the Software Downloads page:
- GPU basecalling is available in the
ont-guppy
package and will require a GPU driver for basecalling to work. It is also possible to perform CPU basecalling using this package. You will need to specify the GPU device(s) you want to use when setting up your experiment, otherwise Guppy will default to CPU calling. - CPU-only basecalling is available in the
ont-guppy-cpu
package.
Supported platforms for Guppy
Debian packages:
- 64-bit Ubuntu 16 amd64 (for either GPU-enabled Guppy or cpu-only Guppy)
- 64-bit Ubuntu 16 arm64v8 (for GPU-enabled Guppy only)
RPM packages:
- Centos 7 (for either GPU-enabled Guppy or CPU-only Guppy)
- Centos 8 (for either GPU-enabled Guppy or CPU-only Guppy)
Archive package:
- Most 64-bit amd64 Linux platforms (for either GPU-enabled Guppy or CPU-only Guppy – the package was built on Centos 7)
- 64-bit Ubuntu 16 arm64v8 (for GPU-enabled Guppy only - the package was built on a Jetson TX2)
GPU devices:
- A supported NVIDIA GPU for the ont-guppy packages
From version 6.1.0 onwards, Guppy no longer officially supports Ubuntu 16 or earlier, as that version of Ubuntu has reached its End of Life (EOL) date. It should still be possible to run the Guppy archive releases on that platform, although it is not explicitly supported.
Use this installation process if you are installing from .deb for Guppy:
- Add Oxford Nanopore's deb repository to your system (this is to install Oxford Nanopore Technologies-specific dependency packages):
sudo apt update
sudo apt install wget lsb-release
export PLATFORM=$(lsb_release -cs)
wget -O- https://cdn.oxfordnanoportal.com/apt/ont-repo.pub | sudo apt-key add -
echo "deb http://cdn.oxfordnanoportal.com/apt ${PLATFORM}-stable non-free" | sudo tee /etc/apt/sources.list.d/nanoporetech.sources.list
sudo apt update
2. To install the .deb for Guppy, use the following command:
sudo apt update
sudo apt install ont-guppy
This will install the GPU version of Guppy.
or:
sudo apt update
sudo apt install ont-guppy-cpu
To install the CPU-only version of Guppy.
Use this installation process if you are installing from .tar.gz:
- Download the .tar.gz archive file. This can be found on the Software Downloads page of the Nanopore Community.
- Unpack the archive:
tar -xf ont-guppy_xxx_linux<64 or aarch64>.tar.gz
or to unpack the CPU-only version of Guppy:
tar -xf ont-guppy-cpu_xxx_linux<64 or aarch64>.tar.gz
Note: 'xxx' in the command denotes the version number e.g. 3.0.3.
Use this installation process if you are installing from .rpm:
- Download the .rpm file. This can be found on the Software Downloads page of the Nanopore Community.
- Install the epel-release repository:
yum install epel-release
If using Centos 8, it will also be necessary to enable the powertools repository. Depending on the version of Centos 8 you have, the name of the repository may be slightly different:
yum install dnf-plugins-core
yum config-manager --set-enabled PowerTools || yum config-manager --set-enabled powertools
3. Install the rpm: ``` yum install [path]/[to]/ont-guppy_xxx.rpm ```
This command installs the GPU version of Guppy.
or
yum install [path]/[to]/ont-guppy-cpu_xxx.rpm
This command installs the CPU-only version of Guppy.
The GPU-enabled ont-guppy package will not install a GPU driver by default – it will be necessary to handle installing this yourself, e.g. by visiting NVIDIA's website. Guppy requires an NVIDIA driver of at least version 455.
(Ubuntu 16 only) Using unsupported NVIDIA GPUs
Starting with Ubuntu 18.04, NVIDIA driver versioning was handled differently, and Guppy debs will automatically attempt to install the latest driver version that they can find. This should mean that Guppy will work with any recently-released GPU that has the correct driver installed without any additional work required. On Ubuntu 16.04, newer GPUs may require more up-to-date versions of NVIDIA drivers and CUDA APIs than what the ont-guppy deb recommends. It is possible to install the ont-guppy deb without installing any GPU drivers by not installing the recommended ones, leaving it to the user to ensure that their system has the required minimum drivers:
apt-get install ont-guppy --no-install-recommends
In general NVIDIA APIs and drivers are backwards-compatible, so it will usually be safe to install a later/higher version than Guppy requires. Check for the minimum driver version that Guppy requires using `apt-cache show`:
$ apt-cache show ont-guppy
Package: ont-guppy
Version: 0.0.0-1~xenial
Architecture: amd64
Depends: libc6 (≥2.23), libcurl4-openssl-dev, libssl-dev, libhdf5-cpp-11, libzmq5, libboost-atomic1.58.0, libboost-chrono1.58.0, libboost-date-time1.58.0, libboost-filesystem1.58.0, libboost-program-options1.58.0, libboost-regex1.58.0, libboost-system1.58.0, libboost-log1.58.0
Recommends: nvidia-384, libcuda1-384
[...]
In this example the driver version (384) can be seen as part of the `nvidia` and `libcuda1` package names in the "Recommends" section. Driver versions higher than 384 will likely be compatible with this version of Guppy.
Note that Guppy only supports GPUs with an NVIDIA compute version of 6.1 or higher.
4. macOS
CPU basecalling for Guppy on macOS
Only CPU basecalling and post-processing features are available in Guppy for Mac OS in the ont-guppy-cpu
toolkit.
Supported platforms:
- 64-bit OSX 10.11 (El Capitan) or higher
Download the .zip archive.
This can be found on the Software Downloads page in the Community.
Unzip the archive to a location of your choice.
Note: You may require Administrator access depending on the location into which you unzip the archive.
5. Windows
Command prompt
Launch a command prompt. To do this, click on Start and type "Command prompt" in the search box, then click the link. You will be running all Guppy operations from here:
Required parameters
To start a sequencing run with Guppy, you will need to type a command that specifies, as a minimum, the following parameters:
- The configuration file to use*
- The full path to the directory where the raw read files are located
- The full path to the directory where the basecalled files will be saved
* Rather than explicitly specifying a configuration file, Guppy can automatically select the appropriate configuration when the user specifies the flow cell and kit that have been used for the experiment.
Default parameters
If you only specify the parameters listed above, Guppy will run with a number of other parameters set at their default values, for example:
- Q-score filtering will be set to ON
- File compression will be set to OFF
- Sequence reversal and U to T substitution for RNA will be set to OFF
For a full list of all the optional parameters and their default values, refer to the “Setting up a run: configurations and parameters” section of the protocol.
Command-line entries for basecalling
To basecall reads with Guppy, you will need to use the following commands:
"C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_basecaller.exe"
--input_path
Full or relative path to the directory where the raw read files are located. The folder can be absolute (e.g.C:\data\my\_reads
) or a relative path to the current working directory (e.g...\my\_reads
)--save_path
Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)
Then either:
--config
configuration file containing Guppy parameters
or
--flowcell
flow cell version--kit
sequencing kit version
IMPORTANT
Guppy executable in Windows:
By default, the Guppy executables (such as guppy_basecaller.exe
) will not be on the Windows file path. As a result, you will need to type the full path directory to use an executable.
Below is an example of the full file directory of a Guppy executable from the Guppy toolkit:
C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_basecaller.exe
General help command-line options:
For general help, the following available command-line options, -h
or --help
, are provided within the Guppy toolkit:
"C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_basecaller.exe" -h
or
"C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_basecaller.exe" --help
To call out a list of available flow cells plus kit combinations, and their associated config files, use the command:
"C:\Program Files\OxfordNanopore\ont-guppy-cpu\bin\guppy_basecaller.exe" --print_workflows
The command -–version
shows the version of Guppy that is installed.
6. Linux
Command prompt:
Launch a terminal on your system. For example, when using Ubuntu, you can type Ctrl + Alt + T. You will run all Guppy operations from here:
Required parameters
To start a sequencing run with Guppy, you will need to type a command that specifies, as a minimum, the following parameters:
- The configuration file to use*
- The full path to the directory where the raw read files are located
- The full path to the directory where the basecalled files will be saved
* Rather than explicitly specifying a configuration file, Guppy can automatically select the appropriate configuration when the user specifies the flow cell and kit that have been used for the experiment.
Default parameters
If you only specify the parameters listed above, Guppy will run with a number of other parameters set at their default values, for example:
- Q-score filtering will be set to ON
- File compression will be set to OFF
- Sequence reversal and U to T substitution for RNA will be set to OFF
For a full list of all the optional parameters and their default values, refer to the “Setting up a run: configurations and parameters” section of the protocol.
Command-line entries for basecalling
To basecall reads with Guppy, you will need to use the following commands:
guppy_basecaller
(or the fully-qualified path if using the archive installer)--input_path
Full or relative path to the directory where the raw read files are located. The folder can be absolute (e.g.C:\data\my\_reads
) or a relative path to the current working directory (e.g...\my\_reads
)--save_path
Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)
Then either:
--config
configuration file containing Guppy parameters
or
--flowcell
flow cell version--kit
sequencing kit version
TIP
Alternative commands:
On Linux-based platforms, it is also possible to enter files into Guppy, as follows:
ls input_folder/*.fast5 | guppy_basecaller --save_path output_folder/basecall --config dna_r9.4.1_450bps.cfg
Additional options can be specified to enable different basecalling features and output formats and these are discussed further in later sections.
Note that on Linux-based platforms it is also possible to pipe files into Guppy, as follows:
ls input_folder/*.fast5 | guppy_basecaller --save_path output_folder/basecall --config dna_r9.4.1_450bps_fast.cfg
General help command-line options:
For general help, the following available command-line options, -h
or --help
, are provided within the Guppy toolkit:
guppy_basecaller -h
or
guppy_basecaller --help
To call out a list of available flow cells plus kit combinations, and their associated config files, use the command:
guppy_basecaller --print_workflows
Or if the model versions are not required, for faster display use:
guppy_basecaller --print_workflows --skip_model_versions
The command -–version
shows the version of Guppy that is installed.
7. macOS
Command prompt:
Open a command-line terminal. Open your Applications folder, then open the Utilities folder. Click on the Terminal application to open:
IMPORTANT
Before starting:
You must find where the unzipped 'Guppy archive' is located – this will give you the path you will need to enter in order to run the Guppy executables.
For example, if you extracted the .zip archive to /Users/myuser/ont-guppy-cpu
, then you can run the 1D basecaller using this:
/Users/myuser/ont-guppy-cpu/bin/guppy_basecaller
Required parameters
To start a sequencing run with Guppy, you will need to type a command that specifies, as a minimum, the following parameters:
- The configuration file to use*
- The full path to the directory where the raw read files are located
- The full path to the directory where the basecalled files will be saved
* Rather than explicitly specifying a configuration file, Guppy can automatically select the appropriate configuration when the user specifies the flow cell and kit that have been used for the experiment.
Default parameters
If you only specify the parameters listed above, Guppy will run with a number of other parameters set at their default values, for example:
- Q-score filtering will be set to ON
- File compression will be set to OFF
- Sequence reversal and U to T substitution for RNA will be set to OFF
For a full list of all the optional parameters and their default values, refer to the “Setting up a run: configurations and parameters” section of the protocol.
Command-line entries for basecalling
To basecall reads with Guppy, you will need to use the following commands:
guppy_basecaller
--input_path
Full or relative path to the directory where the raw read files are located. The folder can be absolute (e.g.C:\data\my\_reads
) or a relative path to the current working directory (e.g...\my\_reads
)--save_path
Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)
Then either:
--config
configuration file containing Guppy parameters
or
--flowcell
flow cell version--kit
sequencing kit version
General help command-line options:
For general help, the following available command-line options, -h
or --help
, are provided within the Guppy toolkit:
/Users/myuser/ont-guppy-cpu/bin/guppy_basecaller -h
or
/Users/myuser/ont-guppy-cpu/bin/guppy_basecaller --help
To call out a list of available flow cells plus kit combinations, and their associated config files, use the command:
/Users/myuser/ont-guppy-cpu/bin/guppy_basecaller --print_workflows
The command -–version
shows the version of Guppy that is installed.
8. Setting up a run: configurations and parameters
Config files - variable parameters
In addition, Guppy must know which basecalling configuration to use. This can be provided in one of two ways:
- By selecting a config file:
- Config (
-c
or--config
): either the name of the config file to use, or a full path to a config file (see the section below). If the argument is only the name of a config file then it must correspond to one of the standard configuration files provided by the package.
- Config (
- Or by selecting a flow cell and a kit:
- Flow cell (
-f
or--flowcell
): the name of the flow cell used for sequencing (e.g. FLO-MIN106). - Kit (
-k
or--kit
): the name of the kit used for sequencing (e.g. SQK-LSK109).
- Flow cell (
Note: If you use the --config
argument, then --flowcell
and --kit
arguments are not needed and will be ignored.
Choosing a config file for Guppy
Guppy contains several types of basecalling configurations, many of which are not available by using the flow cell and kit selector. These models will usually have their own config file, and they may then be used with the --config
argument.
Generally speaking, the configuration file names are structured as follows:
<strand_type>_<pore_type>_<enzyme_type>_[modbases_specifier]_<model_type>_[instrument_type].cfg
strand_type
: This will be either the string "dna" or "rna", depending on the type of sequencing being performed.pore_type
: The pore the basecalling model was trained for, indicated by the letter "r" followed by a version number. For example: "r9.4.1" or "r10.4".enzyme_type
: The enzyme motor the model was trained for. This will either be the letter "e" followed by a version number, or a number indicating the enzyme speed, followed by "bps". For example: "e8.1" or "450bps".modbase_specifier
: Optional. If specified, indicates that modified base detection will be performed. This will be the string "modbases_" followed by an indicator of the modification supported, such as "5mc_cg" or "5hmc_5mc_cg".model_type
: The type of basecalling model to use, depending on whether you want optimal basecalling speed or accuracy. See below.instrument_type
: Optional. If this is not specified, then the configuration is targeted to a GridION device or a PC. The strings "mk1c" or "prom" are used to indicate that the configuration parameters and model are optimised for the MinION Mk1C or PromethION devices, respectively. Note that if the kit and flow cell are specified on the command-line instead of a specific config file, then the config file chosen will be one without an instrument type specified.
The model types are:
sup
: Super-accurate basecalling.hac
: High accuracy basecalling. These are the configurations that will be selected when a kit and flow cell are specified on the command-line instead of a specific config file.fast
: Fast basecalling.sketch
: Sketch basecalling. This is primarily for use with adaptive sampling on the MinION Mk1C device to minimise latency.
For example, to basecall data generated with the R10.4 pore and the E8.1 enzyme, using the Fast CRF model:
guppy_basecaller -c dna_r10.4_e8.1_fast.cfg [...]
If you were running this on a MinION Mk1C device, you would use:
guppy_basecaller -c dna_r10.4_e8.1_fast_mk1c.cfg [...]
Config files - selecting kit and flow cell
These should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".
To see the supported flow cells and kits, run Guppy with the --print_workflows
option:
guppy_basecaller --print_workflows
...which will produce output like this:
Available flowcell + kit combinations are:
flowcell kit barcoding config_name model version
FLO-MIN114 SQK-LSK114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-LSK114-XL dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-ULK114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RAD114 dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-NBD114-24 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-NBD114-96 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RBK114-24 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-MIN114 SQK-RBK114-96 included dna_r10.4.1_e8.2_400bps_hac dna_r10.4.1_e8.2_400bps_hac@v3.5.2
FLO-PRO002 SQK-LSK112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-LSK112-XL dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RAD112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-NBD112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-NBD112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RBK112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002 SQK-RBK112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-LSK112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-LSK112-XL dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RAD112 dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-NBD112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-NBD112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RBK112-24 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO002M SQK-RBK112-96 included dna_r9.4.1_e8.1_hac_prom 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-LSK112 dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-LSK112-XL dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RAD112 dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-NBD112-24 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-NBD112-96 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RBK112-24 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-MIN106 SQK-RBK112-96 included dna_r9.4.1_e8.1_hac 2021-09-13_dna_r9.4.1_minion_promethion_384_ca963bcb
FLO-PRO111 SQK-CS9109 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-PRO111 SQK-DCS108 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
FLO-PRO111 SQK-DCS109 dna_r10.3_450bps_hac_prom 2021-04-20_dna_r10.3_minion_promethion_384_72309afc
[...]
In the case of kits which come with their own barcodes included, the barcoding column will specify "included". Reads which have been prepared with these kits will be able to be demultiplexed using `guppy_barcoder` (see below).
Optional parameters
In addition to the required parameters described in the Quick Start section, Guppy has many optional parameters. You can use them if they are applicable to your experiment. The following optional parameters are commonly used:
Data features:
- Q-score filtering (
--disable_qscore_filtering
): Flag to disable filtering of reads into pass/fail folders inside the output folder, based on their strand q-score. See--min_qscore
. - Alignment filtering (
--alignment_filtering
): Flag for filtering of reads into pass/fail folders inside the output folder, based on their number of alignments. Can be set tonone
(default) orfail
to disable or enable this feature. - Minimum q-score (
--min_qscore
): The minimum q-score a read must attain to pass q-score filtering. The default value for this varies by configuration, ranging from 7.0 for the lower-accuracy models up to 10.0 for the "Sup" models. This should have a minimal impact on output. - Calibration strand detection (
--calib_detect
): Flag to enable calibration strand detection and filtering. If enabled, any reads which align to the calibration strand reference will be filtered into a separate output folder to simplify downstream processing. Off by default. - Alignment reference file (
-a
or--align_ref
): Optional reference genome file name. If an align_ref is provided, Guppy will perform alignment against the reference for called strands, using the minimap2 library. Providing analign_ref
will automatically enable BAM output (see--bam_out
). See the Alignment section for more information on alignment in Guppy. - Reverse RNA sequence (
--reverse_sequence
): Reverse the called sequence (used for RNA sequencing, as RNA strands translocate through the pore in the 3’ to 5’ direction). The default value isFALSE
for DNA sequencing andTRUE
for RNA sequencing. - Perform T to U substitution (
--u_substitution
): Replace all 'T's in the called sequence with 'U's for RNA sequencing. The default value isFALSE
for DNA sequencing andTRUE
for RNA sequencing. - Read splitting (
--do_read_splitting
): Split potentially concatenated input reads into separate outputs, based on the score obtained from mid-strand adapter detection. See--min_score_read_splitting
. If enabled, reads which exceed this threshold will be split into two. - Read splitting depth (
--max_read_split_depth
): Limit the number of times a read will be passed into the read splitter. e.g.--max_read_split_depth 2
would permit the read to be split, and then each resulting read to be split a second time, resulting in up to four reads. The default value is 2. - Minimum read splitting score (
--min_score_read_splitting
): The minimum score a read must generate from mid-strand adapter detection for the read to be considered a concatamer and to be split into two reads for subsequent processing and output. The default is 58.
Input/output:
- Quiet mode (
-z
or--quiet
): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run. - Verbose logging (
--verbose_logs
): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default. - Reads per FASTQ file (
-q
or--records_per_fastq
): The number of reads to put in a single FASTQ file (see output format below). Set this to zero to output all reads into one file (per run id, per caller). The default value is 4000. - Perform FASTQ compression (
--compress_fastq
): Flag to enable gzip compression of output FASTQ files; this reduces file size to about 50% of the original. - Recursive (
-r
or--recursive
): Flag to require searching through all subfolders contained in the--input_path
value, and basecall any .fast5 files found in them. - .bam file output (
--bam_out
): Flag to enable output of .bam files containing basecall result sequence. If a modified base model was used, the modified base locations and probabilities will be emitted. If alignment was performed, the results will also be emitted. Off by default. - .bam file indexing (
--index
): Flag to enable the generation of the .bai index file for .bam file output. Requires--bam_out
. BAM file output will be implicitly enabled if--align_ref
is popultated or a modbase model is selected. Off by default. - Emit move tables (
--moves_out
): Return move table in output BAM file. - Methylation probability cutoff (
--bam_methylation_threshold
): The value below which a predicted methylation probability will not be emitted into a BAM file, expressed as a percentage. Default is 5.0(%). Note that if the configuration being used specifies a context to look for base modifications within, then this parameter will not be applied. Instead, any instances of the base which match the context will be emitted in the BAM file, even if the predicted methylation probability is zero. - Override default data path (
-d
or--data_path
): Option to explicitly specify the path to use for loading any data files the application requires (for example, if you have created your own model files or config files). - Input File List (
--input_file_list
): Optional file containing list of input read files (.fast5/POD5) to process from the input_path. - Nested output folder structure (
--nested_output_folder
): Optional flag, which if set will cause FASTQ files to be output to a nested folder structure similar to that used by MinKNOW. - Progress stats reporting frequency (
--progress_stats_frequency
): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display. - Maximum queue size (
--max_queued_reads
): Maximum number of reads "in flight", defaults to 2000. Helps to limit the amount of memory used in the case where basecalling cannnot keep up with the speed reads are loaded.
Optimisation:
Chunks per caller (
--chunks_per_caller
): A soft limit on the number of chunks in each basecaller's chunk queue. When a read is sent to the basecaller, it is broken up into “chunks” of signal, and each chunk is basecalled in isolation. Once all the chunks for a read have been basecalled, they are combined to produce a full basecall.--chunks_per_caller
sets a limit on how many chunks will be collected before they are dispatched for basecalling. On GPU platforms this is an important parameter to obtain good performance, as it directly influences how much computation can be done in parallel by a single basecaller.Number of parallel callers (
--num_callers
): Number of parallel basecallers to create. A thread will be spawned for each basecaller to use. Increasing this number will allow Guppy to make better use of multi-core CPU systems, but may impact overall system performance.GPU device (
-x
or--device
): Specify a GPU device to use in order to accelerate basecalling. If this option is not selected, Guppy will default to CPU usage. You can specify one or more devices as well as optionally limiting the amount of GPU memory used (to leave space for other tasks to run on GPUs). GPUs are counted from zero, and the memory limit can be specified as percentage of total GPU memory or as size in bytes. Examples:device result cuda:0
Use the first GPU in the system, no memory limit cuda:0,1
Use the first two GPUs in the system, no memory limit "cuda:0 cuda:1"
Same as cuda:0,1
cuda:all:100%
Use all GPUs in the system, no memory limit cuda:1,2:50%
Use the second and third GPU in the system, and use only up to half of the GPU memory of each GPU "cuda:0 cuda:1,2:8G"
Use the first three GPUs in the system. Use a maximum of 8 GiB on each of GPUs 1 and 2. auto
Same as cuda:0
Note: Spaces are only allowed between multiple cuda: specifications. In this case it is necessary to put the entire device specification in quotes. It is strongly recommended to use a supported GPU if one is available, as basecalling will typically perform orders of magnitude faster.
Resume previous run (
--resume
): Flag to enable resuming a previous basecalling run. This option can be used to resume a partially completed basecall if it was interrupted for some reason, or to re-basecall an input directory if more reads were added.
CPU/GPU basecalling usage
There are two parameters that govern how many CPU threads Guppy uses: callers and CPU threads per caller.
When performing GPU basecalling, there is always one CPU support thread per GPU caller, so the number of callers (--num_callers
) dictates the maximum number of CPU threads used. Modifying the number of CPU threads per caller (--num_cpu_threads_per_caller
) will have no effect.
When performing CPU basecalling both callers and threads per caller may be set, making the maximum number of CPU threads used equal to num_callers * cpu_threads_per_caller.
The number of CPU threads used should generally not exceed either of these two values:
- The number of logical CPU cores your machine has (as there will probably not be sufficient computational power available for Guppy to run any faster than this).
- When performing CPU basecalling, more than the number of CPU threads your machine's RAM can support 4GB + 1GB per CPU thread for 1D basecalling
So if your machine has 8 GB of RAM, you can support a maximum of 4 CPU threads for 1D basecalling.
This assumes your machine is not performing any other computationally-intensive tasks except for using Guppy (e.g. it assumes you are not running MinKNOW).
Resuming runs
If a run of the Guppy basecaller is interrupted for some reason, it is possible to use the --resume
option to attempt to re-start the basecall from where it was halted. This is useful if basecalling fails during processing particularly large batches of files. Resume should be used with exactly the same parameters as the previous run, or undefined behaviour may occur. If the --resume
option is specified, the following steps occur:
- The basecaller checks the output directory to find log files from any previous runs
- The log files are interrogated to discover any successfully completed reads (and their source files) from previous runs
- Any files in the output directory, which do not belong to successfully completed reads, are removed (i.e. reads which were partially completed)
- The data for previously completed reads is extracted from the summary file for the previous run
The basecaller then proceeds as normal, filtering out any input reads which were previously processed.
After resumption of a basecall run, a single summary file will have been produced with all reads from the input folder in it, as if the run was completed normally.
Note: It is permissible to chain resume operations together, and it is permissible to resume from a successfully completed operation. This allows the resume functionality to be used to re-basecall an input folder in order to basecall just the read files which have appeared in that folder since the last basecall operation was invoked on it.
The resume system works by batching reads internally, and recording to the logfile when those batches have been completed and written to disk. The --read_batch_size
argument can be set to control the size of these batches, and controls the granularity at which resume operations can occur. Increasing the batch size will reduce the fragmentation of output FASTQ files but can increase the amount of time a resume operation takes, as more previously basecalled reads may be re-called, because their batch was not completed.
9. Input and output files
Input files
Read .fast5 files, used as input to the basecalling software, must contain raw data. Raw data is included by default in .fast5 files generated by the MinKNOW software. Make sure you are using recent .fast5 files from the latest version of MinKNOW, as older files may not basecall properly with the set-out models and parameters provided in stand-alone Guppy.
POD5 files are also supported as input.
Both the alignment and barcoding software accept FASTQ files as input. These can be generated either by the Guppy basecallers or by the MinKNOW software.
Output file size
If you start with a .fast5 file that only has raw data in it and .fast5 output is enabled, file size increases to roughly 2X original size for 1D basecalling.
Folder structure
If using a version of MinKNOW which outputs reads in separate subfolders, it is necessary to use the --recursive
option listed above to search through them to find input read files.
For example, if MinKNOW's output folder structure looks like this:
minknow_output_folder/
--- 0/
| --- file1.fast5
| --- file2.fast5
| [...]
--- 1/
| --- file10.fast5
| --- file11.fast5
| [...]
Then calling Guppy as follows will search through the numbered subfolders for input read files:
guppy_basecaller --input_path minknow_output_folder --recursive [...]
Output formats
Guppy supports outputting FASTQ files, and optionally BAM, via the --bam_out
argument. By default, FASTQ or BAM files will contain 4000 reads per file, according to the --records_per_fastq
argument.
Multiple input files from the same run_id will be grouped into batches, where the number of reads in a batch is less than or equal to --read_batch_size
. Individual input files will not be split across batches, even if this means a batch is larger than --read_batch_size
. Output files for a batch will be split when --records_per_fastq
reads have been recorded. In the case where --records_per_fastq
is set to 0, all reads from a batch will be written into a single file (per run_id).
The default FASTQ header is:
{read_id} runid={run_id} read={read_number} ch={channel_id} start_time={start_time_utc}
- read_id is the unique ID for the read.
- sample_id is the user-specified sample ID which the read belongs to (read from
tracking_id/sample_id
in the source read file). - read_number is the sequential read number for the channel (read from the read's
read_number
in the source read file). - channel_id is the source channel within the flow cell for the read (read from the read's
channel_number
in the source read file). - start_time_utc is the read's start time (calculated from
tracking_id/exp_start_time
and the read'sstart_time
in the source read file).
If barcoding was performed, the FASTQ header will also include a barcodeid={barcode}
field, where barcode
is the normalised ID of the detected barcode arrangement.
If read splitting was performed, the FASTQ header will also include a parent_read_id={parent_read_id}
field, where parent_read_id
is the read_id
of the original read from which this read was split.
Contents of the output folder
The save path will have the following structure once Guppy has finished running:
guppy_basecaller_<time_and_date>.log
A log file of what Guppy did during this basecall session.sequencing_summary.txt
A tab-delimited text file containing useful information for each read analysed during this Guppy basecall.fastq_runid_<run_id>_<batch_id>_<file_number>.fastq
A collection of FASTQ files will be emitted containing the basecall results. Each FASTQ file may contain many reads. A set of FASTQ files will be generated for each run ID in the input file set. Additionally, depending on the--read_batch_size
and--records_per_fastq
settings, a single run ID may generate multiple FASTQ files.
Note: The FASTQ files in the output folder may be separated into "pass", "fail", and "calibration_strands" folders, depending on whether they pass or fail the filtering conditions or whether they have been identified as a calibration strand. This behaviour may be controlled with the --disable_qscore_filtering
and --calib_detect
options. For example, if both options are enabled, the output folder structure would look like this:
guppy_output_folder/
--- pass/
| fastq_runid_777_0.fastq
| fastq_runid_abc_0.fastq
| fastq_runid_abc_1.fastq
--- fail/
| fastq_runid_777_0.fastq
--- calibration_strands/
| fastq_runid_777_0.fastq
Whereas turning both options off would produce a folder layout like this:
guppy_output_folder/
| fastq_runid_777_0.fastq
| fastq_runid_abc_0.fastq
| fastq_runid_abc_1.fastq
If barcode detection was performed, Guppy will demultiplex the reads into separate subfolders (within the 'pass' and 'fail' and 'calibration_strands' folders if applicable), like this example:
guppy_output_folder/
--- pass/
| ---barcode01/
| | fastq_runid_abc_0_0.fastq
| ---unclassified/
| | fastq_runid_abc_0_0.fastq
--- fail/
| ---unclassified
| | fastq_runid_abc_0_0.fastq
Guppy will not empty the save path before writing the output, but it will overwrite existing FASTQ files.
__Nested output folders__
Guppy also supports an alternative output folder structure, designed to match that produced by MinKNOW. This can be enabled using the command line switch `--nested_output_folder`. When enabled, Guppy will further organise the output subfolders as follows:
`guppy_output_folder/`
`--- {protocol_group_id} (if it exists is source fast5 files)/`
`| ---{sample_id}/`
`| | ---{experiment_start_time}_{device_id}_{flow_cell_id}_{protocol_run_id}/`
`| | | ---fastq_pass or fastq_fail/`
`| | | | ---{barcode classification} (if it exists for the read, otherwise this folder is absent)`
An alternative nested folder output is available which is very similar to the above, but places the barcode classification directly under the protocol group id. This scheme can be enabled using the command line switch `--barcode_nested_output_folder`. The folders are organised as follows:
`guppy_output_folder/`
`--- {protocol_group_id} (if it exists is source fast5 files)/`
`| ---{barcode classification}/ (if it exists for the read, otherwise this folder is absent)`
`| | ---{sample_id}/`
`| | | ---{experiment_start_time}_{device_id}_{flow_cell_id}_{protocol_run_id}/`
`| | | | ---fastq_pass or fastq_fail`
Ping information
Guppy collects high-level summary information when it is used, and by default this information is sent over your internet connection to Oxford Nanopore Technologies. This is important information that allows us to analyse the performance of Guppy and identify areas where we need to improve. Nothing specific about the genomic content of individual reads is included - only generic information is logged, such as sequence length and q-score, aggregated over all the reads processed by Guppy. The sending of this summary information can be turned off if desired by providing the --disable_pings
option to Guppy.
Guppy collects this high-level summary information as follows:
- Individual reads are added to an aggregator as they are basecalled
- The summary ping(s) are written out to a file (.js)
- If not disabled, the summary ping(s) are sent to Oxford Nanopore
This type of information is collected:
- General information about the configuration of Guppy and the run(s) that the data came from:
- the options provided to Guppy
- the total number of reads seen, and those seen per channel
- Basecalling information:
- the numbers of reads which passed or failed basecalling
- the average sequence length
- the distribution of mean q-scores
- the distribution of basecalling speeds
Users are encouraged to browse the summary_telemetry.js
file if they wish to see exactly what information Guppy is aggregating for telemetry.
Summary file contents
Guppy produces a summary file named sequencing_summary.txt
during basecalling, which contains high-level information on every read analysed by the basecaller. This file is a tab-delimited text file which can be imported into common spreadsheet applications such as Excel or LibreOffice Calc, or read by software libraries such as NumPy or Pandas. Every read that is sent to the basecaller will have an entry in the summary file, regardless of whether or not that read was successfully basecalled.
When enabling extra functionality such as barcoding or alignment, additional columns will be added to the summary file. For this reason, and because the columns may occasionally be re-ordered, it is recommended that specific columns are accessed by their name (e.g. the read_id
column) instead of the order in which they occur in the file.
Below is a list of summary file columns with a description of their contents. Very occasionally new columns may be added to the file without being described here; these columns should be considered unreliable and subject to change or removal.
- filename The name of the input read file the read came from.
- read_id The uuid that uniquely identifies this read.
- parent_read_id The uuid that uniquely identifies the original input read from which this read was generated. This column will only be present if
--do_read_splitting
is enabled. For unsplit reads, this value will be identical to read_id. - run_id The uuid that uniquely identifies the sequencing run that this read came from.
- batch_id Integer identifier of the batch that Guppy put this read in. See the
--read_batch_size
parameter and the--resume
option. - channel The channel on the flow cell that the read came from.
- mux The mux in the channel that the read came from.
- start_time Start time of the read, in seconds since the beginning of the run.
- duration Duration of the read, in seconds.
- minknow_events The number of events detected by MinKNOW. Defaults to zero if unknown, or if the value cannot be determined due to read-splitting.
- passes_filtering Whether or not the read passed the qscore and alignment filters (the value is not affected by the
--disable_qscore_filtering
and--alignment_filtering
flags). See the--min_qscore
parameter. - template_start Start time of the portion of the read that was sent to the basecaller after adapter trimming, in seconds since the beginning of the run. See the
--trim_threshold
,--trim_min_events
,--max_search_len
,--trim_strategy
, and--dmean_win_size
parameters. - num_events_template Legacy field -- template_duration should be used instead.
- template_duration Duration of the portion of the read that was sent to the basecaller after adapter trimming, in seconds.
- sequence_length_template Number of bases in the output sequence, taking into account any sequence trimming. See "Barcode trimming".
- mean_qscore_template The qscore corresponding to the mean error rate of the sequence.
- strand_score_template Legacy field - no longer populated reliably.
- median_template The median current of the read, in pA.
- scaling_median_template The "median_template" value used by the basecaller to scale incoming data. May be different than median_template if adapter scaling or scaling overrides are used. See the
--scaling_med
parameter. - scaling_mad_template The "mad_template" value used by the basecaller to scale incoming data. May be different than mad_template if adapter scaling or scaling overrides are used. See the
--scaling_mad
parameter.
If barcoding/demultiplexing is enabled via the --barcode_kits
argument, then the following columns are added to the sequencing summary file:
- barcode_arrangement The normalized name of the barcode classification, without a kit (e.g. "barcode01"), or "unclassified" if no classification could be made.
- barcode_full_arrangement The full name for the highest-scoring barcode match, including kit, variation, and direction (e.g. "RAB19_var2").
- barcode_kit The kit name belonging to the highest-scoring barcode match (e.g. "RAB").
- barcode_variant Which of the forward / reverse variants the highest-scoring barcode matched (e.g. "var1"), or "n/a" if no variants are available.
- barcode_score The score for either the front or rear barcode, whichever is higher. The maximum score is 100, with no minimum.
- barcode_front_id The full name for the barcode at the front of the strand, including direction (forward/reverse) and variant (1st/2nd) (e.g. "RAB19_2nd_FWD").
- barcode_front_score The score for the barcode at the front of the strand.
- barcode_front_refseq The reference sequence the barcode at the front of the strand was matched against.
- barcode_front_foundseq The sequence of the barcode at the front of the strand that matched
barcode_front_refseq
. - barcode_front_foundseq_length The length of
barcode_front_foundseq
. - barcode_front_begin_index The position in the called sequence, counting from the beginning, that
barcode_front_foundseq
begins at. - barcode_rear_score The score for the barcode at the rear of the strand.
- barcode_rear_refseq The reference sequence the barcode at the rear of the strand was matched against.
- barcode_rear_foundseq The sequence of the barcode at the rear of the strand that matched
barcode_rear_refseq
. - barcode_rear_foundseq_length The length of
barcode_rear_foundseq
. - barcode_rear_end_index The position in the called sequence, counting backwards from the end, that
barcode_rear_foundseq
ends at.
If dual barcoding is used the following additional columns will be present:
- barcode_front_id_inner
- barcode_front_score_inner
- barcode_rear_id_inner
- barcode_rear_score_inner
These columns have the same meaning as the standard "id" and "score" columns above, but apply only to the inner front and rear barcodes. The standard "id" and "score" columns now apply to the outer barcodes.
For further details on how barcoding works see the "Barcoding/demultiplexing" section.
If LamPORE detection is enabled via the --lamp_detect
argument, the following additional columns will be present:
- lamp_barcode_id The normalized name of the LAMP FIP barcode classification, (e.g. "FIP01"), or "unclassified" if no classification could be made.
- lamp_barcode_score The alignment score for the best-scoring LAMP FIP barcode. Note that if the best score is below the threshold specified by
--min_score_lamp
, the score will still be reported here, although the classification will be "unclassified". - lamp_target_id The target name of the LAMP target classification (e.g. "ACTB"), or "unclassified if no classification could be made.
- lamp_target_score The alignment score for the best-scoring LAMP target. Note that if the best score is below the threshold specified by
--min_score_lamp_target
, the score will still be reported here, although the classification will be "unclassified".
If adapter detection is enabled via the --detect_adapter
argument, the following additional columns will be present:
- adapter_front_id The name of the adapter (if any) found at the front of the strand. This will be "unclassified" if no adapter was found.
- adapter_front_score The alignment score of the adapter at the front of the strand. If unclassified this will be the score that was highest among the rejected sequences.
- adapter_front_begin_index The position in the called sequence of the beginning of the adapter, counting from the beginning of the strand.
- adapter_front_foundseq_length The length of the portion of the strand that aligned to adapter.
- adapter_rear_id The name of the adapter (if any) found at the rear of the strand. This will be "unclassified" if no adapter was found.
- adapter_rear_score The alignment score of the adapter at the rear of the strand. If unclassified this will be the score that was highest among the rejected sequences.
- adapter_rear_end_index The position in the called sequence of the end of the adapter, counting from the end of the strand.
- adapter_rear_foundseq_length The length of the portion of the strand that aligned to adapter.
If primer detection is enabled via the --detect_primer
argument, the following additional columns will be present:
- primer_front_id The name of the primer (if any) found at the front of the strand. This will be "unclassified" if no primer was found.
- primer_front_score The alignment score of the primer at the front of the strand. If unclassified, this will be the score that was highest among the rejected sequences.
- primer_front_begin_index The position in the called sequence of the beginning of the primer, counting from the beginning of the strand.
- primer_front_foundseq_length The length of the portion of the strand that aligned to primer.
- primer_rear_id The name of the primer (if any) found at the rear of the strand. This will be "unclassified" if no primer was found.
- primer_rear_score The alignment score of the primer at the rear of the strand. If unclassified, this will be the score that was highest among the rejected sequences.
- primer_rear_end_index The position in the called sequence of the end of the primer, counting from the end of the strand.
- primer_rear_foundseq_length The length of the portion of the strand that aligned to primer.
If barcode trimming is enabled via --enable_trim_barcodes
, or adapter or primer trimming is enabled via the trim_adapters
, or trim_primers
arguments, the following additional columns will also be present:
- front_total_trimmed The number of bases removed from the front of the sequence as part of trimming.
- rear_total_trimmed The number of bases removed from the rear of the sequence as part of trimming.
If alignment is enabled via the --align_ref
argument, then the following colums are added to the sequencing summary file:
- alignment_genome The name of the reference which the read aligned to, or "*" if no alignment was found.
- alignment_genome_start The position in the reference where the alignment started, or 0 if no alignment was found.
- alignment_genome_end The position in the reference where the alignment ended, or 0 if no alignment was found.
- alignment_strand_start The position in the called sequence where the alignment started, or 0 if no alignment was found.
- alignment_strand_end The position in the called sequence where the alignment ended, or 0 if no alignment was found.
- alignment_num_insertions The number of insertions in the alignment, or -1 if no alignment was found.
- alignment_num_deletions The number of deletions in the alignment, or -1 if no alignment was found.
- alignment_num_aligned The number of bases in the called sequence which aligned to bases in the reference, or -1 if no alignment was found.
- alignment_num_correct The number of aligned bases in the called sequence which match their corresponding reference base, or -1 if no alignment was found.
- alignment_identity The percentage of aligned bases which correctly match their corresponding reference base (alignment_num_correct/alignment_num_aligned), or -1 if no alignment was found.
- alignment_accuracy The percentage of all bases in the alignment which are correct (alignment_num_correct/(alignment_num_aligned + alignment_num_insertions + alignment_num_deletions)), or -1 if no alignment was found.
- alignment_score The score returned by minimap2, or -1 if no alignment was found.
- alignment_coverage The percentage of either the called sequence or the reference (whichever is shorter) that aligns (e.g. (alignment_strand_end - alignment_strand_start + 1)/(sequence_length_template), or -1 if no alignment was found.
- alignment_direction The direction of the alignment, either forwards (+) or reverse (-), or "*" if no alignment was found. Note that genome positions (e.g. alignment_genome_start) are always given in the forwards direction.
- alignment_mapping_quality The mapping quality of the alignment. It equals −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. A value 255 indicates that the mapping quality is not available.
- alignment_num_alignments The total number of alignments found. This will be zero if no alignment was found.
- alignment_num_secondary_alignments The number of alignments that were flagged by minimap2 as secondary alignments.
- alignment_num_supplementary_alignments The number of alignments that were flagged by minimap2 as supplementary alignments.
10. Guppy basecall server
Guppy basecall server
Guppy includes an additional executable called guppy_basecall_server
which provides basecalling as a network-enabled service. The basecall server may be useful in situations where a set of compute resources such as GPUs need to be shared between several concurrently-running basecalling clients. It is enables client applications to perform basecalling by communicating with the server via the ZMQ socket interface. ONT products which support multiple flow cells typically use Guppy in a server configuration in order to share the embedded GPUs between all flow cells.
The server is launched as follows:
guppy_basecall_server --config <config file> --log_path <log file folder> --port 5555 [--allow_non_local] [--use_tcp]
The basecall server requires a basecalling config file, like the stand-alone basecaller. It also requires a --log_path
to be specified, which will be used to output the server execution log. The final required parameter is --port
, which specifies the path to a local Unix socket file (on supported systems) or the socket port number on which the server will listen for connections. The --port
parameter may also be set to auto
, in which case the server will generate a path in the system temporary folder for socket file connections or provide an available port number for TCP connections.
guppy_basecall_server --config <config file> --log_path <log file folder> --port auto
To force the use of a TCP connection, pass the optional flag --use_tcp
on the command line (this flag has no effect on unsupported platforms, e.g. Windows). By default, the server only listens for TCP connections on the localhost interface. The optional flag --allow_non_local
is used to permit connections to the server from addresses other than localhost - this flag also implies --use_tcp
on supported systems.
On startup the server will output something similar to the following:
ONT Guppy basecall server software version 3.0.3+7e7b7d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
log path: /tmp
chunk size: 1000
chunks per runner: 48
max queued reads: 2000
num basecallers: 1
num socket threads: 1
gpu device: cuda:0
kernel path:
runners per device: 2
Starting server on port: 5555
or:
ONT Guppy basecall server software version 3.0.3+7e7b7d0
config file: /opt/ont/guppy/data/dna_r9.4.1_450bps_fast.cfg
model file: /opt/ont/guppy/data/template_r9.4.1_450bps_fast.jsn
log path: /tmp
chunk size: 1000
chunks per runner: 48
max queued reads: 2000
num basecallers: 1
num socket threads: 1
gpu device: cuda:0
kernel path:
runners per device: 2
Starting server on port: ipc:///tmp/ae89-314a-54a4-be67
The server may take several seconds to fully launch, but once the "Starting server" line is output, the server is ready for connections.
If the server fails to start due to being improperly configured, it will exit with exit code 2, and details about what went wrong will be output to the log file. Some examples of things that could go wrong include:
- Required command line parameters were not provided.
- The configuration file does not exist, or it specifies a model file that does not exist.
- The CUDA device specified does not exist, or is unavailable.
- The CUDA device does not have enough memory to support the requested configuration.
- The path that the log files should be written to cannot be accessed for writing.
In general, any automated software that is responsible for starting the server should check for a return code of 2, and if present, this means that subsequent attempts to start the server with the same input parameters will also fail if the problem is not addressed.
If the server crashes due to an exception being thrown within the software, details of the error will appear in the logs. In this case, the return code will be 1. Any other return codes (other than 0, which indicates normal shutdown) will indicate that the server has crashed in a way that may have prevented any information about the nature of the error being logged properly.
Once the server is running, it can be used to basecall by running the Guppy basecall client application. This is exactly the same as launching the Guppy basecaller application locally, except a socket file path or connection port is specified:
guppy_basecall_client --input_path reads --save_path output_folder/basecall --config dna_r9.4.1_450bps_fast.cfg --port ~/my_socket_files/socket1
Note that socket files only permit local connections.
To use a TCP socket connection, add the --use_tcp
flag in the same way as when launching the server:
guppy_basecall_client --input_path reads --save_path output_folder/basecall --config dna_r9.4.1_450bps_fast.cfg --port 5555 --use_tcp
If only a port is specified to the Guppy basecall client as above, Guppy will assume the server is running on the local host. However, it is also possible to specify an address or hostname:
guppy_basecall_client --input_path reads --save_path output_folder/basecall --config dna_r9.4.1_450bps_fast.cfg --port 192.168.0.64:5555 --use_tcp
or
guppy_basecall_client --input_path reads --save_path output_folder/basecall --config dna_r9.4.1_450bps_fast.cfg --port my_basecall_server:5555 --use_tcp
In this case, the connection can be made to a remote server. Note that to allow connections from clients specified in this way, the server must be launched with the `--allow_non_local` command-line flag. If the server was launched with `--allow_non_local`, the client must use `--use_tcp`, even if this flag was not passed to the server.
Note: Basecalling performance may be compromised by network bandwidth when using a remote server. It is possible for multiple clients to connect to a basecall server simultaneously and the server will distribute processing resource between them using a fair queuing system.
Note: Read trimming and file output will be performed on the client, so any parameters to control those steps must be specified when launching the client, not the server.
Basecall server-specific parameters
To start the basecall server you will need to specify the path for logging.
- Logging path (
--log_path
): The path to the folder to save a basecall log. The logs contain all the messages that are output to the terminal, plus additional informational messages. For example, the log will contain a record for each input file which is loaded and each file which is written out. Any error or warning messages generated during the run will also go in the log, which can be used for diagnosing problems. If the user specifies the--verbose
flag, an additional verbose log file is written out. Thelog_path
is only set for the server (as it has no other output files), but theguppy_basecaller
app also emits logs, which go into thesave_path
- Maximum queue size (
--max_queued_reads
): Maximum number of reads to queue per client. When running in client/server mode, the client will load files from disk and send them immediately to the server for basecalling. If the client can load and send reads faster than the basecaller can process them, queued reads will pile up on the basecall server, increasing memory consumption. To avoid this problem,--max_queued_reads
specifies a maximum number of reads that an individual client can have in flight on the server at once. This has a default value of 2000, which is sufficient for MinION Mk1B and GridION setups with a single client attached. When running multiple clients, the number should be reduced to prevent excessive memory usage. - Allow non-local connections (
--allow_non_local
): By default the server will only accept connections from clients on localhost. Pass this flag to allow incoming connections on other interfaces. - High-priority read threshold (
--high_priority_threshold
): Number of high-priority chunks to process for each medium-priority chunk. The default is 10. - Medium-priority read threshold (
--medium_priority_threshold
): Number of medium-priority chunks to process for each low-priority chunk. The default is 4. - Maximum IPC message block size (
--max_block_size
): Maximum block size (in samples) of messages. Reads over the maximum size will be sent in multiple parts. The default is 256,000. - Number of threads for IPC message handling (
--ipc_threads
): Number of threads to use for inter-process communication. The default is 2.
Basecall client-specific parameters
- Server connection hostname and port (
-p
or--port
): Specify a hostname and port for connecting to basecall service (ie 'myserver:5555'), or port only (ie '5555'), in which case localhost is assumed. This is the port used to communicate between the basecall client and server. The client and server both need to use the same port number or they will not be able to connect to each other. - Client ID (
--client_id
): An identifier for the Guppy Client instance. If supplied, this identifier will be included in any files output by the Guppy Client. This may be used to guarantee unique filenames in the case that multiple Guppy Client processes are writing to the same output folder. This can used when there are multiple Guppy clients processing reads at the same time. To avoid the clients overwriting each other’s files, giving each one a unique client ID will allow it to label its output files with the ID and make them unique per client. - Client connection timeout (
--conn_timeout_ms
): Connection timeout in milliseconds before the server considers the client as disconnected. Set to zero to disable the server auto-disconnecting the client. - Max server read failure count (
--max_server_read_failures
): Maximum times to try resending in-flight reads when the server repeatedly crashes. - Server file loading timeout (
--server_file_load_timeout
): Timeout in seconds to wait for the server to load a requested data file (e.g. a basecalling model or alignment index). This may need increasing if very large alignment references are being requested. The default is 180s.
Guppy basecall supervisor
Guppy includes an executable called guppy_basecaller_supervisor
.
A single guppy_basecaller_client
will struggle to read files fast enough to supply a guppy_basecall_server
especially in a multiple GPU system. To improve the GPU utilisation, it is necessary to have multiple clients connecting to the basecall server. The guppy_basecall_supervisor
application is provided to simplify the process of connecting multiple clients to a server while all reading from the same input location and writing to the same save path.
This supervisor application ensures that:
- All files from the input location are distributed amongst the child basecaller clients, and
- Each client is launched with a unique client_id guaranteeing all files written to the save folder will be uniquely named.
Once all basecaller clients have completed the supervisor exits, a return code of zero indicating success.
Usage
The basecaller supervisor is launched with exactly the same parameters as the Guppy basecaller running in client mode, but with the addition of a --num_clients
parameter.
For example, to launch three Guppy basecallers running in client mode all processing the same input location and writing to the same save location:
(The following assumes that the basecall server has already been launched and is listening on TCP port 5555)
guppy_basecaller_supervisor --num_clients 5 --input_path reads --save_path ./save_folder/ --config dna_r9.4.1_450bps_fast.cfg --port 5555 --use_tcp
Note: the output will be written by each client individually and is not merged. In particular, it is worth noting that there will be one sequencing summary per client.
Depending on your requirements some further processing may be necessary in order to merge the sequencing summary files.
Example output files:
--- /save_folder/
| fastq_runid_6dce0a5_client0_0_0.fastq
| fastq_runid_6dce0a5_client1_0_0.fastq
| fastq_runid_6dce0a5_client2_0_0.fastq
| guppy_basecaller_0_log-2019-11-25_15-11-53.log
| guppy_basecaller_1_log-2019-11-25_15-11-53.log
| guppy_basecaller_2_log-2019-11-25_15-11-53.log
| guppy_basecaller_supervisor_log-2019-11-25_15-11-53.log
| sequencing_summary_0.txt
| sequencing_summary_1.txt
| sequencing_summary_2.txt
| sequencing_telemetry_0.js
| sequencing_telemetry_1.js
| sequencing_telemetry_2.js
Command-line configuration arguments Any configuration parameters currently passed to the guppy_basecaller
, e.g. --num_callers
, --ipc_threads
, --gpu_runners_per_device
, --chunks_per_runner
, etc., should also be suitable for the guppy_basecall_supervisor
as these will be directly forwarded to the clients.
To choose an optimum value for the --num_clients
parameter, some trial and error is necessary, for example start with num_clients 1
and increase until no further benefit is noticed. The output from the supervisor may well be useful in determining this as it reports the samples/second, i.e.
Caller time: 5405 ms, Samples called: 186589921, samples/s: 3.45217e+07
For more detailed metrics, the `--progress_stats_frequency` argument can be used, although this reports bases called/second as opposed to samples. Below is some sample output with `progress_stats_frequency 5`
Found 38 input read files to process.
Processing ...
[PROG_STAT_HDR] time elapsed(secs), time remaining (estimate), total reads processed, total reads (estimate), interval(secs), interval reads processed, interval bases processed, bases/sec
[PROG_STAT] 5.00439, 10.8428, 12, 38, 5.00439, 12, 66073, 13203.0
[PROG_STAT] 10.0091, 8.10263, 21, 38, 5.00466, 9, 61161, 12220.8
[PROG_STAT] 15.0133, 1.76627, 34, 38, 5.0041, 13, 71410, 14270.3
[PROG_STAT] 17.1152, 0, 38, 38, 2.10173, 4, 35785, 17026.5
Caller time: 17530 ms, Samples called: 2157249, samples/s: 123060
All instances of guppy_basecaller completed successfully.
__Notes__
- The intended usage is that the supervisor will be running clients that connect to a server, therefore it is necessary to supply the
--port
argument. - If the Guppy basecall server was launched with the
--use_tcp
and/or--allow_non_local
options then--use_tcp
should also be supplied when launching the supervisor. - Since the child Guppy basecall clients are using a server for the actual basecalling, the
--device
argument should NOT be supplied.
11. Expert settings
Parameters for expert users
There are additional advanced options for expert users. Experimenting with these parameters may significantly impact the performance or accuracy of the basecaller:
Data features
- Calibration strand reference file (
--calib_reference
): Provide a FASTA file to override the reference calibration strand. - Calibration strand candidate minimum sequence length (
--calib_min_sequence_length
): Minimum sequence length for reads to be considered candidate calibration strands. - Calibration strand candidate maximum sequence length (
--calib_max_sequence_length
): Maximum sequence length for reads to be considered candidate calibration strands. - Calibration strand minimum coverage (
--calib_min_coverage
): Minimum reference coverage of candidate strand required for a read to pass calibration strand detection. - DNA Adapter trimming threshold (
--trim_threshold
): Threshold above which data will be trimmed (in standard deviations of current level distribution). - DNA Adapter trimming minimum events (
--trim_min_events
): Adapter trimmer minimum stride intervals after stall that must be seen. - DNA Adapter trimming maximum search length (
--max_search_len
): Maximum number of samples from the beginning of the read to search through for the stall. - Override automatic read scaling (
--override_scaling
): Flag to manually provide scaling parameters rather than estimating them from each read. See the--scaling_med
and--scaling_mad
options below. Note that if--ignore_scaling_from_read_files
is not set, scaling overrides will only apply to reads which did not have scaling information stored in the source file. - Manual read scaling median (
--scaling_med
): Median current value to use for manual scaling. - Manual read scaling median absolute deviation (
--scaling_mad
): Median absolute deviation to use for manual scaling. - Adapter Trimming strategy (
--trim_strategy
): Trimming strategy to apply to the raw signal before basecalling (must be one ofdna
,rna
ornone
). The adapter looks different in the signal depending on whether DNA or RNA is being basecalled, so the two cases require a different adapter trimming algorithm. This should be set automatically by the config file, and usually it is not required to set this at the command line. - RNA Adapter Trimming Window size (
--dmean_win_size
): Window size for coarse stall event detection. This parameter,–-dmean_threshold
and–-jump_threshold
are used to override how the RNA adapter trimming code operates. Generally, users should not need to change these unless they are familiar with how RNA adapter trimming works. - RNA Adapter Trimming threshold (
--dmean_threshold
): Threshold for coarse stall event detection. - RNA Adapter Trimming jump threshold (
--jump_threshold
): Threshold level for RNA stall detection. - Disable event table transmission (
--disable_events
): Flag to disable the transmission of event tables when receiving reads back from the basecall server. If the event tables are not required for downstream processing (e.g. for 1D^2) then it is more efficient to disable them. - Enable poly-T/non-sequence adapter-based read scaling (
--pt_scaling
): Flag to enable polyT/adapter max detection for read scaling. This will be used in preference to read median/median absolute deviation to perform read scaling if the poly-T to non-sequence adapter current level change can be detected. - Poly-T scaling median offset (
--pt_median_offset
): Set polyT median offset for setting read scaling median (default 2.5) - Poly-T scaling range scale (
--adapter_pt_range_scale
): Set polyT/adapter range scale for setting read scaling median absolute deviation (default 5.2) - Poly-T scaling minimum adapter drop (
--pt_required_adapter_drop
): Set minimum required current drop from adapter max to polyT detection. (default 30.0) - Poly-T scaling minimum read start index (
--pt_minimum_read_start_index
): Set minimum index for read start sample required to attempt polyT scaling. (default 30) - Noisiest-section scaling maximum read size (
--noisiest_section_scaling_max_size
): Set the maximum size of a read (in samples) for which noisiest-section signal scaling is performed. For short reads, greater accuracy can be achieved by only using the noisiest section of the signal to calculate the signal median and median absolute deviation. These values are then used when scaling the read signal. Defaults to 0. - Read ID whitelist (
--read_id_list
): A filename for a text file containing a whitelist of read IDs (one per line, no whitespace). If this option is specified, Guppy will only basecall reads from the input which have read IDs that are in the read whitelist. - Barcoding configuration file (
--barcoding_config_file
): A filename from which to load the barcoding configuration, allowing users to override all barcoding parameters without specifying them at the command line. Defaults to 'configuration.cfg'. - Sample sheet (
--sample_sheet
): A filename for a MinKNOW-compatible CSV format sample sheet, containingflow_cell_id
,experiment_id
and optionallybarcode
, orinternal_barcode
andexternal_barcode
, orrapid_barcode
andfip_barcode
, used to identify a particular classification of read. Thealias
column will then be used by Guppy to rename the output files and folders based on the other classification values. Note that MinKNOW sample sheets can omit theflow_cell_id
as long as they contain aposition_id
, but to be used with Guppy, the sample sheet must contain aflow_cell_id
. - Load scaling from read files (
--load_scaling_info_from_read_files
): Flag to enable loading scaling offset and scale information from source read files, if it exists. If this flag is set, Guppy will use the stored values in the input files, instead of computing scaling values for reads. - Use quantile scaling (
--use_quantile_scaling
): When enabled, Guppy will calculate scaling values from the raw signal using quantile scaling instead of the default (med-mad). - Beam cut (
--beam_cut
): Beam score cutoff for beam search decoding. - Beam width (
--beam_width
): Beam width to use in beam search decode.
Optimisation
- Model file (
-m
or--model_file
): A path to a JSON RNN model file to use instead of the model specified in the configuration file. - Adapter scaling model file (
--as_model_file
): Path to JSON model file for adapter scaling. - Chunk size (
--chunk_size
): Set the size of the chunks of data which are sent to the basecaller for analysis. Chunk size is specified in signal blocks, so the total chunk size in samples will bechunk_size * event_stride
. - Chunk overlap (
--overlap
): The overlap between adjacent chunks, specified in signal blocks. An overlap is required for chunks to be stitched back into a continuous read. - Max chunks per runner (
--chunks_per_runner
): The maximum number of chunks which can be submitted to a single neural network runner before it starts computation. Increasing this figure will increase GPU basecalling performance when it is enabled. - Number of GPU runners per device (
--gpu_runners_per_device
): The number of neural network runners to create per CUDA device. Increasing this number may improve performance on GPUs with a large number of compute cores, but will increase GPU memory use. This option only affects GPU calling. - CPU threads per caller (
--cpu_threads_per_caller
): The number of CPU threads to create for each caller to use. Increasing this number may improve performance on CPUs with a large number of cores, but will increase system load. This option only affects CPU calling. - Stay penalty (
--stay_penalty
): Scaling factor to apply to stay probability calculation during transducer decode. - Q-score offset (
--qscore_offset
): Override the q-score offset to apply when calibrating output q-scores for the read. There is an offset and scale (see--qscore_scale
below) that are applied to the output base probabilities in the FASTQ for a basecall, to make the q-scores as close as possible to the Phred quality scores. Once a basecall model has been trained, these scores are calculated and added to the config files. - Q-score scale (
--qscore_scale
): Override the q-score scale to apply when calibrating output q-scores for the read. - Use built-in GPU kernels (
--builtin_scripts
): Set this flag to false to disable built-in GPU kernels, allowing custom kernels to be used (see--kernel_path
). - GPU Kernel source path (
--kernel_path
): Path to GPU kernel files, which will be used if--builtin_scripts
is set to false. - Number of adapter scalers (
--as_num_scalers
): Number of parallel scalers for adapter scaling. - Reads per scaler (
--as_reads_per_runner
): Maximum reads per runner for adapter scaling. - CPU threads per adapter scaler (
--as_cpu_threads_per_scaler
): Number of CPU worker threads per adapter scaler. - GPU adapter scaling runners per device (
--as_gpu_runners_per_device
): Number of runners per GPU device for adapter scaling. - Num alignment threads (
--num_alignment_threads
): Number of worker threads to use for alignment. - Num barcoding threads (
--num_barcoding_threads
): Number of worker threads to use for barcoding. - Num modified base basecaller threads (
--num_base_mod_threads
): The number of threads to use for Remora modified base detection in GPU basecalling mode. - Num read splitting threads (
--num_read_splitting_threads
): Number of worker threads to use for read splitting. - Num read splitting buffers (
--num_read_splitting_buffers
): Number of GPU memory buffers to allocate to perform read splitting. Controls level of parallelism on GPU for read splitting using mid adapter detection. - Disable pings (
--disable_pings
): Flag to disable sending any telemetry information to Oxford Nanopore Technologies. See the "Ping information" section for a summary of what is included in the Guppy telemetry. - Telemetry URL (
--ping_url
): Override the default URL for sending telemetry pings. - Ping segment duration (
--ping_segment_duration
): Duration in minutes of each ping segment. - Read batch size (
--read_batch_size
): The maximum batch size, in reads, for grouping input files. This controls the granularity at which resume can operate. Note that this value may be exceeded if individual input files contain more than this many reads. Output files for each batch will be contain a maximum of--records_per_fastq
entries. - Int8 inference mode (
--int8_mode
): Enable quantised int8 mode for kernels which support it. - Log speed frequency (
--log_speed_frequency
): How often to print out basecalling speed.
Overriding configuration parameters from the command-line
Guppy configuration files specify many of the optional parameters discussed previously. For example, the basecalling section of a configuration file could look like this:
Basic configuration file for ONT Guppy basecaller software.
Basecalling.
model_file = template_r9.5_450bps_5mer_raw.jsn
chunk_size = 1000
runners = 20
chunks_per_runner = 20
overlap = 50
qscore_offset = -0.06
qscore_scale = 1.16
builtin_scripts = 1
The parameters specified in the configuration file can be overwritten from the command-line by arguments of the form --parameter value
, e. g.
guppy_basecaller --config dna_r9.5_450bps.cfg --runners 40 [other options]
Command-line parameters always take priority over config file parameters, so running Guppy with these arguments would override the runners
setting from the config file, forcing it to 40. This facilitates small changes to parameters. Please note that no spaces are allowed in arguments, but the argument can be wrapped in quotes. For example, to run Guppy with two GPU devices, you would set the devices like so:
guppy_basecaller --device "cuda:0 cuda:1" [other options]
12. Configuring a GPU version of Guppy with MinKNOW for MinION or PromethION 2 Solo on Windows
IMPORTANT
In some cases, you may wish to reconfigure how GPU-enabled MinKNOW installations are set up, to get the best out of the available hardware on the host device.
GPU basecalling is supported on NVIDIA GPUs only, and only on Linux and Windows. The installation of the GPU is done at the user's own risk.
Modifying the installation of GPU versions of Guppy and MinKNOW is done at your own risk. Misconfiguration of the GPU may result in slow basecalling and/or a large number of skipped reads if the basecall server crashes due to misparameterisation.
A GPU with at least 12 GB of memory is recommended. GPUs with less than 8 GB of memory may not work, especially with HAC or SUP models. For full GPU recommendations, refer to the 'Computer requirements' section of the protocol.
Please note that CUDA Toolkit 11.8 needs to be installed to run GPU basecalling on Windows. CUDA 11.8 can be downloaded from the NVIDIA download archive.
It is also recommended to install the latest GPU drivers available for your system and graphics card. See the NVIDIA driver download page for details.
The following commands need to be entered into a Windows command prompt which has been run as an adminstrator. They also assume a standard MinKNOW installation, where the location of MinKNOW is C:\Program Files\OxfordNanopore\MinKNOW.
Modify MinKNOW's application configuration to enable GPU basecalling and set the appropriate settings.
For example, to change the used CUDA devices to just use device 1, you would run:
"C:\Program Files\OxfordNanopore\MinKNOW\bin\config_editor.exe" ^
--conf application --filename "C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf" ^
--set guppy.server_config.gpu_devices="cuda:1"
Information about Guppy settings can be found in the appropriate section of the Guppy protocol.
Restart the MinKNOW service:
- Open the Windows menu.
- Type "services" and select the Services app that is displayed.
- Scroll down and find the MinKNOW service.
- Right-click on it and select Restart.
Confirm the guppy_basecall_server is using the GPU:
nvidia-smi
Note that in some instances, nvidia-smi will not report running processes if launched through CMD. Running nvidia-smi in PowerShell will show the correct processes if this occurs.
Monitor your first sequencing run using the MinKNOW GUI to make sure basecalling is working as expected.
Troubleshooting
If step 3 above does not show guppy_basecall_server
using the GPU, or if Guppy crashes frequently, then it is recommended to check the Guppy log files. These files are normally found in C:\data\guppy_logs
, and a new file will be created every time the basecall server is launched.
- If there is no server log with a timestamp that roughly matches step 2 ("Restart the MinKNOW service") above, then a new basecall server has not been launched. In this case, restart the MinKNOW service. If there is still no new log file created, restart your computer.
- If there is a new server log file but it does not contain the parameters that were set as part of step 5 ("Modify MinKNOW's application configuration") above, then repeat steps 1 and 2.
- If, during step 1, you see the following error message:
Failed to open C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf for writing.
, then your terminal has not been run as an administrator.
Reconfiguration case: changing the set of GPUs used.
By default, MinKNOW will configure the guppy_basecall_server with --device cuda:all
, which tells Guppy to use all the GPUs on the host machine. If this is not desired, the --device
parameter can be changed to select specific devices (e.g. --device cuda:0
). See the Guppy protocol for more information on the --device
argument.
This setting can be changed by specifying a new guppy.server_config.gpu_devices
:
"C:\Program Files\OxfordNanopore\MinKNOW\bin\config_editor.exe" ^
--conf application --filename "C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf" ^
--set guppy.server_config.gpu_devices="cuda:1"
Reconfiguration case: setting GPU parameters for lower-memory graphics cards (8 GB or less).
When performing GPU basecalling, Guppy divides data up into chunks and then combines together a certain number of chunks which are basecalled at the same time by a basecall "runner". The GPU memory use of one of these runners is governed by two things:
- The number of chunks in it (or "chunks per runner").
- The complexity of the basecall model being used (higher-accuracy basecall models use more memory).
For basecalling to occur, enough GPU memory needs to be allocated for at least one runner, and each basecalling configuration file has a default number of chunks per runner. This means that by default, each basecall model has a minimum amount of memory required to perform basecalling. When using graphics cards with lower amounts of memory, larger basecall models (such as HAC and Sup) may not run, instead returning an "out of memory" error. This can be addressed by lowering the number of chunks per runner.
Changing the Guppy "chunks per runner" value
- When using the standalone basecaller (
guppy_basecaller
), set the--chunks_per_runner
command-line parameter. - When using the basecall server with MinKNOW: edit MinKNOW's application configuration to add the
--chunks_per_runner
option to Guppy's "extra arguments" section:
"C:\Program Files\OxfordNanopore\MinKNOW\bin\config_editor" ^
--conf application --filename "C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf" ^
--set guppy.server_config.extra_arguments="--chunks_per_runner <value>"
Or `--set guppy.server_config.extra_arguments=""` to undo your changes.
The following settings are recommended for 8 GB graphics cards. For cards with less GPU memory, or if the GPU is being used by other processes, these numbers may need to be lowered.
- For HAC, use
--chunks_per_runner 160
- For Sup basecalling models, use
--chunks_per_runner 10
After changing these settings, restart the MinKNOW service so that the changes take effect.
Configure Guppy to use TCP and allow remote connections
For security purposes, by default the Guppy basecall server will only allow connections from other processes running on the same computer. In some cases, you may need to connect from other PCs to perform basecalling, while at the same time allowing MinKNOW to run as normal and also make use of the Guppy basecall server.
Edit MinKNOW's application configuration to add --allow_non_local
to Guppy's "extra arguments" section:
"C:\Program Files\OxfordNanopore\MinKNOW\bin\config_editor" ^
--conf application --filename "C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf" ^
--set guppy.server_config.extra_arguments="--allow_non_local"
Then restart MinKNOW with the command: `systemctl restart minknow`
Recommended procedure
To find the best configuration values for GPU basecalling, it is recommended to have a small number of input files available that you can use to measure basecalling speed. These files should take at least a minute or so to basecall, to minimise the overhead that comes from loading the files into Guppy.
It is also recommended to only adjust the --chunks_per_runner
value and not any other parameters. See the next section for further details.
- Begin by setting
--chunks_per_runner
to 1, and confirm that GPU basecalling works with the configuration file you have chosen. Basecalling will likely be extremely slow at this point. Note the basecalling speed, which is listed in "samples per second" once basecalling has completed (e.g.samples/s: 3.22e+06
). - Increase the value of
--chunks_per_runner
until basecalling stops working due to a memory error, or you stop seeing an increase in basecalling speed. The final value of--chunks_per_runner
will likely be less than 1000, and for very high accuracy models it could be significantly less: use this range as a guide to choose the values of--chunks_per_runner
to test.
Note that it will likely be necessary to choose different values of --chunks_per_runner
for different types of basecalling configuration files. For example, HAC configuration files will likely support higher values of --chunks_per_runner
than Sup configuration files.
Other parameters
There are two other parameters that can affect basecalling speed and memory use:
--gpu_runners_per_device
(where "device" is a GPU)--chunk_size
Once Guppy can allocate enough GPU memory for a single runner, it will dynamically create additional runners as needed, up to the maximum number listed by --gpu_runners_per_device
. However, Guppy will only create these additional runners if there is enough GPU memory available to do so. This means that you can leave --gpu_runners_per_device
at a high value (e.g. 20), and you are less likely to see performance changes by adjusting this downwards.
It is not recommended to change the value of --chunk_size
. This is because adjustments to this value will change your basecall results.
Reconfiguration case: Setting GPU parameters for high-power graphics cards.
The default parameters for Guppy are designed to work reliably for all cards from the minimum specification upwards. However, the settings may not be optimal for newer or higher-power GPUs. It may be possible to get higher basecalling performance by experimenting with increasing the --chunks_per_runner
setting, following the instructions above.
13. Configuring a GPU version of Guppy with MinKNOW for MinION or PromethION 2 Solo on Linux
IMPORTANT
In some cases, you may wish to reconfigure how GPU-enabled MinKNOW installations are set up, to get the best out of the available hardware on the host device.
GPU basecalling is supported on NVIDIA GPUs only, and only on Linux and Windows. The installation of the GPU is done at the user's own risk.
Modifying the installation of GPU versions of Guppy and MinKNOW is done at your own risk. Misconfiguration of the GPU may result in slow basecalling and/or a large number of skipped reads if the basecall server crashes due to misparameterisation.
A GPU with at least 12 GB of memory is recommended. GPUs with less than 8 GB of memory may not work, especially with HAC or SUP models. For full GPU recommendations, refer to the 'Computer requirements' section of the protocol.
The following commands need to be entered into a terminal. Note that some of them will require superuser privileges:
Use systemctl to edit the existing guppyd service (this will open a text editor with a copy of the existing service file):
sudo systemctl edit guppyd.service --full
If a prompt appears asking about overwriting the exiting guppyd.service file, enter "y" to continue.
Edit that new service file to contain the required GPU configuration for the guppy_basecall_server. You can change any other server arguments at the same time.
To do so, change this line in the service file:
ExecStart=/opt/ont/guppy/bin/guppy_basecall_server <arguments>
Save the file and exit the text editor (the filename may look odd, but systemctl should change it to the correct name later).
Stop the MinKNOW service:
sudo service minknow stop
Stop the guppyd service:
sudo service guppyd stop
Confirm the guppy_basecall_server process is not running:
$ ps -A | grep guppy_basecall_
If the result of the above command is not blank, manually kill the process:
sudo killall guppy_basecall_server
Start the guppyd service:
sudo service guppyd start
Confirm the guppy_basecall_server is running and is using the GPU:
nvidia-smi
If the Guppy basecall server is not launching correctly, check its log output using journalctl ("-n 100" shows the last 100 entries in the journal) to see what is going wrong:
sudo journalctl -u guppyd.service -n 100
Confirm that the newly updated settings are being used by the guppyd service:
sudo service guppyd status
The output should include a line starting "CGroup" which will contain the arguments used by the basecall server. There should also be a line starting "Active: active (running)".
Start the MinKNOW service:
sudo service minknow start
Monitor your first sequencing run using the MinKNOW GUI to make sure basecalling is working as expected.
Troubleshooting
If some part of the above process does not work, then it is possible the guppyd service may end up misconfigured, and may be automatically disabled by the system. There are a few diagnostic checks that can be performed.
- Look at the Guppy basecall server logs and check for error messages that indicate that the GPU has not been configured or not found. Guppy log files are stored in
/var/log/guppy
- Use journalctl to directly read the log entries produced by Guppy and systemctl, and check for any error messages:
sudo journalctl -u guppyd.service -n 100
- Check whether the service is enabled:
systemctl list-unit-files | grep guppyd.service
If the service is not listed as "enabled", then it will either be marked as "disabled" or "masked". You can reset those statuses as described below. a. If the service is marked as "disabled":
sudo systemctl enable guppyd.service
b. If the service is marked as "masked":
sudo systemctl unmask guppyd.service
Then enable the service:
sudo systemctl enable guppyd.service
- Reinstall the service:
sudo apt install --reinstall ont-guppyd-for-minion
sudo systemctl revert guppyd.service
sudo service guppyd restart
Reconfiguration case: changing the set of GPUs used.
By default, MinKNOW will configure the guppy_basecall_server with --device cuda:all
, which tells Guppy to use all the GPUs on the host machine. If this is not desired, edit the guppyd service file and change the --device
parameter to select specific devices (e.g. --device cuda:0
). See the Guppy protocol for more information on the --device
argument.
Setting GPU parameters for lower-memory graphics cards (8 GB or less).
When performing GPU basecalling, Guppy divides data up into chunks and then combines together a certain number of chunks which are basecalled at the same time by a basecall "runner". The GPU memory use of one of these runners is governed by two things:
- The number of chunks in it (or "chunks per runner").
- The complexity of the basecall model being used (higher-accuracy basecall models use more memory).
For basecalling to occur, enough GPU memory needs to be allocated for at least one runner, and each basecalling configuration file has a default number of chunks per runner. This means that by default, each basecall model has a minimum amount of memory required to perform basecalling. When using graphics cards with lower amounts of memory, larger basecall models (such as HAC and Sup) may not run, instead returning an "out of memory" error. This can be addressed by lowering the number of chunks per runner.
Changing the Guppy "chunks per runner" value
- When using the standalone basecaller (
guppy_basecaller
), set the--chunks_per_runner
command-line parameter. - When using the basecall server with MinKNOW: edit the guppyd-cpu service file and add
--chunks_per_runner <value>
to theExecStart
line, before restarting the service.
The following settings are recommended for 8 GB graphics cards. For cards with less GPU memory, or if the GPU is being used by other processes, these numbers may need to be lowered.
- For HAC, use
--chunks_per_runner 160
- For Sup basecalling models, use
--chunks_per_runner 10
After changing these settings, restart the guppyd-cpu service so that the changes take effect.
Configure Guppy to use TCP and allow remote connections
For security purposes, by default the Guppy basecall server will only allow connections from other processes running on the same computer. In some cases, you may need to connect from other PCs to perform basecalling, while at the same time allowing MinKNOW to run as normal and also make use of the Guppy basecall server.
Edit the guppyd service file and add --use_tcp
and --allow_non_local
to the ExecStart
line, before restarting the service.
"C:\Program Files\OxfordNanopore\MinKNOW\bin\config_editor" ^
--conf application --filename "C:\Program Files\OxfordNanopore\MinKNOW\conf\app_conf" ^
--set guppy.server_config.extra_arguments="--allow_non_local"
Edit the file /opt/ont/minknow/conf/app_conf file
and add the following lines:
- name: server_port
type: int
default: 5555
- name: server_ipc_file
type: std::string
default: '"/tmp/.guppy/5555"'
- name: use_tcp
type: bool
default: defaults::guppy_use_tcp()
Then restart MinKNOW with the command: `systemctl restart minknow`
Recommended procedure
To find the best configuration values for GPU basecalling, it is recommended to have a small number of input files available that you can use to measure basecalling speed. These files should take at least a minute or so to basecall, to minimise the overhead that comes from loading the files into Guppy.
It is also recommended to only adjust the --chunks_per_runner
value and not any other parameters. See the next section for further details.
- Begin by setting
--chunks_per_runner
to 1, and confirm that GPU basecalling works with the configuration file you have chosen. Basecalling will likely be extremely slow at this point. Note the basecalling speed, which is listed in "samples per second" once basecalling has completed (e.g.samples/s: 3.22e+06
). - Increase the value of
--chunks_per_runner
until basecalling stops working due to a memory error, or you stop seeing an increase in basecalling speed. The final value of--chunks_per_runner
will likely be less than 1000, and for very high accuracy models it could be significantly less: use this range as a guide to choose the values of--chunks_per_runner
to test.
Note that it will likely be necessary to choose different values of --chunks_per_runner
for different types of basecalling configuration files. For example, HAC configuration files will likely support higher values of --chunks_per_runner
than Sup configuration files.
Other parameters
There are two other parameters that can affect basecalling speed and memory use:
--gpu_runners_per_device
(where "device" is a GPU)--chunk_size
Once Guppy can allocate enough GPU memory for a single runner, it will dynamically create additional runners as needed, up to the maximum number listed by --gpu_runners_per_device
. However, Guppy will only create these additional runners if there is enough GPU memory available to do so. This means that you can leave --gpu_runners_per_device
at a high value (e.g. 20), and you are less likely to see performance changes by adjusting this downwards.
It is not recommended to change the value of --chunk_size
. This is because adjustments to this value will change your basecall results.
Reconfiguration case: Setting GPU parameters for high-power graphics cards.
The default parameters for Guppy are designed to work reliably for all cards from the minimum specification upwards. However, the settings may not be optimal for newer or higher-power GPUs. It may be possible to get higher basecalling performance by experimenting with increasing the --chunks_per_runner
setting, following the instructions above.
14. Barcoding/demultiplexing
Barcoding/demultiplexing overview
In the Guppy suite, barcoding can be performed by a separate executable. This allows barcoding to be performed as an offline analysis step without having to re-basecall the source reads. To perform barcoding in this way, invoke the barcoder with the minimum required parameters:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg
When performing barcode detection, Guppy will create a barcoding_summary.txt
file in the output folder, which contains information about the best-matching barcodes for each read in the FASTQ/FASTA files in the input folder (see "Summary file contents" in the "Input and output files" section for details). The output FASTQ/FASTA files will be written into barcode-specific subdirectories for the barcode detected. A log file is also emitted with information about the execution run.
The Guppy barcoder supports the following optional parameters:
- Version (
-v
or--version
): Prints the version of Guppy barcoder. - Help (
-h
or--help
): Print a help message describing usage and all the available parameters.
Data features
- Require a barcode on both ends of the read (
--require_barcode_both_ends
): Option to only classify reads where a barcode has been detected at both the front and rear of the read. This can significantly reduce the number of reads that are classified, and is also not a valid argument for the Rapid kits (which do not have a rear barcode). - Allow inferior barcodes to be used in arrangements (
--allow_inferior_barcodes
): Option to still classify reads when the barcode selected at each end of the read was not the highest-scoring barcode detected (assuming one was detected above the minimum score). This can slightly increase the number of reads that are classified but can increase the false-positive rate in classifications. - Front window size (
--front_window_size
): Specify the maximum window of the start of the read (in bases) to search for the front barcode in. The default is 150 bases. - Rear window size (
--rear_window_size
): Specify the maximum window of the end of the read (in bases) to search for the rear barcode in. The default is 150 bases. - Detect mid-strand barcodes (
--detect_mid_strand_barcodes
): Flag option to enable detection of barcodes within the strand. This option can be used to detect abnormal reads such as chimeras. If a mid-strand barcode is detected, the read will be classified as "unclassified". - Detect mid-strand adapters (
--detect_mid_strand_adapter
): Flag option to enable detection of adapter sequences within the strand. This option can be used to detect abnormal reads such as chimeras. - Minimum score for barcode detection (
--min_score_barcode_front
): Specify the minimum score for barcode detection. Unless a minimum score is also set for rear barcodes, this score will be used for both front and rear barcodes. Default is 60. - Minimum score for rear barcodes (
--min_score_barcode_rear
): Specify the minimum score for rear barcodes. Use this if you want to set a different minimum score for rear barcodes than for front barcodes. Default is to use the front barcode minimum. - Minimum score for detection of barcode contexts (
--min_score_barcode_mask
): Specify the minimum score to consider a barcode context to be a valid location to search for a barcode. If set to -1.0, this option is ignored and barcode scoring is performed on a weighted average of the barcode and context score. Default is -1.0. - Minimum score for detection of mid-strand barcodes (
--min_score_barcode_mid
): Minimum score to consider a barcode detected mid-strand to be considered a valid alignment. Mid-strand barcodes below this threshold will be ignored. The default is 40.0. - LamPORE kit (
lamp_kit
): Specify the LamPORE kit to use for detection. Note that unlike--barcode_kits
, it is not supported to analyse reads against multiple LamPORE kits simultaneously. - Minimum score for detection of LAMP FIP barcodes (
--min_score_lamp
): Specify the minimum score to consider a LAMP FIP barcode to be classified. Default is 80.0. - Minimum score for detection of LAMP FIP barcode masks (
--min_score_lamp_mask
): Specify the minimum score to consider a LAMP FIP barcode context to be a valid location to search for a FIP barcode. Default is 50.0. - Minimum score for detection of LAMP targets (
--min_score_lamp_target
): Specify the minimum score to consider a LAMP target sequence alignment to be classified. Default is 75. - Minimum score for detection of adapters (
--min_score_adapter
): Minimum score for an adapter to be considered a valid alignment. Default is 60. - Minimum score for detection of mid-strand adapters (
--min_score_adapter_mid
): Minimum score for a mid-strand adapter to be considered a valid alignment. Default is 50. - Minimum score for detection of primers (
--min_score_primer
): Minimum score for a primer to be considered to be a valid alignment. Default is 60. - Minimum length for detection of LAMP FIP barcode masks (
--min_length_lamp_context
): Specify the minimum length to consider a LAMP FIP barcode context to be a valid location to search for a FIP barcode. Default is 40. - Minimum length for detection of LAMP targets (
--min_length_lamp_target
): Specify the minimum length to consider a LAMP target sequence alignment to be classified. Default is 80. - Additional LAMP barcode context bases (
--additional_lamp_context_bases
): Number of bases from a lamp FIP barcode context to append to the front and rear of the FIP barcode before performing matching. Default is 2. - Detect adapter sequences at front and rear of the read (
--detect_adapter
): Enables adapter detection. Disabled by default. - Detect primer sequences at front and rear of the read (
--detect_primer
): Enables primer detection. Disabled by default. - Enable trimming barcodes (
--enable_trim_barcodes
): Flag to enable trimming of barcodes from the sequences in the output files. If present, detected barcodes will be trimmed from the sequence. See "Barcode trimming" for more details and related options.
Input/output
- Quiet mode (
-z
or--quiet
): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run. - Verbose logging (
--verbose_logs
): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default. - Recursive (
-r
or--recursive
): search through all subfolders contained in the--input_path
value, and perform barcode detection on any FASTQ or FASTA files found in them. - Configuration file (
-c
or--config
): This option allows you to specify a configuration file, which contains details of the parameters used during barcode detection. The default configuration file supplied with Guppy should be sufficient for most users. There is an additional configuration_dual.cfg containing settings for using dual-barcode preparations. - Override default data path (
-d
or--data_path
): Option to explicitly specify the path to use for loading any data files the application requires (for example, if you have created your own model files or config files). - Records per FASTQ (
-q
or--records_per_fastq
): The maximum number of reads to put in a single FASTQ or FASTA file. Set this to zero to output all reads into one file (per run id, per batch). The default value is 4000. - Perform FASTQ compression (
-–compress_fastq
): Flag to enable gzip compression of output FASTQ/FASTA files; this reduces file size to about 50% of the original. See also--read_batch_size
. - BAM file output (
--bam_out
): This flag enables BAM file output. Default is for BAM file output to be disabled. - BAM file indexing (
--index
): This flag enables BAM file indexing. If the flag is present,guppy_barcoder
sorts the BAM file output and generates the BAI index file. This flag requires that--bam_out
is also set. Disabled by default. - FASTQ file output (
--fastq_out
): This flag enables FASTQ file output. If neither--bam_out
or--fastq_out
is enabled, FASTQ output is enabled by default. - Input valid extensions (
--ext_in
): Only files with the specified extensions are processed (comma separated list). If this is not enabled, all files with supported extension are processed. Supported extensions are:.fastq
,.fq
,.fasta
,.fa
,.sam
,.bam
. Sequences from a.sam
or.bam
file that have been stored as the reverse complement will be reverse-complemented before barcoding.
Optimisation
- Worker thread count (
-t
or--worker_threads
): The number of worker threads to spawn for the barcoder to use. Increasing this number will allow Guppy barcoder to make better use of multi-core CPU systems, but may impact overall system performance. - GPU device (
-x
or--device
): Specify the CUDA-enabled GPU to use to perform barcode alignment. Parameters are specified the same way as in the basecaller application. - Limit the kits to detect against (
--barcode_kits
): List of barcoding kit(s) or expansion kit(s) used to limit the number of barcodes to be detected against. This speeds up barcoding. Multiple kits must be a space-separated list in double quotes. - Number of parallel GPU barcoding buffers (
--num_barcoding_buffers
): Number of parallel memory buffers to supply to the GPU for barcode strand detection. Greater numbers will increase parallelism on the GPU at an increased memory cost. The default is 24. - Number of reads to process in parallel in each GPU barcoding buffer (
--num_reads_per_barcoding_buffer
): The number of reads to process in parallel in each GPU barcoding buffer. Greater numbers will increase parallelism on the GPU at an increased memory cost. The default is 4. - Number of parallel GPU mid-barcode detection buffers (
--num_mid_barcoding_buffers
): Number of parallel memory buffers to supply to the GPU for barcode mid-strand detection. Greater numbers will increase parallelism on the GPU at an increased memory cost. The default is 96. - Limit the barcodes to a subset of the kits (
--barcode_list
): Only the barcodes in this space-separated list will be considered when barcoding. - Progress stats reporting frequency (
--progress_stats_frequency
): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display. - Trace catagory logs (
--trace_category_logs
): Enable trace logs - list of strings with the desired names. - Trace domains config (
--trace_domains_config
): Configuration file containing list of trace domains to include in verbose logging (if enabled) - Disable pings (
--disable_pings
): Flag to disable sending any telemetry information to Oxford Nanopore Technologies. See the "Ping information" section for a summary of what is included in the Guppy telemetry. - Telemetry URL (
--ping_url
): Override the default URL for sending telemetry pings. - Ping segment duration (
--ping_segment_duration
): Duration in minutes of each ping segment.
To see the supported barcoding kits, run the --print_kits
argument with the barcoder:
guppy_barcoder --print_kits
To limit the kits to detect against:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --barcode_kits SQK-RPB004
Or for multiple kits add a space-separated list in double quotes:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --barcode_kits "EXP-NBD104 EXP-NBD114"
Barcoding of dual-barcode arrangements is also supported. To use dual-barcode arrangements, the correct configuration file must be specified:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration_dual.cfg --barcode_kits "EXP-DUAL00"
Note that running barcode detection on dual- and single- barcode kits at the same time is not currently supported. New columns will be emitted into the barcoding_summary.txt
or sequencing_summary.txt
when performing demultiplexing of dual barcode kits: barcode_front_id_inner
, barcode_front_score_inner
, barcode_rear_id_inner
and barcode_rear_score_inner
.
Barcoding during basecalling
It is also possible to perform barcode detection during the basecalling process. When invoking the guppy_basecaller
executable, simply provide a valid set of kits to the barcode_kits
argument to enable barcoding, for example:
guppy_basecaller --input_path <folder containing .fast5 or .pod5 files> --save_path <output folder> --config dna_r9.4.1_450bps_fast.cfg --barcode_kits SQK-RBK001
Note that options such as barcode trimming and demultiplexing output FASTQ/FASTA files are all supported by the guppy_basecaller
executable as well as guppy_barcoder
. Guppy also supports barcoding demultiplexing during basecalling when using the guppy_basecall_server
. If a barcoding configuration file other than the default configuration.cfg is required, the basecaller executable supports selecting a barcode config using --barcoding_config_file
command-line option.
Barcode FASTQ output
The barcoding executable will output FASTQ/FASTA files into barcode-specific subdirectories in the output folder depending on the barcode that was detected. The FASTQ naming follows the same rules as for basecalling (see "Guppy features, settings and analysis"). A barcode directory will only exist if the barcode was detected. The output structure will look like this:
guppy_output_folder/
| barcoding_summary.txt
--- barcode01/
| fastq_runid_777_0.fastq
| fastq_runid_abc_0.fastq
| fastq_runid_abc_1.fastq
--- barcode03/
| fastq_runid_777_0.fastq
| fasta_runid_xyz_0.fasta
--- unclassified/
| fastq_runid_777_0.fastq
Barcode trimming
The barcoding executable can automatically trim the detected barcodes from the sequence before being output to the FASTQ/FASTA file. This is off by default. To enable barcode trimming add the --enable_trim_barcodes
argument:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --enable_trim_barcodes
Two extra columns will then be written into the barcoding_summary.txt
output: barcode_front_total_trimmed
and barcode_rear_total_trimmed
. A barcode will only be trimmed if it is above the min_score
threshold (default 60), and the aligned sequence that matches to the barcode will be removed from the front and/or rear of the sequence that is then written to the FASTQ/FASTA.
If the user wants to be more severe with trimming, there is a --num_extra_bases_trim
argument, which defaults to 0. Setting this to, for example, 2 would trim the detected barcode sequence plus an extra 2 bases. If the user wants to be more cautious then give this argument a negative number; for example, -3 would trim 3 fewer bases than was detected as the barcode sequence.
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --num_extra_bases_trim 2
Expert users - adjusting barcode classification thresholds
The classification threshold has been chosen to produce a low number of incorrect classifications while retaining an acceptable classification rate. The user may override this, but note that small changes can have a significant effect on the false-positive rate, so it is important to always test any changes before using them.
To change the threshold used for both the front and rear barcode modify the --min_score
argument. The following would increase the threshold for barcodes to be classified to 70, so that if either the front or rear barcode has a score of 70 or more the read will be classified:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --min_score 70
The user may also have different front and rear thresholds by also supplying the --min_score_rear_override
argument. If this is specified then --min_score
will be used for the front barcode and --min_score_rear_override
will be used for the rear barcode. For example, in the following a read will be classified if either the front barcode is above the default (which is currently 60), or the rear barcode 55 or more:
guppy_barcoder --input_path <folder containing FASTQ and/or FASTA files> --save_path <output folder> --config configuration.cfg --min_score_rear_override 55
How barcode demultiplexing works in Guppy
This is a general outline of how the Guppy barcoder works and how you can adjust its classification thresholds.
The regions of a barcode
A complete barcode arrangement comprises three sections:
- The upstream flanking region, which comes between the barcode and the sequencing adapter
- The barcode sequence
- The downstream flanking region, which comes between the barcode and the sample sequence
A complete dual-barcode arrangement comprises five sections:
- The upstream flanking region, which comes between the outer barcode and the sequencing adapter
- The outer barcode sequence
- The mid flanking region, which comes between the outer barcode and the inner barcode
- The inner barcode sequence
- The downstream flanking region, which comes between the inner barcode and the sample sequence
The barcode sequences remain constant across almost all of Oxford Nanopore Technologies' kits. For example, the flanking regions for barcode 10 in the Rapid Barcoding Kit (SQK-RBK004) are different from the flanking regions for barcode 10 in the native barcoding expansion kit (EXP-NBD114), but the barcode sequence itself is the same.
While native kits use the same barcode sequences as other kits, barcodes 1-12 in the native kit are the reverse complement of the standard barcodes 1-12.
There is one other exception to this: barcode 12a in the Rapid PCR Barcoding Kit SQK-RPB004 has a different barcode sequence to barcode 12 in other kits. For this reason, the oligonucleotide of this sequence is referred to as "barcode 12a".
Different barcoding chemistries
While each barcoding chemistry type (e.g. native, rapid, or PCR) will produce barcodes with the pattern described in "The regions of a barcode", there can be variations in the flanking regions within a particular kit. These are referred to as either "forward" and "reverse" variations or "variation 1" and "variation 2" depending on the configuration. When these variations are present the full double-stranded sequence can look like this:
<barcodeXX_var1---><sample sequence top strand---><barcodeXX_var2_rc>
<barcodeXX_var1_rc><sample sequence bottom strand><barcodeXX_var2--->
The PCR Barcoding Expansion kit (EXP-PBC001) produces barcodes like the example directly above.
Or like this:
<barcodeXX_var1><sample sequence top strand---><barcodeXX_var2>
<barcodeXX_var2><sample sequence bottom strand><barcodeXX_var1>
The Native Barcoding Expansion kit (EXP-NBD114) produces barcodes like the second example, directly above.
The barcoding algorithm
The barcoding algorithm uses a modified Needleman-Wunsch method. We modify the Needleman-Wunsh algorithm by adding "gap open" and "gap extension" penalties, as well as separate "start gap" and "end gap" penalties. These penalties and the match / mismatch scores for aligning a barcode to a sequence are detailed in two places:
Generic gap penalties are in the barcoding configuration file configuration.cfg, or configuration_dual.cfg for dual-barcode arrangements.
DNA-specific match/mismatch scores are stored in the file 4x4_mismatch_matrix.txt
. Note that these scores are shifted such that the highest score is 100 – this means that the final barcode score will share the same maximum. There is also a 5x5_mismatch_matrix.txt
file which includes the ability to match any cardinal base to a mask base 'N'.
Each barcode is aligned to a section of the basecall, usually the first and / or last 150 bases. This generates a grid of size 150 * < barcode_length >.
The barcoding score for a particular grid is calculated in a two-step process:
The score for only the section of the grid that corresponds to the barcode itself is considered. This corresponds to removing the initial gap row and discarding all scores past the alignment of the last base of the barcode, or removing those sections where the "start gap" and "end gap" penalties are applied.
The score is normalized by the total length of the barcode sequence. This ensures the final score is no more than the highest score in the mismatch table (which should be 100). Note that this potentially allows for negative scores when there are a relatively high number of gaps and/or mismatches.
Measuring classification
The classification for a particular barcode is determined by comparing the barcoding score to a fixed classification threshold – scores that exceed the threshold are considered (successful) classifications. The current threshold is set to 60 for single barcode arrangements and 50 for dual barcode arrangements.
Classification for a read is determined by taking the single highest-scoring (successful) barcode classification. This includes both classifications made at the beginning of the sequence and (where applicable) the end. If no classification exists then the read is considered "unclassified".
Classification threshold criteria
The classification threshold has been chosen to produce a low number of incorrect classifications while retaining an acceptable classification rate. This means that when a read has been classified as having a particular barcode, that classification will be incorrect a low number of times. Ideally this false-positive rate is around 1 in 1000, though this can be dependent on how well individually-barcoded samples are purified before they are pooled together. Classification rates should be 90% or above for samples with barcodes on both ends.
It is important to note that the above evaluation criteria assume that only reads which pass Guppy's quality filters are used. This corresponds to reads which are placed in the "pass" folder after basecalling; generally these will be reads with a mean q-score value greater than 7.
Modifying classification thresholds
It is possible to increase the number of classifications at the cost of the false-positive rate. Small changes to this can have a significant effect on the false-positive rate, so it is important to test any changes to the thresholds before using them.
For example, here is a graph of the number of reads classified for particular binned values of the (best) barcoding score. The data set is a collection of around 200,000 reads barcoded with the Native Expansion kit (EXP-NBD114):
This graph shows that, for example, reads where the best barcode score is around 30 will have about ~95% incorrect classifications. In contrast, for those reads where the highest barcode score is around 95 there will be near 0% incorrect classifications, and we correctly classify around 22,000 reads.
By reducing the threshold by a few points additional correct classifications may be obtained, but the cost in false positive percentage can go up significantly.
The threshold may be changed by modifying the --min_score
argument, which applies the threshold to both the front and rear barcode. To have different thresholds for the front and rear barcode modify the --min_score_rear_override
argument to change the rear barcode threshold. In that case the --min_score
argument will apply to only the front barcode.
How classifications are reported
When barcodes are loaded into Guppy for classification, they are loaded in arrangements. An arrangement consists of either:
- One barcode, when searching for barcodes only at the front of a read.
- A front barcode and a rear barcode, when searching for barcodes at both ends of a read.
Once the classification for a particular read has been determined (by choosing the single highest-scoring barcode alignment), there may be another barcode in the arrangement corresponding to the other end of the read. The score for this barcode is also retrieved and reported, regardless of its classification – this means the entire arrangement is always reported.
For example, if a barcode arrangement is loaded containing barcode01_FWD + barcode01_REV
with barcode01_FWD
matching the front of the read with a score of 90 and barcode01_REV
matching the rear of the read with a score of 10, then the final reported result will be:
front_barcode: barcode01_FWD
front_score: 90
rear_barcode: barcode01_REV
rear_score: 10
Adding your own barcodes
Guppy fully supports the use of custom barcode sequences. It is recommended that this is accomplished by copying Guppy's existing configuration files and modifying them elsewhere with a text editor.
Barcoding data files
Barcoding data files are contained in Guppy's data folder in the "barcoding" subfolder. You can find this folder in the following locations:
On Linux:
- In
/opt/ont/guppy/data
if installing from deb or RPM. - In the
data
folder in the main Guppy directory if installing from archive.
On OS X/macOS:
- In the
data
folder in the main Guppy directory.
On Windows:
- In
C:\Program Files\Oxford Nanopore\ont-guppy-cpu\data
This folder contains the following subfolders and types of files:
4x4_mismatch_matrix.txt the DNA mismatch matrix for aligning barcodes to sequences
5x5_mismatch_matrix.txt the DNA mismatch matrix for aligning barcodes to sequences including a
'N' mask base
5x5_mismatch_matrix_simple.txt the DNA mismatch matrix for use with dual barcodes.
barcodes_masked.fasta the full list of all barcode and the flanking region mask sequences
lamp_targets.fasta the full list of all LamPORE kit target sequences
configuration.cfg the configuration file containing parameters used in barcode detection
barcoding_arrangements/
barcode_arrs_XXX.toml the arrangement files for specific barcodes
barcoding_dual_arrangements/
barcode_arrs_dual_XXX.toml the arrangement files for specific dual barcodes
lamp_arrangements/
barcode_arrs_lampXXX.toml the arrangement files for specific LamPORE kit configurations
4x4 mismatch_matrix.txt
: A tab-delimited file containing the mismatch penalties for DNA.5x5 mismatch_matrix.txt
: A tab-delimited file containing the mismatch penalties for DNA plus a masking base 'N', which matches against all bases with a score of 90.5x5 mismatch_matrix_simple.txt
: A tab-delimited file containing the mismatch penalties for DNA plus a masking base 'N', which matches against all bases with a score of 90. This version of the 5x5 mismatch matrix has been optimised for dual-barcoding arrangements.barcoding_arrangements
folder: Folder containing barcoding arrangement files.barcoding_dual_arrangements
folder: Folder containing dual barcoding arrangement files.lamp_arrangements
folder: Folder containing arrangement files for LamPORE kits.example_barcode_arrs_XXX.toml
(andexample_barcode_arrs_dual_XXX.toml
): A .toml formatted arrangement file describing how a particular set of barcode arrangements is configured. It contains the following fields:
[loading options]
barcodes_filename = [filename]
double_variants_frontrear = [true / false]
[arrangement]
name = [name of the barcoding arrangement]
id_pattern = [barcode id pattern]
compatible_kits = [array of kits]
first_index = [first barcode number to load]
last_index = [last barcode number to load]
kit = [kit name]
normalised_id_pattern = ["barcode_arrangement" summary file pattern]
scoring_function = "MAX"
barcode1_pattern = [pattern to look up front barcode in [barcodes_filename]]
barcode2_pattern = [pattern to look up rear barcode in [barcodes_filename]]
mask1 = [mask name to look up front barcode masking region in [barcodes_filename] (optional)]
mask2 = [mask name to look up rear barcode masking region in [barcodes_filename] (optional)]
barcode_inner1_pattern = [pattern to look up front inner barcode in [barcodes_filename] when dual barcoding (optional)]
barcode_inner2_pattern = [pattern to look up rear inner barcode in [barcodes_filename] when dual barcoding (optional)]
barcode_arrs_lampXXX.toml
: A .toml formatted arrangement file describing how a LamPORE arrangement is configured. These are similar to barcoding arrangement files, but support a slightly different set of:
[loading options]
barcodes_filename = [filename]
lamp_targets_filename = [filename containing sequences which should be used as LAMP targets]
double_variants_frontrear = [true / false]
[arrangement]
name = [name of the lamp arrangement]
id_pattern = [barcode id pattern]
compatible_kits = [array of kits]
first_index = [first barcode number to load]
last_index = [last barcode number to load]
kit = [kit name]
normalised_id_pattern = ["barcode_arrangement" summary file pattern]
scoring_function = "MAX"
barcode1_pattern = [pattern to look up front barcode in [barcodes_filename]]
lamp_masks = [An array of mask names look up barcode masking regions in [barcodes_filename]]
These sections are dealt with in reverse order.
arrangement
id_pattern
: The barcode ID pattern itself is what is used as the base name for each barcode arrangement. It may be modified later depending on what is present in theloading_options
section (see below). This pattern should have%0Ni
present somewhere in the name (whereN
is the number of digits to use in the barcode number), as that will be replaced with the barcode number for the arrangement. For example, the patternNB%03i
will be formatted to produce barcode arrangement names such asNB001
,NB384
, etc. The final arrangement name based on this pattern will be reported in thebarcode_full_arrangement
field in thebarcoding_summary.txt
file.compatible_kits
: A list of kits this set of arrangements is compatible with. These may be selected from the command line to restrict the arrangements that barcodes are matched against.first_index
: The first integer used when loading barcodes frombarcodes_filename
(seeloading options
below). These integers are used to populate the%0Ni
parts of the barcode name,normalised_id
,barcode1
, andbarcode2
patterns.last_index
: The last index used (inclusively) when loading barcodes frombarcodes_filename
.kit
: The name reported in the "kit" column of the barcoding summary file.normalised_id_pattern
: The name reported in the "barcoding_arrangement" column of the barcoding summary file. This should contain the%0Ni
pattern within it so that the barcode number can be added. This is normally used to report the barcode number without the kit designation.scoring_function
: The function used to score a barcode arrangement. There are two choices for this, though only "MAX" is currently used:MAX
: The barcode arrangement score is the larger of the front and rear scores.ADD
: The barcode arrangement score is the sum of the front and rear scores.
barcode1_pattern
: Optional pattern used to look up the front barcode sequences inbarcodes_filename
. If this field is not present then no front barcodes will be added during the initial barcode loading step (although it is still possible to obtain front barcodes depending on loading options below). Note that the suffix_FWD
will be added to this barcode name in the arrangement.barcode2_pattern
: Optional pattern used to look up the rear barcode sequences inbarcodes_filename
. Note that the suffix_REV
will be added to this barcode name in the arrangement, and the barcode inserted into the arrangement will be the reverse complement of the named barcode specified inbarcodes_filename
.mask1
: Optional. Used to look up the front barcode flanking region inbarcodes_filename
. If this field is used, this masking region will be aligned first, then the barcode1 sequence for each arrangement will be aligned to the section of the read which corresponds to the masked-off region of this sequence (i.e. the section of 'N' bases).mask2
: Optional. Used to look up the rear barcode flanking region inbarcodes_filename
. If this field is used, this masking region will be aligned first, then thebarcode2
sequence for each arrangement will be aligned to the section of the read which corresponds to the masked-off region of this sequence (i.e. the section of 'N' bases).barcode_inner1_pattern
: Optional pattern used to look up the front inner barcode sequences inbarcodes_filename
. Note that the suffix_FWD
will be added to this barcode name in the arrangement.barcode_inner2_pattern
: Optional pattern used to look up the rear inner barcode sequences inbarcodes_filename
. Note that the suffix_REV
will be added to this barcode name in the arrangement, and the barcode inserted into the arrangement will be the reverse complement of the named barcode specified inbarcodes_filename
.lamp_masks
: A comma-seperated list of patterns to look up barcode masking regions inbarcodes_filename
. Note that there can be several of these masks, as they may be different for each target. The mask will be used to find a context in the sequence to inspect for FIP barcodes.
loading options
barcodes_filename
: the name of the FASTA file to load barcodes from. It should be in the data/barcoding folder, or the filename should include a relative path from the data/barcoding folder.lamp_targets_filename
: the name of the FASTA file to load LamPORE targets from. It should be in the data/barcoding folder, or the filename should include a relative path from the data/barcoding folder. The targets should be named[target_id]:[specific_sequence_id]
. This allows multiple sequences to map to the same target ID. Just thetarget_id
will be reported by the detector.double_variants_frontrear
: For each barcode arrangement, create_var1
and_var2
variants. The_var1
variant will bebarcode1
at the front andbarcode2
at the rear, and_var2
will bebarcode2
at the front andbarcode1
at the rear. This effectively adds a complement for each arrangement.
For example:
loading options | barcode names expected in barcodes_filename (assuming both barcode1 and barcode2 patterns are present) | barcode arrangements added to list to test against ("rc" denotes reverse complement) [front_barcode] + [rear_barcode] |
---|---|---|
double_variants_frontrear : false | [barcode1] [barcode2] | [barcode1]_FWD + [barcode2rc]_REV |
double_variants_frontrear : true | [barcode1] [barcode2] | [barcode1]_var1_FWD + [barcode2rc]_var2_REV [barcode2]_var2_FWD + [barcode1rc]_var1_REV |
Quick start
Assume we have created a set of custom barcodes structured like this:
[barcodeXX---][sample sequence top strand---][barcodeXX_rc]
[barcodeXX_rc][sample sequence bottom strand][barcodeXX---]
There is one type of barcode, it can be attached on both the top and bottom strand, and it has a reverse complement present on the opposite strand.
Furthermore, assume we have two different barcodes to add. We will call these two barcodes CUST01 and CUST02.
Step 1: Copy the Guppy data folder to a different location See "Barcoding data files" above. Assuming a Linux deb installation, this could look something like this:
cp -r /opt/ont/guppy/data ~/mydata
**Step 2: Create a new arrangements file and a new FASTA file to store our custom barcodes.** Copy one of the arrangements in the barcoding data folder to store the new arrangement:
cp ~/mydata/barcoding/barcoding_arrangements/barcode_arrs_nb24.toml ~/mydata/barcoding/barcoding_arrangements/barcode_arrs_cust2.toml
And create a new FASTA file for the custom barcodes:
touch ~/mydata/barcoding/custom_barcodes.fasta
**Step 3: Edit the new arrangement file to include information on the new barcode** We have one type of barcode, and arrangements of that barcode will include the barcode at the front and the reverse complement of the barcode at the rear. We do not want to set `double_variants_frontrear` because we have only one variant of our barcode.
The configuration file barcode_arrs_cust2.toml
will look like this:
[loading_options]
barcodes_filename = "custom_barcodes.fasta"
double_variants_frontrear = false
[arrangement]
name = "barcode_arrs_cust2"
id_pattern = "CUST%02i"
compatible_kits = ["MY-CUSTOM-BARCODES"]
first_index = 1
last_index = 2
kit = "CUST"
normalised_id_pattern = "barcode%02i"
scoring_function = "MAX"
barcode1_pattern = "CUST%02i"
barcode2_pattern = "CUST%02i"
Note that we set both `barcode1_pattern` and `barcode2_pattern` to the same value. This means:
- We are going to search for barcodes at the rear of the strand (because barcode2 is set).
- The rear barcodes are based on the same barcodes used in the front.
Note: If the range of barcode indices includes values greater than 99, ensure that sufficient digits are specified in each of the pattern fields. For example, if you have barcodes from 1 to 384, the arrangements section would contain pattern fields containing %03i
, like this:
[arrangement]
name = "barcode_arrs_cust2"
id_pattern = "CUST%03i"
compatible_kits = ["MY-CUSTOM-BARCODES"]
first_index = 1
last_index = 384
kit = "CUST"
normalised_id_pattern = "barcode%03i"
scoring_function = "MAX"
barcode1_pattern = "CUST%03i"
barcode2_pattern = "CUST%03i"
**Step 4: Add the new barcodes to the FASTA file** Open the `custom_barcodes.fasta` file created during step 2 and add your barcode sequences in with names matching the `CUST%02i` pattern used in the arrangement file:
>CUST01
AAAAAAAGCTCGCTCGCTCGAGATTTTTTT
>CUST02
AAAAAAACGGTAAATTGGCATTATTTTTTT
**Step 5: Run `guppy_barcoder` with the new barcodes**
guppy_barcoder \
--input_path [path_to_input_fastq_files] \
--save_path [path_to_output_directory] \
--data_path ~/mydata/barcoding \
--barcode_kits MY-CUSTOM-BARCODES
Context-specific barcode specification
When specifying a LAMP arrangement, context-specific barcodes are supported. For example, consider a LAMP configuration file as follows:
[loading_options]
barcodes_filename = "barcodes_masked.fasta"
lamp_targets_filename = "lamp_targets.fasta"
double_variants_frontrear = true
[arrangement]
name = "barcode_arrs_lamp_example"
id_pattern = "LAMP%02i"
compatible_kits = ["MY_CUSTOM_LAMP_KIT"]
first_index = 1
last_index = 8
kit = "LAMP"
normalised_id_pattern = "FIP%02i"
barcode1_pattern = "LM%02i"
lamp_masks = ["CONTEXT1","CONTEXT2","CONTEXT3"]
Normally, this kit would be expanded to use barcodes LM01
to LM08
from the barcodes_masked.fasta
, no matter which context is being used. However, there are situations where having a context-specific barcode may be desirable. If a specific barcode is required to replace LM01
for CONTEXT2
, it can be added to the barcodes_masked.fasta
as follows:
>LM01
ACGTATCTCA
15. Alignment
Alignment overview
The Guppy toolchain provides the guppy_aligner
executable to allow users to perform reference genome alignment on basecalled reads. Alignment is performed against the supplied reference via an integrated minimap2 aligner, full details of which can be found: https://github.com/lh3/minimap2. To perform alignment, invoke the Guppy aligner with the minimum required parameters:
guppy_aligner --input_path <folder containing input files> --save_path <output folder> --align_ref <reference FASTA>
The input path will be searched for input FASTQ, FASTA, SAM and BAM files to perform alignment on. The align_ref
is used to specify the reference genome. Sequences from a SAM or BAM file that have been stored as the reverse complement will be reverse-complemented before alignment in order to ensure the same results are produced when realigning the output file with the same options. When performing alignment, the Guppy aligner creates the following files in the output folder:
alignment_summary.txt
: Contains information about the best-quality alignment result for each read, such as alignment start, end, accuracy, etc. See "Summary file contents" in the "Input and output files" section for details.read_processor_log-\<date and time\>.log
: A log file with information about the execution run..sam
or.bam
: A SAM or BAM file is produced for each corresponding input file located in the input folder. If a successful alignment is found which passes the coverage filter, the SAM/BAM file will contain a CIGAR string representing the alignment. The default alignment coverage required to consider a result successful is 60%. If BAM file output is enabled, BAM files will be sorted by reference ID and then the leftmost coordinate.
Guppy aligner supports the following optional parameters:
- Version (
--version
): Prints the version of Guppy aligner. - Help (
-h
or--help
): Print a help message describing usage and all the available parameters. - Quiet mode (
-z
or--quiet
): This option prevents the Guppy basecaller from outputting anything to stdout. Stdout is short for “standard output” and is the default location to which a running program sends its output. For a command line executable, stdout will typically be sent to the terminal window from which the program was run. - Verbose logging (
--verbose_logs
): Flag to enable verbose logging (outputting a verbose log file, in addition to the standard log files, which contains detailed information about the application). Off by default. - Worker thread count (
-t
or--worker_threads
): The number of worker threads to spawn for the aligner to use. Increasing this number will allow Guppy aligner to make better use of multi-core CPU systems, but may impact overall system performance. - Recursive (
-r
or--recursive
): search through all subfolders contained in the--input_path
value, and perform alignment on any .fastq, .fq, .fasta or .fa files found in them. - BAM file output (
--bam_out
): This flag enables BAM file output. If the flag is not present,guppy_aligner
defaults to SAM output. - BAM file indexing (
--index
): This flag enables BAM file indexing. If the flag is present,gupply_aligner
sorts the BAM file output and generates the BAI index file. This flag requires that--bam_out
is also set. Disabled by default. - Minimap options (
--minimap_opt_string
): This flag allows to specify alignment options for the inner minimap2 alignment algorithm, using the same flags and format supported by theminimap2
program. See [#supported-minimap2-options](Supported minimap2 options) for the list of supported flags. - Max records per output file (
-q
or--records_per_file
): The maximum number of records to put in a single SAM or BAM file. Set this to zero to allow unlimited records per file. Note: setting to zero will have a performance impact due to holding all the records in memory until writing to disk. The default value is 4000. - Perform read filtering based on alignment (
--alignment_filtering
): This flag allows reads to be filtered based on their alignment status. Reads with alignment results will be written to the pass folder, and unaligned reads to the fail folder. - BED file (
--bed_file
): Path to .bed file containing areas of interest in reference genome. The emitted alignment_summary file will contain a column ofalignment_bed_hits
for the regions of interest. - Alignment type (
--align_type
): Specify whether you want full or coarse alignment. Valid values are (auto/full/coarse). - Progress stats reporting frequency (
--progress_stats_frequency
): Frequency in seconds in which to report progress statistics, if supplied will replace the default progress display. - Trace catagory logs (
--trace_category_logs
): Enable trace logs - list of strings with the desired names. - Trace domains config (
--trace_domains_config
): Configuration file containing list of trace domains to include in verbose logging (if enabled) - Disable pings (
--disable_pings
): Flag to disable sending any telemetry information to Oxford Nanopore Technologies. See the "Ping information" section for a summary of what is included in the Guppy telemetry. - Telemetry URL (
--ping_url
): Override the default URL for sending telemetry pings. - Ping segment duration (
--ping_segment_duration
): Duration in minutes of each ping segment.
If the aligner reports more than one possible alignment, only the best one is output. An alignment that covers less than 60% of the read or of the reference will be rejected.
Index files produced by the bwa aligner should also work as an align_ref
but are not explicitly supported.
The integrated minimap2 aligner is run with no additional arguments supplied to it - the default values are used for all alignments. It is not possible to modify the arguments at this time.
The minimap library integration Oxford Nanopore uses is available on our GitHub page here: http://github.com/nanoporetech/ont_minimap2
For more explanation of alignment-related columns output in the sequencing summary file, please refer to the Input and output files section of this protocol.
Supported minimap2 options
The list of flags currently supported by the --minimap_opt_string
it is possible to run:
<guppy_executable> --minimap_opt_string --help
In the list of flags below NUM
represents an integer in human-readable format, e.g. 4000 can be specified as 4k.
The default value for each option is reported in square brackets after its description.
Indexing flags:
-H [ --hpc ]
use homopolymer-compressed k-mer-k [ --kmer-size ] INT
k-mer size (no larger than 28) [15]-w [ --window-size ] INT
minimiser window size [10]-I [ --batch-size ] NUM
split index for every ~NUM input bases [4G]
Mapping flags:
-f [ --mid-occ-frac ] FLOAT
filter out top FLOAT fraction of repetitive minimisers [0.0002]-g [ --max-gap ] NUM
stop chain enlongation if there are no minimisers in INT-bp [5000]-G [ --max-intron-len ] NUM
max intron length (effective with -xsplice + changing -r) [200k]-F [ --max-frag-len ] NUM
max fragment length (effective with -xsr or in the fragment mode) [0]-r [ --bandwidth ] NUM[,NUM]
chaining/alignment bandwidth and long-join bandwidth [500,20000]-n [ --min-count ] INT
minimal number of minimisers on a chain [3]-m [ --min-chain-score ] INT
minimal chaining score (matching bases minus log gap penalty) [40]-X [ --skip-self-dual ]
skip self and dual mappings (for the all-vs-all mode)-p [ --pri-ratio ] FLOAT
min secondary-to-primary score ratio [0.8]-N [ --best-n ] INT
retain at most INT secondary alignments [5]
Alignment flags:
-A [ --match ] INT
matching score [2]-B [ --mismatch ] INT
mismatch penalty (larger value for lower divergence) [4]-O [ --gap-open ] INT[,INT]
gap open penalty [4,24]-E [ --gap-extension ] INT[,INT]
gap extension penalty; a k-long gap costs min{O1+kE1,O2+kE2} [2,1]-z [ --z-drop ] INT[,INT]
Z-drop score and inversion Z-drop score [400,200]-s [ --min-dp-score ] INT
minimal peak DP alignment score [80]-u [ --gt-ag ] CHAR
how to find GT-AG. f:transcript strand, b:both strands, n:do not match GT-AG [n]
Input/Output flags:
-L [ --long-cigar ]
write CIGAR with >65535 ops at the CG tag-c [ --cg ]
output CIGAR in PAF--cs arg
output the cs tag; STR is 'short' (if absent) or 'long' [none]--MD
output the MD tag--eqx
write =/X CIGAR operators-Y [ --softclip ]
use soft clipping for supplementary alignments-t [ --threads ] INT
number of threads [1]-K [ --mb-size ] NUM
minibatch size for mapping [500M]-V [ --version ]
show version number
Preset flags:
-x [ --preset ] STR
preset (always applied before other options; seeman minimap2.1
for details) []map-pb/map-ont
- PacBio CLR/Nanopore vs reference mappingmap-hifi
- PacBio HiFi reads vs reference mappingava-pb/ava-ont
- PacBio/Nanopore read overlapasm5/asm10/asm20
- asm-to-ref mapping, for ~0.1/1/5%% sequence divergencesplice/splice:hq
- long-read/Pacbio-CCS spliced alignmentsr
- genomic short-read mapping
Unsupported flags:
-d [ --dump-index ] FILE
dump index to FILE []-a [ --sam ]
output in the SAM format (PAF by default)-o [ --output ] FILE
output alignments to FILE [stdout]-R [ --rg ] STR
SAM read group line in a format like '@RG\tID:foo\tSM:bar' []
See man minimap2.1
for detailed description of these and other advanced command-line options.
Alignment index files
When aligning to large references (≥100 Mb) it is recommended to prepare an index file in advance for performance (to avoid generating the index during each run).
To create a minimap2 index file:
- Download and install the minimap2 tool from: https://github.com/lh3/minimap2
- Run the command:
minimap2 <input.fasta> <output.idx> -I 32G
-I 32G
indicates the size of reference in bases before sharding occurs - this should be set to be larger than your reference length.
Sharding: In minimap2, by default, infers references greater than 4 Gb are split into 'shards' within the index in order to reduce RAM usage. The strand is then aligned separately against each reference shard which can lead to Guppy returning an incorrect alignment, if the strand aligns to a reference that is not within the first shard.
File conversion
The Guppy toolchain provides the bam_convert
executable to convert files between the SAM and BAM formats. To convert a file, invoke bam_convert
with the minimum required parameters:
bam_convert --input [input file name] --save [output filename]
`bam_convert` can also be used to merge multiple files into one by specifying a directory containing one or more BAM files and using the `--merge` flag:
bam_convert --input [input file path] --save [output filename] --merge
The input directory will be searched for BAM files, and the contents merged into a single output file.
bam_convert
also supports the following optional parameters:
- Help (
-h
or--help
): Print a help message describing usage and all the available parameters. - Sort (
--sort
): Sort the records in the exported file by reference ID and then the leftmost coordinate. - Recursive (
-r
or--recursive
): When performing a merge, search the input directory recursively for input files. - Index (
--index
): Generate an index file for the output BAM file. - Merge header (
--merge_headers
): Regenerate IDs for program group and read group tags to prevent clashes. If this option is omitted,bam_convert
will use only the headers from the first file to be merged. This option is only valid when--merge
is also present.
16. Calibration Strand detection
Calibration Strand detection
The DNA calibration strand (DCS) is a 3.6 kb amplicon of the Lambda phage genome. The RNA calibration strand (RCS) is the 1.4 kb enolase 2 (ENO2) gene transcript. Calibration strands are added to DNA/RNA samples during library preparation, and are processed and included in the library.
Detection of calibration strands by the basecaller can be used to assess how well basecalling has worked, and to confirm that the sample preparation was successful.
If --calib_detect
is enabled, Guppy will attempt to identify and analyse any calibration strands which have been basecalled. Specifically, Guppy detects the calibration strand that is associated with the basecall configuration being used
(e.g. in experiments running an RNA configuration it looks for the RNA CS, and for DNA configurations it looks for DNA CS). It does this by first checking to see if a basecalled strand is approximately the correct length (controlled by --calib_min_sequence_length
and --calib_max_sequence_length
), and it then aligns the basecalled strand to the calibration strand reference. Successfully aligned reads are placed in the calibration_strands
folder, and alignment accuracy and identity metrics are added to the sequencing_summary.txt
file. This can be a useful way of evaluating the quality of a run.
17. Modified base calling
Modified base calling
It is now possible to use Guppy to identify certain types of modified bases. This requires the use of a specific basecalling model which is trained to identify one or more types of modification. Configuration files for these new models can generally be identified by the inclusion of "modbases" in their name (e.g. dna_r9.4.1_450bps_modbases_5mc_hac.cfg
). The tokens following "modbases" will generally provide information about the type of modifications that will
be looked for. For example, "5mc_cg" indicates that it will look for 5mC modifications in a CG context.
Modified base call results can currently be stored in BAM files. BAM file output (--bam_out
) will automatically be enabled if a modified basecall model is detected in the configuration. It is also possible to extract the raw modified base information from a called read via the Guppy client API in C++ or Python. Note that to get back modified base information via the Guppy client API, move and trace data must be enabled (see the API documentation for more details).
Raw modified base table format
The raw modified table (as available via the client API) is a two-dimensional array, where each row of the table relates to the corresponding base in the associated canonical sequence. For example, the first row of the table (row 0) will correspond to the first base in the canonical basecall sequence.
Each row contains a number of columns equal to the number of canonical bases (four) plus the number of modifications present in the model. The columns list the bases in alphabetical order (ACGT for DNA, ACGU for RNA), and each base is immediately followed by columns corresponding to the modifications that apply to that particular base. For example, with a model that identified modifications for 6mA and 5mC, the column ordering would be A 6mA C 5mC G T.
Each table row describes the likelihood that, given that a particular base was called at that position, that that base is either a canonical one (i.e. a base that the model considers to be "unmodified"), or one of the modifications that is contained within the model. The contents of the table are integers in the range of 0-255, which represent likelihoods in the range of 0-100% (storing these values as integers allows us to reduce .fast5 file size). For example, a likelihood of 100% corresponds to a table entry of 255. Within a given row the table entries for a particular base will sum to 100%.
Following from our previous example with 6mA and 5mC, you might see a table with row entries like these:
[63, 192, 0, 0, 0, 0],
[0, 0, 255, 0, 0, 0],
[0, 0, 0, 0, 255, 0],
[0, 0, 0, 0, 0, 255],
This would mean that:
- An A was called for the first base, and the likelihood that it is a canonical A is ~25% (63/192), and the likelihood that it is 6mA is ~75% (192 / 255).
- A C was called for the second base, and the likelihood that it is a canonical C is 100% (255 / 255), with no chance (0 / 255) of it being a 5mC.
- A G and then a T were called for the third and fourth bases. The likelihood that they are canonical bases is 100% (255 / 255 -- this should always be the case, as the model does not include any modification states for G or T).
Note that for the current modified base models in Guppy, the likelihoods will all be 0 for the bases that were not called, since the modification detection is performed after determining the called sequence. This was not the case for previous versions of the software, which used a different method to determine the probabilities.
BAM file modified base format
If a modified base call model is selected, Guppy will emit BAM files as if the --bam_out
flag had been set. Modified bases will be encoded into the BAM modified base format in the metadata tags MM
and ML
. For configurations that only look for modifications within a specific context (which is currently the case for all of our suppored modified base configurations), a ? will be used in the MM tag to indicate that the modification probability is unknown for any bases of the specified type that were skipped, and results will only be output for bases that match the context. If any context-free modification configurations are used, then the ? will not appear in the tag, and only instances of the base that exceed the specified threshold will be output. For more information on the BAM modified base format, see the "Base Modifications" section of the SAM optional fields specification here: https://samtools.github.io/hts-specs/SAMtags.pdf
18. Duplex basecalling
Duplex basecalling
Note: We recommend using our Dorado basecaller to perform duplex basecalling. For more information, please see the Dorado page on Github or the "Basecalling Kit 14 duplex data" in the Kit 14 Sequencing and Duplex Basecalling info sheet.
The Guppy toolkit now supports performing duplex basecalling, where the template and complement strands of a read can have their basecall data combined to provide a more accurate sequence. To perform duplex basecalling, the template and complement read pairs must first be identified.
The guppy_basecaller_duplex
tool currently provides two options for pairing reads:
from_pair_list
will instruct the Guppy duplex basecaller to use a text file containing read information indicating the source reads to be paired. This file may contain either two or eight whitespace-separated columns per line. The first two columns are the read ids of the reads to be duplexed. If the two reads are the results of a read being split, additional pairs of columns should be present specifying the parent read id, start time and duration (in seconds) of the read segments.from_1d_summary
will instruct the Guppy duplex basecaller to read in a previously-generated Guppy 1D basecall summary file. This file will be used to identify pairs of reads which were sequenced through the same flow cell and channel in rapid sucession, marking them as potential pairs. Split reads are handled automatically.
The guppy_basecaller_duplex
executable can be launched as follows:
guppy_basecaller_duplex --input_path <path to reads> --save_path <output folder> -x "cuda:0" --config dna_r10.4.1_e8.2_400bps_sup.cfg --duplex_pairing_mode from_pair_list --duplex_pairing_file <text pair file>
Note that duplex basecalling is very resource-intensive (especially when using the highest accuracy models), so it is strongly recommended to use GPU mode basecalling if possible.
For further information on duplex basecalling, please see our Duplex Tools page on Github.
Additional arguments
In addition to the arguments supported by the 1D basecaller, guppy_basecaller_duplex
supports these additional arguments:
duplex_pairing_mode
: The read pairing mode to use for duplex basecalling. Must be one of 'from_1d_summary' or 'from_pair_list'. This argument must be specified.duplex_pairing_file
: The input filename to use for duplex pairing. Must be a list of pairs of read ids forfrom_pair_list
pairing, or a Guppy sequencing_summary file forfrom_1d_summary
pairing. This argument must be specified.
Note that when performing duplex basecalling, the sequencing summary will have different columns available. The following columns will have different information to their meaning in 1D basecalls:
read_id
: The uuid that uniquely identifies the template source strand of this duplex read.filename
: The name of the input read file which the template read came from.The following column will be added:
duplex_pair_read_id
: The uuid that uniquely identifies the complement source strand of this duplex read.
Duplex basecalling is still in prototype support in Guppy, and there are some limitations to be aware of:
There is a maximum read size which is supported for duplex calling on GPU, based on the available device memory. It can be controlled by setting the --chunks_per_runner
option for the duplex basecaller. To obtain best runtime performance, it is currently recommended to use the highest possible setting for --chunks_per_runner
which the device can support. Here are some recommendations - these will need to be adjusted down by the user if they are doing other work on the GPU (such as using Guppy for barcoding):
Available GPU memory | --chunks_per_runner setting | Approximate maximum duplex read length |
---|---|---|
40 GB (e.g. A100) | 1200 | 400 kb |
32 GB (e.g. GV100) | 900 | 300 kb |
16 GB (e.g. V100) | 450 | 150 kb |
12 GB (e.g. GTX 1080 Ti) | 320 | 106 kb |
This limit on read length will be removed in future releases. |
Duplex pipeline
ont_guppy_duplex_pipeline
is a Python module that performs the sequence of processes required for duplex basecalling with Guppy. It is available on PyPI and can be installed via pip:
pip install ont-guppy-duplex-pipeline
The duplex pipeline comprises the following steps:
- (Optional) simplex (1D) basecalling.
- Identification of duplex pairs in the simplex basecall results.
- (Optional) duplex basecalling of those pairs.
- (Optional) simplex basecalling of all reads that were not part of a duplex pair, using the same configuration as the duplex basecalling.
To process reads with the duplex pipeline, call guppy_duplex
with the required parameters:
guppy_duplex -i <read_folder> -s <output_folder>
The guppy_duplex
pipeline also supports a number of optional parameters in addition to those required above.
Executables
- Path to the basecaller executable (
--basecaller_exe
): If this is not set, the pipeline assumes thatguppy_basecaller
is already in the path. - Path to the duplex basecaller executable (
--duplex_basecaller_exe
): If this is not set, the pipeline assumes thatguppy_basecaller_duplex
is already in the path.
Input/output
- Recursive (
-r
or--recursive
): Flag to require searching through all subfolders contained in the--input_path
value, and basecall any files found in them.
Processing steps
- Skip simplex (
--skip_simplex
): Skip the initial simplex basecall step. The pipeline will assume that thesequencing_summary.txt
file exists. - Skip duplex (
--skip_duplex
): Skip the duplex basecall step. - Non-duplex reads (
--call_non_duplex_reads
): When specified, this flag runs an additional basecall (using the same configuration as the duplex basecall) on reads that did not participate in a duplex pair. As the duplex step is typically performed using a higher-accuracy configuration that the simplex step, this provides for high accuracy basecalls of the non-duplex reads as part of the pipeline. - Read splitting (
--do_read_splitting
): Perform read splitting during the initial simplex basecall to separate potentially concatenated reads.
Configurations
- Simplex configuration (
--simplex_config
): Sets the configuration to use for the initial simplex basecall. This will typically be a "fast" model. Defaults todna_r10.4.1_e8.2_400bps_fast.cfg
. - Duplex configuration (
--duplex_config
): Sets the configuration to use for the duplex basecall. This will typically be a "hac" or "sup" model. This configuration will also be used to call the non-duplex reads if--call_non_duplex_reads
is set. Defaults todna_r10.4.1_e8.2_400bps_sup.cfg
.
Optimisation
- GPU device (
-d
or--device
): Specify the CUDA-enabled GPU to use to perform basecalling. See the Optimisation section under "Guppy features, settings and analysis" for more details. - Duplex chunks per runner (
--duplex_chunks_per_runner
): Passed to the Guppy duplex basecaller when performing duplex basecalling. Decrease this value in case of out-of-memory errors.
Other
- Disable logging (
--disable_logging
): Turns of logging.
19. FAQ
FAQ and common issues
My basecalling speed is a lot slower than it was the last time I ran Guppy.
This is most likely because the model you are using is different. The default configuration for Guppy is to use High Accuracy models, and these will be quite a lot slower than the "fast" models that Guppy used to have as a default. See the above section Setting up a run: configurations and parameters.
When I run the GPU version of Guppy I get the message error while loading shared libraries: libcuda.so.1: cannot open shared object file
.
This will happen when NVIDIA GPU drivers for Guppy have not been installed, or have been installed incorrectly. (Re)installing GPU driver packages should solve this.
I get slightly different results from Guppy when I run on different platforms or with different GPUs.
Guppy should provide completely deterministic and repeatable results for a given platform, operating system, GPU, and GPU driver, but its output may be slightly different if any of those things change. The overall basecall results should be very similar, and in most cases completely identical.
20. Troubleshooting
Memory usage for Guppy
If the memory requirements for Guppy exceed available memory on the host machine, or if other computationally-intensive work is performed while Guppy is running, then Guppy may run out of memory and crash. If Guppy fails for this reason the cause may not be directly apparent - the application may either simply crash or it may return a "segmentation fault" or "killed" error. In such cases, either the number of threads should be reduced or other computationally-intensive tasks should be stopped.
CUDA_ERROR_OUT_OF_MEMORY, CUBLAS_STATUS_ALLOC_FAILED, or CUBLAS_STATUS_EXECUTION_FAILED
If GPU basecalling is enabled, and available memory on the GPU device is exceeded, Guppy may exit reporting an error of type CUDA_ERROR_OUT_OF_MEMORY
, CUBLAS_STATUS_ALLOC_FAILED
, or CUBLAS_STATUS_EXECUTION_FAILED
. If this happens, there are two possible fixes:
Limit the amount of GPU memory that Guppy uses by appending a colon followed by a percentage to the device string (
-x
or--device
) passed to Guppy. Start with 50%, e.g. using-x cuda:0:50%
instead of-x cuda:0
, and, if this succeeds, increase the percentage until Guppy fails to run.Reduce the number of runners or the number of chunks per runner to allow Guppy to run. See the section CPU/GPU basecalling usage above about how to estimate the amount of memory to use. When in doubt, setting:
--num_callers 1 --gpu_runners_per_device 1 --chunks_per_runner 1
should result in very low GPU memory use (and correspondingly low basecalling speed), and you can experiment with increasing those numbers until running out of GPU memory again.
Advanced troubleshooting for Guppy crashes
Despite our best efforts, Guppy will occasionally crash due to bugs, and being able to diagnose and fix these is very important. This generally involves a two-step process:
Provide general configuration information from Guppy
- Provide general configuration information
This is just information about how you ran Guppy. It should include:
- Your Guppy version (find this by running
guppy_basecaller
with--version
). - The full command used to call Guppy.
- Any relevant files, such as your configuration file (if a custom one was used), or read files if available.
- Any other information you think is relevant, such as the hardware Guppy was running on (GPU model, etc).
- If possible, reproduce the crash with tracing enabled and provide these trace files along with any other log files. To do this follow these steps:
Create a file called trace_config.txt containing the following line:
* trace_file
Pass the path to this file to the relevant guppy application either using a command-line parameter, i.e.
--trace_domains_config /path/to/trace_config.txt
or using an environment variable, i.e.
ONT_GUPPY_TRACE_CONFIG=/path/to/trace_config.txt
If using a Guppy application, this step can be ignored. If using the client lib as opposed to a guppy application, the save folder for output trace will need to be set using an environment variable, i.e.
ONT_GUPPY_TRACE_FOLDER=/folder/to/output/trace/files/
Rerun the application to reproduce the error.
Any generated trace and logging files will be written to the save folder.
Note: Remove the
trace_config.txt
file the command-line parameter and any environment variables. This will prevent the application continuing to fill up the hard disk with unnecessary trace files.
- Your Guppy version (find this by running
Attempt to isolate the read which caused the crash in Guppy
Frequently, Guppy crashes will be due to a single read, and figuring out which read caused the crash will make it much easier to reproduce it. Do that like this:
- If you are outputting reads to subfolders, check your
guppy_basecaller_<timestamp>.log
file to see which subfolder Guppy was writing to when it crashed. Any problem reads are likely in this folder. - Run the same Guppy command again (on just the affected subfolder if possible), enabling verbose logging.
- Wait for Guppy to crash again.
- Look in your log file for the last read which was sent to Guppy (it will probably not have a "finished processing" message present in the log).
- Run Guppy again with only that one read, and see if it crashes. If it did, you've found the problem read and can send it to customer services.
- If Guppy did not crash, try again with the last few reads listed in the log file, as Guppy may be processing several reads in parallel (especially while running on GPU).