Q system data analysis


1. Basecalling overview

Introduction to basecalling

Basecalling is the process of converting the electrical signals generated by a DNA or RNA strand passing through the nanopore into the corresponding base sequence of the strand. The general data flow in a nanopore sequencing experiment is shown below.

Basecalling documentation

Raw data – a direct measurement of the changes in ionic current as a DNA/RNA strand passes through the pore, which are recorded by the MinKNOW software. MinKNOW also processes the signal into "reads", each read corresponding to a single strand of DNA/RNA. These reads are written out as POD5 files: a custom Oxford Nanopore file type.

Basecalling – the basecalling algorithm uses signal processing techniques based on machine learning to transform the raw signal of the reads into basecalls. The software writes out the results of these analyses into BAM files (unaligned, or containing modified base information and/or alignment information), with a default of 4000 reads per file. Additionally, FASTQ files are also produced. Similarly, the default is 4000 reads per file.

Oxford Nanopore Technologies provides several platforms to allow users to carry out basecalling in real-time, as well as executables for users' local infrastructure. You can carry out basecalling live during the experiment, as post-processing after an experiment has finished, or a combination of these.

Basecalling with neural networks

The production version of Oxford Nanopore basecallers convert raw signal data to basecalls using algorithms that incorporate bi-directional Recurrent Neural Networks (RNNs).

A neural network models processes that occur inside the human brain. The network contains nodes arranged in layers, which carry out computations. Neural networks receive and process data, but crucially, they have been trained to have exceptional performance for particular signal processing tasks. They have been successfully used for diverse applications like pattern recognition (such as handwritten characters, speech recognition), or predicting trends over time.

A recurrent neural network is a class of neural networks in which the output is dependent on past computations. An RNN keeps an internal memory of previously-seen data, so each new computation can use information from several preceding computations. A bi-directional RNN can set data in the context of what comes both before and after in the signal.

Basecalling

Oxford Nanopore's basecallers use neural networks that have been trained on a range of example DNA sequences (described in more detail in the Basecaller training section of Basecalling algorithms). The network learns how to translate the series of measurements into the sequence.

Oxford Nanopore basecallers

Basecaller Algorithm Availability
MinKNOW basecaller Production basecaller on the device software; uses the "Flip-flop" algorithm. This is identical to the algorithm used by stand-alone Guppy, but may be a version behind Available as a free download (further details in the Q system MinKNOW user guide). The basecalling option is selected when the sequencing experiment is started, and the experiment progress is monitored via the MinKNOW GUI
Guppy Production basecaller available as an executable; uses the "Flip-flop" algorithm. This can be run on CPUs, but is optimised for real-time basecalling on GPUs. Guppy is the default basecaller on all Oxford Nanopore sequencing devices Available as a free download. The executable version of the software can be run on the host computer via the command line
Research algorithms Varied Algorithms written by Oxford Nanopore's Research division are available through a GitHub repository: https://github.com/nanoporetech. The releases are varied, and include features that will be included in future versions of the production basecaller

2. Basecalling algorithms

The "Flip-flop" basecalling algorithm

The Flip-flop basecalling algorithm uses the raw signal to write out single-base transition probabilities, along with confidence levels for each called base, and the alternative possibilities. Flip-flop basecalling provides higher accuracy basecalls and better homopolymer resolution than previous algorithms from Oxford Nanopore Technologies.

Flip flop algorithm

The neural network model in the Flip-flop basecaller performs label-free basecalling: instead of labelling raw data with a short sequence of bases like previous "transducer" basecalling algorithms, the model produces likelihoods of transitions between consecutive bases. The network considers two states for each base: "flip", which is a transition from one base to the next, and "flop", which is staying on the same base for two time-steps. The model uses Viterbi decoding to assign the likelihood of the "flip" and "flop" state for each base transition. This way, the network can distinguish between long runs of the same base (homopolymers) and multiple “stays” on the same base.

Fast vs High Accuracy models

Please note that only the High Accuracy (HAC) model is supported for use with the Q system. Do not select the Fast basecalling or Modified basecalling models when setting up your experiment.

The Guppy basecaller, which is also integrated in MinKNOW, offers two different Flip-flop models: a High-accuracy (HAC) model and a Fast model. The HAC model provides a higher consensus/raw read accuracy than the Fast model. It contains a more computationally-intense Flip-flop architecture that can deliver higher accuracy using the same data produced by nanopore sequencing.

The Fast Flip-flop model includes a simplified version of the Flip-flop algorithm and delivers the same level of accuracy and basecalling speed to those obtained with the older transducer basecalling algorithm. Both models have been trained on the same datasets.

A comparison of the speed and accuracy of the two models is provided in the table and graphs below. Please note that these numbers represent the theoretical best speeds achievable with the basecallers, and a real biological sample may be basecalled more slowly.

Sample type Model name Modal single-molecule accuracy Basecalling speed on GridION
(1X Quadro GV100 GPU)
DNA Fast 92.1% 40 Gbases/hour
DNA HAC 95.0% 5 Gbases/hour

Accuracy Fast HAC GridION

Calling modified bases

Certain base modifications, specifically 6mA dam/5mC dcm and CpG, can be called using the MinKNOW or stand-alone Guppy software. This requires the use of a designated basecalling model which uses the Flip-Flop algorithm and is trained to identify base modifications.

Modified base output consists of two parts:

  1. A normal FASTQ record, available either as part of FASTQ files or as FASTQ entries embedded in .fast5 files.
  2. A supplementary table provided as part of .fast5 output, which contains estimated probabilities that a particular base in the FASTQ entry is a modified one.

For more information about this model, please refer to the Guppy basecaller and Guppy basecaller server section in the Guppy protocol: https://community.nanoporetech.com/protocols/Guppy-protocol/.

3. Live basecalling

Introduction to live basecalling in MinKNOW

For MinION Mk1B/Mk1D, Flongle on MinION Mk1B/Mk1D, and PromethION 2 Solo, the MinKNOW software presents an option to basecall reads on the local computer. The basecalling is carried out live, as the read files are generated during a sequencing experiment.

Basecalling results are displayed in real-time in the MinKNOW user interface, and data is written out in the BAM or FASTQ file format.

MinKNOW 3

MinKNOW basecalling: keep-up vs catch-up

Basecalling with the Fast basecalling model can keep up with the speed of data acquisition on most nanopore platforms. High Accuracy basecalling keeps up on GridION, and with 18 flow cells on PromethION A-Series. When using the more computationally-intensive models, basecalling continues after the sequencing experiment has run to completion; any reads that have not been basecalled during the experiment will be queued and processed afterwards. This is known as “Catch-up mode”.

You therefore have two options: either to allow MinKNOW to continue in catch-up mode, or to stop the analysis and basecall the remaining reads at a later time, e.g. using stand-alone Dorado.

Catch-up 2

4. On-demand basecalling using the Guppy software

Guppy basecalling software

Guppy is a legacy basecaller that contains Oxford Nanopore Technologies' basecalling algorithms. It is run from the command line in Windows, and on multiple Linux platforms. Guppy is optimised for real-time basecalling on GPUs.

Calling modified bases in Guppy

Guppy includes a configuration file that enables the identification of:

  • 6mA dam methylation (trained on E. coli data)
  • 5mC dcm methylation (trained on E. coli data)
  • 5mC CpG methylation (trained on human data)

This modification detection is done directly from the raw signal of native DNA without the need for any new preparation steps and with almost no additional analysis time.

Modified base output consists of two parts:

  1. A normal FASTQ record, the same as from normal basecalling, available either as part of FASTQ files or as FASTQ entries embedded in .fast5 files
  2. A supplementary table provided as part of .fast5 output, which contains estimated probabilities that a particular base in the FASTQ entry is a modified one

More information about running the modified base caller and interpreting the results can be found in the Guppy features, settings and analysis section of the Guppy protocol: https://community.nanoporetech.com/protocols/Guppy-protocol/.

Guppy availability

The Guppy basecalling software is available free of charge to the Nanopore Community. More details on installing and running the software are found in the Guppy protocol: https://community.nanoporetech.com/protocols/Guppy-protocol/.

The Guppy source code is available through the Developer channel: /document/developer-channel to users who have signed the Developer terms and conditions.

5. Basecall accuracy

Introduction to nanopore sequencing accuracy

Oxford Nanopore's sequencing accuracy is shown as one of several metrics:

  • per-base quality score, denoted by the Phred Q-score
  • raw read quality. This is calculated as an average from the q-scores, and this average quality is calibrated vs accuracy.
  • raw read accuracy. Accuracy is calculated from an alignment to a reference sequence and counts insertions, deletions and substitutions as errors (bases correct / (insertions + deletions + bases aligned)), e.g. 99% accuracy can be interpreted as 99 out of 100 bases in a read were called correctly. Note that Oxford Nanopore Technologies represents the average raw read accuracy as the modal per-read accuracy from a sequencing run

Sequencing accuracy is determined by multiple factors, such as pore chemistry and basecalling algorithms. Improvements in these areas over the last several years have led to a steady increase in both single-molecule and consensus accuracy.

For more information about our latest accuracy data, please see the Accuracy page on the Oxford Nanopore website.

Phred quality scores

The Phred quality score defines the quality of each base in the sequence, with values from from 0 to 93. The score is calculated as:

Quality score = -10 x log(Pe)

where Pe is the estimated error probability for each base. For example, an error of 1 in 100 will give a q-score of 20. The q-scores are then encoded in the Sanger format using ASCII, with values of 33 to 126. The quality is then shown as a single character per base.

Per-base quality scores are stored together with the base sequence in FASTQ files output by the basecalling algorithms.

Single-molecule accuracy

Single-molecule accuracy has been steadily increasing over the years, following multiple improvements in basecalling algorithms and pore chemistry. Below is a timeline of the increase in modal accuracy (thick blue lines), where the boxes show the accuracy distribution for 90% of reads in a sequencing run. The results are typically generated on a range of samples including E. coli, S. aureus, and human.

Single molecule accuracy timetable

In particular, the switch from a "transducer" basecalling algorithm to the Flip-flop algorithm, which produces a per-base output and can better resolve homopolymeric regions, has improved the modal accuracy from 90% to ~95%. Raw read accuracy

Consensus accuracy

Consensus accuracy represents the accuracy of a consensus sequence made from a pile-up of reads. It is denoted by Q-scores, which have a logarithmic relationship to the basecalling error probability:

Q-score Number of errors
Q40 1 error in 10,000 bases
Q50 1 error in 100,000 bases
Q60 1 error in 1,000,000 bases

As with single-molecule accuracy, modal consensus accuracy has been increasing over the years with improvements in pore chemistry, basecallers and polishing tools.

Consensus accuracy timeline

The current best consensus accuracy is produced using an Oxford Nanopore Research base-space polishing tool called medaka, which is optimised for the latest 'R10' pore chemistry (https://github.com/nanoporetech/medaka).

Q44 has been achieved with the R9.4.1 pore, the Flip-flop basecaller and medaka polishing, while Q54 can be reached using the R10 pore.

Pore chemistry and its impact on accuracy

The structure of the nanopore itself determines the raw signal that is produced when the DNA or RNA passes through the pore, and therefore how accurate the basecalling will be.

The nanopores used by Oxford Nanopore Technologies have one or more "readers", i.e. constrictions within the pore barrel that interact with the polynucleotides and produce the changes in current levels.

The R9.4.1 version of the pore has a single reader in the middle of the barrel. The new R10 version of the pore has two readers, which means that more bases within the DNA/RNA strand can interact with the pore, and longer homopolymeric regions are better "seen" by the pore (see diagram below).

R9 R10 pore structure

While the R9.4.1 and R10 modal single-molecule accuracies are similar (top graph below), R10 retains higher single-molecule accuracy when reading homopolymers (shown in the bottom graph below), while the accuracy with R9.4.1 drops off.

R9 R10 raw read accuracy R9 R10 homopolymer calling

Consensus accuracy is also higher for R10 than R9.4.1. A recent sequencing analysis of the ZymoBIOMICS Microbial Community Standards with the R10 pore has given a consensus accuracy above Q45 for various bacterial species in the sample at 50X coverage, and Q54 for S. aureus at 150X coverage.

R10 Zymo consensus accuracy

6. Barcoding options

Barcode design

The Oxford Nanopore Technologies barcoding kits can place barcodes at the beginning and for some kits, also at the end of the strands for multiplexing several different samples in one sequencing experiment. The barcodes will reside in a kit-specific context sequence, and different kits will have different lengths of sequence before and after the barcode. However, the sequences of the barcodes themselves are identical, regardless of kit.

The regions of a barcode

A complete barcode arrangement comprises three sections:

  1. The upstream flanking region, which comes between the barcode and the sequencing adapter.
  2. The barcode sequence.
  3. The downstream flanking region, which comes between the barcode and the sample sequence.

A complete dual-barcode arrangement comprises five sections:

  1. The upstream flanking region, which comes between the outer barcode and the sequencing adapter.
  2. The outer barcode sequence.
  3. The mid flanking region, which comes between the outer barcode and the inner barcode.
  4. The inner barcode sequence.
  5. The downstream flanking region, which comes between the inner barcode and the sample sequence.

The barcode sequences remain constant across almost all of Oxford Nanopore Technologies' kits. For example, the flanking regions for barcode 10 in the Rapid Barcoding Kit (SQK-RBK004) are different from the flanking regions for barcode 10 in the native barcoding expansion kit (EXP-NBD114), but the barcode sequence itself is the same.

While native kits use the same barcode sequences as other kits, barcodes 1-12 in the native kit are the reverse complement of the standard barcodes 1-12.

There is one other exception to this: barcode 12a in the Rapid Barcoding Kit SQK-RBK004 has a different barcode sequence to barcode 12 in other kits. For this reason, the oligonucleotide of this sequence is referred to as "barcode 12a".

Barcode and barcode flanking sequences can be found in the "Barcoding kits" section of the Chemistry technical document in the Nanopore Community.

Barcode demultiplexing options

After a barcoded sequencing run has completed, the reads can be split into folders by barcode, using one of Oxford Nanopore's demultiplexing tools:

  • Real-time barcode demultiplexing in MinKNOW
  • Post-run barcode demultiplexing in MinKNOW
  • Barcode demultiplexing in the Dorado basecall server

A brief description of the options is provided below.

FASTQ Barcoding workflow in EPI2ME

The EPI2ME platform provides a workflow for analysing sequencing data with barcodes. When this is used, the returned FASTQ files will contain information about the barcodes found. The reads will be returned to a data location entered in the Desktop Agent.

As the basecalled reads are uploaded to the cloud, the first step in the workflow is to check that the read files are of a supported filetype (currently FASTQ), and that the file headers are in the correct format.

The demultiplexing algorithm consists of three main steps:

  1. Identify the sequencing kit that was used for a given read
  2. Identify the correct barcode
  3. Filter the results

Barcoding methodology

1. Identify the sequencing kit that was used for a given read

The algorithm has one ore more 'templates' for each sequencing kit. These templates consist of the fixed regions that flank the barcodes, and a place holder for where the barcode is located. The fixed regions differ between kits and therefore allow to determine the kit that was used.

For example:

Template for the Rapid Barcoding Kit (SQK-RBK004): GCTTGGGTGTTTAACCNNNNNNNNNNNNNNNNNNNNNNNNGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA

Template for the Native Barcoding Expansion (EXP-NBD104): ATTGCTAAGGTTAANNNNNNNNNNNNNNNNNNNNNNNNCAGCACC

The algorithm aligns the templates from all supported kits to a given read using semi-global alignment and computes a length-normalised score. The template with the highest score is most likely to originate from the kit that was used to sequence the given read.

Example:

  • The template for SQK-RBK004 has a score of 46
  • The template for EXP-NBD104 gets a score of 94

Therefore, it is assumed that the library preparation used EXP-NBD104. EPI2ME also allows the user specify the kit the was used for sequencing. In this case, step 1 can be skipped.

2. Identify the correct barcode

All the barcodes are aligned separately to the region of the read where the barcode is supposed to be located (the stretch of Ns in the template), to calculate the alignment scores again. Including ~10 bp up- and downstream of the barcode to the alignment has been shown to improve results. The barcode that gives the highest score is mostly likely to be the correct barcode.

For example:

Barcode01: score 31 Barcode02: score 24 Barcode03: score 100 .... Barcode12: score 10

In this case, barcode03 is the correct barcode. The algorithm uses minimum score cutoff of 58: if there are no alignments above this score, the read is reported as unclassified.

3. Filtering the results

It has been shown that reads with different barcodes attached to their 5' and 3' ends result in incorrect classification. To reduce the effect of those reads, the algorithm searches for barcodes independently at the 5' and 3'. If it finds barcodes at both ends that are above the score cutoff, it checks whether the calls are the same. If the two barcodes are not the same, the read is reported as unclassified.

For more information about the barcoding workflow, please see the "Workflows" section of the EPI2ME Technical Document: https://community.nanoporetech.com/technical_documents/epi2me-tech-doc/.

Barcode demultiplexing in Guppy

The barcoding algorithm in Guppy uses a modified Needleman-Wunsch method. Each barcode is aligned to a section of the basecall, with a score assigned to each base in the sequence depending on whether the base was a match, mis-match or a gap. The combined scores for each barcode alignment are compared, and the barcode with the highest score is chosen as long as the score is above the defined threshold. The barcode sequences can be trimmed from the reads, as a command-line option.

For more details, please see the Appendix A of the Guppy protocol: https://community.nanoporetech.com/protocols/Guppy-protocol/.

Barcode demultiplexing in MinKNOW

MinKNOW currently uses Dorado for both basecalling and barcode demultiplexing. It performs barcode demultiplexing in real-time, as the sequencing run progresses. MinKNOW demultiplexing is also available as a post-run analysis option.

7. FASTQ and BAM files

FASTQ output

FASTQ files are text files that contain sequence data for each read, and associated per-base quality scores. FASTQ files can be generated in MinKNOW, Dorado, and Guppy. The default is to write out 4000 reads per FASTQ file, although this number is configurable.

A single read sequence in a FASTQ file is described in four lines:

  1. Line 1 begins with a '@' and is followed by a header containing information about the sequencing run.
  2. Line 2 is the basecalled sequence (using A, C, T, G and N).
  3. Line 3 contains a '+'.
  4. Line 4 encodes the per-base quality scores for the sequence in Line 2.

An example of a FASTQ file is shown below:

@75be78f7-bd62-4972-92d2-aba16f465b0d runid=ff83cfafb0cb3bfc28ac370b841f59798ab3d63a sampleid=RB02_lambda_ovn1 flow_cell_id=PBA53900 protocol_group_id=r9_read_length ch=1375 start_time=2024-10-31T09:34:35Z basecall_model_version_id=dna_r9.4.1_e8_hac@v3.3 basecall_gpu=Quadro_GV100
CGGTATTACTTCGTTCAGTTTCGGACAGGTGTTTTAACC[...]TCGTACCTAT
+
'%+-($&&&&'(':+7)-%(&$$.%##))868;;87/9;[...]68(*(2)/%$

BAM output

BAM files are output by MinKNOW and stand-alone Guppy software if alignment has been performed on the basecalled dataset. BAM files are also output when using the modified base models in MinKNOW and Dorado.

8. Read .fast5 files from the instrument

Read .fast5 files

A .fast5 file is a type of HDF5 file, which is designed to contain all information needed for analysing nanopore sequencing data and tracking it back to its source. Read .fast5 files contain raw sequencing data for each read, with a default of 4000 reads per file.

Default read file location

Windows

C:\data\

Mac OS X

/Library/MinKNOW/

Linux

/var/lib/minknow/

File directory and name

All files output directly from an experiment are located in the same directory. This directory has the structure:

{output_dir}/{experiment_id}/{sample_id}/{start_time}_{device_ID}_{flow_cell_id}_{short_protocol_run_id}/

output_dir is the configured output directory experiment_id is the user-entered identifier for a group of runs sample_id is the user-entered value for a specific sample or run. Multiple sample_ids may exist beneath an individual experiment_id start_time is the time at which the protocol started, in YYYYMMDD_HHMM format device_id is the serial ID of the MinION Mk 1B or device position for GridION/PromethION flow_cell_id is the flow cell ID (eg: FAH12345), either programmed on the ASIC or entered by the user short_protocol_run_id is a unique identifier of 7 characters from the protocol ID

Examples of the above naming convention:

MinION Mk1B

C:\data\MyExperiment\Sample1\20181011_1759_MN12345_FAH12345_0ffe109

GridION

/data/MyExperiment/Sample1/20181011_1759_GA2000_FAH12345_0ffe109

Individual read files are split into .fast5 "pass" and "fail" folders, as well as FASTQ "pass" and "fail" folders within this directory:

{ext}_{status}/{flow cell id}_{run id}_{batch_number}.{ext}

Examples of the above file naming:

fast5_pass/FAK12345_9bf81741599b097c42ba9a6ff6e4d00f02b6534b_0.fast5
fast5_fail/FAK12345_9bf81741599b097c42ba9a6ff6e4d00f02b6534b_0.fast5
fastq_pass/FAK12345_9bf81741599b097c42ba9a6ff6e4d00f02b6534b_0.fastq
fastq_fail/FAK12345_9bf81741599b097c42ba9a6ff6e4d00f02b6534b_0.fastq

File content

This information is provided for users who want to be able to look directly at the read files being produced by MinKNOW, which may be of particular interest to those wishing to carry out custom analysis using their own tools. The following represents the current standard for multi-read .fast5 files.

file:
| attributes:
| | # The version of the file. We strive to use semantic
| | # versioning, so if two files share the same major version number
| | # (MAJOR.MINOR.PATCH) then the file with the higher overall
| | # version should be compatible with tools designed to work
| | # with the lower one.
| | file_version: "3.1"
| | # file_type can be "single-read" or "multi-read". Note that
| | # single-read files are no longer officially supported.
| | file_type: "multi-read"
| groups:
| | 'read_[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}':
| | | name_type: regex
| | | attributes:
| | | | # The ID for a specific data acquisition period.
| | | | run_id: S
| | | | # The pore type used, or "not_set".
| | | | pore_type: S
| | | count:
| | | | minimum_count: 0
| | | groups:
| | | | tracking_id:
| | | | | attributes:
| | | | | | # The ID of the ASIC eeprom
| | | | | | "asic_id_eeprom":
| | | | | | datatype: S
| | | | | | # The device ID. On PromethION and GridION
| | | | | | # this will be a position on the device.
| | | | | | "device_id":
| | | | | | datatype: S
| | | | | | # The name of the experiment script that was
| | | | | | # run, including any optional parameters
| | | | | | # passed to it.
| | | | | | "exp_script_name":
| | | | | | datatype: S
| | | | | | # The "purpose" of the experiment script,
| | | | | | # e.g. "sequencing".
| | | | | | "exp_script_purpose":
| | | | | | datatype: S
| | | | | | # ISO8601 time the experiment started.
| | | | | | "exp_start_time":
| | | | | | datatype: S
| | | | | | # The id of the flow cell used.
| | | | | | "flow_cell_id":
| | | | | | datatype: S
| | | | | | # The product code associated with the flow
| | | | | | # cell and its pore(s), e.g. FLO-MIN107.
| | | | | | "flow_cell_product_code":
| | | | | | datatype: S
| | | | | | # The ID of the machine running MinKNOW.
| | | | | | "hostname":
| | | | | | datatype: S
| | | | | | # The ID of the full set of data acquisition
| | | | | | # periods performed by an experiment script.
| | | | | | "protocol_run_id":
| | | | | | datatype: S
| | | | | | # The start time of the full set of data 
| | | | | | # acquisition periods performed by an 
| | | | | | # experiment script.
| | | | | | "protocol_start_time":
| | | | | | # rfc-3339 date
| | | | | | datatype: S
| | | | | | # The version of the software toolkit used by
| | | | | | # the experiment scripts.
| | | | | | "protocols_version":
| | | | | | datatype: S
| | | | | | # The version of the configuration system,
| | | | | | # including the experiment scripts.
| | | | | | "configuration_version":
| | | | | | datatype: S
| | | | | | # The ID of the data acquisition period during
| | | | | | # which this read was obtained. See also
| | | | | | # "protocol_run_id".
| | | | | | "run_id":
| | | | | | datatype: S
| | | | | | # The customer-supplied ID of the sample being
| | | | | | # sequenced.
| | | | | | "sample_id":
| | | | | | datatype: S
| | | | | | # The version of MinKNOW used to acquire this
| | | | | | # read.
| | | | | | "version":
| | | | | | datatype: S
| | | | | | '[a-zA-Z0-9]+':
| | | | | | name_type: regex
| | | | | | count:
| | | | | | minimum_count: 0
| | | | | | datatype: S
| | | | context_tags:
| | | | | attributes:
| | | | | | # Whether or not basecalling during sequencing
| | | | | | # was enabled. Set to either 0 or 1
| | | | | | "local_basecalling":
| | | | | | datatype: S
| | | | | | # The sequencing kit selected by the user in
| | | | | | # the GUI.
| | | | | | "sequencing_kit":
| | | | | | datatype: S
| | | | | | '[a-zA-Z0-9]+':
| | | | | | name_type: regex
| | | | | | count:
| | | | | | minimum_count: 0
| | | | | | datatype: S
| | | | channel_id:
| | | | | attributes:
| | | | | | # These three parameters are used to convert
| | | | | | # integer ADC values from the "Signal" dataset
| | | | | | # into picoamps, using the formula:
| | | | | | # pA = (range / digitisation) * (adc_val + offset)
| | | | | | digitisation: f8
| | | | | | offset: f8
| | | | | | range: f8
| | | | | | # The rate at which ADC data was acquired, in
| | | | | | # hertz.
| | | | | | sampling_rate: f8
| | | | | | # The number of the channel from which the
| | | | | | # read was acquired
| | | | | | channel_number: S
| | | | Raw:
| | | | | count:
| | | | | | minimum_count: 0
| | | | | attributes:
| | | | | | # The time the read began, in seconds since
| | | | | | # the start of the acquisition period.
| | | | | | start_time: u8
| | | | | | # The duration of the read, in seconds.
| | | | | | duration: u4
| | | | | | # The number, counted upwards from zero,
| | | | | | # corresponding to this read. Note that not
| | | | | | # all reads are "strand" reads written to
| | | | | | # fast5 files, so read numbers may be
| | | | | | # missing.
| | | | | | read_number: i4
| | | | | | # The mux this read came from.
| | | | | | start_mux: u1
| | | | | | # A UUID4 identifier for this read.
| | | | | | read_id: S
| | | | | | # An estimation of the median channel current
| | | | | | # immediately before the read was acquired.
| | | | | | median_before: f8
| | | | | | # Enum of `unknown` = 0, `partial` = 1, `mux_change` = 2, `unblock_mux_change` = 3, `signal_positive` = 4, `signal_negative` = 5
| | | | | | end_reason: u1
| | | | | | # Number of minknow events that the read contains
| | | | | | num_minknow_events: i8
| | | | | datasets:
| | | | | | # Dataset containing the raw ADC values for
| | | | | | # the read. See "channel_id" above for how to
| | | | | | # convert these values to pA.
| | | | | | Signal:
| | | | | | datatype: i2

The file structure described here allows the files to be produced incrementally by MinKNOW, which means that analyses can be done whilst the experiment is still in progress.

Intermediate folder

The files in the intermediate folder store unprocessed raw signal data. Once raw signal processing is complete, POD5 or .fast5 files are generated and stored in the tmp folder, where local basecalling can proceed. These files are removed as processing proceeds or at the end of the run.

If the system encounters an issue, such as running out of space, the unprocessed data will not be cleared and will remain in the intermediate folder. Due to the real-time streaming nature of the system, this data cannot be processed after the run is stopped.

9. Basecalled .fast5 files

Multi-read .fast5 file structure for local basecalling

file:
| attributes:
| | # The version of the file. We strive to use semantic versioning, so if two
| | # files share the same major version number (MAJOR.MINOR.PATCH), then the
| | # file with the higher overall version should be compatible with tools
| | # designed to work with the lower one.
| | file_version: "2.3"
| | # file_type can be "single-read" or "multi-read". Note that single-read files
| | # are no longer officially supported.
| | file_type: "multi-read"
| groups:
| | 'read_[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}':
| | | name_type: regex
| | | attributes:
| | | | # The ID for a specific data acquisition period.
| | | | run_id: S
| | | | # The pore type used, or "not_set".
| | | | pore_type: S
| | | count:
| | | | minimum_count: 0
| | | groups:
| | | | tracking_id:
| | | | | attributes:
| | | | | | # The ID of the ASIC eeprom
| | | | | | "asic_id_eeprom":
| | | | | | datatype: S
| | | | | | # The device ID. On PromethION and GridION this will be a position on
| | | | | | # the device.
| | | | | | "device_id":
| | | | | | datatype: S
| | | | | | # The name of the experiment script that was run, including any
| | | | | | # optional parameters passed to it.
| | | | | | "exp_script_name":
| | | | | | datatype: S
| | | | | | # The "purpose" of the experiment script, e.g. "sequencing".
| | | | | | "exp_script_purpose":
| | | | | | datatype: S
| | | | | | # ISO8601 time the experiment started.
| | | | | | "exp_start_time":
| | | | | | datatype: S
| | | | | | # The id of the flow cell used.
| | | | | | "flow_cell_id":
| | | | | | datatype: S
| | | | | | # The product code associated with the flow cell and its pore(s),
| | | | | | # e.g. FLO-MIN106.
| | | | | | "flow_cell_product_code":
| | | | | | datatype: S
| | | | | | # The ID of the machine running MinKNOW.
| | | | | | "hostname":
| | | | | | datatype: S
| | | | | | # The ID of the full set of data acquisition periods performed by
| | | | | | # an experiment script.
| | | | | | "protocol_run_id":
| | | | | | datatype: S
| | | | | | # The start time of the full set of data acquisition periods
| | | | | | # performed by an experiment script.
| | | | | | "protocol_start_time":
| | | | | | # rfc-3339 date
| | | | | | datatype: S
| | | | | | # The version of the software toolkit used by the experiment scripts.
| | | | | | "protocols_version":
| | | | | | datatype: S
| | | | | | # The version of the configuration system, including the experiment
| | | | | | # scripts.
| | | | | | "configuration_version":
| | | | | | datatype: S
| | | | | | # The ID of the data acquisition period during which this read was
| | | | | | # obtained. See also "protocol_run_id".
| | | | | | "run_id":
| | | | | | datatype: S
| | | | | | # The customer-supplied ID of the sample being sequenced.
| | | | | | "sample_id":
| | | | | | datatype: S
| | | | | | # The version of MinKNOW used to acquire this read.
| | | | | | "version":
| | | | | | datatype: S
| | | | | | '[a-zA-Z0-9]+':
| | | | | | datatype: S
| | | | context_tags:
| | | | | attributes:
| | | | | | # Whether or not basecalling during sequencing was enabled. Set to
| | | | | | # either 0 or 1
| | | | | | "local_basecalling":
| | | | | | datatype: S
| | | | | | # The sequencing kit selected by the user in the GUI.
| | | | | | "sequencing_kit":
| | | | | | datatype: S
| | | | | | '[a-zA-Z0-9]+':
| | | | | | datatype: S
| | | | channel_id:
| | | | | attributes:
| | | | | | # These three parameters are used to convert integer ADC values from
| | | | | | # the "Signal" dataset into picoamps, using the formula:
| | | | | | # pA = (range / digitisation) * (adc_val + offset)
| | | | | | digitisation: f8
| | | | | | offset: f8
| | | | | | range: f8
| | | | | | # The rate at which ADC data was acquired, in Hertz.
| | | | | | sampling_rate: f8
| | | | | | # The number of the channel from which the read was acquired.
| | | | | | channel_number: S
| | | | Raw:
| | | | | count:
| | | | | | minimum_count: 0
| | | | | attributes:
| | | | | | # The time the read began, in seconds since the start of the
| | | | | | # acquisition period.
| | | | | | start_time: u8
| | | | | | # The duration of the read, in seconds.
| | | | | | duration: u4
| | | | | | # The number, counted upwards from zero, corresponding to this read.
| | | | | | # Note that not all reads are "strand" reads written to fast5 files,
| | | | | | # so read numbers may be missing.
| | | | | | read_number: i4
| | | | | | # The mux this read came from.
| | | | | | start_mux: u1
| | | | | | # A UUID4 identifier for this read.
| | | | | | read_id: S
| | | | | | # An estimation of the median channel current immediately before the
| | | | | | # read was acquired.
| | | | | | median_before: f8
| | | | | | # Enum of `unknown` = 0, `partial` = 1, `mux_change` = 2, `unblock_mux_change` = 3, `signal_positive` = 4, `signal_negative` = 5
| | | | | | end_reason: u1
| | | | | datasets:
| | | | | | # Dataset containing the raw ADC values for the read. See
| | | | | | # "channel_id" above for how to convert these values to pA.
| | | | | | Signal:
| | | | | | datatype: i2
| | | | # The Analysis group is where output from the basecaller is stored. Other
| | | | # tools, such as Tombo, will sometimes add their own groups here as well.
| | | | Analyses:
| | | | | count:
| | | | | | minimum_count: 0
| | | | | groups:
| | | | | | # The Segmentation group stores information about pre-processing
| | | | | | # that occurs prior to basecalling.
| | | | | | 'Segmentation_[0-9]+':
| | | | | | | count:
| | | | | | | | minimum_count: 0
| | | | | | | name_type: regex
| | | | | | | attributes:
| | | | | | | | # The name of the software used to generate this Analysis
| | | | | | | | # group. This will usually be "MinKNOW-Live-Basecalling" or "ONT
| | | | | | | | # Guppy Basecall software".
| | | | | | | | name: S
| | | | | | | | # The version of the software used to generate this Analysis
| | | | | | | | # group.
| | | | | | | | version: S
| | | | | | | | # The ISO8601 time this group was generated.
| | | | | | | | time_stamp: S
| | | | | | | groups:
| | | | | | | | Summary:
| | | | | | | | | attributes:
| | | | | | | | | | # Legacy field - now always "Workflow Successful".
| | | | | | | | | | return_status: S
| | | | | | | | | groups:
| | | | | | | | | | segmentation:
| | | | | | | | | | | count:
| | | | | | | | | | | | minimum_count: 0
| | | | | | | | | | | attributes:
| | | | | | | | | | | | # Whether or not this read has a template section. This
| | | | | | | | | | | | # will almost always be true.
| | | | | | | | | | | | has_template: u1
| | | | | | | | | | | | # The first sample, from the start of the read, which was
| | | | | | | | | | | | # sent to the basecaller.
| | | | | | | | | | | | first_sample_template: u8
| | | | | | | | | | | | # The duration, in samples, of the data sent to the
| | | | | | | | | | | | # basecaller.
| | | | | | | | | | | | duration_template: u8
| | | | | | 'Basecall_1D_[0-9]+':
| | | | | | | count:
| | | | | | | | minimum_count: 0
| | | | | | | name_type: regex
| | | | | | | attributes:
| | | | | | | | # The name of the software used to generate this Analysis group.
| | | | | | | | # This will usually be "MinKNOW-Live-Basecalling" or "ONT Guppy
| | | | | | | | # Basecall software".
| | | | | | | | name: S
| | | | | | | | # The version of the software used to generate this Analysis
| | | | | | | | # group.
| | | | | | | | version: S
| | | | | | | | # The ISO8601 time this group was generated.
| | | | | | | | time_stamp: S
| | | | | | | | # The type of basecalling model used
| | | | | | | | model_type: S
| | | | | | | groups:
| | | | | | | | Summary:
| | | | | | | | | attributes:
| | | | | | | | | | # Legacy field - now always "Workflow Successful".
| | | | | | | | | | return_status: S
| | | | | | | | | groups:
| | | | | | | | | | basecall_1d_template:
| | | | | | | | | | | | minimum_count: 0
| | | | | | | | | | | | maximum_count: 1
| | | | | | | | | | | attributes:
| | | | | | | | | | | | # Legacy field
| | | | | | | | | | | | num_events: u8
| | | | | | | | | | | | # The duration, in samples, between units of information
| | | | | | | | | | | | # sent to the basecaller.
| | | | | | | | | | | | block_stride: u8
| | | | | | | | | | | | # Mean qscore for this read.
| | | | | | | | | | | | mean_qscore: f4
| | | | | | | | | | | | # Legacy field.
| | | | | | | | | | | | strand_score: f4
| | | | | | | | | | | | # Sequence length, in bases, for this read.
| | | | | | | | | | | | sequence_length: u8
| | | | | | | | | | | | # Estimated transition probabilities for the basecall.
| | | | | | | | | | | | stay_prob: f4
| | | | | | | | | | | | step_prob: f4
| | | | | | | | | | | | skip_prob: f4
| | | | | | | | | | | | # Scaling parameters which describe how raw data is
| | | | | | | | | | | | # normalised before being basecalled.
| | | | | | | | | | | | basecall_location: f4
| | | | | | | | | | | | basecall_scale: f4
| | | | | | | # The BaseCalled_template analysis group holds information about 1D
| | | | | | | # basecalling. It is called "template" for historical reasons.
| | | | | | | BaseCalled_template:
| | | | | | | | count:
| | | | | | | | | minimum_count: 0
| | | | | | | | | maximum_count: 1
| | | | | | | | datasets:
| | | | | | | | | # The Trace dataset is a flip-flop model-specific
| | | | | | | | | # representation of the basecaller's confidence in particular
| | | | | | | | | # states. See https://community.nanoporetech.com/posts/pre-release-of-stand-alone
| | | | | | | | | Trace:
| | | | | | | | | | count:
| | | | | | | | | | | minimum_count: 0
| | | | | | | | | | | table_version: "flipflop_trace_table_v0.1"
| | | | | | | | | | | offset: f4
| | | | | | | | | | | scale: f4
| | | | | | | | | # The Move dataset describes how the basecaller "moves" through
| | | | | | | | | # the called sequence, and allows for a mapping from basecall
| | | | | | | | | # to raw data. See https://community.nanoporetech.com/posts/mapping-of-signal-to-basec
| | | | | | | | | # and https://gist.github.com/fbrennen/257130d54fd2325c6ac1a7cfb708286d
| | | | | | | | | Move:
| | | | | | | | | | count:
| | | | | | | | | | | minimum_count: 0
| | | | | | | | | | | '': u1
| | | | | | | | | | attributes:
| | | | | | | | | | | table_version: "flipflop_move_table_v0.1"
| | | | | | | | | # The ModBaseProbs table is generated during modified
| | | | | | | | | # basecalling, and lists the likelihood that each base in the
| | | | | | | | | # called sequence contains a particular modification.
| | | | | | | | | # See https://community.nanoporetech.com/posts/guppy-release-v3-2
| | | | | | | | | ModBaseProbs:
| | | | | | | | | | count: base
| | | | | | | | | | | minimum_count: 0
| | | | | | | | | | | table_version: "flipflop_modified_base_probs_v0.1"
| | | | | | | | | | | output_alphabet: S
| | | | | | | | | | | modified_base_long_names: S
| | | | | | | | | # An embedded representation of the FASTQ string for this
| | | | | | | | | # basecall.
| | | | | | | | | Fastq:
| | | | | | | | | | datatype: S
| | | | | | | | | | size: []

10. Data analysis in EPI2ME

The EPI2ME desktop application

The EPI2ME desktop application simplifies genomic data analysis for scientists without the need for bioinformatics expertise. With its intuitive interface, users can navigate through a collection of preconfigured workflows for best-practice genomic analyses of nanopore data.

EPI2ME is compatible with Windows, macOS or Linux and can be installed on a laptop, desktop computer, cluster or cloud service, as well as directly on Oxford Nanopore devices with computing capability (PromethION or GridION). It can run locally or in the cloud and includes workflows running in real-time for time-critical applications.

The EPI2ME platform uses the latest, internally validated, open-source analysis pipelines to deliver a growing range of streamlined, best-practice analysis workflows. Available EPI2ME workflows are listed here and include:

  • Human genomics. All-in-one variant detection, including SNPs, SVs, CNVs, STRs, and methylation.
  • Cancer genomics. Somatic variation detection from paired tumour/normal data.
  • Single cell and transcriptomics. Comprehensive analysis of full-length transcripts.
  • Microbiology and infectious disease. Real-time metagenomic species identification, and pathogen analysis workflows.
  • Genome assembly. Plasmid and bacterial genome assembly and annotation.
  • Targeted sequencing. Variant calling in amplicon sequences.

The workflows deliver intuitive and interactive reports and standard output files, including variants and methylation analysis.

In addition, EPI2ME provides industry-standard output files (e.g. VCF) as well as offering direct integration with selected tertiary analysis tools — enabling more comprehensive downstream data analysis, including the interpretation of human genomic variants.

EPI2ME availability

You can log into the EPI2ME website (https://epi2me.nanoporetech.com) using the Single Sign-On credentials used for your Nanopore Community account.

Instructions for installing and running the EPI2ME Agent can be found in the EPI2ME protocol: https://community.nanoporetech.com/protocols/epi2me/.

11. Oxford Nanopore Technologies tools and pipelines

Oxford Nanopore Technologies tools and pipelines

Oxford Nanopore Technologies' GitHub repository (https://github.com/nanoporetech) contains a number of data analysis tools created by our R&D division. Most of the tools require some bioinformatics knowledge, and use of the command line. Several pipelines, for example pinfish (a tool used for annotating genomes using long-read transcriptomics data), are described in more detail in our Bioinformatics tutorials: https://community.nanoporetech.com/knowledge/bioinformatics. Theese tutorials contain step-by-step instructions for answering a range of biological questions using Oxford Nanopore- or Community-developed analysis tools. Example datasets are provided for practice.

12. Custom analysis

Third-party tools

The Oxford Nanopore Resource Centre (https://nanoporetech.com/resource-centre) collates all Community-developed data analysis tools (available under the "Tools" tab). Most tools are available on GitHub, and require some knowledge of bioinformatics and use of the command line.

Last updated: 12/24/2019

Document options

Language: