Analysing Oxford Nanopore data: a beginner’s guide to file formats

Return to Blog

Whether you are using Oxford Nanopore sequencing to assemble a genome from telomere to telomere, identify microbes, or detect epigenetic modifications, all the information you gather along the way is packaged neatly into files. During data analysis, this information is repackaged from one file type to another, often several times.

At first glance, the variety of file formats involved in bioinformatics may feel overwhelming: what exactly is the difference between a FASTQ and a FASTA? How is a BED file different from a bedMethyl file? But you will find that just a few formats will get you a long way — and the same formats tend to crop up again and again.

In this Nanopore Know-How blog, we will introduce some of the file formats you can expect to encounter as you move from raw sequencing data to basecalling and deeper analysis.

Why are different file formats needed in bioinformatics?

As Oxford Nanopore expert Bryant Catano (Senior Product Support Scientist) put it in his masterclass introducing Oxford Nanopore data analysis, ‘bioinformatics is really all about converting data from one file format to another that is more suitable for answering your biological questions’. Each format meets the specific needs of each step in an analysis pipeline, including the types of data required, compatibility with the upstream and downstream tools being used, and suitability for short- or long-term storage.

Image of a PromethION 24 in the lab next to two scientists

In Oxford Nanopore sequencing, you begin the data analysis workflow with your raw sequencing data and initial steps like basecalling in MinKNOW, the software that controls the sequencing devices. From there, you can move into further analysis with the easy-to-use EPI2ME Desktop Application, which requires no prior bioinformatics experience. If you are new to EPI2ME, take a look at the blog EPI2ME: bioinformatics made simple.

With that context in mind, let’s start from the beginning, with the data produced by Oxford Nanopore sequencing itself.

From squiggles to basecalled data

The first file format produced is the POD5 file, which captures the raw data generated when DNA or RNA molecules pass through nanopores during sequencing. For a detailed walkthrough of this process, check out the blog How Oxford Nanopore sequencing works. In brief, a strand of a nucleic acid passes through a nanopore, causing changes in electrical current. The device captures these changes as an electrical signal known as a ‘squiggle’. Packed with information that can be extracted in analysis, these squiggles are recorded in POD5 files.

Next comes the calling of canonical and modified bases from your raw data. Thanks to real-time data streaming, you can perform this as soon as your raw data is produced. If you are only looking at canonical bases, the FASTQ format is what you will need. The standard format across most sequencing data types, the FASTQ is much smaller in size than the POD5 as it does not retain all the raw data recorded in sequencing. Instead, FASTQ files contain your canonical sequence, per-read quality scores, and run, sample, and read identification information.

Image introducing the features recorded in a FASTQ file.

FASTQ is a standard sequencing file format compatible with a wide range of analysis solutions.

If you need to align your data to a reference genome, your output will be the BAM file format. BAM files are binary, taking up less space than both POD5 and FASTQ files. For alignment, you will need to provide a reference genome in FASTA format and an index file. These are also the files you will need to view your aligned BAM file using a genome browser such as Integrative Genomics Viewer. Both FASTQ and FASTA formats are human-readable text files.

While the BAM format was initially created to store alignment data, it can also store unaligned data; more and more downstream analysis tools are now compatible with the format. Furthermore, unlike FASTQ files, BAM files can store methylation data. To capture methylation information in your native DNA or RNA — with no need for special library prep — simply switch on modified basecalling in MinKNOW and select the BAM format as an output file. Oxford Nanopore sequencing is unique in its ability to directly detect epigenetic modifications, and not only DNA modifications, such as 5mC, 5hmC, and 6mA, but also RNA modifications including m⁶A and pseU.

Image showing the choice of BAM and/or FASTQ basecalled output files in MinKNOW

Output file formats are easy to specify in MinKNOW.

Store what matters

We recommend only holding on to the files that will serve the needs of your experimental goals in the short and long term. For most applications, we recommend storing only your compact FASTQ or BAM files, which require significantly less space. All Oxford Nanopore sequencing devices that feature integrated compute (GridION, PromethION 2 Integrated, and PromethION 24) can produce these file formats in real time, so that they are ready by the end of your sequencing run.

By choosing to store your POD5 files, you can revisit your raw data, for example, to re-basecall it with the latest algorithms as they are developed or analyse methylation at a later stage.

From basecalls to variant calls and beyond

Once you have your basecalled, and if required, aligned data to hand, you can dive into further analysis using EPI2ME or third-party tools. This may take place when your sequencing run has finished and you have generated sufficient data to run your analysis, or it may proceed in real time during sequencing itself. EPI2ME not only produces the relevant file formats for your workflow, but also intuitive HTML reports through which you can easily see the results of your analysis.

If you are assembling your data, your consensus sequence will be output as a FASTA file. For example, to assemble whole bacterial genomes from isolate samples using the EPI2ME bacterial genomes workflow, you can input either the FASTQ or BAM files from the previous step. The workflow outputs your assembly as a FASTA file, as well as other files holding information such as annotations, sequence typing, and more.

If you are looking to identify variants in your basecalled data, you can expect to see specific file types for this process too. For example, the EPI2ME human variation workflow provides all-in-one calling of small and large variants, plus methylation data and haplotype phasing, from a single human whole-genome dataset. To use the workflow, you will need your BAM file of basecalled data aligned to a human reference genome. Your aligned data is output in the CRAM format, which is simply a more compact version of BAM. The genomic variants identified in your data are output as variant call format (VCF) files, with separate files for different variant types, such as single nucleotide variants, structural variants, and short tandem repeats. Per-base methylation data, meanwhile, is shared in the format bedMethyl, which allows you to view DNA methylation sites data in a genome browser.

The bedMethyl format is, in turn, an extension of the BED format. You can use the BED format to specify genomic locations, such as when you wish to focus data analysis on the targets in a sequence capture panel. In Oxford Nanopore sequencing, you will also encounter this file format when you use adaptive sampling: a bioinformatics-based target enrichment or depletion method that takes place during sequencing, without the need for special library preparation steps. To use adaptive sampling, simply provide MinKNOW with a BED file specifying the coordinates of the targets you plan to enrich or deplete. These targets could be complex gene panels, entire chromosomes, or even whole microbial genomes in a metagenomic sample. MinKNOW will then take care of the rest during real-time sequencing. With no need for primers or probes, updating your targets is as simple as editing your BED file.

Wherever your analysis takes you, we hope that this introduction helps as you get started.

For detailed information on every EPI2ME workflow, including breakdowns of input and output file options, take a look at the EPI2ME documentation.

Oxford Nanopore Technologies products are not intended for use for health assessment or to diagnose, treat, mitigate, cure, or prevent any disease or condition.

Discover nanopore sequencing

Explore products

Research

Techniques

Focus areas

Resources

Documentation

Nanopore Learning

Company

News & Events

Global partners

Analysing Oxford Nanopore data: a beginner’s guide to file formats

Why are different file formats needed in bioinformatics?

From squiggles to basecalled data

Store what matters

From basecalls to variant calls and beyond

Getting started

Quick links

About Oxford Nanopore