NanoOk – Flexible, multi-reference software for pre- and post-alignment analysis of nanopore sequencing data, quality and error profiles

The recent launch of the Oxford Nanopore Technologies MinION Access Program (MAP) resulted in the rapid development of a number of open source tools aimed at extracting reads and yield information from the HDF5 format files produced by the platform. However, all tools so far published only facilitate the production of FASTA/Q files, with only providing assistance with alignment. In particular, none of the tools provides alignment-based analysis and error profiling of Nanopore reads, something that is critical in order to understand the applicability of the application to a new problem area and is often performed ad hoc. NanoOK has been written to address this gap.

NanoOK processes the raw HDF5 files output by the MinION® basecaller, extracts FASTA/Q format files, aligns to references, calculates a wide range of QC and error metrics, and finally consolidates information into a PDF report. Crucially, it is designed to support multiple concurrent references, enabling analysis of metagenomic samples and pooled libraries. NanoOK produces in-depth data on a variety of key metrics including: number of reads aligning; quality of alignment; coverage and perfect kmer plots for template, complement and 2D reads; analysis of longest perfect sequence; statistics on types of error (substitutions, indels); analysis of over- and under-represented kmers; location of error; error motifs, such as preceding n-mers before observed errors.

Full source code is available from the TGAC GitHub site (https://github.com/TGAC/NanoOK) and documentation is at https://documentation.tgac.ac.uk/display/NANOOK/NanoOK. Here, can be found a description of the tool and associated sequence data as well as a some analysis of real world samples.

Authors: TGAC (The Genome Analysis Centre), Richard Leggett, Robert Davey