Edward DeLong - No assembly required: Single nanopore reads yield complete virus genome sequences from naturally occurring microbial community DNA


Edward DeLong, from the University of Hawaii, gave a talk expanding upon the work referenced by Eoghan Harrington of the Oxford Nanopore Applications department earlier in the day. Edward began his talk by describing the research station in Aloha where the samples analysed later in the talk were taken. To give an insight into the types of research this station facilitates, Edward spoke about an ongoing 30 year time series experiment aiming to track changes in ocean microbial populations. Using the example of ocean acidification, Edward spoke of how one of the interests of microbial ecologists is to determine how microbial populations change over time in response to varying environmental variables. In the case of ocean acidification, this is directly linked to climate change as CO2 dissolves in sea water, reducing the pH and potentially affecting microbial organisms such as coccolithophores whose outer protective plates are made of calcium carbonate. Segueing into a description of ocean microbial as the the “forests of the sea” Edward mentioned that ocean microbial populations fulfil many essential biogeochemical processes, and that organisms such as the cyanobacteria as primary producers. However, the focus of this talk was that of ocean phages as they affect, maintain, and alter prokaryotic populations through, among other things, infection and release of nutrients through cellular lysis. Due to the interrelated nature of phage and prokaryotic ocean populations, understanding not only which phage taxa are present in sea water, but how these populations change over environmental and temporal gradients, is important to understand how changes in environmental conditions may alter ecosystem function. In order to do this there are a number of challenges, not least of which is a lack of good viral reference genomes.

Moving on to describe his study of viruses in the oceans, Edward described how sea water from the study site in Aloha was taken at 15m, 117m and 250m. As an overview of the methods used, water was put through filtration systems in order to select for virions and then nucleic acids were then extracted from each sample using a Qiagen genomic tip 20/G to produce 2-5ug of DNA. This was then used as the input material for the standard LSK-109 ligation sequencing kit by David Dai at the New York branch of Oxford Nanopore's Applications department. Samples were sequenced on a GridION using 9.4.1 flow cells and the resultant data was put through an analysis pipeline constructed by Oxford Nanopore's Applications bioinformatician, John Beaulaurier. First, the raw 1D reads were put through kaiju to generate taxonomic bins of known viruses. The remaining un-binned reads were filtered for known cellular fractions to remove sequences belonging to known non-viral organisms. K-mer clustering was performed in order to generate k-mer bins alongside the viral taxonomic bins. The k-mer bins were further processed by Canu for error correction and, using a read length filter, reads that spanned whole potential viral genomes were isolated. Next these reads were clustered based on nucleotide frequency using PyANI and polished using Racon and Nano polish to produce draft phage genomes.

Expanding upon each section of the data analysis, Edward explained in more detail how t-distributed stochastic neighbour embedding (t-SNE) plots were constructed and the density of points on the graph were used to delineate taxonomic bins in a reference-independent fashion. Then, using Kaiju, each taxonomic bin was filtered for cellular DNA “noise” and then screened for unique properties of phage genomes to determine which could be used for correction and draft genome construction. Edward said that initially they were having difficulty assembling the reads from each bin, but then upon examining the read length distribution he proposed that this may be due to the fact that whole viral genomes were captured in single reads. The next task was to convince himself this was true. In order to do this, read length distributions of each bin were compared with the overall distribution of sequence lengths in the sample, suggested that, indeed, whole viral genomes were covered in single reads. Furthermore, using pairwise comparisons of average nucleotide identity within each bin, Edward showed that many contained single viral genomes while others contained multiple, closely related viral genomes. In the case of the former, one read was picked and polished using the rest of the reads within the cluster, while in the latter, reads from each average nucleotide sub-cluster were chosen as references and polished.

In order to validate these draft viral genomes, a number of checks were performed. A program called Virsorter was used to determine how many of the reads within each cluster were predicted to be of viral origin and this resulted in 100 % concordance. Furthermore, 95 % of the polished reads had 200 – 2000 bp direct terminal repeats, a feature very common in viral genomes. Pulse field gel electrophoresis of known oceanic phages were compared with the read length distributions of the suspected phage bins and the polished read lengths matched closely suggesting that whole viral genomes had actually been caught in single nanopore reads. Next homology between the polished reads and a number of environmental viral genome databases was calculated, showing that the majority of proposed genomes had significant homology known marine viral genomes. However, Edward pointed out that a number of novel viral genomes were detected in just these three samples and, although homology with known viral genomes was low, they had many of the characteristics of a complete phage genome.

Discussing the results of the depth study Edward showed that as depth increased, the proportion of phage genomes of unknown origins increased, but many at the 15M depth were reasonably well characterised taxonomically. Furthermore, many of the marker genes expected to be seen in viral populations could be detected in these draft genomes.

Edward then showed some short-read phage data and how these populations change across time and depth. This highlighted both dynamic and predictable patterns of ocean phage communities with some taxa appearing and disappearing sporadically while ecological patterns in others were relatively predictable.  Finishing this section Edward compared short read sequencing with nanopore sequencing on the exact same samples and showed that nanopore reads appear to recover rare phage types very efficiently.

In his closing remarks, Ed said that nanopore sequencing of viral metagenomic samples required no assembly, with many reads spanning whole viral genomes. In addition, nanopore sequencing efficiently recovers whole virus sequences from complex environmental samples and, compared with standard short read assembly methods, novel viral types seem to be recovered using a nanopore-based approach.