Eoghan Harrington: From mixed bags to magic bullets: Solving challenging problems in genomics

Eoghan Harrington, a Senior Applications Bioinformatician, took to the stage to give an update on the work being carried out by the Applications department at Oxford Nanopore Technologies. He started by explaining the role of both the U.K. and U.S. contingents, describing the former as the sample technology specialists and the latter as the customer-like users that showcase the technology by using it to answer biological research questions. Giving an overview of the theme of his presentation, Eoghan stated that the majority of his talk would revolve around applications that were once difficult but have now become possible through the rapid platform improvements seen over the last few years (video: http://bit.ly/2DGSENr). Furthermore, Eoghan stressed that the majority of the technology showcased in his talk is currently available to customers.

Eoghan used the example of assembling large vertebrate genomes with long nanopore reads as an example of a procedure that was, until recently, considered a significant challenge but now, with the increase in throughput and single molecule accuracy obtained by Oxford Nanopore sequencing, has become “almost routine”. Eoghan described the overarching goals of large genome sequencing as; obtaining resolved haplotypes and telomere-to-telomere contigs with high accuracy in order to generate accurate gene models. Using nanopore long reads allows scientists to get closer to these goals by providing solutions to a number of key challenges. These being: large genome size, repeat regions, heterozygosity and sample heterogeneity. As an example of sample heterogeneity, Eoghan described some “unfinished business” that was spoken about by Clive Brown in his London Calling 2018 talk, that being the extensive terminal blocking caused by DNA libraries made using chicken blood as a source of DNA. The solution to this problem was a platform fix involving a nuclease wash that would “refresh” the blocked pores. Eoghan mentioned that “depending upon who you talk to, this solution could be viewed as either a scalpel or sledgehammer”.

Showing how effective this was on a PromethION, Eoghan described how multiple nuclease washes took the overall sequence yield from a respectable 50 Gb to just under 100 Gb from a single flow cell. Eoghan then went on to discuss how this data was then used to create a genome assembly totalling 960 contigs, with a contig N50 of 18.5 Mb in under a week. He noted that there have been recent improvements in algorithms that have significantly reduced this computation time from days to hours.

Putting this into context, Eoghan showed that previous attempts to generate assemblies of this genome began in 2004 and only came close to having this level of completeness once long reads were used across multiple sequencing runs by other long and short-read technologies.

To demonstrate how these advances in large genome assembly can be easily transferred to Oxford Nanopore users, Eoghan described the new protocol builder (poster: http://bit.ly/2FHi4ND, protocol builder: http://bit.ly/2QfVHmx) which guides users through bespoke protocols and suggests data analysis routes one might use to answer the proposed research question.  He then went on to discuss nanopore chromatin conformation capture, named "Pore-C" (Poster: http://bit.ly/2QgcgyE). PoreC is a way of obtaining information regarding the spatial localisation of specific regions of DNA and also as a way to generate long range information that helps provide better scaffold contiguity when assembling large genomes. Moving on, Eoghan described new updates to the cDNA and direct RNA kits that allow for accurate detection of strand specific isoforms (Poster: http://bit.ly/2R80J1s) in combination with a number of Nanopore-made bioinformatic tools that have been written to aid users in doing this (pinfish software link: http://bit.ly/2RdSQYq). Generating full length reads spanning whole transcripts provides further information to help generate accurate gene models especially in non-model organism. Finishing the first section of his talk, Eoghan touched on the new sequence reader head developed by Oxford Nanopore, R10, and its ability to generate higher accuracy consensus assemblies by dealing with homopolymeric regions. This will be expanded upon in Clive’s talk later.

The next section revolved around the “holy grail” of microbial genomics, that being assembling full genomes from metagenomic samples. One of the common aims of metagenomics is to understand the interplay between different microbial taxa in a given ecological niche. In order to achieve this, genomes must be complete as possible so that metabolic pathways involved in this interplay can be accurately inferred. Along with the long-read nature of nanopore sequencing data, drastic improvements in throughput over the last few years have allowed numerous researchers to attempt to assemble full genomes from these complex systems. Eoghan explained that there were 4 main ways to attempt to partition out sequences belonging to different taxa in a mixed sample in order to aid in assembly.  These are methods include: reference-based approaches, where sequences are aligned to known genomes; binning by sequence composition; exploiting changes in abundance patterns; and using sequence length variability (poster: http://bit.ly/2E0vKBA). Eoghan touched on a project with Ed DeLong, whereby three sea water samples were depleted of non-phage sequences by a reference-based approach prior to sequence clustering based upon sequence composition. Read length variability was then used to help identify reads spanning full length phage genomes (poster: http://bit.ly/2P4Q7ih). This showed how read lengths could be used as a taxonomic marker in themselves.

While whole phage genomes may be captured in single reads, a suggestion from Mads Albertsen from Aalborg University showed that using different DNA extraction methods on the same sample can cause differential lysis of organisms (poster: http://bit.ly/2E0vKBA). The differences in abundance of specific sequences between the different extraction methods can be exploited to infer bins of taxonomic significance. This expands upon a previously suggested method that utilises changes in abundance across a time series but can be performed on a single sample negating the requirement for complex time series experimental designs. Finishing this section, Eoghan describes something “hot off the pores”. A new version of PoreC, called metaPoreC, was described where the PoreC method was applied to model system of two bacterial species. Results showed that individual genomes, and their associated plasmids, could be differentiated and resolved suggesting that this type of interaction measurement could further aid in metagenomic assembly.

The final section of the talk involved the titular “magic bullet”. This being a reference to the film JFK where a single bullet takes a convoluted path resulting in much more destruction than one would think possible. Here it was used as a metaphor for a complex structural variation involving a duplication-triplicate inversion–duplication with loss of homozygosity. Detecting long, complex structural variations is very difficult and can only really be resolved using long read technology. The mechanism behind this structural variation involves fork stalling and template switching with microhomology-mediated break-induced replication (poster: http://bit.ly/2RdnVeR). The whole structural variant spans approximately 2 Mb and thus is a prime example of a complex structural variant that can only be resolved with long reads. A PromethION was used to generate 16 x coverage of the 2 Mb region of which was enough to resolve this rearrangement. Eoghan explained, step-by-step, how each section of the duplication-inversion triplication-duplication event could be distinguished using the sequencing data. Across the triplicated section of the structural variant, the copy number could be seen to increase from 1 to 4, representing counts from both the triplicated allele and the wild type allele. Interrogating individual reads, those spanning the duplicated sections were shown to align to the reference in a strand specific manner highlighting the predicted breakpoints and thus providing further evidence that the hypothesised

mechanism was correct. Finally, Eoghan showed how loss of heterozygosity could be detected from that point in the chromosome onwards due to a switch to the homologous chromosome and he explained how analysis of absence of heterozygosity could reveal recessive mutations through loss of a dominant allele. On top of this chromosome 14 contains imprinted loci that overlap with the region with AOH, so the parent of origin for this region is important for the phenotype. Future steps Eoghan described could be to investigate the maethylation patterns present in the same set of nanopore data to confirm the proposed mechanism of variation.

To see all of the apps posters demonstrating these platform improvements and how they have helped answer specific research questions see this link http://bit.ly/2AoieUb.