Webinar - Using ultra-long reads to fully characterise genomes
The ability to obtain reads of any length with Oxford Nanopore sequencing technology provides a unique opportunity to access the human genome in a way that has not been previously possible. In collaboration with Circulomics, Oxford Nanopore Technologies has released an Ultra-Long DNA Sequencing Kit, optimised for the production of ultra-long sequencing reads (>50 kbp). In this webinar, Rachel Rubinstein, Field Based Sequencing Product Manager (Oxford Nanopore Technologies), was joined by Miten Jain, Research Scientist (UCSC), and Duncan Kilburn, Principal Scientist (Circulomics), to explore how ultra-long sequencing reads can be used to fully characterise genomes: from accessing highly repetitive regions, to resolving large-scale structural aberrations, with modification data provided in the same sequencing run.
Answers from Miten Jain and Duncan Kilburn to your questions during the webinar:
1. What read lengths can you expect with pure preparations of DNA from genomic regions in bacterial artificial chromosome (BAC) clones in ultra-long sequencing?
Duncan Kilburn: The read length would depend on how long the BACs are, but we would expect that it is possible to generate reads that are the entire length of the BAC. As a related example, we have seen 50x full length coverage (i.e., transposase cuts once) of a naturally occurring 200 kb plasmid. The key here would be ensuring that the ratio of FRA to DNA cuts once per BAC.
2. Is there data on CRISPR-Cas9 target enrichment techniques in combination with ultra-long nanopore reads?
Miten Jain: There is some work that I’m aware of, including work we are helping graduate students with at Santa Cruz, where we are taking Circulomics-derived, ultra-high-molecular-weight DNA with CRISPR-Cas9 enrichments to address targeted questions. But that kind of an application is still under development, and I’ve not seen significant amounts of data. What we are more excited about, and as are the Community, is doing ultra-long sequencing with adaptive sampling, because that's where other interesting applications kick in and it’s a bit easier to try. At the moment, the CRISPR-Cas9 based chemistry also requires downstream steps with clean-ups and ligation-based chemistries, which make it a little challenging in preserving the read lengths when the DNA is being delivered to the nanopore.
3. What depth of ultra-long data and read N50s do you need to get a gapless genome assembly?
Miten Jain: I asked this question to Sergey Koren, who along with Adam Phillippy is one of the assembly experts for nanopore amongst general assembly experts, and he said that if you have over 128 kb long reads, then you can pretty much assemble the whole genome in a gapless manner, except for ribosomal DNAs. rDNAs are still a bit of a challenge, and that's something that Adam and Sergey are taking on at the moment. So, I'd say, once you're past 100-150 kb, it is possible to achieve a relatively complete gapless genome. Of course, there's always a trade-off between accuracy, length, and coverage, which of course will get better over time.
4. We'd really love to hear more about using Shasta with ultra-long datasets. Are there any differences in the workflow, the commands, compute time, etc. for ultra-long versus standard? Do you have any recommendations or tips for other people?
Miten Jain: With Shasta, we have configuration files for different types of datasets. For ultra-long, there is a configuration file that has essentially the letters ‘UL’ in the name. For example, the most recent one is September 2020 UL Shasta config. Using that in combination with a minimum read length cutoff, depending on how long your dataset is, you can essentially define what is the minimum read length you want to discard. Typically, in conventional settings, Shasta will remove everything below 10 kb. With ultra-long read sets, we routinely change that minimum cutoff to 35 to 50 kb. Essentially, as long as you can recover a 40 to 60 X coverage dataset above those read lengths, that still gives you a pretty good assembly. In terms of runtime, it's a little bit slower with the ultra-long reads, which is no surprise because you have very long reads to compute, but not significantly. An ultra-long read assembly would still be in the neighbourhood of two to three hours per human genome with Shasta using these configurations and modified read length settings.
5. Would you recommend re-basecalling nanopore data whenever basecaller improvements come out to supplement the kit and prep improvements?
Miten Jain: Almost always. The updates tend to be pretty significant and even if you don't do every version, because some of those are just bug fixes, every major model release has significantly improved the assemblies that we have seen. Most recently, with Guppy 5, the errors have significantly changed in how much removal of the non-systematic errors has happened. The assembly is also truly improved, both from the assembly standpoint of its own, and then phasing in variant calling. So as much as possible, work with the newest model. At this point the basis should be Guppy 5 or higher, because Guppy 5 is a game changer in terms of the basecaller models, thus far.
6. Any advice for barcoding samples — how many samples and micrograms of DNA?
We have barcoded/multiplexed 12 bacterial ultra-long sequencing samples on a GridION, which we presented as a poster at London Calling 2021 (Kilburn et al.). The input per cell type was 167 µL of 1 OD culture i.e., 1/12th of standard input from the Circulomics UHMW extraction protocols for gram-negative and gram-positive bacteria. If you would like more details of that protocol, please contact us at firstname.lastname@example.org and we'd be happy to help.
7. There have been a couple of questions about other sample types for the Circulomics kit. What sample types are in development for the Circulomics kit and what sample types are available now?
Duncan Kilburn: At the moment, the more mature protocols we have are for cultured cells, blood, and bacteria, and we also have protocols for people wanting to do plant and tissue sequencing. At the moment, the plant and tissue protocols don’t give quite the same N50 and max read lengths as the culture cells, blood, and bacteria, but they do give ultra-long reads and the protocol works. We’re currently developing and trying to optimise plant and tissue protocols, and we're also hoping to move towards more metagenomics, fungal and insect samples. Hopefully, we'll have protocols for those soon.
8. When loading the library to the flow cell, must you also use a wide-bore tip?
I recommend wide bore tips to preserve the MB+ reads but using standard tips will not impact the read N50 or reads in the 100-500 kb range.
9. There have been some questions about other kits and their efficacy for ultra-long DNA extraction. Do you want to comment on what you've seen in your experiments?
Miten Jain: If we just think from the evolutionary perspective, the way the ultra-long began was back with Josh Quick and Nick Loman’s work on the phenol/chloroform-derived extraction. Since then, there's been a lot of work with different companies, such as QIAGEN Puregene and NEB Monarch kits, leading up to Circulomics themselves, who did a version of the long-read kit based on the short-read eliminator, and then the ultra-long now. What we've seen is, they all work to varying degrees. The phenol/chloroform was the original, which gave very long reads, then there were the intermediate ones that give you long reads, where you can get 50 kb, and 50 is no problem. And then, the highest throughput and read length that we've seen thus far have been with this newest combination of the Oxford Nanopore Ultra-Long Kit, combined with the Circulomics ultra-long preparation. So, they all work to relative degrees, so it's not that one is terrible in terms of its throughput quality, but the newest combination really outperforms everything that we've seen thus far.
Duncan Kilburn: The extraction protocol in our kit was developed in tandem with the Oxford Nanopore Ultra-Long Kit and it has been specifically optimised for this protocol. So, that may be where you see the performance benefits, because they've been developed side by side.
Miten Jain: One of the key points to add to what Duncan said, is that this combination of in-tandem development has translated very, very well onto the PromethION. So, you're not limited to a MinION or a GridION, but can get truly high throughput, long-read sequencing — a PromethION giving you 150 gigabases with these kinds of reads, is essentially what ligation chemistry more often than not does in the field.
Rachel Rubinstein: I will add, PromethION of course is amazing, but you can get really good data with MinION and GridION as well.
10. Do you have any experience with generating ultra-long reads from yeasts? And what about fungi?
Duncan Kilburn: Not yet, but they are both on the list of samples we're developing ultra-long extraction protocols for.
11. Do you think you can get a fully haplotype-resolved human genome assembly yet?
Miten Jain: We do. CHM13, which was recently published in a pre-print as the first truly complete human genome by the Telomere-to-Telomere (T2T) Consortium, is now available. It's the first of its time because you have all complete chromosomes and the sex chromosomes, with the centromeres and telomeres. The hope is that this will pave the way for more complete genomes, and the T2T, as well as the community, is trying to now work on completing a truly diploid genome, which is next in line.