Interview: Making telomere-to-telomere genomic assemblies accessible

Alexander Wittenberg is a Genomics Scientist at KeyGene, where he is currently responsible for scouting new genomics technologies and is involved in the development of innovative sequence-based technologies within KeyGene’s Genome Insights crop innovation platform. He also works closely with R&D and business development departments within KeyGene to translate these technologies to the market for KeyGene’s partners.

Alexander recently shared some of his work during our Knowledge Exchange ‘Making telomere-to-telomere genomic assemblies accessible: examples from human and plant genomes’. We caught up with Alexander to ask him some of the questions that came from online viewers who watched his presentation.

What are the advantages of duplex reads longer than 20 kb?

A coverage of ~20/25x per haplotype of duplex reads, with length between ~15–25 kb, will already provide you with optimal duplex yield, leading to a very contiguous assembly for most crop genomes. Depending on the complexity of the genome, these assemblies can already generate a number of telomere-to-telomere (T2T) assembled chromosomes. Duplex reads >25 kb can be of value to span complex, repetitive regions, and as such, can help in reducing the number of remaining gaps in an assembly. Longer duplex reads can also help in phasing the maternal and paternal chromosomes in cases of heterozygous and polyploid genomes.

Would chromatin confirmation contribute to the quality of crop genome assembly? When would you recommend it, and which approach would you use?

Long-range contact information provided by higher-order chromatin structures arises from physical interactions of many genomic loci. This data can be used in scaffolding contig-level assemblies to a chromosome-level as well as in the phasing process to separate the parental chromosomes. Pore-C has a number of advantages over short-read-based approaches:

  • It determines multi-way genomic loci in contrast to only capturing the interaction between two pairs of loci, providing a higher resolution.
  • The PCR‐free strategy of Pore‐C enables direct detection of DNA methylation and higher‐order chromatin interaction.
  • It can be generated on the same platform, in house, and is cost effective.

What other crops are you and your team looking into for T2T assemblies?

Besides tomato and maize, we are working together with the International Lettuce Genomics Consortium to generate a T2T lettuce genome. Furthermore, we are in the process of generating a number of phased T2T reference genomes for diploid and triploid banana varieties.

What are the names of the software tools for automatic discovery of structural variations based on long reads? And can these tools combine with primer design for screening?

There are many tools (e.g. cuteSV, pbsv, Sniffles, NanoVar, NanoSV, SVIM, DeBreak, etc.) to detect structural variation in a genome and these tools keep on improving. These tools do not perform primer design but there is a range of other tools that can do this. For tomato specifically a number of papers have been published on the detection and biological impacts of structural variants1,2,3.

Is accurate T2T assembly possible without a reference genome?

Although more difficult to validate, it certainly is possible to generate T2T crop genome assemblies without a reference genome in place. No matter what data sets are used, it is very important to realise that any assembly is just a hypothesis. Orthogonal data sets, as well as closely related genomes, can help in validating the results. Fortunately, more and more tools are becoming available to validate assemblies4.

How do you maximize duplex reads?

Percentage duplex reads can be maximised by starting with high-quality DNA with as low as possible percentage of nicked DNA. In addition, adaptors need to be ligated on both ends of the fragment. The highest duplex yields we have obtained included fragmenting the DNA to ~15–30 kb and removing fragments <10 kb. Flushing and re-loading cells with additional library prep material, can also help to maximise yields.

What is the realistic expected ratio of F reads and R read output in high duplex sequencing? Are there any technical recommendations for optimal duplex output on a P2 chip?

In the case of native DNA, we have obtained ~45–65% of data assigned to duplex reads during the developer release. This translates to duplex yields of ~23–32 Gb for a cell that generated 100 Gb of data in total.

Can you comment on or suggest a good assembler for polyploid genomes?

Scientists usually make a distinction between polyploids that arise within a species (autopolyploids) and those that arise due to the hybridisation of two distinct species (allopolyploids). The genome composition impacts the challenges in the assembly process. Recently, a review paper5 was published that outlines the current state of the art. At keyGene, we have developed a proprietary genome assembler that we are currently expanding to accommodate the assembly of polyploid genomes.

What is the percentage of duplex reads among the total reads in your lab? Apart from the library loading amount, are there any other recommendations for increasing the ratio of the duplex reads?

With the developer release of the high duplex cells, we have seen that for native DNA ~45–65% of the data can be assigned to duplex. For whole-genome amplified DNA this percentages reached ~71%. With the current commercial release, the duplex percentages are lower.

Can ultra-long reads be obtained after whole-genome amplification?

No, this is not recommended. The average size of the DNA fragments after whole-genome amplification varies between methods. PCR-based amplification methods generally yield fragment lengths up to a few kb. Multiple Displacement Amplification (MDA), isothermal genome amplification methods can yield significantly longer lengths (average product length ≥10 kb) with part of the products potentially reaching ~80–100 kb. For the generation of ultra-long reads, mostly the enzymatic transposase library preparation kit is used. This kit requires significantly longer DNA as starting material.

What is the duplex read percentage in your data sets? How can we improve it?

With the developer release of the high duplex cells, we have seen that for native DNA ~45–65% of the data can be assigned to duplex. High-quality DNA and a good ligation are important for obtaining good results. Also fragmenting the DNA to ~15–30 kb can help in optimising duplex yields.

What is the minimum sequencing coverage required to assemble a human genome using only nanopore data

The recommendation is to use ~40x duplex data in combination with ~40x ultra-long (≥80 kb) data for a human genome, in order to assemble a large portion of the chromosomes from telomere to telomere. Parental or Pore-C data can be used for phasing the chromosomes.

Which assembler can be used for T2T genome assembly using only nanopore long reads, ultra long reads, and duplex reads?

For eukaryotic genomes, currently the recommended assemblers are Verkko and Hifiasm. Other assemblers can be used with these data types, but currently do not automatically generate gapless chromosomes.

Can these assemblers perform well with high heterozygous genomes without Pore-C or Hi-C information?

In general, the higher the heterozygosity in a genome, the easier it is to separate the parental phases. Genomes, or regions in a genome with relative low heterozygosity, are commonly the more difficult regions to resolve. In these cases, long-range, multi-contact information from Pore-C is very valuable to be able to separate these regions. We are currently exploring this approach in a number of heterozygous genomes.

References

  1. Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell 182(1):145–161 E23 (2020).
  2. Jobson, E. and Roberts, R. Genomic structural variation in tomato and its role in plant immunity. Molecular Horticulture 2(7) (2022).
  3. Li, N. et al. Super-pangenome analyses highlight genomic diversity and structural variation across wild and cultivated tomato species. Nature Genetics 55:852–860 (2023).
  4. McCartney, A.M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature Methods 19:687–695 (2022).
  5. Wang, Y. et al. Sequencing and assembly of polyploid genomes. Methods Mol. Biol. (2023). DOI: https://doi.org/10.1007/978-1-0716-2561-3_23