Michael Schatz: 100 genomes in 100 days: The structural variant landscape in tomato genomes

Opening day 2 of the Nanopore Community Meeting 2018, Michael Schatz, Bloomberg Distinguished Associate Professor of Computer Science and Biology at Johns Hopkins University, described his group’s ground-breaking research characterising the structural variant landscape in tomato genomes. With an annual production of over 175 million tonnes and a value of $85B, the tomato is one of the most valuable crops in the world. They are also an important model plant system, exhibiting extensive variation across over 15,000 known varieties – providing a model for studying fruiting, taste and the important Solanaceae family, which includes potato and pepper.

Published in May 2012, the tomato reference genome was an international collaboration requiring years of effort costing millions of dollars. Michael noted that this has proved an invaluable resource for thousands of studies (including his own) and has delivered candidate SNPs for many traits.

According to Michael: ‘Short-read sequencing has proven valuable for single nucleotide polymorphism (SNP) discovery, but lacks power for more complex structural variants (SV)’. Recent research has highlighted that SVs play a major role in phenotypic variation, making their study increasingly important. Michael commented that long-read sequencing has the power to uncover this previously hidden variation; however, until recently, it had been too expensive to apply at scale. This all changed with the launch of the PromethION. Taking advantage of the facility for rapid and affordable long-read analysis of SV, Michael outlined his team’s involvement in a multicentre project to characterise SV landscapes in 100 diverse tomato genomes in just 100 days. The aim of the study is to elucidate the role of SV in natural variation, domestication and crop improvement. Over 900 tomato species have been sequenced to date with short reads, most at approximately 20-40x coverage. Michael described how they wanted to select the most diverse varieties to cover the maximum amount of structural variation possible. Simply selecting varieties at random underrepresented the diversity, so the team wrote an algorithm that utilises the SNP information gleaned from the existing genome sequences to ensure the maximum genomic diversity in the study samples.

Their initial intention was to mix long- and short-read sequencing; however, this all changed based on successful test runs using the PromethION, which, in their hands, is currently yielding up to 109 Gb per flow cell. They are now running 12-16 samples per week allowing them to easily hit their target of 100 genomes in 100 days. The team use the Ligation Sequencing Kit, (LSK109) which offers both long reads and high yield. The current mean read length across all of the genomes sequenced to date is 10-20 kb; however, lengths up to 1.5 Mb have been obtained. They are now generating approximately 1 Tb of sequencing data per week, which Michael described as a ‘world changing phenomenon’. He also identified that such high-throughput sequencing capability creates new data management challenges, which they have addressed through increasing their data storage capacity and upgrading their data network.

Next, Michael discussed the two major strategies for SV analysis: alignment-based detection and assembly-based detection. Both offer unique advantages and challenges. Alignment-based analysis using their own aligner, NGMLR, provided significantly more accurate SV detection than BWA-MEM. Even small structural variants are enough to cause systematic alignment errors in short-read data; however, long reads allow sequencing of the entire structural variation. Also SVs are often localised with repetitive sequence, which precludes discovery using short reads.

For de novo assembly-based approaches, Canu worked well, delivering contig N50 sizes 10-fold better than the reference, but at the expense of speed – taking two weeks per assembly using 320 cores. The team are now exploring alternative, faster assembly options, including miniasm, wtdbg2 and cloud-enable pipelines. Michael commented that these are ‘reference quality assemblies with contig sizes up to 30 Mb’.

Michael also presented a novel method for fast and accurate reference-guided scaffolding called Ragoo. Using this tool, it is possible to generate almost complete chromosomes, in a significantly faster timeframe and with more accuracy than the popular salsa algorithm. Subsequent structural variation identification is achieved using their Assemblytics pipeline.

Initial results from the first 12 genomes sequenced in this project showed substantial variation between samples, with between 25,000-45,000 SVs each. Most of these SVs were insertions and deletions and, while the majority of variants are specific to each sample, a number of variants are shared by multiple samples – including some in all 12 samples.

Michael also shared data validating an 83 kb duplication spanning the ej2 gene that counteracts a negative epistatic interaction commonly found in crossed tomato plant lines. This has allowed the team to utilise CRISPR/cas9 approaches to overcome such negative epistatic interactions to improve fruit yields. According to Michael, ‘this is just first of many stories’ that he expects to come out of this study.

Summarising his talk, Michael reiterated that long-read nanopore sequencing on the PromethION has allowed the identification of thousands of variants previously missed using short-read sequencing. He also highlighted that nanopore sequencing allows the generation of genome assemblies ten times better than the original tomato reference genome within a couple of days and at a fraction of the cost. The high-throughput provided by PromethION has allowed the team to do this at a scale that has never been seen before, and already they have generated more reference quality assemblies than any other plant or animal species.

Michael has kindly made his presentation slides available to the research community.