Sequencing and assembling mega-genomes of mega-trees: the giant sequoia and coast redwood genomes

Opening day two of the Nanopore Community Meeting in New York, we were delighted to welcome back Steven Salzberg, Professor and Director of the Centre for Computational Biology at Johns Hopkins University. At the Nanopore Community Meeting in 2017, Professor Salzberg detailed his involvement in an ambitious project to sequence the genomes of the giant sequoia (Sequoiadendron giganteum) and coast redwood (Sequoia sempervirens). Now, just two years later, he returned to share the results.

Not only are the sequoia and redwood trees two of the largest living organisms on the planet, they also possess extremely large genomes. At 8.2 Gb for sequoia and 26.5 Gb for redwood, the genomes of these organisms are, respectively, approximately 2.6 and 8.3 times larger than that of humans — making their assembly a truly monumental undertaking. To tackle this challenge, the team deployed a ‘hybrid’ genome assembly strategy, utilising both short-read sequencing technology and long-read nanopore sequencing. Professor Salzberg described how, at over 10 kb in length, nanopore sequencing reads could span nearly all common repeats, simplifying the assembly process.

The team deployed the MaSuRCA hybrid assembler, an open source tool developed in Professor Salzberg’s lab. Briefly, this uses a k-mer lookup to extend short sequencing reads base by base, at both the 5’ and 3’ ends (as long as the extension is unique), to form much longer ‘super-reads’. The combination of super-reads and long nanopore sequencing reads then enable the generation of even larger ‘mega-reads’.

Starting with the sequoia, Professor Salzberg described how the short-read sequencing was performed on DNA obtained from a single seed (or pine nut) taken from, at 93.3 meters high, the tallest known sequoia in the world. Importantly, the seed is haploid making the subsequent assembly much easier. In total the team generated 135x genome coverage using short-read sequencing technology. For the long-read nanopore sequencing component, they obtained the DNA from needle tissue taken from the same tree, which in itself was fraught with challenges, not least the requirement to get 100 feet in the air to access the first branch of the tree. Using 13 MinION Flow Cells, Winston Timp’s lab at Johns Hopkins University, who undertook all of the sequencing work, generated over 182 Gb of data across approximately 24 million reads — equating to 22x genome coverage — with a read N50 of 9.5 kb.

Assembly using the short-read data alone provided a contig N50 of 12 kb across 2,507,175 contigs; however, addition of the long-read nanopore data increased the contig N50 to 360 kb, while reducing the number of contigs to less than 50,000 (a 30-fold reduction). Professor Salzberg referred to this genome assembly as Sequoia v1.0 and went on to explain how they have now integrated a Hi-C chromosome conformation capture technique, using the HiRise assembly algorithm, to generate the Sequoia v2.0 assembly. The team at Johns Hopkins University recently applied this technique to generate a chromosome-level assembly of the walnut (Juglans regia L) genome. Comparing the walnut and sequoia nanopore sequencing reads, Professor Salzberg noted that the more recent sequoia reads were significantly longer, reflecting the rapid development of the nanopore technology their laboratory’s sequencing workflow.

Assembly using the HiRise algorithm generated 11 ‘enormous’ chromosome-size scaffolds of between 171 Mb and 985 Mb in size. Describing such large scaffolds as ‘spectacular’ and ‘transformative’ Professor Salzberg noted that these are the largest scaffolds ever assembled for any genome. He went on to joke that ‘if your genome doesn’t have a chromosome larger than 1 Gb you can’t break this record’. Recent gene annotation of the Sequoia v2.0 genome has identified 37,963 protein coding genes.

Moving on to the coast redwood, Professor Salzburg explained how, at 27 Gb, this organism not only possesses a much larger genome, but is also hexaploid (i.e. six copies of each chromosome), providing an even sterner computational challenge. In total, the team generated 3.2 trillion bases of short-read data and 582 billion bases of nanopore sequencing data, representing 122x and 21x genome coverage respectively. Sequencing of this giant genome was completed in October 2018, and the subsequent assembly took 5-6 months (or approximately 700,000 CPU hours post error correction). Sharing some of the metrics for this redwood v1.0 assembly, Professor Salzberg stated that the largest contig is 2.4 Mb and the N50 contig size is 110 kb. The Hi-C assembly is still ongoing; however, in closing his presentation, Professor Salzberg suggested that using the resulting data it may be possible to split the redwood genome into its three sub-genomes - and welcomed any help with this challenge!

The redwood sequencing also resulted in the identification of a novel fungal genome (Pestalotiopsis sempervirens), which was covered to 26x depth; however, with the sheer volume of data being generated in the lab, the team have not yet had time to write this up.

It is anticipated that this ground-breaking research will significantly enhance our knowledge of these astounding organisms and support conservation and breeding efforts, preserving them for future generations to enjoy.

Authors: Steven Salzberg