Karen Miga - Telomere-to-telomere assembly of a complete human X chromosome
London Calling 2019
Release of the first human genome assembly was a landmark achievement, and after nearly two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no one chromosome has yet been finished end to end, and hundreds of gaps persist across the genome. These unresolved regions include segmental duplications, ribosomal rRNA gene arrays, and satellite arrays that harbor unexplored variation of unknown consequence. We aim to finish these remaining regions and generate the first truly complete assembly of a human genome.
Here we announce a whole-genome de novo assembly that surpasses the continuity of GRCh38, along with the first complete, telomere-to-telomere assembly of a human X chromosome. In total, we collected 40X coverage of ultra-long Oxford Nanopore sequencing for the CHM13hTERT cell line, including 44 Gb of sequence in reads >100 kb and a maximum read length exceeding 1 Mb. This unprecedented coverage of ultra-long reads enabled the resolution of most repeats in the genome, including large fractions of the centromeric satellite arrays and short arms of the acrocentrics. A de novo assembly combining this nanopore data with 70X of existing PacBio data achieved an NG50 contig size of 75 Mb (compared to 56 Mb for GRCh38), with some chromosomes broken only at the centromere. Using this assembly as a basis, we chose to manually finish the X chromosome. The few unresolved segmental duplications were assembled using ultra-long reads spanning the individual copies, and the ~2.3 Mbp X centromere was assembled by identifying unique variants within the array and using these to anchor overlapping ultra-long reads. These results demonstrate that it is now possible to finish entire human chromosomes without gaps, and our future work will focus on completing and validating the remainder of the genome.
Karen opened her plenary talk by stating that we are "entering into a new era" in genetics and genomics which is demanding complete, high-quality assemblies. The current human reference genome (GRCh38) is the most accurate and complete vertebrate genome to date. However, it is incomplete - there are still 368 unresolved issues and 102 gaps. Karen said that it "really drives it home when we look at chromosome 21", which has ~30 Mb of assembled sequence but ~20 Mb of missing sequence - unexplored regions to study that could be linked to disease. These problem regions are associated with segmental duplications, gene families, satellite arrays, centromeres, and rDNAs, as well as uncharacterised sequence variation in the human population. The major challenge is the generation of complete assemblies across repetitive regions that can span up to hundreds of kilobases, or even megabases at centromeres. Karen asked: can high-coverage, ultra-long read sequencing be used to resolve these regions and complete assemblies of the human genome? She stated that this question was what motivated the establishment of the Telomere-to-Telomere (T2T) consortium, of which she is a member, which is an open, community-based effort to generate the first complete assembly of a human genome. The aim of this consortium is to "shift the standards in genomics" to the highest quality.
Karen and her colleagues have sequenced CHM13hTERT, a karyotypically stable haploid cell line, using long-read nanopore sequencing. From the start of May 2018 to January 2019, 94 MinION/GridION flow cells were used for CHM13 sequencing, obtaining 50X depth of coverage from ultra-long nanopore reads. The maximum mapped read length was 1.04 Mb. These nanopore ultra-long read data were used for contig building, along with long read datasets from other sequencing platforms for polishing and structural validation. The alignment tool Canu was used for sequence assembly; the final assembly was 2.94 Gbp with an NG50 contig size of 75 Mbp - this exceeds the continuity of GRCh38 in completeness which has an NG50 contig size of 56 Mbp. Moreover, a subset of chromosome assemblies only remained broken at the centromere.
Karen stated that the next step was to use this hybrid de novo assembly to assemble a complete human X chromosome. The X chromosome seemed a "natural place to invest time", for it is associated with many Mendelian diseases. The biggest challenge in assembly of this chromosome was at the centromere, which required ultra-long nanopore reads spanning 100 kbp repeat-rich regions. However, she stated that an assembly is only a hypothesis and the manually-finished assembly needed to be validated using other methods such as digital droplet PCR, restriction enzyme pulse-field gels, and structural validation techniques.
Karen demonstrated how difficult it is to assemble centromeric regions, especially the centromere of the X chromosome where, for example, only 37 structural variants are present to guide assembly, and the majority of these SVs are very small. She stated that the next challenge is determining how to polish the assembly and bring it to high accuracy. How can we create new strategies to deal with tandem repeats? Karen described how they created a polishing strategy using unique k-mers; this firstly involves identifying all unique, single-copy k-mers throughout the genome. These k-mers are used to create a scaffold for anchoring high-confident, long-read alignments; only those long-reads aligning with unique k-mers are retained. Karen described how spacing of single-copy k-mers can be irregular in repeat dense regions, such as centromeres. For example, the longest distance observed between two k-mers on the X chromosome was 53 kbp, this means that reads of ≥53 kbp are required to span this section of the chromosome.
Two rounds of nanopolish were used for k-mer-based polishing of nanopore reads, along with long read polishing from other sequencing platforms, and HiFi alignments were then used to evaluate the success of polishing. Karen concluded this section by stating that the finished T2T X chromosome had a structurally validated assembly, from telomere-to-telomere, including a problem 2.8 Mb tandem repeat at the X centromere. The novel k-mer based polishing strategy they used improved the assembly quality of large repeat-rich regions. She stated that this demonstration is "really bringing the point home that we are achieving high quality and high continuity".
In the final section of her talk, Karen asked "how do we start to finish the human genome?" Focusing on chromosomes 7 and 9, at D6Z1 and D8Z2 centromeric sites from satellite array predicted regions, Karen explained how we can see the difference in sequence diversity compared to the X chromosome centromere with its 2.8 Mb tandem repeat. At the centromeres on these autosomal chromosomes there is far greater sequence diversity which makes their assembly significantly easier - there is "a lot more information to guide mapping, polishing and assembly". For example, the maximum spacing between k-mers is only 3 kb. Using the k-mer polishing approach greatly improved the assembly.
Karen concluded by stating that the goal of the next two years is to obtain a complete human genome. Challenges facing us include acrocentric regions, large segmental duplications, and classical human satellites, and we need to start thinking about automating repeat assembly. "We keep setting the bar higher and higher" for the genetics community in terms of assembly quality and completeness. Thinking about 2020 and beyond, we need to start thinking about human populations, as opposed to a single human genome. This will require increasingly high-throughput long-read sequencing on the PromethION, and they are now starting to "ramp up the process". It will also require cloud-based assembly and processing; Karen announced that the SHASTA cloud-based assembler is imminently being released by Santa Cruz; this has achieved assembly of 2.8 Gbp of sequence data in only 5.6 hours.
"So I guess that my take home message is...keep calm because everything is awesome"!
Please note that all the CHM13 data is openly available at github.com/nanopore-wgs-consortium/chm13.