NCM 2021: The annotation of novel genes in a complete human genome


To begin the presentation, Alaina Shumate from Johns Hopkins University discussed the recent efforts to assemble the complete human genome. Up until recently, 8% of the genome was missing from the human reference, mostly comprising long genomic repeats, which have been refractory to assembly using short-read sequencing technologies. However, earlier this year the Telomere-to-Telomere (T2T) Consortium completed the sequence of a human genome, and Alaina mentioned how ultra-long nanopore reads were instrumental in achieving this feat. For the CHM13 cell line, Alaina obtained 126x coverage of nanopore reads, with an N50 of 58 kbp and a ‘particularly impressive’ maximum read length of 1.3 Mbp.

Since the assembly of the T2T-CH13 genome was more peripheral to the talk’s focus, Alaina summarised the main novel sequence features, including an added and corrected 238 Mbp of sequence. Unsurprisingly, many of these sequences were repetitive, e.g., centromeric satellites, rDNAs, and segmental duplications. Finally, Alaina and her colleagues found 1,956 novel genes, including 99 protein-coding genes exclusive to the T2T assembly — how they discovered these genes was the focus of the talk.

Next, Alaina began addressing the computational methods used to identify these novel genes from the T2T assembly. Due to the advancements in DNA sequencing and the accompanying genome assemblies, Alaina saw a growing need to have a tool that can map or ‘lift over’ gene annotations from a reference genome to an improved assembly of the same or closely related species. To that end, Alaina has been developing a tool called Liftoff; ‘one of its unique features is its ability to use the reference annotation to find additional paralogs in the new target assembly’. This is especially important, since more contiguous assemblies tend to capture paralogs more efficiently, ‘which is exactly what we saw in the T2T assembly’. Alaina then discussed some of the existing ‘lift over’ tools, which convert single coordinates of genomic features between assemblies; however, she pointed out that with gene annotations you’re mapping genomic intervals rather than single coordinates. Liftoff was designed to address some of the challenges associated with lifting over gene annotations.

In the first step, Liftoff aligns all of the complete gene sequences, including introns from the reference annotation to the new target assembly using minimap2. The reason for doing this as opposed to splice alignment is because splice aligners rely on figuring out the precise intron-exon structure of the gene; in this case, the intron-exon structure is already annotated in the reference genome. Moreover, spliced aligners can have difficulties aligning short exons.

Alaina next discussed some of the challenges faced when lifting over annotations, including when genes align in fragments due to differences between the two genomes and/or when a gene has more than one alignment. Alaina gave an example of a gene in GRCh37 which has a 50 kb gap in the middle of an intron as well as having a paralog close by. In attempts to lift over the annotation to another human genome, six different alignments were generated on chromosome 7 — none of which contained all exons. Then it’s about working out which combination of alignments contain the exons in the correct order. In order to do this, the alignments are broken up into gapless blocks, and alignments lacking any part of an exon are removed. Alaina added that intron alignments are not important here; they were important in the alignment step but are not useful for the coordinate conversion.

Next, blocks are represented as a directed acyclic graph, where the nodes represent alignment blocks, which are weighted according to the number of mismatches in the exon alignment. The edges represent gaps within exons and are assigned a weight according to the length of those gaps. The graph generates paths: each path representing possible alignment combinations that will form a feasible gene model. Alaina clarified that two nodes are only connected by an edge if they are on the same strand and chromosome, as well as in the same 5’ to 3’ order in both genomes. Lastly the distance from start of one node to the end of a connected node is also given a tuneable parameter to avoid generating transcripts that aren’t biologically feasible. Alaina emphasised that the graph structure captures the gene structural information, and the edges, nodes, and weights account for sequence similarity information. The shortest path is identified from the graph, and that contains the alignments to use for exon coordinate conversion. These are the alignments that adhere to the reference gene structure with the highest sequence identity — the fewest mismatches and the fewest gaps. Alaina then pointed out that any other path chosen gives a truncated gene.

Zooming out, Alaina provided an overview of the process for identifying paralogs. After mapping the reference annotations onto the new genome, they go back to minimap2 to look for additional or secondary alignments of the reference genes. Alignments overlapping other genes that are already annotated in the first lift over are removed. The lift over process is then repeated, and paralogs are labelled if above a sequence identity threshold of 95%. One of the limitations of Liftoff is that it relies on a reference annotation and can only identify new genes that are paralogs of those in the reference. As such, it can’t find entirely novel genes. Some of Alaina’s collaborators have developed the Comparative Annotation Toolkit which uses RNA-Seq and Iso-Seq data to identify novel transcript isoforms and genes. Using this tool, the group identified eight entirely novel genes in the new T2T assembly, although their medical relevance is yet to be determined.

This is not the case for paralogs, as Alaina proceeded to explain. Using the list of 5,175 medically relevant genes compiled by the Genome in a Bottle Consortium, Alaina identified clinically relevant paralogs from the T2T assembly, including the potent ligand for CCR5 HIV-1 coreceptor, CCL3L1. A higher CCL3L1 copy number is correlated to a lower susceptibility to HIV-1. Finally, Alaina and her team found a paralog of KCNJ18, which encodes a potassium channel. Mutations in this gene are implicated in thyrotoxic paralysis. Interestingly, they reported an error from the update of GRCh37 to GRCh38 that led to a deletion of the KCNJ17 paralog. Alaina said that what they found is likely not a new gene, but rather a correction in the annotation of previous assemblies. Alaina finished the talk by reiterating that quality annotation is dependent on quality assemblies, and this has been made possible with the advent of long-read sequencing.

Authors: Alaina Shumate