Interview: Assembly and annotation of reference-quality human genomes

Alaina Shumate is a Ph.D. candidate in Biomedical Engineering with Dr. Steven Salzberg at Johns Hopkins University. Her research is focused on the development of computational methods for genome annotation, including the creation of Liftoff, a specifically designed annotation tool. Here, Alaina discusses how long-read sequencing is helping with the generation of high-quality assemblies as an alternative to a single reference genome, and the importance of complete and correct annotation for studying genetic variation.

Alaina Shumate will be presenting a webinar on ‘Assembly and Annotation of Reference-Quality Human Genomes with Technology Networks on Wednesday 8th September, 3pm UK time.

What are your current research interests?

My current research interests are the development and application of software methods for gene annotation. Because of the major improvements to sequencing technology and genome assembly methods, it is now possible to generate multiple high-quality assemblies for a given species which will allow us to move away from the reliance on a single reference genome. My goal is to make mapping annotations onto these assemblies a simple and routine process.

What first ignited your interest in genomics and bioinformatics?

My first research experience was as an undergraduate at Stanford University in a synthetic biology lab where sequencing and synthesizing DNA was a routine part of our research. I was fascinated by the ability to read and write genetic code like this, and at the same time, I was learning to code in my computer science courses at Stanford. After I graduated, I was fortunate enough to get a job as a software test engineer in the biotech industry where I could combine these two interests. Here I saw cutting-edge work led by research scientists and bioinformaticians which motivated me to pursue a Ph.D. in bioinformatics.

How is long-read sequencing changing the quality of genomic data for the assembly of human genomes? How has it benefitted your work?

The high quality of the human genome assemblies we have created simply would not be possible without long-read sequencing – long reads are essential for correctly assembling the repeats throughout human genomes. As someone focused on annotation, an accurate assembly is the foundation for my work and a correct and complete annotation isn’t possible without a correct and complete assembly.

Accurate annotation is crucial for a human genome to function as an effective reference. What influence could your work have on the creation of reference-quality genomes and how might this impact our understanding of human biology?

Reference genomes are an essential component of the variant calling process, and it’s not enough just to know the locations of variants. To understand how these variants impact human biology, we need to know if these variants lie within genes, and how the variant(s) may impact the function of the gene. To do this, we first need an accurate annotation of the reference genome, which is what I strive to achieve with my work.

What have been the main challenges in your work and how have you approached them?

I believe a challenge throughout much of genomics research is distinguishing true biological variation from errors or artifacts. For example, if a gene annotated with our software has a frameshift mutation, that individual may have a frameshift in that gene – or it could be a result of a sequencing error, an assembly error, or an error in the annotation process. Approaching this challenge requires looking at the data and metrics at each step of the process. We can use the original sequencing reads to see if the variant in question is supported by the reads and if the quality scores suggest sequencing errors. A careful look at the alignments used in the annotation process can also identify annotation errors. And when in doubt, seeking guidance from my exceptional mentors and colleagues has always proven to be an effective approach to solving many of the challenges I encounter in my work.

What’s next for your research?

After we map genes from one assembly to another, the next question frequently asked is how are the genes different between the two assemblies? Some things we have looked at in past projects are the number of genes that contain variants and what the consequences of these variants are, the conservation of gene synteny between the two assemblies, and the collapse or expansion of gene families. So next for us is building a software tool that automates all this analysis. My hope is that this will be a useful tool for evaluating genetic variation as we begin to sequence and assemble more reference-quality human genomes.

Are you interested in hearing more about Alaina's research? Join her webinar on 8th September, where she will discuss reference-quality assemblies from Ashkenazi and Puerto Rican individuals — both more contiguous than the CRCh39 reference genome. Register here.