Long-read sequencing technologies resolve most dark and camouflaged gene regions


The human genome contains many "dark" regions that short-read sequencing cannot resolve. These dark regions include protein-coding regions, meaning that potential disease-relevant variants remain undiscovered. Mark detailed how his team have systematically analysed these dark regions, which can be classified as either "dark by depth" (low number of mappable reads) or "camouflaged" (ambiguous alignment; also referred to as "dark by mapping quality"). For example, the gene HSPA1A is 53% camouflaged, with >90% of the reads having a mapping quality score of <10. This gene is upstream of a similar gene HSPA1B which is also a dark gene, meaning that both genes in this region are not visible in sequencing data as any reads could come from either gene - "you will not find them in your vcf file". Mark wanted to emphasise that "this is a read length problem". In comparison, HLA-DRB5 is dark by depth - with many "gaps", across which few mappable reads are present. Mark explained how this phenomenon was known but "we were surprised by how big the problem is".

More than 100 protein-coding genes are 100% camouflaged by short-read sequencing approaches, and approximately 6,054 gene bodies are partially dark (hg38). Mark stated that the next logical question therefore is "what kind of gene bodies are we talking about?" He explained that ~4,000 of the 6,054 genes are protein coding genes, and ~1,000 are pseudogenes. In terms of what part of the gene body that is dark, predominantly it is the intron. Mark pointed out that introns are often considered to be unimportant or irrelevant but he does not agree with this view and many introns drive disease.

Mark stated that he would focus on the coding sequence (CDS) dark regions for the rest of his talk. Approximately 2,855 CDSs across 748 genes are dark, of which 117 are 100% dark. Furthermore, 76 of the genes that are >5% dark in the CDS are known to drive disease, being associated with 326 unique human diseases. Mark then provided a few interesting examples of disease-relevant genes which are either dark or camouflaged. Firstly, he gave the example of SMN1,which is 95% camouflaged in its CDS by its pseudogene SMN2. SMN1 is required for motor neuron health and is important for dendrite and axon development. This gene is associated with spinal muscular atrophy and implicated in amyotrophic lateral sclerosis (ALS).

Another example of a camouflaged disease-relevant gene is CR1, whichis 26% camouflaged in its CDS. CR1 is a top Alzheimer's disease gene, and it is camouflaged by itself - 3 of the gene's exons are repeated. Mark stated that the most important region of this gene is dark, as the camouflaged region is also the C3b/C4b binding domain; C3b and C4b are two proteins involved in the complement cascade, a pathway that is implicated in Alzheimer's disease.

Mark next asked - how do we resolve these regions? He stated that in the "long term, the answer is long-read sequencing". Mark described how he has used long-read sequencing technologies to resolve these dark regions, comparing different sequencing technologies in their ability to resolve them. Overall, nanopore long-reads appeared to be most successful in resolving gene body dark regions, displaying the highest mapping quality. Specifically, nanopore sequencing was able to resolve over 90% of dark gene regions, which Mark said is "quite impressive". For example, nanopore long reads were best able to resolve the camouflaged region in gene CR1.

Authors: Mark T. W. Ebbert