Julie Karl: De novo assembly of an entire Mauritian macaque MHC haplotype with ultra-long reads
- Home
- Julie Karl: De novo assembly of an entire Mauritian macaque MHC haplotype with ultra-long reads
Julie Karl, Senior Research Specialist at the University of Wisconsin-Madison, opened the Assembly breakout session by describing her recent work applying long-read nanopore sequencing to create a complete de novo assembly of the Mauritian macaque MHC haplotype. She explained that the 5 Mb macaque MHC genomic region is currently unexplored, despite its importance in transplantation and immunological research. The MHC region is gene-dense, highly complex and exhibits a high level of repetitiveness - caused by the presence of multicopy genes and pseudogenes. Furthermore, many copy number variants are evident, arising from segmental duplication events. Julie stated that this high level of genome complexity precludes the accurate assembly and analysis of the MHC region from current macaque reference genomes which have been created using short-read sequencing technology.
The long reads provided by nanopore sequencing are able to span complete regions of repetitive DNA, allowing compete genomic resolution. In order to generate ultra-long nanopore sequencing reads, Julie adopted a phenol:chloroform sample preparation approach previously demonstrated by Nick Loman, Matt Loose and co-workers to deliver high-molecular weight DNA. In order to simplify analysis, the team obtained their macaque DNA from an isolated population from which animals with homozygous MHC can be obtained.
Julie revealed that this approach provided approximately 53x genome coverage, with the longest read in excess of 1.4 Mb. The reads were mapped to the human hg38 reference genome and sequences mapping to the MHC region were extracted and de novoassembled using Canu. The longest read in the MHC region spanned 613 kb. Julie showed how the long reads provided by nanopore sequencing allowed the entire MHC region to be resolved in a single contig.
The team took an initial look at where the longest reads (>250 kb) were localised in the MHC regions and, as expected, the majority of reads spanned multiple genes, which Julie stated: ‘provided confidence that resultant assembly would be highly contiguous’. Polishing using short sequencing reads was used to further enhance the consensus sequence. The resulting, highly accurate MHC contig spanned 5 Mb, with genomic regions exhibiting 100% sequence accuracy with cDNA sequence. Comparing this contig with existing BAC-based MHC contigs revealed many similarities but also identified a large insertion in the new data set, which, based on initial analysis, the team believe to be caused by a duplication event.
Closing her presentation, Julie listed a number of future research activities that they are planning, including analysing the dataset for other immune gene regions of interest and assembly of the full macaque genome. Additional work will also include characterising additional macaque genomes/MHC regions and exploring the feasibility of using target capture technologies for ultra-long reads.