Structural variants in the French-Canadian population

Biograph

Sarah Reiling joined the Advanced Genomic Technologies group, led by Jiannis Ragoussis, at the McGill Genome Centre in 2019, where she works with nanopore sequencing technologies to better understand human diseases, human pathogens, and transplantation immunology. She has a multidisciplinary background in genetics, immunology, virology, parasitology, molecular biology, medical microbiology, and animal models and has worked on multiple international projects in Asia, Africa, Europe, and North America.

Sarah obtained her PhD from the McGill Institute of Parasitology for her work on drug resistance of malaria parasites. She then worked at the Health Canada Food Directorate, Bureau of Microbial Hazards (BMH) on food- and waterborne parasites that are a common source of infections to Canadians. Sarah set up and performed the first nanopore sequencing runs that were done at the BMH.

Abstract

French Canadians of the province of Quebec are a classic example of a recent founder population established four centuries ago. Population-specific structural variant (SV) detection in the Quebec population of French ancestry may aid in understanding SV-associated inherited genetic disorders. While short read sequencing technologies have difficulties with SV detection, long read sequencing technologies have the potential of resolving complex genomic structures, including SVs. Sequencing long and ultra-long reads of human genomes has become easier with the launch of the Oxford Nanopore Technologies PromethION platform. However, there is a lack of well characterized bioinformatics pipelines for data processing of Nanopore reads

Within the framework of generating the genetic “blueprint” of the Quebec population of French ancestry, we analyzed 14 mother and child pairs of French-Canadian origin using three technologies. We performed Nanopore sequencing using the PromethION instrument (R9.4 flow cells), Illumina whole genome sequencing (WGS), and 10X Genomics Technologies Linked-reads (10X). On average, we obtained 17x coverage after alignment (75 Gb basecalled reads with an N50 of 24 kb) using PromethION, 44x coverage using WGS, and 36x coverage using 10X. Phased SV identification was done using a custom Python script to link SVIM supporting reads and haplotype tags, and SURVIVOR then merged the paired SVIM phased calls. The intersection of 10X linked read and WGS derived SV and phasing data were used to perform haplotype assembly using WHATSHAP. This haplotype information was integrated with the Nanopore read derived data.

Per individual, we detected ~30,000 filtered SVs, of which about half of them were shared between mother and child. Despite variability in genome coverage between samples, we were able to phase 40-50% of the aligned reads in each sample. We identified ~2,500-3,500 shared phased SVs in each mother and child pair.

In conclusion, we were able to call SVs using Nanopore data, phase these SVs and integrate these data with those obtained from the 10X linked-reads and Illumina WGS technologies. As a result, we produced for the first-time high-quality genome sequences representative of the Quebec population of French ancestry and identified genome-wide SVs. This will allow annotation of these genomes in terms of SVs and their possible functional consequences through detailed gene annotation integration. The phased SV information can then be used in order to impute SVs using WGS data derived from Gen3G, a large Quebec birth cohort in order to establish a reference (blueprint) genome for Canadians of French ancestry living in the Quebec Province.

Authors: Sarah Reiling