Comprehensive and cost effective characterisation of the Y chromosome

Mammalian Y chromosomes are often neglected from genomic analysis due to their inherent assembly difficulties caused by a high level of repetitive DNA and palindromes. To date, just a single reference-quality human Y chromosome, of European ancestry, is available — thereby increasing the potential for reference bias and overlooking the significant genomic variation present in other populations1.

While isolation of the Y chromosome using flow cytometry can simplify the assembly challenge — reducing the overlap of repetitive DNA with that on other chromosomes — due to their limited read lengths, amplification bias and removal of epigenetic modifications, traditional short read sequencing technologies offer an imperfect solution.

To overcome the limitations of short read technology and address the lack of reference-quality Y chromosomes, researchers from Spain developed a novel strategy to sequence native, unamplified flow-sorted DNA of African ancestry using the MinION. Approximately 9 million Y chromosomes were sorted from a lymphoblastoid cell line (HG02982), whose haplogroup (A0) represents one of the earliest known human lineages.

Nanopore sequencing of the flow-sorted chromosomes generated 2.3 Gb of data with an average read N50* of ~18 kb. De novo assembly and sequence polishing was carried out using the Canu2 and nanopolish3 tools respectively, with further sequence polishing using short-read data performed using Pilon4. The final assembly totalled 21.5 Mb in length and comprised 35 contigs with a contig N50 of 1.46 Mb.

The researchers commented that this technique: ‘…constitutes a significant improvement over comparable previous methods, increasing continuity by more than 800%’ (Figure 1)1.

Human fig 1.PNGFigure 1: Assembly contiguity comparison between the human HG02982 and gorilla Y chromosomes. The size of each rectangle corresponds to the size of a contig within each assembly. The HG02982 assembly which combined long nanopore reads with shortread sequencing displayed significantly higher contiguity than the recently published gorilla assembly which utilised an alternative ‘long’- read sequencing technology combined with short-read sequencing. Both sequencing data sets were derived from flow-sorted Y chromosomes.

Comparing the assembly against the GRCh38 reference, the team were able identify extensive genic copy number variation with expansions in 5 of the 9 multi-copy genes, four of which are implicated in male infertility. They also identified 347 structural variants of over 50 bp in size between the two assemblies.

The team were also able to detect the epigenetic modification 5-methylcytosine alongside the nucleotide sequence, demonstrating good correlation with data obtained using whole genome bisulfite sequencing.

This study highlights how long nanopore sequencing reads can be used to deliver new insights into complex genomic regions which have previously proven challenging to analyse using traditional sequencing technology. Commenting on this research the team suggest that:

'Given the current developments in sequencing throughput, a single MinION flowcell should now be sufficient to assemble a whole human Y chromosome. Furthermore, it is becoming clear that the upper read length boundary is only delimited by the integrity of the DNA, suggesting the possibility that complete Y chromosome assemblies, including full resolution of amplicons, might be possible in the near future’1.

Expanding on this possibility, recent research led by Dr. Karen Miga at the University of California, Santa Cruz demonstrated the use of nanopore technology to deliver the first complete and accurate sequence of a human centromere5. Human centromeres are composed of long tracts of near identical tandem repeats making them intractable to assembly using short-read sequencing technology. Using the long reads generated by nanopore technology, the team sequenced eight BAC clones that together spanned the Y chromosome centromere.

In total the team generated over 3,500 reads that were greater than 150 kb in length. Consensus sequence polishing was performed using the BLASR tool, and variants were validated against short read sequencing data. These informative markers, together with structural variants, allowed alignment of the BAC consensus sequences, revealing the centromere to be 365 kb in length (Figure 2). According to the researchers, their assembly: ‘enables the precise number of repeats in an array to be robustly measured and resolves the order, orientation, and density of both repeat-length variants across the full extent of the array. This work could potentially advance studies of centromere evolution and function and may aid ongoing efforts to complete the human genome’5. The team are now optimising the methodology to sequence centromeres directly from whole genomic DNA without the requirement for BACs.

Human fig 2.PNGFigure 2: Assembly of the human Y chromosome centromere. Eight BAC clones covering the entire centromere were ordered using sequence variants. The centromere is dominated by 5.8 kb higher order repeats (HOR) (light blue boxes) interspersed by HOR variants (purple boxes). Highly divergent monomeric alpha satellite is indicated in dark blue. Figure adapted from Jain et al.5

* The N50 value represents the fragment length where half of the data are contained in fragments of this length and greater.

This case study is taken from the human white paper.

1. Kuderna, L.F.K. et al. Selective single molecule sequencing and assembly of a human Y chromosome of African origin. bioRxiv 342667 (2018).

2. GitHub. CANU. Available at: [Accessed: 20 August 2018]

3. GitHub. Nanopolish. Available at: [Accessed: 20 August 2018]

4. GitHub. Pilon. Available at: [Accessed: 20 August 2018]

5. Jain, M. Linear assembly of a human centromere on the Y chromosome. Nat Biotechnol. 36(4):321- 323 (2018).