High-quality Arabidopsis thaliana genome assembly with Nanopore and HiFi long reads

Here, we report a high-quality (HQ) and almost complete genome assembly with a single gap and quality value (QV) larger than 60 of the model plant Arabidopsis thaliana ecotype Columbia (Col-0), generated using combination of Oxford Nanopore Technology (ONT) ultra-long reads, high fidelity (HiFi) reads and Hi-C data. The total genome assembly size is 133,877,291 bp (chr1: 32,659,241 bp, chr2: 22,712,559 bp, chr3: 26,161,332 bp, chr4: 22,250,686 bp and chr5: 30,093,473 bp), and introduces 14.73 Mb (96% belong to centromere) novel sequences compared to TAIR10.1 reference genome.

All five chromosomes of our HQ assembly are highly accurate with QV larger than 60, ranging from QV62 to QV68, which is significantly higher than TAIR10.1 referecne (44-51) and a recent published genome (41-43). We have completely resolved chr3 and chr5 from telomere-to-telomere. For chr2 and chr4, we have completely resolved apart from the nucleolar organizing regions, which are composed of highly long-repetitive DNA fragments. It has been reported that the length of centromere 1 is about 9 Mb and it is hard to assembly since tens of thousands of CEN180 satellite repeats. Based on the cutting-edge sequencing data, we assembled about 4Mb continuous sequence of centromere 1.

We found different identity patterns across five centromeres, and all centromeres were significantly enriched with CENH3 ChIP-seq signals, confirming the accuracy of the assembly. We obtained four clusters of CEN180 repeats, and found CENH3 presented a strong preference for a cluster 3. Moreover, we observed hypomethylation patterns in CENH3 enriched regions. This high-quality assembly genome will be a valuable reference to assist us in the understanding of global pattern of centromeric polymorphism, genetic and epigenetic in naturally inbred lines of Arabidopsis thaliana.

Authors: Bo Wang, Yanyan Jia, Peng Jia, Quanbin Dong, Xiaofei Yang, Kai Ye