Fig. 3 Haplotype-resolved assemblies a) concept b) and c) collapsed and trio-binned ONT assemblies, respectively d) pipeline e) phasing f) and g) resolving haplotypes h) and i) final assemblies
Many assemblers collapse diploid genomes into a haploid assembly, mixing variants from both haplotypes randomly (Fig. 3a). Each contig/scaffold of a collapsed assembly has k-mers from both parents (Fig. 3b). It is preferable to have an assembly for each separate haplotype, and this is often achieved by trio binning. Here, unique k-mers are extracted from each parent’s data and then used to separate the reads into paternal and maternal. The two sets are then assembled separately. This results in one assembly per haplotype where each contig/scaffold only has k-mers from one parent (Fig. 3c). However, parent data is not always available. Here we present an alternative where phasing, based on long-reads and Pore-C, enables reads to be separated into haplotypes without the need for parent data. The pipeline, based on DipASM, first assembles ONT reads into a collapsed assembly, aligns the long reads back and calls variants (Fig. 3d). These variants are then phased into chromosome-scale phaseblocks. We obtain a single phaseblock for each chromosome, containing virtually all variants with correct phasing (Fig. 3e). Next, reads are tagged using the phased variants and separated into haplotypes. The vast majority of base pairs can be phased in this way. A final assembly step yields a chromosome-scale assembly for each haplotype. The resulting scaffolds either stem from the paternal or the maternal haplotype and have human-reference scale N50s (Figs. 3f and 3g). Distinguishing maternal and paternal scaffolds is difficult without trio information, and thus both assemblies are a mix of paternal and maternal scaffolds. Finally, Figs. 3h and 3i show dot plots for both assembled haplotypes compared to the T2T CHM13 assembly.