Main menu

Hybrid clustering of long and short-read for improved metagenome assembly


Next-generation sequencing has enabled metagenomics, the study of the genomes of microorganisms sampled directly from the environment without cultivation. We previously developed a proof-of-concept, scalable metagenome clustering algorithm based on Apache Spark to cluster sequence reads according to their species of origin. To overcome its under-clustering problem on short-read sequences, in this study we developed a new, two-step Label Propagation Algorithm (LPA) that first forms clusters of long reads and then recruits short reads to these clusters.

Compared to alternative label propagation strategies, this hybrid clustering algorithm (hybrid-LPA) yields significantly larger read clusters without compromising cluster purity. We show that adding an extra clustering step before assembly leads to improved metagenome assemblies, predicting more complete genomes or gene clusters from a synthetic metagenome dataset and a real-world metagenome dataset, respectively.

These results suggest that hybrid-LPA is a good alternative to current metagenome assembly practice by providing benefits in both scalability and accuracy on large metagenome datasets.

Authors: Yakang Lu, Lizhen Shi, Marc W. Van Goethem, Volkan Sevim, Michael Mascagni, Li Deng, Zhong Wang

Getting started

Buy a MinION starter pack Nanopore store Sequencing service providers Channel partners

Nanopore technology

Subscribe to Nanopore updates Resources and publications What is the Nanopore Community

About Oxford Nanopore

News Company timeline Sustainability Leadership team Media resources & contacts For investors For partners Working at Oxford Nanopore Current vacancies Commercial information BSI 27001 accreditationBSI 90001 accreditationBSI mark of trust
English flag