Inferring protein sequence-function relationships with large-scale positive-unlabeled learning

Published on: August 20 2020
Source: BioRxiv

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It’s challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data.

Importantly, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function data sets, representing proteins of different folds, functions, and library types.

The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes.

Authors: Hyebin Song, Bennett J. Bremer, Emily C. Hinds, Garvesh Raskutti, Philip A. Romero

耗材

所有产品

研究领域

技术

技术

资源中心

文档

Nanopore学习中心

公司

新闻与活动

全球合作伙伴

Inferring protein sequence-function relationships with large-scale positive-unlabeled learning

Download

入门指南

联系我们

关于 Oxford Nanopore