Language Model-Based Deep Neural Network Protein Fitness and Annotation Prediction

Background & Question
Advancements in DNA sequencing technologies, particularly next-generation sequencing, have accelerated the discovery of numerous genes from an extensive variety of species. The increased number of resulting protein sequences creates an opportunity to expand protein engineering, but also presents a challenge as many gene product molecular functions are poorly annotated. Protein engineering holds great promise for a wide range of human endeavors, such as the development of therapeutics drugs and gene editing, through producing protein variants that enhance the original function or are entirely novel [1]. Machine learning (ML) has been increasingly coopted for protein engineering via computational, physics based rational design. A critical component of employing ML for protein engineering is the development of a model that predicts the fitness of a protein given its sequence. This method has already yielded several algorithms that can successfully predict the effects of a mutation on function given evolutionary information of homologous sequences [2][3]. Protein language models (PLM) have been found to generate state-of-the-art representations of biological properties and achieve impressive prediction performance in protein prediction related tasks [4].

During Spring of 2023 I found that using a Multilayer Perceptron taking as input a PLM embedding of a mutated amino acid with a protein sequence can accurately predict protein function (Fig. 1). However, the Deep Mutational Scanning (DMS) assay data used to train my novel Multilayer Perceptron was produced in vitro and is thus subject to errors and noise as are all in vitro experiments [5]. To execute data valuation, cull harmful data points, and ultimately improve the quality of training data I will integrate a novel data quality valuation method into the previously developed language model-based deep neural network protein fitness.

Student Name

Boysen, Joanne

Faculty Mentor

Yunan Luo

Project Year

Computational Biology Faculty Research Awards, Fall 2023

Bioinformatics

Language Model-Based Deep Neural Network Protein Fitness and Annotation Prediction

Georgia Institute of Technology