Identifying Structural Features in Prokaryotic Short Gene Prediction using Protein Language Models

ABSTRACT
Prokaryotic gene prediction is largely misconceived as a solved problem in bioinformatics. There are many challenges to be solved: 1) the existing tools can hardly detect short genes (<180 nucleotides) although they are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes and 2) the false positive prediction in these gene annotation methods gives rise to hypothetical proteins, which may or may not be accurate; therefore, it is crucial to explore the reason behind these false positives. This proposal aims to investigate the potential benefits of leveraging the predicted 3D structure of potentially encoded proteins to enhance the accuracy of ProtiGeno, a transformer-based deep learning tool for prokaryotic gene prediction developed from our prior project. The objective is to acquire the 3D structure of both coding and noncoding regions in prokaryotes using state-of-the-art techniques such as AlphaFold2 and ESMFold. By doing so, we seek to explore the correlation between secondary structure patterns and their role as distinguishing features in the prediction of prokaryotic genes. This study has the potential to advance our understanding of gene prediction methodologies and contribute to the refinement of ProtiGeno's predictive capabilities.

Student Name
Sankar Ramalaxmi, Gautham
Faculty Mentor
Amirali Aghazadeh