Foundation Models for Proteins Â
Independent research, University of Colorado Denver
Independent research, University of Colorado Denver
Protein modeling is central to computational drug discovery, yet how sequences are broken into tokens determines what patterns machine learning models can capture. Standard approaches, such as single amino acids or generic subword tokenizers, often fragment true biological motifs and weaken performance in small-data, target-specific settings. This project aims to design a simple, reproducible tokenizer that yields stronger sequence representations under limited data. I am systematically comparing strategies including fixed k-mers, variable-length subwords, and boundary-aware slicing around conserved motifs, then feeding each into lightweight encoders for self-supervised pretraining and downstream tasks like enzyme classification. I am implementing these tokenization variants, reproducing baselines, running controlled ablations, and logging results for reproducibility. Early stages are complete, with tokenizers under development and baseline runs in progress. The expected impact is a validated framework showing how tokenization choices shape model performance in drug discovery pipelines where labeled data are scarce.
Building on this foundation, the project has progressed to a structure- and energy-guided tokenizer, ProteKenz, which uses Q-learning and PyRosetta energy evaluations to learn biologically stable token boundaries. Rather than relying solely on sequence statistics, the model rewards token splits that minimize van der Waals clashes and solvation energy while preserving secondary structure units such as helices and sheets. Across 807 AlphaFold protein structures and controlled BioBERT evaluations, structure-aware tokens achieved higher classification accuracy (93.11% vs. 91.22%), improved MCC, and lower test loss under identical training conditions. These results demonstrate that integrating protein physics into tokenization yields more informative representations, particularly in low-data and target-specific modeling settings.