Foundation Models for Proteins Â
Independent research, University of Colorado Denver
Independent research, University of Colorado Denver
Protein modeling is central to computational drug discovery, yet how sequences are broken into tokens determines what patterns machine learning models can capture. Standard approaches, such as single amino acids or generic subword tokenizers, often fragment true biological motifs and weaken performance in small-data, target-specific settings. This project aims to design a simple, reproducible tokenizer that yields stronger sequence representations under limited data. I am systematically comparing strategies including fixed k-mers, variable-length subwords, and boundary-aware slicing around conserved motifs, then feeding each into lightweight encoders for self-supervised pretraining and downstream tasks like enzyme classification. I am implementing these tokenization variants, reproducing baselines, running controlled ablations, and logging results for reproducibility. Early stages are complete, with tokenizers under development and baseline runs in progress. The expected impact is a validated framework showing how tokenization choices shape model performance in drug discovery pipelines where labeled data are scarce.