Here, we introduce PTMGPT2, a suite of models capable of generating tokens that signify modified protein sequences, crucial for identifying PTM sites. At the core of this platform is PROTGPT2, an autoregressive transformer model. We have adapted PROTGPT2, utilizing it as a pre-trained model, and further fine-tuned it for the spe cific task of generating classification labels for a given PTM type. Uniquely, PTMGPT2 utilizes a decoder-only architecture, which eliminates the need for a task-specific clas- sification head during training. Instead, the final layer of the decoder functions as a projection back to the vocabulary space, effectively generating the next possible token based on the learned patterns among tokens in the input prompt.
Link - (https://nsclbio.jbnu.ac.kr/GPT_model/)
Contact us directly at [email protected] for bulk predictions and trained models
Link - (https://nsclbio.jbnu.ac.kr/tools/ptmgpt2/)
Link - (https://doi.org/10.5281/zenodo.11371883)
Link - (https://zenodo.org/records/11362322)
Link - (https://doi.org/10.5281/zenodo.11377398)
python 3.11.3
transformers 4.29.2
scikit-learn 1.2.2
pytorch 2.0.1
pytorch-cuda 11.7
• Model: This folder hosts a sample model designed to predict PTM sites from given
protein sequences, illustrating PTMGPT2’s application.
• Tokenizer: This folder contains a sample tokenizer responsible for tokenizing
protein sequences, including handcrafted tokens for specific amino acids or motifs.
• Inference.ipynb: This file provides executable code for applying PTMGPT2 model
and tokenizer to predict PTM sites, serving as a practical guide for users to apply
the model to their datasets.