Skip to content

SIAT-code/OTMTD

Repository files navigation

OTMTD

Directory Structure

├── example.ipynb                  # Example of OTMTD
├── otmtd_cal.ipynb                # OTMDT metric calculation
├── emprical_relate.ipynb          # Analysis of OTMTD and empirical results correlation
├── ot_baselines_cal.ipynb         # OT-based baseline metrics
├── non_ot_baselines_cal.ipynb     # Non-OT-based baseline metrics
├── metrics_perf_comp.ipynb        # Comparison of different metrics
├── requirements.txt               # Environment dependencies
├── README.md                      # Readme file
├── otmtd/                         # Definition files for OTMTD
├── otdd/                          # Definition files for OTDD
├── otce/                          # Definition files for OTCE
├── represent/                     # Represent protein tasks using MASSA
├── processed_data                 # Processed protein downstream task text data
├── protein_embeddings_MultiTasks  # Pre-trained or downstream task embeddings
├── cv_emb                         # Represent CV tasks as embeddings

Requirements

python >3.9.1 torch-1.8.1

conda create -n otmtd python==3.9.1
conda activate otmtd
cd OTMTD
pip install -r requirements.txt

Example

example.ipynb demonstrates the transferability calculation from pre-training to the Fluorescence task. Note that the embeddings of the example need to be download at
https://drive.google.com/drive/folders/1RTphom46oGlJlnw52NSABMNQurWldhJi?usp=sharing.

1. Data Processing

Raw data of protein downstream tasks could be download at https://drive.google.com/drive/folders/1BYzf2RJFcMnT_8Cf_F0Gu_ZWGvM7Z0eY?usp=sharing. The data format and size of datasets are as follows:

  • Without uniprot id

    task seq label
    Stability DQSVRKLV... -0.2099
    Fluorescence SKGEELFT... 3.7107
    Remote Homology PKKVLTGV... 51
    Secondary Structure MNDKRLQF... 22222000...
    Signal Peptide MLGMIRNS... 0
    Fold Classes MSPFTGSA... c
  • With uniprot id

    • PDBBind

      uniprot_id seq smiles rdkit_smiles label dataset_type
      11gs PYTVVYFP... OC(=O)c1cc... O=C(O)c1cc... 4.62 train
    • Kinase

      molecule uniprot_id seq label
      COC1C(N(C)C(C)=O)... P05129 MAGLGPGV... 1
  • Size of datasets

    Stability Fluorescence Remote Homology Secondary Structure Signal Peptide Fold Classes Pdbbind Kinase
    Train 53614 21446 12312 8678 16606 15680 11906 91552
    Valid 2512 5362 736 2170 / / 1000 /
    Test 12851 27217 718 513 4152 3921 290 19685

Then, the datasets without uniprot ids are manually added with the uniprot id following the template <task>_<dataset_type>_<number>, e.g., fluo_train_17878.

Next, the corresponding Gene Ontology(GO) is retrieved from the idmapping_selected.tab according to the uniprot id, and No goterm will be returned if no GO is retrieved. The command is as follow:

grep -w <uniprot_id> idmapping_selected.tab -m 1

The reference code about processing data can be found in the file named processing_data.ipynb, which includes retrieving GO and processing labels.

Processed data of protein pretraining and downstream tasks should be placed in the processed_data directory. The data format is as follows:

task uniprot_id seq GO label
Stability stab_train_0 DQSVRKLV... No goterm 2
Fluorescence fluo_train_17878 SKGEELFT... No goterm 0
Remote Homology remo_train_0 PKKVLTGV... No goterm 0
Secondary Structure secstruc_train_0 MNDKRLQF... No goterm 1
Signal Peptide sign_train_0 MLGMIRNS... No goterm 0
Fold Classes fold_train_0 MSPFTGSA... No goterm 2
PDBBind 11gs PYTVVYFP... GO:0005737;GO:0005829;... 2
Kinase P05129 MAGLGPGV... GO:0004672;GO:0004674;... 1

2. Embeddings generation

Use represent/model_interpreter_multi.py to represent protein tasks, and modify represent/config.yaml to configure the downstream task paths. For example,

python model_interpreter_multi.py --batch_size=32 --gpu=0 --ft=multi

The generated embeddings have the following format:

pro_id pro_seq pro_emb
0 fluo_train_0 SKGEELFT... [-0.5087447166442871, -2.313387870788574, -0.1...

Additionly, pretrained and finetuned weights used in genereating embeddings come from ours previous work MASSA. And the hyperparameters of experiment are as follow:

Pretrain Stability Fluorescence Remote Homology Secondary Structure Pdbbind Kinase Skempi
epoch 150 150 150 150 150 150 150 150
batch size 4 8 32 4 8 8 4 8
lr (learning rate) 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
weight decay 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
gradient accumulation 8 8 8 8 8 8 8 8
optimizer RAdam RAdam RAdam RAdam RAdam RAdam RAdam RAdam
Loss CrossEntropy Loss MSELoss MSELoss Equalized Focal Loss CrossEntropy Loss MSELoss CrossEntropy Loss MSELoss

3. OTMTD Calculation

Run otmtd_cal.ipynb to calculate the transferability metrics from multi-modal multi-task pre-training to downstream tasks.

Acknowledgement

The SOFTWARE will be used for teaching or not-for-profit research purposes only. Permission is required for any commercial use of the Software.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •