This repository contains the code and datasets for the manuscript "Predicting Protein and Pathway Associations for Understudied Dark Kinases using Pattern-constrained Knowledge Graph Embedding ".
-
DownloadLinkForData.txt
: file (the onedrive link in the README.md is the same in this text file )-
file contains link for download
-
Download contains KG, embeddings, uniprot2genename, and reactome2pathwayname files
-
This also contains files that goes with the example provided below
-
-
data/GP2VEC_emb_walk_file_40_40_0_0_bwV4_s256_w7_n5_mc5.txt
: embedding KG file -
data/GP2VEC_emb_walk_file_40_40_0_0_bwV4_s256_w7_n5_mc5
: embedding KG file -
data/df_edges.csv
: main KG edge file -
data/df_node.csv
: KG node file -
data/df_relation.csv
: KG relation type file -
data/df_type.csv
: KG node type file -
data/original/GeneNameMapping.txt
: UniProtID to name mapping file. One of the input files for GeneratingKinaseGraph -
data/original/ReactomePathways.txt
: ReactomeID to name mapping file. One of the input files for GeneratingKianseGraph -
benchmarkCofigFiles
: Has 3 files that can be used to set the paramaters for the supplementary benchmarks models -
2 Jupyter lab notebooks files
-
Notebook for measuring the accuracy
graphpattern2vec_process-multithread-Edited2022.ipynb
-
Notebook for prediction generation
graphpattern2vec_process-multithread-PREDICTION-Edited2022.ipynb
-
-
gp2v.yml
: file- file can be used to create the environment
-
graphpattern2vec
: folders- folder holds the objects/functions of RegPattern2Vec
-
Readme.md
: file- You're reading it right now
-
model
: folder- folder holds temporary files
Python == 3.10.8
From pip:
git clone https://github.com/gravelCompBio/RegPattern2Vec.git
cd RegPattern2vec/
ALSO DOWNLOAD THIS data/ FOLDER AND PUT IT IN THE RegPattern2Vec-main folder (unzip and make sure the name of the unzipped folder is still "data")
or (same data differnt download locations)
DOI:
10.5281/zenodo.7541827
the yml file is in RegPatter2Vec file and is named "gp2v.yml"
conda env create -f gp2v.yml -n gp2v
conda activate gp2v
-
Confirm prerequisite data has been downloaded (see download repository section)
-
Navigate to RegPattern2vec-main folder in the terminal and run jupyter lab
jupyter lab
-
for measuring accuracy of the model (AUC-ROC)
-
has split funtion for benchmarking so only will not give the best predictions
or
-
for generating predictions
-
ROC in this file does not represent the overall accuracy
Please Read this Section before running the link prediction section in either notebook! Section for generating user embeddings from outside data!
-
Confirm prerequisite data has been downloaded (see download repository section)
-
After performing the random walk sections of the code in either jupyter notebook, either the embedding files provided can be used or embedding files can be generated from the user’s own data. If you wish to generate your own embedding files, please see the section below for how to do so with metapath2vec.
A) If the user creates their own embedding please note the current version of the code assumes that you are just following the example provided and the use of our provided embedding (generated with our data and provided Knowledge Graph). Users will need to change file name in code (both notebooks).
- After embedding file generation, run the link prediction sections of the code in either notebook
Basic guide for running metapath2vec to generate new embedding files from user generated Random Walk file
This is only if you want to run this pipeline on custom data. If you are running this pipline on our data/KG provided, you do not have to run metapath2vec because the we provide the embedding files (output of metapath2vec) in the DownloadLinkForData.txt
Also the version of metapath2vec we use (https://github.com/Change2vec/change2vec/tree/master/metapath2vec/code_metapath2vec ) is computationally expensive and our KG took 200Gb of ram with over 2 days of runtime with 64 cores so if you can find a more effencint version of metapath2vec that you trust, I would suggest you use others versions (one example is embiggen(grape) https://github.com/AnacletoLAB/grape)
We utilize files produced by our model containing nodes collected through RegPattern2Vec as input into a portion of the pipeline for metapath2vec used vectorize the KG nodes and generate embeddings
If the user is generating their own data and using their own embedding files, all cells in the notebook should be run until the Link Prediction section then you should create the embeddings with metapath2vec.
The code can be accessed from:
The following parameters were used to produce our embeddings:
-
Size 256
-
Window 7
-
Negative 5
-
MinCount 5
-
Threads 32
-
PlusPlus 1
Example code of how we put it in the terminal
make metapath2vec -B
./metapath2vec -train data/walk_GP2vec_test_REL_simplified_sep_probs_40_40_0_0_bwV4.csv -output data/GP2VEC_emb_walk_file_40_40_0_0_bwV4_s256_w7_n5_mc5 -threads 32 -size 256 -window 7 -negative 5 -min-count 5 -pp 1
Run the rest of the cells after the Link Prediction section, after inputting embedding file generated outside of notebook.
see following paper for a more complete explanation
Dong, Y., Chawla, N. V., & Swami, A. (2017, August). metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 135-144).
After running the Link Prediction Notebook (data/results/df_prediction_name_emb_walk_file_40_40_0_0_bwV4_s256_w7_n5_mc5.csv
) , the model will output a CSV file:
Example here with first 5 rows:
uniprot,ProteinName,Pathway_id,Pathway_Name,Score
Q9H8X2,IPPK,1855191,Synthesis of IPs in the nucleus,0.999999998802044
Q96GX5,MASTL,2465910,MASTL Facilitates Mitotic Progression,0.9999999956283079
Q9H8X2,IPPK,1855167,Synthesis of pyrophosphates in the cytosol,0.9999999950700171
Q13237,PRKG2,8978934,Metabolism of cofactors,0.9999999891668171
Q13237,PRKG2,1474151,"Tetrahydrobiopterin (BH4) synthesis, recycling, salvage and regulation",0.9999999819057269
Row Names:
Uniprot
Gene Name
Pathway ID
Pathway_Name
Score