Collection of scripts to create and use SMILES corrector. The SMILES corrector is a transformer model that is trained to translate invalid SMILES sequences to valid SMILES sequences.
-
RDKit (version >= 2020.03) $ conda install -c conda-forge rdkit
-
Numpy (version >= 1.19) $ conda install numpy
-
Scikit-Learn (version >= 0.23) $ conda install scikit-learn
-
Pandas (version >= 1.2.2) $ conda install pandas
-
PyTorch (version >= 1.7) $ conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
-
Matplotlib (version >= 2.0) $ conda install matplotlib
-
TQMD $ pip install tqdm
-
Torchtext $ pip install torchtext==0.11.2
-
Modin $ pip install modin[ray]
-
Chembl structure pipeline $ conda install -c conda-forge chembl_structure_pipeline
-
Seaborn $ conda install seaborn
-
tueplots $ pip install tueplots
-
jupyer $ pip install jupyter
There are two main options for using the SMILES corrector; use a previously trained model (step 4) or train a new model.
The files for reuse can be found on: https://zenodo.org/record/7157412#.Y1edr3ZBxD8
To train a new model follow example in test.py to perform the following steps
-
Prepare the molecules that will be used for training standardize the molecules using standardize from preprocess.py (this for example removes salts & canonicalizes the SMILES) remove molecules with SMILES longer than a set threshold (this is necessary to reduce the number of model parameters)
-
Introduce ‘artificial’ errors into training molecules, done in order to create invalid-valid pairs (for training & evaluating the model) indicate which type of error to introduce using invalid_type = (could for example be all errors (‘all’) or only syntax, ring, parentheses, valence, aromaticity, bond exists & random permutations) indicate how many errors to introduce per sequence using num_errors introduce the errors using get_invalid_smiles from invalidSMILES.py the resulting invalid-valid pairs are saved as training and validation set (& optional test set) (are later used by torchtext dataloader)
-
Train & evaluate model indicate where to find real-world error set using error source = (apart from training & evaluating the model on artificial invalid SMILES it will also be evaluated on actual invalid SMILES created by de novo generation) use model_training from modelling.py to train the model (this function will initialize a transformer model and train, evaluate and save it) the transformer architecture is defined in transformer.py if a trained model already exists the weights from the last epoch (modelname_last) are loaded & existing model is trained further evaluation is done using functions from metric.py performance & model are saved in Data/performance/
-
Use model to correct invalid SMILES done using correct_SMILES from modelling.py (best model is saved in Data/performance/[modelname].pkg (!= the modelname ending with _last)) for an example of how to use the SMILES corrector for exploration see testexplore.py
Send a message to [email protected]