PolymerGasMembraneML (pgmML)

A machine-learning implementation that learns generalizable, interpretable models between polymer chemistry and membrane gas permeability, which can be used for polymer discovery. Please refer to our work "Machine learning enables interpretable discovery of innovative polymers for gas separation membranes" for additional details.

Installation

To download, clone this repository

git clone https://github.com/jsunn-y/PolymerGasMembraneML

To run most code in this repository, the relevant anaconda environment can be installed from pgmML.yml. To build this environment, run

cd ./PolymerGasMembraneML
conda env create -f pgmML.yml
conda activate pgmML

However, for calculating chemical descriptors and fingerprints (optional task), a separate rdkit environment is required. Installation instructions can be found in rdkit documentation.

Datasets

In addition to the training dataset, Dataset A, we use 3 screening datasets in this work: Dataset B, C, and D. Dataset A has both SMILES strings (chemistry) and permeabilities. Datasets B,C, and D only provide SMILES strings. Due to large size of Datasets B and C, we do not include any calculated features in this github in /datasets, and for Dataset C, we only include the first 1 million SMILES strings. Note that Dataset C is split across 9 individual files due to its large memory requirements.

However all the datasets used in this work, including SMILES and calculated fingerprints, can be downloaded here.

Dataset	Description	Included in Github	Additional Download Available
Dataset A	Training Set	`datasetA_imputed_all.csv` `datasetAX_desc.csv` `datasetAX_fing.csv`	None
Dataset B	1 million screening	`datasetB.csv`	`datasetBX_fing.csv`
Dataset C	8 million screening	`datasetC_0.csv`	`datasetC_1-8.csv` `datasetCX_fing_0-8.csv`
Dataset D	1 thousand screening	`datasetD.csv` `datasetDX_fing.csv`	None

General Use

Referring to Figure 1 in our paper, there are 5 steps in our ML training and discovery workflow.

We have curated a dataset of SMILES strings and permeabilities, and the results are contained in the /datasets folder. Imputed permeabilities are included in the datasets, based on this code. More details can be found in the study by Yuan et al./home/jyang4/repos/DeCOIL/optimization/configs/examples/simple_exp25_mix2_GB1.json
The chemical features of the training set have been computed and uploaded to the /datasets folder. All remaining fingerprints and datasets have been generated via rdkit and uploaded at the same link as above. To use our code, download all the data and unzip them into the '/datasets' folder. Running step2_generateXfeatures.py to calculate chemical features is not required, but we have included the code for those who might find it beneficial.
We recommend training our top-performing ML model, a DNN ensemble trained using Morgan fingerprints as inputs with permeabilities imputed using Bayesian Linear Regression:

python step3_train.py --features 'fing' --imputation 'BLR' --model 'DNN'

Alternatively, one can also train on descriptors, use extremely randomized trees for imputation, and train a random forest. Options for training are summarized below:

Features	Imputation	Model
`'fing'` `'desc'`	`'BLR'` `'ERT'`	`'RF'` `'DNN'`

We also include several pretrained models in /pretrained_models that reproduce the results demonstrated in our paper. To directly use our pretrained models, rename /pretrained_models to /models. The thousands of candidate polymers with promising performance (beyond 2008 Robeson upper bounds, or with high permeability) identified in our study are included in /pretrained_models/DNN_BLR_fing/top_polymers.csv. We encourage computational and experimental researchers to explore these polymers further for gas separations. A summary of the columns in this dataset:

Column Name	Description
Smiles	SMILES string for the polymer
Smi_A	SMILES string for the diamine/diisocyanante component of the polymer
Smi_B	SMILES string for the dianhydride component of the polymer
SA_A	Synthetic accessibility score for component A
SA_B	Synthetic accessibility score for component B
He, H2, O2, N2, CO2, CH4	log_10 of the gas permeability measured in Barrer
Ondist, CCdist, CNdist, HCdist	log_10 of the distance above the 2008 Robeson upper bound for O2/N2, CO2/CH4, CO2/N2, and H2/CO2 separations

Using the saved models, it is possible to extract the model's SHAP values to interpret trained ML models in step 3.5. For example, run:

python step3.5_SHAP.py --modelname 'DNN_BLR_fing'

As part of our discovery framework, we also screen large datasets of polymers to predict their permeabilities and identify promising candidates. For example, to screen Dataset D on the model that we trained and analyzed above, run:

python step4_screen.py --modelname 'DNN_BLR_fing' --dataset 'datasetDX_fing.csv'

Note that the input features of the screening dataset must match that of the trained model.

Please refer to our paper for details on the implementation of MD simulations for validation.

Visualization of Results

visualizations.ipynb offers several useful functions for visualizing the results of the ML models. The performance of a trained model can be verified via plotR2(), while SHAP values for physical insights into a trained model can be visualized using plotSHAP(). Futhermore, plotRobeson() visualizes permeabilities in the context of O2/N2, CO2/CH4, CO2/N2, and H2/CO2 separations.

License

Please refer to our paper for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolymerGasMembraneML (pgmML)

Installation

Datasets

General Use

Visualization of Results

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
__pycache__		__pycache__
datasets		datasets
models		models
pretrained_models		pretrained_models
.DS_Store		.DS_Store
DNN_functions.py		DNN_functions.py
LICENSE		LICENSE
README.md		README.md
pgmML.yml		pgmML.yml
step2_generate_Xfeatures.py		step2_generate_Xfeatures.py
step3.5_SHAP.py		step3.5_SHAP.py
step3_train.py		step3_train.py
step4_screen.py		step4_screen.py
visualizations.ipynb		visualizations.ipynb
workflow.jpg		workflow.jpg

License

jsunn-y/PolymerGasMembraneML

Folders and files

Latest commit

History

Repository files navigation

PolymerGasMembraneML (pgmML)

Installation

Datasets

General Use

Visualization of Results

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages