Skip to content
This repository has been archived by the owner on Nov 13, 2024. It is now read-only.

Created a GitHub repository for the APP project aimed at identifying hits for the SARS-CoV-2 PLpro protein. Upload KNIME workflows, datasets used for model building, and a CSV file containing selected hit molecules. This repository will support future publication in the ACS Journal of Chemical Information and Modeling.

License

Notifications You must be signed in to change notification settings

ncats/ML-HitDiscovery-PLpro

Repository files navigation

Applications of Machine Learning Approaches for Discovery of SARS-CoV-2 PLpro Inhibitors

This repository is under the NCATS organization. It documents computational workflows, datasets, and results from a study focused on discovering potential inhibitors for the SARS-CoV-2 PLpro protein. The study uses machine learning (ML) models and high-throughput screening data to identify promising hit compounds.

Project Overview
The primary goal of this research work is to identify and validate compounds with potential inhibitory activity against the SARS-CoV-2 PLpro protein, an essential viral protein involved in viral replication. This work was conducted as part of the Antiviral Program for Pandemics (APP) project at NCATS.

Key Objectives

  • Develop ML models based on high-throughput screening data to predict activity against PLpro.
  • Perform virtual screening using optimized ML models to identify promising hit compounds from the in-house Genesis molecular library.
  • Perform the similarity-based screening of the hit compounds.

Repository Contents
This repository contains the following:

KNIME Workflows

  • Hyperparameter Optimization Workflows:

    • random_forest_hyperoptimization.knwf: A workflow for hyperparameter optimization of Random Forest model, using different combinations of Avalon, Morgan, and Atom-pair fingerprints along with RDKit descriptors.
    • gradient_boost_hyperoptimization.knwf: A workflow for hyperparameter optimization of Gradient Boosting model, with the same variations of fingerprints and descriptors.
    • xgboost_hyperoptimization.knwf: A workflow for hyperparameter optimization of an XGBoost model, using the fingerprint and descriptor combinations mentioned above.
  • Virtual Screening Workflow:

  • virtual_screening_workflow.knwf: Represents the virtual screening workflow against the Genesis molecular library, using the best, optimized ML model for hit discovery.

Screening Data

  • Genesis_library_HTS_data/: Contains high-throughput screening data for a small subset of the Genesis library.
    • primary_assay_data.csv: Data from the initial primary assay screening.
    • confirmed_active_compounds_follow-up_assay.csv: List of active compounds confirmed by follow-up assays.
  • NPACT_library_HTS_data/: Contains high-throughput screening data for a subset of the NPACT library.
    • primary_assay_data.csv: Data from the initial primary assay screening.
    • confirmed_active_compounds_follow-up_assay.csv: List of active compounds confirmed by follow-up assays.

Training and Test Sets

  • training_set.sdf: Structure data file (SDF) containing the training set compounds used for model building.
  • test_set.sdf: Structure data file (SDF) containing the test set compounds used for model validation.

Selected Hits From Virtual Screening

  • selected_screening_hits.xlsx: A file listing the top hit compounds identified from the virtual screening workflow.

Analog Compounds Similarity

  • analog_compounds.xlsx: A file listing analog compounds screened in this study, along with their Tanimoto similarity values.

About

Created a GitHub repository for the APP project aimed at identifying hits for the SARS-CoV-2 PLpro protein. Upload KNIME workflows, datasets used for model building, and a CSV file containing selected hit molecules. This repository will support future publication in the ACS Journal of Chemical Information and Modeling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published