Early Warning Tool for prioritizing individuals for screenings based on risk of MASLD related liver complications)
Documentation of steps and code
For every patient
- who has had at least one clinical visit in the past 3 years
- is 18+ years of age (and alive)
- has not been diagnosed yet with MASLD related liver_complications or hepatitis or alcohol related or other liver complications
Predict the top k individuals (based on intervention capacity) who are at risk of having MASLD related liver complications in the next 3 years
- Predict risk of having MASLD related liver complications in the next 3 years
- Define Baselines:
- age, most recent fib4, co-morbidities
- clinical guidelines - t2dm, obesity, high tg, glucose, hdl, bp, ast or alt
- Metric(s):
- Primary: Precision (PPV) or Recall (sensitivity) at top k (:warning: need to determine k based on capacity)
- AUC (if capacity is TBD)
- Fairness metric: TPR disparity by Race, Gender
- Define cohort based on formulation: All patients > 18 years, at least one outpatient visit in the past 3 years, no previous diagnosis of liver-related complications or other liver-related diagnosis exclusions. sql file used in config
- Define Outcome/Label based on formulation (will get diagnosed with X in the next z months): Liver complications (defined as development of cirrhosis or liver-related complications) developed in the next 3 years following prediction date sql file used in config
- Define Training and Validation sets over time
- Define and generate predictors: All features defined in [these config files](triage_config_files /feature_groups/)
- Train Models on each training set and score all patients in the corresponding validation set
- Evaluate all models for each validation time according to each metric (PPV at top k)
- Select "Best" model based on results over time
- Explore the high performing models to understand who they rank high, how they compare to the cohort, and important predictors
- Check and/or correct for bias issues
We are using Triage to build and select models. Some background and tutorials on Triage:
- Tutorial on Google Colab - Are you completely new to Triage? Run through a quick tutorial hosted on google colab (no setup necessary) to see what triage can do!
- Dirty Duck Tutorial - Want a more in-depth walk through of triage's functionality and concepts? Go through the dirty duck tutorial here with sample data
- QuickStart Guide - Try Triage out with your own project and data
- Suggested workflow
- Understanding the configuration file
- Installation: install triage in a python virtual environment
Assuming Triage is installed and the data is in a postgres database. To run,
- activate virtual environment source env/bin/activate
- python run.py -c configfilename
- if running on sample database add --sampledb flag
Triage running - choices to Make
- replace flag (set to false until we want to nuke everything)
- save predictions (don't for the beginning)
- number of processors to use
- The current one is here
- File with design choices: google doc
Config file choices to make: example config file
- cohort and label query: need to write a query that takes two parameters {as_of_date} and {label_timespan} and returns data in two columns (entity_id, outcome) specifying all the patients in the cohort as of {as_of_date} and outcomes can be 1 (got diagnosed with NASH/NAFLD related liver complications within the time period {label_timespan) from the {as_of_date}, 0 (did not get diagnosed), null (don't know or not sure). We can later turn nulls into 0s or ignore them.
- temporal config parameters
- features and imputation
- subsets to analyze: prior NASH or NAFLD diagnosis
- attributes to do bias audits for: sex, race
- Of all the patients who have FIB-4 components, how many should be in our cohort (don't have nash/nafld related complications yet), and what % of them end up having them in the next 3 years?
The following feature sets were tested using manual_modeling.py. Takes 4 parameters:
- training matrix
- test matrix
- model to build and test
- feature set