-
Notifications
You must be signed in to change notification settings - Fork 22
Code Documentation
This page documents the pipeline and directory structure of the scripts/processes involved in the collaborative ensemble repository.
As a high level overview, the process looks like this:
- Component model files (up to the last season) are collected in
./model-forecasts/component-models
. - Weights are generated for different ensemble types and are kept in
./weights
. - During the prediction season (current season with real time data), the real time submission files (from the same set of components as used in weight estimation) are used to generate ensemble submissions each week.
Note that the first two steps are run in the beginning of the season, providing fixed weights to the components for the rest of the prediction season. Read on to know more about the implementation.
The repository on github has submission files in CDC style csv kept in
./model-forecasts/
and are categorized in the following sub directories:
-
./model-forecasts/component-models
Component model submissions for past years. Used for weight estimation.
-
./model-forecasts/cv-ensemble-models
Ensemble files generated after performing leave-one-season-out cross validation on the component files from past years.
-
./model-forecasts/real-time-component-models
Real time component submissions for the current season.
-
./model-forecasts/real-time-ensemble-models
Real time ensemble files created from real time component files.
-
./model-forecasts/submissions
Files to be submitted to CDC from the best (fixed in the beginning of the season) ensemble model.
Scripts for working with the data are written in R and JavaScript. For
JavaScript, the dependencies are managed using directory local package.json
files. Dependencies for R scripts are in the file ./pkrfile
.
There are following set of scripts in the repository:
-
Data files test
This is a single script
./test-data.js
which does superficial checks (without actually reading the content of files) on the csvs inside./model-forecasts
.It is written in JavaScript (needs nodejs) and can be invoked using the following commands in the repository's root directory:
npm install # Install JS dependencies for tests npm run test
-
Score generation
The csvs are scored against true values from
./scores/target-multivals.csv
using the script./scripts/generate-scores.js
. The script can be used by running the following in the repository's root directory:npm install # Intall JS dependencies, if not already done npm run generate-scores # Generate ./scores/scores.csv # Also run the following to generate extra metadata to be used in further # processing npm run generate-id-mappings npm run format-metadata
-
Real time ensemble file generation
This step involves the script
./scripts/make-real-time-ensemble-forecast-file.R
which accepts a week number (like 52, 1, 2, etc.) as argument and generate ensemble csvs and plots for submission. Its dependencies are listed in./pkrfile
and can be installed from CRAN in an R session or using the cli helper pkr. An example use case is given below:# Generate ensemble files for week 52 Rscript ./scripts/make-real-time-ensemble-forecast-file.R 52
There is a helper script
./scripts/get-current-week.js
for automatically detecting the correct week to generate the ensemble files for (based on either a commit message from travis or the list of already generated ensemble files). The above ensemble script invocation now looks like the following using the helper script:# In repository's root npm install # Install JS dependencies if not already done Rscript ./scripts/make-real-time-ensemble-forecast-file.R $(node ./scripts/get-current-week.js)
-
Deploying visualization
Visualizations (hosted on http://flusightnetworks.io) are generated using scripts from
./flusight-deploy
. The following bash commands take care of parsing the data, setting up dependencies and building the visualizer in the repo root.# In repository's root cd ./flusight-deploy bash ./0-init-flusight.sh bash ./1-build-flusight.sh
Note that this dumps all the built files in the repository's root directory so that the deploy steps are easy. If you do not want to clutter the root, remove the last few lines from
./flusight-deploy/1-build-flusight.sh
which copies the built visualizer files to the root directory.
We use travis to automate scripts for testing data, generating ensemble files weekly, building visualizations etc. This section documents the travis build process and related triggers used in this repo.
Every time commits are pushed to the repo, a travis build process is triggered which does the following:
- Check if the commit is
[TRAVIS] Autogenerated files from travis
. If yes, then exit build. - Run tests on csv files. If not on
master
branch, end build here. - If commit message is
[TRAVIS] Generate scores
, run score generation scripts and push scores and other autogenerated files to github. - If commit message is
[TRAVIS] Generate ensemble week xx
(or the build is triggered via acron
task), run real time ensemble generation script and push generated files to master. - Build flusight visualizer and push to
gh-pages
branch.
Other than user commits, there are two more triggers for build process.
- A cron job scheduled at 18:00 on Mondays. This triggers ensemble file creation process.
-
xpull
requests from reichlab/2017-2018-cdc-flu-contest. This triggers a process which pull recent submissions for component models from Reich Lab.