Skip to content

Code Documentation

Abhinav Tushar edited this page Jan 23, 2018 · 5 revisions

This page documents the pipeline and directory structure of the scripts/processes involved in the collaborative ensemble repository.

As a high level overview, the process looks like this:

  1. Component model files (up to the last season) are collected in ./model-forecasts/component-models.
  2. Weights are generated for different ensemble types and are kept in ./weights.
  3. During the prediction season (current season with real time data), the real time submission files (from the same set of components as used in weight estimation) are used to generate ensemble submissions each week.

Note that the first two steps are run in the beginning of the season, providing fixed weights to the components for the rest of the prediction season. Read on to know more about the implementation.

Implementation

The repository on github has submission files in CDC style csv kept in ./model-forecasts/ and are categorized in the following sub directories:

  • ./model-forecasts/component-models

    Component model submissions for past years. Used for weight estimation.

  • ./model-forecasts/cv-ensemble-models

    Ensemble files generated after performing leave-one-season-out cross validation on the component files from past years.

  • ./model-forecasts/real-time-component-models

    Real time component submissions for the current season.

  • ./model-forecasts/real-time-ensemble-models

    Real time ensemble files created from real time component files.

  • ./model-forecasts/submissions

    Files to be submitted to CDC from the best (fixed in the beginning of the season) ensemble model.

Scripts

Scripts for working with the data are written in R and JavaScript. For JavaScript, the dependencies are managed using directory local package.json files. Dependencies for R scripts are in the file ./pkrfile.

There are following set of scripts in the repository:

  1. Data files test

    This is a single script ./test-data.js which does superficial checks (without actually reading the content of files) on the csvs inside ./model-forecasts.

    It is written in JavaScript (needs nodejs) and can be invoked using the following commands in the repository's root directory:

    npm install # Install JS dependencies for tests
    npm run test
  2. Score generation

    The csvs are scored against true values from ./scores/target-multivals.csv using the script ./scripts/generate-scores.js. The script can be used by running the following in the repository's root directory:

    npm install # Intall JS dependencies, if not already done
    npm run generate-scores # Generate ./scores/scores.csv
    
    # Also run the following to generate extra metadata to be used in further
    # processing
    npm run generate-id-mappings
    npm run format-metadata
  3. Real time ensemble file generation

    This step involves the script ./scripts/make-real-time-ensemble-forecast-file.R which accepts a week number (like 52, 1, 2, etc.) as argument and generate ensemble csvs and plots for submission. Its dependencies are listed in ./pkrfile and can be installed from CRAN in an R session or using the cli helper pkr. An example use case is given below:

    # Generate ensemble files for week 52
    Rscript ./scripts/make-real-time-ensemble-forecast-file.R 52

    There is a helper script ./scripts/get-current-week.js for automatically detecting the correct week to generate the ensemble files for (based on either a commit message from travis or the list of already generated ensemble files). The above ensemble script invocation now looks like the following using the helper script:

    # In repository's root
    npm install # Install JS dependencies if not already done
    Rscript ./scripts/make-real-time-ensemble-forecast-file.R $(node ./scripts/get-current-week.js)
  4. Deploying visualization

    Visualizations (hosted on http://flusightnetworks.io) are generated using scripts from ./flusight-deploy. The following bash commands take care of parsing the data, setting up dependencies and building the visualizer in the repo root.

    # In repository's root
    cd ./flusight-deploy
    bash ./0-init-flusight.sh
    bash ./1-build-flusight.sh

    Note that this dumps all the built files in the repository's root directory so that the deploy steps are easy. If you do not want to clutter the root, remove the last few lines from ./flusight-deploy/1-build-flusight.sh which copies the built visualizer files to the root directory.

Travis

We use travis to automate scripts for testing data, generating ensemble files weekly, building visualizations etc. This section documents the travis build process and related triggers used in this repo.

Every time commits are pushed to the repo, a travis build process is triggered which does the following:

  1. Check if the commit is [TRAVIS] Autogenerated files from travis. If yes, then exit build.
  2. Run tests on csv files. If not on master branch, end build here.
  3. If commit message is [TRAVIS] Generate scores, run score generation scripts and push scores and other autogenerated files to github.
  4. If commit message is [TRAVIS] Generate ensemble week xx (or the build is triggered via a cron task), run real time ensemble generation script and push generated files to master.
  5. Build flusight visualizer and push to gh-pages branch.

Other than user commits, there are two more triggers for build process.

  1. A cron job scheduled at 18:00 on Mondays. This triggers ensemble file creation process.
  2. xpull requests from reichlab/2017-2018-cdc-flu-contest. This triggers a process which pull recent submissions for component models from Reich Lab.
Clone this wiki locally