Merge pull request #1990 from AllenInstitute/rc/2.10.0

eventual merge of 2.10.1
AllenInstitute · Mar 23, 2021 · ea8cdc7 · ea8cdc7
2 parents 9474baf + 3afd41c
commit ea8cdc7
Show file tree

Hide file tree

Showing 104 changed files with 5,209 additions and 1,175 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,18 @@
 # Change Log
 All notable changes to this project will be documented in this file.
 
+## [2.10.1] = 2021-03-23
+- changes name of BehaviorProjectCache to VisualBehaviorOphysProjectCache
+- changes VisualBehaviorOphysProjectCache method get_session_table() to get_ophys_session_table()
+- changes VisualBehaviorOphysProjectCache method get_experiment_table() to get_ophys_experiment_table()
+- VisualBehaviorOphysProjectCache is enabled to instantiate from_s3_cache() and from_local_cache()
+- Improvements to BehaviorProjectCache
+- Adds project metadata writer
+
 ## [2.9.0] = 2021-03-08
+- Improvements to BehaviorProjectCache 
+
+## [2.9.0] = 20201-03-08
 - Updates to Session metadata; refactors implementation to use class rather than dict internally
 - Fixes a bug that was preventing Allen Institute Windows users from accessing gratings images
 

diff --git a/allensdk/__init__.py b/allensdk/__init__.py
@@ -35,7 +35,7 @@
 #
 import logging
 
-__version__ = '2.9.0'
+__version__ = '2.10.1'
 
 
 try:

diff --git a/allensdk/api/cloud_cache/README.md b/allensdk/api/cloud_cache/README.md
@@ -0,0 +1,130 @@
+Cloud Cache
+===========
+
+## High level summary
+
+The classes defined in this directory are designed to provide programmatic
+access to version-controlled, cloud-hosted datasets. Users download these
+datasets using sub-classes of the `CloudCacheBase` class defined in
+`cloud_cache.py`. The datasets accessed by the cloud cache generally
+consist of three parts
+
+- Some arbitrary number of metadata files. These will be csv files suitable for
+reading with pandas.
+- Some arbitrary number of data files. These can be of any form.
+- A manifest.json file defining the contents of the dataset.
+
+For each version of the dataset, there will be a distinct manifest file loaded
+into the cloud service behind the cloud cache. All other files are
+version-controlled using the cloud service's native functionality. To load a
+dataset, the user instantiates a sub-class of `CloudCacheBase` and runs
+`cache.load_manifest('name_of_manifest.json')`. Valid manifest file names can
+be accessed through `cache.manifest_file_names`. Loading the manifest
+essentially configures the cloud cache to access the corresponding version of
+the dataset.
+
+`cache.download_data(file_id)` will download a data file to the local
+sytem and return the path to where that file has been downloaded. If the file
+has already been downloaded, `cache.download_data(file_id)` will just
+return the path to the local copy of the file without downloading it again.
+In this call `file_id` is a unique identifier for each data file corresponding
+to a column in the metadata files. The name of that column can be found with
+`cache.file_id_column`.
+
+`cache.download_metadata(metadata_fname)` will download a metadata
+file to the local system and return the path where the file has been stored.
+The list of valid values for `metadata_fname` can be found with
+`cache.metadata_file_names`. If users wish to directly access a
+pandas DataFrame of a given metadata file, they can use
+`cache.get_metadata(metadata_fname)`.
+
+## Structure of `manifest.json`
+
+The `manifest.json` files are structured like so
+```
+
+{
+ "project_name" : my-project-name-string,
+ "dataset_version" : dataset_version_string,
+ "file_id_column": name_of_column_uniquely_identifying_files,
+ "metadata_files":{
+     metadata_file_name_1: {"url": "full/url/to/file",
+                            "version_id": version_id_string,
+                            "file_hash": file_hash_of_metadata_file},
+     metadata_file_name_2: {"url": "full/url/to/file",
+                            "version_id": version_id_string,
+                            "file_hash": file_hash_of_metadata_file},
+  ...
+ },
+ "data_files": {
+     file_id_1: {"url": "full/url/to/imaging_plane.nwb",
+                 "version_id": version_id_string,
+                 "file_hash": file_hash_of_file},
+     file_id_2: {"url": "full/url/to/behavior_only_session.nwb",
+                 "version_id": version_id_string,
+                 "file_hash": file_hash_of_file},
+    ...
+    }
+}
+```
+The entries under `metadata_files` and `data_files` provide the information
+necessary for the cloud cache to
+
+- locate the online resoure
+- determine where it should be stored locally
+- determine if the copy that is stored locally is valid
+
+When a user asks to download a file, `cache._manifest` (an
+instantiation of the `Manifest` class defined in `manifest.py`) constructs
+a candidate local path for the resource like
+```
+cache_dir/file_hash/relative_path_to_resource
+```
+where `cache_dir` is a parent directory for all local data storage specified by
+the user upon instantiating the cloud cache. If a file already exists at that
+location, the cloud cache compares its `file_hash` to the `file_hash` reported
+in the manifest. If they match, the file does not need to be downloaded.
+If either
+
+- a file does not exist at the candidate local path or
+- the `file_hash` of the file at the candidate local path does not match the
+`file_hash` reported in the manifest
+
+then the cloud cache downloads the online resource to the candidate local path.
+By including `file_hash` in the local path, we ensure that, if `data_file_1`
+did not change between versions 1 and 2 of the dataset, it will not be
+needlessly downloaded again when the user switches between those versions of
+the dataset. Furthermore, when the user switches to version 3 of the dataset,
+they will not lose the old version of `data_file_1` that they previously
+downloaded, the cloud cache will merely redirect them to using the newer
+version of the data file.
+
+The `version_id` entry in the `manifest.json` description of resources is
+necessary to disambiguate different versions of the same file when downloading
+the resources from the cloud service.
+
+## Implementation of `CloudCacheBase`
+
+`CloudCacheBase` is actually just a base class that is meant to be
+cloud-provider agnostic. In order to actually access a dataset, a sub-class
+of `CloudCacheBase` must be implemented which knows how to access the
+specific cloud service hosting the data (see, for instance `S3CloudCache`,
+also defined in `cloud_cache.py`). Sub-classes of `CloudCacheBase` must
+implement
+
+### `_list_all_manifests`
+
+Takes no arguments beyond `self`. Returns a list of all `manifest.json` files
+in the dataset (with the `manifest/` prefix removed from the path).
+
+### `_download_manifest`
+
+Takes the name of a `manifest.json` file an `io.BytesIO` stream. Downloads the
+contents of the `manifest.json`, loads it into the stream, and resets the
+stream to the beginning (i.e. `stream.seek(0)`). Returns nothing.
+
+### `_download_file`
+
+Takes a `CacheFileAttributes` (defined in `file_attributes.py`) describing a
+file. Checks to see if the local file exists in a valid state. If not,
+downloads the file.
diff --git a/allensdk/api/cloud_cache/__init__.py b/allensdk/api/cloud_cache/__init__.py
@@ -0,0 +1 @@
+# empty file