Skip to content

Commit

Permalink
Merge pull request #1990 from AllenInstitute/rc/2.10.0
Browse files Browse the repository at this point in the history
eventual merge of 2.10.1
  • Loading branch information
djkapner authored Mar 23, 2021
2 parents 9474baf + 3afd41c commit ea8cdc7
Show file tree
Hide file tree
Showing 104 changed files with 5,209 additions and 1,175 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,18 @@
# Change Log
All notable changes to this project will be documented in this file.

## [2.10.1] = 2021-03-23
- changes name of BehaviorProjectCache to VisualBehaviorOphysProjectCache
- changes VisualBehaviorOphysProjectCache method get_session_table() to get_ophys_session_table()
- changes VisualBehaviorOphysProjectCache method get_experiment_table() to get_ophys_experiment_table()
- VisualBehaviorOphysProjectCache is enabled to instantiate from_s3_cache() and from_local_cache()
- Improvements to BehaviorProjectCache
- Adds project metadata writer

## [2.9.0] = 2021-03-08
- Improvements to BehaviorProjectCache

## [2.9.0] = 20201-03-08
- Updates to Session metadata; refactors implementation to use class rather than dict internally
- Fixes a bug that was preventing Allen Institute Windows users from accessing gratings images

Expand Down
2 changes: 1 addition & 1 deletion allensdk/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
#
import logging

__version__ = '2.9.0'
__version__ = '2.10.1'


try:
Expand Down
130 changes: 130 additions & 0 deletions allensdk/api/cloud_cache/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
Cloud Cache
===========

## High level summary

The classes defined in this directory are designed to provide programmatic
access to version-controlled, cloud-hosted datasets. Users download these
datasets using sub-classes of the `CloudCacheBase` class defined in
`cloud_cache.py`. The datasets accessed by the cloud cache generally
consist of three parts

- Some arbitrary number of metadata files. These will be csv files suitable for
reading with pandas.
- Some arbitrary number of data files. These can be of any form.
- A manifest.json file defining the contents of the dataset.

For each version of the dataset, there will be a distinct manifest file loaded
into the cloud service behind the cloud cache. All other files are
version-controlled using the cloud service's native functionality. To load a
dataset, the user instantiates a sub-class of `CloudCacheBase` and runs
`cache.load_manifest('name_of_manifest.json')`. Valid manifest file names can
be accessed through `cache.manifest_file_names`. Loading the manifest
essentially configures the cloud cache to access the corresponding version of
the dataset.

`cache.download_data(file_id)` will download a data file to the local
sytem and return the path to where that file has been downloaded. If the file
has already been downloaded, `cache.download_data(file_id)` will just
return the path to the local copy of the file without downloading it again.
In this call `file_id` is a unique identifier for each data file corresponding
to a column in the metadata files. The name of that column can be found with
`cache.file_id_column`.

`cache.download_metadata(metadata_fname)` will download a metadata
file to the local system and return the path where the file has been stored.
The list of valid values for `metadata_fname` can be found with
`cache.metadata_file_names`. If users wish to directly access a
pandas DataFrame of a given metadata file, they can use
`cache.get_metadata(metadata_fname)`.

## Structure of `manifest.json`

The `manifest.json` files are structured like so
```
{
"project_name" : my-project-name-string,
"dataset_version" : dataset_version_string,
"file_id_column": name_of_column_uniquely_identifying_files,
"metadata_files":{
metadata_file_name_1: {"url": "full/url/to/file",
"version_id": version_id_string,
"file_hash": file_hash_of_metadata_file},
metadata_file_name_2: {"url": "full/url/to/file",
"version_id": version_id_string,
"file_hash": file_hash_of_metadata_file},
...
},
"data_files": {
file_id_1: {"url": "full/url/to/imaging_plane.nwb",
"version_id": version_id_string,
"file_hash": file_hash_of_file},
file_id_2: {"url": "full/url/to/behavior_only_session.nwb",
"version_id": version_id_string,
"file_hash": file_hash_of_file},
...
}
}
```
The entries under `metadata_files` and `data_files` provide the information
necessary for the cloud cache to

- locate the online resoure
- determine where it should be stored locally
- determine if the copy that is stored locally is valid

When a user asks to download a file, `cache._manifest` (an
instantiation of the `Manifest` class defined in `manifest.py`) constructs
a candidate local path for the resource like
```
cache_dir/file_hash/relative_path_to_resource
```
where `cache_dir` is a parent directory for all local data storage specified by
the user upon instantiating the cloud cache. If a file already exists at that
location, the cloud cache compares its `file_hash` to the `file_hash` reported
in the manifest. If they match, the file does not need to be downloaded.
If either

- a file does not exist at the candidate local path or
- the `file_hash` of the file at the candidate local path does not match the
`file_hash` reported in the manifest

then the cloud cache downloads the online resource to the candidate local path.
By including `file_hash` in the local path, we ensure that, if `data_file_1`
did not change between versions 1 and 2 of the dataset, it will not be
needlessly downloaded again when the user switches between those versions of
the dataset. Furthermore, when the user switches to version 3 of the dataset,
they will not lose the old version of `data_file_1` that they previously
downloaded, the cloud cache will merely redirect them to using the newer
version of the data file.

The `version_id` entry in the `manifest.json` description of resources is
necessary to disambiguate different versions of the same file when downloading
the resources from the cloud service.

## Implementation of `CloudCacheBase`

`CloudCacheBase` is actually just a base class that is meant to be
cloud-provider agnostic. In order to actually access a dataset, a sub-class
of `CloudCacheBase` must be implemented which knows how to access the
specific cloud service hosting the data (see, for instance `S3CloudCache`,
also defined in `cloud_cache.py`). Sub-classes of `CloudCacheBase` must
implement

### `_list_all_manifests`

Takes no arguments beyond `self`. Returns a list of all `manifest.json` files
in the dataset (with the `manifest/` prefix removed from the path).

### `_download_manifest`

Takes the name of a `manifest.json` file an `io.BytesIO` stream. Downloads the
contents of the `manifest.json`, loads it into the stream, and resets the
stream to the beginning (i.e. `stream.seek(0)`). Returns nothing.

### `_download_file`

Takes a `CacheFileAttributes` (defined in `file_attributes.py`) describing a
file. Checks to see if the local file exists in a valid state. If not,
downloads the file.
1 change: 1 addition & 0 deletions allensdk/api/cloud_cache/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# empty file
Loading

0 comments on commit ea8cdc7

Please sign in to comment.