-
Notifications
You must be signed in to change notification settings - Fork 151
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1990 from AllenInstitute/rc/2.10.0
eventual merge of 2.10.1
- Loading branch information
Showing
104 changed files
with
5,209 additions
and
1,175 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -35,7 +35,7 @@ | |
# | ||
import logging | ||
|
||
__version__ = '2.9.0' | ||
__version__ = '2.10.1' | ||
|
||
|
||
try: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
Cloud Cache | ||
=========== | ||
|
||
## High level summary | ||
|
||
The classes defined in this directory are designed to provide programmatic | ||
access to version-controlled, cloud-hosted datasets. Users download these | ||
datasets using sub-classes of the `CloudCacheBase` class defined in | ||
`cloud_cache.py`. The datasets accessed by the cloud cache generally | ||
consist of three parts | ||
|
||
- Some arbitrary number of metadata files. These will be csv files suitable for | ||
reading with pandas. | ||
- Some arbitrary number of data files. These can be of any form. | ||
- A manifest.json file defining the contents of the dataset. | ||
|
||
For each version of the dataset, there will be a distinct manifest file loaded | ||
into the cloud service behind the cloud cache. All other files are | ||
version-controlled using the cloud service's native functionality. To load a | ||
dataset, the user instantiates a sub-class of `CloudCacheBase` and runs | ||
`cache.load_manifest('name_of_manifest.json')`. Valid manifest file names can | ||
be accessed through `cache.manifest_file_names`. Loading the manifest | ||
essentially configures the cloud cache to access the corresponding version of | ||
the dataset. | ||
|
||
`cache.download_data(file_id)` will download a data file to the local | ||
sytem and return the path to where that file has been downloaded. If the file | ||
has already been downloaded, `cache.download_data(file_id)` will just | ||
return the path to the local copy of the file without downloading it again. | ||
In this call `file_id` is a unique identifier for each data file corresponding | ||
to a column in the metadata files. The name of that column can be found with | ||
`cache.file_id_column`. | ||
|
||
`cache.download_metadata(metadata_fname)` will download a metadata | ||
file to the local system and return the path where the file has been stored. | ||
The list of valid values for `metadata_fname` can be found with | ||
`cache.metadata_file_names`. If users wish to directly access a | ||
pandas DataFrame of a given metadata file, they can use | ||
`cache.get_metadata(metadata_fname)`. | ||
|
||
## Structure of `manifest.json` | ||
|
||
The `manifest.json` files are structured like so | ||
``` | ||
{ | ||
"project_name" : my-project-name-string, | ||
"dataset_version" : dataset_version_string, | ||
"file_id_column": name_of_column_uniquely_identifying_files, | ||
"metadata_files":{ | ||
metadata_file_name_1: {"url": "full/url/to/file", | ||
"version_id": version_id_string, | ||
"file_hash": file_hash_of_metadata_file}, | ||
metadata_file_name_2: {"url": "full/url/to/file", | ||
"version_id": version_id_string, | ||
"file_hash": file_hash_of_metadata_file}, | ||
... | ||
}, | ||
"data_files": { | ||
file_id_1: {"url": "full/url/to/imaging_plane.nwb", | ||
"version_id": version_id_string, | ||
"file_hash": file_hash_of_file}, | ||
file_id_2: {"url": "full/url/to/behavior_only_session.nwb", | ||
"version_id": version_id_string, | ||
"file_hash": file_hash_of_file}, | ||
... | ||
} | ||
} | ||
``` | ||
The entries under `metadata_files` and `data_files` provide the information | ||
necessary for the cloud cache to | ||
|
||
- locate the online resoure | ||
- determine where it should be stored locally | ||
- determine if the copy that is stored locally is valid | ||
|
||
When a user asks to download a file, `cache._manifest` (an | ||
instantiation of the `Manifest` class defined in `manifest.py`) constructs | ||
a candidate local path for the resource like | ||
``` | ||
cache_dir/file_hash/relative_path_to_resource | ||
``` | ||
where `cache_dir` is a parent directory for all local data storage specified by | ||
the user upon instantiating the cloud cache. If a file already exists at that | ||
location, the cloud cache compares its `file_hash` to the `file_hash` reported | ||
in the manifest. If they match, the file does not need to be downloaded. | ||
If either | ||
|
||
- a file does not exist at the candidate local path or | ||
- the `file_hash` of the file at the candidate local path does not match the | ||
`file_hash` reported in the manifest | ||
|
||
then the cloud cache downloads the online resource to the candidate local path. | ||
By including `file_hash` in the local path, we ensure that, if `data_file_1` | ||
did not change between versions 1 and 2 of the dataset, it will not be | ||
needlessly downloaded again when the user switches between those versions of | ||
the dataset. Furthermore, when the user switches to version 3 of the dataset, | ||
they will not lose the old version of `data_file_1` that they previously | ||
downloaded, the cloud cache will merely redirect them to using the newer | ||
version of the data file. | ||
|
||
The `version_id` entry in the `manifest.json` description of resources is | ||
necessary to disambiguate different versions of the same file when downloading | ||
the resources from the cloud service. | ||
|
||
## Implementation of `CloudCacheBase` | ||
|
||
`CloudCacheBase` is actually just a base class that is meant to be | ||
cloud-provider agnostic. In order to actually access a dataset, a sub-class | ||
of `CloudCacheBase` must be implemented which knows how to access the | ||
specific cloud service hosting the data (see, for instance `S3CloudCache`, | ||
also defined in `cloud_cache.py`). Sub-classes of `CloudCacheBase` must | ||
implement | ||
|
||
### `_list_all_manifests` | ||
|
||
Takes no arguments beyond `self`. Returns a list of all `manifest.json` files | ||
in the dataset (with the `manifest/` prefix removed from the path). | ||
|
||
### `_download_manifest` | ||
|
||
Takes the name of a `manifest.json` file an `io.BytesIO` stream. Downloads the | ||
contents of the `manifest.json`, loads it into the stream, and resets the | ||
stream to the beginning (i.e. `stream.seek(0)`). Returns nothing. | ||
|
||
### `_download_file` | ||
|
||
Takes a `CacheFileAttributes` (defined in `file_attributes.py`) describing a | ||
file. Checks to see if the local file exists in a valid state. If not, | ||
downloads the file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# empty file |
Oops, something went wrong.