diff --git a/CHANGELOG.md b/CHANGELOG.md index b904cbb3d..7960793b3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,7 +1,18 @@ # Change Log All notable changes to this project will be documented in this file. +## [2.10.1] = 2021-03-23 +- changes name of BehaviorProjectCache to VisualBehaviorOphysProjectCache +- changes VisualBehaviorOphysProjectCache method get_session_table() to get_ophys_session_table() +- changes VisualBehaviorOphysProjectCache method get_experiment_table() to get_ophys_experiment_table() +- VisualBehaviorOphysProjectCache is enabled to instantiate from_s3_cache() and from_local_cache() +- Improvements to BehaviorProjectCache +- Adds project metadata writer + ## [2.9.0] = 2021-03-08 +- Improvements to BehaviorProjectCache + +## [2.9.0] = 20201-03-08 - Updates to Session metadata; refactors implementation to use class rather than dict internally - Fixes a bug that was preventing Allen Institute Windows users from accessing gratings images diff --git a/allensdk/__init__.py b/allensdk/__init__.py index 24ca10b8b..f0ee7651e 100644 --- a/allensdk/__init__.py +++ b/allensdk/__init__.py @@ -35,7 +35,7 @@ # import logging -__version__ = '2.9.0' +__version__ = '2.10.1' try: diff --git a/allensdk/api/cloud_cache/README.md b/allensdk/api/cloud_cache/README.md new file mode 100644 index 000000000..b726348b1 --- /dev/null +++ b/allensdk/api/cloud_cache/README.md @@ -0,0 +1,130 @@ +Cloud Cache +=========== + +## High level summary + +The classes defined in this directory are designed to provide programmatic +access to version-controlled, cloud-hosted datasets. Users download these +datasets using sub-classes of the `CloudCacheBase` class defined in +`cloud_cache.py`. The datasets accessed by the cloud cache generally +consist of three parts + +- Some arbitrary number of metadata files. These will be csv files suitable for +reading with pandas. +- Some arbitrary number of data files. These can be of any form. +- A manifest.json file defining the contents of the dataset. + +For each version of the dataset, there will be a distinct manifest file loaded +into the cloud service behind the cloud cache. All other files are +version-controlled using the cloud service's native functionality. To load a +dataset, the user instantiates a sub-class of `CloudCacheBase` and runs +`cache.load_manifest('name_of_manifest.json')`. Valid manifest file names can +be accessed through `cache.manifest_file_names`. Loading the manifest +essentially configures the cloud cache to access the corresponding version of +the dataset. + +`cache.download_data(file_id)` will download a data file to the local +sytem and return the path to where that file has been downloaded. If the file +has already been downloaded, `cache.download_data(file_id)` will just +return the path to the local copy of the file without downloading it again. +In this call `file_id` is a unique identifier for each data file corresponding +to a column in the metadata files. The name of that column can be found with +`cache.file_id_column`. + +`cache.download_metadata(metadata_fname)` will download a metadata +file to the local system and return the path where the file has been stored. +The list of valid values for `metadata_fname` can be found with +`cache.metadata_file_names`. If users wish to directly access a +pandas DataFrame of a given metadata file, they can use +`cache.get_metadata(metadata_fname)`. + +## Structure of `manifest.json` + +The `manifest.json` files are structured like so +``` + +{ + "project_name" : my-project-name-string, + "dataset_version" : dataset_version_string, + "file_id_column": name_of_column_uniquely_identifying_files, + "metadata_files":{ + metadata_file_name_1: {"url": "full/url/to/file", + "version_id": version_id_string, + "file_hash": file_hash_of_metadata_file}, + metadata_file_name_2: {"url": "full/url/to/file", + "version_id": version_id_string, + "file_hash": file_hash_of_metadata_file}, + ... + }, + "data_files": { + file_id_1: {"url": "full/url/to/imaging_plane.nwb", + "version_id": version_id_string, + "file_hash": file_hash_of_file}, + file_id_2: {"url": "full/url/to/behavior_only_session.nwb", + "version_id": version_id_string, + "file_hash": file_hash_of_file}, + ... + } +} +``` +The entries under `metadata_files` and `data_files` provide the information +necessary for the cloud cache to + +- locate the online resoure +- determine where it should be stored locally +- determine if the copy that is stored locally is valid + +When a user asks to download a file, `cache._manifest` (an +instantiation of the `Manifest` class defined in `manifest.py`) constructs +a candidate local path for the resource like +``` +cache_dir/file_hash/relative_path_to_resource +``` +where `cache_dir` is a parent directory for all local data storage specified by +the user upon instantiating the cloud cache. If a file already exists at that +location, the cloud cache compares its `file_hash` to the `file_hash` reported +in the manifest. If they match, the file does not need to be downloaded. +If either + +- a file does not exist at the candidate local path or +- the `file_hash` of the file at the candidate local path does not match the +`file_hash` reported in the manifest + +then the cloud cache downloads the online resource to the candidate local path. +By including `file_hash` in the local path, we ensure that, if `data_file_1` +did not change between versions 1 and 2 of the dataset, it will not be +needlessly downloaded again when the user switches between those versions of +the dataset. Furthermore, when the user switches to version 3 of the dataset, +they will not lose the old version of `data_file_1` that they previously +downloaded, the cloud cache will merely redirect them to using the newer +version of the data file. + +The `version_id` entry in the `manifest.json` description of resources is +necessary to disambiguate different versions of the same file when downloading +the resources from the cloud service. + +## Implementation of `CloudCacheBase` + +`CloudCacheBase` is actually just a base class that is meant to be +cloud-provider agnostic. In order to actually access a dataset, a sub-class +of `CloudCacheBase` must be implemented which knows how to access the +specific cloud service hosting the data (see, for instance `S3CloudCache`, +also defined in `cloud_cache.py`). Sub-classes of `CloudCacheBase` must +implement + +### `_list_all_manifests` + +Takes no arguments beyond `self`. Returns a list of all `manifest.json` files +in the dataset (with the `manifest/` prefix removed from the path). + +### `_download_manifest` + +Takes the name of a `manifest.json` file an `io.BytesIO` stream. Downloads the +contents of the `manifest.json`, loads it into the stream, and resets the +stream to the beginning (i.e. `stream.seek(0)`). Returns nothing. + +### `_download_file` + +Takes a `CacheFileAttributes` (defined in `file_attributes.py`) describing a +file. Checks to see if the local file exists in a valid state. If not, +downloads the file. diff --git a/allensdk/api/cloud_cache/__init__.py b/allensdk/api/cloud_cache/__init__.py new file mode 100644 index 000000000..fa81adaff --- /dev/null +++ b/allensdk/api/cloud_cache/__init__.py @@ -0,0 +1 @@ +# empty file diff --git a/allensdk/api/cloud_cache/cloud_cache.py b/allensdk/api/cloud_cache/cloud_cache.py new file mode 100644 index 000000000..93af0ddb5 --- /dev/null +++ b/allensdk/api/cloud_cache/cloud_cache.py @@ -0,0 +1,570 @@ +from abc import ABC, abstractmethod +import os +import copy +import pathlib +import pandas as pd +import boto3 +import semver +import tqdm +import re +from botocore import UNSIGNED +from botocore.client import Config +from allensdk.internal.core.lims_utilities import safe_system_path +from allensdk.api.cloud_cache.manifest import Manifest +from allensdk.api.cloud_cache.file_attributes import CacheFileAttributes # noqa: E501 +from allensdk.api.cloud_cache.utils import file_hash_from_path # noqa: E501 +from allensdk.api.cloud_cache.utils import bucket_name_from_url # noqa: E501 +from allensdk.api.cloud_cache.utils import relative_path_from_url # noqa: E501 + + +class CloudCacheBase(ABC): + """ + A class to handle the downloading and accessing of data served from a cloud + storage system + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + project_name: str + the name of the project this cache is supposed to access. This will + be the root directory for all files stored in the bucket. + """ + + _bucket_name = None + + def __init__(self, cache_dir, project_name): + os.makedirs(cache_dir, exist_ok=True) + + self._manifest = None + self._cache_dir = cache_dir + self._project_name = project_name + self._manifest_file_names = self._list_all_manifests() + + @abstractmethod + def _list_all_manifests(self) -> list: + """ + Return a list of all of the file names of the manifests associated + with this dataset + """ + raise NotImplementedError() + + @property + def latest_manifest_file(self) -> str: + """parses available manifest files for semver string + and returns the latest one + self.manifest_file_names are assumed to be of the form + '_v.json' + + Returns + ------- + str + the filename whose semver string is the latest one + """ + vstrs = [s.split(".json")[0].split("_v")[-1] + for s in self.manifest_file_names] + versions = [semver.VersionInfo.parse(v) for v in vstrs] + imax = versions.index(max(versions)) + return self.manifest_file_names[imax] + + def load_latest_manifest(self): + self.load_manifest(self.latest_manifest_file) + + @abstractmethod + def _download_manifest(self, + manifest_name: str): + """ + Download a manifest from the dataset + + Parameters + ---------- + manifest_name: str + The name of the manifest to load. Must be an element in + self.manifest_file_names + """ + raise NotImplementedError() + + @abstractmethod + def _download_file(self, file_attributes: CacheFileAttributes) -> bool: + """ + Check if a file exists and is in the expected state. + + If it is, return True. + + If it is not, download the file, creating the directory + where the file is to be stored if necessary. + + If the download is successful, return True. + + If the download fails (file hash does not match expectation), + return False. + + Parameters + ---------- + file_attributes: CacheFileAttributes + Describes the file to download + + Returns + ------- + None + + Raises + ------ + RuntimeError + If the path to the directory where the file is to be saved + points to something that is not a directory. + + RuntimeError + If it is not able to successfully download the file after + 10 iterations + """ + raise NotImplementedError() + + @property + def project_name(self) -> str: + """ + The name of the project that this cache is accessing + """ + return self._project_name + + @property + def manifest_prefix(self) -> str: + """ + On-line prefix for manifest files + """ + return f'{self.project_name}/manifests/' + + @property + def file_id_column(self) -> str: + """ + The column in the metadata files used to uniquely + identify data files + """ + return self._manifest.file_id_column + + @property + def version(self) -> str: + """ + The version of the dataset currently loaded + """ + return self._manifest.version + + @property + def metadata_file_names(self) -> list: + """ + List of metadata file names associated with this dataset + """ + return self._manifest.metadata_file_names + + @property + def manifest_file_names(self) -> list: + """ + Sorted list of manifest file names associated with this + dataset + """ + return copy.deepcopy(self._manifest_file_names) + + def load_manifest(self, manifest_name: str): + """ + Load a manifest from this dataset. + + Parameters + ---------- + manifest_name: str + The name of the manifest to load. Must be an element in + self.manifest_file_names + """ + if manifest_name not in self.manifest_file_names: + raise ValueError(f"manifest: {manifest_name}\n" + "is not one of the valid manifest names " + "for this dataset:\n" + f"{self.manifest_file_names}") + + filepath = os.path.join(self._cache_dir, manifest_name) + if not os.path.exists(filepath): + self._download_manifest(manifest_name) + + with open(filepath) as f: + self._manifest = Manifest( + cache_dir=self._cache_dir, + json_input=f + ) + + def _file_exists(self, file_attributes: CacheFileAttributes) -> bool: + """ + Given a CacheFileAttributes describing a file, assess whether or + not that file exists locally and is valid (i.e. has the expected + file hash) + + Parameters + ---------- + file_attributes: CacheFileAttributes + Description of the file to look for + + Returns + ------- + bool + True if the file exists and is valid; False otherwise + + Raises + ----- + RuntimeError + If file_attributes.local_path exists but is not a file. + It would be unclear how the cache should proceed in this case. + """ + + if not file_attributes.local_path.exists(): + return False + if not file_attributes.local_path.is_file(): + raise RuntimeError(f"{file_attributes.local_path}\n" + "exists, but is not a file;\n" + "unsure how to proceed") + + full_path = file_attributes.local_path.resolve() + test_checksum = file_hash_from_path(full_path) + if test_checksum != file_attributes.file_hash: + return False + + return True + + def data_path(self, file_id) -> dict: + """ + Return the local path to a data file, and test for the + file's existence/validity + + Parameters + ---------- + file_id: + The unique identifier of the file to be accessed + + Returns + ------- + dict + + 'path' will be a pathlib.Path pointing to the file's location + + 'exists' will be a boolean indicating if the file + exists in a valid state + + 'file_attributes' is a CacheFileAttributes describing the file + in more detail + + Raises + ------ + RuntimeError + If the file cannot be downloaded + """ + file_attributes = self._manifest.data_file_attributes(file_id) + exists = self._file_exists(file_attributes) + local_path = file_attributes.local_path + output = {'local_path': local_path, + 'exists': exists, + 'file_attributes': file_attributes} + + return output + + def download_data(self, file_id) -> pathlib.Path: + """ + Return the local path to a data file, downloading the file + if necessary + + Parameters + ---------- + file_id: + The unique identifier of the file to be accessed + + Returns + ------- + pathlib.Path + The path indicating where the file is stored on the + local system + + Raises + ------ + RuntimeError + If the file cannot be downloaded + """ + super_attributes = self.data_path(file_id) + file_attributes = super_attributes['file_attributes'] + self._download_file(file_attributes) + return file_attributes.local_path + + def metadata_path(self, fname: str) -> dict: + """ + Return the local path to a metadata file, and test for the + file's existence/validity + + Parameters + ---------- + fname: str + The name of the metadata file to be accessed + + Returns + ------- + dict + + 'path' will be a pathlib.Path pointing to the file's location + + 'exists' will be a boolean indicating if the file + exists in a valid state + + 'file_attributes' is a CacheFileAttributes describing the file + in more detail + + Raises + ------ + RuntimeError + If the file cannot be downloaded + """ + file_attributes = self._manifest.metadata_file_attributes(fname) + exists = self._file_exists(file_attributes) + local_path = file_attributes.local_path + output = {'local_path': local_path, + 'exists': exists, + 'file_attributes': file_attributes} + + return output + + def download_metadata(self, fname: str) -> pathlib.Path: + """ + Return the local path to a metadata file, downloading the + file if necessary + + Parameters + ---------- + fname: str + The name of the metadata file to be accessed + + Returns + ------- + pathlib.Path + The path indicating where the file is stored on the + local system + + Raises + ------ + RuntimeError + If the file cannot be downloaded + """ + super_attributes = self.metadata_path(fname) + file_attributes = super_attributes['file_attributes'] + self._download_file(file_attributes) + return file_attributes.local_path + + def get_metadata(self, fname: str) -> pd.DataFrame: + """ + Return a pandas DataFrame of metadata + + Parameters + ---------- + fname: str + The name of the metadata file to load + + Returns + ------- + pd.DataFrame + + Notes + ----- + This method will check to see if the specified metadata file exists + locally. If it does not, the method will download the file. Use + self.metadata_path() to find where the file is stored + """ + local_path = self.download_metadata(fname) + return pd.read_csv(local_path) + + +class S3CloudCache(CloudCacheBase): + """ + A class to handle the downloading and accessing of data served from + an S3-based storage system + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + bucket_name: str + for example, if bucket URI is 's3://mybucket' this value should be + 'mybucket' + + project_name: str + the name of the project this cache is supposed to access. This will + be the root directory for all files stored in the bucket. + """ + + def __init__(self, cache_dir, bucket_name, project_name): + self._manifest = None + self._bucket_name = bucket_name + + super().__init__(cache_dir=cache_dir, project_name=project_name) + + _s3_client = None + + @property + def s3_client(self): + if self._s3_client is None: + s3_config = Config(signature_version=UNSIGNED) + self._s3_client = boto3.client('s3', + config=s3_config) + return self._s3_client + + def _list_all_manifests(self) -> list: + """ + Return a list of all of the file names of the manifests associated + with this dataset + """ + paginator = self.s3_client.get_paginator('list_objects_v2') + subset_iterator = paginator.paginate( + Bucket=self._bucket_name, + Prefix=self.manifest_prefix + ) + + output = [] + for subset in subset_iterator: + if 'Contents' in subset: + for obj in subset['Contents']: + output.append(pathlib.Path(obj['Key']).name) + + output.sort() + return output + + def _download_manifest(self, + manifest_name: str): + """ + Download a manifest from the dataset + + Parameters + ---------- + manifest_name: str + The name of the manifest to load. Must be an element in + self.manifest_file_names + """ + + manifest_key = self.manifest_prefix + manifest_name + response = self.s3_client.get_object(Bucket=self._bucket_name, + Key=manifest_key) + + filepath = os.path.join(self._cache_dir, manifest_name) + + with open(filepath, 'wb') as f: + for chunk in response['Body'].iter_chunks(): + f.write(chunk) + + def _download_file(self, file_attributes: CacheFileAttributes) -> bool: + """ + Check if a file exists and is in the expected state. + + If it is, return True. + + If it is not, download the file, creating the directory + where the file is to be stored if necessary. + + If the download is successful, return True. + + If the download fails (file hash does not match expectation), + return False. + + Parameters + ---------- + file_attributes: CacheFileAttributes + Describes the file to download + + Returns + ------- + None + + Raises + ------ + RuntimeError + If the path to the directory where the file is to be saved + points to something that is not a directory. + + RuntimeError + If it is not able to successfully download the file after + 10 iterations + """ + + local_path = file_attributes.local_path + + local_dir = pathlib.Path(safe_system_path(str(local_path.parents[0]))) + + # make sure Windows references to Allen Institute + # local networked file system get handled correctly + local_path = pathlib.Path(safe_system_path(str(local_path))) + + # using os here rather than pathlib because safe_system_path + # returns a str + os.makedirs(local_dir, exist_ok=True) + if not os.path.isdir(local_dir): + raise RuntimeError(f"{local_dir}\n" + "is not a directory") + + bucket_name = bucket_name_from_url(file_attributes.url) + obj_key = relative_path_from_url(file_attributes.url) + + n_iter = 0 + max_iter = 10 # maximum number of times to try download + + version_id = file_attributes.version_id + + pbar = None + if not self._file_exists(file_attributes): + response = self.s3_client.list_object_versions(Bucket=bucket_name, + Prefix=str(obj_key)) + object_info = [i for i in response["Versions"] + if i["VersionId"] == version_id][0] + pbar = tqdm.tqdm(desc=object_info["Key"].split("/")[-1], + total=object_info["Size"], + unit_scale=True, + unit_divisor=1000., + unit="MB") + + while not self._file_exists(file_attributes): + response = self.s3_client.get_object(Bucket=bucket_name, + Key=str(obj_key), + VersionId=version_id) + + if 'Body' in response: + with open(local_path, 'wb') as out_file: + for chunk in response['Body'].iter_chunks(): + out_file.write(chunk) + pbar.update(len(chunk)) + + n_iter += 1 + if n_iter > max_iter: + pbar.close() + raise RuntimeError("Could not download\n" + f"{file_attributes}\n" + "In {max_iter} iterations") + if pbar is not None: + pbar.close() + return None + + +class LocalCache(CloudCacheBase): + """A class to handle accessing of data that has already been downloaded + locally + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + project_name: str + the name of the project this cache is supposed to access. This will + be the root directory for all files stored in the bucket. + """ + def __init__(self, cache_dir, project_name): + super().__init__(cache_dir=cache_dir, project_name=project_name) + + def _list_all_manifests(self) -> list: + return [x for x in os.listdir(self._cache_dir) + if re.fullmatch(".*_manifest_v.*.json", x)] + + def _download_manifest(self, manifest_name: str): + raise NotImplementedError() + + def _download_file(self, file_attributes: CacheFileAttributes) -> bool: + raise NotImplementedError() diff --git a/allensdk/api/cloud_cache/file_attributes.py b/allensdk/api/cloud_cache/file_attributes.py new file mode 100644 index 000000000..26a6941a2 --- /dev/null +++ b/allensdk/api/cloud_cache/file_attributes.py @@ -0,0 +1,69 @@ +import json +import pathlib + + +class CacheFileAttributes(object): + """ + This class will contain the attributes of a remotely stored file + so that they can easily and consistently be passed around between + the methods making up the remote file cache and manifest classes + + Parameters + ---------- + url: str + The full URL of the remote file + version_id: str + A string specifying the version of the file (probably calculated + by S3) + file_hash: str + The (hexadecimal) file hash of the file + local_path: pathlib.Path + The path to the location where the file's local copy should be stored + (probably computed by the Manifest class) + """ + + def __init__(self, + url: str, + version_id: str, + file_hash: str, + local_path: str): + + if not isinstance(url, str): + raise ValueError(f"url must be str; got {type(url)}") + if not isinstance(version_id, str): + raise ValueError(f"version_id must be str; got {type(version_id)}") + if not isinstance(file_hash, str): + raise ValueError(f"file_hash must be str; " + f"got {type(file_hash)}") + if not isinstance(local_path, pathlib.Path): + raise ValueError(f"local_path must be pathlib.Path; " + f"got {type(local_path)}") + + self._url = url + self._version_id = version_id + self._file_hash = file_hash + self._local_path = local_path + + @property + def url(self) -> str: + return self._url + + @property + def version_id(self) -> str: + return self._version_id + + @property + def file_hash(self) -> str: + return self._file_hash + + @property + def local_path(self) -> pathlib.Path: + return self._local_path + + def __str__(self): + output = {'url': self.url, + 'version_id': self.version_id, + 'file_hash': self.file_hash, + 'local_path': str(self.local_path)} + output = json.dumps(output, indent=2, sort_keys=True) + return f'CacheFileParameters{output}' diff --git a/allensdk/api/cloud_cache/manifest.py b/allensdk/api/cloud_cache/manifest.py new file mode 100644 index 000000000..658c1da5a --- /dev/null +++ b/allensdk/api/cloud_cache/manifest.py @@ -0,0 +1,203 @@ +from typing import Dict, List, Any +import json +import pathlib +import copy +from typing import Union +from allensdk.api.cloud_cache.utils import relative_path_from_url # noqa: E501 +from allensdk.api.cloud_cache.file_attributes import CacheFileAttributes # noqa: E501 + + +class Manifest(object): + """ + A class for loading and manipulating the online manifest.json associated + with a dataset release + + Each Manifest instance should represent the data for 1 and only 1 + manifest.json file. + + Parameters + ---------- + cache_dir: str or pathlib.Path + The path to the directory where local copies of files will be stored + json_input: + A ''.read()''-supporting file-like object containing + a JSON document to be deserialized (i.e. same as the + first argument to json.load) + """ + + def __init__(self, + cache_dir: Union[str, pathlib.Path], + json_input): + if isinstance(cache_dir, str): + self._cache_dir = pathlib.Path(cache_dir).resolve() + elif isinstance(cache_dir, pathlib.Path): + self._cache_dir = cache_dir.resolve() + else: + raise ValueError("cache_dir must be either a str " + "or a pathlib.Path; " + f"got {type(cache_dir)}") + + self._data: Dict[str, Any] = json.load(json_input) + if not isinstance(self._data, dict): + raise ValueError("Expected to deserialize manifest into a dict; " + f"instead got {type(self._data)}") + self._project_name: str = self._data["project_name"] + self._version: str = self._data['manifest_version'] + self._file_id_column: str = self._data['metadata_file_id_column_name'] + self._data_pipeline: str = self._data["data_pipeline"] + + self._metadata_file_names: List[str] = [ + file_name for file_name in self._data['metadata_files'] + ] + self._metadata_file_names.sort() + + @property + def project_name(self): + """ + The name of the project whose data and metadata files this + manifest tracks. + """ + return self._project_name + + @property + def version(self): + """ + The version of the dataset currently loaded + """ + return self._version + + @property + def file_id_column(self): + """ + The column in the metadata files used to uniquely + identify data files + """ + return self._file_id_column + + @property + def metadata_file_names(self): + """ + List of metadata file names associated with this dataset + """ + return self._metadata_file_names + + def _create_file_attributes(self, + remote_path: str, + version_id: str, + file_hash: str) -> CacheFileAttributes: + """ + Create the cache_file_attributes describing a file. + This method does the work of assigning a local_path to a remote file. + + Parameters + ---------- + remote_path: str + The full URL to a file + version_id: str + The string specifying the version of the file + file_hash: str + The (hexadecimal) file hash of the file + + Returns + ------- + CacheFileAttributes + """ + + # Paths should be built like: + # {cache_dir} / {project_name}-{manifest_version} / relative_path + # Ex: my_cache_dir/visual-behavior-ophys-1.0.0/behavior_sessions/etc... + + project_dir_name = f"{self._project_name}-{self._version}" + project_dir = self._cache_dir / project_dir_name + + # The convention of the data release tool is to have all + # relative_paths from remote start with the project name which + # we want to remove since we already specified a project directory + relative_path = relative_path_from_url(remote_path) + shaved_rel_path = "/".join(relative_path.split("/")[1:]) + + local_path = project_dir / shaved_rel_path + + obj = CacheFileAttributes(remote_path, + version_id, + file_hash, + local_path) + + return obj + + def metadata_file_attributes(self, + metadata_file_name: str) -> CacheFileAttributes: # noqa: E501 + """ + Return the CacheFileAttributes associated with a metadata file + + Parameters + ---------- + metadata_file_name: str + Name of the metadata file. Must be in self.metadata_file_names + + Return + ------ + CacheFileAttributes + + Raises + ------ + RuntimeError + If you try to run this method when self._data is None (meaning + you haven't yet loaded a manifest.json) + + ValueError + If the metadata_file_name is not a valid option + """ + if self._data is None: + raise RuntimeError("You cannot retrieve " + "metadata_file_attributes;\n" + "you have not yet loaded a manifest.json file") + + if metadata_file_name not in self._metadata_file_names: + raise ValueError(f"{metadata_file_name}\n" + "is not in self.metadata_file_names:\n" + f"{self._metadata_file_names}") + + file_data = self._data['metadata_files'][metadata_file_name] + return self._create_file_attributes(file_data['url'], + file_data['version_id'], + file_data['file_hash']) + + def data_file_attributes(self, file_id) -> CacheFileAttributes: + """ + Return the CacheFileAttributes associated with a data file + + Parameters + ---------- + file_id: + The identifier of the data file whose attributes are to be + returned. Must be a key in self._data['data_files'] + + Return + ------ + CacheFileAttributes + + Raises + ------ + RuntimeError + If you try to run this method when self._data is None (meaning + you haven't yet loaded a manifest.json file) + + ValueError + If the file_id is not a valid option + """ + if self._data is None: + raise RuntimeError("You cannot retrieve data_file_attributes;\n" + "you have not yet loaded a manifest.json file") + + if file_id not in self._data['data_files']: + valid_keys = list(self._data['data_files'].keys()) + valid_keys.sort() + raise ValueError(f"file_id: {file_id}\n" + "Is not a data file listed in manifest:\n" + f"{valid_keys}") + + file_data = self._data['data_files'][file_id] + return self._create_file_attributes(file_data['url'], + file_data['version_id'], + file_data['file_hash']) diff --git a/allensdk/api/cloud_cache/utils.py b/allensdk/api/cloud_cache/utils.py new file mode 100644 index 000000000..d7b91c00c --- /dev/null +++ b/allensdk/api/cloud_cache/utils.py @@ -0,0 +1,88 @@ +from typing import Optional +import warnings +import re +import urllib.parse as url_parse +import hashlib + + +def bucket_name_from_url(url: str) -> Optional[str]: + """ + Read in a URL and return the name of the AWS S3 bucket it points towards. + + Parameters + ---------- + URL: str + A generic URL, suitable for retrieving an S3 object via an + HTTP GET request. + + Returns + ------- + str + An AWS S3 bucket name. Note: if 's3.amazonaws.com' does not occur in + the URL, this method will return None and emit a warning. + + Note + ----- + URLs passed to this method should conform to the "new" scheme as described + here + https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/ + """ + s3_pattern = re.compile('\.s3[\.,a-z,0-9,\-]*\.amazonaws.com') # noqa: W605, E501 + url_params = url_parse.urlparse(url) + raw_location = url_params.netloc + s3_match = s3_pattern.search(raw_location) + + if s3_match is None: + warnings.warn(f"{s3_pattern} does not occur in url {url}") + return None + + s3_match = raw_location[s3_match.start():s3_match.end()] + return url_params.netloc.replace(s3_match, '') + + +def relative_path_from_url(url: str) -> str: + """ + Read in a url and return the relative path of the object + + Parameters + ---------- + url: str + The url of the object whose path you want + + Returns + ------- + str: + Relative path of the object + + Notes + ----- + This method returns a str rather than a pathlib.Path because + it is used to get the S3 object Key from a URL. If using + Pathlib.path on a Windows system, the '/' will get transformed + into '\', confusing S3. + """ + url_params = url_parse.urlparse(url) + return url_params.path[1:] + + +def file_hash_from_path(file_path: str) -> str: + """ + Return the hexadecimal file hash for a file + + Parameters + ---------- + file_path: str + path to a file + + Returns + ------- + str: + The file hash (Blake2b; hexadecimal) of the file + """ + hasher = hashlib.blake2b() + with open(file_path, 'rb') as in_file: + chunk = in_file.read(1000000) + while len(chunk) > 0: + hasher.update(chunk) + chunk = in_file.read(1000000) + return hasher.hexdigest() diff --git a/allensdk/api/queries/annotated_section_data_sets_api.py b/allensdk/api/queries/annotated_section_data_sets_api.py index 359dcb05d..72cefbf87 100644 --- a/allensdk/api/queries/annotated_section_data_sets_api.py +++ b/allensdk/api/queries/annotated_section_data_sets_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from .rma_api import RmaApi -from ..cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable class AnnotatedSectionDataSetsApi(RmaApi): diff --git a/allensdk/api/queries/biophysical_api.py b/allensdk/api/queries/biophysical_api.py index 45376ceff..70fd7a4fb 100644 --- a/allensdk/api/queries/biophysical_api.py +++ b/allensdk/api/queries/biophysical_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from allensdk.api.queries.rma_template import RmaTemplate -from allensdk.api.cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable import os import simplejson as json from collections import OrderedDict diff --git a/allensdk/api/queries/brain_observatory_api.py b/allensdk/api/queries/brain_observatory_api.py index 351fe73cc..e6ce48bc3 100644 --- a/allensdk/api/queries/brain_observatory_api.py +++ b/allensdk/api/queries/brain_observatory_api.py @@ -43,7 +43,7 @@ import allensdk.brain_observatory.stimulus_info as stimulus_info from .rma_template import RmaTemplate -from ..cache import cacheable, Cache +from allensdk.api.warehouse_cache.cache import cacheable, Cache from .rma_pager import pageable from dateutil.parser import parse as parse_date diff --git a/allensdk/api/queries/cell_types_api.py b/allensdk/api/queries/cell_types_api.py index 41b89904b..de9562757 100644 --- a/allensdk/api/queries/cell_types_api.py +++ b/allensdk/api/queries/cell_types_api.py @@ -34,9 +34,9 @@ # POSSIBILITY OF SUCH DAMAGE. # from .rma_api import RmaApi -from ..cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable from allensdk.config.manifest import Manifest -from allensdk.api.cache import Cache +from allensdk.api.warehouse_cache.cache import Cache from allensdk.deprecated import deprecated diff --git a/allensdk/api/queries/grid_data_api.py b/allensdk/api/queries/grid_data_api.py index 96604173d..9702326a7 100644 --- a/allensdk/api/queries/grid_data_api.py +++ b/allensdk/api/queries/grid_data_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # -from allensdk.api.cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable from allensdk.deprecated import deprecated from .rma_api import RmaApi @@ -249,4 +249,4 @@ def download_alignment3d(self, section_data_set_id, num_rows='all', count=False, elif len(results) > 1: raise ValueError('found multiple SectionDataSets with attached alignment3ds for id {}: {}'.format(section_data_set_id, results)) - return results[0]['alignment3d'] \ No newline at end of file + return results[0]['alignment3d'] diff --git a/allensdk/api/queries/image_download_api.py b/allensdk/api/queries/image_download_api.py index 65312fa6a..d4685e3b6 100644 --- a/allensdk/api/queries/image_download_api.py +++ b/allensdk/api/queries/image_download_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from .rma_template import RmaTemplate -from ..cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable from six import string_types diff --git a/allensdk/api/queries/mouse_atlas_api.py b/allensdk/api/queries/mouse_atlas_api.py index c0f1b15f4..b02dada34 100644 --- a/allensdk/api/queries/mouse_atlas_api.py +++ b/allensdk/api/queries/mouse_atlas_api.py @@ -35,7 +35,7 @@ # from allensdk.core import sitk_utilities -from allensdk.api.cache import Cache, cacheable +from allensdk.api.warehouse_cache.cache import Cache, cacheable from .reference_space_api import ReferenceSpaceApi from .grid_data_api import GridDataApi diff --git a/allensdk/api/queries/mouse_connectivity_api.py b/allensdk/api/queries/mouse_connectivity_api.py index 25644c12e..5fb6a8c0e 100644 --- a/allensdk/api/queries/mouse_connectivity_api.py +++ b/allensdk/api/queries/mouse_connectivity_api.py @@ -35,7 +35,7 @@ # from .reference_space_api import ReferenceSpaceApi from .grid_data_api import GridDataApi -from ..cache import cacheable, Cache +from allensdk.api.warehouse_cache.cache import cacheable, Cache import numpy as np import nrrd import six diff --git a/allensdk/api/queries/ontologies_api.py b/allensdk/api/queries/ontologies_api.py index 13aa0c928..7a5cb2723 100644 --- a/allensdk/api/queries/ontologies_api.py +++ b/allensdk/api/queries/ontologies_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from .rma_template import RmaTemplate -from ..cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable from allensdk.core.structure_tree import StructureTree diff --git a/allensdk/api/queries/reference_space_api.py b/allensdk/api/queries/reference_space_api.py index ea01404c8..1a5d5811c 100644 --- a/allensdk/api/queries/reference_space_api.py +++ b/allensdk/api/queries/reference_space_api.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from .rma_api import RmaApi -from allensdk.api.cache import cacheable, Cache +from allensdk.api.warehouse_cache.cache import cacheable, Cache from allensdk.core.obj_utilities import read_obj import allensdk.core.sitk_utilities as sitk_utilities import numpy as np diff --git a/allensdk/api/warehouse_cache/__init__.py b/allensdk/api/warehouse_cache/__init__.py new file mode 100644 index 000000000..1bb8bf6d7 --- /dev/null +++ b/allensdk/api/warehouse_cache/__init__.py @@ -0,0 +1 @@ +# empty diff --git a/allensdk/api/cache.py b/allensdk/api/warehouse_cache/cache.py similarity index 100% rename from allensdk/api/cache.py rename to allensdk/api/warehouse_cache/cache.py diff --git a/allensdk/api/caching_utilities.py b/allensdk/api/warehouse_cache/caching_utilities.py similarity index 100% rename from allensdk/api/caching_utilities.py rename to allensdk/api/warehouse_cache/caching_utilities.py diff --git a/allensdk/brain_observatory/behavior/behavior_ophys_analysis.py b/allensdk/brain_observatory/behavior/behavior_ophys_analysis.py index 1e27700f2..016a665a6 100644 --- a/allensdk/brain_observatory/behavior/behavior_ophys_analysis.py +++ b/allensdk/brain_observatory/behavior/behavior_ophys_analysis.py @@ -3,7 +3,8 @@ import seaborn as sns from allensdk.core.lazy_property import LazyProperty, LazyPropertyMixin -from allensdk.brain_observatory.behavior.behavior_ophys_session import BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment def plot_trace(timestamps, trace, ax=None, xlabel='time (seconds)', ylabel='fluorescence', title='roi'): if ax is None: @@ -108,7 +109,7 @@ def plot_example_traces_and_behavior(self, N=10): if __name__ == "__main__": - session = BehaviorOphysSession(789359614) + session = BehaviorOphysExperiment(789359614) analysis = BehaviorOphysAnalysis(session) analysis.plot_example_traces_and_behavior() - \ No newline at end of file + diff --git a/allensdk/brain_observatory/behavior/behavior_ophys_experiment.py b/allensdk/brain_observatory/behavior/behavior_ophys_experiment.py new file mode 100644 index 000000000..c5286c8f0 --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_ophys_experiment.py @@ -0,0 +1,355 @@ +import numpy as np +import pandas as pd +from typing import Any + +from allensdk.brain_observatory.behavior.behavior_session import ( + BehaviorSession) +from allensdk.brain_observatory.session_api_utils import ParamsMixin +from allensdk.brain_observatory.behavior.session_apis.data_io import ( + BehaviorOphysNwbApi, BehaviorOphysLimsApi) +from allensdk.deprecated import legacy +from allensdk.brain_observatory.behavior.image_api import Image, ImageApi + + +class BehaviorOphysExperiment(BehaviorSession, ParamsMixin): + """Represents data from a single Visual Behavior Ophys imaging session. + Can be initialized with an api that fetches data, or by using class methods + `from_lims` and `from_nwb_path`. + """ + + def __init__(self, api=None, + eye_tracking_z_threshold: float = 3.0, + eye_tracking_dilation_frames: int = 2, + events_filter_scale: float = 2.0, + events_filter_n_time_steps: int = 20): + """ + Parameters + ---------- + api : object, optional + The backend api used by the session object to get behavior ophys + data, by default None. + eye_tracking_z_threshold : float, optional + The z-threshold when determining which frames likely contain + outliers for eye or pupil areas. Influences which frames + are considered 'likely blinks'. By default 3.0 + eye_tracking_dilation_frames : int, optional + Determines the number of adjacent frames that will be marked + as 'likely_blink' when performing blink detection for + `eye_tracking` data, by default 2 + events_filter_scale : float, optional + Stdev of halfnorm distribution used to convolve ophys events with + a 1d causal half-gaussian filter to smooth it for visualization, + by default 2.0 + events_filter_n_time_steps : int, optional + Number of time steps to use for convolution of ophys events + """ + + BehaviorSession.__init__(self, api=api) + ParamsMixin.__init__(self, ignore={'api'}) + + # eye_tracking processing params + self._eye_tracking_z_threshold = eye_tracking_z_threshold + self._eye_tracking_dilation_frames = eye_tracking_dilation_frames + + # events processing params + self._events_filter_scale = events_filter_scale + self._events_filter_n_time_steps = events_filter_n_time_steps + + # LazyProperty constructor provided by LazyPropertyMixin + LazyProperty = self.LazyProperty + + # Initialize attributes to be lazily evaluated + self._ophys_session_id = LazyProperty( + self.api.get_ophys_session_id) + self._ophys_experiment_id = LazyProperty( + self.api.get_ophys_experiment_id) + self._max_projection = LazyProperty(self.api.get_max_projection, + wrappers=[ImageApi.deserialize]) + self._average_projection = LazyProperty( + self.api.get_average_projection, wrappers=[ImageApi.deserialize]) + self._ophys_timestamps = LazyProperty(self.api.get_ophys_timestamps, + settable=True) + self._dff_traces = LazyProperty(self.api.get_dff_traces, settable=True) + self._events = LazyProperty(self.api.get_events, settable=True) + self._cell_specimen_table = LazyProperty( + self.api.get_cell_specimen_table, settable=True) + self._corrected_fluorescence_traces = LazyProperty( + self.api.get_corrected_fluorescence_traces, settable=True) + self._motion_correction = LazyProperty(self.api.get_motion_correction, + settable=True) + self._segmentation_mask_image = LazyProperty( + self.get_segmentation_mask_image) + self._eye_tracking = LazyProperty( + self.api.get_eye_tracking, settable=True, + z_threshold=self._eye_tracking_z_threshold, + dilation_frames=self._eye_tracking_dilation_frames) + self._eye_tracking_rig_geometry = LazyProperty( + self.api.get_eye_tracking_rig_geometry) + + # ==================== class and utility methods ====================== + + @classmethod + def from_lims(cls, ophys_experiment_id: int, + eye_tracking_z_threshold: float = 3.0, + eye_tracking_dilation_frames: int = 2 + ) -> "BehaviorOphysExperiment": + return cls(api=BehaviorOphysLimsApi(ophys_experiment_id), + eye_tracking_z_threshold=eye_tracking_z_threshold, + eye_tracking_dilation_frames=eye_tracking_dilation_frames) + + @classmethod + def from_nwb_path( + cls, nwb_path: str, **api_kwargs: Any) -> "BehaviorOphysExperiment": + api_kwargs["filter_invalid_rois"] = api_kwargs.get( + "filter_invalid_rois", True) + return cls(api=BehaviorOphysNwbApi.from_path( + path=nwb_path, **api_kwargs)) + + # ========================= 'get' methods ========================== + + def get_segmentation_mask_image(self): + """ Returns an image with value 1 if the pixel was included + in an ROI, and 0 otherwise + + Returns + ---------- + allensdk.brain_observatory.behavior.image_api.Image: + array-like interface to segmentation_mask image data and metadata + """ + mask_data = np.sum(self.roi_masks['roi_mask'].values).astype(int) + + max_projection_image = self.max_projection + + mask_image = Image( + data=mask_data, + spacing=max_projection_image.spacing, + unit=max_projection_image.unit + ) + return mask_image + + @legacy('Consider using "dff_traces" instead.') + def get_dff_traces(self, cell_specimen_ids=None): + + if cell_specimen_ids is None: + cell_specimen_ids = self.get_cell_specimen_ids() + + csid_table = \ + self.cell_specimen_table.reset_index()[['cell_specimen_id']] + csid_subtable = csid_table[csid_table['cell_specimen_id'].isin( + cell_specimen_ids)].set_index('cell_specimen_id') + dff_table = csid_subtable.join(self.dff_traces, how='left') + dff_traces = np.vstack(dff_table['dff'].values) + timestamps = self.ophys_timestamps + + assert (len(cell_specimen_ids), len(timestamps)) == dff_traces.shape + return timestamps, dff_traces + + @legacy() + def get_cell_specimen_indices(self, cell_specimen_ids): + return [self.cell_specimen_table.index.get_loc(csid) + for csid in cell_specimen_ids] + + @legacy("Consider using cell_specimen_table['cell_specimen_id'] instead.") + def get_cell_specimen_ids(self): + cell_specimen_ids = self.cell_specimen_table.index.values + + if np.isnan(cell_specimen_ids.astype(float)).sum() == \ + len(self.cell_specimen_table): + raise ValueError("cell_specimen_id values not assigned " + f"for {self.ophys_experiment_id}") + return cell_specimen_ids + + # ====================== properties and setters ======================== + + @property + def ophys_experiment_id(self) -> int: + """Unique identifier for this experimental session. + :rtype: int + """ + return self._ophys_experiment_id + + @property + def ophys_session_id(self) -> int: + """Unique identifier for this ophys session. + :rtype: int + """ + return self._ophys_session_id + + @property + def max_projection(self) -> Image: + """2D max projection image. + :rtype: allensdk.brain_observatory.behavior.image_api.Image + """ + return self._max_projection + + @property + def average_projection(self) -> pd.DataFrame: + """2D image of the microscope field of view, averaged across the + experiment + :rtype: pandas.DataFrame + """ + return self._average_projection + + @property + def ophys_timestamps(self) -> np.ndarray: + """Timestamps associated with frames captured by the microscope + :rtype: numpy.ndarray + """ + return self._ophys_timestamps + + @ophys_timestamps.setter + def ophys_timestamps(self, value): + self._ophys_timestamps = value + + @property + def dff_traces(self) -> pd.DataFrame: + """Traces of dff organized into a dataframe; index is the cell roi ids. + :rtype: pandas.DataFrame + """ + return self._dff_traces + + @dff_traces.setter + def dff_traces(self, value): + self._dff_traces = value + + @property + def events(self) -> pd.DataFrame: + """Get event detection data + + Returns + ------- + pd.DataFrame + index: + cell_specimen_id: int + cell_roi_id: int + events: np.array + filtered_events: np.array + Events, convolved with filter to smooth it for visualization + lambdas: float64 + noise_stds: float64 + """ + params = {'events_filter_scale', 'events_filter_n_time_steps'} + + if self.needs_data_refresh(params): + self._events = self.LazyProperty( + self.api.get_events, + filter_scale=self._events_filter_scale, + filter_n_time_steps=self._events_filter_n_time_steps) + self.clear_updated_params(params) + + return self._events + + @events.setter + def events(self, value): + self._events = value + + @property + def cell_specimen_table(self) -> pd.DataFrame: + """Cell roi information organized into a dataframe; index is the cell + roi ids. + :rtype: pandas.DataFrame + """ + return self._cell_specimen_table + + @cell_specimen_table.setter + def cell_specimen_table(self, value): + self._cell_specimen_table = value + + @property + def corrected_fluorescence_traces(self) -> pd.DataFrame: + """The motion-corrected fluorescence traces organized into a dataframe; + index is the cell roi ids. + :rtype: pandas.DataFrame + """ + return self._corrected_fluorescence_traces + + @corrected_fluorescence_traces.setter + def corrected_fluorescence_traces(self, value): + self._corrected_fluorescence_traces = value + + @property + def motion_correction(self) -> pd.DataFrame: + """A dataframe containing trace data used during motion correction + computation + :rtype: pandas.DataFrame + """ + return self._motion_correction + + @motion_correction.setter + def motion_correction(self, value): + self._motion_correction = value + + @property + def segmentation_mask_image(self) -> Image: + """An image with pixel value 1 if that pixel was included in an ROI, + and 0 otherwise + :rtype: allensdk.brain_observatory.behavior.image_api.Image + """ + if self._segmentation_mask_image is None: + self._segmentation_mask_image = self.get_segmentation_mask_image() + return self._segmentation_mask_image + + @segmentation_mask_image.setter + def segmentation_mask_image(self, value): + self._segmentation_mask_image = value + + @property + def eye_tracking(self) -> pd.DataFrame: + """A dataframe containing ellipse fit parameters for the eye, pupil + and corneal reflection (cr). Fits are derived from tracking points + from a DeepLabCut model applied to video frames of a subject's + right eye. Raw tracking points and raw video frames are not exposed + by the SDK. + + Notes: + - All columns starting with 'pupil_' represent ellipse fit parameters + relating to the pupil. + - All columns starting with 'eye_' represent ellipse fit parameters + relating to the eyelid. + - All columns starting with 'cr_' represent ellipse fit parameters + relating to the corneal reflection, which is caused by an infrared + LED positioned near the eye tracking camera. + - All positions are in units of pixels. + - All areas are in units of pixels^2 + - All values are in the coordinate space of the eye tracking camera, + NOT the coordinate space of the stimulus display (i.e. this is not + gaze location), with (0, 0) being the upper-left corner of the + eye-tracking image. + - The 'likely_blink' column is True for any row (frame) where the pupil + fit failed OR eye fit failed OR an outlier fit was identified on the + pupil or eye fit. + - The pupil_area, cr_area, eye_area columns are set to NaN wherever + 'likely_blink' == True. + - The pupil_area_raw, cr_area_raw, eye_area_raw columns contains all + pupil fit values (including where 'likely_blink' == True). + - All ellipse fits are derived from tracking points that were output by + a DeepLabCut model that was trained on hand-annotated data from a + subset of imaging sessions on optical physiology rigs. + - Raw DeepLabCut tracking points are not publicly available. + + :rtype: pandas.DataFrame + """ + params = {'eye_tracking_dilation_frames', 'eye_tracking_z_threshold'} + + if self.needs_data_refresh(params): + self._eye_tracking = self.LazyProperty( + self.api.get_eye_tracking, + z_threshold=self._eye_tracking_z_threshold, + dilation_frames=self._eye_tracking_dilation_frames) + self.clear_updated_params(params) + + return self._eye_tracking + + @eye_tracking.setter + def eye_tracking(self, value): + self._eye_tracking = value + + @property + def eye_tracking_rig_geometry(self) -> dict: + """Get the eye tracking rig geometry + associated with an ophys experiment""" + return self.api.get_eye_tracking_rig_geometry() + + @property + def roi_masks(self) -> pd.DataFrame: + return self.cell_specimen_table[['cell_roi_id', 'roi_mask']] diff --git a/allensdk/brain_observatory/behavior/behavior_ophys_session.py b/allensdk/brain_observatory/behavior/behavior_ophys_session.py index 0d98d3c71..30186cfb4 100644 --- a/allensdk/brain_observatory/behavior/behavior_ophys_session.py +++ b/allensdk/brain_observatory/behavior/behavior_ophys_session.py @@ -1,362 +1,18 @@ -import numpy as np -import pandas as pd -from typing import Any - -from allensdk.brain_observatory.behavior.behavior_session import ( - BehaviorSession) -from allensdk.brain_observatory.session_api_utils import ParamsMixin -from allensdk.brain_observatory.behavior.session_apis.data_io import ( - BehaviorOphysNwbApi, BehaviorOphysLimsApi) -from allensdk.deprecated import legacy -from allensdk.brain_observatory.behavior.image_api import Image, ImageApi - - -class BehaviorOphysSession(BehaviorSession, ParamsMixin): - """Represents data from a single Visual Behavior Ophys imaging session. - Can be initialized with an api that fetches data, or by using class methods - `from_lims` and `from_nwb_path`. - """ - - def __init__(self, api=None, - eye_tracking_z_threshold: float = 3.0, - eye_tracking_dilation_frames: int = 2, - events_filter_scale: float = 2.0, - events_filter_n_time_steps: int = 20): - """ - Parameters - ---------- - api : object, optional - The backend api used by the session object to get behavior ophys - data, by default None. - eye_tracking_z_threshold : float, optional - The z-threshold when determining which frames likely contain - outliers for eye or pupil areas. Influences which frames - are considered 'likely blinks'. By default 3.0 - eye_tracking_dilation_frames : int, optional - Determines the number of adjacent frames that will be marked - as 'likely_blink' when performing blink detection for - `eye_tracking` data, by default 2 - events_filter_scale : float, optional - Stdev of halfnorm distribution used to convolve ophys events with - a 1d causal half-gaussian filter to smooth it for visualization, - by default 2.0 - events_filter_n_time_steps : int, optional - Number of time steps to use for convolution of ophys events - """ - - BehaviorSession.__init__(self, api=api) - ParamsMixin.__init__(self, ignore={'api'}) - - # eye_tracking processing params - self._eye_tracking_z_threshold = eye_tracking_z_threshold - self._eye_tracking_dilation_frames = eye_tracking_dilation_frames - - # events processing params - self._events_filter_scale = events_filter_scale - self._events_filter_n_time_steps = events_filter_n_time_steps - - # LazyProperty constructor provided by LazyPropertyMixin - LazyProperty = self.LazyProperty - - # Initialize attributes to be lazily evaluated - self._ophys_session_id = LazyProperty( - self.api.get_ophys_session_id) - self._ophys_experiment_id = LazyProperty( - self.api.get_ophys_experiment_id) - self._max_projection = LazyProperty(self.api.get_max_projection, - wrappers=[ImageApi.deserialize]) - self._average_projection = LazyProperty( - self.api.get_average_projection, wrappers=[ImageApi.deserialize]) - self._ophys_timestamps = LazyProperty(self.api.get_ophys_timestamps, - settable=True) - self._dff_traces = LazyProperty(self.api.get_dff_traces, settable=True) - self._events = LazyProperty(self.api.get_events, settable=True) - self._cell_specimen_table = LazyProperty( - self.api.get_cell_specimen_table, settable=True) - self._corrected_fluorescence_traces = LazyProperty( - self.api.get_corrected_fluorescence_traces, settable=True) - self._motion_correction = LazyProperty(self.api.get_motion_correction, - settable=True) - self._segmentation_mask_image = LazyProperty( - self.get_segmentation_mask_image) - self._eye_tracking = LazyProperty( - self.api.get_eye_tracking, settable=True, - z_threshold=self._eye_tracking_z_threshold, - dilation_frames=self._eye_tracking_dilation_frames) - self._eye_tracking_rig_geometry = LazyProperty( - self.api.get_eye_tracking_rig_geometry) - - # ==================== class and utility methods ====================== - - @classmethod - def from_lims(cls, ophys_experiment_id: int, - eye_tracking_z_threshold: float = 3.0, - eye_tracking_dilation_frames: int = 2 - ) -> "BehaviorOphysSession": - return cls(api=BehaviorOphysLimsApi(ophys_experiment_id), - eye_tracking_z_threshold=eye_tracking_z_threshold, - eye_tracking_dilation_frames=eye_tracking_dilation_frames) - - @classmethod - def from_nwb_path( - cls, nwb_path: str, **api_kwargs: Any) -> "BehaviorOphysSession": - api_kwargs["filter_invalid_rois"] = api_kwargs.get( - "filter_invalid_rois", True) - return cls(api=BehaviorOphysNwbApi.from_path( - path=nwb_path, **api_kwargs)) - - # ========================= 'get' methods ========================== - - def get_segmentation_mask_image(self): - """ Returns an image with value 1 if the pixel was included - in an ROI, and 0 otherwise - - Returns - ---------- - allensdk.brain_observatory.behavior.image_api.Image: - array-like interface to segmentation_mask image data and metadata - """ - mask_data = np.sum(self.roi_masks['roi_mask'].values).astype(int) - - max_projection_image = self.max_projection - - mask_image = Image( - data=mask_data, - spacing=max_projection_image.spacing, - unit=max_projection_image.unit - ) - return mask_image - - @legacy('Consider using "dff_traces" instead.') - def get_dff_traces(self, cell_specimen_ids=None): - - if cell_specimen_ids is None: - cell_specimen_ids = self.get_cell_specimen_ids() - - csid_table = \ - self.cell_specimen_table.reset_index()[['cell_specimen_id']] - csid_subtable = csid_table[csid_table['cell_specimen_id'].isin( - cell_specimen_ids)].set_index('cell_specimen_id') - dff_table = csid_subtable.join(self.dff_traces, how='left') - dff_traces = np.vstack(dff_table['dff'].values) - timestamps = self.ophys_timestamps - - assert (len(cell_specimen_ids), len(timestamps)) == dff_traces.shape - return timestamps, dff_traces - - @legacy() - def get_cell_specimen_indices(self, cell_specimen_ids): - return [self.cell_specimen_table.index.get_loc(csid) - for csid in cell_specimen_ids] - - @legacy("Consider using cell_specimen_table['cell_specimen_id'] instead.") - def get_cell_specimen_ids(self): - cell_specimen_ids = self.cell_specimen_table.index.values - - if np.isnan(cell_specimen_ids.astype(float)).sum() == \ - len(self.cell_specimen_table): - raise ValueError("cell_specimen_id values not assigned " - f"for {self.ophys_experiment_id}") - return cell_specimen_ids - - # ====================== properties and setters ======================== - - @property - def ophys_experiment_id(self) -> int: - """Unique identifier for this experimental session. - :rtype: int - """ - return self._ophys_experiment_id - - @property - def ophys_session_id(self) -> int: - """Unique identifier for this ophys session. - :rtype: int - """ - return self._ophys_session_id - - @property - def max_projection(self) -> Image: - """2D max projection image. - :rtype: allensdk.brain_observatory.behavior.image_api.Image - """ - return self._max_projection - - @property - def average_projection(self) -> pd.DataFrame: - """2D image of the microscope field of view, averaged across the - experiment - :rtype: pandas.DataFrame - """ - return self._average_projection - - @property - def ophys_timestamps(self) -> np.ndarray: - """Timestamps associated with frames captured by the microscope - :rtype: numpy.ndarray - """ - return self._ophys_timestamps - - @ophys_timestamps.setter - def ophys_timestamps(self, value): - self._ophys_timestamps = value - - @property - def dff_traces(self) -> pd.DataFrame: - """Traces of dff organized into a dataframe; index is the cell roi ids. - :rtype: pandas.DataFrame - """ - return self._dff_traces - - @dff_traces.setter - def dff_traces(self, value): - self._dff_traces = value - - @property - def events(self) -> pd.DataFrame: - """Get event detection data - - Returns - ------- - pd.DataFrame - index: - cell_specimen_id: int - cell_roi_id: int - events: np.array - filtered_events: np.array - Events, convolved with filter to smooth it for visualization - lambdas: float64 - noise_stds: float64 - """ - params = {'events_filter_scale', 'events_filter_n_time_steps'} - - if self.needs_data_refresh(params): - self._events = self.LazyProperty( - self.api.get_events, - filter_scale=self._events_filter_scale, - filter_n_time_steps=self._events_filter_n_time_steps) - self.clear_updated_params(params) - - return self._events - - @events.setter - def events(self, value): - self._events = value - - @property - def cell_specimen_table(self) -> pd.DataFrame: - """Cell roi information organized into a dataframe; index is the cell - roi ids. - :rtype: pandas.DataFrame - """ - return self._cell_specimen_table - - @cell_specimen_table.setter - def cell_specimen_table(self, value): - self._cell_specimen_table = value - - @property - def corrected_fluorescence_traces(self) -> pd.DataFrame: - """The motion-corrected fluorescence traces organized into a dataframe; - index is the cell roi ids. - :rtype: pandas.DataFrame - """ - return self._corrected_fluorescence_traces - - @corrected_fluorescence_traces.setter - def corrected_fluorescence_traces(self, value): - self._corrected_fluorescence_traces = value - - @property - def motion_correction(self) -> pd.DataFrame: - """A dataframe containing trace data used during motion correction - computation - :rtype: pandas.DataFrame - """ - return self._motion_correction - - @motion_correction.setter - def motion_correction(self, value): - self._motion_correction = value - - @property - def segmentation_mask_image(self) -> Image: - """An image with pixel value 1 if that pixel was included in an ROI, - and 0 otherwise - :rtype: allensdk.brain_observatory.behavior.image_api.Image - """ - if self._segmentation_mask_image is None: - self._segmentation_mask_image = self.get_segmentation_mask_image() - return self._segmentation_mask_image - - @segmentation_mask_image.setter - def segmentation_mask_image(self, value): - self._segmentation_mask_image = value - - @property - def eye_tracking(self) -> pd.DataFrame: - """A dataframe containing ellipse fit parameters for the eye, pupil - and corneal reflection (cr). Fits are derived from tracking points - from a DeepLabCut model applied to video frames of a subject's - right eye. Raw tracking points and raw video frames are not exposed - by the SDK. - - Notes: - - All columns starting with 'pupil_' represent ellipse fit parameters - relating to the pupil. - - All columns starting with 'eye_' represent ellipse fit parameters - relating to the eyelid. - - All columns starting with 'cr_' represent ellipse fit parameters - relating to the corneal reflection, which is caused by an infrared - LED positioned near the eye tracking camera. - - All positions are in units of pixels. - - All areas are in units of pixels^2 - - All values are in the coordinate space of the eye tracking camera, - NOT the coordinate space of the stimulus display (i.e. this is not - gaze location), with (0, 0) being the upper-left corner of the - eye-tracking image. - - The 'likely_blink' column is True for any row (frame) where the pupil - fit failed OR eye fit failed OR an outlier fit was identified on the - pupil or eye fit. - - The pupil_area, cr_area, eye_area columns are set to NaN wherever - 'likely_blink' == True. - - The pupil_area_raw, cr_area_raw, eye_area_raw columns contains all - pupil fit values (including where 'likely_blink' == True). - - All ellipse fits are derived from tracking points that were output by - a DeepLabCut model that was trained on hand-annotated data from a - subset of imaging sessions on optical physiology rigs. - - Raw DeepLabCut tracking points are not publicly available. - - :rtype: pandas.DataFrame - """ - params = {'eye_tracking_dilation_frames', 'eye_tracking_z_threshold'} - - if self.needs_data_refresh(params): - self._eye_tracking = self.LazyProperty( - self.api.get_eye_tracking, - z_threshold=self._eye_tracking_z_threshold, - dilation_frames=self._eye_tracking_dilation_frames) - self.clear_updated_params(params) - - return self._eye_tracking - - @eye_tracking.setter - def eye_tracking(self, value): - self._eye_tracking = value - - @property - def eye_tracking_rig_geometry(self) -> dict: - """Get the eye tracking rig geometry - associated with an ophys experiment""" - return self.api.get_eye_tracking_rig_geometry() - - @property - def roi_masks(self) -> pd.DataFrame: - return self.cell_specimen_table[['cell_roi_id', 'roi_mask']] - - -if __name__ == "__main__": - - ophys_experiment_id = 789359614 - session = BehaviorOphysSession.from_lims(ophys_experiment_id) - print(session.trials['reward_time']) +import warnings + +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment as BOE + +# alias as BOE prevents someone becoming comfortable with +# import BehaviorOphysExperiment from this to-be-deprecated module + +class BehaviorOphysSession(BOE): + def __init__(self, **kwargs): + warnings.warn( + "allensdk.brain_observatory.behavior.behavior_ophys_session." + "BehaviorOphysSession is deprecated. use " + "allensdk.brain_observatory.behavior.behavior_ophys_experiment." + "BehaviorOphysExperiment.", + DeprecationWarning, + stacklevel=3) + super().__init__(**kwargs) diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/__init__.py b/allensdk/brain_observatory/behavior/behavior_project_cache/__init__.py new file mode 100644 index 000000000..ff862b37c --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/__init__.py @@ -0,0 +1,2 @@ +from allensdk.brain_observatory.behavior.behavior_project_cache.\ + behavior_project_cache import VisualBehaviorOphysProjectCache # noqa F401 diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache.py b/allensdk/brain_observatory/behavior/behavior_project_cache/behavior_project_cache.py similarity index 52% rename from allensdk/brain_observatory/behavior/behavior_project_cache.py rename to allensdk/brain_observatory/behavior/behavior_project_cache/behavior_project_cache.py index 9d2a860ec..0d03412e9 100644 --- a/allensdk/brain_observatory/behavior/behavior_project_cache.py +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/behavior_project_cache.py @@ -1,23 +1,27 @@ from functools import partial -from typing import Type, Optional, List, Union +from typing import Optional, List, Union from pathlib import Path import pandas as pd import logging -from allensdk.api.cache import Cache - +from allensdk.api.warehouse_cache.cache import Cache +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .experiments_table import \ + ExperimentsTable +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .sessions_table import \ + SessionsTable from allensdk.brain_observatory.behavior.project_apis.data_io import ( - BehaviorProjectLimsApi) -from allensdk.brain_observatory.behavior.project_apis.abcs import ( - BehaviorProjectBase) -from allensdk.api.caching_utilities import one_file_call_caching, call_caching + BehaviorProjectLimsApi, BehaviorProjectCloudApi) +from allensdk.api.warehouse_cache.caching_utilities import \ + one_file_call_caching, call_caching +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .ophys_sessions_table import \ + BehaviorOphysSessionsTable from allensdk.core.authentication import DbCredentials -BehaviorProjectApi = Type[BehaviorProjectBase] - - -class BehaviorProjectCache(Cache): +class VisualBehaviorOphysProjectCache(Cache): MANIFEST_VERSION = "0.0.1-alpha.3" OPHYS_SESSIONS_KEY = "ophys_sessions" BEHAVIOR_SESSIONS_KEY = "behavior_sessions" @@ -28,12 +32,12 @@ class BehaviorProjectCache(Cache): "spec": f"{OPHYS_SESSIONS_KEY}.csv", "parent_key": "BASEDIR", "typename": "file" - }, + }, BEHAVIOR_SESSIONS_KEY: { "spec": f"{BEHAVIOR_SESSIONS_KEY}.csv", "parent_key": "BASEDIR", "typename": "file" - }, + }, OPHYS_EXPERIMENTS_KEY: { "spec": f"{OPHYS_EXPERIMENTS_KEY}.csv", "parent_key": "BASEDIR", @@ -43,7 +47,8 @@ class BehaviorProjectCache(Cache): def __init__( self, - fetch_api: Optional[BehaviorProjectApi] = None, + fetch_api: Optional[Union[BehaviorProjectLimsApi, + BehaviorProjectCloudApi]] = None, fetch_tries: int = 2, manifest: Optional[Union[str, Path]] = None, version: Optional[str] = None, @@ -53,8 +58,8 @@ def __init__( downloading detailed session data (such as dff traces). Likely you will want to use a class constructor, such as `from_lims`, - to initialize a BehaviorProjectCache, rather than calling this - directly. + to initialize a VisualBehaviorOphysProjectCache, rather than calling + this directly. --- NOTE --- Because NWB files are not currently supported for this project (as of @@ -102,6 +107,62 @@ def __init__( self.fetch_tries = fetch_tries self.logger = logging.getLogger(self.__class__.__name__) + @classmethod + def from_s3_cache(cls, cache_dir: Union[str, Path], + bucket_name: str = "visual-behavior-ophys-data", + project_name: str = "visual-behavior-ophys" + ) -> "VisualBehaviorOphysProjectCache": + """instantiates this object with a connection to an s3 bucket and/or + a local cache related to that bucket. + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + bucket_name: str + for example, if bucket URI is 's3://mybucket' this value should be + 'mybucket' + + project_name: str + the name of the project this cache is supposed to access. This + project name is the first part of the prefix of the release data + objects. I.e. s3://// + + Returns + ------- + VisualBehaviorOphysProjectCache instance + + """ + fetch_api = BehaviorProjectCloudApi.from_s3_cache( + cache_dir, bucket_name, project_name) + return cls(fetch_api=fetch_api) + + @classmethod + def from_local_cache(cls, cache_dir: Union[str, Path], + project_name: str = "visual-behavior-ophys" + ) -> "VisualBehaviorOphysProjectCache": + """instantiates this object with a local cache. + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + project_name: str + the name of the project this cache is supposed to access. This + project name is the first part of the prefix of the release data + objects. I.e. s3://// + + Returns + ------- + VisualBehaviorOphysProjectCache instance + + """ + fetch_api = BehaviorProjectCloudApi.from_local_cache( + cache_dir, project_name) + return cls(fetch_api=fetch_api) + @classmethod def from_lims(cls, manifest: Optional[Union[str, Path]] = None, version: Optional[str] = None, @@ -111,11 +172,13 @@ def from_lims(cls, manifest: Optional[Union[str, Path]] = None, mtrain_credentials: Optional[DbCredentials] = None, host: Optional[str] = None, scheme: Optional[str] = None, - asynchronous: bool = True) -> "BehaviorProjectCache": + asynchronous: bool = True, + data_release_date: Optional[str] = None + ) -> "VisualBehaviorOphysProjectCache": """ - Construct a BehaviorProjectCache with a lims api. Use this method - to create a BehaviorProjectCache instance rather than calling - BehaviorProjectCache directly. + Construct a VisualBehaviorOphysProjectCache with a lims api. Use this + method to create a VisualBehaviorOphysProjectCache instance rather + than calling VisualBehaviorOphysProjectCache directly. Parameters ========== @@ -143,10 +206,13 @@ def from_lims(cls, manifest: Optional[Union[str, Path]] = None, included for consistency with EcephysProjectCache.from_lims. asynchronous : bool Whether to fetch from web asynchronously. Currently unused. + data_release_date: str + Use to filter tables to only include data released on date + ie 2021-03-25 Returns ======= - BehaviorProjectCache - BehaviorProjectCache instance with a LIMS fetch API + VisualBehaviorOphysProjectCache + VisualBehaviorOphysProjectCache instance with a LIMS fetch API """ if host and scheme: app_kwargs = {"host": host, "scheme": scheme, @@ -154,53 +220,63 @@ def from_lims(cls, manifest: Optional[Union[str, Path]] = None, else: app_kwargs = None fetch_api = BehaviorProjectLimsApi.default( - lims_credentials=lims_credentials, - mtrain_credentials=mtrain_credentials, - app_kwargs=app_kwargs) + lims_credentials=lims_credentials, + mtrain_credentials=mtrain_credentials, + data_release_date=data_release_date, + app_kwargs=app_kwargs) return cls(fetch_api=fetch_api, manifest=manifest, version=version, cache=cache, fetch_tries=fetch_tries) - def get_session_table( + def get_ophys_session_table( self, suppress: Optional[List[str]] = None, - by: str = "ophys_session_id") -> pd.DataFrame: + index_column: str = "ophys_session_id", + as_df=True, + include_behavior_data=True) -> \ + Union[pd.DataFrame, BehaviorOphysSessionsTable]: """ Return summary table of all ophys_session_ids in the database. :param suppress: optional list of columns to drop from the resulting dataframe. :type suppress: list of str - :param by: (default="ophys_session_id"). Column to index on, either + :param index_column: (default="ophys_session_id"). Column to index + on, either "ophys_session_id" or "ophys_experiment_id". - If by="ophys_experiment_id", then each row will only have one - experiment id, of type int (vs. an array of 1>more). - :type by: str + If index_column="ophys_experiment_id", then each row will only have + one experiment id, of type int (vs. an array of 1>more). + :type index_column: str + :param as_df: whether to return as df or as BehaviorOphysSessionsTable + :param include_behavior_data + Whether to include behavior data :rtype: pd.DataFrame """ + if isinstance(self.fetch_api, BehaviorProjectCloudApi): + return self.fetch_api.get_ophys_session_table() if self.cache: path = self.get_cache_path(None, self.OPHYS_SESSIONS_KEY) - sessions = one_file_call_caching( + ophys_sessions = one_file_call_caching( path, - self.fetch_api.get_session_table, - _write_json, _read_json) - sessions.set_index("ophys_session_id") - else: - sessions = self.fetch_api.get_session_table() - if suppress: - sessions.drop(columns=suppress, inplace=True, errors="ignore") - - # Possibly explode and reindex - if by == "ophys_session_id": - pass - elif by == "ophys_experiment_id": - sessions = (sessions.reset_index() - .explode("ophys_experiment_id") - .set_index("ophys_experiment_id")) + self.fetch_api.get_ophys_session_table, + _write_json, + lambda path: _read_json(path, index_name='ophys_session_id')) else: - self.logger.warning( - f"Invalid value for `by`, '{by}', passed to get_session_table." - " Valid choices for `by` are 'ophys_experiment_id' and " - "'ophys_session_id'.") - return sessions + ophys_sessions = self.fetch_api.get_ophys_session_table() + + if include_behavior_data: + # Merge behavior data in + behavior_sessions_table = self.get_behavior_session_table( + suppress=suppress, as_df=True, include_ophys_data=False) + ophys_sessions = behavior_sessions_table.merge( + ophys_sessions, + left_index=True, + right_on='behavior_session_id', + suffixes=('_behavior', '_ophys')) + + sessions = BehaviorOphysSessionsTable(df=ophys_sessions, + suppress=suppress, + index_column=index_column) + + return sessions.table if as_df else sessions def add_manifest_paths(self, manifest_builder): manifest_builder = super().add_manifest_paths(manifest_builder) @@ -208,55 +284,84 @@ def add_manifest_paths(self, manifest_builder): manifest_builder.add_path(key, **config) return manifest_builder - def get_experiment_table( + def get_ophys_experiment_table( self, - suppress: Optional[List[str]] = None) -> pd.DataFrame: + suppress: Optional[List[str]] = None, + as_df=True) -> Union[pd.DataFrame, SessionsTable]: """ Return summary table of all ophys_experiment_ids in the database. :param suppress: optional list of columns to drop from the resulting dataframe. :type suppress: list of str + :param as_df: whether to return as df or as SessionsTable :rtype: pd.DataFrame """ + if isinstance(self.fetch_api, BehaviorProjectCloudApi): + return self.fetch_api.get_ophys_experiment_table() if self.cache: path = self.get_cache_path(None, self.OPHYS_EXPERIMENTS_KEY) experiments = one_file_call_caching( path, - self.fetch_api.get_experiment_table, - _write_json, _read_json) - experiments.set_index("ophys_experiment_id") + self.fetch_api.get_ophys_experiment_table, + _write_json, + lambda path: _read_json(path, + index_name='ophys_experiment_id')) else: - experiments = self.fetch_api.get_experiment_table() - if suppress: - experiments.drop(columns=suppress, inplace=True, errors="ignore") - return experiments + experiments = self.fetch_api.get_ophys_experiment_table() + + # Merge behavior data in + behavior_sessions_table = self.get_behavior_session_table( + suppress=suppress, as_df=True, include_ophys_data=False) + experiments = behavior_sessions_table.merge( + experiments, left_index=True, right_on='behavior_session_id', + suffixes=('_behavior', '_ophys')) + experiments = ExperimentsTable(df=experiments, + suppress=suppress) + return experiments.table if as_df else experiments def get_behavior_session_table( self, - suppress: Optional[List[str]] = None) -> pd.DataFrame: + suppress: Optional[List[str]] = None, + as_df=True, + include_ophys_data=True) -> Union[pd.DataFrame, SessionsTable]: """ Return summary table of all behavior_session_ids in the database. :param suppress: optional list of columns to drop from the resulting dataframe. + :param as_df: whether to return as df or as SessionsTable + :param include_ophys_data + Whether to include ophys data :type suppress: list of str :rtype: pd.DataFrame """ - + if isinstance(self.fetch_api, BehaviorProjectCloudApi): + return self.fetch_api.get_behavior_session_table() if self.cache: path = self.get_cache_path(None, self.BEHAVIOR_SESSIONS_KEY) sessions = one_file_call_caching( path, - self.fetch_api.get_behavior_only_session_table, - _write_json, _read_json) - sessions.set_index("behavior_session_id") + self.fetch_api.get_behavior_session_table, + _write_json, + lambda path: _read_json(path, + index_name='behavior_session_id')) + else: + sessions = self.fetch_api.get_behavior_session_table() + + if include_ophys_data: + ophys_session_table = self.get_ophys_session_table( + suppress=suppress, + as_df=False, + include_behavior_data=False) else: - sessions = self.fetch_api.get_behavior_only_session_table() - sessions = sessions.rename(columns={"genotype": "full_genotype"}) - if suppress: - sessions.drop(columns=suppress, inplace=True, errors="ignore") - return sessions + ophys_session_table = None + sessions = SessionsTable(df=sessions, suppress=suppress, + fetch_api=self.fetch_api, + ophys_session_table=ophys_session_table) + + return sessions.table if as_df else sessions - def get_session_data(self, ophys_experiment_id: int, fixed: bool = False): + def get_behavior_ophys_experiment(self, ophys_experiment_id: int, + fixed: bool = False): """ Note -- This method mocks the behavior of a cache. Future development will include an NWB reader to read from @@ -266,17 +371,17 @@ def get_session_data(self, ophys_experiment_id: int, fixed: bool = False): """ if fixed: raise NotImplementedError - fetch_session = partial(self.fetch_api.get_session_data, + fetch_session = partial(self.fetch_api.get_behavior_ophys_experiment, ophys_experiment_id) return call_caching( fetch_session, - lambda x: x, # not writing anything - lazy=False, # can't actually read from file cache + lambda x: x, # not writing anything + lazy=False, # can't actually read from file cache read=fetch_session ) - def get_behavior_session_data(self, behavior_session_id: int, - fixed: bool = False): + def get_behavior_session(self, behavior_session_id: int, + fixed: bool = False): """ Note -- This method mocks the behavior of a cache. Future development will include an NWB reader to read from @@ -287,12 +392,12 @@ def get_behavior_session_data(self, behavior_session_id: int, if fixed: raise NotImplementedError - fetch_session = partial(self.fetch_api.get_behavior_only_session_data, + fetch_session = partial(self.fetch_api.get_behavior_session, behavior_session_id) return call_caching( fetch_session, - lambda x: x, # not writing anything - lazy=False, # can't actually read from file cache + lambda x: x, # not writing anything + lazy=False, # can't actually read from file cache read=fetch_session ) @@ -311,12 +416,13 @@ def _write_json(path, df): them back to the expected format by adding them to `convert_dates`. In the future we could schematize this data using marshmallow or something similar.""" - df.reset_index(inplace=True) df.to_json(path, orient="split", date_unit="s", date_format="epoch") -def _read_json(path): +def _read_json(path, index_name: Optional[str] = None): """Reads a dataframe file written to the cache by _write_json.""" df = pd.read_json(path, date_unit="s", orient="split", convert_dates=["date_of_acquisition"]) + if index_name: + df = df.rename_axis(index=index_name) return df diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/external/__init__.py b/allensdk/brain_observatory/behavior/behavior_project_cache/external/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/external/behavior_project_metadata_writer.py b/allensdk/brain_observatory/behavior/behavior_project_cache/external/behavior_project_metadata_writer.py new file mode 100644 index 000000000..f374cde4e --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/external/behavior_project_metadata_writer.py @@ -0,0 +1,200 @@ +import argparse +import json +import logging +import os +import warnings + +import pandas as pd + +import allensdk +from allensdk.brain_observatory.behavior.behavior_project_cache import \ + VisualBehaviorOphysProjectCache + +######### +# These columns should be dropped from external-facing metadata +######### +SESSION_SUPPRESS = ( + 'donor_id', + 'foraging_id', + 'session_name', + 'specimen_id' +) +OPHYS_EXPERIMENTS_SUPPRESS = SESSION_SUPPRESS + ( + 'container_workflow_state', + 'behavior_session_uuid', + 'experiment_workflow_state', + 'published_at', + 'isi_experiment_id' +) +######### + +OUTPUT_METADATA_FILENAMES = { + 'behavior_session_table': 'behavior_session_table.csv', + 'ophys_session_table': 'ophys_session_table.csv', + 'ophys_experiment_table': 'ophys_experiment_table.csv' +} + + +class BehaviorProjectMetadataWriter: + """Class to write project-level metadata to csv""" + + def __init__(self, behavior_project_cache: VisualBehaviorOphysProjectCache, + out_dir: str, project_name: str, data_release_date: str, + overwrite_ok=False): + + self._behavior_project_cache = behavior_project_cache + self._out_dir = out_dir + self._project_name = project_name + self._data_release_date = data_release_date + self._overwrite_ok = overwrite_ok + self._logger = logging.getLogger(self.__class__.__name__) + + self._release_behavior_only_nwb = self._behavior_project_cache \ + .fetch_api.get_release_files(file_type='BehaviorNwb') + self._release_behavior_with_ophys_nwb = self._behavior_project_cache \ + .fetch_api.get_release_files(file_type='BehaviorOphysNwb') + + def write_metadata(self): + """Writes metadata to csv""" + os.makedirs(self._out_dir, exist_ok=True) + + self._write_behavior_sessions() + self._write_ophys_sessions() + self._write_ophys_experiments() + + self._write_manifest() + + def _write_behavior_sessions(self, suppress=SESSION_SUPPRESS, + output_filename=OUTPUT_METADATA_FILENAMES[ + 'behavior_session_table']): + behavior_sessions = self._behavior_project_cache. \ + get_behavior_session_table(suppress=suppress, + as_df=True) + + # Add release files + behavior_sessions = behavior_sessions \ + .merge(self._release_behavior_only_nwb, + left_index=True, + right_index=True, + how='left') + if "file_id" in behavior_sessions.columns: + if behavior_sessions["file_id"].isnull().values.any(): + msg = (f"{output_filename} field `file_id` contains missing " + "values and pandas.to_csv() converts it to float") + warnings.warn(msg) + self._write_metadata_table(df=behavior_sessions, + filename=output_filename) + + def _write_ophys_sessions(self, suppress=SESSION_SUPPRESS, + output_filename=OUTPUT_METADATA_FILENAMES[ + 'ophys_session_table' + ]): + ophys_sessions = self._behavior_project_cache. \ + get_ophys_session_table(suppress=suppress, as_df=True) + self._write_metadata_table(df=ophys_sessions, + filename=output_filename) + + def _write_ophys_experiments(self, suppress=OPHYS_EXPERIMENTS_SUPPRESS, + output_filename=OUTPUT_METADATA_FILENAMES[ + 'ophys_experiment_table' + ]): + ophys_experiments = \ + self._behavior_project_cache.get_ophys_experiment_table( + suppress=suppress, as_df=True) + + # Add release files + ophys_experiments = ophys_experiments.merge( + self._release_behavior_with_ophys_nwb + .drop('behavior_session_id', axis=1), + left_index=True, + right_index=True, + how='left') + + self._write_metadata_table(df=ophys_experiments, + filename=output_filename) + + def _write_metadata_table(self, df: pd.DataFrame, filename: str): + """ + Writes file to csv + + Parameters + ---------- + df + The dataframe to write + filename + Filename to save as + """ + filepath = os.path.join(self._out_dir, filename) + self._pre_file_write(filepath=filepath) + + self._logger.info(f'Writing {filepath}') + + df = df.reset_index() + df.to_csv(filepath, index=False) + + self._logger.info('Writing successful') + + def _write_manifest(self): + def get_abs_path(filename): + return os.path.abspath(os.path.join(self._out_dir, filename)) + + metadata_filenames = OUTPUT_METADATA_FILENAMES.values() + metadata_files = [get_abs_path(f) for f in metadata_filenames] + data_pipeline = [{ + 'name': 'AllenSDK', + 'version': allensdk.__version__, + 'comment': 'AllenSDK version used to produce data NWB and ' + 'metadata CSV files for this release' + }] + + manifest = { + 'metadata_files': metadata_files, + 'data_pipeline_metadata': data_pipeline, + 'project_name': self._project_name, + } + + save_path = os.path.join(self._out_dir, 'manifest.json') + self._pre_file_write(filepath=save_path) + + with open(save_path, 'w') as f: + f.write(json.dumps(manifest, indent=4)) + + def _pre_file_write(self, filepath: str): + """Checks if file exists at filepath. If so, and overwrite_ok is False, + raises an exception""" + if os.path.exists(filepath): + if self._overwrite_ok: + pass + else: + raise RuntimeError(f'{filepath} already exists. In order ' + f'to overwrite this file, pass the ' + f'--overwrite_ok flag') + + +def main(): + parser = argparse.ArgumentParser(description='Write project metadata to ' + 'csvs') + parser.add_argument('--out_dir', help='directory to save csvs', + required=True) + parser.add_argument('--project_name', help='project name', required=True) + parser.add_argument('--data_release_date', help='Project release date. ' + 'Ie 2021-03-25', + required=True) + parser.add_argument('--overwrite_ok', help='Whether to allow overwriting ' + 'existing output files', + dest='overwrite_ok', action='store_true') + args = parser.parse_args() + + bpc = VisualBehaviorOphysProjectCache.from_lims( + data_release_date=args.data_release_date) + bpmw = BehaviorProjectMetadataWriter( + behavior_project_cache=bpc, + out_dir=args.out_dir, + project_name=args.project_name, + data_release_date=args.data_release_date, + overwrite_ok=args.overwrite_ok) + bpmw.write_metadata() + + +if __name__ == '__main__': + main() diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/__init__.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/experiments_table.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/experiments_table.py new file mode 100644 index 000000000..e27f41e13 --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/experiments_table.py @@ -0,0 +1,31 @@ +from typing import Optional, List + +import pandas as pd + +from allensdk.brain_observatory.behavior.behavior_project_cache.tables\ + .ophys_mixin import \ + OphysMixin +from allensdk.brain_observatory.behavior.behavior_project_cache.tables\ + .project_table import \ + ProjectTable + + +class ExperimentsTable(ProjectTable, OphysMixin): + """Class for storing and manipulating project-level data + at the behavior-ophys experiment level""" + def __init__(self, df: pd.DataFrame, + suppress: Optional[List[str]] = None): + """ + Parameters + ---------- + df + The behavior-ophys experiment-level data + suppress + columns to drop from table + """ + + ProjectTable.__init__(self, df=df, suppress=suppress) + OphysMixin.__init__(self) + + def postprocess_additional(self): + pass diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_mixin.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_mixin.py new file mode 100644 index 000000000..980c2de6f --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_mixin.py @@ -0,0 +1,13 @@ +class OphysMixin: + """A mixin class for ophys project data""" + def __init__(self): + # If we're in the state of combining behavior and ophys data + if 'date_of_acquisition_behavior' in self._df and \ + 'date_of_acquisition_ophys' in self._df: + + # Prioritize ophys_date_of_acquisition + self._df['date_of_acquisition'] = \ + self._df['date_of_acquisition_ophys'] + self._df = self._df.drop( + ['date_of_acquisition_behavior', + 'date_of_acquisition_ophys'], axis=1) diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_sessions_table.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_sessions_table.py new file mode 100644 index 000000000..dc5760f6d --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/ophys_sessions_table.py @@ -0,0 +1,51 @@ +import logging +from typing import Optional, List + +import pandas as pd + +from allensdk.brain_observatory.behavior.behavior_project_cache.tables\ + .ophys_mixin import \ + OphysMixin +from allensdk.brain_observatory.behavior.behavior_project_cache.tables\ + .project_table import ProjectTable + + +class BehaviorOphysSessionsTable(ProjectTable, OphysMixin): + """Class for storing and manipulating project-level data + at the behavior-ophys session level""" + def __init__(self, df: pd.DataFrame, + suppress: Optional[List[str]] = None, + index_column: str = 'ophys_session_id'): + """ + Parameters + ---------- + df + The behavior-ophys session-level data + suppress + columns to drop from table + index_column + See description in BehaviorProjectCache.get_session_table + """ + + self._logger = logging.getLogger(self.__class__.__name__) + self._index_column = index_column + ProjectTable.__init__(self, df=df, suppress=suppress) + OphysMixin.__init__(self) + + def postprocess_additional(self): + # Possibly explode and reindex + self.__explode() + + def __explode(self): + if self._index_column == "ophys_session_id": + pass + elif self._index_column == "ophys_experiment_id": + self._df = (self._df.reset_index() + .explode("ophys_experiment_id") + .set_index("ophys_experiment_id")) + else: + self._logger.warning( + f"Invalid value for `by`, '{self._index_column}', passed to " + f"BehaviorOphysSessionsCacheTable." + " Valid choices for `by` are 'ophys_experiment_id' and " + "'ophys_session_id'.") diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/project_table.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/project_table.py new file mode 100644 index 000000000..c14380590 --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/project_table.py @@ -0,0 +1,49 @@ +from abc import abstractmethod, ABC +from typing import Optional, Iterable + +import pandas as pd + + +class ProjectTable(ABC): + """Class for storing and manipulating project-level data""" + def __init__(self, df: pd.DataFrame, + suppress: Optional[Iterable[str]] = None): + """ + Parameters + ---------- + df + The project-level data + suppress + columns to drop from table + + """ + self._df = df + + if suppress is not None: + suppress = list(suppress) + self._suppress = suppress + + self.postprocess() + + @property + def table(self): + return self._df + + def postprocess_base(self): + """Postprocessing to apply to all project-level data""" + # Make sure the index is not duplicated (it is rare) + self._df = self._df[~self._df.index.duplicated()].copy() + + def postprocess(self): + """Postprocess loop""" + self.postprocess_base() + self.postprocess_additional() + + if self._suppress: + self._df.drop(columns=self._suppress, inplace=True, + errors="ignore") + + @abstractmethod + def postprocess_additional(self): + """Additional postprocessing should be overridden by subclassess""" + raise NotImplementedError() diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/sessions_table.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/sessions_table.py new file mode 100644 index 000000000..d44a8f327 --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/sessions_table.py @@ -0,0 +1,95 @@ +import re +from typing import Optional, List + +import pandas as pd + +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .ophys_sessions_table import \ + BehaviorOphysSessionsTable +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .util.prior_exposure_processing import \ + get_prior_exposures_to_session_type, get_prior_exposures_to_image_set, \ + get_prior_exposures_to_omissions +from allensdk.brain_observatory.behavior.behavior_project_cache.tables \ + .project_table import \ + ProjectTable +from allensdk.brain_observatory.behavior.metadata.behavior_metadata import \ + BehaviorMetadata +from allensdk.brain_observatory.behavior.project_apis.data_io import \ + BehaviorProjectLimsApi + + +class SessionsTable(ProjectTable): + """Class for storing and manipulating project-level data + at the session level""" + + def __init__( + self, df: pd.DataFrame, + fetch_api: BehaviorProjectLimsApi, + suppress: Optional[List[str]] = None, + ophys_session_table: Optional[BehaviorOphysSessionsTable] = None): + """ + Parameters + ---------- + df + The session-level data + fetch_api + The api needed to call mtrain db + suppress + columns to drop from table + ophys_session_table + BehaviorOphysSessionsTable, to optionally merge in ophys data + """ + self._fetch_api = fetch_api + self._ophys_session_table = ophys_session_table + super().__init__(df=df, suppress=suppress) + + def postprocess_additional(self): + self._df['reporter_line'] = self._df['reporter_line'].apply( + BehaviorMetadata.parse_reporter_line) + self._df['cre_line'] = self._df['full_genotype'].apply( + BehaviorMetadata.parse_cre_line) + self._df['indicator'] = self._df['reporter_line'].apply( + BehaviorMetadata.parse_indicator) + + self.__add_session_number() + + self._df['prior_exposures_to_session_type'] = \ + get_prior_exposures_to_session_type(df=self._df) + self._df['prior_exposures_to_image_set'] = \ + get_prior_exposures_to_image_set(df=self._df) + self._df['prior_exposures_to_omissions'] = \ + get_prior_exposures_to_omissions(df=self._df, + fetch_api=self._fetch_api) + + if self._ophys_session_table is not None: + # Merge in ophys data + self._df = self._df.reset_index() \ + .merge(self._ophys_session_table.table.reset_index(), + on='behavior_session_id', + how='left', + suffixes=('_behavior', '_ophys')) + self._df = self._df.set_index('behavior_session_id') + + # Prioritize behavior date_of_acquisition + self._df['date_of_acquisition'] = \ + self._df['date_of_acquisition_behavior'] + self._df = self._df.drop(['date_of_acquisition_behavior', + 'date_of_acquisition_ophys'], axis=1) + + def __add_session_number(self): + """Parses session number from session type and and adds to dataframe""" + + def parse_session_number(session_type: str): + """Parse the session number from session type""" + match = re.match(r'OPHYS_(?P\d+)', + session_type) + if match is None: + return None + return int(match.group('session_number')) + + session_type = self._df['session_type'] + session_type = session_type[session_type.notnull()] + + self._df.loc[session_type.index, 'session_number'] = \ + session_type.apply(parse_session_number) diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/__init__.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py new file mode 100644 index 000000000..56f5abf09 --- /dev/null +++ b/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py @@ -0,0 +1,182 @@ +import re +from typing import Optional + +import pandas as pd + +from allensdk.brain_observatory.behavior.project_apis.data_io import \ + BehaviorProjectLimsApi + + +def get_prior_exposures_to_session_type(df: pd.DataFrame) -> pd.Series: + """Get prior exposures to session type + + Parameters + ---------- + df + The sessions df + + Returns + --------- + Series with index same as df and values prior exposure counts to + session type + """ + return __get_prior_exposure_count(df=df, to=df['session_type']) + + +def get_prior_exposures_to_image_set(df: pd.DataFrame) -> pd.Series: + """Get prior exposures to image set + + The image set here is the letter part of the session type + ie for session type OPHYS_1_images_B, it would be "B" + + Some session types don't have an image set name, such as + gratings, which will be set to null + + Parameters + ---------- + df + The session df + + Returns + -------- + Series with index same as df and values prior exposure counts to image set + """ + + def __get_image_set_name(session_type: Optional[str]): + match = re.match(r'OPHYS_\d+_images_(?P\w)', + session_type) + if match is None: + return None + return match.group('image_set') + + session_type = df['session_type'][ + df['session_type'].notnull()] + image_set = session_type.apply(__get_image_set_name) + return __get_prior_exposure_count(df=df, to=image_set) + + +def get_prior_exposures_to_omissions(df: pd.DataFrame, + fetch_api: BehaviorProjectLimsApi) -> \ + pd.Series: + """Get prior exposures to omissions + + Parameters + ---------- + df + The session df + fetch_api + API needed to query mtrain + + Returns + --------- + Series with index same as df and values prior exposure counts to omissions + """ + df = df[df['session_type'].notnull()] + + contains_omissions = pd.Series(False, index=df.index) + + def __get_habituation_sessions(df: pd.DataFrame): + """Returns all habituation sessions""" + return df[ + df['session_type'].str.lower().str.contains('habituation')] + + def __get_habituation_sessions_contain_omissions( + habituation_sessions: pd.DataFrame, + fetch_api: BehaviorProjectLimsApi) -> pd.Series: + """Habituation sessions are not supposed to include omissions but + because of a mistake omissions were included for some habituation + sessions. + + This queries mtrain to figure out if omissions were included + for any of the habituation sessions + + Parameters + ---------- + habituation_sessions + the habituation sessions + + Returns + --------- + series where index is same as habituation sessions and values + indicate whether omissions were included + """ + + def __session_contains_omissions( + mtrain_stage_parameters: dict) -> bool: + return 'flash_omit_probability' in mtrain_stage_parameters \ + and \ + mtrain_stage_parameters['flash_omit_probability'] > 0 + + foraging_ids = habituation_sessions['foraging_id'].tolist() + foraging_ids = [f'\'{x}\'' for x in foraging_ids] + mtrain_stage_parameters = fetch_api. \ + get_behavior_stage_parameters(foraging_ids=foraging_ids) + return habituation_sessions.apply( + lambda session: __session_contains_omissions( + mtrain_stage_parameters=mtrain_stage_parameters[ + session['foraging_id']]), axis=1) + + habituation_sessions = __get_habituation_sessions(df=df) + if not habituation_sessions.empty: + contains_omissions.loc[habituation_sessions.index] = \ + __get_habituation_sessions_contain_omissions( + habituation_sessions=habituation_sessions, + fetch_api=fetch_api) + + contains_omissions.loc[ + (df['session_type'].str.lower().str.contains('ophys')) & + (~df.index.isin(habituation_sessions.index)) + ] = True + return __get_prior_exposure_count(df=df, to=contains_omissions, + agg_method='cumsum') + + +def __get_prior_exposure_count(df: pd.DataFrame, to: pd.Series, + agg_method='cumcount') -> pd.Series: + """Returns prior exposures a subject had to something + i.e can be prior exposures to a stimulus type, a image_set or + omission + + Parameters + ---------- + df + The sessions df + to + The array to calculate prior exposures to + Needs to have the same index as self._df + agg_method + The aggregation method to apply on the groups (cumcount or cumsum) + + Returns + --------- + Series with index same as self._df and with values of prior + exposure counts + """ + index = df.index + df = df.sort_values('date_of_acquisition') + df = df[df['session_type'].notnull()] + + # reindex "to" to df + to = to.loc[df.index] + + # exclude missing values from cumcount + to = to[to.notnull()] + + # reindex df to match "to" index with missing values removed + df = df.loc[to.index] + + if agg_method == 'cumcount': + counts = df.groupby(['mouse_id', to]).cumcount() + elif agg_method == 'cumsum': + df['to'] = to + + def cumsum(x): + return x.cumsum().shift(fill_value=0).astype('int64') + + counts = df.groupby(['mouse_id'])['to'].apply(cumsum) + counts.name = None + else: + raise ValueError(f'agg method {agg_method} not supported') + + # reindex to original index + return counts.reindex(index) diff --git a/allensdk/brain_observatory/behavior/behavior_session.py b/allensdk/brain_observatory/behavior/behavior_session.py index 9e8b6582d..e7836fee7 100644 --- a/allensdk/brain_observatory/behavior/behavior_session.py +++ b/allensdk/brain_observatory/behavior/behavior_session.py @@ -66,7 +66,7 @@ def cache_clear(self) -> None: try: self.api.cache_clear() except AttributeError: - logging.getLogger("BehaviorOphysSession").warning( + logging.getLogger("BehaviorSession").warning( "Attempted to clear API cache, but method `cache_clear`" f" does not exist on {self.api.__class__.__name__}") @@ -220,7 +220,7 @@ def licks(self) -> pd.DataFrame: NOTE: For BehaviorSessions, returned timestamps are not aligned to external 'synchronization' reference timestamps. - Synchronized timestamps are only available for BehaviorOphysSessions. + Synchronized timestamps are only available for BehaviorOphysExperiments. Returns ------- @@ -239,7 +239,7 @@ def rewards(self) -> pd.DataFrame: NOTE: For BehaviorSessions, returned timestamps are not aligned to external 'synchronization' reference timestamps. - Synchronized timestamps are only available for BehaviorOphysSessions. + Synchronized timestamps are only available for BehaviorOphysExperiments. Returns ------- @@ -260,7 +260,7 @@ def running_speed(self) -> pd.DataFrame: NOTE: For BehaviorSessions, returned timestamps are not aligned to external 'synchronization' reference timestamps. - Synchronized timestamps are only available for BehaviorOphysSessions. + Synchronized timestamps are only available for BehaviorOphysExperiments. Returns ------- @@ -280,7 +280,7 @@ def raw_running_speed(self) -> pd.DataFrame: NOTE: For BehaviorSessions, returned timestamps are not aligned to external 'synchronization' reference timestamps. - Synchronized timestamps are only available for BehaviorOphysSessions. + Synchronized timestamps are only available for BehaviorOphysExperiments. Returns ------- @@ -336,7 +336,7 @@ def stimulus_timestamps(self) -> np.ndarray: NOTE: For BehaviorSessions, returned timestamps are not aligned to external 'synchronization' reference timestamps. - Synchronized timestamps are only available for BehaviorOphysSessions. + Synchronized timestamps are only available for BehaviorOphysExperiments. Returns ------- diff --git a/allensdk/brain_observatory/behavior/metadata/behavior_metadata.py b/allensdk/brain_observatory/behavior/metadata/behavior_metadata.py index ada47815c..a057e01db 100644 --- a/allensdk/brain_observatory/behavior/metadata/behavior_metadata.py +++ b/allensdk/brain_observatory/behavior/metadata/behavior_metadata.py @@ -7,8 +7,6 @@ import numpy as np import pytz -from allensdk.brain_observatory.behavior.metadata.util import \ - parse_cre_line, parse_age_in_days from allensdk.brain_observatory.behavior.session_apis.abcs.\ data_extractor_base.behavior_data_extractor_base import \ BehaviorDataExtractorBase @@ -177,7 +175,7 @@ def age_in_days(self) -> Optional[int]: """Converts the age cod into a numeric days representation""" age = self._extractor.get_age() - return parse_age_in_days(age=age) + return self.parse_age_in_days(age=age, warn=True) @property def stimulus_frame_rate(self) -> float: @@ -237,34 +235,20 @@ def date_of_acquisition(self) -> datetime: @property def reporter_line(self) -> Optional[str]: - """There can be multiple reporter lines, so it is returned from LIMS - as a list. But there shouldn't be more than 1 for behavior. This - tries to convert to str - - Returns - --------- - single reporter line, or None if not possible - """ reporter_line = self._extractor.get_reporter_line() + return self.parse_reporter_line(reporter_line=reporter_line, warn=True) - if isinstance(reporter_line, str): - return reporter_line - - if len(reporter_line) == 0: - warnings.warn('No reporter line') - return None - - if len(reporter_line) > 1: - warnings.warn('More than 1 reporter line. Returning the first one') - - return reporter_line[0] + @property + def indicator(self) -> Optional[str]: + """Parses indicator from reporter""" + reporter_line = self.reporter_line + return self.parse_indicator(reporter_line=reporter_line, warn=True) @property def cre_line(self) -> Optional[str]: """Parses cre_line from full_genotype""" - cre_line = parse_cre_line(full_genotype=self.full_genotype) - if cre_line is None: - warnings.warn('Unable to parse cre_line from full_genotype') + cre_line = self.parse_cre_line(full_genotype=self.full_genotype, + warn=True) return cre_line @property @@ -322,6 +306,96 @@ def to_dict(self) -> dict: def _get_frame_rate(timestamps: np.ndarray): return np.round(1 / np.mean(np.diff(timestamps)), 0) + @staticmethod + def parse_cre_line(full_genotype: str, warn=False) -> Optional[str]: + """ + Parameters + ---------- + full_genotype + formatted from LIMS, e.g. + Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt + warn + Whether to output warning if parsing fails + + Returns + ---------- + cre_line + just the Cre line, e.g. Vip-IRES-Cre, or None if not possible to + parse + """ + if ';' not in full_genotype: + if warn: + warnings.warn('Unable to parse cre_line from full_genotype') + return None + return full_genotype.split(';')[0].replace('/wt', '') + + @staticmethod + def parse_age_in_days(age: str, warn=False) -> Optional[int]: + """Converts the age code into a numeric days representation + + Parameters + ---------- + age + age code, ie P123 + warn + Whether to output warning if parsing fails + """ + if not age.startswith('P'): + if warn: + warnings.warn('Could not parse numeric age from age code ' + '(age code does not start with "P")') + return None + + match = re.search(r'\d+', age) + + if match is None: + if warn: + warnings.warn('Could not parse numeric age from age code ' + '(no numeric values found in age code)') + return None + + start, end = match.span() + return int(age[start:end]) + + @staticmethod + def parse_reporter_line(reporter_line: Optional[List[str]], + warn=False) -> Optional[str]: + """There can be multiple reporter lines, so it is returned from LIMS + as a list. But there shouldn't be more than 1 for behavior. This + tries to convert to str + + Parameters + ---------- + reporter_line + List of reporter line + warn + Whether to output warnings if parsing fails + + Returns + --------- + single reporter line, or None if not possible + """ + if reporter_line is None: + if warn: + warnings.warn('Error parsing reporter line. It is null.') + return None + + if len(reporter_line) == 0: + if warn: + warnings.warn('Error parsing reporter line. ' + 'The array is empty') + return None + + if isinstance(reporter_line, str): + return reporter_line + + if len(reporter_line) > 1: + if warn: + warnings.warn('More than 1 reporter line. Returning the first ' + 'one') + + return reporter_line[0] + def _get_properties(self, vars_: dict): """Returns all property names and values""" return {name: getattr(self, name) for name, value in vars_.items() @@ -351,3 +425,29 @@ def __eq__(self, other): except AssertionError: return False return True + + @staticmethod + def parse_indicator(reporter_line: Optional[str], warn=False) -> Optional[ + str]: + """Parses indicator from reporter""" + reporter_substring_indicator_map = { + 'GCaMP6f': 'GCaMP6f', + 'GC6f': 'GCaMP6f', + 'GCaMP6s': 'GCaMP6s' + } + if reporter_line is None: + if warn: + warnings.warn( + 'Could not parse indicator from reporter because ' + 'there is no reporter') + return None + + for substr, indicator in reporter_substring_indicator_map.items(): + if substr in reporter_line: + return indicator + + if warn: + warnings.warn( + 'Could not parse indicator from reporter because none' + 'of the expected substrings were found in the reporter') + return None diff --git a/allensdk/brain_observatory/behavior/metadata/behavior_ophys_metadata.py b/allensdk/brain_observatory/behavior/metadata/behavior_ophys_metadata.py index 61fd0bef0..b2c597cad 100644 --- a/allensdk/brain_observatory/behavior/metadata/behavior_ophys_metadata.py +++ b/allensdk/brain_observatory/behavior/metadata/behavior_ophys_metadata.py @@ -1,5 +1,3 @@ -import warnings - import numpy as np from typing import Optional @@ -35,9 +33,10 @@ def emission_lambda(self) -> float: def excitation_lambda(self) -> float: return 910.0 + # TODO rename to ophys_container_id @property def experiment_container_id(self) -> int: - return self._extractor.get_experiment_container_id() + return self._extractor.get_ophys_container_id() @property def field_of_view_height(self) -> int: @@ -59,27 +58,6 @@ def imaging_plane_group(self) -> Optional[int]: def imaging_plane_group_count(self) -> int: return self._extractor.get_plane_group_count() - @property - def indicator(self) -> Optional[str]: - """Parses indicator from reporter""" - reporter_substring_indicator_map = { - 'GCaMP6f': 'GCaMP6f', - 'GC6f': 'GCaMP6f', - 'GCaMP6s': 'GCaMP6s' - } - if self.reporter_line is None: - warnings.warn('Could not parse indicator from reporter because ' - 'there is no reporter') - return None - - for substr, indicator in reporter_substring_indicator_map.items(): - if substr in self.reporter_line: - return indicator - - warnings.warn('Could not parse indicator from reporter because none' - 'of the expected substrings were found in the reporter') - return None - @property def ophys_experiment_id(self) -> int: return self._extractor.get_ophys_experiment_id() diff --git a/allensdk/brain_observatory/behavior/metadata/util.py b/allensdk/brain_observatory/behavior/metadata/util.py deleted file mode 100644 index fd70188fe..000000000 --- a/allensdk/brain_observatory/behavior/metadata/util.py +++ /dev/null @@ -1,43 +0,0 @@ -import re -import warnings -from typing import Optional - - -def parse_cre_line(full_genotype: str) -> Optional[str]: - """ - Parameters - ---------- - full_genotype - formatted from LIMS, e.g. - Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt - - Returns - ---------- - cre_line - just the Cre line, e.g. Vip-IRES-Cre, or None if not possible to parse - """ - if ';' not in full_genotype: - return None - return full_genotype.split(';')[0].replace('/wt', '') - - -def parse_age_in_days(age: str) -> Optional[int]: - """Converts the age code into a numeric days representation - - Parameters - ---------- - age - age code, ie P123 - """ - if not age.startswith('P'): - warnings.warn('Could not parse numeric age from age code') - return None - - match = re.search(r'\d+', age) - - if match is None: - warnings.warn('Could not parse numeric age from age code') - return None - - start, end = match.span() - return int(age[start:end]) diff --git a/allensdk/brain_observatory/behavior/project_apis/abcs/behavior_project_base.py b/allensdk/brain_observatory/behavior/project_apis/abcs/behavior_project_base.py index 380e68862..7f701aa27 100644 --- a/allensdk/brain_observatory/behavior/project_apis/abcs/behavior_project_base.py +++ b/allensdk/brain_observatory/behavior/project_apis/abcs/behavior_project_base.py @@ -1,8 +1,8 @@ from abc import ABC, abstractmethod from typing import Iterable -from allensdk.brain_observatory.behavior.behavior_ophys_session import ( - BehaviorOphysSession) +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import ( + BehaviorOphysExperiment) from allensdk.brain_observatory.behavior.behavior_session import ( BehaviorSession) import pandas as pd @@ -10,23 +10,24 @@ class BehaviorProjectBase(ABC): @abstractmethod - def get_session_data(self, ophys_session_id: int) -> BehaviorOphysSession: - """Returns a BehaviorOphysSession object that contains methods + def get_behavior_ophys_experiment(self, ophys_experiment_id: int + ) -> BehaviorOphysExperiment: + """Returns a BehaviorOphysExperiment object that contains methods to analyze a single behavior+ophys session. - :param ophys_session_id: id that corresponds to a behavior session - :type ophys_session_id: int - :rtype: BehaviorOphysSession + :param ophys_experiment_id: id that corresponds to an ophys experiment + :type ophys_experiment_id: int + :rtype: BehaviorOphysExperiment """ pass @abstractmethod - def get_session_table(self) -> pd.DataFrame: + def get_ophys_session_table(self) -> pd.DataFrame: """Return a pd.Dataframe table with all ophys_session_ids and relevant metadata.""" pass @abstractmethod - def get_behavior_only_session_data( + def get_behavior_session( self, behavior_session_id: int) -> BehaviorSession: """Returns a BehaviorSession object that contains methods to analyze a single behavior session. @@ -37,7 +38,7 @@ def get_behavior_only_session_data( pass @abstractmethod - def get_behavior_only_session_table(self) -> pd.DataFrame: + def get_behavior_session_table(self) -> pd.DataFrame: """Returns a pd.DataFrame table with all behavior session_ids to the user with additional metadata. :rtype: pd.DataFrame @@ -46,21 +47,21 @@ def get_behavior_only_session_table(self) -> pd.DataFrame: @abstractmethod def get_natural_movie_template(self, number: int) -> Iterable[bytes]: - """Download a template for the natural scene stimulus. This is the - actual image that was shown during the recording session. - :param number: idenfifier for this movie (note that this is an int, - so to get the template for natural_movie_three should pass 3) + """ Download a template for the natural movie stimulus. This is the + actual movie that was shown during the recording session. + :param number: identifier for this scene :type number: int - :returns: iterable yielding a tiff file as bytes + :returns: An iterable yielding an npy file as bytes """ pass @abstractmethod def get_natural_scene_template(self, number: int) -> Iterable[bytes]: - """ Download a template for the natural movie stimulus. This is the - actual movie that was shown during the recording session. - :param number: identifier for this scene + """Download a template for the natural scene stimulus. This is the + actual image that was shown during the recording session. + :param number: idenfifier for this movie (note that this is an int, + so to get the template for natural_movie_three should pass 3) :type number: int - :returns: An iterable yielding an npy file as bytes + :returns: iterable yielding a tiff file as bytes """ pass diff --git a/allensdk/brain_observatory/behavior/project_apis/data_io/__init__.py b/allensdk/brain_observatory/behavior/project_apis/data_io/__init__.py index 91b3aa8fc..6dcade699 100644 --- a/allensdk/brain_observatory/behavior/project_apis/data_io/__init__.py +++ b/allensdk/brain_observatory/behavior/project_apis/data_io/__init__.py @@ -1 +1,2 @@ from allensdk.brain_observatory.behavior.project_apis.data_io.behavior_project_lims_api import BehaviorProjectLimsApi # noqa: F401, E501 +from allensdk.brain_observatory.behavior.project_apis.data_io.behavior_project_cloud_api import BehaviorProjectCloudApi # noqa: F401, E501 diff --git a/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_cloud_api.py b/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_cloud_api.py new file mode 100644 index 000000000..ddee6ff3e --- /dev/null +++ b/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_cloud_api.py @@ -0,0 +1,378 @@ +import pandas as pd +from typing import Iterable, Union, Dict, List, Optional +from pathlib import Path +import logging +import ast +import semver + +from allensdk.brain_observatory.behavior.project_apis.abcs import ( + BehaviorProjectBase) +from allensdk.brain_observatory.behavior.behavior_session import ( + BehaviorSession) +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import ( + BehaviorOphysExperiment) +from allensdk.api.cloud_cache.cloud_cache import S3CloudCache, LocalCache +from allensdk import __version__ as sdk_version + + +# [min inclusive, max exclusive) +COMPATIBILITY = { + "pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.9.0", "3.0.0"]}, + "2.10.0": {"AllenSDK": ["2.10.0", "3.0.0"]} + } +} + + +class BehaviorCloudCacheVersionException(Exception): + pass + + +def version_check(pipeline_versions: List[Dict[str, str]], + sdk_version: str = sdk_version, + compatibility: Dict[str, Dict] = COMPATIBILITY): + """given a pipeline_versions list (from manifest) determine + the pipeline version of AllenSDK used to write the data. Lookup + the compatibility limits, and check the the running version of + AllenSDK meets those limits. + + Parameters + ---------- + pipeline_versions: List[Dict[str, str]]: + each element has keys name, version, (and comment - not used here) + sdk_version: str + typically the current return value for allensdk.__version__ + compatibility_dict: Dict + keys (under 'pipeline_versions' key) are specific version numbers to + match a pipeline version for AllenSDK from the manifest. values + specify the min (inclusive) and max (exclusive) limits for + interoperability + + Raises + ------ + BehaviorCloudCacheVersionException + + """ + pipeline_version = [i for i in pipeline_versions + if "AllenSDK" == i["name"]] + if len(pipeline_version) != 1: + raise BehaviorCloudCacheVersionException( + "expected to find 1 and only 1 entry for `AllenSDK` " + "in the manifest.data_pipeline metadata. " + f"found {len(pipeline_version)}") + pipeline_version = pipeline_version[0]["version"] + if pipeline_version not in compatibility["pipeline_versions"]: + raise BehaviorCloudCacheVersionException( + f"no version compatibility listed for {pipeline_version}") + version_limits = compatibility["pipeline_versions"][pipeline_version] + smin = semver.VersionInfo.parse(version_limits["AllenSDK"][0]) + smax = semver.VersionInfo.parse(version_limits["AllenSDK"][1]) + if (sdk_version < smin) | (sdk_version >= smax): + raise BehaviorCloudCacheVersionException( + f""" + The version of the visual-behavior-ophys data files (specified + in path_to_users_current_release_manifest) requires that your + AllenSDK version be >={smin} and <{smax}. + Your version of AllenSDK is: {sdk_version}. + If you want to use the specified manifest to retrieve data, please + upgrade or downgrade AllenSDK to the range specified. + If you just want to get the latest version of visual-behavior-ophys + data please upgrade to the latest AllenSDK version and try this + process again.""") + + +def literal_col_eval(df: pd.DataFrame, + columns: List[str] = ["ophys_experiment_id", + "ophys_container_id", + "driver_line"]) -> pd.DataFrame: + def converter(x): + if isinstance(x, str): + x = ast.literal_eval(x) + return x + + for column in columns: + if column in df.columns: + df.loc[df[column].notnull(), column] = \ + df[column][df[column].notnull()].apply(converter) + return df + + +class BehaviorProjectCloudApi(BehaviorProjectBase): + """API for downloading data released on S3 and returning tables. + + Parameters + ---------- + cache: S3CloudCache + an instantiated S3CloudCache object, which has already run + `self.load_manifest()` which populates the columns: + - metadata_file_names + - file_id_column + skip_version_check: bool + whether to skip the version checking of pipeline SDK version + vs. running SDK version, which may raise Exceptions. (default=False) + local: bool + Whether to operate in local mode, where no data will be downloaded + and instead will be loaded from local + """ + def __init__(self, cache: Union[S3CloudCache, LocalCache], + skip_version_check: bool = False, + local: bool = False): + expected_metadata = set(["behavior_session_table", + "ophys_session_table", + "ophys_experiment_table"]) + self.cache = cache + if cache._manifest.metadata_file_names is None: + raise RuntimeError("S3CloudCache object has no metadata " + "file names. BehaviorProjectCloudApi " + "expects a S3CloudCache passed which " + "has already run load_manifest()") + cache_metadata = set(cache._manifest.metadata_file_names) + if cache_metadata != expected_metadata: + raise RuntimeError("expected S3CloudCache object to have " + f"metadata file names: {expected_metadata} " + f"but it has {cache_metadata}") + if not skip_version_check: + version_check(self.cache._manifest._data_pipeline) + self.logger = logging.getLogger("BehaviorProjectCloudApi") + self._local = local + self._get_ophys_session_table() + self._get_behavior_session_table() + self._get_ophys_experiment_table() + + @staticmethod + def from_s3_cache(cache_dir: Union[str, Path], + bucket_name: str, + project_name: str) -> "BehaviorProjectCloudApi": + """instantiates this object with a connection to an s3 bucket and/or + a local cache related to that bucket. + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + bucket_name: str + for example, if bucket URI is 's3://mybucket' this value should be + 'mybucket' + + project_name: str + the name of the project this cache is supposed to access. This + project name is the first part of the prefix of the release data + objects. I.e. s3://// + + Returns + ------- + BehaviorProjectCloudApi instance + + """ + cache = S3CloudCache(cache_dir, bucket_name, project_name) + cache.load_latest_manifest() + return BehaviorProjectCloudApi(cache) + + @staticmethod + def from_local_cache(cache_dir: Union[str, Path], + project_name: str) -> "BehaviorProjectCloudApi": + """instantiates this object with a local cache. + + Parameters + ---------- + cache_dir: str or pathlib.Path + Path to the directory where data will be stored on the local system + + project_name: str + the name of the project this cache is supposed to access. This + project name is the first part of the prefix of the release data + objects. I.e. s3://// + + Returns + ------- + BehaviorProjectCloudApi instance + + """ + cache = LocalCache(cache_dir, project_name) + cache.load_latest_manifest() + return BehaviorProjectCloudApi(cache, local=True) + + def get_behavior_session( + self, behavior_session_id: int) -> BehaviorSession: + """get a BehaviorSession by specifying behavior_session_id + + Parameters + ---------- + behavior_session_id: int + the id of the behavior_session + + Returns + ------- + BehaviorSession + + Notes + ----- + entries in the _behavior_session_table represent + (1) ophys_sessions which have a many-to-one mapping between nwb files + and behavior sessions. (file_id is NaN) + AND + (2) behavior only sessions, which have a one-to-one mapping with + nwb files. (file_id is not Nan) + In the case of (1) this method returns an object which is just behavior + data which is shared by all experiments in 1 session. This is extracted + from the nwb file for the first-listed ophys_experiment. + + """ + row = self._behavior_session_table.query( + f"behavior_session_id=={behavior_session_id}") + if row.shape[0] != 1: + raise RuntimeError("The behavior_session_table should have " + "1 and only 1 entry for a given " + "behavior_session_id. For " + f"{behavior_session_id} " + f" there are {row.shape[0]} entries.") + row = row.squeeze() + has_file_id = not pd.isna(row[self.cache.file_id_column]) + if not has_file_id: + oeid = row.ophys_experiment_id[0] + row = self._ophys_experiment_table.query(f"index=={oeid}") + file_id = str(int(row[self.cache.file_id_column])) + data_path = self._get_data_path(file_id=file_id) + return BehaviorSession.from_nwb_path(str(data_path)) + + def get_behavior_ophys_experiment(self, ophys_experiment_id: int + ) -> BehaviorOphysExperiment: + """get a BehaviorOphysExperiment by specifying ophys_experiment_id + + Parameters + ---------- + ophys_experiment_id: int + the id of the ophys_experiment + + Returns + ------- + BehaviorOphysExperiment + + """ + row = self._ophys_experiment_table.query( + f"index=={ophys_experiment_id}") + if row.shape[0] != 1: + raise RuntimeError("The behavior_ophys_experiment_table should " + "have 1 and only 1 entry for a given " + f"ophys_experiment_id. For " + f"{ophys_experiment_id} " + f" there are {row.shape[0]} entries.") + file_id = str(int(row[self.cache.file_id_column])) + data_path = self._get_data_path(file_id=file_id) + return BehaviorOphysExperiment.from_nwb_path(str(data_path)) + + def _get_ophys_session_table(self): + session_table_path = self._get_metadata_path( + fname="ophys_session_table") + df = literal_col_eval(pd.read_csv(session_table_path)) + self._ophys_session_table = df.set_index("ophys_session_id") + + def get_ophys_session_table(self) -> pd.DataFrame: + """Return a pd.Dataframe table summarizing ophys_sessions + and associated metadata. + + Notes + ----- + - Each entry in this table represents the metadata of an ophys_session. + Link to nwb-hosted files in the cache is had via the + 'ophys_experiment_id' column (can be a list) + and experiment_table + """ + return self._ophys_session_table + + def _get_behavior_session_table(self): + session_table_path = self._get_metadata_path( + fname='behavior_session_table') + df = literal_col_eval(pd.read_csv(session_table_path)) + self._behavior_session_table = df.set_index("behavior_session_id") + + def get_behavior_session_table(self) -> pd.DataFrame: + """Return a pd.Dataframe table with both behavior-only + (BehaviorSession) and with-ophys (BehaviorOphysExperiment) + sessions as entries. + + Notes + ----- + - In the first case, provides a critical mapping of + behavior_session_id to file_id, which the cache uses to find the + nwb path in cache. + - In the second case, provides a critical mapping of + behavior_session_id to a list of ophys_experiment_id(s) + which can be used to find file_id mappings in ophys_experiment_table + see method get_behavior_session() + """ + return self._behavior_session_table + + def _get_ophys_experiment_table(self): + experiment_table_path = self._get_metadata_path( + fname="ophys_experiment_table") + df = literal_col_eval(pd.read_csv(experiment_table_path)) + self._ophys_experiment_table = df.set_index("ophys_experiment_id") + + def get_ophys_experiment_table(self): + """returns a pd.DataFrame where each entry has a 1-to-1 + relation with an ophys experiment (i.e. imaging plane) + + Notes + ----- + - the file_id column allows the underlying cache to link + this table to a cache-hosted NWB file. There is a 1-to-1 + relation between nwb files and ophy experiments. See method + get_behavior_ophys_experiment() + """ + return self._ophys_experiment_table + + def get_natural_movie_template(self, number: int) -> Iterable[bytes]: + """ Download a template for the natural movie stimulus. This is the + actual movie that was shown during the recording session. + :param number: identifier for this scene + :type number: int + :returns: An iterable yielding an npy file as bytes + """ + raise NotImplementedError() + + def get_natural_scene_template(self, number: int) -> Iterable[bytes]: + """Download a template for the natural scene stimulus. This is the + actual image that was shown during the recording session. + :param number: idenfifier for this movie (note that this is an int, + so to get the template for natural_movie_three should pass 3) + :type number: int + :returns: iterable yielding a tiff file as bytes + """ + raise NotImplementedError() + + def _get_metadata_path(self, fname: str): + if self._local: + path = self._get_local_path(fname=fname) + else: + path = self.cache.download_metadata(fname=fname) + return path + + def _get_data_path(self, file_id: str): + if self._local: + data_path = self._get_local_path(file_id=file_id) + else: + data_path = self.cache.download_data(file_id=file_id) + return data_path + + def _get_local_path(self, fname: Optional[str] = None, file_id: + Optional[str] = None): + if fname is None and file_id is None: + raise ValueError('Must pass either fname or file_id') + + if fname is not None and file_id is not None: + raise ValueError('Must pass only one of fname or file_id') + + if fname is not None: + path = self.cache.metadata_path(fname=fname) + else: + path = self.cache.data_path(file_id=file_id) + + exists = path['exists'] + local_path = path['local_path'] + if not exists: + raise FileNotFoundError(f'You started a cache without a ' + f'connection to s3 and {local_path} is ' + 'not already on your system') + return local_path diff --git a/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_lims_api.py b/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_lims_api.py index ae2c230b1..bf870971d 100644 --- a/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_lims_api.py +++ b/allensdk/brain_observatory/behavior/project_apis/data_io/behavior_project_lims_api.py @@ -6,13 +6,13 @@ BehaviorProjectBase) from allensdk.brain_observatory.behavior.behavior_session import ( BehaviorSession) -from allensdk.brain_observatory.behavior.behavior_ophys_session import ( - BehaviorOphysSession) +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import ( + BehaviorOphysExperiment) from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorLimsApi, BehaviorOphysLimsApi) from allensdk.internal.api import db_connection_creator -from allensdk.brain_observatory.ecephys.ecephys_project_api.http_engine import ( - HttpEngine) +from allensdk.brain_observatory.ecephys.ecephys_project_api.http_engine \ + import (HttpEngine) from allensdk.core.typing import SupportsStr from allensdk.core.authentication import DbCredentials from allensdk.core.auth_config import ( @@ -20,7 +20,8 @@ class BehaviorProjectLimsApi(BehaviorProjectBase): - def __init__(self, lims_engine, mtrain_engine, app_engine): + def __init__(self, lims_engine, mtrain_engine, app_engine, + data_release_date: Optional[str] = None): """ Downloads visual behavior data from the Allen Institute's internal Laboratory Information Management System (LIMS). Only functional if connected to the Allen Institute Network. Used to load @@ -59,18 +60,23 @@ def __init__(self, lims_engine, mtrain_engine, app_engine): implement: stream : takes a url as a string. Returns an iterable yielding the response body as bytes. + data_release_date + Use to filter tables to only include data released on date + ie 2021-03-25 """ self.lims_engine = lims_engine self.mtrain_engine = mtrain_engine self.app_engine = app_engine + self.data_release_date = data_release_date self.logger = logging.getLogger("BehaviorProjectLimsApi") @classmethod def default( - cls, - lims_credentials: Optional[DbCredentials] = None, - mtrain_credentials: Optional[DbCredentials] = None, - app_kwargs: Optional[Dict[str, Any]] = None) -> \ + cls, + lims_credentials: Optional[DbCredentials] = None, + mtrain_credentials: Optional[DbCredentials] = None, + app_kwargs: Optional[Dict[str, Any]] = None, + data_release_date: Optional[str] = None) -> \ "BehaviorProjectLimsApi": """Construct a BehaviorProjectLimsApi instance with default postgres and app engines. @@ -85,6 +91,9 @@ def default( Credentials to pass to the postgres connector to the mtrain database. If left unspecified, will check environment variables for the appropriate values. + data_release_date: Optional[str] + Filters tables to include only data released on date + ie 2021-03-25 app_kwargs: Dict Dict of arguments to pass to the app engine. Currently unused. @@ -105,7 +114,8 @@ def default( fallback_credentials=MTRAIN_DB_CREDENTIAL_MAP) app_engine = HttpEngine(**_app_kwargs) - return cls(lims_engine, mtrain_engine, app_engine) + return cls(lims_engine, mtrain_engine, app_engine, + data_release_date=data_release_date) @staticmethod def _build_in_list_selector_query( @@ -136,21 +146,49 @@ def _build_in_list_selector_query( sorted(set(map(str, valid_list))))})""") return session_query - @staticmethod - def _build_experiment_from_session_query() -> str: + def _build_experiment_from_session_query(self) -> str: """Aggregate sql sub-query to get all ophys_experiment_ids associated with a single ophys_session_id.""" + if self.data_release_date: + release_filter = self._get_ophys_experiment_release_filter() + else: + release_filter = '' query = f""" -- -- begin getting all ophys_experiment_ids -- -- SELECT (ARRAY_AGG(DISTINCT(oe.id))) AS experiment_ids, os.id FROM ophys_sessions os RIGHT JOIN ophys_experiments oe ON oe.ophys_session_id = os.id + {release_filter} GROUP BY os.id -- -- end getting all ophys_experiment_ids -- -- """ return query + def _build_container_from_session_query(self) -> str: + """Aggregate sql sub-query to get all ophys_container_ids associated + with a single ophys_session_id.""" + if self.data_release_date: + release_filter = self._get_ophys_experiment_release_filter() + else: + release_filter = '' + query = f""" + -- -- begin getting all ophys_container_ids -- -- + SELECT + (ARRAY_AGG( + DISTINCT(oec.visual_behavior_experiment_container_id)) + ) AS container_ids, os.id + FROM ophys_experiments_visual_behavior_experiment_containers oec + JOIN visual_behavior_experiment_containers vbc + ON oec.visual_behavior_experiment_container_id = vbc.id + JOIN ophys_experiments oe ON oe.id = oec.ophys_experiment_id + JOIN ophys_sessions os ON os.id = oe.ophys_session_id + {release_filter} + GROUP BY os.id + -- -- end getting all ophys_container_ids -- -- + """ + return query + @staticmethod def _build_line_from_donor_query(line="driver") -> str: """Sub-query to get a line from a donor. @@ -169,21 +207,16 @@ def _build_line_from_donor_query(line="driver") -> str: """ return query - def _get_behavior_summary_table(self, - session_sub_query: str) -> pd.DataFrame: + def _get_behavior_summary_table(self) -> pd.DataFrame: """Build and execute query to retrieve summary data for all data, or a subset of session_ids (via the session_sub_query). Should pass an empty string to `session_sub_query` if want to get all data in the database. - :param session_sub_query: additional filtering logic to get a - subset of sessions. - :type session_sub_query: str :rtype: pd.DataFrame """ query = f""" SELECT bs.id AS behavior_session_id, - bs.ophys_session_id, equipment.name as equipment_name, bs.date_of_acquisition, d.id as donor_id, @@ -205,8 +238,11 @@ def _get_behavior_summary_table(self, {self._build_line_from_donor_query("driver")} ) driver on driver.donor_id = d.id LEFT OUTER JOIN equipment ON equipment.id = bs.equipment_id - {session_sub_query} """ + + if self.data_release_date is not None: + query += self._get_behavior_session_release_filter() + self.logger.debug(f"get_behavior_session_table query: \n{query}") return self.lims_engine.select(query) @@ -255,18 +291,46 @@ def _get_behavior_stage_table( self.logger.debug(f"_get_behavior_stage_table query: \n {query}") return self.mtrain_engine.select(query) - def get_session_data(self, ophys_session_id: int) -> BehaviorOphysSession: - """Returns a BehaviorOphysSession object that contains methods + def get_behavior_stage_parameters(self, + foraging_ids: List[str]) -> pd.Series: + """Gets the stage parameters for each foraging id from mtrain + + Parameters + ---------- + foraging_ids + List of foraging ids + + + Returns + --------- + Series with index of foraging id and values stage parameters + """ + foraging_ids_query = self._build_in_list_selector_query( + "bs.id", foraging_ids) + + query = f""" + SELECT + bs.id AS foraging_id, + stages.parameters as stage_parameters + FROM behavior_sessions bs + JOIN stages ON stages.id = bs.state_id + {foraging_ids_query}; + """ + df = self.mtrain_engine.select(query) + df = df.set_index('foraging_id') + return df['stage_parameters'] + + def get_behavior_ophys_experiment(self, ophys_experiment_id: int + ) -> BehaviorOphysExperiment: + """Returns a BehaviorOphysExperiment object that contains methods to analyze a single behavior+ophys session. - :param ophys_session_id: id that corresponds to a behavior session - :type ophys_session_id: int - :rtype: BehaviorOphysSession + :param ophys_experiment_id: id that corresponds to an ophys experiment + :type ophys_experiment_id: int + :rtype: BehaviorOphysExperiment """ - return BehaviorOphysSession(BehaviorOphysLimsApi(ophys_session_id)) + return BehaviorOphysExperiment(BehaviorOphysLimsApi(ophys_experiment_id)) - def _get_experiment_table( - self, - ophys_experiment_ids: Optional[List[int]] = None) -> pd.DataFrame: + def _get_ophys_experiment_table(self) -> pd.DataFrame: """ Helper function for easier testing. Return a pd.Dataframe table with all ophys_experiment_ids and relevant @@ -277,38 +341,21 @@ def _get_experiment_table( specimen_id, full_genotype, sex, age_in_days, reporter_line, driver_line, mouse_id - :param ophys_experiment_ids: optional list of ophys_experiment_ids - to include :rtype: pd.DataFrame """ - if not ophys_experiment_ids: - self.logger.warning("Getting all ophys sessions." - " This might take a while.") - experiment_query = self._build_in_list_selector_query( - "oe.id", ophys_experiment_ids) query = f""" SELECT oe.id as ophys_experiment_id, os.id as ophys_session_id, bs.id as behavior_session_id, - oec.visual_behavior_experiment_container_id as container_id, + oec.visual_behavior_experiment_container_id as + ophys_container_id, pr.code as project_code, vbc.workflow_state as container_workflow_state, oe.workflow_state as experiment_workflow_state, os.name as session_name, - os.stimulus_name as session_type, - equipment.name as equipment_name, os.date_of_acquisition, os.isi_experiment_id, - os.specimen_id, - d.id as donor_id, - g.name as sex, - DATE_PART('day', os.date_of_acquisition - d.date_of_birth) - AS age_in_days, - d.full_genotype, - d.external_donor_name AS mouse_id, - reporter.reporter_line, - driver.driver_line, id.depth as imaging_depth, st.acronym as targeted_structure, vbc.published_at @@ -317,27 +364,19 @@ def _get_experiment_table( ON oec.visual_behavior_experiment_container_id = vbc.id JOIN ophys_experiments oe ON oe.id = oec.ophys_experiment_id JOIN ophys_sessions os ON os.id = oe.ophys_session_id - LEFT OUTER JOIN behavior_sessions bs ON os.id = bs.ophys_session_id + JOIN behavior_sessions bs ON os.id = bs.ophys_session_id LEFT OUTER JOIN projects pr ON pr.id = os.project_id - JOIN donors d ON d.id = bs.donor_id - JOIN genders g ON g.id = d.gender_id - LEFT OUTER JOIN ( - {self._build_line_from_donor_query(line="reporter")} - ) reporter on reporter.donor_id = d.id - LEFT OUTER JOIN ( - {self._build_line_from_donor_query(line="driver")} - ) driver on driver.donor_id = d.id LEFT JOIN imaging_depths id ON id.id = oe.imaging_depth_id JOIN structures st ON st.id = oe.targeted_structure_id - LEFT OUTER JOIN equipment ON equipment.id = os.equipment_id - {experiment_query}; """ - self.logger.debug(f"get_experiment_table query: \n{query}") + + if self.data_release_date is not None: + query += self._get_ophys_experiment_release_filter() + + self.logger.debug(f"get_ophys_experiment_table query: \n{query}") return self.lims_engine.select(query) - def _get_session_table( - self, - ophys_session_ids: Optional[List[int]] = None) -> pd.DataFrame: + def _get_ophys_session_table(self) -> pd.DataFrame: """Helper function for easier testing. Return a pd.Dataframe table with all ophys_session_ids and relevant metadata. @@ -347,56 +386,35 @@ def _get_session_table( specimen_id, full_genotype, sex, age_in_days, reporter_line, driver_line, mouse_id - :param ophys_session_ids: optional list of ophys_session_ids to include :rtype: pd.DataFrame """ - if not ophys_session_ids: - self.logger.warning("Getting all ophys sessions." - " This might take a while.") - session_query = self._build_in_list_selector_query("os.id", - ophys_session_ids) query = f""" SELECT os.id as ophys_session_id, bs.id as behavior_session_id, - experiment_ids as ophys_experiment_id, + exp_ids.experiment_ids as ophys_experiment_id, + cntr_ids.container_ids as ophys_container_id, pr.code as project_code, os.name as session_name, - os.stimulus_name as session_type, - equipment.name as equipment_name, os.date_of_acquisition, - os.specimen_id, - d.id as donor_id, - g.name as sex, - DATE_PART('day', os.date_of_acquisition - d.date_of_birth) - AS age_in_days, - d.full_genotype, - d.external_donor_name AS mouse_id, - reporter.reporter_line, - driver.driver_line + os.specimen_id FROM ophys_sessions os - LEFT OUTER JOIN behavior_sessions bs ON os.id = bs.ophys_session_id + JOIN behavior_sessions bs ON os.id = bs.ophys_session_id LEFT OUTER JOIN projects pr ON pr.id = os.project_id - JOIN donors d ON d.id = bs.donor_id - JOIN genders g ON g.id = d.gender_id JOIN ( {self._build_experiment_from_session_query()} ) exp_ids ON os.id = exp_ids.id - LEFT OUTER JOIN ( - {self._build_line_from_donor_query(line="reporter")} - ) reporter on reporter.donor_id = d.id - LEFT OUTER JOIN ( - {self._build_line_from_donor_query(line="driver")} - ) driver on driver.donor_id = d.id - LEFT OUTER JOIN equipment ON equipment.id = os.equipment_id - {session_query}; + JOIN ( + {self._build_container_from_session_query()} + ) cntr_ids ON os.id = cntr_ids.id """ - self.logger.debug(f"get_session_table query: \n{query}") + + if self.data_release_date is not None: + query += self._get_ophys_session_release_filter() + self.logger.debug(f"get_ophys_session_table query: \n{query}") return self.lims_engine.select(query) - def get_session_table( - self, - ophys_session_ids: Optional[List[int]] = None) -> pd.DataFrame: + def get_ophys_session_table(self) -> pd.DataFrame: """Return a pd.Dataframe table with all ophys_session_ids and relevant metadata. Return columns: ophys_session_id, behavior_session_id, @@ -404,18 +422,16 @@ def get_session_table( session_type, equipment_name, date_of_acquisition, specimen_id, full_genotype, sex, age_in_days, reporter_line, driver_line - - :param ophys_session_ids: optional list of ophys_session_ids to include :rtype: pd.DataFrame """ # There is one ophys_session_id from 2018 that has multiple behavior # ids, causing duplicates -- drop all dupes for now; # TODO - table = (self._get_session_table(ophys_session_ids) + table = (self._get_ophys_session_table() .drop_duplicates(subset=["ophys_session_id"], keep=False) .set_index("ophys_session_id")) return table - def get_behavior_only_session_data( + def get_behavior_session( self, behavior_session_id: int) -> BehaviorSession: """Returns a BehaviorSession object that contains methods to analyze a single behavior session. @@ -425,7 +441,7 @@ def get_behavior_only_session_data( """ return BehaviorSession(BehaviorLimsApi(behavior_session_id)) - def get_experiment_table( + def get_ophys_experiment_table( self, ophys_experiment_ids: Optional[List[int]] = None) -> pd.DataFrame: """Return a pd.Dataframe table with all ophys_experiment_ids and @@ -433,7 +449,7 @@ def get_experiment_table( level to examine the data. Return columns: ophys_experiment_id, ophys_session_id, behavior_session_id, - container_id, project_code, container_workflow_state, + ophys_container_id, project_code, container_workflow_state, experiment_workflow_state, session_name, session_type, equipment_name, date_of_acquisition, isi_experiment_id, specimen_id, sex, age_in_days, full_genotype, reporter_line, @@ -442,11 +458,9 @@ def get_experiment_table( to include :rtype: pd.DataFrame """ - return self._get_experiment_table().set_index("ophys_experiment_id") + return self._get_ophys_experiment_table().set_index("ophys_experiment_id") - def get_behavior_only_session_table( - self, - behavior_session_ids: Optional[List[int]] = None) -> pd.DataFrame: + def get_behavior_session_table(self) -> pd.DataFrame: """Returns a pd.DataFrame table with all behavior session_ids to the user with additional metadata. @@ -454,27 +468,105 @@ def get_behavior_only_session_table( acquisition date for behavior sessions (only in the stimulus pkl file) :rtype: pd.DataFrame """ - self.logger.warning("Getting behavior-only session data. " - "This might take a while...") - session_query = self._build_in_list_selector_query( - "bs.id", behavior_session_ids) - summary_tbl = self._get_behavior_summary_table(session_query) - stimulus_names = self._get_behavior_stage_table(behavior_session_ids) + summary_tbl = self._get_behavior_summary_table() + stimulus_names = self._get_behavior_stage_table( + behavior_session_ids=summary_tbl.index.tolist()) return (summary_tbl.merge(stimulus_names, on=["foraging_id"], how="left") .set_index("behavior_session_id")) - def get_natural_movie_template(self, number: int) -> Iterable[bytes]: - """Download a template for the natural scene stimulus. This is the - actual image that was shown during the recording session. - :param number: idenfifier for this movie (note that this is an int, - so to get the template for natural_movie_three should pass 3) - :type number: int - :returns: iterable yielding a tiff file as bytes + def get_release_files(self, file_type='BehaviorNwb') -> pd.DataFrame: + """Gets the release nwb files. + + Parameters + ---------- + file_type + NWB files to return ('BehaviorNwb', 'BehaviorOphysNwb') + + Returns + --------- + Dataframe of release files and file metadata + -index of behavior_session_id or ophys_experiment_id + -columns file_id and isilon filepath """ - raise NotImplementedError() + if self.data_release_date is None: + raise RuntimeError(f'data_release_date must be set in constructor') + + if file_type not in ('BehaviorNwb', 'BehaviorOphysNwb'): + raise ValueError(f'cannot retrieve file type {file_type}') + + if file_type == 'BehaviorNwb': + attachable_id_alias = 'behavior_session_id' + select_clause = f''' + SELECT attachable_id as {attachable_id_alias}, id as file_id, + filename, storage_directory + ''' + join_clause = '' + else: + attachable_id_alias = 'ophys_experiment_id' + select_clause = f''' + SELECT attachable_id as {attachable_id_alias}, + bs.id as behavior_session_id, wkf.id as file_id, + filename, wkf.storage_directory + ''' + join_clause = f''' + JOIN ophys_experiments oe ON oe.id = attachable_id + JOIN ophys_sessions os ON os.id = oe.ophys_session_id + JOIN behavior_sessions bs on bs.ophys_session_id = os.id + ''' + + query = f''' + {select_clause} + FROM well_known_files wkf + {join_clause} + WHERE published_at = '{self.data_release_date}' AND + well_known_file_type_id IN ( + SELECT id + FROM well_known_file_types + WHERE name = '{file_type}' + ); + ''' + + res = self.lims_engine.select(query) + res['isilon_filepath'] = res['storage_directory'] \ + .str.cat(res['filename']) + res = res.drop(['filename', 'storage_directory'], axis=1) + return res.set_index(attachable_id_alias) + + def _get_behavior_session_release_filter(self): + # 1) Get release behavior only session ids + behavior_only_release_files = self.get_release_files( + file_type='BehaviorNwb') + release_behavior_only_session_ids = \ + behavior_only_release_files.index.tolist() + + # 2) Get release behavior with ophys session ids + ophys_release_files = self.get_release_files( + file_type='BehaviorOphysNwb') + release_behavior_with_ophys_session_ids = \ + ophys_release_files['behavior_session_id'].tolist() + + # 3) release behavior session ids is combination + release_behavior_session_ids = \ + release_behavior_only_session_ids + \ + release_behavior_with_ophys_session_ids + + return self._build_in_list_selector_query( + "bs.id", release_behavior_session_ids) + + def _get_ophys_session_release_filter(self): + release_files = self.get_release_files( + file_type='BehaviorOphysNwb') + return self._build_in_list_selector_query( + "bs.id", release_files['behavior_session_id'].tolist()) + + def _get_ophys_experiment_release_filter(self): + release_files = self.get_release_files( + file_type='BehaviorOphysNwb') + return self._build_in_list_selector_query( + "oe.id", release_files.index.tolist()) - def get_natural_scene_template(self, number: int) -> Iterable[bytes]: + def get_natural_movie_template(self, number: int) -> Iterable[bytes]: """ Download a template for the natural movie stimulus. This is the actual movie that was shown during the recording session. :param number: identifier for this scene @@ -482,3 +574,13 @@ def get_natural_scene_template(self, number: int) -> Iterable[bytes]: :returns: An iterable yielding an npy file as bytes """ raise NotImplementedError() + + def get_natural_scene_template(self, number: int) -> Iterable[bytes]: + """Download a template for the natural scene stimulus. This is the + actual image that was shown during the recording session. + :param number: idenfifier for this movie (note that this is an int, + so to get the template for natural_movie_three should pass 3) + :type number: int + :returns: iterable yielding a tiff file as bytes + """ + raise NotImplementedError() diff --git a/allensdk/brain_observatory/behavior/session_apis/abcs/data_extractor_base/behavior_ophys_data_extractor_base.py b/allensdk/brain_observatory/behavior/session_apis/abcs/data_extractor_base/behavior_ophys_data_extractor_base.py index afa4a6bfe..cf63a111d 100644 --- a/allensdk/brain_observatory/behavior/session_apis/abcs/data_extractor_base/behavior_ophys_data_extractor_base.py +++ b/allensdk/brain_observatory/behavior/session_apis/abcs/data_extractor_base/behavior_ophys_data_extractor_base.py @@ -42,7 +42,7 @@ def get_field_of_view_shape(self) -> Dict[str, int]: raise NotImplementedError() @abc.abstractmethod - def get_experiment_container_id(self) -> int: + def get_ophys_container_id(self) -> int: """Get the experiment container id associated with an ophys experiment""" raise NotImplementedError() diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_lims_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_lims_api.py index 12471cd99..80062557d 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_lims_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_lims_api.py @@ -6,7 +6,7 @@ import pandas as pd -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.brain_observatory.behavior.session_apis.abcs.\ data_extractor_base.behavior_data_extractor_base import \ BehaviorDataExtractorBase diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_nwb_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_nwb_api.py index 229d16872..8adcc3f6c 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_nwb_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_nwb_api.py @@ -12,8 +12,6 @@ from allensdk.brain_observatory.behavior.metadata.behavior_metadata import ( get_expt_description, BehaviorMetadata ) -from allensdk.brain_observatory.behavior.metadata.util import parse_cre_line, \ - parse_age_in_days from allensdk.brain_observatory.behavior.session_apis.abcs.\ session_base.behavior_base import BehaviorBase from allensdk.brain_observatory.behavior.schemas import ( @@ -34,11 +32,12 @@ class BehaviorNwbApi(NwbApi, BehaviorBase): """A data fetching class that serves as an API for fetching 'raw' data from an NWB file that is both necessary and sufficient for filling - a 'BehaviorOphysSession'. + a 'BehaviorOphysExperiment'. """ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) + self._behavior_session_id = None def save(self, session_object): @@ -122,7 +121,9 @@ def save(self, session_object): return nwbfile def get_behavior_session_id(self) -> int: - return int(self.nwbfile.identifier) + if self._behavior_session_id is None: + self.get_metadata() + return self._behavior_session_id def get_running_acquisition_df(self) -> pd.DataFrame: """Get running speed acquisition data. @@ -253,16 +254,19 @@ def get_metadata(self) -> dict: metadata_nwb_obj = self.nwbfile.lab_meta_data['metadata'] data = OphysBehaviorMetadataSchema( exclude=['date_of_acquisition']).dump(metadata_nwb_obj) + self._behavior_session_id = data["behavior_session_id"] # Add pyNWB Subject metadata to behavior session metadata nwb_subject = self.nwbfile.subject data['mouse_id'] = int(nwb_subject.subject_id) data['sex'] = nwb_subject.sex - data['age_in_days'] = parse_age_in_days(age=nwb_subject.age) + data['age_in_days'] = BehaviorMetadata.parse_age_in_days( + age=nwb_subject.age) data['full_genotype'] = nwb_subject.genotype data['reporter_line'] = nwb_subject.reporter_line data['driver_line'] = sorted(list(nwb_subject.driver_line)) - data['cre_line'] = parse_cre_line(full_genotype=nwb_subject.genotype) + data['cre_line'] = BehaviorMetadata.parse_cre_line( + full_genotype=nwb_subject.genotype) # Add other metadata stored in nwb file to behavior session meta data['date_of_acquisition'] = self.nwbfile.session_start_time diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_json_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_json_api.py index be9721e9a..8f295ab31 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_json_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_json_api.py @@ -13,7 +13,7 @@ class BehaviorOphysJsonApi(BehaviorOphysDataTransforms): """A data fetching and processing class that serves processed data from a specified raw data source (extractor). Contains all methods - needed to fill a BehaviorOphysSession.""" + needed to fill a BehaviorOphysExperiment.""" def __init__(self, data: dict, skip_eye_tracking: bool = False): extractor = BehaviorOphysJsonExtractor(data=data) @@ -24,11 +24,11 @@ def __init__(self, data: dict, skip_eye_tracking: bool = False): class BehaviorOphysJsonExtractor(BehaviorJsonExtractor, BehaviorOphysDataExtractorBase): """A class which 'extracts' data from a json file. The extracted data - is necessary (but not sufficient) for populating a 'BehaviorOphysSession'. + is necessary (but not sufficient) for populating a 'BehaviorOphysExperiment'. Most data provided by this extractor needs to be processed by BehaviorOphysDataTransforms methods in order to usable by - 'BehaviorOphysSession's. + 'BehaviorOphysExperiment's. This class is used by the write_nwb module for behavior ophys sessions. """ @@ -64,7 +64,7 @@ def get_field_of_view_shape(self) -> dict: return {'height': self.data['movie_height'], 'width': self.data['movie_width']} - def get_experiment_container_id(self) -> int: + def get_ophys_container_id(self) -> int: """Get the experiment container id associated with an ophys experiment""" return self.data['container_id'] diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_lims_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_lims_api.py index 334f71fad..c26f4525b 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_lims_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_lims_api.py @@ -2,7 +2,7 @@ from typing import List, Optional import pandas as pd -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.brain_observatory.behavior.session_apis.abcs. \ data_extractor_base.behavior_ophys_data_extractor_base import \ BehaviorOphysDataExtractorBase @@ -22,7 +22,7 @@ class BehaviorOphysLimsApi(BehaviorOphysDataTransforms, CachedInstanceMethodMixin): """A data fetching and processing class that serves processed data from a specified data source (extractor). Contains all methods - needed to populate a BehaviorOphysSession.""" + needed to populate a BehaviorOphysExperiment.""" def __init__(self, ophys_experiment_id: Optional[int] = None, @@ -50,11 +50,11 @@ class BehaviorOphysLimsExtractor(OphysLimsExtractor, BehaviorLimsExtractor, BehaviorOphysDataExtractorBase): """A data fetching class that serves as an API for fetching 'raw' data from LIMS necessary (but not sufficient) for filling - a 'BehaviorOphysSession'. + a 'BehaviorOphysExperiment'. Most 'raw' data provided by this API needs to be processed by BehaviorOphysDataTransforms methods in order to usable by - 'BehaviorOphysSession's. + 'BehaviorOphysExperiment's. """ def __init__(self, ophys_experiment_id: int, @@ -100,7 +100,7 @@ def get_project_code(self) -> str: return self.lims_db.fetchone(query, strict=True) @memoize - def get_experiment_container_id(self) -> int: + def get_ophys_container_id(self) -> int: """Get the experiment container id associated with the ophys experiment id used to initialize the API""" query = """ diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_nwb_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_nwb_api.py index 78f862974..84789ffb1 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_nwb_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/behavior_ophys_nwb_api.py @@ -46,7 +46,7 @@ class BehaviorOphysNwbApi(BehaviorNwbApi, BehaviorOphysBase): """A data fetching class that serves as an API for fetching 'raw' data from an NWB file that is both necessary and sufficient for filling - a 'BehaviorOphysSession'. + a 'BehaviorOphysExperiment'. """ def __init__(self, *args, **kwargs): @@ -270,9 +270,9 @@ def get_eye_tracking(self, dilation_frames=dilation_frames) eye_tracking_data["likely_blink"] = likely_blinks - eye_tracking_data["eye_area"][likely_blinks] = np.nan - eye_tracking_data["pupil_area"][likely_blinks] = np.nan - eye_tracking_data["cr_area"][likely_blinks] = np.nan + eye_tracking_data.at[likely_blinks, "eye_area"] = np.nan + eye_tracking_data.at[likely_blinks, "pupil_area"] = np.nan + eye_tracking_data.at[likely_blinks, "cr_area"] = np.nan return eye_tracking_data diff --git a/allensdk/brain_observatory/behavior/session_apis/data_io/ophys_lims_api.py b/allensdk/brain_observatory/behavior/session_apis/data_io/ophys_lims_api.py index 2a0566311..7025587cf 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_io/ophys_lims_api.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_io/ophys_lims_api.py @@ -5,7 +5,7 @@ from allensdk.internal.api import ( OneOrMoreResultExpectedError, db_connection_creator) -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.internal.core.lims_utilities import safe_system_path from allensdk.core.cache_method_utilities import CachedInstanceMethodMixin from allensdk.core.authentication import DbCredentials @@ -16,11 +16,11 @@ class OphysLimsExtractor(CachedInstanceMethodMixin): """A data fetching class that serves as an API for fetching 'raw' data from LIMS for filling optical physiology data. This data is is necessary (but not sufficient) to fill the 'Ophys' portion of a - BehaviorOphysSession. + BehaviorOphysExperiment. This class needs to be inherited by the BehaviorOphysLimsApi and also have methods from BehaviorOphysDataTransforms in order to be usable by a - BehaviorOphysSession. + BehaviorOphysExperiment. """ def __init__(self, ophys_experiment_id: int, diff --git a/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_data_transforms.py b/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_data_transforms.py index e088e0ebf..afb188a94 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_data_transforms.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_data_transforms.py @@ -7,7 +7,7 @@ import os from allensdk.brain_observatory.behavior.metadata.behavior_metadata import \ get_task_parameters, BehaviorMetadata -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.internal.core.lims_utilities import safe_system_path from allensdk.brain_observatory.behavior.rewards_processing import get_rewards from allensdk.brain_observatory.behavior.running_processing import \ diff --git a/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_ophys_data_transforms.py b/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_ophys_data_transforms.py index d92466f3b..23c7885dc 100644 --- a/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_ophys_data_transforms.py +++ b/allensdk/brain_observatory/behavior/session_apis/data_transforms/behavior_ophys_data_transforms.py @@ -10,7 +10,7 @@ import warnings -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.brain_observatory.behavior.metadata.behavior_ophys_metadata \ import BehaviorOphysMetadata from allensdk.brain_observatory.behavior.event_detection import \ @@ -39,7 +39,7 @@ class BehaviorOphysDataTransforms(BehaviorDataTransforms, BehaviorOphysBase): """This class provides methods that transform data extracted from LIMS or JSON data sources into final data products necessary for - populating a BehaviorOphysSession. + populating a BehaviorOphysExperiment """ def __init__(self, @@ -438,7 +438,7 @@ def get_events(self, filter_scale: float = 2, filter_n_time_steps: int See filter_events_array for description - See behavior_ophys_session.events for return type + See behavior_ophys_experiment.events for return type """ events_file = self.extractor.get_event_detection_filepath() with h5py.File(events_file, 'r') as f: diff --git a/allensdk/brain_observatory/behavior/swdb/behavior_project_cache.py b/allensdk/brain_observatory/behavior/swdb/behavior_project_cache.py index 1c7a42b09..e34e9b214 100644 --- a/allensdk/brain_observatory/behavior/swdb/behavior_project_cache.py +++ b/allensdk/brain_observatory/behavior/swdb/behavior_project_cache.py @@ -5,12 +5,12 @@ import re from allensdk import one -from allensdk.brain_observatory.behavior.metadata.util import \ - parse_cre_line +from allensdk.brain_observatory.behavior.metadata.behavior_metadata import \ + BehaviorMetadata from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorOphysNwbApi) -from allensdk.brain_observatory.behavior.behavior_ophys_session import \ - BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment from allensdk.core.lazy_property import LazyProperty from allensdk.brain_observatory.behavior.trials_processing import \ calculate_reward_rate @@ -49,7 +49,7 @@ def __init__(self, cache_base): Methods: get_session(ophys_experiment_id): - Returns an extended BehaviorOphysSession object, including + Returns an extended BehaviorOphysExperiment object, including trial_response_df and flash_response_df get_container_sessions(container_id): @@ -72,7 +72,7 @@ def __init__(self, cache_base): self.cache_paths['manifest_path']) self.experiment_table['cre_line'] = self.experiment_table[ - 'full_genotype'].apply(parse_cre_line) + 'full_genotype'].apply(BehaviorMetadata.parse_cre_line) self.experiment_table['passive_session'] = self.experiment_table[ 'stage_name'].apply(parse_passive) self.experiment_table['image_set'] = self.experiment_table[ @@ -132,7 +132,7 @@ def get_extended_stimulus_presentations_df(self, experiment_id): def get_session(self, experiment_id): ''' - Return a BehaviorOphysSession object given an ophys_experiment_id. + Return a BehaviorOphysExperiment object given an ophys_experiment_id. ''' nwb_path = self.get_nwb_filepath(experiment_id) trial_response_df_path = self.get_trial_response_df_path(experiment_id) @@ -145,7 +145,7 @@ def get_session(self, experiment_id): flash_response_df_path, extended_stim_df_path ) - session = ExtendedBehaviorSession(api) + session = ExtendedBehaviorOphysExperiment(api) return session def get_container_sessions(self, container_id): @@ -457,7 +457,7 @@ def get_image_index_names(self): return image_index_names -class ExtendedBehaviorSession(BehaviorOphysSession): +class ExtendedBehaviorOphysExperiment(BehaviorOphysExperiment): """Represents data from a single Visual Behavior Ophys imaging session. LazyProperty attributes access the data only on the first demand, and then memoize the result for reuse. @@ -521,7 +521,7 @@ class ExtendedBehaviorSession(BehaviorOphysSession): """ def __init__(self, api): - super(ExtendedBehaviorSession, self).__init__(api) + super(ExtendedBehaviorOphysExperiment, self).__init__(api) self.api = api self.trial_response_df = LazyProperty(self.api.get_trial_response_df) @@ -530,7 +530,7 @@ def __init__(self, api): self.roi_masks = LazyProperty(self.get_roi_masks) def get_roi_masks(self): - masks = super(ExtendedBehaviorSession, self).get_roi_masks() + masks = super(ExtendedBehaviorOphysExperiment, self).get_roi_masks() return { cell_specimen_id: masks.loc[ {"cell_specimen_id": cell_specimen_id}].data diff --git a/allensdk/brain_observatory/behavior/swdb/save_extended_stimulus_presentations_df.py b/allensdk/brain_observatory/behavior/swdb/save_extended_stimulus_presentations_df.py index 9011abf05..574372f34 100644 --- a/allensdk/brain_observatory/behavior/swdb/save_extended_stimulus_presentations_df.py +++ b/allensdk/brain_observatory/behavior/swdb/save_extended_stimulus_presentations_df.py @@ -3,9 +3,8 @@ import numpy as np import pandas as pd -from allensdk.brain_observatory.behavior.behavior_ophys_session import ( - BehaviorOphysSession, -) +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import ( + BehaviorOphysExperiment) from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorOphysNwbApi, BehaviorOphysLimsApi) @@ -207,7 +206,7 @@ def get_extended_stimulus_presentations(session): # experiment_id = cache.manifest.iloc[5]['ophys_experiment_id'] nwb_path = cache.get_nwb_filepath(experiment_id) api = BehaviorOphysNwbApi(nwb_path) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) # output_path = "/allen/programs/braintv/workgroups/nc-ophys/visual_behavior/SWDB_2019/extra_files_final" output_path = "/allen/programs/braintv/workgroups/nc-ophys/visual_behavior/SWDB_2019/corrected_extended_stim" @@ -228,7 +227,7 @@ def get_extended_stimulus_presentations(session): # nwb_path = cache.get_nwb_filepath(success_oeid) nwb_path = cache.get_nwb_filepath(failed_oeid) api = BehaviorOphysNwbApi(nwb_path, filter_invalid_rois = True) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) extended_stimulus_presentations_df = get_extended_stimulus_presentations(session) diff --git a/allensdk/brain_observatory/behavior/swdb/save_flash_response_df.py b/allensdk/brain_observatory/behavior/swdb/save_flash_response_df.py index 3b0c708f1..634276a69 100644 --- a/allensdk/brain_observatory/behavior/swdb/save_flash_response_df.py +++ b/allensdk/brain_observatory/behavior/swdb/save_flash_response_df.py @@ -4,7 +4,8 @@ import pandas as pd import itertools -from allensdk.brain_observatory.behavior.behavior_ophys_session import BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorOphysNwbApi) from allensdk.brain_observatory.behavior.session_apis.data_io import ( @@ -13,7 +14,7 @@ from allensdk.brain_observatory.behavior.swdb.analysis_tools import get_nearest_frame, get_trace_around_timepoint, get_mean_in_window ''' - This script computes the flash_response_df for a BehaviorOphysSession object + This script computes the flash_response_df for a BehaviorOphysExperiment object ''' @@ -22,7 +23,7 @@ def get_flash_response_df(session, response_analysis_params): Builds the flash response dataframe for INPUTS: - BehaviorOphysSession to build the flash response dataframe for + BehaviorOphysExperiment to build the flash response dataframe for A dictionary with the following keys 'window_around_timepoint_seconds' is the time window to save out the dff_trace around the flash onset. 'response_window_duration_seconds' is the length of time after the flash onset to compute the mean_response @@ -83,7 +84,7 @@ def get_p_values_from_shuffled_spontaneous(session, flash_response_df, response_ magnitude in the spontaneous window. The algorithm is copied from VBA INPUTS: - a BehaviorOphysSession object + a BehaviorOphysExperiment object the flash_response_df for this session is the duration of the response_window that was used to compute the mean_response in the flash_response_df. This is used here to extract an equivalent duration df/f trace from the spontaneous timepoint the number of shuffles of spontaneous activity used to compute the pvalue @@ -144,7 +145,7 @@ def get_spontaneous_frames(session): Returns a list of the frames that occur during the before and after spontaneous windows. This is copied from VBA. Does not use the full spontaneous period because that is what VBA did. It only uses 4 minutes of the before and after spontaneous period. INPUTS: - a BehaviorOphysSession object to get all the spontaneous frames + a BehaviorOphysExperiment object to get all the spontaneous frames OUTPUTS: a list of the frames during the spontaneous period ''' @@ -189,7 +190,7 @@ def add_image_name(session,fdf): Slow to run, could probably be improved with some more intelligent use of pandas INPUTS: - a BehaviorOphysSession object + a BehaviorOphysExperiment object a flash_response_df for this session OUTPUTS: @@ -289,7 +290,7 @@ def get_mean_sem(group): cache = bpc.BehaviorProjectCache(cache_json) nwb_path = cache.get_nwb_filepath(experiment_id) api = BehaviorOphysNwbApi(nwb_path, filter_invalid_rois = True) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) # Where to save the results output_path = '/allen/programs/braintv/workgroups/nc-ophys/visual_behavior/SWDB_2019/flash_response_500msec_response' @@ -320,7 +321,7 @@ def get_mean_sem(group): # This case is just for debugging. It computes the flash_response_df on a truncated portion of the data. nwb_path = '/allen/programs/braintv/workgroups/nc-ophys/visual_behavior/SWDB_2019/nwb_files/behavior_ophys_session_880961028.nwb' api = BehaviorOphysNwbApi(nwb_path, filter_invalid_rois=True) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) #Small data for testing session.__dict__['dff_traces'].value = session.dff_traces.iloc[:5] diff --git a/allensdk/brain_observatory/behavior/swdb/save_trial_response_df.py b/allensdk/brain_observatory/behavior/swdb/save_trial_response_df.py index f3048b77b..42cdef385 100644 --- a/allensdk/brain_observatory/behavior/swdb/save_trial_response_df.py +++ b/allensdk/brain_observatory/behavior/swdb/save_trial_response_df.py @@ -5,7 +5,8 @@ from scipy import stats import itertools -from allensdk.brain_observatory.behavior.behavior_ophys_session import BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorOphysNwbApi, BehaviorOphysLimsApi) from allensdk.brain_observatory.behavior.swdb import behavior_project_cache as bpc @@ -215,7 +216,7 @@ def get_trial_response_df(session, response_analysis_params): # experiment_id = cache.manifest.iloc[5]['ophys_experiment_id'] # nwb_path = cache.get_nwb_filepath(experiment_id) # api = BehaviorOphysNwbApi(nwb_path, filter_invalid_rois=True) - # session = BehaviorOphysSession(api) + # session = BehaviorOphysExperiment(api) # Get the session using the cache so that the change time fix is applied session = cache.get_session(experiment_id) @@ -246,10 +247,10 @@ def get_trial_response_df(session, response_analysis_params): experiment_id = 846487947 # api = BehaviorOphysLimsApi(experiment_id) - # session = BehaviorOphysSession(api) + # session = BehaviorOphysExperiment(api) # nwb_path = cache.get_nwb_filepath(experiment_id) # api = BehaviorOphysNwbApi(nwb_path) - # session = BehaviorOphysSession(api) + # session = BehaviorOphysExperiment(api) session = cache.get_session(experiment_id) diff --git a/allensdk/brain_observatory/behavior/trials_processing.py b/allensdk/brain_observatory/behavior/trials_processing.py index 9053e9677..8294d420e 100644 --- a/allensdk/brain_observatory/behavior/trials_processing.py +++ b/allensdk/brain_observatory/behavior/trials_processing.py @@ -269,7 +269,7 @@ def get_trial_timing( Dictionary of trial events in the well-known `pkl` file licks: List[float] list of lick timestamps, from the `get_licks` response for - the BehaviorOphysSession.api. + the BehaviorOphysExperiment.api. go: bool True if "go" trial, False otherwise. Mutually exclusive with `catch`. diff --git a/allensdk/brain_observatory/behavior/validation.py b/allensdk/brain_observatory/behavior/validation.py index 528e0a6a6..acb4045d0 100644 --- a/allensdk/brain_observatory/behavior/validation.py +++ b/allensdk/brain_observatory/behavior/validation.py @@ -5,7 +5,8 @@ BehaviorOphysLimsApi) from allensdk.brain_observatory.behavior.session_apis.data_io.ophys_lims_api \ import OphysLimsApi -from allensdk.brain_observatory.behavior.behavior_ophys_session import BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment class ValidationError(AssertionError): pass @@ -52,7 +53,7 @@ def validate_last_trial_ends_adjacent_to_flash(ophys_experiment_id, api=None, ve # the second carrot represents the time at which another flash should have started, after accounting for the possibility of the session ending on an omitted flash api = BehaviorOphysLimsApi() if api is None else api - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) # get the flash/blank parameters max_flash_duration = session.stimulus_presentations['duration'].max() @@ -106,4 +107,4 @@ def validate_last_trial_ends_adjacent_to_flash(ophys_experiment_id, api=None, ve try: validation_function(ophys_experiment_id, api=api) except ValidationError as e: - print(ophys_experiment_id, e) \ No newline at end of file + print(ophys_experiment_id, e) diff --git a/allensdk/brain_observatory/behavior/write_nwb/__main__.py b/allensdk/brain_observatory/behavior/write_nwb/__main__.py index 67effe5c6..24eee7e76 100644 --- a/allensdk/brain_observatory/behavior/write_nwb/__main__.py +++ b/allensdk/brain_observatory/behavior/write_nwb/__main__.py @@ -4,8 +4,8 @@ import argschema import marshmallow -from allensdk.brain_observatory.behavior.behavior_ophys_session import ( - BehaviorOphysSession) +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import ( + BehaviorOphysExperiment) from allensdk.brain_observatory.behavior.session_apis.data_io import ( BehaviorOphysNwbApi, BehaviorOphysJsonApi, BehaviorOphysLimsApi) from allensdk.brain_observatory.behavior.write_nwb._schemas import ( @@ -32,22 +32,22 @@ def write_behavior_ophys_nwb(session_data: dict, try: json_api = BehaviorOphysJsonApi(data=session_data, skip_eye_tracking=skip_eye_tracking) - json_session = BehaviorOphysSession(api=json_api) + json_session = BehaviorOphysExperiment(api=json_api) lims_api = BehaviorOphysLimsApi( ophys_experiment_id=session_data['ophys_experiment_id'], skip_eye_tracking=skip_eye_tracking) - lims_session = BehaviorOphysSession(api=lims_api) + lims_session = BehaviorOphysExperiment(api=lims_api) - logging.info("Comparing a BehaviorOphysSession created from JSON " - "with a BehaviorOphysSession created from LIMS") + logging.info("Comparing a BehaviorOphysExperiment created from JSON " + "with a BehaviorOphysExperiment created from LIMS") assert sessions_are_equal(json_session, lims_session, reraise=True) BehaviorOphysNwbApi(nwb_filepath_inprogress).save(json_session) - logging.info("Comparing a BehaviorOphysSession created from JSON " - "with a BehaviorOphysSession created from NWB") + logging.info("Comparing a BehaviorOphysExperiment created from JSON " + "with a BehaviorOphysExperiment created from NWB") nwb_api = BehaviorOphysNwbApi(nwb_filepath_inprogress) - nwb_session = BehaviorOphysSession(api=nwb_api) + nwb_session = BehaviorOphysExperiment(api=nwb_api) assert sessions_are_equal(json_session, nwb_session, reraise=True) os.rename(nwb_filepath_inprogress, nwb_filepath) diff --git a/allensdk/brain_observatory/ecephys/ecephys_project_cache.py b/allensdk/brain_observatory/ecephys/ecephys_project_cache.py index 7cfb90167..89cb62210 100644 --- a/allensdk/brain_observatory/ecephys/ecephys_project_cache.py +++ b/allensdk/brain_observatory/ecephys/ecephys_project_cache.py @@ -8,7 +8,7 @@ import numpy as np import pynwb -from allensdk.api.cache import Cache +from allensdk.api.warehouse_cache.cache import Cache from allensdk.core.authentication import DbCredentials from allensdk.brain_observatory.ecephys.ecephys_project_api import ( EcephysProjectApi, EcephysProjectLimsApi, EcephysProjectWarehouseApi, @@ -22,7 +22,7 @@ ) from allensdk.brain_observatory.ecephys.ecephys_session import EcephysSession from allensdk.brain_observatory.ecephys import get_unit_filter_value -from allensdk.api.caching_utilities import one_file_call_caching +from allensdk.api.warehouse_cache.caching_utilities import one_file_call_caching class EcephysProjectCache(Cache): diff --git a/allensdk/brain_observatory/receptive_field_analysis/utilities.py b/allensdk/brain_observatory/receptive_field_analysis/utilities.py index fb0c29403..bbef432f3 100644 --- a/allensdk/brain_observatory/receptive_field_analysis/utilities.py +++ b/allensdk/brain_observatory/receptive_field_analysis/utilities.py @@ -37,7 +37,7 @@ import numpy as np import scipy.interpolate as spinterp from .tools import dict_generator -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize import os import warnings from skimage.measure import block_reduce diff --git a/allensdk/brain_observatory/stimulus_info.py b/allensdk/brain_observatory/stimulus_info.py index 9ad23871f..595510c56 100755 --- a/allensdk/brain_observatory/stimulus_info.py +++ b/allensdk/brain_observatory/stimulus_info.py @@ -37,7 +37,7 @@ import numpy as np import scipy.ndimage.interpolation as spndi from PIL import Image -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize import itertools # some handles for stimulus types diff --git a/allensdk/core/brain_observatory_cache.py b/allensdk/core/brain_observatory_cache.py index 829159946..bdc6b67b9 100644 --- a/allensdk/core/brain_observatory_cache.py +++ b/allensdk/core/brain_observatory_cache.py @@ -40,7 +40,7 @@ from pathlib import Path -from allensdk.api.cache import Cache, get_default_manifest_file +from allensdk.api.warehouse_cache.cache import Cache, get_default_manifest_file from allensdk.api.queries.brain_observatory_api import BrainObservatoryApi from allensdk.config.manifest_builder import ManifestBuilder from .brain_observatory_nwb_data_set import BrainObservatoryNwbDataSet diff --git a/allensdk/core/brain_observatory_nwb_data_set.py b/allensdk/core/brain_observatory_nwb_data_set.py index 69d437c4a..69729cfc8 100755 --- a/allensdk/core/brain_observatory_nwb_data_set.py +++ b/allensdk/core/brain_observatory_nwb_data_set.py @@ -52,7 +52,7 @@ from allensdk.brain_observatory.brain_observatory_exceptions import (MissingStimulusException, NoEyeTrackingException) -from allensdk.api.cache import memoize +from allensdk.api.warehouse_cache.cache import memoize from allensdk.core import h5_utilities from allensdk.brain_observatory.stimulus_info import mask_stimulus_template as si_mask_stimulus_template diff --git a/allensdk/core/cell_types_cache.py b/allensdk/core/cell_types_cache.py index 2e5426563..706277156 100644 --- a/allensdk/core/cell_types_cache.py +++ b/allensdk/core/cell_types_cache.py @@ -37,7 +37,7 @@ from six import string_types from allensdk.config.manifest_builder import ManifestBuilder -from allensdk.api.cache import Cache, get_default_manifest_file +from allensdk.api.warehouse_cache.cache import Cache, get_default_manifest_file from allensdk.api.queries.cell_types_api import CellTypesApi from . import json_utilities as json_utilities diff --git a/allensdk/core/mouse_connectivity_cache.py b/allensdk/core/mouse_connectivity_cache.py index 88e3538ba..041db7724 100644 --- a/allensdk/core/mouse_connectivity_cache.py +++ b/allensdk/core/mouse_connectivity_cache.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from allensdk.config.manifest_builder import ManifestBuilder -from allensdk.api.cache import Cache, get_default_manifest_file +from allensdk.api.warehouse_cache.cache import Cache, get_default_manifest_file from allensdk.api.queries.mouse_connectivity_api import MouseConnectivityApi from allensdk.deprecated import deprecated diff --git a/allensdk/core/reference_space_cache.py b/allensdk/core/reference_space_cache.py index 2ac2c7dbc..256e02b21 100644 --- a/allensdk/core/reference_space_cache.py +++ b/allensdk/core/reference_space_cache.py @@ -34,7 +34,7 @@ # POSSIBILITY OF SUCH DAMAGE. # from allensdk.config.manifest_builder import ManifestBuilder -from allensdk.api.cache import Cache +from allensdk.api.warehouse_cache.cache import Cache from allensdk.api.queries.reference_space_api import ReferenceSpaceApi from allensdk.api.queries.ontologies_api import OntologiesApi from allensdk.deprecated import deprecated diff --git a/allensdk/internal/api/queries/grid_data_api_prerelease.py b/allensdk/internal/api/queries/grid_data_api_prerelease.py index f6dea5104..14cacad5f 100644 --- a/allensdk/internal/api/queries/grid_data_api_prerelease.py +++ b/allensdk/internal/api/queries/grid_data_api_prerelease.py @@ -2,7 +2,7 @@ import six from allensdk.config.manifest import Manifest -from allensdk.api.cache import Cache, cacheable +from allensdk.api.warehouse_cache.cache import Cache, cacheable from allensdk.api.queries.grid_data_api import GridDataApi from allensdk.core import json_utilities diff --git a/allensdk/internal/api/queries/mouse_connectivity_api_prerelease.py b/allensdk/internal/api/queries/mouse_connectivity_api_prerelease.py index 0b7267052..86e18be2d 100644 --- a/allensdk/internal/api/queries/mouse_connectivity_api_prerelease.py +++ b/allensdk/internal/api/queries/mouse_connectivity_api_prerelease.py @@ -1,4 +1,4 @@ -from allensdk.api.cache import Cache, cacheable +from allensdk.api.warehouse_cache.cache import Cache, cacheable from allensdk.api.queries.grid_data_api import GridDataApi from allensdk.api.queries.mouse_connectivity_api import MouseConnectivityApi diff --git a/allensdk/internal/api/queries/pre_release.py b/allensdk/internal/api/queries/pre_release.py index 1e5552b77..07d3fe13e 100644 --- a/allensdk/internal/api/queries/pre_release.py +++ b/allensdk/internal/api/queries/pre_release.py @@ -1,5 +1,5 @@ from allensdk.api.queries.brain_observatory_api import BrainObservatoryApi -from allensdk.api.cache import cacheable +from allensdk.api.warehouse_cache.cache import cacheable from allensdk.core.brain_observatory_cache import BrainObservatoryCache import allensdk.internal.core.lims_utilities as lu import os @@ -167,4 +167,4 @@ def get_cell_metrics(self): cell_list.append(c) - return cell_list \ No newline at end of file + return cell_list diff --git a/allensdk/test/api/cloud_cache/__init__.py b/allensdk/test/api/cloud_cache/__init__.py new file mode 100644 index 000000000..1bb8bf6d7 --- /dev/null +++ b/allensdk/test/api/cloud_cache/__init__.py @@ -0,0 +1 @@ +# empty diff --git a/allensdk/test/api/cloud_cache/test_cache.py b/allensdk/test/api/cloud_cache/test_cache.py new file mode 100644 index 000000000..43acc94bf --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_cache.py @@ -0,0 +1,616 @@ +import pytest +import json +import hashlib +import pathlib +import pandas as pd +import io +import boto3 +from moto import mock_s3 +from allensdk.api.cloud_cache.cloud_cache import S3CloudCache # noqa: E501 +from allensdk.api.cloud_cache.file_attributes import CacheFileAttributes # noqa: E501 + + +@mock_s3 +def test_list_all_manifests(tmpdir): + """ + Test that S3CloudCache.list_al_manifests() returns the correct result + """ + + test_bucket_name = 'list_manifest_bucket' + + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name) + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.json', + Body=b'123456') + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_2.json', + Body=b'123456') + client.put_object(Bucket=test_bucket_name, + Key='junk.txt', + Body=b'123456') + + cache = S3CloudCache(tmpdir, test_bucket_name, 'proj') + + assert cache.manifest_file_names == ['manifest_1.json', 'manifest_2.json'] + + +@mock_s3 +def test_list_all_manifests_many(tmpdir): + """ + Test the extreme case when there are more manifests than list_objects_v2 + can return at a time + """ + + test_bucket_name = 'list_manifest_bucket' + + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name) + + client = boto3.client('s3', region_name='us-east-1') + for ii in range(2000): + client.put_object(Bucket=test_bucket_name, + Key=f'proj/manifests/manifest_{ii}.json', + Body=b'123456') + + client.put_object(Bucket=test_bucket_name, + Key='junk.txt', + Body=b'123456') + + cache = S3CloudCache(tmpdir, test_bucket_name, 'proj') + + expected = list([f'manifest_{ii}.json' for ii in range(2000)]) + expected.sort() + assert cache.manifest_file_names == expected + + +@mock_s3 +def test_loading_manifest(tmpdir): + """ + Test loading manifests with S3CloudCache + """ + + test_bucket_name = 'list_manifest_bucket' + + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + client = boto3.client('s3', region_name='us-east-1') + + manifest_1 = {'manifest_version': '1', + 'metadata_file_id_column_name': 'file_id', + 'data_pipeline': 'placeholder', + 'project_name': 'sam-beckett', + 'metadata_files': {'a.csv': {'url': 'http://www.junk.com', + 'version_id': '1111', + 'file_hash': 'abcde'}, + 'b.csv': {'url': 'http://silly.com', + 'version_id': '2222', + 'file_hash': 'fghijk'}}} + + manifest_2 = {'manifest_version': '2', + 'metadata_file_id_column_name': 'file_id', + 'data_pipeline': 'placeholder', + 'project_name': 'al', + 'metadata_files': {'c.csv': {'url': 'http://www.absurd.com', + 'version_id': '3333', + 'file_hash': 'lmnop'}, + 'd.csv': {'url': 'http://nonsense.com', + 'version_id': '4444', + 'file_hash': 'qrstuv'}}} + + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.csv', + Body=bytes(json.dumps(manifest_1), 'utf-8')) + + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_2.csv', + Body=bytes(json.dumps(manifest_2), 'utf-8')) + + cache = S3CloudCache(pathlib.Path(tmpdir), test_bucket_name, 'proj') + cache.load_manifest('manifest_1.csv') + assert cache._manifest._data == manifest_1 + assert cache.version == '1' + assert cache.file_id_column == 'file_id' + assert cache.metadata_file_names == ['a.csv', 'b.csv'] + + cache.load_manifest('manifest_2.csv') + assert cache._manifest._data == manifest_2 + assert cache.version == '2' + assert cache.file_id_column == 'file_id' + assert cache.metadata_file_names == ['c.csv', 'd.csv'] + + with pytest.raises(ValueError) as context: + cache.load_manifest('manifest_3.csv') + msg = 'is not one of the valid manifest names' + assert msg in context.value.args[0] + + +@mock_s3 +def test_file_exists(tmpdir): + """ + Test that cache._file_exists behaves correctly + """ + + data = b'aakderasjklsafetss77123523asf' + hasher = hashlib.blake2b() + hasher.update(data) + true_checksum = hasher.hexdigest() + test_file_path = pathlib.Path(tmpdir)/'junk.txt' + with open(test_file_path, 'wb') as out_file: + out_file.write(data) + + # need to populate a bucket in order for + # S3CloudCache to be instantiated + test_bucket_name = 'silly_bucket' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + cache = S3CloudCache(tmpdir, test_bucket_name, 'proj') + + # should be true + good_attribute = CacheFileAttributes('http://silly.url.com', + '12345', + true_checksum, + test_file_path) + assert cache._file_exists(good_attribute) + + # test when checksum is wrong + bad_attribute = CacheFileAttributes('http://silly.url.com', + '12345', + 'probably_not_the_checksum', + test_file_path) + assert not cache._file_exists(bad_attribute) + + # test when file path is wrong + bad_path = pathlib.Path('definitely/not/a/file.txt') + bad_attribute = CacheFileAttributes('http://silly.url.com', + '12345', + true_checksum, + bad_path) + + assert not cache._file_exists(bad_attribute) + + # test when path exists but is not a file + bad_attribute = CacheFileAttributes('http://silly.url.com', + '12345', + true_checksum, + pathlib.Path(tmpdir)) + with pytest.raises(RuntimeError) as context: + cache._file_exists(bad_attribute) + assert 'but is not a file' in context.value.args[0] + + +@mock_s3 +def test_download_file(tmpdir): + """ + Test that S3CloudCache._download_file behaves as expected + """ + + hasher = hashlib.blake2b() + data = b'11235813kjlssergwesvsdd' + hasher.update(data) + true_checksum = hasher.hexdigest() + + test_bucket_name = 'bucket_for_download' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='data/data_file.txt', + Body=data) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id = response['Versions'][0]['VersionId'] + + cache_dir = pathlib.Path(tmpdir) / 'download/test/cache' + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + expected_path = cache_dir / true_checksum / 'data/data_file.txt' + + url = f'http://{test_bucket_name}.s3.amazonaws.com/data/data_file.txt' + good_attributes = CacheFileAttributes(url, + version_id, + true_checksum, + expected_path) + + assert not expected_path.exists() + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum + + +@mock_s3 +def test_download_file_multiple_versions(tmpdir): + """ + Test that S3CloudCache._download_file behaves as expected + when there are multiple versions of the same file in the + bucket + + (This is really just testing that S3's versioning behaves the + way we think it does) + """ + + hasher = hashlib.blake2b() + data_1 = b'11235813kjlssergwesvsdd' + hasher.update(data_1) + true_checksum_1 = hasher.hexdigest() + + hasher = hashlib.blake2b() + data_2 = b'zzzzxxxxyyyywwwwjjjj' + hasher.update(data_2) + true_checksum_2 = hasher.hexdigest() + + assert true_checksum_2 != true_checksum_1 + + test_bucket_name = 'bucket_for_download_versions' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='data/data_file.txt', + Body=data_1) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id_1 = response['Versions'][0]['VersionId'] + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='data/data_file.txt', + Body=data_2) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id_2 = None + for v in response['Versions']: + if v['IsLatest']: + version_id_2 = v['VersionId'] + assert version_id_2 is not None + assert version_id_2 != version_id_1 + + cache_dir = pathlib.Path(tmpdir) / 'download/test/cache' + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + url = f'http://{test_bucket_name}.s3.amazonaws.com/data/data_file.txt' + + # download first version of file + expected_path = cache_dir / true_checksum_1 / 'data/data_file.txt' + + good_attributes = CacheFileAttributes(url, + version_id_1, + true_checksum_1, + expected_path) + + assert not expected_path.exists() + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum_1 + + # download second version of file + expected_path = cache_dir / true_checksum_2 / 'data/data_file.txt' + + good_attributes = CacheFileAttributes(url, + version_id_2, + true_checksum_2, + expected_path) + + assert not expected_path.exists() + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum_2 + + +@mock_s3 +def test_re_download_file(tmpdir): + """ + Test that S3CloudCache._download_file will re-download a file + when it has been altered locally + """ + + hasher = hashlib.blake2b() + data = b'11235813kjlssergwesvsdd' + hasher.update(data) + true_checksum = hasher.hexdigest() + + test_bucket_name = 'bucket_for_re_download' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='data/data_file.txt', + Body=data) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id = response['Versions'][0]['VersionId'] + + cache_dir = pathlib.Path(tmpdir) / 'download/test/cache' + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + expected_path = cache_dir / true_checksum / 'data/data_file.txt' + + url = f'http://{test_bucket_name}.s3.amazonaws.com/data/data_file.txt' + good_attributes = CacheFileAttributes(url, + version_id, + true_checksum, + expected_path) + + assert not expected_path.exists() + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum + + # now, remove the file, and see if it gets re-downloaded + expected_path.unlink() + assert not expected_path.exists() + + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum + + # now, alter the file, and see if it gets re-downloaded + with open(expected_path, 'wb') as out_file: + out_file.write(b'778899') + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() != true_checksum + + cache._download_file(good_attributes) + assert expected_path.exists() + hasher = hashlib.blake2b() + with open(expected_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_checksum + + +@mock_s3 +def test_download_data(tmpdir): + """ + Test that S3CloudCache.download_data() correctly downloads files from S3 + """ + + hasher = hashlib.blake2b() + data = b'11235813kjlssergwesvsdd' + hasher.update(data) + true_checksum = hasher.hexdigest() + + test_bucket_name = 'bucket_for_download_data' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='data/data_file.txt', + Body=data) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id = response['Versions'][0]['VersionId'] + + manifest = {} + manifest['manifest_version'] = '1' + manifest['project_name'] = "project-z" + manifest['metadata_file_id_column_name'] = 'file_id' + manifest['metadata_files'] = {} + url = f'http://{test_bucket_name}.s3.amazonaws.com/project-z/data/data_file.txt' # noqa: E501 + data_file = {'url': url, + 'version_id': version_id, + 'file_hash': true_checksum} + + manifest['data_files'] = {'only_data_file': data_file} + manifest['data_pipeline'] = 'placeholder' + + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.json', + Body=bytes(json.dumps(manifest), 'utf-8')) + + cache_dir = pathlib.Path(tmpdir) / "data/path/cache" + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + cache.load_manifest('manifest_1.json') + + expected_path = cache_dir / 'project-z-1' / 'data/data_file.txt' + assert not expected_path.exists() + + # test data_path + attr = cache.data_path('only_data_file') + assert attr['local_path'] == expected_path + assert not attr['exists'] + + # NOTE: commenting out because moto does not support + # list_object_versions and this is becoming difficult + + # result_path = cache.download_data('only_data_file') + # assert result_path == expected_path + # assert expected_path.exists() + # hasher = hashlib.blake2b() + # with open(expected_path, 'rb') as in_file: + # hasher.update(in_file.read()) + # assert hasher.hexdigest() == true_checksum + + # test that data_path detects that the file now exists + # attr = cache.data_path('only_data_file') + # assert attr['local_path'] == expected_path + # assert attr['exists'] + + +@mock_s3 +def test_download_metadata(tmpdir): + """ + Test that S3CloudCache.download_metadata() correctly + downloads files from S3 + """ + + hasher = hashlib.blake2b() + data = b'11235813kjlssergwesvsdd' + hasher.update(data) + true_checksum = hasher.hexdigest() + + test_bucket_name = 'bucket_for_download_metadata' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + meta_version = client.put_object(Bucket=test_bucket_name, + Key='metadata_file.csv', + Body=data)["VersionId"] + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id = response['Versions'][0]['VersionId'] + + manifest = {} + manifest['manifest_version'] = '1' + manifest['project_name'] = "project4" + manifest['metadata_file_id_column_name'] = 'file_id' + url = f'http://{test_bucket_name}.s3.amazonaws.com/project4/metadata_file.csv' # noqa: E501 + metadata_file = {'url': url, + 'version_id': version_id, + 'file_hash': true_checksum} + + manifest['metadata_files'] = {'metadata_file.csv': metadata_file} + manifest['data_pipeline'] = 'placeholder' + + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.json', + Body=bytes(json.dumps(manifest), 'utf-8')) + + cache_dir = pathlib.Path(tmpdir) / "metadata/path/cache" + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + cache.load_manifest('manifest_1.json') + + expected_path = cache_dir / "project4-1" / 'metadata_file.csv' + assert not expected_path.exists() + + # test that metadata_path also works + attr = cache.metadata_path('metadata_file.csv') + assert attr['local_path'] == expected_path + assert not attr['exists'] + + def response_fun(Bucket, Prefix): + # moto doesn't cover list_object_versions + return {"Versions": [{ + "VersionId": meta_version, + "Key": "metadata_file.csv", + "Size": 12}]} + # cache.s3_client.list_object_versions = response_fun + + # NOTE: commenting out because moto does not support + # list_object_versions and this is becoming difficult + + # result_path = cache.download_metadata('metadata_file.csv') + # assert result_path == expected_path + # assert expected_path.exists() + # hasher = hashlib.blake2b() + # with open(expected_path, 'rb') as in_file: + # hasher.update(in_file.read()) + # assert hasher.hexdigest() == true_checksum + + # # test that metadata_path detects that the file now exists + # attr = cache.metadata_path('metadata_file.csv') + # assert attr['local_path'] == expected_path + # assert attr['exists'] + + +@mock_s3 +def test_metadata(tmpdir): + """ + Test that S3CloudCache.metadata() returns the expected pandas DataFrame + """ + data = {} + data['mouse_id'] = [1, 4, 6, 8] + data['sex'] = ['F', 'F', 'M', 'M'] + data['age'] = ['P50', 'P46', 'P23', 'P40'] + true_df = pd.DataFrame(data) + + with io.StringIO() as stream: + true_df.to_csv(stream, index=False) + stream.seek(0) + data = bytes(stream.read(), 'utf-8') + + hasher = hashlib.blake2b() + hasher.update(data) + true_checksum = hasher.hexdigest() + + test_bucket_name = 'bucket_for_metadata' + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + client = boto3.client('s3', region_name='us-east-1') + client.put_object(Bucket=test_bucket_name, + Key='metadata_file.csv', + Body=data) + + response = client.list_object_versions(Bucket=test_bucket_name) + version_id = response['Versions'][0]['VersionId'] + + manifest = {} + manifest['manifest_version'] = '1' + manifest['project_name'] = "project-X" + manifest['metadata_file_id_column_name'] = 'file_id' + url = f'http://{test_bucket_name}.s3.amazonaws.com/metadata_file.csv' + metadata_file = {'url': url, + 'version_id': version_id, + 'file_hash': true_checksum} + + manifest['metadata_files'] = {'metadata_file.csv': metadata_file} + manifest['data_pipeline'] = 'placeholder' + + client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.json', + Body=bytes(json.dumps(manifest), 'utf-8')) + + cache_dir = pathlib.Path(tmpdir) / "metadata/cache" + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + cache.load_manifest('manifest_1.json') + + metadata_df = cache.get_metadata('metadata_file.csv') + assert true_df.equals(metadata_df) diff --git a/allensdk/test/api/cloud_cache/test_file_attributes.py b/allensdk/test/api/cloud_cache/test_file_attributes.py new file mode 100644 index 000000000..f3af08221 --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_file_attributes.py @@ -0,0 +1,73 @@ +import platform +import pytest +import pathlib +from allensdk.api.cloud_cache.file_attributes import CacheFileAttributes # noqa: E501 + + +def test_cache_file_attributes(): + attr = CacheFileAttributes(url='http://my/url', + version_id='aaabbb', + file_hash='12345', + local_path=pathlib.Path('/my/local/path')) + + assert attr.url == 'http://my/url' + assert attr.version_id == 'aaabbb' + assert attr.file_hash == '12345' + assert attr.local_path == pathlib.Path('/my/local/path') + + # test that the correct ValueErrors are raised + # when you pass invalid arguments + + with pytest.raises(ValueError) as context: + attr = CacheFileAttributes(url=5.0, + version_id='aaabbb', + file_hash='12345', + local_path=pathlib.Path('/my/local/path')) + + msg = "url must be str; got " + assert context.value.args[0] == msg + + with pytest.raises(ValueError) as context: + attr = CacheFileAttributes(url='http://my/url/', + version_id=5.0, + file_hash='12345', + local_path=pathlib.Path('/my/local/path')) + + msg = "version_id must be str; got " + assert context.value.args[0] == msg + + with pytest.raises(ValueError) as context: + attr = CacheFileAttributes(url='http://my/url/', + version_id='aaabbb', + file_hash=5.0, + local_path=pathlib.Path('/my/local/path')) + + msg = "file_hash must be str; got " + assert context.value.args[0] == msg + + with pytest.raises(ValueError) as context: + attr = CacheFileAttributes(url='http://my/url/', + version_id='aaabbb', + file_hash='12345', + local_path='/my/local/path') + + msg = "local_path must be pathlib.Path; got " + assert context.value.args[0] == msg + + +def test_str(): + """ + Test the string representation of CacheFileParameters + """ + attr = CacheFileAttributes(url='http://my/url', + version_id='aaabbb', + file_hash='12345', + local_path=pathlib.Path('/my/local/path')) + + s = f'{attr}' + assert "CacheFileParameters{" in s + assert '"file_hash": "12345"' in s + assert '"url": "http://my/url"' in s + assert '"version_id": "aaabbb"' in s + if platform.system().lower() != 'windows': + assert '"local_path": "/my/local/path"' in s diff --git a/allensdk/test/api/cloud_cache/test_full_process.py b/allensdk/test/api/cloud_cache/test_full_process.py new file mode 100644 index 000000000..8caff5dd1 --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_full_process.py @@ -0,0 +1,261 @@ +import pytest +import json +import pathlib +import hashlib +import pandas as pd +import io +import boto3 +from moto import mock_s3 +from allensdk.api.cloud_cache.cloud_cache import S3CloudCache + + +@mock_s3 +def test_full_cache_system(tmpdir): + """ + Test the process of loading different versions of the same dataset, + each of which involve different versions of files + """ + + test_bucket_name = 'full_cache_bucket' + + conn = boto3.resource('s3', region_name='us-east-1') + conn.create_bucket(Bucket=test_bucket_name, ACL='public-read') + + # turn on bucket versioning + # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#bucketversioning + bucket_versioning = conn.BucketVersioning(test_bucket_name) + bucket_versioning.enable() + + s3_client = boto3.client('s3', region_name='us-east-1') + + # generate data and expected hashes + + true_hashes = {} + version_id_lookup = {} + + data1_v1 = b'12345678' + data1_v2 = b'45678901' + data2_v1 = b'abcdefghijk' + data2_v2 = b'lmnopqrstuv' + data3_v1 = b'jklmnopqrst' + + metadata1_v1 = pd.DataFrame({'mouse': [1, 2, 3], + 'sex': ['F', 'F', 'M']}) + + metadata2_v1 = pd.DataFrame({'experiment': [5, 6, 7], + 'file_id': ['data1', 'data2', 'data3']}) + + metadata1_v2 = pd.DataFrame({'mouse': [8, 9, 0], + 'sex': ['M', 'F', 'M']}) + + v1_hashes = {} + for data, key in zip((data1_v1, data2_v1, data3_v1), + ('data1', 'data2', 'data3')): + + hasher = hashlib.blake2b() + hasher.update(data) + v1_hashes[key] = hasher.hexdigest() + s3_client.put_object(Bucket=test_bucket_name, + Key=f'proj/data/{key}', + Body=data) + + for df, key in zip((metadata1_v1, metadata2_v1), + ('proj/metadata1.csv', 'proj/metadata2.csv')): + + with io.StringIO() as stream: + df.to_csv(stream, index=False) + stream.seek(0) + data = bytes(stream.read(), 'utf-8') + + hasher = hashlib.blake2b() + hasher.update(data) + v1_hashes[key.replace('proj/', '')] = hasher.hexdigest() + s3_client.put_object(Bucket=test_bucket_name, + Key=key, + Body=data) + + true_hashes['v1'] = v1_hashes + v1_version_id = {} + response = s3_client.list_object_versions(Bucket=test_bucket_name) + for v in response['Versions']: + vkey = v['Key'].replace('proj/', '').replace('data/', '') + v1_version_id[vkey] = v['VersionId'] + + version_id_lookup['v1'] = v1_version_id + + v2_hashes = {} + v2_version_id = {} + for data, key in zip((data1_v2, data2_v2), + ('data1', 'data2')): + + hasher = hashlib.blake2b() + hasher.update(data) + v2_hashes[key] = hasher.hexdigest() + s3_client.put_object(Bucket=test_bucket_name, + Key=f'proj/data/{key}', + Body=data) + + s3_client.delete_object(Bucket=test_bucket_name, + Key='proj/data/data3') + + with io.StringIO() as stream: + metadata1_v2.to_csv(stream, index=False) + stream.seek(0) + data = bytes(stream.read(), 'utf-8') + + hasher = hashlib.blake2b() + hasher.update(data) + v2_hashes['metadata1.csv'] = hasher.hexdigest() + s3_client.put_object(Bucket=test_bucket_name, + Key='proj/metadata1.csv', + Body=data) + + s3_client.delete_object(Bucket=test_bucket_name, + Key='proj/metadata2.csv') + + true_hashes['v2'] = v2_hashes + v2_version_id = {} + response = s3_client.list_object_versions(Bucket=test_bucket_name) + for v in response['Versions']: + if not v['IsLatest']: + continue + vkey = v['Key'].replace('proj/', '').replace('data/', '') + v2_version_id[vkey] = v['VersionId'] + version_id_lookup['v2'] = v2_version_id + + # check thata data3 and metadata2.csv do not occur in v2 of + # the dataset, but other data/metadata files do + + assert 'data3' in version_id_lookup['v1'] + assert 'data3' not in version_id_lookup['v2'] + assert 'data1' in version_id_lookup['v1'] + assert 'data2' in version_id_lookup['v1'] + assert 'data1' in version_id_lookup['v2'] + assert 'data2' in version_id_lookup['v2'] + assert 'metadata1.csv' in version_id_lookup['v1'] + assert 'metadata2.csv' in version_id_lookup['v1'] + assert 'metadata1.csv' in version_id_lookup['v2'] + assert 'metadata2.csv' not in version_id_lookup['v2'] + + # build manifests + + manifest_1 = {} + manifest_1['manifest_version'] = 'A' + manifest_1['project_name'] = "project-A1" + manifest_1['metadata_file_id_column_name'] = 'file_id' + manifest_1['data_pipeline'] = 'placeholder' + data_files_1 = {} + for k in ('data1', 'data2', 'data3'): + obj = {} + obj['url'] = f'http://{test_bucket_name}.s3.amazonaws.com/proj/data/{k}' # noqa: E501 + obj['file_hash'] = true_hashes['v1'][k] + obj['version_id'] = version_id_lookup['v1'][k] + data_files_1[k] = obj + manifest_1['data_files'] = data_files_1 + metadata_files_1 = {} + for k in ('metadata1.csv', 'metadata2.csv'): + obj = {} + obj['url'] = f'http://{test_bucket_name}.s3.amazonaws.com/proj/{k}' + obj['file_hash'] = true_hashes['v1'][k] + obj['version_id'] = version_id_lookup['v1'][k] + metadata_files_1[k] = obj + manifest_1['metadata_files'] = metadata_files_1 + + manifest_2 = {} + manifest_2['manifest_version'] = 'B' + manifest_2['project_name'] = "project-B2" + manifest_2['metadata_file_id_column_name'] = 'file_id' + manifest_2['data_pipeline'] = 'placeholder' + data_files_2 = {} + for k in ('data1', 'data2'): + obj = {} + obj['url'] = f'http://{test_bucket_name}.s3.amazonaws.com/proj/data/{k}' # noqa: E501 + obj['file_hash'] = true_hashes['v2'][k] + obj['version_id'] = version_id_lookup['v2'][k] + data_files_2[k] = obj + manifest_2['data_files'] = data_files_2 + metadata_files_2 = {} + for k in ['metadata1.csv']: + obj = {} + obj['url'] = f'http://{test_bucket_name}.s3.amazonaws.com/proj/{k}' + obj['file_hash'] = true_hashes['v2'][k] + obj['version_id'] = version_id_lookup['v2'][k] + metadata_files_2[k] = obj + manifest_2['metadata_files'] = metadata_files_2 + + s3_client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_1.json', + Body=bytes(json.dumps(manifest_1), 'utf-8')) + + s3_client.put_object(Bucket=test_bucket_name, + Key='proj/manifests/manifest_2.json', + Body=bytes(json.dumps(manifest_2), 'utf-8')) + + # Use S3CloudCache to interact with dataset + cache_dir = pathlib.Path(tmpdir) / 'my/test/cache' + cache = S3CloudCache(cache_dir, test_bucket_name, 'proj') + + # load the first version of the dataset + + cache.load_manifest('manifest_1.json') + assert cache.version == 'A' + + # check that metadata dataframes have expected contents + m1 = cache.get_metadata('metadata1.csv') + assert metadata1_v1.equals(m1) + m2 = cache.get_metadata('metadata2.csv') + assert metadata2_v1.equals(m2) + + # check that data files have expected hashes + for k in ('data1', 'data2', 'data3'): + + attr = cache.data_path(k) + assert not attr['exists'] + + local_path = cache.download_data(k) + assert local_path.exists() + hasher = hashlib.blake2b() + with open(local_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_hashes['v1'][k] + + attr = cache.data_path(k) + assert attr['exists'] + + # now load the second version of the dataset + + cache.load_manifest('manifest_2.json') + assert cache.version == 'B' + + # metadata2.csv should not exist in this version of the dataset + with pytest.raises(ValueError) as context: + cache.get_metadata('metadata2.csv') + assert 'is not in self.metadata_file_names' in context.value.args[0] + + # check that metadata1 has expected contents + m1 = cache.get_metadata('metadata1.csv') + assert metadata1_v2.equals(m1) + + # data3 should not exist in this version of the dataset + with pytest.raises(ValueError) as context: + _ = cache.download_data('data3') + assert 'not a data file listed' in context.value.args[0] + + with pytest.raises(ValueError) as context: + _ = cache.data_path('data3') + assert 'not a data file listed' in context.value.args[0] + + # check that data1, data2 have expected hashes + for k in ('data1', 'data2'): + attr = cache.data_path(k) + assert not attr['exists'] + + local_path = cache.download_data(k) + assert local_path.exists() + hasher = hashlib.blake2b() + with open(local_path, 'rb') as in_file: + hasher.update(in_file.read()) + assert hasher.hexdigest() == true_hashes['v2'][k] + + attr = cache.data_path(k) + assert attr['exists'] diff --git a/allensdk/test/api/cloud_cache/test_manifest.py b/allensdk/test/api/cloud_cache/test_manifest.py new file mode 100644 index 000000000..daa2f523f --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_manifest.py @@ -0,0 +1,172 @@ +import pytest +import json +import pathlib +from allensdk.internal.core.lims_utilities import safe_system_path +from allensdk.api.cloud_cache.manifest import Manifest +from allensdk.api.cloud_cache.file_attributes import CacheFileAttributes # noqa: E501 + + +@pytest.fixture +def meta_json_path(tmpdir): + jpath = tmpdir / "somejson.json" + d = { + "project_name": "X", + "manifest_version": "Y", + "metadata_file_id_column_name": "Z", + "data_pipeline": "ZA", + "metadata_files": ["ZB", "ZC", "ZD"], + "data_files": {"AB": "ab", "BC": "bc", "CD": "cd"}} + with open(jpath, "w") as f: + json.dump(d, f) + yield jpath + + +def test_constructor(meta_json_path): + """ + Make sure that the Manifest class __init__ runs and + raises an error if you give it an unexpected cache_dir + """ + Manifest('my/cache/dir', meta_json_path) + Manifest(pathlib.Path('my/other/cache/dir'), meta_json_path) + with pytest.raises(ValueError, match=r"cache_dir must be either a str.*"): + Manifest(1234.2, meta_json_path) + + +def test_create_file_attributes(meta_json_path): + """ + Test that Manifest._create_file_attributes correctly + handles input parameters (this is mostly a test of + local_path generation) + """ + mfest = Manifest('/my/cache/dir', meta_json_path) + attr = mfest._create_file_attributes('http://my.url.com/path/to/file.txt', + '12345', + 'aaabbbcccddd') + + assert isinstance(attr, CacheFileAttributes) + assert attr.url == 'http://my.url.com/path/to/file.txt' + assert attr.version_id == '12345' + assert attr.file_hash == 'aaabbbcccddd' + expected_path = '/my/cache/dir/X-Y/to/file.txt' + assert attr.local_path == pathlib.Path(expected_path).resolve() + + +@pytest.fixture +def manifest_for_metadata(tmpdir): + jpath = tmpdir / "a_manifest.json" + manifest = {} + metadata_files = {} + metadata_files['a.txt'] = {'url': 'http://my.url.com/path/to/a.txt', + 'version_id': '12345', + 'file_hash': 'abcde'} + metadata_files['b.txt'] = {'url': 'http://my.other.url.com/different/path/to/b.txt', # noqa: E501 + 'version_id': '67890', + 'file_hash': 'fghijk'} + + manifest['metadata_files'] = metadata_files + manifest['project_name'] = "some-project" + manifest['manifest_version'] = '000' + manifest['metadata_file_id_column_name'] = 'file_id' + manifest['data_pipeline'] = 'placeholder' + with open(jpath, "w") as f: + json.dump(manifest, f) + yield jpath + + +def test_metadata_file_attributes(manifest_for_metadata): + """ + Test that Manifest.metadata_file_attributes returns the + correct CacheFileAttributes object and raises the correct + error when you ask for a metadata file that does not exist + """ + + mfest = Manifest('/my/cache/dir/', manifest_for_metadata) + + a_obj = mfest.metadata_file_attributes('a.txt') + assert a_obj.url == 'http://my.url.com/path/to/a.txt' + assert a_obj.version_id == '12345' + assert a_obj.file_hash == 'abcde' + expected = safe_system_path('/my/cache/dir/some-project-000/to/a.txt') + expected = pathlib.Path(expected).resolve() + assert a_obj.local_path == expected + + b_obj = mfest.metadata_file_attributes('b.txt') + assert b_obj.url == 'http://my.other.url.com/different/path/to/b.txt' + assert b_obj.version_id == '67890' + assert b_obj.file_hash == 'fghijk' + expected = safe_system_path('/my/cache/dir/some-project-000/path/to/b.txt') + expected = pathlib.Path(expected).resolve() + assert b_obj.local_path == expected + + # test that the correct error is raised when you ask + # for a metadata file that does not exist + + with pytest.raises(ValueError) as context: + _ = mfest.metadata_file_attributes('c.txt') + msg = "c.txt\nis not in self.metadata_file_names" + assert msg in context.value.args[0] + + +@pytest.fixture +def manifest_with_data(tmpdir): + jpath = tmpdir / "manifest_with files.json" + manifest = {} + manifest['metadata_files'] = {} + manifest['manifest_version'] = '0' + manifest['project_name'] = "myproject" + manifest['metadata_file_id_column_name'] = 'file_id' + manifest['data_pipeline'] = 'placeholder' + data_files = {} + data_files['a'] = {'url': 'http://my.url.com/myproject/path/to/a.nwb', + 'version_id': '12345', + 'file_hash': 'abcde'} + data_files['b'] = {'url': 'http://my.other.url.com/different/path/b.nwb', + 'version_id': '67890', + 'file_hash': 'fghijk'} + manifest['data_files'] = data_files + with open(jpath, "w") as f: + json.dump(manifest, f) + yield jpath + + +def test_data_file_attributes(manifest_with_data): + """ + Test that Manifest.data_file_attributes returns the correct + CacheFileAttributes object and raises the correct error when + you ask for a data file that does not exist + """ + mfest = Manifest('/my/cache/dir', manifest_with_data) + + a_obj = mfest.data_file_attributes('a') + assert a_obj.url == 'http://my.url.com/myproject/path/to/a.nwb' + assert a_obj.version_id == '12345' + assert a_obj.file_hash == 'abcde' + expected = safe_system_path('/my/cache/dir/myproject-0/path/to/a.nwb') + assert a_obj.local_path == pathlib.Path(expected).resolve() + + b_obj = mfest.data_file_attributes('b') + assert b_obj.url == 'http://my.other.url.com/different/path/b.nwb' + assert b_obj.version_id == '67890' + assert b_obj.file_hash == 'fghijk' + expected = safe_system_path('/my/cache/dir/myproject-0/path/b.nwb') + assert b_obj.local_path == pathlib.Path(expected).resolve() + + with pytest.raises(ValueError) as context: + _ = mfest.data_file_attributes('c') + msg = "file_id: c\nIs not a data file listed in manifest:" + assert msg in context.value.args[0] + + +def test_file_attribute_errors(meta_json_path): + """ + Test that Manifest raises the correct error if you try to get file + attributes before loading a manifest.json + """ + mfest = Manifest("/my/cache/dir", meta_json_path) + with pytest.raises(ValueError, + match=r".* not in self.metadata_file_names"): + mfest.metadata_file_attributes('some_file.txt') + + with pytest.raises(ValueError, + match=r".* not a data file listed in manifest"): + mfest.data_file_attributes('other_file.txt') diff --git a/allensdk/test/api/cloud_cache/test_utils.py b/allensdk/test/api/cloud_cache/test_utils.py new file mode 100644 index 000000000..e0a6f8608 --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_utils.py @@ -0,0 +1,53 @@ +import pytest +import hashlib +import numpy as np +import allensdk.api.cloud_cache.utils as utils + + +def test_bucket_name_from_url(): + + url = 'https://dummy_bucket.s3.amazonaws.com/txt_file.txt?versionId="jklaafdaerew"' # noqa: E501 + bucket_name = utils.bucket_name_from_url(url) + assert bucket_name == "dummy_bucket" + + url = 'https://dummy_bucket2.s3-us-west-3.amazonaws.com/txt_file.txt?versionId="jklaafdaerew"' # noqa: E501 + bucket_name = utils.bucket_name_from_url(url) + assert bucket_name == "dummy_bucket2" + + url = 'https://dummy_bucket/txt_file.txt?versionId="jklaafdaerew"' + with pytest.warns(UserWarning): + bucket_name = utils.bucket_name_from_url(url) + assert bucket_name is None + + # make sure we are actualy detecting '.' in .amazonaws.com + url = 'https://dummy_bucket2.s3-us-west-3XamazonawsYcom/txt_file.txt?versionId="jklaafdaerew"' # noqa: E501 + with pytest.warns(UserWarning): + bucket_name = utils.bucket_name_from_url(url) + assert bucket_name is None + + +def test_relative_path_from_url(): + url = 'https://dummy_bucket.s3.amazonaws.com/my/dir/txt_file.txt?versionId="jklaafdaerew"' # noqa: E501 + relative_path = utils.relative_path_from_url(url) + assert relative_path == 'my/dir/txt_file.txt' + + +def test_file_hash_from_path(tmpdir): + + rng = np.random.RandomState(881) + alphabet = list('abcdefghijklmnopqrstuvwxyz') + fname = tmpdir / 'hash_dummy.txt' + with open(fname, 'w') as out_file: + for ii in range(10): + out_file.write(''.join(rng.choice(alphabet, size=10))) + out_file.write('\n') + + hasher = hashlib.blake2b() + with open(fname, 'rb') as in_file: + chunk = in_file.read(7) + while len(chunk) > 0: + hasher.update(chunk) + chunk = in_file.read(7) + + ans = utils.file_hash_from_path(fname) + assert ans == hasher.hexdigest() diff --git a/allensdk/test/api/cloud_cache/test_windows_isilon_paths.py b/allensdk/test/api/cloud_cache/test_windows_isilon_paths.py new file mode 100644 index 000000000..ac656a052 --- /dev/null +++ b/allensdk/test/api/cloud_cache/test_windows_isilon_paths.py @@ -0,0 +1,70 @@ +import re +import json +from pathlib import Path + +from allensdk.api.cloud_cache.cloud_cache import CloudCacheBase +from allensdk.api.cloud_cache.manifest import Manifest + + +def test_windows_path_to_isilon(monkeypatch, tmpdir): + """ + This test is just meant to verify on Windows CI instances + that, if a path to the `/allen/` shared file store is used as + cache_dir, the path to files will come out useful (i.e. without any + spurious C:/ prepended as in AllenSDK issue #1964 + """ + + cache_dir = Path(tmpdir) + + manifest_1 = {'manifest_version': '1', + 'metadata_file_id_column_name': 'file_id', + 'data_pipeline': 'placeholder', + 'project_name': 'my-project', + 'metadata_files': {'a.csv': {'url': 'http://www.junk.com/path/to/a.csv', # noqa: E501 + 'version_id': '1111', + 'file_hash': 'abcde'}, + 'b.csv': {'url': 'http://silly.com/path/to/b.csv', # noqa: E501 + 'version_id': '2222', + 'file_hash': 'fghijk'}}, + 'data_files': {'data_1': {'url': 'http://www.junk.com/data/path/data.csv', # noqa: E501 + 'version_id': '1111', + 'file_hash': 'lmnopqrst'}} + } + manifest_path = tmpdir / "manifest.json" + with open(manifest_path, "w") as f: + json.dump(manifest_1, f) + + def dummy_file_exists(self, m): + return True + + # we do not want paths to `/allen` to be resolved to + # a local drive on the user's machine + bad_windows_pattern = re.compile('^[A-Z]\:') # noqa: W605 + + # make sure pattern is correctly formulated + m = bad_windows_pattern.search('C:\\a\windows\path') # noqa: W605 + assert m is not None + + with monkeypatch.context() as ctx: + class TestCloudCache(CloudCacheBase): + + def _download_file(self, m, o): + pass + + def _download_manifest(self, m, o): + pass + + def _list_all_manifests(self): + pass + + ctx.setattr(TestCloudCache, + '_file_exists', + dummy_file_exists) + + cache = TestCloudCache(cache_dir, 'proj') + cache._manifest = Manifest(cache_dir, json_input=manifest_path) + + m_path = cache.metadata_path('a.csv') + assert bad_windows_pattern.match(str(m_path)) is None + d_path = cache.data_path('data_1') + assert bad_windows_pattern.match(str(d_path)) is None diff --git a/allensdk/test/api/test_cache.py b/allensdk/test/api/test_cache.py index 78e1e5e35..902f92c46 100755 --- a/allensdk/test/api/test_cache.py +++ b/allensdk/test/api/test_cache.py @@ -43,7 +43,7 @@ import pytest from mock import MagicMock, mock_open, patch -from allensdk.api.cache import Cache, memoize, get_default_manifest_file +from allensdk.api.warehouse_cache.cache import Cache, memoize, get_default_manifest_file from allensdk.api.queries.rma_api import RmaApi import allensdk.core.json_utilities as ju from allensdk.config.manifest import ManifestVersionError diff --git a/allensdk/test/api/test_cacheable.py b/allensdk/test/api/test_cacheable.py index 0e5a17d1b..8afb1f590 100644 --- a/allensdk/test/api/test_cacheable.py +++ b/allensdk/test/api/test_cacheable.py @@ -35,7 +35,7 @@ # import pytest from mock import MagicMock, patch, mock_open -from allensdk.api.cache import Cache, cacheable +from allensdk.api.warehouse_cache.cache import Cache, cacheable from allensdk.api.queries.rma_api import RmaApi import pandas as pd from six.moves import builtins @@ -314,4 +314,4 @@ def get_hemispheres(): assert not ju_read_url_get.called read_csv.assert_called_once_with('/xyz/abc/example.csv', parse_dates=True) assert not ju_write.called, 'json write should not have been called' - assert not ju_read.called, 'json read should not have been called' \ No newline at end of file + assert not ju_read.called, 'json read should not have been called' diff --git a/allensdk/test/api/test_caching_utilities.py b/allensdk/test/api/test_caching_utilities.py index ed95bb00f..96fc5bdf7 100644 --- a/allensdk/test/api/test_caching_utilities.py +++ b/allensdk/test/api/test_caching_utilities.py @@ -5,7 +5,7 @@ import pytest import pandas as pd -from allensdk.api import caching_utilities as cu +from allensdk.api.warehouse_cache import caching_utilities as cu def get_data(): diff --git a/allensdk/test/api/test_file_download.py b/allensdk/test/api/test_file_download.py index 780e6d244..3241bba1b 100644 --- a/allensdk/test/api/test_file_download.py +++ b/allensdk/test/api/test_file_download.py @@ -35,7 +35,7 @@ # import pytest from mock import Mock, patch -from allensdk.api.cache import cacheable, Cache +from allensdk.api.warehouse_cache.cache import cacheable, Cache from allensdk.config.manifest import Manifest import allensdk.core.json_utilities as ju import pandas.io.json as pj diff --git a/allensdk/test/api/test_pager.py b/allensdk/test/api/test_pager.py index ac6eb9a6d..68399e014 100644 --- a/allensdk/test/api/test_pager.py +++ b/allensdk/test/api/test_pager.py @@ -44,7 +44,7 @@ import os import simplejson as json from allensdk.api.queries.rma_template import RmaTemplate -from allensdk.api.cache import cacheable, Cache +from allensdk.api.warehouse_cache.cache import cacheable, Cache try: import StringIO except: diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/behavior_session_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/behavior_session_table.pkl new file mode 100644 index 000000000..f8e18e967 Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/behavior_session_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_experiment_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_experiment_table.pkl new file mode 100644 index 000000000..f9610db63 Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_experiment_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_session_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_session_table.pkl new file mode 100644 index 000000000..b4957ad48 Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata/expected/ophys_session_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/behavior_session_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/behavior_session_table.pkl new file mode 100644 index 000000000..10c68924b Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/behavior_session_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_experiment_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_experiment_table.pkl new file mode 100644 index 000000000..c9895c479 Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_experiment_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_session_table.pkl b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_session_table.pkl new file mode 100644 index 000000000..02517b9b8 Binary files /dev/null and b/allensdk/test/brain_observatory/behavior/resources/project_metadata_writer/expected/ophys_session_table.pkl differ diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_lims_api.py b/allensdk/test/brain_observatory/behavior/test_behavior_lims_api.py index 85459783d..8ea3e45ab 100644 --- a/allensdk/test/brain_observatory/behavior/test_behavior_lims_api.py +++ b/allensdk/test/brain_observatory/behavior/test_behavior_lims_api.py @@ -307,7 +307,7 @@ def test_behavior_uuid_regression(self): def test_container_id_regression(self): assert (self.bd.extractor.ophys_container_id - == self.od.extractor.get_experiment_container_id()) + == self.od.extractor.get_ophys_container_id()) def test_stimulus_frame_rate_regression(self): assert (self.bd.get_stimulus_frame_rate() diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_metadata.py b/allensdk/test/brain_observatory/behavior/test_behavior_metadata.py index b0862da6b..59c5b841e 100644 --- a/allensdk/test/brain_observatory/behavior/test_behavior_metadata.py +++ b/allensdk/test/brain_observatory/behavior/test_behavior_metadata.py @@ -12,7 +12,211 @@ @pytest.mark.parametrize("data, expected", - [pytest.param({ # noqa: E128 + [pytest.param({ # noqa: E128 + "items": { + "behavior": { + "config": { + "DoC": { + "blank_duration_range": ( + 0.5, 0.6), + "response_window": [0.15, 0.75], + "change_time_dist": "geometric", + "auto_reward_volume": 0.002, + }, + "reward": { + "reward_volume": 0.007, + }, + "behavior": { + "task_id": "DoC_untranslated", + }, + }, + "params": { + "stage": "TRAINING_3_images_A", + "flash_omit_probability": 0.05 + }, + "stimuli": { + "images": {"draw_log": [1] * 10, + "flash_interval_sec": [ + 0.32, -1.0]} + }, + } + } + }, + { + "blank_duration_sec": [0.5, 0.6], + "stimulus_duration_sec": 0.32, + "omitted_flash_fraction": 0.05, + "response_window_sec": [0.15, 0.75], + "reward_volume": 0.007, + "session_type": "TRAINING_3_images_A", + "stimulus": "images", + "stimulus_distribution": "geometric", + "task": "change detection", + "n_stimulus_frames": 10, + "auto_reward_volume": 0.002 + }, id='basic'), + pytest.param({ + "items": { + "behavior": { + "config": { + "DoC": { + "blank_duration_range": ( + 0.5, 0.5), + "response_window": [0.15, + 0.75], + "change_time_dist": + "geometric", + "auto_reward_volume": 0.002 + }, + "reward": { + "reward_volume": 0.007, + }, + "behavior": { + "task_id": "DoC_untranslated", + }, + }, + "params": { + "stage": "TRAINING_3_images_A", + "flash_omit_probability": 0.05 + }, + "stimuli": { + "images": {"draw_log": [1] * 10, + "flash_interval_sec": [ + 0.32, -1.0]} + }, + } + } + }, + { + "blank_duration_sec": [0.5, 0.5], + "stimulus_duration_sec": 0.32, + "omitted_flash_fraction": 0.05, + "response_window_sec": [0.15, 0.75], + "reward_volume": 0.007, + "session_type": "TRAINING_3_images_A", + "stimulus": "images", + "stimulus_distribution": "geometric", + "task": "change detection", + "n_stimulus_frames": 10, + "auto_reward_volume": 0.002 + }, id='single_value_blank_duration'), + pytest.param({ + "items": { + "behavior": { + "config": { + "DoC": { + "blank_duration_range": ( + 0.5, 0.5), + "response_window": [0.15, + 0.75], + "change_time_dist": + "geometric", + "auto_reward_volume": 0.002 + }, + "reward": { + "reward_volume": 0.007, + }, + "behavior": { + "task_id": "DoC_untranslated", + }, + }, + "params": { + "stage": "TRAINING_3_images_A", + "flash_omit_probability": 0.05 + }, + "stimuli": { + "grating": {"draw_log": [1] * 10, + "flash_interval_sec": + [0.34, -1.0]} + }, + } + } + }, + { + "blank_duration_sec": [0.5, 0.5], + "stimulus_duration_sec": 0.34, + "omitted_flash_fraction": 0.05, + "response_window_sec": [0.15, 0.75], + "reward_volume": 0.007, + "session_type": "TRAINING_3_images_A", + "stimulus": "grating", + "stimulus_distribution": "geometric", + "task": "change detection", + "n_stimulus_frames": 10, + "auto_reward_volume": 0.002 + }, id='stimulus_duration_from_grating'), + pytest.param({ + "items": { + "behavior": { + "config": { + "DoC": { + "blank_duration_range": ( + 0.5, 0.5), + "response_window": [0.15, + 0.75], + "change_time_dist": + "geometric", + "auto_reward_volume": 0.002 + }, + "reward": { + "reward_volume": 0.007, + }, + "behavior": { + "task_id": "DoC_untranslated", + }, + }, + "params": { + "stage": "TRAINING_3_images_A", + "flash_omit_probability": 0.05 + }, + "stimuli": { + "grating": { + "draw_log": [1] * 10, + "flash_interval_sec": None} + }, + } + } + }, + { + "blank_duration_sec": [0.5, 0.5], + "stimulus_duration_sec": np.NaN, + "omitted_flash_fraction": 0.05, + "response_window_sec": [0.15, 0.75], + "reward_volume": 0.007, + "session_type": "TRAINING_3_images_A", + "stimulus": "grating", + "stimulus_distribution": "geometric", + "task": "change detection", + "n_stimulus_frames": 10, + "auto_reward_volume": 0.002 + }, id='stimulus_duration_none') + ] + ) +def test_get_task_parameters(data, expected): + actual = get_task_parameters(data) + for k, v in actual.items(): + # Special nan checking since pytest doesn't do it well + try: + if np.isnan(v): + assert np.isnan(expected[k]) + else: + assert expected[k] == v + except (TypeError, ValueError): + assert expected[k] == v + + actual_keys = list(actual.keys()) + actual_keys.sort() + expected_keys = list(expected.keys()) + expected_keys.sort() + assert actual_keys == expected_keys + + +def test_get_task_parameters_task_id_exception(): + """ + Test that, when task_id has an unexpected value, + get_task_parameters throws the correct exception + """ + input_data = { "items": { "behavior": { "config": { @@ -20,54 +224,13 @@ "blank_duration_range": (0.5, 0.6), "response_window": [0.15, 0.75], "change_time_dist": "geometric", - "auto_reward_volume": 0.002, - }, - "reward": { - "reward_volume": 0.007, - }, - "behavior": { - "task_id": "DoC_untranslated", - }, - }, - "params": { - "stage": "TRAINING_3_images_A", - "flash_omit_probability": 0.05 - }, - "stimuli": { - "images": {"draw_log": [1]*10, - "flash_interval_sec": [0.32, -1.0]} - }, - } - } - }, - { - "blank_duration_sec": [0.5, 0.6], - "stimulus_duration_sec": 0.32, - "omitted_flash_fraction": 0.05, - "response_window_sec": [0.15, 0.75], - "reward_volume": 0.007, - "session_type": "TRAINING_3_images_A", - "stimulus": "images", - "stimulus_distribution": "geometric", - "task": "change detection", - "n_stimulus_frames": 10, - "auto_reward_volume": 0.002 - }, id='basic'), - pytest.param({ - "items": { - "behavior": { - "config": { - "DoC": { - "blank_duration_range": (0.5, 0.5), - "response_window": [0.15, 0.75], - "change_time_dist": "geometric", "auto_reward_volume": 0.002 }, "reward": { "reward_volume": 0.007, }, "behavior": { - "task_id": "DoC_untranslated", + "task_id": "junk", }, }, "params": { @@ -75,72 +238,29 @@ "flash_omit_probability": 0.05 }, "stimuli": { - "images": {"draw_log": [1]*10, + "images": {"draw_log": [1] * 10, "flash_interval_sec": [0.32, -1.0]} }, } } - }, - { - "blank_duration_sec": [0.5, 0.5], - "stimulus_duration_sec": 0.32, - "omitted_flash_fraction": 0.05, - "response_window_sec": [0.15, 0.75], - "reward_volume": 0.007, - "session_type": "TRAINING_3_images_A", - "stimulus": "images", - "stimulus_distribution": "geometric", - "task": "change detection", - "n_stimulus_frames": 10, - "auto_reward_volume": 0.002 - }, id='single_value_blank_duration'), - pytest.param({ - "items": { - "behavior": { - "config": { - "DoC": { - "blank_duration_range": (0.5, 0.5), - "response_window": [0.15, 0.75], - "change_time_dist": "geometric", - "auto_reward_volume": 0.002 - }, - "reward": { - "reward_volume": 0.007, - }, - "behavior": { - "task_id": "DoC_untranslated", - }, - }, - "params": { - "stage": "TRAINING_3_images_A", - "flash_omit_probability": 0.05 - }, - "stimuli": { - "grating": {"draw_log": [1]*10, - "flash_interval_sec": [0.34, -1.0]} - }, - } - } - }, - { - "blank_duration_sec": [0.5, 0.5], - "stimulus_duration_sec": 0.34, - "omitted_flash_fraction": 0.05, - "response_window_sec": [0.15, 0.75], - "reward_volume": 0.007, - "session_type": "TRAINING_3_images_A", - "stimulus": "grating", - "stimulus_distribution": "geometric", - "task": "change detection", - "n_stimulus_frames": 10, - "auto_reward_volume": 0.002 - }, id='stimulus_duration_from_grating'), - pytest.param({ + } + + with pytest.raises(RuntimeError) as error: + _ = get_task_parameters(input_data) + assert "does not know how to parse 'task_id'" in error.value.args[0] + + +def test_get_task_parameters_flash_duration_exception(): + """ + Test that, when 'images' or 'grating' not present in 'stimuli', + get_task_parameters throws the correct exception + """ + input_data = { "items": { "behavior": { "config": { "DoC": { - "blank_duration_range": (0.5, 0.5), + "blank_duration_range": (0.5, 0.6), "response_window": [0.15, 0.75], "change_time_dist": "geometric", "auto_reward_volume": 0.002 @@ -149,7 +269,7 @@ "reward_volume": 0.007, }, "behavior": { - "task_id": "DoC_untranslated", + "task_id": "DoC", }, }, "params": { @@ -157,118 +277,12 @@ "flash_omit_probability": 0.05 }, "stimuli": { - "grating": {"draw_log": [1]*10, - "flash_interval_sec": None} + "junk": {"draw_log": [1] * 10, + "flash_interval_sec": [0.32, -1.0]} }, } } - }, - { - "blank_duration_sec": [0.5, 0.5], - "stimulus_duration_sec": np.NaN, - "omitted_flash_fraction": 0.05, - "response_window_sec": [0.15, 0.75], - "reward_volume": 0.007, - "session_type": "TRAINING_3_images_A", - "stimulus": "grating", - "stimulus_distribution": "geometric", - "task": "change detection", - "n_stimulus_frames": 10, - "auto_reward_volume": 0.002 - }, id='stimulus_duration_none') - ] -) -def test_get_task_parameters(data, expected): - actual = get_task_parameters(data) - for k, v in actual.items(): - # Special nan checking since pytest doesn't do it well - try: - if np.isnan(v): - assert np.isnan(expected[k]) - else: - assert expected[k] == v - except (TypeError, ValueError): - assert expected[k] == v - - actual_keys = list(actual.keys()) - actual_keys.sort() - expected_keys = list(expected.keys()) - expected_keys.sort() - assert actual_keys == expected_keys - - -def test_get_task_parameters_task_id_exception(): - """ - Test that, when task_id has an unexpected value, - get_task_parameters throws the correct exception - """ - input_data = { - "items": { - "behavior": { - "config": { - "DoC": { - "blank_duration_range": (0.5, 0.6), - "response_window": [0.15, 0.75], - "change_time_dist": "geometric", - "auto_reward_volume": 0.002 - }, - "reward": { - "reward_volume": 0.007, - }, - "behavior": { - "task_id": "junk", - }, - }, - "params": { - "stage": "TRAINING_3_images_A", - "flash_omit_probability": 0.05 - }, - "stimuli": { - "images": {"draw_log": [1]*10, - "flash_interval_sec": [0.32, -1.0]} - }, - } - } - } - - with pytest.raises(RuntimeError) as error: - _ = get_task_parameters(input_data) - assert "does not know how to parse 'task_id'" in error.value.args[0] - - -def test_get_task_parameters_flash_duration_exception(): - """ - Test that, when 'images' or 'grating' not present in 'stimuli', - get_task_parameters throws the correct exception - """ - input_data = { - "items": { - "behavior": { - "config": { - "DoC": { - "blank_duration_range": (0.5, 0.6), - "response_window": [0.15, 0.75], - "change_time_dist": "geometric", - "auto_reward_volume": 0.002 - }, - "reward": { - "reward_volume": 0.007, - }, - "behavior": { - "task_id": "DoC", - }, - }, - "params": { - "stage": "TRAINING_3_images_A", - "flash_omit_probability": 0.05 - }, - "stimuli": { - "junk": {"draw_log": [1]*10, - "flash_interval_sec": [0.32, -1.0]} - }, - } - } - } + } with pytest.raises(RuntimeError) as error: _ = get_task_parameters(input_data) @@ -349,14 +363,20 @@ def full_genotype(self): metadata = BehaviorMetadata() - assert metadata.cre_line is None + with pytest.warns(UserWarning) as record: + cre_line = metadata.cre_line + assert cre_line is None + assert str(record[0].message) == 'Unable to parse cre_line from ' \ + 'full_genotype' def test_reporter_line(monkeypatch): """Test that reporter line properly parsed from list""" + class MockExtractor: def get_reporter_line(self): return ['foo'] + extractor = MockExtractor() with monkeypatch.context() as ctx: @@ -374,9 +394,11 @@ def dummy_init(self): def test_reporter_line_str(monkeypatch): """Test that reporter line returns itself if str""" + class MockExtractor: def get_reporter_line(self): return 'foo' + extractor = MockExtractor() with monkeypatch.context() as ctx: @@ -392,11 +414,21 @@ def dummy_init(self): assert metadata.reporter_line == 'foo' -def test_reporter_line_multiple(monkeypatch): - """Test that if multiple reporter lines, the first is returned""" +@pytest.mark.parametrize("input_reporter_line, warning_msg, expected", ( + (('foo', 'bar'), 'More than 1 reporter line. ' + 'Returning the first one', 'foo'), + (None, 'Error parsing reporter line. It is null.', None), + ([], 'Error parsing reporter line. The array is empty', None) +) + ) +def test_reporter_edge_cases(monkeypatch, input_reporter_line, warning_msg, + expected): + """Test reporter line edge cases""" + class MockExtractor: def get_reporter_line(self): - return ['foo', 'bar'] + return input_reporter_line + extractor = MockExtractor() with monkeypatch.context() as ctx: @@ -406,17 +438,22 @@ def dummy_init(self): ctx.setattr(BehaviorMetadata, '__init__', dummy_init) - metadata = BehaviorMetadata() - assert metadata.reporter_line == 'foo' + with pytest.warns(UserWarning) as record: + reporter_line = metadata.reporter_line + + assert reporter_line == expected + assert str(record[0].message) == warning_msg def test_age_in_days(monkeypatch): """Test that age_in_days properly parsed from age""" + class MockExtractor: def get_age(self): return 'P123' + extractor = MockExtractor() with monkeypatch.context() as ctx: @@ -432,51 +469,21 @@ def dummy_init(self): assert metadata.age_in_days == 123 -def test_age_in_days_unkown_age(monkeypatch): - """Test age in days is None if age is unknown""" - class MockExtractor: - def get_age(self): - return 'unkown' - extractor = MockExtractor() - - with monkeypatch.context() as ctx: - def dummy_init(self): - self._extractor = extractor - - ctx.setattr(BehaviorMetadata, - '__init__', - dummy_init) - - metadata = BehaviorMetadata() - - assert metadata.age_in_days is None - +@pytest.mark.parametrize("input_age, warning_msg, expected", ( + ('unkown', 'Could not parse numeric age from age code ' + '(age code does not start with "P")', None), + ('P', 'Could not parse numeric age from age code ' + '(no numeric values found in age code)', None) +) + ) +def test_age_in_days_edge_cases(monkeypatch, input_age, warning_msg, + expected): + """Test age in days edge cases""" -def test_age_in_days_invalid_age(monkeypatch): - """Test that age_in_days is None if age not prefixed with P""" class MockExtractor: def get_age(self): - return 'Q123' - extractor = MockExtractor() - - with monkeypatch.context() as ctx: - def dummy_init(self): - self._extractor = extractor - - ctx.setattr(BehaviorMetadata, - '__init__', - dummy_init) - - metadata = BehaviorMetadata() + return input_age - assert metadata.age_in_days is None - - -def test_reporter_line_no_reporter_line(monkeypatch): - """Test that if no reporter line, returns None""" - class MockExtractor: - def get_reporter_line(self): - return [] extractor = MockExtractor() with monkeypatch.context() as ctx: @@ -489,55 +496,52 @@ def dummy_init(self): metadata = BehaviorMetadata() - assert metadata.reporter_line is None + with pytest.warns(UserWarning) as record: + age_in_days = metadata.age_in_days + + assert age_in_days is None + assert str(record[0].message) == warning_msg @pytest.mark.parametrize("test_params, expected_warn_msg", [ # Vanilla test case ({ - "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", - "%Y-%m-%d %H:%M:%S"), - "pkl_expt_date": datetime.strptime("2021-03-14 03:14:15", - "%Y-%m-%d %H:%M:%S"), - "behavior_session_id": 1 - }, - None - ), + "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", + "%Y-%m-%d %H:%M:%S"), + "pkl_expt_date": datetime.strptime("2021-03-14 03:14:15", + "%Y-%m-%d %H:%M:%S"), + "behavior_session_id": 1 + }, None), # pkl expt date stored in unix format ({ - "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", - "%Y-%m-%d %H:%M:%S"), - "pkl_expt_date": 1615716855.0, - "behavior_session_id": 2 - }, - None - ), + "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", + "%Y-%m-%d %H:%M:%S"), + "pkl_expt_date": 1615716855.0, + "behavior_session_id": 2 + }, None), # Extractor and pkl dates differ significantly ({ - "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", - "%Y-%m-%d %H:%M:%S"), - "pkl_expt_date": datetime.strptime("2021-03-14 20:14:15", - "%Y-%m-%d %H:%M:%S"), - "behavior_session_id": 3 + "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", + "%Y-%m-%d %H:%M:%S"), + "pkl_expt_date": datetime.strptime("2021-03-14 20:14:15", + "%Y-%m-%d %H:%M:%S"), + "behavior_session_id": 3 }, - "The `date_of_acquisition` field in LIMS *" - ), + "The `date_of_acquisition` field in LIMS *"), # pkl file contains an unparseable datetime ({ - "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", - "%Y-%m-%d %H:%M:%S"), - "pkl_expt_date": None, - "behavior_session_id": 4 + "extractor_expt_date": datetime.strptime("2021-03-14 03:14:15", + "%Y-%m-%d %H:%M:%S"), + "pkl_expt_date": None, + "behavior_session_id": 4 }, - "Could not parse the acquisition datetime *" - ), + "Could not parse the acquisition datetime *"), ]) def test_get_date_of_acquisition(monkeypatch, tmp_path, test_params, expected_warn_msg): - mock_session_id = test_params["behavior_session_id"] pkl_save_path = tmp_path / f"mock_pkl_{mock_session_id}.pkl" @@ -581,3 +585,57 @@ def dummy_init(self, extractor, behavior_stimulus_file): obt_date = metadata.date_of_acquisition assert obt_date == extractor_expt_date + + +def test_indicator(monkeypatch): + """Test that indicator is parsed from full_genotype""" + + class MockExtractor: + def get_reporter_line(self): + return 'Ai148(TIT2L-GC6f-ICL-tTA2)' + + extractor = MockExtractor() + + with monkeypatch.context() as ctx: + def dummy_init(self): + self._extractor = extractor + + ctx.setattr(BehaviorMetadata, + '__init__', + dummy_init) + + metadata = BehaviorMetadata() + + assert metadata.indicator == 'GCaMP6f' + + +@pytest.mark.parametrize("input_reporter_line, warning_msg, expected", ( + (None, 'Error parsing reporter line. It is null.', None), + ('foo', 'Could not parse indicator from reporter because none' + 'of the expected substrings were found in the reporter', None) +) + ) +def test_indicator_edge_cases(monkeypatch, input_reporter_line, warning_msg, + expected): + """Test indicator parsing edge cases""" + + class MockExtractor: + def get_reporter_line(self): + return input_reporter_line + + extractor = MockExtractor() + + with monkeypatch.context() as ctx: + def dummy_init(self): + self._extractor = extractor + + ctx.setattr(BehaviorMetadata, + '__init__', + dummy_init) + + metadata = BehaviorMetadata() + + with pytest.warns(UserWarning) as record: + indicator = metadata.indicator + assert indicator is expected + assert str(record[0].message) == warning_msg diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_ophys_session.py b/allensdk/test/brain_observatory/behavior/test_behavior_ophys_experiment.py similarity index 92% rename from allensdk/test/brain_observatory/behavior/test_behavior_ophys_session.py rename to allensdk/test/brain_observatory/behavior/test_behavior_ophys_experiment.py index 8fac18d77..8583d3bd5 100644 --- a/allensdk/test/brain_observatory/behavior/test_behavior_ophys_session.py +++ b/allensdk/test/brain_observatory/behavior/test_behavior_ophys_experiment.py @@ -8,8 +8,8 @@ from imageio import imread from unittest.mock import MagicMock -from allensdk.brain_observatory.behavior.behavior_ophys_session import \ - BehaviorOphysSession +from allensdk.brain_observatory.behavior.behavior_ophys_experiment import \ + BehaviorOphysExperiment from allensdk.brain_observatory.behavior.write_nwb.__main__ import \ BehaviorOphysJsonApi from allensdk.brain_observatory.behavior.session_apis.data_io import ( @@ -32,7 +32,7 @@ ]) def test_session_from_json(tmpdir_factory, session_data, get_expected, get_from_session): - session = BehaviorOphysSession(api=BehaviorOphysJsonApi(session_data)) + session = BehaviorOphysExperiment(api=BehaviorOphysJsonApi(session_data)) expected = get_expected(session_data) obtained = get_from_session(session) @@ -51,10 +51,10 @@ def test_nwb_end_to_end(tmpdir_factory): nwb_filepath = os.path.join(str(tmpdir_factory.mktemp(tmpdir)), 'nwbfile.nwb') - d1 = BehaviorOphysSession.from_lims(oeid) + d1 = BehaviorOphysExperiment.from_lims(oeid) BehaviorOphysNwbApi(nwb_filepath).save(d1) - d2 = BehaviorOphysSession(api=BehaviorOphysNwbApi(nwb_filepath)) + d2 = BehaviorOphysExperiment(api=BehaviorOphysNwbApi(nwb_filepath)) assert sessions_are_equal(d1, d2, reraise=True) @@ -62,7 +62,7 @@ def test_nwb_end_to_end(tmpdir_factory): @pytest.mark.nightly def test_visbeh_ophys_data_set(): ophys_experiment_id = 789359614 - data_set = BehaviorOphysSession.from_lims(ophys_experiment_id) + data_set = BehaviorOphysExperiment.from_lims(ophys_experiment_id) # TODO: need to improve testing here: # for _, row in data_set.roi_metrics.iterrows(): @@ -145,7 +145,7 @@ def test_visbeh_ophys_data_set(): def test_legacy_dff_api(): ophys_experiment_id = 792813858 api = BehaviorOphysLimsApi(ophys_experiment_id) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) _, dff_array = session.get_dff_traces() for csid in session.dff_traces.index.values: @@ -162,7 +162,7 @@ def test_legacy_dff_api(): pytest.param(792813858, 129) ]) def test_stimulus_presentations_omitted(ophys_experiment_id, number_omitted): - session = BehaviorOphysSession.from_lims(ophys_experiment_id) + session = BehaviorOphysExperiment.from_lims(ophys_experiment_id) df = session.stimulus_presentations assert df['omitted'].sum() == number_omitted @@ -175,7 +175,7 @@ def test_stimulus_presentations_omitted(ophys_experiment_id, number_omitted): ]) def test_trial_response_window_bounds_reward(ophys_experiment_id): api = BehaviorOphysLimsApi(ophys_experiment_id) - session = BehaviorOphysSession(api) + session = BehaviorOphysExperiment(api) response_window = session.task_parameters['response_window_sec'] for _, row in session.trials.iterrows(): @@ -202,7 +202,7 @@ def test_trial_response_window_bounds_reward(ophys_experiment_id): def test_eye_tracking(dilation_frames, z_threshold, eye_tracking_start_value): mock = MagicMock() mock.get_eye_tracking.return_value = pd.DataFrame([1, 2, 3]) - session = BehaviorOphysSession( + session = BehaviorOphysExperiment( api=mock, eye_tracking_z_threshold=z_threshold, eye_tracking_dilation_frames=dilation_frames) @@ -223,7 +223,7 @@ def test_eye_tracking(dilation_frames, z_threshold, eye_tracking_start_value): @pytest.mark.requires_bamboo def test_event_detection(): ophys_experiment_id = 789359614 - session = BehaviorOphysSession.from_lims( + session = BehaviorOphysExperiment.from_lims( ophys_experiment_id=ophys_experiment_id) events = session.events @@ -244,9 +244,9 @@ def test_event_detection(): @pytest.mark.requires_bamboo -def test_BehaviorOphysSession_property_data(): +def test_BehaviorOphysExperiment_property_data(): ophys_experiment_id = 960410026 - dataset = BehaviorOphysSession.from_lims(ophys_experiment_id) + dataset = BehaviorOphysExperiment.from_lims(ophys_experiment_id) assert dataset.ophys_session_id == 959458018 assert dataset.ophys_experiment_id == 960410026 diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_ophys_metadata.py b/allensdk/test/brain_observatory/behavior/test_behavior_ophys_metadata.py deleted file mode 100644 index 4795c1de6..000000000 --- a/allensdk/test/brain_observatory/behavior/test_behavior_ophys_metadata.py +++ /dev/null @@ -1,42 +0,0 @@ -from allensdk.brain_observatory.behavior.metadata.behavior_ophys_metadata \ - import BehaviorOphysMetadata - - -def test_indicator(monkeypatch): - """Test that indicator is parsed from full_genotype""" - class MockExtractor: - def get_reporter_line(self): - return 'Ai148(TIT2L-GC6f-ICL-tTA2)' - extractor = MockExtractor() - - with monkeypatch.context() as ctx: - def dummy_init(self): - self._extractor = extractor - - ctx.setattr(BehaviorOphysMetadata, - '__init__', - dummy_init) - - metadata = BehaviorOphysMetadata() - - assert metadata.indicator == 'GCaMP6f' - - -def test_indicator_invalid_reporter_line(monkeypatch): - """Test that indicator is None if it can't be parsed from reporter line""" - class MockExtractor: - def get_reporter_line(self): - return 'foo' - extractor = MockExtractor() - - with monkeypatch.context() as ctx: - def dummy_init(self): - self._extractor = extractor - - ctx.setattr(BehaviorOphysMetadata, - '__init__', - dummy_init) - - metadata = BehaviorOphysMetadata() - - assert metadata.indicator is None diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_project_cache.py b/allensdk/test/brain_observatory/behavior/test_behavior_project_cache.py index 83cf06a5c..242cb5571 100644 --- a/allensdk/test/brain_observatory/behavior/test_behavior_project_cache.py +++ b/allensdk/test/brain_observatory/behavior/test_behavior_project_cache.py @@ -4,43 +4,81 @@ import pandas as pd import tempfile import logging -from allensdk.brain_observatory.behavior.behavior_project_cache import ( - BehaviorProjectCache) + +from allensdk.brain_observatory.behavior.behavior_project_cache \ + import VisualBehaviorOphysProjectCache +from allensdk.test.brain_observatory.behavior.conftest import get_resources_dir @pytest.fixture def session_table(): - return (pd.DataFrame({"ophys_session_id": [1, 2, 3], - "ophys_experiment_id": [[4], [5, 6], [7]], - "date_of_acquisition": np.datetime64('2020-02-20'), - "reporter_line": [["aa"], ["aa", "bb"], ["cc"]], - "driver_line": [["aa"], ["aa", "bb"], ["cc"]]}) - .set_index("ophys_session_id")) + return (pd.DataFrame({"behavior_session_id": [3], + "ophys_experiment_id": [[5, 6]], + "date_of_acquisition": np.datetime64('2020-02-20') + }, index=pd.Index([1], name='ophys_session_id')) + ) @pytest.fixture def behavior_table(): return (pd.DataFrame({"behavior_session_id": [1, 2, 3], - "date_of_acquisition": np.datetime64("NAT"), - "reporter_line": [["aa"], ["aa", "bb"], ["cc"]], - "driver_line": [["aa"], ["aa", "bb"], ["cc"]]}) + "foraging_id": [1, 2, 3], + "date_of_acquisition": [ + np.datetime64('2020-02-20'), + np.datetime64('2020-02-21'), + np.datetime64('2020-02-22') + ], + "reporter_line": ["Ai93(TITL-GCaMP6f)", + "Ai93(TITL-GCaMP6f)", + "Ai93(TITL-GCaMP6f)"], + "driver_line": [["aa"], ["aa", "bb"], ["cc"]], + 'full_genotype': [ + 'foo-SlcCre', + 'Vip-IRES-Cre/wt;Ai148(TIT2L-GC6f-ICL-tTA2)/wt', + 'bar'], + 'cre_line': [None, 'Vip-IRES-Cre', None], + 'session_type': ['TRAINING_1_gratings', + 'TRAINING_1_gratings', + 'OPHYS_1_images_A'], + 'mouse_id': [1, 1, 1] + }) .set_index("behavior_session_id")) @pytest.fixture -def mock_api(session_table, behavior_table): +def experiments_table(): + return (pd.DataFrame({"ophys_session_id": [1, 2, 3], + "behavior_session_id": [1, 2, 3], + "ophys_experiment_id": [1, 2, 3], + "date_of_acquisition": [ + np.datetime64('2020-02-20'), + np.datetime64('2020-02-21'), + np.datetime64('2020-02-22') + ], + 'imaging_depth': [75, 75, 75], + 'targeted_structure': ['VISp', 'VISp', 'VISp'] + }) + .set_index("ophys_experiment_id")) + + +@pytest.fixture +def mock_api(session_table, behavior_table, experiments_table): class MockApi: - def get_session_table(self): + def get_ophys_session_table(self): return session_table - def get_behavior_only_session_table(self): + def get_behavior_session_table(self): return behavior_table + def get_ophys_experiment_table(self): + return experiments_table + def get_session_data(self, ophys_session_id): return ophys_session_id - def get_behavior_only_session_data(self, behavior_session_id): - return behavior_session_id + def get_behavior_stage_parameters(self, foraging_ids): + return {x: {} for x in foraging_ids} + return MockApi @@ -48,30 +86,58 @@ def get_behavior_only_session_data(self, behavior_session_id): def TempdirBehaviorCache(mock_api, request): temp_dir = tempfile.TemporaryDirectory() manifest = os.path.join(temp_dir.name, "manifest.json") - yield BehaviorProjectCache(fetch_api=mock_api(), - cache=request.param, - manifest=manifest) + yield VisualBehaviorOphysProjectCache(fetch_api=mock_api(), + cache=request.param, + manifest=manifest) temp_dir.cleanup() @pytest.mark.parametrize("TempdirBehaviorCache", [True, False], indirect=True) -def test_get_session_table(TempdirBehaviorCache, session_table): +def test_get_ophys_session_table(TempdirBehaviorCache, session_table): cache = TempdirBehaviorCache - actual = cache.get_session_table() + obtained = cache.get_ophys_session_table() if cache.cache: path = cache.manifest.path_info.get("ophys_sessions").get("spec") assert os.path.exists(path) - pd.testing.assert_frame_equal(session_table, actual) + + expected_path = os.path.join(get_resources_dir(), 'project_metadata', + 'expected') + expected = pd.read_pickle(os.path.join(expected_path, + 'ophys_session_table.pkl')) + + pd.testing.assert_frame_equal(expected, obtained) @pytest.mark.parametrize("TempdirBehaviorCache", [True, False], indirect=True) def test_get_behavior_table(TempdirBehaviorCache, behavior_table): cache = TempdirBehaviorCache - actual = cache.get_behavior_session_table() + obtained = cache.get_behavior_session_table() if cache.cache: path = cache.manifest.path_info.get("behavior_sessions").get("spec") assert os.path.exists(path) - pd.testing.assert_frame_equal(behavior_table, actual) + + expected_path = os.path.join(get_resources_dir(), 'project_metadata', + 'expected') + expected = pd.read_pickle(os.path.join(expected_path, + 'behavior_session_table.pkl')) + + pd.testing.assert_frame_equal(expected, obtained) + + +@pytest.mark.parametrize("TempdirBehaviorCache", [True, False], indirect=True) +def test_get_experiments_table(TempdirBehaviorCache, experiments_table): + cache = TempdirBehaviorCache + obtained = cache.get_ophys_experiment_table() + if cache.cache: + path = cache.manifest.path_info.get("ophys_experiments").get("spec") + assert os.path.exists(path) + + expected_path = os.path.join(get_resources_dir(), 'project_metadata', + 'expected') + expected = pd.read_pickle(os.path.join(expected_path, + 'ophys_experiment_table.pkl')) + + pd.testing.assert_frame_equal(expected, obtained) @pytest.mark.parametrize("TempdirBehaviorCache", [True], indirect=True) @@ -79,17 +145,22 @@ def test_session_table_reads_from_cache(TempdirBehaviorCache, session_table, caplog): caplog.set_level(logging.INFO, logger="call_caching") cache = TempdirBehaviorCache - cache.get_session_table() + cache.get_ophys_session_table() expected_first = [ - ("call_caching", logging.INFO, "Reading data from cache"), - ("call_caching", logging.INFO, "No cache file found."), - ("call_caching", logging.INFO, "Fetching data from remote"), - ("call_caching", logging.INFO, "Writing data to cache"), - ("call_caching", logging.INFO, "Reading data from cache")] + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'No cache file found.'), + ('call_caching', logging.INFO, 'Fetching data from remote'), + ('call_caching', logging.INFO, 'Writing data to cache'), + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'No cache file found.'), + ('call_caching', logging.INFO, 'Fetching data from remote'), + ('call_caching', logging.INFO, 'Writing data to cache'), + ('call_caching', logging.INFO, 'Reading data from cache')] assert expected_first == caplog.record_tuples caplog.clear() - cache.get_session_table() - assert [expected_first[0]] == caplog.record_tuples + cache.get_ophys_session_table() + assert [expected_first[0], expected_first[-1]] == caplog.record_tuples @pytest.mark.parametrize("TempdirBehaviorCache", [True], indirect=True) @@ -99,22 +170,28 @@ def test_behavior_table_reads_from_cache(TempdirBehaviorCache, behavior_table, cache = TempdirBehaviorCache cache.get_behavior_session_table() expected_first = [ - ("call_caching", logging.INFO, "Reading data from cache"), - ("call_caching", logging.INFO, "No cache file found."), - ("call_caching", logging.INFO, "Fetching data from remote"), - ("call_caching", logging.INFO, "Writing data to cache"), - ("call_caching", logging.INFO, "Reading data from cache")] + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'No cache file found.'), + ('call_caching', logging.INFO, 'Fetching data from remote'), + ('call_caching', logging.INFO, 'Writing data to cache'), + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'Reading data from cache'), + ('call_caching', logging.INFO, 'No cache file found.'), + ('call_caching', logging.INFO, 'Fetching data from remote'), + ('call_caching', logging.INFO, 'Writing data to cache'), + ('call_caching', logging.INFO, 'Reading data from cache')] assert expected_first == caplog.record_tuples caplog.clear() cache.get_behavior_session_table() - assert [expected_first[0]] == caplog.record_tuples + assert [expected_first[0], expected_first[-1]] == caplog.record_tuples @pytest.mark.parametrize("TempdirBehaviorCache", [True, False], indirect=True) -def test_get_session_table_by_experiment(TempdirBehaviorCache): - expected = (pd.DataFrame({"ophys_session_id": [1, 2, 2, 3], - "ophys_experiment_id": [4, 5, 6, 7]}) +def test_get_ophys_session_table_by_experiment(TempdirBehaviorCache): + expected = (pd.DataFrame({"ophys_session_id": [1, 1], + "ophys_experiment_id": [5, 6]}) .set_index("ophys_experiment_id")) - actual = TempdirBehaviorCache.get_session_table(by="ophys_experiment_id")[ + actual = TempdirBehaviorCache.get_ophys_session_table( + index_column="ophys_experiment_id")[ ["ophys_session_id"]] pd.testing.assert_frame_equal(expected, actual) diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_project_cloud_api.py b/allensdk/test/brain_observatory/behavior/test_behavior_project_cloud_api.py new file mode 100644 index 000000000..a4bf6632a --- /dev/null +++ b/allensdk/test/brain_observatory/behavior/test_behavior_project_cloud_api.py @@ -0,0 +1,201 @@ +import pytest +import pandas as pd +from pathlib import Path +from unittest.mock import MagicMock + +from allensdk.brain_observatory.behavior.project_apis.data_io import \ + behavior_project_cloud_api as cloudapi + + +class MockCache(): + def __init__(self, + behavior_session_table, + ophys_session_table, + ophys_experiment_table, + cachedir): + self.file_id_column = "file_id" + self.session_table_path = cachedir / "session.csv" + self.behavior_session_table_path = cachedir / "behavior_session.csv" + self.ophys_experiment_table_path = cachedir / "ophys_experiment.csv" + + ophys_session_table.to_csv(self.session_table_path, index=False) + behavior_session_table.to_csv(self.behavior_session_table_path, + index=False) + ophys_experiment_table.to_csv(self.ophys_experiment_table_path, + index=False) + + self._manifest = MagicMock() + self._manifest.metadata_file_names = ["behavior_session_table", + "ophys_session_table", + "ophys_experiment_table"] + self._metadata_name_path_map = { + "behavior_session_table": self.behavior_session_table_path, + "ophys_session_table": self.session_table_path, + "ophys_experiment_table": self.ophys_experiment_table_path} + + def download_metadata(self, fname): + return self._metadata_name_path_map[fname] + + def download_data(self, file_id): + return file_id + + def metadata_path(self, fname): + local_path = self._metadata_name_path_map[fname] + return { + 'local_path': local_path, + 'exists': Path(local_path).exists() + } + + def data_path(self, file_id): + return { + 'local_path': file_id, + 'exists': True + } + + +@pytest.fixture +def mock_cache(request, tmpdir): + bst = request.param.get("behavior_session_table") + ost = request.param.get("ophys_session_table") + oet = request.param.get("ophys_experiment_table") + + # round-trip the tables through csv to pick up + # pandas mods to lists + fname = tmpdir / "my.csv" + bst.to_csv(fname, index=False) + bst = pd.read_csv(fname) + ost.to_csv(fname, index=False) + ost = pd.read_csv(fname) + oet.to_csv(fname, index=False) + oet = pd.read_csv(fname) + yield (MockCache(bst, ost, oet, tmpdir), request.param) + + +@pytest.mark.parametrize( + "mock_cache", + [ + { + "behavior_session_table": pd.DataFrame({ + "behavior_session_id": [1, 2, 3, 4], + "ophys_experiment_id": [4, 5, 6, [7, 8, 9]], + "file_id": [4, 5, 6, None]}), + "ophys_session_table": pd.DataFrame({ + "ophys_session_id": [10, 11, 12, 13], + "ophys_experiment_id": [4, 5, 6, [7, 8, 9]]}), + "ophys_experiment_table": pd.DataFrame({ + "ophys_experiment_id": [4, 5, 6, 7, 8, 9], + "file_id": [4, 5, 6, 7, 8, 9]})}, + ], + indirect=["mock_cache"]) +@pytest.mark.parametrize("local", [True, False]) +def test_BehaviorProjectCloudApi(mock_cache, monkeypatch, local): + mocked_cache, expected = mock_cache + api = cloudapi.BehaviorProjectCloudApi(mocked_cache, + skip_version_check=True, + local=False) + if local: + api = cloudapi.BehaviorProjectCloudApi(mocked_cache, + skip_version_check=True, + local=True) + + # behavior session table as expected + bost = api.get_behavior_session_table() + assert bost.index.name == "behavior_session_id" + bost = bost.reset_index() + ebost = expected["behavior_session_table"] + for k in ["behavior_session_id", "file_id"]: + pd.testing.assert_series_equal(bost[k], ebost[k]) + for k in ["ophys_experiment_id"]: + assert all([i == j + for i, j in zip(bost[k].values, ebost[k].values)]) + + # ophys session table as expected + ost = api.get_ophys_session_table() + assert ost.index.name == "ophys_session_id" + ost = ost.reset_index() + eost = expected["ophys_session_table"] + for k in ["ophys_session_id"]: + pd.testing.assert_series_equal(ost[k], eost[k]) + for k in ["ophys_experiment_id"]: + assert all([i == j + for i, j in zip(ost[k].values, eost[k].values)]) + + # experiment table as expected + et = api.get_ophys_experiment_table() + assert et.index.name == "ophys_experiment_id" + et = et.reset_index() + pd.testing.assert_frame_equal(et, expected["ophys_experiment_table"]) + + # get_behavior_session returns expected value + # both directly and via experiment table + def mock_nwb(nwb_path): + return nwb_path + monkeypatch.setattr(cloudapi.BehaviorSession, "from_nwb_path", mock_nwb) + assert api.get_behavior_session(2) == "5" + assert api.get_behavior_session(4) == "7" + + # direct check only for ophys experiment + monkeypatch.setattr(cloudapi.BehaviorOphysExperiment, + "from_nwb_path", mock_nwb) + assert api.get_behavior_ophys_experiment(8) == "8" + + +@pytest.mark.parametrize( + "pipeline_versions, sdk_version, lookup, exception, match", + [ + ( + [{ + "name": "AllenSDK", + "version": "2.9.0"}], + "2.9.0", + {"pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.9.0", "3.0.0"]}}}, + None, + ""), + ( + [{ + "name": "AllenSDK", + "version": "2.9.0"}], + "2.9.0", + {"pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.9.1", "3.0.0"]}}}, + cloudapi.BehaviorCloudCacheVersionException, + r".*version be >=2.9.1 and <3.0.0.*"), + ( + [{ + "name": "AllenSDK", + "version": "2.9.0"}], + "2.9.0", + {"pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.8.0", "2.9.0"]}}}, + cloudapi.BehaviorCloudCacheVersionException, + r".*version be >=2.8.0 and <2.9.0.*"), + ( + [{ + "name": "AllenSDK", + "version": "2.10.0"}], + "2.9.0", + {"pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.8.0", "2.9.0"]}}}, + cloudapi.BehaviorCloudCacheVersionException, + r"no version compatibility .*"), + ( + [{ + "name": "AllenSDK", + "version": "2.10.0"}, + { + "name": "AllenSDK", + "version": "2.10.1"}], + "2.9.0", + {"pipeline_versions": { + "2.9.0": {"AllenSDK": ["2.8.0", "2.9.0"]}}}, + cloudapi.BehaviorCloudCacheVersionException, + r"expected to find 1 and only 1 .*"), + ]) +def test_compatibility(pipeline_versions, sdk_version, lookup, + exception, match): + if exception is None: + cloudapi.version_check(pipeline_versions, sdk_version, lookup) + return + with pytest.raises(exception, match=match): + cloudapi.version_check(pipeline_versions, sdk_version, lookup) diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_project_metadata_writer.py b/allensdk/test/brain_observatory/behavior/test_behavior_project_metadata_writer.py new file mode 100644 index 000000000..41fd73b5e --- /dev/null +++ b/allensdk/test/brain_observatory/behavior/test_behavior_project_metadata_writer.py @@ -0,0 +1,79 @@ +import os +import tempfile +from ast import literal_eval + +import pandas as pd +import pytest + +from allensdk.brain_observatory.behavior.behavior_project_cache import \ + VisualBehaviorOphysProjectCache +from allensdk.brain_observatory.behavior.behavior_project_cache.external \ + .behavior_project_metadata_writer import \ + BehaviorProjectMetadataWriter +from allensdk.test.brain_observatory.behavior.conftest import get_resources_dir + + +def convert_strings_to_lists(df, is_session=True): + """Lists when inside dataframe and written using .to_csv + get written as string literals. Need to parse out lists""" + df.loc[df['driver_line'].notnull(), 'driver_line'] = \ + df['driver_line'][df['driver_line'].notnull()] \ + .apply(lambda x: literal_eval(x)) + + if is_session: + df.loc[df['ophys_experiment_id'].notnull(), 'ophys_experiment_id'] = \ + df['ophys_experiment_id'][df['ophys_experiment_id'].notnull()] \ + .apply(lambda x: literal_eval(x)) + df.loc[df['ophys_container_id'].notnull(), 'ophys_container_id'] = \ + df['ophys_container_id'][df['ophys_container_id'].notnull()] \ + .apply(lambda x: literal_eval(x)) + + +@pytest.mark.requires_bamboo +def test_metadata(): + release_date = '2021-03-25' + with tempfile.TemporaryDirectory() as tmp_dir: + bpc = VisualBehaviorOphysProjectCache.from_lims( + data_release_date=release_date) + bpmw = BehaviorProjectMetadataWriter( + behavior_project_cache=bpc, + out_dir=tmp_dir, + project_name='visual-behavior-ophys', + data_release_date=release_date) + bpmw.write_metadata() + + expected_path = os.path.join(get_resources_dir(), + 'project_metadata_writer', + 'expected') + # test behavior + expected = pd.read_pickle(os.path.join(expected_path, + 'behavior_session_table.pkl')) + obtained = pd.read_csv(os.path.join(tmp_dir, + 'behavior_session_table.csv'), + dtype={'mouse_id': str}, + parse_dates=['date_of_acquisition']) + convert_strings_to_lists(df=obtained) + pd.testing.assert_frame_equal(expected, + obtained) + + # test ophys session + expected = pd.read_pickle(os.path.join(expected_path, + 'ophys_session_table.pkl')) + obtained = pd.read_csv(os.path.join(tmp_dir, + 'ophys_session_table.csv'), + dtype={'mouse_id': str}, + parse_dates=['date_of_acquisition']) + convert_strings_to_lists(df=obtained) + pd.testing.assert_frame_equal(expected, + obtained) + + # test ophys experiment + expected = pd.read_pickle(os.path.join(expected_path, + 'ophys_experiment_table.pkl')) + obtained = pd.read_csv(os.path.join(tmp_dir, + 'ophys_experiment_table.csv'), + dtype={'mouse_id': str}, + parse_dates=['date_of_acquisition']) + convert_strings_to_lists(df=obtained, is_session=False) + pd.testing.assert_frame_equal(expected, + obtained) diff --git a/allensdk/test/brain_observatory/behavior/test_behavior_session.py b/allensdk/test/brain_observatory/behavior/test_behavior_session.py index 9df831e84..fdec71d91 100644 --- a/allensdk/test/brain_observatory/behavior/test_behavior_session.py +++ b/allensdk/test/brain_observatory/behavior/test_behavior_session.py @@ -59,7 +59,7 @@ def test_cache_clear_raises_warning(self, caplog): " `cache_clear` does not exist on DummyApi") self.behavior_session.cache_clear() assert caplog.record_tuples == [ - ("BehaviorOphysSession", logging.WARNING, expected_msg)] + ("BehaviorSession", logging.WARNING, expected_msg)] def test_cache_clear_no_warning(self, caplog): caplog.clear() diff --git a/allensdk/test/brain_observatory/behavior/test_prior_exposure_count_processing.py b/allensdk/test/brain_observatory/behavior/test_prior_exposure_count_processing.py new file mode 100644 index 000000000..57d18c5fb --- /dev/null +++ b/allensdk/test/brain_observatory/behavior/test_prior_exposure_count_processing.py @@ -0,0 +1,65 @@ +import numpy as np +import pandas as pd + +from allensdk.brain_observatory.behavior.behavior_project_cache.tables.util \ + .prior_exposure_processing import \ + get_prior_exposures_to_session_type, get_prior_exposures_to_image_set, \ + get_prior_exposures_to_omissions + + +def test_prior_exposure_to_session_type(): + """Tests normal behavior as well as case where session type is missing""" + df = pd.DataFrame({ + 'session_type': ['A', 'A', None, 'A', 'B'], + 'mouse_id': [0, 0, 0, 0, 1], + 'date_of_acquisition': [0, 1, 2, 3, 0] + }, index=pd.Series([0, 1, 2, 3, 4], name='behavior_session_id')) + expected = pd.Series([0, 1, np.nan, 2, 0], + index=pd.Series([0, 1, 2, 3, 4], + name='behavior_session_id')) + obtained = get_prior_exposures_to_session_type(df=df) + pd.testing.assert_series_equal(expected, obtained) + + +def test_prior_exposure_to_image_set(): + """Tests normal behavior as well as case where session type is not an + image set type""" + df = pd.DataFrame({ + 'session_type': ['OPHYS_1_images_A', 'OPHYS_2_images_A_passive', + 'foo', 'OPHYS_3_images_A', 'B'], + 'mouse_id': [0, 0, 0, 0, 1], + 'date_of_acquisition': [0, 1, 2, 3, 0] + }, index=pd.Index([0, 1, 2, 3, 4], name='behavior_session_id')) + expected = pd.Series([0, 1, np.nan, 2, np.nan], + index=pd.Series([0, 1, 2, 3, 4], + name='behavior_session_id')) + obtained = get_prior_exposures_to_image_set(df=df) + pd.testing.assert_series_equal(expected, obtained) + + +def test_prior_exposure_to_omissions(): + """Tests normal behavior and tests case where flash_omit_probability + needs to be looked up for habituation session. Only 1 of the habituation + sessions has omissions""" + df = pd.DataFrame({ + 'session_type': ['OPHYS_1_images_A', 'OPHYS_2_images_A_passive', + 'OPHYS_1_habituation', 'OPHYS_2_habituation', + 'OPHYS_3_habituation'], + 'mouse_id': [0, 0, 1, 1, 1], + 'foraging_id': [1, 2, 3, 4, 5], + 'date_of_acquisition': [0, 1, 0, 1, 2] + }, index=pd.Index([0, 1, 2, 3, 4], name='behavior_session_id')) + expected = pd.Series([0, 1, 0, 0, 1], + index=pd.Index([0, 1, 2, 3, 4], + name='behavior_session_id')) + + class MockFetchApi: + def get_behavior_stage_parameters(self, foraging_ids): + return { + 3: {}, + 4: {'flash_omit_probability': 0.05}, + 5: {} + } + fetch_api = MockFetchApi() + obtained = get_prior_exposures_to_omissions(df=df, fetch_api=fetch_api) + pd.testing.assert_series_equal(expected, obtained) diff --git a/allensdk/test/brain_observatory/behavior/test_swdb_behavior_project_cache.py b/allensdk/test/brain_observatory/behavior/test_swdb_behavior_project_cache.py index bf2aaca0f..16a529418 100644 --- a/allensdk/test/brain_observatory/behavior/test_swdb_behavior_project_cache.py +++ b/allensdk/test/brain_observatory/behavior/test_swdb_behavior_project_cache.py @@ -106,7 +106,7 @@ def test_get_container_sessions(cache): container_id = cache.experiment_table['container_id'].unique()[0] container_sessions = cache.get_container_sessions(container_id) session = container_sessions['OPHYS_1_images_A'] - assert isinstance(session, bpc.ExtendedBehaviorSession) + assert isinstance(session, bpc.ExtendedBehaviorOphysExperiment) np.testing.assert_almost_equal(session.dff_traces.loc[817103993]['dff'][0], 0.3538657529565) diff --git a/doc_template/data_api_client.rst b/doc_template/data_api_client.rst index ad80f1770..62a037419 100644 --- a/doc_template/data_api_client.rst +++ b/doc_template/data_api_client.rst @@ -113,7 +113,7 @@ The .itertuples method is one way to do it. Caching Queries on Disk ----------------------- -:py:meth:`~allensdk.api.cache.Cache.wrap` has several parameters for querying the API, +:py:meth:`~allensdk.api.warehouse_cache.cache.Cache.wrap` has several parameters for querying the API, saving the results as CSV or JSON and reading the results as a pandas dataframe. .. literalinclude:: examples_root/examples/data_api_client_ex.py diff --git a/doc_template/examples_root/examples/data_api_client_ex.py b/doc_template/examples_root/examples/data_api_client_ex.py index 071606b92..f33eb2151 100644 --- a/doc_template/examples_root/examples/data_api_client_ex.py +++ b/doc_template/examples_root/examples/data_api_client_ex.py @@ -119,7 +119,7 @@ # example 11 #=============================================================================== -from allensdk.api.cache import Cache +from allensdk.api.warehouse_cache.cache import Cache cache_writer = Cache() do_cache=True diff --git a/doc_template/index.rst b/doc_template/index.rst index 8d3d90a53..6ae91b964 100644 --- a/doc_template/index.rst +++ b/doc_template/index.rst @@ -91,6 +91,16 @@ The Allen SDK provides Python code for accessing experimental metadata along wit See the `mouse connectivity section `_ for more details. +What's New - 2.10.1 +----------------------------------------------------------------------- +- changes name of BehaviorProjectCache to VisualBehaviorOphysProjectCache +- changes VisualBehaviorOphysProjectCache method get_session_table() to get_ophys_session_table() +- changes VisualBehaviorOphysProjectCache method get_experiment_table() to get_ophys_experiment_table() +- VisualBehaviorOphysProjectCache is enabled to instantiate from_s3_cache() and from_local_cache() +- Improvements to BehaviorProjectCache +- Adds project metadata writer + + What's New - 2.9.0 ----------------------------------------------------------------------- - Updates to Session metadata; refactors implementation to use class rather than dict internally diff --git a/requirements.txt b/requirements.txt index d8c37f5f3..c4b947eb4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -26,3 +26,5 @@ aiohttp==3.7.4 nest_asyncio==1.2.0 tqdm>=4.27 ndx-events<=0.2.0 +boto3==1.17.21 +semver diff --git a/test_requirements.txt b/test_requirements.txt index e841e82cf..525a86d0f 100644 --- a/test_requirements.txt +++ b/test_requirements.txt @@ -7,6 +7,7 @@ pytest-mock>=1.5.0,<3.0.0 mock>=1.0.1,<5.0.0 coverage>=3.7.1,<6.0.0 scikit-learn<1.0.0 +moto==2.0.1 # these overlap with requirements specified in doc_requirements. As long as they are needed, these specifications must be kept in sync # TODO: see if we can avoid duplicating these requirements - this will involved surveying CI pep8==1.7.0,<2.0.0