This is a testing dataset of dicom files, actually images of cookies (and candy tumors) provided in Dicom Format with various metadata, the intention being that you can use this dataset to test your DICOM applications. The dataset is provided in the Experiment Factory standard, using the JSON API specification to make it servable from this static Github folder The general goal of this simple framework is to make it possible for anyone, with little programming experience, to put images and associated in metadata, push them to Github, and have them programatically accessible. The entire thing is rendered statically using Jekyll on Github Pages.
Each dataset is a separate folder within _datasets
, and this means that you can add a new dataset by simply adding a folder with the appropriate subfolders and metadata.
You can easily create your own programatically accessible data by doing the following:
- use this repo as a template, and clear out the
_datasets
folder, replacing with your data with the structure (outlined below). - push to the
gh-pages
branch of a repo. If your github username ispydicom
, and the repo is calleddicom-cookies
, this means that the Github Pages branch is deployed at [https://pydicom.github.io/dicom-cookies].- The "human friendly" version of will be at https://pydicom.github.io/dicom-cookies.
- The general datasets endpoint can be found at https://pydicom.github.io/dicom-cookies/datasets.
- A filtered version for each of images and (if defined) texts is also available.
A complete example of how you might programatically access your data is provided in get_datasets.py, and outlined here. The general steps are as follows:
- Give the repository unique resource identifier (uri) to the application.
- optionally specify images, texts
In the example provided, the tool defaults to this repo, so you don't need to specify anything at all, other than your preference for downloading datasets, or images. But also for this application, the only datasets that we have are images, so the two functionalities return the same thing. Let's take a look at how this might work, or just watch here.
wget https://raw.githubusercontent.com/pydicom/dicom-cookies/master/scripts/get_datasets.py
python get_datasets.py
usage: get_datasets.py [-h] [--output OUTPUT] [--images] [--show] [--uri URI]
download and query data from your json dataset API.
optional arguments:
-h, --help show this help message and exit
--output OUTPUT a folder to save output files to.
--images download images
--show print the json data structure of the endpoint to the screen
--uri URI the github repo username/reponame
Specify to download data with --output or just print to screen with --show
Cool! So the tool wants an output directory, or for us to specify it to just print json to the screen. The first thing we could do is print to the screen, like this:
python get_datasets.py --show
or we could pipe it into a file:
python get_datasets.py --show >> dicom-cookies.json
But we probably want to get the data! Here is how to do that.
# Here is an output directory
mkdir /home/vanessa/Desktop/cookies
python get_datasets.py --output /home/vanessa/Desktop/cookies
And our datasets are downloaded!
Generation of the dataset was done via the create_dataset.py script. The only dependency is pydicom.
A dataset folder includes top level files for metadata, images, and subfolders with the actual images
. Texts are optional but not included in the cookies dataset. The organization might look like the following:
_datasets
cookie-1
images.txt
images
image1.dcm
image2.dcm
In the above example, we have an entity named "cookie-1" with a metadata.txt file that will be rendered at the url /datasets/cookie-1/metadata
as json, and this metadata file will have an includes
section that will indicate if we have images and/or text, or neither, and then link to /datasets/cookie-1/images
and/or /datasets/cookie-1/texts
. Details about the metadata file, images and text files, are below. For the above, we should note that the folder name cookie-1
is going to coincide with the dataset-id
.
metadata.txt
should be a text file located at the top level of the subject folder. Note that the dataset-id
coincides with the folder name for the dataset. THe metadata.txt
includes the fields specified in meta.yml, organized according to being required or not. We can look at a minimal example:
---
type: entity
dataset-id: cookie-1
includes:
- images
---
Features about the dataset can be put in the list of attributes
, although in our case, we are including them with the dicom files:
attributes:
- color: red
- flavor: chocolate
The includes
section indicates that the entity has subfolders "images"," and an images.txt is also present to describe the contents.
Each of the images.txt and texts.txt file in a dataset folder simply need to have a list of the files that you want published, with type "images" for images, and "texts" for texts:
---
type: images
dataset-id: cookie-1
images:
- image1.dcm
- image2.dcm
---
As a reminder, in the example above, we have a folder that looks like this, and we are viewing the images.txt file:
cookie-1
images.txt
images
image1.dcm
image2.dcm
While these variables could be sniffied programmatically, it is important that you are able to include a data object in a repository, but turn it's "published" status on or off. If an image is not included in the list above, it will not be rendered in the json data structure for the API.
You can easily download images using the example above (with get_datasets.py in the scripts folder, and then we've provided a simple example of how to generate a cookie manifest using the script get_cookie_manifest.py, which is basically a lookup dictionary (by cookie id) with a list of cookies and metadata that (likely) a user would search for like cookie ID (PatientID), gender (PatientSex), and name (PatientName) along with referring and reading physician names, and study operator. We have provided this file for you, in case you just want to quickly use it for the dataset provided here.