Script to automate the process of adding potentially thousands of FITS files to a simple DOI within a Dataverse server. The mimetypes is automatically detected and the appropriate metadata is extracted and added to the description of the DOI. If the mine type is not recognized, the script will assign it as a binary file (application/octet-stream) to prevent upload failure. Several shape files mime types are included in the script (application/x-esri-shape & application/x-qgis) but more can be added as needed.
- py_add_fits_files_to_dio.py: Rewrote Bash script behavior. see below
- fits_extract.py: Extract FITS metadata. see below
- FITS_Description.md: Describe FITS files and what is possible to extract. see below
- generate_test_files.md: Generate test files. see below
- grouped_files.py: Grouping to zip large number of files. see below
Found that using pyDataverse.api added options that would have been difficult to replicate in a bash script.
- Python 3 (tested on 3.10.12)
- Python libraries "dvuploader" & "pyDataverse" installed.
- Datatverse API Token (https://archive.data.jhu.edu/dataverseuser.xhtml) click on API Token tab after logging in
- The DOI to run on
- FITS files to process
- All FITS files need to be together in a single directory (no subdirectories).
- Python's Virtual Environment
- Locally set up: Utilizing
pip install
to configure the script's dependencies has been the conventional method for setting up Python scripts. However, this approach is becoming less favorable over time.
Before running the scripts, you need to obtain your API token from Dataverse. This is optional so that you don't have to enter the API key into the terminal and instead can pass '$API_KEY' to the scripts.
- Navigate to
[Site_URL]/dataverseuser.xhtml?selectTab=dataRelatedToMe
in your web browser. - Click on the "API Token" tab.
- Copy the displayed token string.
Next, set the 'API_KEY' environment variable in your terminal:
Open your terminal and execute the following command, replacing 'xxxxxxxxxxxxxxxxxxxxxxxxxx' with your actual API token string:
export API_KEY='xxxxxxxxxxxxxxxxxxxxxxxxxx'
To make the 'API_KEY' persist across terminal sessions, you can add the above line to your '~/.bashrc', '~/.bash_profile', or '~/.zshrc' file, depending on your shell and operating system.
Open Command Prompt or PowerShell and execute the following command, replacing xxxxxxxxxxxxxxxxxxxxxxxxxx with your actual API token string:
Command Prompt:
set API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxx
PowerShell:
$env:API_KEY='xxxxxxxxxxxxxxxxxxxxxxxxxx'
To make the 'API_KEY' persist across sessions in Windows, you can set it as a user or system environment variable through the System Properties. This can be accessed by searching for "Edit the system environment variables" in the Start menu. In the System Properties window, click on the "Environment Variables" button, and then you can add or edit the 'API_KEY' variable under either User or System variables as needed.
Install pipenv to simplifying dependency management and providing consistent environments across different installations and it should avoid version conflicts with libraries already installed.
# Install pipenv (Linux)
python -m pip install pipenv
# OR (Mac)
brew install pyenv
# Linux
git clone https://github.com/pyenv/pyenv.git $(python -m site --user-base)/.pyenv
echo 'export PYENV_ROOT="$(python -m site --user-base)/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init --path)"' >> ~/.bashrc
source ~/.bashrc
# Install Python 3.10.12 using pyenv
pyenv install 3.10.12
# Create a virtual environment at a specific Python version
pipenv --python 3.10.12
# 2 Ways to install packages into the virtual environment.
# Either manually install packages into the virtual environment.
pipenv install dvuploader pyDataverse mimetype-description astropy shutil grequests requests
# OR use the Pipfile files (preferred).
# This is useful for ensuring consistent environments across different installations.
pipenv install
# Optional: To run the following commands as python instead of pipenv run python
# Run a shell within the virtual environment
pipenv shell
# To exit the shell
exit
# To remove the virtual environment
pipenv --rm
python -m pip install dvuploader pyDataverse mimetype-description shutil grequests requests
# Optional Packages (for the fits_extract.py and group_files scripts)
python -m pip install astropy pandas
Note: "./" is a shorthand notation used by the computer to specify the execution of a file, especially when the file itself indicates that it's a Python script. In simpler terms, "python foo.py" and "./foo.py" essentially perform the same action.
To elaborate further, when running a script with pipenv instead of the local Python installation, you can simply replace the "./" notation with "pipenv run python" This allows you to execute the script within the virtual environment managed by pipenv.
For Example
# Run using the "Locally" installed
./py_add_fits_files_to_dio.py --help
# Run using pipenv
pipenv run python py_add_fits_files_to_dio.py --help
These files can be executed either independently or as dependencies within other scripts. As of the time of writing, there are no instances where they are being called as dependencies.
Run help for options
# Run using the "Locally" installed
./py_add_fits_files_to_dio.py --help
# Run using pipenv
pipenv run python py_add_fits_files_to_dio.py --help
These file can be used independently of the main script.
See fits_extract.md for details.
See generate_test_files.md for details.
See mimetype.md for details.
See FITS_Description.md for details.
See grouped_files.md for details.
- Processing order: there is no telling the order at which the system is reading in the files. It does sort alphabetically but that doesn't mean it will process them in that order. This is important to know because the order of the files is important to the user.
- Subdirectories with FITS files: if the directory has subdirectories we need to discuss the expected behavior and modify this code accordingly.
- Large Number of files: will cause the script to take a long time to run. Ingestion of data in Dataverse is currently handled natively within Java, using a single-threaded process that reads each cell, row by row, column by column. The performance of ingestion mainly depends on the clock speed of a single core.
...
- error:
SystemError: (libev) error creating signal/async pipe: Too many open files
- solution (for Mac & Linux):
ulimit -n 4096