Reproduction Materials for "Forgotten but not gone: A multi-state analysis of modern-day debt imprisonment."
Install python 3 if you do not have it. Then, run the following commands from the command line:
git clone [email protected]:stanford-policylab/debt
cd debt
# Example uses:
./download.py --all # Download all data
./download.py --state wi --subgeo milwaukee_county # Download data for a specific county
./download.py --all --download-dir /tmp/debt # Download the data to /tmp/debt
./download.py --state wi --subgeo milwaukee_county --type csv # Download as CSV instead of RDS
To reproduce our analysis, download all of the data into the data/repro
subdirectory:
./download.py --all --download-dir data/repro
Once the data have downloaded, ensure that you have groundhog
installed and
are using R version 4.3.1
. Then, simply knit the main.Rmd
file in the root
of this directory.
Rscript -e "knitr::knit('main.Rmd')"
Data processing is carried out in a number of steps. A large number of helper
functions for the seven major steps—rectangularization, classification,
parsing, deduplication, imputation, standardization, and sanitization—can be
found in the lib
directory under the corresponding R
script. Higher level
functions for managing data processing are contained in lib/debt.R
.
When a new county is initialized with the init
script, a number of
sub directories within the data
directory are created for it:
raw
: This directory holds the raw data, as originally given to us by the county jail.raw_txt
: This directory holds text extracted from the raw data, usually using tools such aspdftotext
ortabula
, or theto_txt.py
utility.raw_csv
: This directory holds data in CSV format that can be ingested byR
, either because it was directly extracted in a usable format, e.g., using ourto_csv.py
utility, or one of the scripts inlib/extract/
, or, in some cases, location-specific cleaning scripts.clean
: This directory holds data in CSV and RDS formats that have been cleaned and standardized.
Converting raw data into CSV files that can be manipulated in R
is the first
step of the pipeline (i.e., going from the raw
directory to the raw_csv
directory), which we term "rectangularization":
- Rectangularization: First, raw data must be converted to CSV format.
This is accomplished by converting the raw data to a CSV using one of the
conversion utilities in
lib/utils
orlib/pdf.R
, or with tabula. In some cases, e.g., due to special visual formatting, additional conversion is required. Scripts for converting specific report types can be found inlib/xtract
. In some cases, due to unusual visual formatting or structure, additional location-specific processing is performed on a location-by-location basis to rectangularize the data.
Data processing (i.e., going from the raw_csv
directory to the clean
directory`) is undertaken at the location level, according to the following
steps:
- Classification: Once data have been rectangularized, the columns must be
classified according to the overall data schema. First, an automatic pass
using cosine similarity is attempted by calling
debt_classify()
fromlib/debt.R
. This classification is confirmed (and modified, as appropriate) using theshiny
app inlib/audit/app.R
.
The remaining steps are carried out automatically by calling debt_process()
or
debt_process_all()
after sourcing debt.R
.
- Parsing: The raw contents of the classified columns are parsed according
to a standard schema (i.e., "WF" is converted to "white" in the race column
and "female" in the gender column). (See
lib/parse.R
.) - Deduplication: Each charge present at the time of booking is separated
into a distinct row, and rows representing the same charge at the time of
booking are combined. (See
lib/dedupe.R
.) - Imputation: Features not included in the data but which can be
reasonably inferred from the data—most notably, whether the booking
represents a failure-to-pay booking, but also ages, length of stay,
ethnicity, etc.—are imputed. (See
lib/impute.R
.) - Standardization: Parsed values are coerced to the correct types, and
values are checked to ensure that incorrect values resulting from clerical
errors (such as ages greater than 100 or less than 15) are removed. (See
lib/standardize.R
andlib/standards.R
.) - Sanitization: Unparsed personally identifying information is removed
from free text fields. (See
lib/sanitize.R
.)
Due to the dependency on the poster
R
library, which in turn has a dependency on the
libpostal
C
library, the
data cleaning code cannot be run in a virtual environment with renv
or
groundhog
. We are working on a solution using
Spack, and will update this repository with
instructions for recreating the data cleaning environment once that process is
complete.
Raw data and location-specific processing scripts are available only by request.