If my_address_file.csv
is a file in the current working directory with an address column named address
, then the DeGAUSS command:
docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv
will produce my_address_file_postal_0.1.4.csv
with added columns:
cleaned_address
:address
with non-alphanumeric characterics and excess whitespace removed (withdht::clean_address()
)parsed.{address_component}
: multiple columns, one for each parsed address component (e.g.,parsed.road
,parsed.state
,parsed.house_number
)parsed_address
: a "parsed" address created by pasting together availableparsed.house_number
,parsed.road
,parsed.city
,parsed.state
, and the first five digits of theparsed.postcode
address components
After parsing, the parsed addresses can be expanded into several possible normalized addresses using libpostal
. This can be useful for matching of these addresses with other messy, real world addresses.
If any value is provided as an argument (e.g., "expand"), then the DeGAUSS command:
docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv expand
will produce my_address_file_postal_0.1.4_expand.csv
with the above columns plus:
expanded_addresses
: the expanded addresses forparsed_address
Because each parsed_address
will likely result in more than one expanded_addresses
, each input row is duplicated to accomodate several expanded_addresses
. This means that when expanding addresses, the input CSV file is "expanded" too by duplicating the input rows.
Input addresses are parsed/normalized using libpostal
by:
- removing non-alphanumeric characters (except
-
) and excess whitespace (withdht::clean_address()
) - parsing addresses into components using
libpostal/scr/address_parser
(a machine learning model trained on OpenStreetMap and OpenAddresses) - (with an optional argument) expanding the parsed address into several possible normalized addresses
For detailed documentation on DeGAUSS, including general usage and installation, please see the DeGAUSS homepage.