Skip to content
This repository has been archived by the owner on May 9, 2023. It is now read-only.
/ postal_parser Public archive
forked from degauss-org/postal

address normalization and parsing with libpostal

License

Notifications You must be signed in to change notification settings

degauss-org/postal_parser

 
 

Repository files navigation

postal

container build status

Using

If my_address_file.csv is a file in the current working directory with an address column named address, then the DeGAUSS command:

docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv

will produce my_address_file_postal_0.1.4.csv with added columns:

  • cleaned_address: address with non-alphanumeric characterics and excess whitespace removed (with dht::clean_address())
  • parsed.{address_component}: multiple columns, one for each parsed address component (e.g., parsed.road, parsed.state, parsed.house_number)
  • parsed_address: a "parsed" address created by pasting together available parsed.house_number, parsed.road, parsed.city, parsed.state, and the first five digits of the parsed.postcode address components

Optional Argument

After parsing, the parsed addresses can be expanded into several possible normalized addresses using libpostal. This can be useful for matching of these addresses with other messy, real world addresses.

If any value is provided as an argument (e.g., "expand"), then the DeGAUSS command:

docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/postal:0.1.4 my_address_file.csv expand

will produce my_address_file_postal_0.1.4_expand.csv with the above columns plus:

  • expanded_addresses: the expanded addresses for parsed_address

Because each parsed_address will likely result in more than one expanded_addresses, each input row is duplicated to accomodate several expanded_addresses. This means that when expanding addresses, the input CSV file is "expanded" too by duplicating the input rows.

Geomarker Methods

Input addresses are parsed/normalized using libpostal by:

  1. removing non-alphanumeric characters (except -) and excess whitespace (with dht::clean_address())
  2. parsing addresses into components using libpostal/scr/address_parser (a machine learning model trained on OpenStreetMap and OpenAddresses)
  3. (with an optional argument) expanding the parsed address into several possible normalized addresses

DeGAUSS Details

For detailed documentation on DeGAUSS, including general usage and installation, please see the DeGAUSS homepage.

About

address normalization and parsing with libpostal

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 56.1%
  • Dockerfile 33.4%
  • Makefile 10.5%