Skip to content
This repository has been archived by the owner on Feb 22, 2020. It is now read-only.

Geocoding with DeGAUSS

Cole Brokamp edited this page Feb 5, 2019 · 5 revisions

Input File

The input file must be a CSV file with a column containing an address string. Other columns may be present and will be returned in the output file, but should be kept to a minimum to reduce file size.

An example input CSV file (called my_address_file.csv) might look like:

id,address
001,3333 Burnet Ave Cincinnati OH 45229
002,660 Lincoln Avenue Cincinnati OH 45229
003,2800 Winslow Avenue Cincinnati OH 45206

Address String Formatting

If your address components are in different columns, you will need to paste them together into a single string. Below are some tips that will help optimize geocoding accuracy and precision:

  • separate the different address components with a space
  • do not include apartment numbers or "second address line" (but its okay if you can't remove them)
  • spelling should be as accurate as possible, but the program does complete "fuzzy matching" so an exact match is not necessary
  • capitalization does not affect results
  • abbreviations may be used (i.e. St. instead of Street or OH instead of Ohio)
  • use arabic numerals instead of written numbers (i.e. 13 instead of thirteen)
  • do not try to geocode "P.O. box" addresses; these are really not addresses based on a phyiscal location and the geocoder will likely return incorrect matches
  • do not try to geocode addresses without a valid 5 digit zip code; this is used by the geocoder to complete its initial searches and if attempted, it will likely return incorrect matches
  • plus4 zip codes are ignored, but if they must be included make sure to separate them with a dash (i.e. 37209-0000 instead of 372090000)
  • address strings with out of order items could return NA (i.e. 3333 Burnet Ave Cincinnati 45229 OH)

Geocoding

After opening a shell, navigate to the directory where the CSV file to be geocoded is located. See here for help on navigating a filesystem using the command line.

For those unfamiliar with the command line, the simplest approach might be to put the file to be geocoded on the desktop and then navigate to your desktop folder after starting the Docker Quickstart Terminal with cd Desktop.

Run:

docker run --rm=TRUE -v "$PWD":/tmp degauss/geocoder <name-of-file> <address-column-name>

replacing <name-of-file> with the name of the CSV file to be geocoded and <address-column-name> with the name of the column in the CSV file that contains the address strings.

Continuing on our example address file above, we can use:

docker run --rm=TRUE -v "$PWD":/tmp degauss/geocoder my_address_file.csv address

To avoid headaches don't use a file with spaces in the filename or address column name. When issuing the geocoding docker command make sure to include the .csv filename extension even if they don't show up in your system file browser.

If run successfully, the shell should show a progress bar while geocoding and the geocoded file will be written to the current working directory named similarly to the input file but with _geocoded appended to the file name.

Don't forget that if calling this image for the first time, Docker will have to download the image before starting the geocoding process. Although it is quite a large download (~ 6 GB), this only has to happen one time.

Output File

Our output file is written to the same directory and in our example, will be called my_address_file_geocoded.csv:

"address","id","street","zip","city","state","lat","lon","score","prenum","number","precision"
"2800 Winslow Avenue Cincinnati OH 45206","003","Winslow Ave","45206","Cincinnati","OH",39.130586,-84.49631,0.941,"","2800","range"
"3333 Burnet Ave Cincinnati OH 45229","001","Burnet Ave","45229","Cincinnati","OH",39.14089,-84.500402,0.949,"","3333","range"
"660 Lincoln Avenue Cincinnati OH 45229","002","Lincoln Ave","45206",NA,NA,39.13282,-84.494724,0.805,"","660","range"

This output file contains diagnostic information on the precision and method used for geocoding each address.

The following fields are included in the geocoded output.

  • address: address string input to the geocoder

  • lat: estimated latitude coordinate

  • lon: estimated longitude coordinate

  • number: The building number of the address. When a building number is not included in a range stored in the address database, the nearest known building number will be returned in its place.

  • street: The name of the street found in the database that matches the address, given in a normalized form.

  • street1,street2: When an address is parsed as an intersection, the intersecting streets are returned as street1 and street2 in place of the number and street fields.

  • city: The city matching the given address. In the US, this is typically determined from the matching ZIP code, so, for ZIP codes that cover more than one named place, the results may be different from what you expect, but will still be suitable for postal addressing.

  • state: The two letter postal abbreviation of the state containing the matching address.

  • zip: The five digit ZIP code of the matching address.

  • plus4: The ZIP+4 extension parsed from the address, if any. This extension is not actually used in the geocoding process, but is returned for convenience.

  • fips_county: The FIPS 6-4 code of the county containing the address.

  • prenum / sufnum: If the building number has a non-numeric prefix or suffix, it will be returned here.

  • precision: The qualitative precision of the geocode. The value will be one of range, street, intersection, zip, or city. See the next section for more details.

  • score: The percentage of text match between the given address and the geocoded result, expressed as a number between 0 and 1. A higher score indicates a closer match. Note that each score is relative within a precision method (i.e. a score of 0.8 with a precision of range is not the same as a score of 0.8 with a precision of street).

Interpreting Results Usable for Further Analysis

We recommend that only addresses that are geocoded with a precision of range or street and a score of greater than 0.5 are used used for further analysis. All of the geocoding results, including estimated latitude and longitude coordinates for the more imprecise precision methods are included, but should likely not be used as they may be widely inaccurate or imprecise. In order of decreasing accuracy, the following are the possible values for precision in the output file:

  • range: interpolated based on address ranges from street segments
  • street: center of the matched street
  • intersection: intersection of two streets
  • zip: centroid of the matched zip code
  • city: centroid of the matched city

These geocodes can be used to create maps of subject locations or can be further passed onto other DeGAUSS containers for geomarker assessment.