Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an NHGIS Shapefile Parser to IPUMS.jl #32

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ version = "0.0.1"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
EzXML = "8f5d6c58-4d21-5cfd-889c-e3ad7ee6a615"
GeoDataFrames = "62cb38b5-d8d2-4862-a48e-6a340996859f"
GeoFormatTypes = "68eda718-8dee-11e9-39e7-89f7f65f511f"
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
OpenAPI = "d5e62ea6-ddf3-4d43-8e4c-ad5e6c8bfd7d"
Expand Down
7 changes: 5 additions & 2 deletions src/IPUMS.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ module IPUMS
download as dl
import OpenAPI.Clients:
Client

import GeoDataFrames:
read
import GeoFormatTypes
using DataFrames:
DataFrames,
DataFrame,
Expand Down Expand Up @@ -54,7 +56,7 @@ module IPUMS
=#

include("parsers/ddi_parser.jl")

include("parsers/nhgis_parser.jl")
#=

Exports
Expand All @@ -65,5 +67,6 @@ module IPUMS
export parse_ddi
export extract_download
export load_ipums_extract
export load_ipums_nhgis

end
63 changes: 63 additions & 0 deletions src/parsers/nhgis_parser.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@


"""
load_ipums_nhgis(filepath::String)

This function will take in the filename for an NHGIS Shapefile, and will
return an `NHGISInfo` object that contains a `GeoDataFrame` with the
file data, as well as additional metadata.

### Arguments

- `filepath::String` - The directory path to an IPUMS NHGIS extracted shapefile.

### Returns

This function outputs a Julia GeoDataframe that contains all of the data from
the IPUMS NHGIS extract file. Further, the metadata fields of the Dataframe
contain the metadata parsed from the Shapefile.

# Examples

Let's assume we have an extract NHGIS file named `US_state_1790.shp` in a folder
that contains the other shapefile files. The user can open this Shapefile using
the following code.

```julia-repl
julia> datafile = "US_state_1790.shp"
julia> load_ipums_nhgis(datafile)
IPUMS.NHGISInfo("US_state_1790.shp", "NHGIS",
0×0 DataFrame,
GeoFormatTypes.WellKnownText{GeoFormatTypes.CRS}(GeoFormatTypes.CRS(),
"PROJCS[\"USA_Contiguous_Albers_Equal_Area_Conic\",GEOGCS[\"NAD83\",
DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,
298.257222101,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",
\"6269\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],
UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],
AUTHORITY[\"EPSG\",\"4269\"]],PROJECTION[\"Albers_Conic_Equal_Area\"],
PARAMETER[\"latitude_of_center\",37.5],PARAMETER[\"longitude_of_center\"
,-96],PARAMETER[\"standard_parallel_1\",29.5],
PARAMETER[\"standard_parallel_2\",45.5],PARAMETER[\"false_easting\"
,0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,
AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],
AXIS[\"Northing\",NORTH],AUTHORITY[\"ESRI\",\"102003\"]]"),
16×8 DataFrame ....
```

"""
function load_ipums_nhgis(filepath::String)

gdf = read(filepath)
md = DataFrame(geometrycolumns = metadata(gdf)["geometrycolumns"])
crs = metadata(gdf)["crs"]
nhgis_object = NHGISInfo(filepath, "NHGIS", DataFrame(), crs, gdf)

return nhgis_object

end






73 changes: 73 additions & 0 deletions src/structs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -230,3 +230,76 @@ https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/field_level_doc
_ns::String = ""
data_summary::DataFrame = DataFrame()
end



"""
```julia
NHGISInfo(
filepath::String,
ipums_project::String = "",
data_summary::DataFrame = DataFrame(),
geom_projection::String,
geodataframe::DataFrame()
)
```

A struct representing the metadata and data taken from an IPUMS NHGIS extract. An IPUMS
NHGIS extract contains both file-level metadata (such as the date of export), as well
as variable level metadata (such as the name and data type of a variable).

This type is not accessed by users directly, but instead is constructed by
the `load_ipums_nhgis()` function.

# Keyword Arguments

- `filepath::String` - File system path to the DDI (`.xml`) file.
- `ipums_project::String` - Identifier for the IPUMS source of the extract
data, such as `IPUMS CPS`, or `IPUMS USA`, etc.
- `data_summary::DataFrame` - Contains a dataframe that holds summary information
about the variables in the dataset, including variable names,
data types, variable descriptions, and categorical information.
- `geom_projection::String` - the GIS spatial or geometric projection for the
IPUMS NHGIS extract.
- `geodataframe::DataFrame` - a GeoDataFrame that contains the data from the
Shapefile.

# Returns

- `NHGISInfo` object that contains both file-level and variable-level metadata
extracted from an IPUMS NHGIS extract.

# Example

```julia-repl
julia> datafile = "US_state_1790.shp"
julia> load_ipums_nhgis(datafile)
IPUMS.NHGISInfo("US_state_1790.shp", "NHGIS",
0×0 DataFrame,
GeoFormatTypes.WellKnownText{GeoFormatTypes.CRS}(GeoFormatTypes.CRS(),
"PROJCS[\"USA_Contiguous_Albers_Equal_Area_Conic\",GEOGCS[\"NAD83\",
DATUM[\"North_American_Datum_1983\",SPHEROID[\"GRS 1980\",6378137,
298.257222101,AUTHORITY[\"EPSG\",\"7019\"]],AUTHORITY[\"EPSG\",
\"6269\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],
UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],
AUTHORITY[\"EPSG\",\"4269\"]],PROJECTION[\"Albers_Conic_Equal_Area\"],
PARAMETER[\"latitude_of_center\",37.5],PARAMETER[\"longitude_of_center\"
,-96],PARAMETER[\"standard_parallel_1\",29.5],
PARAMETER[\"standard_parallel_2\",45.5],PARAMETER[\"false_easting\"
,0],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,
AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],
AXIS[\"Northing\",NORTH],AUTHORITY[\"ESRI\",\"102003\"]]"),
16×8 DataFrame ....
```

# References

"""
Base.@kwdef mutable struct NHGISInfo
filepath::String
ipums_project::String = "NHGIS"
data_summary::DataFrame = DataFrame()
geom_projection::GeoFormatTypes.WellKnownText{GeoFormatTypes.CRS}
geodataframe::DataFrame = DataFrame()
end

7 changes: 7 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,10 @@ end
@test isa(metadata(df), Dict)
@test isa(colmetadata(df, :YEAR), Dict)
end

@testset "NHGIS Parser" begin
datafile = "testdata/nhgis0001_shapefile/US_state_1790.shp"
df = load_ipums_nhgis(datafile)
@test size(df.geodataframe) == (16, 8)

end
Binary file not shown.
1 change: 1 addition & 0 deletions test/testdata/nhgis0001_shapefile/US_state_1790.prj
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
PROJCS["USA_Contiguous_Albers_Equal_Area_Conic",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Albers"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-96.0],PARAMETER["Standard_Parallel_1",29.5],PARAMETER["Standard_Parallel_2",45.5],PARAMETER["Latitude_Of_Origin",37.5],UNIT["Meter",1.0]]
Binary file added test/testdata/nhgis0001_shapefile/US_state_1790.sbn
Binary file not shown.
Binary file added test/testdata/nhgis0001_shapefile/US_state_1790.sbx
Binary file not shown.
Binary file not shown.
Loading
Loading