Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan metadata updates #357

Merged
merged 19 commits into from
Sep 25, 2024
Merged

Scan metadata updates #357

merged 19 commits into from
Sep 25, 2024

Conversation

dylanbeaudette
Copy link
Member

This PR addresses several TODO items for fetchSCAN() and related functions:

  • SCAN/SNOTEL metadata #61 -- new site metadata, derived from authoritative sources that have been updated / corrected over the last 2 years
  • fetchSCAN upgrades #184 -- timezone is now available in the station metadata, and a new datetime column initialized as timestamp with timezone

The latest metadata creation code and intermediate files are at data-raw/scan-snotel-current. This code and resulting data replace all previous efforts to combine WCIS (station information) and LDM (user pedon ID and pedlabsampnum) sources.

@brownag brownag linked an issue Sep 18, 2024 that may be closed by this pull request
@brownag
Copy link
Member

brownag commented Sep 18, 2024

Looks good to merge, with one suggestion for a comment above.

I think we can now close #61 as the new data-raw script should allow us to keep up with any new updates made in linkages between WCIS and LDM. I have linked that issue to this PR per the comment in script, unlink it if you don't think issue is fully resolved.

Not sure whether there will be efforts to characterize sites that lack pedons, or if there is more detective work needed to link more pedons, but if so, from {soilDB} perspective, looks like @jskovlin and @dylanbeaudette have "solved" it and we can now just re-run update script periodically.

I was curious how many sites still lack linked pedon data, see below.

library(soilDB)
x <- SCAN_site_metadata()
aggregate(
  x$upedonid,
  by = list(Network = x$Network),
  FUN = \(i) c(no_pedon = sum(is.na(i)), n = length(i))
)
#>   Network x.no_pedon x.n
#> 1   CSCAN         20  21
#> 2    SCAN         45 210
#> 3    SNTL        689 910
#> 4   SNTLT         42  45

Also probably pretty close to closing #184 with timezone and SNOWLITE updates. Not sure if there is much we want to do with the remaining TODOs in that issue.

@brownag
Copy link
Member

brownag commented Sep 18, 2024

Also, no issue with this that I can see, but just a thought: when combining hourly data for stations in multiple time zones the datetime column in resulting sensor data uses the time zone from the first station.

Dissimilar timezones appear to be correctly converted when the data are combined--but curious if we want to give the user the ability to specify a target time zone explicitly (rather than implicitly through the choice of their first station ID).

@dylanbeaudette
Copy link
Member Author

Looks good to merge, with one suggestion for a comment above.

I think we can now close #61 as the new data-raw script should allow us to keep up with any new updates made in linkages between WCIS and LDM. I have linked that issue to this PR per the comment in script, unlink it if you don't think issue is fully resolved.

Thanks for looking over the PR and commentary. I agree that we can close #61, with the understanding that there is still work to be done fixing station information in WCIS and in NASIS. Jay is working on a batch of those right now.

Not sure whether there will be efforts to characterize sites that lack pedons, or if there is more detective work needed to link more pedons, but if so, from {soilDB} perspective, looks like @jskovlin and @dylanbeaudette have "solved" it and we can now just re-run update script periodically.

Yes, it will get progressively "better".

Also probably pretty close to closing #184 with timezone and SNOWLITE updates. Not sure if there is much we want to do with the remaining TODOs in that issue.

I'll close it, and open a new ticket for the question about multiple timezones.

@dylanbeaudette
Copy link
Member Author

Also, no issue with this that I can see, but just a thought: when combining hourly data for stations in multiple time zones the datetime column in resulting sensor data uses the time zone from the first station.

Dissimilar timezones appear to be correctly converted when the data are combined--but curious if we want to give the user the ability to specify a target time zone explicitly (rather than implicitly through the choice of their first station ID).

Good points, I hadn't thought about that carefully and there are no tests in place to ensure that data are reasonable. I also hadn't considered what "happens" when combining multiple timezones. I had to convince myself that it was reasonable, and it is, just as you describe:

x <- c(
  as.POSIXct('2024-01-01 12:00:00', tz = 'etc/gmt+8'),
  as.POSIXct('2024-01-01 12:00:00', tz = 'etc/gmt+10')
)

# correctly converted to single TZ
x

# new TZ is the first one
table(format(x, format = '%Z'))

I'll open a new issue to implement selection of a timezone. I'll also add a note in the manual + tutorial.

@dylanbeaudette
Copy link
Member Author

Multiple TZ is slightly more complicated than I thought, as until now, the ordering of stations in the results was the numerically-sorted site.code. Just fixed this.

@brownag
Copy link
Member

brownag commented Sep 18, 2024

Multiple TZ is slightly more complicated than I thought, as until now, the ordering of stations in the results was the numerically-sorted site.code. Just fixed this.

See above review comment on what you just deleted, I think that you need to check for NULL site code before preserving user input. EDIT: sorry did not complete my review, see below

R/fetchSCAN.R Outdated Show resolved Hide resolved
This was referenced Sep 18, 2024
@brownag
Copy link
Member

brownag commented Sep 18, 2024

After discussion on above test added in 7822889#diff-129da79a9437e382a28369436cd7042f113ce828f5a421f7132d59f2cfbded0d which failed on Linux I implemented standardization of the timezone based on tz argument to fetchSCAN() in f762612. This change closes #358.

The default behavior ( tz="US/Central") converts all times to central time, to which is a logical choice for a network that is US based. However, this default behavior will result in a mixture of times relative to UTC: CDT (-0500) and CST (-0600) depending on whether dates in the time series occur during daylight savings time. While the offsets differ, times are correct and the results are tested and consistent across platforms.

To ensure that a common offset is returned, users need to specify tz="UTC" or use a timezone that does not use daylight savings (e.g. tz="US/Arizona" which is always -0700 relative to UTC)

EDIT: UTC may be a better default due to above caveats

@brownag brownag linked an issue Sep 18, 2024 that may be closed by this pull request
@brownag
Copy link
Member

brownag commented Sep 18, 2024

Just pushed a small change in d96cf6f due to an oversight I made, the issue was only evident for timestep="Hourly"

@dylanbeaudette
Copy link
Member Author

Thanks for the more nuanced solution to the TZ and multiple above ground sensor problems. Are we good to merge ?

@brownag
Copy link
Member

brownag commented Sep 20, 2024

Thanks for the more nuanced solution to the TZ and multiple above ground sensor problems. Are we good to merge ?

Yep, I think this is good to go in that it does what is expected. The only remaining "issue" is perhaps the choice of default timezone.

I noticed that with the default "US/Central" timezone (or any timezone that observes DST) you will get 2 records for the same hour slot when timeseries="Hourly" and the time "falls back" in November. This I think is "expected" behavior if we are allowing for conversion to any local timezone (be it user specified or specific to a group of stations).

To get stability in times and DST need to use "UTC" or a specific offset as you had originally set up.

The caveat to specific offset being that the exact output will differ based on platform. It seems that on the Linux systems tested (recent Ubuntu and Manjaro) that the offset gets applied to the time (converted to UTC), whereas on Windows and Mac the offset is preserved with the local time. Either is a "correct" representation of time, but final output varies with platform.

Choosing a consistent timezone default at least prevents multiple queries across different extents or different groups of stations from returning different results silently. I suppose this is not a huge problem because if combining POSIXct results from multiple queries the first timezone is used. We probably want to make sure the tutorials cover this in some regard. The fetchSCAN() docs currently do mention it effect of DST and describe how to use tz argument for datetime column result. I just updated the example output to show the new behavior and an example of correcting using a specific stations metadata.

@brownag
Copy link
Member

brownag commented Sep 20, 2024

Do we want to add a message that indicates to users requests that span multiple dataTimeZone and/or if the dataTimeZone is outside the defaults of US/Central (i.e. not -5 or -6)? This might help alert to the need to specify tz argument to a more locally relevant value

@dylanbeaudette
Copy link
Member Author

Do we want to add a message that indicates to users requests that span multiple dataTimeZone and/or if the dataTimeZone is outside the defaults of US/Central (i.e. not -5 or -6)? This might help alert to the need to specify tz argument to a more locally relevant value

Yes, a message when >1 TZ result from the selected stations is a good idea. I can do that this morning if you are busy.

@brownag
Copy link
Member

brownag commented Sep 24, 2024

Do we want to add a message that indicates to users requests that span multiple dataTimeZone and/or if the dataTimeZone is outside the defaults of US/Central (i.e. not -5 or -6)? This might help alert to the need to specify tz argument to a more locally relevant value

Yes, a message when >1 TZ result from the selected stations is a good idea. I can do that this morning if you are busy.

Did it yesterday, just didnt push. See 309e357

@dylanbeaudette dylanbeaudette merged commit 1755770 into master Sep 25, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add user-defined time zone argument to fetchSCAN() SCAN/SNOTEL metadata
2 participants