-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: search platforms separately to keep under 10,000 results #48
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ceholden
force-pushed
the
ceh/search-platforms-separately
branch
from
November 27, 2024 02:43
9f1228e
to
738d59a
Compare
chuckwondo
approved these changes
Dec 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fabulous!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I am changing
This PR addresses #45 and is a more complete fix versus the bandage we applied in #46. As of November 12th the search API from ESA no longer allows limit/offset pagination where the offset is above 10,000. This has caused problems for us because our query regularly returns between ~9,000 and ~11,000 results. The work in #46 fixed the issue of not fetching 100% of the available links, but it had two lingering issues that this PR resolves that had caused our StepFunction to fail,
hls-sentinel2-downloader-serverless/lambdas/link_fetcher/handler.py
Line 101 in 5d81de9
This PR should permanently resolve the issue by searching for links for each satellite platform separately (currently only S2A and S2B). This might also be useful for the upcoming release of data for S2C if we need to make adjustments downstream before we begin processing (e.g., bandpass coefficients for S2C, see NASA-IMPACT/hls-sentinel#163)
How I did it
The strategy for querying by
(date, platform)
involved changes to the part that determines what we search for (date_generator
), to the link fetcher handler, and to the state tracking database table for the link fetcher (GranuleCount
).date_generator
to return a sequence of[(date, platform), ...]
query_dates
~>query_dates_platforms
)platform
to theGranuleCount
table and update the primary key constraint to be(date, platform)
.platform
can't be null because it's a primary key I set a server default ofS2A+S2B
. This value isn't used going forward and is intended to isolate rows that had been created by the link fetcher before this PR is merged.granule_count
table so this migration won't be difficult. I've already tried it in the "event driven link fetcher" PR that I've deployed withIDENTIFIER=event-subs
.granules
table. If we ran this today we would have a few rows withplatform = {S2A, S2B, S2A+S2B}
SearchResult
we persist ingranule
table)mock_scihub_search_results
to work reply per-platformHow you can test it
I updated the unit tests for the link fetcher, including updating the saved search query to include
platform=S2A
.I've been running this code in my deployment of the "event driven link fetcher" (
IDENTIFIER=event-subs
). Here's a screenshot of theGranuleCount
table that shows it working ✅Specifically things to note,
platform=S2A+S2B
were created without the changes in this PR. Whenavailable_links < 10,000
you'll see that theavailable_links == fetched_links
, but otherwise we'll be missing some links.S2A+S2B
did not always fetch all of the links. Reminder - this is just for my test environment for the event driven downloader, NOT "prod""S2A" + "S2B" == "S2A+S2B"
available link counts, e.g.,4326 + 6337 == 10663