Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: search platforms separately to keep under 10,000 results #48

Merged
merged 10 commits into from
Dec 10, 2024

Conversation

ceholden
Copy link
Collaborator

@ceholden ceholden commented Nov 26, 2024

What I am changing

This PR addresses #45 and is a more complete fix versus the bandage we applied in #46. As of November 12th the search API from ESA no longer allows limit/offset pagination where the offset is above 10,000. This has caused problems for us because our query regularly returns between ~9,000 and ~11,000 results. The work in #46 fixed the issue of not fetching 100% of the available links, but it had two lingering issues that this PR resolves that had caused our StepFunction to fail,

  1. After fetching all of the links we still run 1 more query, which would put us above the 10,000 offset restriction
  2. Our "link fetcher" has a lookback to help ensure we catch all granules that might've been published later than the first attempt we make at fetching links. Our Lambda function will fail for these lookback dates because the link fetcher will resume where it left off, and thus send a query that has >10,000 offset

This PR should permanently resolve the issue by searching for links for each satellite platform separately (currently only S2A and S2B). This might also be useful for the upcoming release of data for S2C if we need to make adjustments downstream before we begin processing (e.g., bandpass coefficients for S2C, see NASA-IMPACT/hls-sentinel#163)

How I did it

The strategy for querying by (date, platform) involved changes to the part that determines what we search for (date_generator), to the link fetcher handler, and to the state tracking database table for the link fetcher (GranuleCount).

  1. Update date_generator to return a sequence of [(date, platform), ...]
  2. Update the StepFunction to adjust to the slightly renamed payload key (query_dates ~> query_dates_platforms)
  3. Add a DB migration to introduce a column platform to the GranuleCount table and update the primary key constraint to be (date, platform).
    • Since platform can't be null because it's a primary key I set a server default of S2A+S2B. This value isn't used going forward and is intended to isolate rows that had been created by the link fetcher before this PR is merged.
    • We only have ~400 rows in the granule_count table so this migration won't be difficult. I've already tried it in the "event driven link fetcher" PR that I've deployed with IDENTIFIER=event-subs.
    • New link fetcher runs will ignore the old results, meaning that the lookback window will re-run the link fetching. This should be a no-op (assuming there's no new data) because of the deduplication provided through the granules table. If we ran this today we would have a few rows with platform = {S2A, S2B, S2A+S2B}
  4. Update the link fetcher to include "platform" in the query and query tracking. After fetching links we don't care about splitting up by platform (e.g., no change to downloader or the SearchResult we persist in granule table)
  5. Updated unit tests and integration tests,
    • I updated mock_scihub_search_results to work reply per-platform
    • I re-ran the queries for "index=1", "index=101", and "index=10000" (larger than result count for "no results") for S2A and S2B. The existing data had the search query URL in the "links" section

How you can test it

I updated the unit tests for the link fetcher, including updating the saved search query to include platform=S2A.

I've been running this code in my deployment of the "event driven link fetcher" (IDENTIFIER=event-subs). Here's a screenshot of the GranuleCount table that shows it working ✅

image

Specifically things to note,

  • Rows with platform=S2A+S2B were created without the changes in this PR. When available_links < 10,000 you'll see that the available_links == fetched_links, but otherwise we'll be missing some links.
  • This deployment did NOT have the temporary bandaid fix from fix: Update search limit to maximum to encompass <12k results without exce… #46, so the rows with S2A+S2B did not always fetch all of the links. Reminder - this is just for my test environment for the event driven downloader, NOT "prod"
  • The "S2A" + "S2B" == "S2A+S2B" available link counts, e.g., 4326 + 6337 == 10663

@ceholden ceholden force-pushed the ceh/search-platforms-separately branch from 9f1228e to 738d59a Compare November 27, 2024 02:43
@ceholden ceholden marked this pull request as ready for review November 27, 2024 14:00
@ceholden ceholden requested a review from chuckwondo December 9, 2024 15:57
Copy link
Collaborator

@chuckwondo chuckwondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fabulous!

@ceholden ceholden merged commit 7a62b00 into main Dec 10, 2024
3 checks passed
@ceholden ceholden deleted the ceh/search-platforms-separately branch December 10, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants