fix: search platforms separately to keep under 10,000 results #48

ceholden · 2024-11-26T22:25:39Z

What I am changing

This PR addresses #45 and is a more complete fix versus the bandage we applied in #46. As of November 12th the search API from ESA no longer allows limit/offset pagination where the offset is above 10,000. This has caused problems for us because our query regularly returns between ~9,000 and ~11,000 results. The work in #46 fixed the issue of not fetching 100% of the available links, but it had two lingering issues that this PR resolves that had caused our StepFunction to fail,

After fetching all of the links we still run 1 more query, which would put us above the 10,000 offset restriction
- See,
  
  hls-sentinel2-downloader-serverless/lambdas/link_fetcher/handler.py
  
  Line 101 in 5d81de9
  
  search_results, _ = get_page_for_query_and_total_results(params)
Our "link fetcher" has a lookback to help ensure we catch all granules that might've been published later than the first attempt we make at fetching links. Our Lambda function will fail for these lookback dates because the link fetcher will resume where it left off, and thus send a query that has >10,000 offset

This PR should permanently resolve the issue by searching for links for each satellite platform separately (currently only S2A and S2B). This might also be useful for the upcoming release of data for S2C if we need to make adjustments downstream before we begin processing (e.g., bandpass coefficients for S2C, see NASA-IMPACT/hls-sentinel#163)

How I did it

The strategy for querying by (date, platform) involved changes to the part that determines what we search for (date_generator), to the link fetcher handler, and to the state tracking database table for the link fetcher (GranuleCount).

Update date_generator to return a sequence of [(date, platform), ...]
Update the StepFunction to adjust to the slightly renamed payload key (query_dates ~> query_dates_platforms)
Add a DB migration to introduce a column platform to the GranuleCount table and update the primary key constraint to be (date, platform).
- Since platform can't be null because it's a primary key I set a server default of S2A+S2B. This value isn't used going forward and is intended to isolate rows that had been created by the link fetcher before this PR is merged.
- We only have ~400 rows in the granule_count table so this migration won't be difficult. I've already tried it in the "event driven link fetcher" PR that I've deployed with IDENTIFIER=event-subs.
- New link fetcher runs will ignore the old results, meaning that the lookback window will re-run the link fetching. This should be a no-op (assuming there's no new data) because of the deduplication provided through the granules table. If we ran this today we would have a few rows with platform = {S2A, S2B, S2A+S2B}
Update the link fetcher to include "platform" in the query and query tracking. After fetching links we don't care about splitting up by platform (e.g., no change to downloader or the SearchResult we persist in granule table)
Updated unit tests and integration tests,
- I updated mock_scihub_search_results to work reply per-platform
- I re-ran the queries for "index=1", "index=101", and "index=10000" (larger than result count for "no results") for S2A and S2B. The existing data had the search query URL in the "links" section

How you can test it

I updated the unit tests for the link fetcher, including updating the saved search query to include platform=S2A.

I've been running this code in my deployment of the "event driven link fetcher" (IDENTIFIER=event-subs). Here's a screenshot of the GranuleCount table that shows it working ✅

Specifically things to note,

Rows with platform=S2A+S2B were created without the changes in this PR. When available_links < 10,000 you'll see that the available_links == fetched_links, but otherwise we'll be missing some links.
This deployment did NOT have the temporary bandaid fix from fix: Update search limit to maximum to encompass <12k results without exce… #46, so the rows with S2A+S2B did not always fetch all of the links. Reminder - this is just for my test environment for the event driven downloader, NOT "prod"
The "S2A" + "S2B" == "S2A+S2B" available link counts, e.g., 4326 + 6337 == 10663

This reverts commit 4841e3c.

This reverts commit 50a5116.

chuckwondo

Fabulous!

ceholden added 3 commits November 26, 2024 15:33

Search for platforms separately to stay <10,000 link results

b79522c

Update root Makefile for ruff switch

9fd8214

Update unit tests

e017b31

ceholden had a problem deploying to dev November 26, 2024 22:37 — with GitHub Actions Failure

Mock scihub search supports platform query

4841e3c

ceholden requested a deployment to dev November 27, 2024 00:11 — with GitHub Actions Waiting

ceholden added 2 commits November 26, 2024 19:17

Fix test of granule count (5x row for 5 days, per platform)

62209fc

Revert "Mock scihub search supports platform query"

50a5116

This reverts commit 4841e3c.

ceholden requested a deployment to dev November 27, 2024 00:20 — with GitHub Actions Waiting

ceholden added 2 commits November 26, 2024 19:20

Revert "Revert "Mock scihub search supports platform query""

f05630c

This reverts commit 50a5116.

Remove non-platform specific responses

3094e34

ceholden had a problem deploying to dev November 27, 2024 00:23 — with GitHub Actions Failure

Update expected number of granule for 4x100 results

5b25982

ceholden had a problem deploying to dev November 27, 2024 01:02 — with GitHub Actions Failure

ceholden had a problem deploying to dev November 27, 2024 02:09 — with GitHub Actions Failure

fix query->filter

738d59a

ceholden force-pushed the ceh/search-platforms-separately branch from 9f1228e to 738d59a Compare November 27, 2024 02:43

ceholden temporarily deployed to dev November 27, 2024 02:45 — with GitHub Actions Inactive

ceholden marked this pull request as ready for review November 27, 2024 14:00

ceholden requested a review from sharkinsspatial November 27, 2024 14:08

ceholden requested a review from chuckwondo December 9, 2024 15:57

chuckwondo approved these changes Dec 10, 2024

View reviewed changes

ceholden merged commit 7a62b00 into main Dec 10, 2024
3 checks passed

ceholden deleted the ceh/search-platforms-separately branch December 10, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: search platforms separately to keep under 10,000 results #48

fix: search platforms separately to keep under 10,000 results #48

ceholden commented Nov 26, 2024 •

edited

Loading

chuckwondo left a comment

fix: search platforms separately to keep under 10,000 results #48

fix: search platforms separately to keep under 10,000 results #48

Conversation

ceholden commented Nov 26, 2024 • edited Loading

What I am changing

How I did it

How you can test it

chuckwondo left a comment

Choose a reason for hiding this comment

ceholden commented Nov 26, 2024 •

edited

Loading