Add failed-import batch archiving to aid debugging #24

jessemortenson · 2024-10-24T02:07:56Z

My goal here is to make it easier to debug failed imports that occur under realtime processing. This code zips up the jurisdiction data directory that was being processed when an import fails, and puts it in the archive/ path of the realtime processing S3 bucket. It logs the name of the archive zip to the log message, so that a debugger can download and examine which data was being imported at the time.

This is kind of yet another hack on top of hacks, but hopefully at least just a logging thing and not something that further complicates the data flow here.

Thinking through this, I think a more reasonable overall process might be something like the following. I didn't implement this because it would be more work, and work spanning into openstates-core. But consider this a little mini-EP tagged onto this PR for your feedback to inform future work.

Scraper is yielding scraped entities
Process that is receiving those and saving to JSON keeps track up to a certain increment (15 minutes? 200 entities?)
Once that increment is met, that process consolidates the data in that increment into a couple of parquet files (one per entity type); then uploads those to S3 along with an SQS message identifying them
The realtime lambda receives the message and processes the full increment as one batch

I think that would help a few things:

Reduce S3 API costs because we're uploading 1-3 parquet files per batch, instead of thousands of JSON Files all the time
An error in processing in the lambda could result in the same parquet files being simply moved to an archive location
Reduce the odds that multiple lambda executions are processing files from the same jurisdiction at the same time, and increase the odds that concurrent executions are instead each working on their own jurisdiction. I don't know for sure but I suspect this will reduce a few errors.

Now of course I think our original idea of the big SQL Mesh transformation engine is even better than the above, but the above is less work than that and probably still a significant improvement.

alexobaseki

LGTM! The idea for improvement also look solid.v I have a question around conditional logic but just for clarification and testing to be sure it works as expected.

app.py

Add failed-import batch archiving to aid debugging

583a0ff

jessemortenson requested a review from alexobaseki October 24, 2024 02:07

alexobaseki approved these changes Oct 24, 2024

View reviewed changes

app.py Outdated Show resolved Hide resolved

jessemortenson added 5 commits November 11, 2024 14:32

Merge branch 'main' into archive-failed-jurisdiction-import/DATA-5013

b1bbef4

Replace messy environment var code with library

892075b

Fix linting

76b6ca4

Fix linting

dff3c66

Fix linting

851cc0c

jessemortenson merged commit bf13ab4 into main Nov 11, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failed-import batch archiving to aid debugging #24

Add failed-import batch archiving to aid debugging #24

jessemortenson commented Oct 24, 2024

alexobaseki left a comment

Add failed-import batch archiving to aid debugging #24

Add failed-import batch archiving to aid debugging #24

Conversation

jessemortenson commented Oct 24, 2024

alexobaseki left a comment

Choose a reason for hiding this comment