Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaveBank duplicates entries in index if file mtime is updated #268

Open
shawnboltz opened this issue Sep 5, 2023 · 2 comments
Open

WaveBank duplicates entries in index if file mtime is updated #268

shawnboltz opened this issue Sep 5, 2023 · 2 comments
Labels
bug Something isn't working waveforms anything having to do with waveforms

Comments

@shawnboltz
Copy link
Collaborator

Description
This is a pretty low priority, but it I can see it having some obscure effects on, ex., identifying gaps/availability in a bank.

When a waveform file gets updated (in this specific case, everything about the files were identical, just the modified time changed), a duplicate entry gets put into the WaveBank.

To Reproduce
I admittedly haven't deliberately reproduced this, but I'm fairly certain this is what happened:

  1. A bank got created for a directory of files
import obsplus
bank = obsplus.WaveBank("path/to/bank")
bank.update_index()

df = bank.read_index(station="sta1", channel="ELZ", location="02")
assert df.duplicated().any() == False
  1. A set of files got copied into the bank, some of which were files that already in the bank (and the originals were overwritten)
  2. The index was updated and duplicated entries got created
bank.update_index()

df = bank.read_index(station="sta1", channel="ELZ", location="02")
assert df.duplicated.any() == True

Expected behavior
The duplicate entries shouldn't get created if all else is equal about the files. (This raises a bigger issue about what happens if the file's contents change substantially, but that's a problem for another day.)

I'm not entirely sure if this would work with the structure of update_index, but simply dropping duplicates should resolve the issue.

Versions (please complete the following information):

  • OS: Ubuntu 22.04
  • ObsPlus Version: 0.2.3.dev10+g8e231f4
  • Python Version: 3.8.12
@shawnboltz shawnboltz added bug Something isn't working waveforms anything having to do with waveforms labels Sep 5, 2023
@d-chambers
Copy link
Member

Ya, I have seen this before. I guess it begs the question: why is the mtime changing, and is that easier to avoid than implementing a solution?

One advantage of the bank is that we can update/query the index without having to load all the contents into memory. If we did check for duplicates, we would have to load the entire index into memory and check for duplicates (could be kinda costly if the index is large) before updates. I am not sure how much slower it would make file indexing, but at least a bit.

@shawnboltz
Copy link
Collaborator Author

shawnboltz commented Sep 5, 2023

In this case I wanted to make sure that the existing files were overwritten because there was a change in how obspy reads/writes Kinemetrics EVT files and I wanted to make sure that any data in the bank that originated from an EVT file was consistent (number of samples/time window was the same, it was a matter of storing the data with/without the instrument response removed).

An alternative could have been to just delete and recreate the index, but the index is huge and it would have taken 24 hours+ (though I will probably end up doing that now, anyways).

I know you don't really like adding additional kwargs, but maybe it could be an optional step if it's a case where we know it will be necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waveforms anything having to do with waveforms
Projects
None yet
Development

No branches or pull requests

2 participants