Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer manager allocation failures causing corruption #4338

Open
benjaminwinger opened this issue Oct 3, 2024 · 0 comments · May be fixed by #4467, #4423 or #4559
Open

Buffer manager allocation failures causing corruption #4338

benjaminwinger opened this issue Oct 3, 2024 · 0 comments · May be fixed by #4467, #4423 or #4559
Assignees
Labels
bug Something isn't working

Comments

@benjaminwinger
Copy link
Collaborator

E.g. the test CopyTest.OutOfMemoryRecovery (renamed to CopyTest.DISABLED_OutOfMemoryRecovery` in #4188) runs out of memory when committing (at the time of #4188) and then corrupts the database since the node group is constructed in two parts.
Generally I think we need to be more careful about potential allocation failures and try to ensure that newly allocated objects are isolated until we're done creating them to minimize the side-effects of failure.

The issue in that test in particular is that we add a new node group object to the node group collection, but don't allocate the persistentChunkGroup field until later. If that doesn't end up being allocated due to an exception, we will handle the exception, and then when closing the database serialize the node groups, including the one that was incomplete (which currently serializes fine the first time, but has issues after that).
It's worth noting that this particular situation is not actually unrecoverable at the moment, the other test in that file still succeeds by dropping the table.

One option for further testing of this problem in general would be to repeatedly try something, e.g. a reasonable sized-copy, and set up the buffer manager to randomly fail (or fail after n allocations if we want to be more exhaustive) as if it is out of memory and make sure we can recover.

@benjaminwinger benjaminwinger added the bug Something isn't working label Oct 3, 2024
@ray6080 ray6080 mentioned this issue Oct 21, 2024
5 tasks
@royi-luo royi-luo self-assigned this Oct 23, 2024
@royi-luo royi-luo linked a pull request Oct 25, 2024 that will close this issue
1 task
@royi-luo royi-luo linked a pull request Nov 4, 2024 that will close this issue
1 task
@royi-luo royi-luo linked a pull request Nov 22, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants