-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fallback-encoding per-repo option for non-utf8 text files #388
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thank you for contributing! Encoding stuff can be pretty scary — can you add some tests to your changes? We're putting some eyes on this PR now otherwise, but we'd want some test coverage before we merge.
index/index.go
Outdated
) | ||
|
||
const ( | ||
matchLimit = 5000 | ||
manifestFilename = "metadata.gob" | ||
excludedFileJsonFilename = "excluded_files.json" | ||
filePeekSize = 2048 | ||
filePeekSize = 1 << 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Why this big jump in filePeekSize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That pesky comment that contains á that breaks UTF-8 detection may be at the end of the file.
This is not a full solution, just a bet that source files should be less than a megabyte long.
A full solution could be retrying the whole indexing on decoding error.
But that felt more involved - though I could try it if you confirm that'd be better.
Hey! We're meeting now and generally like this PR. We have two asks before we merge it:
Thank you again! |
The big peek buf was needed to ensure that a non-utf8 rune at the end of the file does get the attention and be tried with the fallback encoding. A more robust approach is to just try the reading as-is, validating the encoding, and try the fallback encoding if this fails. Also add a test case.
Sorry for my slow reply, but now I've added a test case, and implemented the more robust retrying logic. |
May I assist reviewing this PR? |
This allows to specify an alternate charset (fallback-encoding) per repository.