Add fallback-encoding per-repo option for non-utf8 text files #388

tgulacsi · 2021-03-11T11:33:32Z

This allows to specify an alternate charset (fallback-encoding) per repository.

tgulacsi · 2021-03-11T11:36:05Z

May help #78, #243 and #368, and my ISO8859-2 case :)

salemhilal

Hey, thank you for contributing! Encoding stuff can be pretty scary — can you add some tests to your changes? We're putting some eyes on this PR now otherwise, but we'd want some test coverage before we merge.

jacobrose · 2021-03-17T15:41:01Z

index/index.go

 )

 const (
 	matchLimit               = 5000
 	manifestFilename         = "metadata.gob"
 	excludedFileJsonFilename = "excluded_files.json"
-	filePeekSize             = 2048
+	filePeekSize             = 1 << 20


Hello! Why this big jump in filePeekSize?

That pesky comment that contains á that breaks UTF-8 detection may be at the end of the file.
This is not a full solution, just a bet that source files should be less than a megabyte long.

A full solution could be retrying the whole indexing on decoding error.
But that felt more involved - though I could try it if you confirm that'd be better.

salemhilal · 2021-03-30T18:38:23Z

Hey! We're meeting now and generally like this PR. We have two asks before we merge it:

Can you add some tests?
Increasing filePeekSize might impact startup time. Can you give us any details on how much of an impact you are seeing, and if possible, a rough estimate of the size of your codebase?

Thank you again!

The big peek buf was needed to ensure that a non-utf8 rune at the end of the file does get the attention and be tried with the fallback encoding. A more robust approach is to just try the reading as-is, validating the encoding, and try the fallback encoding if this fails. Also add a test case.

tgulacsi · 2021-04-12T21:20:27Z

Sorry for my slow reply, but now I've added a test case, and implemented the more robust retrying logic.
PTAL.

tgulacsi · 2021-05-10T12:22:05Z

May I assist reviewing this PR?

salemhilal reviewed Mar 17, 2021

View reviewed changes

jacobrose reviewed Mar 17, 2021

View reviewed changes

Add fallback-encoding per-repo option for non-utf8 text files

c210eb0

tgulacsi force-pushed the main branch from 9cddc7a to c210eb0 Compare April 12, 2021 20:12

pru-mike pushed a commit to pru-mike/hound that referenced this pull request Nov 21, 2023

WIP: merge fallback encoding feature from hound-search#388

93c90e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fallback-encoding per-repo option for non-utf8 text files #388

Add fallback-encoding per-repo option for non-utf8 text files #388

tgulacsi commented Mar 11, 2021

tgulacsi commented Mar 11, 2021

salemhilal left a comment

jacobrose Mar 17, 2021

tgulacsi Mar 19, 2021

salemhilal commented Mar 30, 2021

tgulacsi commented Apr 12, 2021

tgulacsi commented May 10, 2021

Add fallback-encoding per-repo option for non-utf8 text files #388

Are you sure you want to change the base?

Add fallback-encoding per-repo option for non-utf8 text files #388

Conversation

tgulacsi commented Mar 11, 2021

tgulacsi commented Mar 11, 2021

salemhilal left a comment

Choose a reason for hiding this comment

jacobrose Mar 17, 2021

Choose a reason for hiding this comment

tgulacsi Mar 19, 2021

Choose a reason for hiding this comment

salemhilal commented Mar 30, 2021

tgulacsi commented Apr 12, 2021

tgulacsi commented May 10, 2021