Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not allow for a word to start or end with punctuation symbols #3588

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yarikoptic
Copy link
Contributor

@yarikoptic yarikoptic commented Nov 22, 2024

The inspired for me to look use case

And then I found the issue this

Although may be I am missing the use-cases/problems @DimitriPapadopoulos and @mdeweerd discussed back then

Edits:

  • I had to partially go back and change so there are two alternative word captures quoted
  • Allow for only trailing, but not leading quotes happen words were not in quotes to start with.

After I pushed, I realized that we have a use case where we are not covering ``LaTeX'' way to quote. So were and keep missing those. Do you think I should add regex for them too?

some gory details on me discovering were' and other "typos" in dictionaries

ok -- tests fail due to the typo:

codespell_lib/data/dictionary_code.txt:were'->we're

and apparently it is not a single one like that left:

codespell_lib/data/dictionary_code.txt:were'->we're
codespell_lib/data/dictionary.txt:aircrafts'->aircraft's
codespell_lib/data/dictionary.txt:arent'->aren't
codespell_lib/data/dictionary.txt:cant'->can't
codespell_lib/data/dictionary.txt:cnat'->can't
codespell_lib/data/dictionary.txt:couldnt'->couldn't
codespell_lib/data/dictionary.txt:didnt'->didn't
codespell_lib/data/dictionary.txt:doesent'->doesn't
codespell_lib/data/dictionary.txt:doesn'->doesn't
codespell_lib/data/dictionary.txt:doesnt'->doesn't
codespell_lib/data/dictionary.txt:dont'->don't
codespell_lib/data/dictionary.txt:dosent'->doesn't
codespell_lib/data/dictionary.txt:hasnt'->hasn't
codespell_lib/data/dictionary.txt:havent'->haven't
codespell_lib/data/dictionary.txt:isnt'->isn't
codespell_lib/data/dictionary.txt:packges'->packages'
codespell_lib/data/dictionary.txt:shouldnt'->shouldn't
codespell_lib/data/dictionary.txt:thats'->that's
codespell_lib/data/dictionary.txt:wasnt'->wasn't
codespell_lib/data/dictionary.txt:wouldnt'->wouldn't

but some of those IMHO make no sense to list ' if correction is also with ' which is AFAIK is not a part of the word, i.e. I think following should be simply removed (replaced with ones with '):

codespell_lib/data/dictionary.txt:gaus'->Gauss'
codespell_lib/data/dictionary.txt:guas'->Gauss'
codespell_lib/data/dictionary.txt:guass'->Gauss'

First I wondered if that is the case worth striving for fixing: since were is a legit word, it could have also been forgotten ' somewhere long before, e.g. in a

var = stay as you were'

which would be programming language gotcha, not a typo.

FWIW were' was added originally in

In leftover cases it boils down to

' is a part of the word, and thus could be present in the typo in alternative location"

(I would still argue to exclude were')..

@yarikoptic yarikoptic changed the title Do not allow for a word to start with punctuation symbols Do not allow for a word to start or end with punctuation symbols Nov 22, 2024
yarikoptic added a commit to yarikoptic/python-sdk that referenced this pull request Nov 22, 2024
codespell from codespell-project/codespell#3588

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w ./tests/unit/test_schema_invalids.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
yarikoptic added a commit to yarikoptic/python-sdk that referenced this pull request Nov 22, 2024
codespell from codespell-project/codespell#3588

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w ./tests/unit/test_schema_invalids.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@larsoner
Copy link
Member

I haven't looked deeply but am I right that this single-line comment:

# blah blah 'blah blah' blah

would get handled differently from this multi-line one

# blah blah 'blah
# blah' blah

? To me these are essentially the same code comment so it's weird that codespell would treat them differently.

It's even a bit weird to try to keep track of the quotation level in any way as it seems brittle, especially given people can forget start and end quotes from time to time

@yarikoptic
Copy link
Contributor Author

since no space should be in "word regex", we should be robust to such examples -- those should be separate words (on one line or not)

Lance-Drane pushed a commit to INTERSECT-SDK/python-sdk that referenced this pull request Nov 25, 2024
codespell from codespell-project/codespell#3588

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w ./tests/unit/test_schema_invalids.py",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detection of string delimiters
2 participants