Skip to content

Conversation

@jima80525
Copy link

I'm not sure if this is a feature you're interested in, but it's something I needed for a project I'm working on and thought I'd like you decide if you wanted it.

This allows checking two-word phrases in the dictionary. For example, you can flag is not -> isn't (which is the type of checking I am after).

I intentionally didn't add this to the section checking filenames, as that seemed a less-than-likely use case.

I'm happy to clean up if there are changes you'd like to see. Also completely understand if this is not functionality in which you are interested.

THANK YOU for this project. As I said, this will give me a great jumping off point for some tooling I need to develop.

@jima80525
Copy link
Author

Refactored to minimize the change and fix the "missing last word" bug. :)

@peternewman
Copy link
Collaborator

Hi @jima80525 ,

Thanks for this, it sounds quite exciting to me. As you can probably imagine, there are a lot of typos where essentially the space is in the wrong place (e.g. "spellin gcheck" type things) which could be potentially caught with code like this. Certainly various stuff with spaces in the typo has been removed from the dictionary previously as Codespell couldn't find it.

Some tests would be great for starters.

Also a few corner cases for you to consider, what about snake_case and camelCase? Being able to find and fix typos in those cases too would be good. It looks like this is just naively taking every pair of words, this would have some interesting edge cases too, if the boundary between words has a full stop. For example:
That's just the way it is. Not everything is simple

I think would be corrected by your code to:
That's just the way it isn't everything is simple
Which is probably not quite the intention!

Your specific suggestion would also trip up Python x is not y, but that could be solved with some thought about dictionaries hopefully.

I guess there are a few classes of two word typos, the ones where you want to remove the separator (e.g. your informal isn't, or any one type things where you can guarantee they should be one word) and then ones where the separator has gone in the wrong place (as I mentioned above) and possibly ones where the two words can be used to inform each other about the typo.

@larsoner any thoughts?

@sebweb3r
Copy link
Contributor

I think it is awesome :-)

@jima80525
Copy link
Author

jima80525 commented Aug 30, 2020

@peternewman and @sebweb3r -
Thanks for the encouragement! It honestly didn't occur to me that this would be useful outside of my particular use case, so I threw this up without tests to see if there was interest.

Tests: definitely.

Corner cases:

  • Punctuation between words: this is going to require some messing around with the parser. IIRC this is done via a regex. I'm not 100% that a regex will be fully functional in capturing the punctuation issue. I suspect at a minimum that this will not play nicely with the "bring your own regex" feature. At worst, it's a full lexer/parser addition, which is possible, but sort of changes the nature of the beast. Is that a change you're willing to accept?
  • camelCase: I"m not seeing an issue with this one? It will just get treated as a single word and lowercased (is that a word?)
  • snake_case: This can be caught by modifying delimiters between words. Including _ as a whitespace character, essentially. I think this is slightly orthogonal to the 2-word problem, however, but fully admit I'm probably missing something. :)
  • Python's x is not y: is a tough issue. In my use case (markdown documents containing code), I can simply skip the code blocks. That's likely not acceptable for your project, however. I suspect that the only real solution for this is "don't do that conversion on Python code", which is what I think you're saying above.
  • (a bonus corner case!) Cross-line substitutions: Again, for my use case this is not an issue, but it is in general. What if line n ends with is and line n+1 starts with not. We can build a solution that catches this (should we?), but if we do, where does the substitution go? Line n or line n+1? This is solvable, likely just by picking an answer going with it. The real question is if we want to do this.

So, the big question for the maintainers is: are you willing to go to a full parser to get this? Even if it breaks the "bring your own regex" feature? And, if that's the easiest way to go, do you want this whole mess under a separate option, or would you drop the regex stuff and go with the parser? (I don't think dropping the regex is a great idea, mind you. I'm just trying to get a feel for the solution space. :) )

Anyway, I can take a look at this and start working on it in earnest. My progress will be slow due to other commitments, however.
Thanks again for the feedback, though! This is a fun design problem to think through.

Last minute edit: technically I should be using the word "lexer" or "tokenizer" instead of "parser" above, I believe. My CS courses were a long, long time ago. :)

@lurch
Copy link
Contributor

lurch commented Aug 30, 2020

Strictly speaking, is not -> isn't is not a typo, IMHO it's more of a "recommendation". Does codespell have any existing mechanisms in place for differentiating between "things that are definitely wrong" and "things which are suggestions"?
Depending on the formality of the document you're spell-checking, is not->isn't may not always be a desired replacement (so IMHO this should be toggle-able with a command-line flag); and I guess in some circumstances you may want to do isn't->is not (but I guess that wouldn't work until codespell properly deals with apostrophes in any case).

@lurch
Copy link
Contributor

lurch commented Aug 30, 2020

And regarding snake_case and camelCase, I think @peternewman was suggesting that it would be cool if codespell could spell-check things_liek_this or thingsLikeTihs or ThinsgLikeThis ? (if he wasn't suggesting that, I think it'd be a very useful feature to have!!)

@jima80525
Copy link
Author

Strictly speaking, is not -> isn't is not a typo, IMHO it's more of a "recommendation". Does codespell have any existing mechanisms in place for differentiating between "things that are definitely wrong" and "things which are suggestions"?
Depending on the formality of the document you're spell-checking, is not->isn't may not always be a desired replacement (so IMHO this should be toggle-able with a command-line flag); and I guess in some circumstances you may want to do isn't->is not (but I guess that wouldn't work until codespell properly deals with apostrophes in any case).

I wouldn't imagine my proposed change modifying the dictionaries at all. The is not -> isn't is for an odd use case I have, but I'll use a private dictionary for that. It's just an example.

@jima80525
Copy link
Author

Never mind the previous comment about python 3.5. Figured it out immediately after posting. :)

@larsoner
Copy link
Member

larsoner commented Sep 1, 2020

I'm also okay with getting rid of 3.5 support, EOL is two weeks from today. So if it's substantially simpler in 3.6 let's just kill 3.5

@jima80525
Copy link
Author

Nah - I just had a brain fart. Allowing 3.5 was a matter of changing match[1] -> match.group(1).

Side question: Is there any documentation on how to run the tests? Doesn't look like the makefile will do it....

@peternewman
Copy link
Collaborator

Side question: Is there any documentation on how to run the tests?

You just need to do pytest codespell_lib

Doesn't look like the makefile will do it....

Well volunteered! 😆

@larsoner
Copy link
Member

larsoner commented Sep 1, 2020

Generally speaking you can always get some sense of the tests a project with CIs by looking at the config. Personally I just do pytest codespell_lib and sometimes flake8 codespell_lib before pushing commits.

@jima80525
Copy link
Author

OK. I'm still missing something. Getting 2 failed tests. I'm on sha: b8f8b9a
which looks like it's the latest. Is there a requirements.txt file I should be installing? I installed the stuff in appveyor.yml (pytest pytest-cov pytest-dependency setuptools flake8 coverage chardet codecov).
Sorry to be asking dumb questions, I'm just not used to this project layout.

@larsoner
Copy link
Member

larsoner commented Sep 1, 2020

@jima80525
Copy link
Author

jima80525 commented Sep 1, 2020

Yes! Did I just miss that? I can't find it on master. Or is it just not merged yet :)

Yep - says it right there in the link. Guess it's my night for not-so-clever questions.

I'm off and running. Will likely close this PR and start a new one when I get my prototype merged in, new tests added and the old ones modified if need be.

Thanks for all the help this evening!

@larsoner
Copy link
Member

larsoner commented Sep 1, 2020

Yep - says it right there in the link. Guess it's my night for not-so-clever questions.

No problem, glad you figured it out 👍

Will likely close this PR and start a new one when I get my prototype merged in, new tests added and the old ones modified if need be.

No need to close, just push commits to this branch (force-push if you want) and ping for a review once you're happy or need help. But if you want to close for now and open one later, you can.

@peternewman
Copy link
Collaborator

But if you want to close for now and open one later, you can.

You can also convert to and from draft if it's still a work in progress!

@jima80525
Copy link
Author

jima80525 commented Sep 3, 2020

Looks like there are a couple of options for installing aspell-python. is this the correct one?

https://github.com/WojciechMula/aspell-python

Nvr mind. Found it in the .travis.yml

@jima80525
Copy link
Author

OK. Got aspell installed and those tests running. Now I'm getting a series of error "some word" should be in aspell dictionary.... Seeing this on unmodified code from master.

Any hints?

@lurch
Copy link
Contributor

lurch commented Sep 3, 2020

Maybe one of the comments in #1650 will help?

@jima80525
Copy link
Author

No joy. I read through the suggested error. Here's my status from it:

Have you got, OS packages:
    libaspell-dev
    aspell-en
 $ sudo apt install libaspell-dev aspell-en
Reading package lists... Done
Building dependency tree       
Reading state information... Done
aspell-en is already the newest version (7.1-0-1.1).
libaspell-dev is already the newest version (0.60.7~20110707-3ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

python:

$ pip list
Package            Version   Location
------------------ --------- ---------------------------
aspell-python-py3  1.15
pytest             6.0.1
pytest-cov         2.10.1
pytest-dependency  0.5.1

I also show this, which I"m not sure of, but I suspect is correct:

codespell          2.0.dev0  /home/jima/coding/codespell

It looked like the problem in the other issue was getting aspell to run. I've got that running, it's just the tests that use aspell are now failing instead of being skipped.

I'm running:

$ pytest codespell_lib/tests/test_dictionary.py
======================================= test session starts ========================================
platform linux -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/jima/coding/codespell, configfile: setup.cfg
plugins: cov-2.10.1, dependency-0.5.1
collected 45 items                                                                                 

codespell_lib/tests/test_dictionary.py .F.F...F.....................................         [100%]

============================================= FAILURES =============================================
_ test_dictionary_formatting[/home/jima/coding/codespell/codespell_lib/tests/../data/dictionary.txt-in_aspell0] _

fname = '/home/jima/coding/codespell/codespell_lib/tests/../data/dictionary.txt'
in_aspell = (False, None)

    @fname_params
    def test_dictionary_formatting(fname, in_aspell):
        """Test that all dictionary entries are valid."""
        errors = list()
        with open(fname, 'rb') as fid:
            for line in fid:
                err, rep = line.decode('utf-8').split('->')
                err = err.lower()
                rep = rep.rstrip('\n')
                try:
                    _check_err_rep(err, rep, in_aspell, fname)
                except AssertionError as exp:
                    errors.append(str(exp).split('\n')[0])
        if len(errors):
>           raise AssertionError('\n' + '\n'.join(errors))
E           AssertionError: 
E           error 'calender' should not be in aspell for dictionary /home/jima/coding/codespell/codespell_lib/tests/../data/dictionary.txt
E           error 'calenders' should not be in aspell for dictionary /home/jima/coding/codespell/codespell_lib/tests/../data/dictionary.txt
E           error "get's" should not be in aspell for dictionary /home/jima/coding/codespell/codespell_lib/tests/../data/dictionary.txt

[plus two more failing tests with similar output]

This would indicate that it doesn't like that calenders is in the aspell dictionary, for example. I tested this out using aspell on the command line and proved that it is in the dictionary. Using this content for input.txt:

annnotate
calender
calenders
get's

when I run:

$ aspell -c input.txt 

It only finds annnotate to be incorrect, meaning that the other three are, indeed, in aspell's dictionary.

I'm out of ideas. Can you help me understand what error it is you're trying to capture here?

@jima80525
Copy link
Author

Just for completeness:

$ aspell dump dicts
en
en-variant_0
en-variant_1
en-variant_2
en-w_accents
en-wo_accents
en_CA
en_CA-variant_0
en_CA-variant_1
en_CA-w_accents
en_CA-wo_accents
en_GB
en_GB-ise
en_GB-ise-w_accents
en_GB-ise-wo_accents
en_GB-ize
en_GB-ize-w_accents
en_GB-ize-wo_accents
en_GB-variant_0
en_GB-variant_1
en_GB-w_accents
en_GB-wo_accents
en_US
en_US-variant_0
en_US-variant_1
en_US-w_accents
en_US-wo_accents

This is on LInuxMint 19.

@sebweb3r
Copy link
Contributor

sebweb3r commented Sep 4, 2020

I have to test this with a mint vm. It looks like mint is not using the same dictionary as ubuntu/debian.

If I look up your words in aspell:
http://app.aspell.net/lookup?dict=en_US&words=annnotate%0D%0Acalender%0D%0Acalenders%0D%0Aget%27s%0D%0A
it says, that calender(s) are in US_large. But not in the regular aspell.

@sebweb3r
Copy link
Contributor

sebweb3r commented Sep 4, 2020

Other option is: you accidentally added it to your aspell.
Have a look in ~/.aspell.en.pws

@lurch
Copy link
Contributor

lurch commented Sep 4, 2020

In codespell's dictionary.txt there's

calender->calendar
calenders->calendars

Calender is a real word, but I suspect that 99% of the time it'll be a typo for calendar.

@peternewman
Copy link
Collaborator

In codespell's dictionary.txt there's

calender->calendar
calenders->calendars

Calender is a real word, but I suspect that 99% of the time it'll be a typo for calendar.

Then in which case it should be moved to the rare dictionary. Do you want to do the honours @lurch ?

@peternewman
Copy link
Collaborator

* **Punctuation between words**: this is going to require some messing around with the parser.  IIRC this is done via a regex. I'm not 100% that a regex will be fully functional in capturing the punctuation issue. I suspect at a minimum that this will not play nicely with the "bring your own regex" feature.  At worst, it's a full lexer/parser addition, which is possible, but sort of changes the nature of the beast.  Is that a change you're willing to accept?

Perhaps get the easy bit in for now, i.e. just chuck in some code which only looks at space word boundaries or something.

* **camelCase**: I"m not seeing an issue with this one?  It will just get treated as a single word and lowercased (is that a word?)

* **snake_case**: This can be caught by modifying delimiters between words. Including `_` as a whitespace character, essentially.  I think this is slightly orthogonal to the 2-word problem, however, but fully admit I'm probably missing something. :)

And regarding snake_case and camelCase, I think @peternewman was suggesting that it would be cool if codespell could spell-check things_liek_this or thingsLikeTihs or ThinsgLikeThis ? (if he wasn't suggesting that, I think it'd be a very useful feature to have!!)

@lurch it can do those already if you tweak the regex, there are a few floating around. I use one for snake case. I think there is a PR for one that does camel.

@jima80525 I was actually thinking more that with your hypothetical is not -> isn't that apostrophe is going to really ruin those variable names. Those sort of issues. The simple thing for now may be multiple dictionaries a code safe one and a code unsafe one, but possibly something cleverer in future so it can fix the comments but not the code.

* **Python's `x is not y`**: is a tough issue.  In my use case (markdown documents containing code), I can simply skip the code blocks. That's likely not acceptable for your project, however.  I suspect that the only real solution for this is "don't do that conversion on Python code", which is what I think you're saying above.

Yeah, again probably separate dictionaries for now I'd imagine. Again you could skip .py files, but then you don't fix the documentation.

* (a bonus corner case!) **Cross-line substitutions**: Again, for my use case this is not an issue, but it is in general. What if line `n` ends with `is` and line `n+1` starts with `not`.  We can build a solution that catches this (should we?), but if we do, where does the substitution go?  Line `n` or line `n+1`? This is solvable, likely just by picking an answer going with it. The real question is if we want to do this.

You could make it configurable, but either option is at risk of hitting a line length thing, so I'd imagine add it to the first and let people run their linter afterwards to fix that.

So, the big question for the maintainers is: are you willing to go to a full parser to get this? Even if it breaks the "bring your own regex" feature? And, if that's the easiest way to go, do you want this whole mess under a separate option, or would you drop the regex stuff and go with the parser? (I don't think dropping the regex is a great idea, mind you. I'm just trying to get a feel for the solution space. :) )

Probably one for @larsoner . The obvious end game would be dealing with code blocks different from documentation in the same file, but the best bet there is probably to get someone else to write that, as they'll already exist. Hopefully possibly codespell could hook into one of those and just process each token or pairs of tokens.

Can you not fix it trivially if inefficiently by matching pairs of words and then checking on the divider in the middle.

E.g. My name is Peter. That is my name.

Would look at:

My name
name is
is Peter
Peter. That
That is
is my
my name

Not perfect, but possibly good enough for most cases?

Dictionary chaining has also been mentioned before, and confirmed to not currently work, and that and the order of dictionaries becomes even more important here, so that iis not becomes is not becomes isn't.

Strictly speaking, is not -> isn't is not a typo, IMHO it's more of a "recommendation". Does codespell have any existing mechanisms in place for differentiating between "things that are definitely wrong" and "things which are suggestions"?
Depending on the formality of the document you're spell-checking, is not->isn't may not always be a desired replacement (so IMHO this should be toggle-able with a command-line flag); and I guess in some circumstances you may want to do isn't->is not (but I guess that wouldn't work until codespell properly deals with apostrophes in any case).

We already have this @lurch . There is https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_informal.txt . Logically it could auto-generate the reverse like the US/GB if desired.

@lurch
Copy link
Contributor

lurch commented Sep 4, 2020

Do you want to do the honours @lurch ?

I don't have time today, and I'm off on holiday all next week... 😎

@peternewman
Copy link
Collaborator

Just for completeness:

This is on LInuxMint 19.

That matches my Ubuntu install, but perhaps they're later dictionaries.

I have to test this with a mint vm. It looks like mint is not using the same dictionary as ubuntu/debian.

If I look up your words in aspell:
http://app.aspell.net/lookup?dict=en_US&words=annnotate%0D%0Acalender%0D%0Acalenders%0D%0Aget%27s%0D%0A
it says, that calender(s) are in US_large. But not in the regular aspell.

I think the variant files may be equivalent to large.

@peternewman
Copy link
Collaborator

Do you want to do the honours @lurch ?

I don't have time today, and I'm off on holiday all next week... sunglasses

Okay. I've done it myself in #1636 so we don't forget.

@jima80525
Copy link
Author

@jima80525 I was actually thinking more that with your hypothetical is not -> isn't that apostrophe is going to really ruin those variable names. Those sort of issues. The simple thing for now may be multiple dictionaries a code safe one and a code unsafe one, but possibly something cleverer in future so it can fix the comments but not the code.

* **Python's `x is not y`**: is a tough issue.  In my use case (markdown documents containing code), I can simply skip the code blocks. That's likely not acceptable for your project, however.  I suspect that the only real solution for this is "don't do that conversion on Python code", which is what I think you're saying above.

Yeah, again probably separate dictionaries for now I'd imagine. Again you could skip .py files, but then you don't fix the documentation.

Yeah - this is where I"m thinking as well. Comment vs. code is a different can of worms, even if you restrict it to Python 👍

* (a bonus corner case!) **Cross-line substitutions**: Again, for my use case this is not an issue, but it is in general. What if line `n` ends with `is` and line `n+1` starts with `not`.  We can build a solution that catches this (should we?), but if we do, where does the substitution go?  Line `n` or line `n+1`? This is solvable, likely just by picking an answer going with it. The real question is if we want to do this.

You could make it configurable, but either option is at risk of hitting a line length thing, so I'd imagine add it to the first and let people run their linter afterwards to fix that.

I'll look into adding that config option.

Can you not fix it trivially if inefficiently by matching pairs of words and then checking on the divider in the middle.

E.g. My name is Peter. That is my name.

Would look at:

My name
name is
is Peter
Peter. That
That is
is my
my name

Not perfect, but possibly good enough for most cases?

I managed to get a regex and generator solution that comes up with

My
My name
name
name is
is
is Peter
Peter
That
That is
is
is my
my
my name
name

from the above string (including the period). Note that it does not test Peter That. :)

As far as the dictionary unit test issues, I'm just going to ignore those for now and try to get some unit tests in to test the new code. I've fixed the code to pass the non-dictionary based tests already. I'll try to confirm that the dictionary tests are failing on my machine in the same manner as master before I submit a new commit :)


word_regex_def = u"[\\w\\-'’`]+"
# NOTE: flake8 suppression due to it not liking \] escape sequence
word_regex_def = u"([\\w\\-'’`]+)([.,?!-:;><@#$%^&*()_+=/\]\\[])?" # noqa W605
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got all of the punctuation marks here. Oddly, flake8 doesn't really like \]


code, stdout, stderr = cs.main(filename, "-D%s" % dictname, std=True)
assert code == 0 # no changes found
assert "won't" not in stdout
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not doing matches across line breaks. This is possible, but the code to switch between two of them is quite different. (for matching across linebreaks, the extract_pairs object gets constructed outside of the loop and then a extract method is called on each line.
The way it is currently it resets the object on each line.
This is fixable, but I don't want to make too much of a mess here.

@sebweb3r
Copy link
Contributor

sebweb3r commented Nov 3, 2020

@larsoner this feature would be awesome for the release too.

@larsoner
Copy link
Member

larsoner commented Nov 3, 2020

@sebweb3r I prefer not to merge big functional changes like this right before a release, but rather right after. That way expert users who use the master version can have time to be the ones who report bugs rather than everyone else (who would prefer a stable experience).

@sebweb3r
Copy link
Contributor

sebweb3r commented Nov 3, 2020

Sounds reasonable.

@hadess
Copy link
Contributor

hadess commented Nov 24, 2020

I've tested the changes in #1607 on top of this (rebased on top of the last release too), and I ran into a single problem.

test_dictionary_formatting will throw an error if any of the shipped dictionaries have whitespace:

E           error 'dummy value' has whitespace
E           error 'man hours' has whitespace
E           error 'sanity check' has whitespace

Maybe it should just check for non-space whitespaces?

@jima80525
Copy link
Author

jima80525 commented Nov 24, 2020 via email

@hadess
Copy link
Contributor

hadess commented Oct 21, 2021

@jima80525 As we're past the one-year anniversary for this MR, did you want to give it another go?

@jima80525
Copy link
Author

@hadess - As you might have guessed, life has taken some turns for me. I've gotten sucked into a small startup and won't have time to dig into this in the near future. I'm terribly sorry for raising this and then ghosting.

@jima80525 jima80525 closed this by deleting the head repository May 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants