-
Notifications
You must be signed in to change notification settings - Fork 497
Added checking on two word sequences in addition to word by word #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Refactored to minimize the change and fix the "missing last word" bug. :) |
|
Hi @jima80525 , Thanks for this, it sounds quite exciting to me. As you can probably imagine, there are a lot of typos where essentially the space is in the wrong place (e.g. "spellin gcheck" type things) which could be potentially caught with code like this. Certainly various stuff with spaces in the typo has been removed from the dictionary previously as Codespell couldn't find it. Some tests would be great for starters. Also a few corner cases for you to consider, what about snake_case and camelCase? Being able to find and fix typos in those cases too would be good. It looks like this is just naively taking every pair of words, this would have some interesting edge cases too, if the boundary between words has a full stop. For example: I think would be corrected by your code to: Your specific suggestion would also trip up Python I guess there are a few classes of two word typos, the ones where you want to remove the separator (e.g. your informal @larsoner any thoughts? |
|
I think it is awesome :-) |
|
@peternewman and @sebweb3r - Tests: definitely. Corner cases:
So, the big question for the maintainers is: are you willing to go to a full parser to get this? Even if it breaks the "bring your own regex" feature? And, if that's the easiest way to go, do you want this whole mess under a separate option, or would you drop the regex stuff and go with the parser? (I don't think dropping the regex is a great idea, mind you. I'm just trying to get a feel for the solution space. :) ) Anyway, I can take a look at this and start working on it in earnest. My progress will be slow due to other commitments, however. Last minute edit: technically I should be using the word "lexer" or "tokenizer" instead of "parser" above, I believe. My CS courses were a long, long time ago. :) |
|
Strictly speaking, |
|
And regarding |
I wouldn't imagine my proposed change modifying the dictionaries at all. The |
|
Never mind the previous comment about python 3.5. Figured it out immediately after posting. :) |
|
I'm also okay with getting rid of 3.5 support, EOL is two weeks from today. So if it's substantially simpler in 3.6 let's just kill 3.5 |
|
Nah - I just had a brain fart. Allowing 3.5 was a matter of changing Side question: Is there any documentation on how to run the tests? Doesn't look like the makefile will do it.... |
You just need to do
Well volunteered! 😆 |
|
Generally speaking you can always get some sense of the tests a project with CIs by looking at the config. Personally I just do |
…allow-two-word-fixes
|
OK. I'm still missing something. Getting 2 failed tests. I'm on sha: b8f8b9a |
|
Yes! Did I just miss that? I can't find it on master. Or is it just not merged yet :) Yep - says it right there in the link. Guess it's my night for not-so-clever questions. I'm off and running. Will likely close this PR and start a new one when I get my prototype merged in, new tests added and the old ones modified if need be. Thanks for all the help this evening! |
No problem, glad you figured it out 👍
No need to close, just push commits to this branch (force-push if you want) and ping for a review once you're happy or need help. But if you want to close for now and open one later, you can. |
You can also convert to and from draft if it's still a work in progress! |
|
Looks like there are a couple of options for installing aspell-python. is this the correct one? https://github.com/WojciechMula/aspell-python Nvr mind. Found it in the .travis.yml |
|
OK. Got aspell installed and those tests running. Now I'm getting a series of Any hints? |
|
Maybe one of the comments in #1650 will help? |
|
No joy. I read through the suggested error. Here's my status from it: python: I also show this, which I"m not sure of, but I suspect is correct: It looked like the problem in the other issue was getting aspell to run. I've got that running, it's just the tests that use aspell are now failing instead of being skipped. I'm running: This would indicate that it doesn't like that when I run: It only finds I'm out of ideas. Can you help me understand what error it is you're trying to capture here? |
|
Just for completeness: This is on LInuxMint 19. |
|
I have to test this with a mint vm. It looks like mint is not using the same dictionary as ubuntu/debian. If I look up your words in aspell: |
|
Other option is: you accidentally added it to your aspell. |
|
In codespell's dictionary.txt there's Calender is a real word, but I suspect that 99% of the time it'll be a typo for calendar. |
Perhaps get the easy bit in for now, i.e. just chuck in some code which only looks at space word boundaries or something.
@lurch it can do those already if you tweak the regex, there are a few floating around. I use one for snake case. I think there is a PR for one that does camel. @jima80525 I was actually thinking more that with your hypothetical
Yeah, again probably separate dictionaries for now I'd imagine. Again you could skip .py files, but then you don't fix the documentation.
You could make it configurable, but either option is at risk of hitting a line length thing, so I'd imagine add it to the first and let people run their linter afterwards to fix that.
Probably one for @larsoner . The obvious end game would be dealing with code blocks different from documentation in the same file, but the best bet there is probably to get someone else to write that, as they'll already exist. Hopefully possibly codespell could hook into one of those and just process each token or pairs of tokens. Can you not fix it trivially if inefficiently by matching pairs of words and then checking on the divider in the middle. E.g. Would look at: Not perfect, but possibly good enough for most cases? Dictionary chaining has also been mentioned before, and confirmed to not currently work, and that and the order of dictionaries becomes even more important here, so that
We already have this @lurch . There is https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_informal.txt . Logically it could auto-generate the reverse like the US/GB if desired. |
I don't have time today, and I'm off on holiday all next week... 😎 |
That matches my Ubuntu install, but perhaps they're later dictionaries.
I think the variant files may be equivalent to large. |
Yeah - this is where I"m thinking as well. Comment vs. code is a different can of worms, even if you restrict it to Python 👍
I'll look into adding that config option.
I managed to get a regex and generator solution that comes up with from the above string (including the period). Note that it does not test As far as the dictionary unit test issues, I'm just going to ignore those for now and try to get some unit tests in to test the new code. I've fixed the code to pass the non-dictionary based tests already. I'll try to confirm that the dictionary tests are failing on my machine in the same manner as master before I submit a new commit :) |
|
|
||
| word_regex_def = u"[\\w\\-'’`]+" | ||
| # NOTE: flake8 suppression due to it not liking \] escape sequence | ||
| word_regex_def = u"([\\w\\-'’`]+)([.,?!-:;><@#$%^&*()_+=/\]\\[])?" # noqa W605 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I got all of the punctuation marks here. Oddly, flake8 doesn't really like \]
|
|
||
| code, stdout, stderr = cs.main(filename, "-D%s" % dictname, std=True) | ||
| assert code == 0 # no changes found | ||
| assert "won't" not in stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not doing matches across line breaks. This is possible, but the code to switch between two of them is quite different. (for matching across linebreaks, the extract_pairs object gets constructed outside of the loop and then a extract method is called on each line.
The way it is currently it resets the object on each line.
This is fixable, but I don't want to make too much of a mess here.
|
@larsoner this feature would be awesome for the release too. |
|
@sebweb3r I prefer not to merge big functional changes like this right before a release, but rather right after. That way expert users who use the |
|
Sounds reasonable. |
|
I've tested the changes in #1607 on top of this (rebased on top of the last release too), and I ran into a single problem.
Maybe it should just check for non-space whitespaces? |
|
Thanks for looking at this. I could never get the dictionary checks to run
(even without the new code) and thus never saw this. Not my decision as to
whether your proposed solution is acceptable, however. :)
…On Tue, Nov 24, 2020 at 8:00 AM hadess ***@***.***> wrote:
I've tested the changes in #1607
<#1607> on top of this
(rebased on top of the last release too), and I ran into a single problem.
test_dictionary_formatting will throw an error if any of the shipped
dictionaries have whitespace:
E error 'dummy value' has whitespace
E error 'man hours' has whitespace
E error 'sanity check' has whitespace
Maybe it should just check for non-space whitespaces?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1657 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNXQO5JAE7C5VN7PQ7ZPULSRPC75ANCNFSM4QPHRD6A>
.
|
|
@jima80525 As we're past the one-year anniversary for this MR, did you want to give it another go? |
|
@hadess - As you might have guessed, life has taken some turns for me. I've gotten sucked into a small startup and won't have time to dig into this in the near future. I'm terribly sorry for raising this and then ghosting. |
I'm not sure if this is a feature you're interested in, but it's something I needed for a project I'm working on and thought I'd like you decide if you wanted it.
This allows checking two-word phrases in the dictionary. For example, you can flag
is not->isn't(which is the type of checking I am after).I intentionally didn't add this to the section checking filenames, as that seemed a less-than-likely use case.
I'm happy to clean up if there are changes you'd like to see. Also completely understand if this is not functionality in which you are interested.
THANK YOU for this project. As I said, this will give me a great jumping off point for some tooling I need to develop.