toke.c: S_intuit_more: Add more commentary #23708

khwilliamson · 2025-09-14T17:48:19Z

This function is described in its comments as 'terrifying', and by its original author, Larry Wall, as "truly awful". As a result, it has been mostly untouched since its introduction in 1993. That means it has not been updated as new language features have been added.

As an example, it does not know about lexical variables, so the code it has for globals just doesn't work on the vast majority of modern day coding practices.

Another example is it knows nothing of UTF-8, and as a result simply changing the input encoding from Latin1 to UTF-8 can result in its outcome being the opposite result.

And it is buggy.

A few years ago, I set out to try to understand it. I added commentary and simplified some overly complicated expressions, but left its behavior unchanged.

Now, I set out to make some changes, and found many more issues than I had earlier. This commit adds commentary about those. Hopefully this will lead to some discussion and a consensus on the way forward.

This set of changes does not require a perldelta entry.

That also avoids crashing on overrun.

jkeenan

Reviewed solely for spelling and grammar issues.

jkeenan · 2025-10-02T00:11:46Z

toke.c

    /* Examine each character in the construct */
+    /* That this knows nothing of UTF-8 can lead to opposite results if the
+     * text is encoded in UTF-8 or not; another relic of the Unicode Bug.
+     * Suppose a string consistis of various un-repeated code points between


s/consistis/consists/ in above

jkeenan · 2025-10-02T00:12:27Z

toke.c

+                     * (Some do it though as a mnemonic that is meaningful to
+                     * them.)  But generally, repeated characters makes things
+                     * more likely to be a charclass.  But we have here that
+                     * this an identifier so likely a subscript.  Its spelling


Sentence beginning with But lacks a verb, is confusing.

This function is described in its comments as 'terrifying', and by its original author, Larry Wall, as "truly awful". As a result, it has been mostly untouched since its introduction in 1993. That means it has not been updated as new language features have been added. As an example, it does not know about lexical variables, so the code it has for globals just doesn't work on the vast majority of modern day coding practices. Another example is it knows nothing of UTF-8, and as a result simply changing the input encoding from Latin1 to UTF-8 can result in its outcome being the opposite result. And it is buggy. An example of how hard this can be to get right is this fairly common use in our test suite: [$A-Z] That looks like a character class matching 27 characters. But wait, what if there exists a $A and a parameterless subroutine 'Z'. Then this could instead be an expression for a subcript. A few years ago, I set out to try to understand it. I added commentary and simplified some overly complicated expressions, but left its behavior unchanged. Now, I set out to make some changes, and found many more issues than I had earlier. This commit adds commentary about those. Hopefully this will lead to some discussion and a consensus on the way forward.

khwilliamson force-pushed the intuit_more_commentary branch from 4c36ccf to f0a5a44 Compare September 15, 2025 12:30

khwilliamson referenced this pull request Sep 15, 2025

intuit_more: no need to copy before keyword check

56f81af

That also avoids crashing on overrun.

khwilliamson force-pushed the intuit_more_commentary branch 2 times, most recently from d1fffd6 to 3407c5f Compare September 22, 2025 22:03

khwilliamson force-pushed the intuit_more_commentary branch from 3407c5f to c77f0b2 Compare September 24, 2025 13:33

khwilliamson mentioned this pull request Sep 24, 2025

Initial overhaul of S_intuit_more #23764

Open

jkeenan reviewed Oct 2, 2025

View reviewed changes

github-actions bot added the hasConflicts label Oct 7, 2025

khwilliamson force-pushed the intuit_more_commentary branch from c77f0b2 to 6ef8978 Compare October 8, 2025 00:59

khwilliamson removed the hasConflicts label Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

toke.c: S_intuit_more: Add more commentary #23708

toke.c: S_intuit_more: Add more commentary #23708

Uh oh!

khwilliamson commented Sep 14, 2025 •

edited

Loading

Uh oh!

jkeenan left a comment

Uh oh!

jkeenan Oct 2, 2025

Uh oh!

jkeenan Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

toke.c: S_intuit_more: Add more commentary #23708

Are you sure you want to change the base?

toke.c: S_intuit_more: Add more commentary #23708

Uh oh!

Conversation

khwilliamson commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkeenan left a comment

Choose a reason for hiding this comment

Uh oh!

jkeenan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

jkeenan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

khwilliamson commented Sep 14, 2025 •

edited

Loading