Skip to content

Conversation

linusg
Copy link
Contributor

@linusg linusg commented Sep 14, 2024

Please review this very carefully, I have no idea what I'm doing :)

Closes #77.
Closes #530.

break;
case RUN_TYPE_U_EXT2:
case RUN_TYPE_U_EXT3:
// TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone needs to tell me how to implement these, my brain is not capable of writing this kind of obfuscated C code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2))
    break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3:
    res[0] = c - code + case_conv_ext[data >> 8];
    res[1] = case_conv_ext[(data >> 4) & 0x0f];
    res[2] = case_conv_ext[data & 0x0f];
    return 3;

Comment on lines +1026 to +1046
if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
ci->l_len == 1) {
len = 1;
while (code + len <= CHARCODE_MAX) {
ci1 = &tab[code + len];
if (!(ci1->u_len == 2 &&
ci1->u_data[1] == 0x399 &&
ci1->u_data[0] == ci->u_data[0] + len &&
ci1->l_len == 1 &&
ci1->l_data[0] == ci->l_data[0] + len))
break;
len++;
}
te->len = len;
te->type = RUN_TYPE_U2_399_EXT2;
te->ext_data[0] = ci->u_data[0];
te->ext_data[1] = ci->l_data[0];
te->ext_len = 2;
return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy and pasted from the one below except there is no F component

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

break;
case RUN_TYPE_U_EXT2:
case RUN_TYPE_U_EXT3:
// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2))
    break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3:
    res[0] = c - code + case_conv_ext[data >> 8];
    res[1] = case_conv_ext[(data >> 4) & 0x0f];
    res[2] = case_conv_ext[data & 0x0f];
    return 3;

Comment on lines +1026 to +1046
if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
ci->l_len == 1) {
len = 1;
while (code + len <= CHARCODE_MAX) {
ci1 = &tab[code + len];
if (!(ci1->u_len == 2 &&
ci1->u_data[1] == 0x399 &&
ci1->u_data[0] == ci->u_data[0] + len &&
ci1->l_len == 1 &&
ci1->l_data[0] == ci->l_data[0] + len))
break;
len++;
}
te->len = len;
te->type = RUN_TYPE_U2_399_EXT2;
te->ext_data[0] = ci->u_data[0];
te->ext_data[1] = ci->l_data[0];
te->ext_len = 2;
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

te->type = RUN_TYPE_L_EXT;
te->ext_len = 1;
te->ext_data[0] = te->data;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some clever rearranging of the enum values, you can avoid code duplication by folding the three cases into one:

case RUN_TYPE_L:
case RUN_TYPE_LF:
case RUN_TYPE_U:
    te->type += RUN_TYPE_L_EXT - RUN_TYPE_L;
    te->ext_len = 1;
    te->ext_data[0] = te->data;
    break;

@bnoordhuis
Copy link
Contributor

@linusg do you plan on revisiting this?

FWIW, I don't think this fully closes out #77 until the regex engine is taught about unicode emoji (tr51). Ex. /\p{Basic_Emoji}/ with the challenge that emoji are variable-length encodings.

I did a proof of concept where I pregenerate the regex bytecode for Basic_Emoji, Emoji_Keycap_Sequence, etc. and memcpy it when compiling regular expressions but that results in really bloated bytecode.

@linusg
Copy link
Contributor Author

linusg commented Sep 27, 2024

I won't get around to it until sometime next week at the earliest, if someone else wants to pick this up before then that's fine by me :)

@bnoordhuis
Copy link
Contributor

No rush. I was just wondering if I should adopt it.

@linusg
Copy link
Contributor Author

linusg commented Oct 7, 2024

I didn't find the time for this yet and likely won't in the coming weeks either - if you or someone else wants to finish it please feel free, I'd love to have this merged!

@bnoordhuis
Copy link
Contributor

I took a slightly different tack because I wanted to understand why 15.1 behaved differently from 16.0 but the update to 15.1 was done in 416ab66, followed immediately afterwards by an update to 16.0 in 243b968. Thanks anyway for the pull request!

@bnoordhuis bnoordhuis closed this Nov 9, 2024
@linusg linusg deleted the unicode-16 branch November 9, 2024 22:34
@linusg
Copy link
Contributor Author

linusg commented Nov 9, 2024

Thank you, much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode needs to be updated to 16.0.0 Update unicode

2 participants