-
Notifications
You must be signed in to change notification settings - Fork 155
Add utf8proc_isequal_normalized #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
A thread-local global variable... I see how this is attractive, and this would work in a plain C/C++ code. I'm not quite sure this will work in Julia, or with other run-time systems that may set up their own threading mechanism. I say this because Julia has its own calling convention (its own ABI) and I am not sure that Julia's |
I'm not sure what you mean, the thread local is just my test/example code, it's not required at all for use of the new function I added in the commit. |
Right, my bad. |
I'm not a huge fan of the complication of a re-entrant API. I really think the algorithm can be re-worked to avoid allocations entirely by using a fixed length (stack) array of the most recent character for each combining class. |
A version that doesn't require scratch buffers would be nice, but with my current understanding of the problem I don't see how that's possible without sacrificing performance. Valid Unicode sequences can contain infinite combining characters in any order, and when comparing two un-normalized UTF-8 strings you need some way to detect when they would be normalized to the same thing even if the un-normalized versions are in different orders. The Julia algorithm did this by decomposing into the scratch buffers and sorting the combining characters in there so they'd be ready for simple direct comparison, when I ported it to C I copied the normalization sorting code that already existed elsewhere in the same source file. The worst case scenario is that both strings are entirely one long sequence of combining characters in different orders, meaning the two scratch buffers used for sorting both need to be able to hold the entire source string decomposed to 32-bit code points in that case. Any solution which does not require scratch buffers necessarily will have terrible performance when passing some threshold for number of combining characters in a row, which can lead to denial of service attacks. At least with the scratch buffer approach, the maximum required buffer size can be computed in advance and rejected if it's undesirable, and then the cost is just to sort the combining sequences for comparison. The scratch buffer approach already lends itself well to using local stack buffers as a fast path for the vast majority of "normal" input strings that don't have too many combining characters in a row, and the choice of how to handle the less usual strings falls to the user, if they even want to bother with it at all. Hence my suggestion to possibly add a simple wrapper that makes that decision internally, so users who aren't picky at the moment can use the simplified wrapper until they need to optimize. It's possible I'm overestimating the problem though, I don't have a full grasp of Unicode combining classes and combining characters so if you think it's simpler than this please do tell me. |
This is not obvious to me, because you don't have to sort within each combining class, and there are only a small number of possible combining classes. For example, the simplest allocation-free algorithm would be to scan through the combining characters, one combining class at a time, searching/comparing only characters in that combining class before going back and searching the next combining class. This is a linear-time algorithm, with cost at worst proportional to the number of combining characters multiplied by the number of combining classes. And I suspect that you could be more efficient than this by storing a fixed amount of state, proportional to the number of combining (Furthermore, in a typical string the number of combining characters per character is almost always going to be small, rarely more than two or three.) Getting this right is tricky, but will be worth it, and I don't want to lock us into a re-entrant API in the meantime. |
If you think it's possible then I'll work another attempt at this, after researching a bit more about the combining class stuff. Probably won't get around to it until this weekend at the earliest though. |
Based on conversation in #101 I have ported the
isequal_normalized
function from Julia to C. This is my first attempt and I haven't updated tests/fuzzing yet, at this stage I mainly want feedback on the API and code style. Once that's settled I'll figure out the other bells and whistles.Despite my earlier comment, I did actually manage to implement the algorithm without requiring a callback nor any memory allocation in the middle of it, instead it's a re-entrant algorithm that updates the string pointers to avoid re-testing parts that have already been deemed equivalent. This makes it easy to detect and fix or handle invalid UTF-8 as well, since the string pointers will be pointing right at the problem on error. I also managed to make the algorithm able to run ahead and try out the rest of the string to determine how big the scratch buffers need to be in the case when they aren't large enough, though it does complicate the implementations slightly.
My example/test code is written in C++ currently, it looks similar to this:
After other details are sorted out I can add a plain C example. Not sure if there should also be a
utf8proc_isequal_normalized_simple
that does all that and returns a simpler result. Also not sure if there should be a version that doesn't take thecustom_func
/custom_data
.