Skip to content

Add utf8proc_isequal_normalized #298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

LB--
Copy link

@LB-- LB-- commented Jul 20, 2025

Based on conversation in #101 I have ported the isequal_normalized function from Julia to C. This is my first attempt and I haven't updated tests/fuzzing yet, at this stage I mainly want feedback on the API and code style. Once that's settled I'll figure out the other bells and whistles.

Despite my earlier comment, I did actually manage to implement the algorithm without requiring a callback nor any memory allocation in the middle of it, instead it's a re-entrant algorithm that updates the string pointers to avoid re-testing parts that have already been deemed equivalent. This makes it easy to detect and fix or handle invalid UTF-8 as well, since the string pointers will be pointing right at the problem on error. I also managed to make the algorithm able to run ahead and try out the rest of the string to determine how big the scratch buffers need to be in the case when they aren't large enough, though it does complicate the implementations slightly.

My example/test code is written in C++ currently, it looks similar to this:

thread_local std::vector<utf8proc_int32_t> scratchSpace{};
[[nodiscard]] bool isEqualNormalized(std::string_view const a, std::string_view const b)
{
	if(std::empty(scratchSpace))
	{
		scratchSpace.resize(32);
	}
	utf8proc_processing_state_t sa{};
	utf8proc_processing_state_t sb{};
	sa.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(a)), gsl::narrow<utf8proc_ssize_t>(std::size(a))};
	sb.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(b)), gsl::narrow<utf8proc_ssize_t>(std::size(b))};
	while(true)
	{
		gsl::span const bufA(gsl::span(scratchSpace).subspan(0, std::size(scratchSpace)/2));
		gsl::span const bufB(gsl::span(scratchSpace).subspan(std::size(bufA), std::size(bufA)));
		sa.buf = {std::data(bufA), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufA))};
		sb.buf = {std::data(bufB), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufB))};
		utf8proc_isequal_normalized(&sa, &sb, utf8proc_option_t{}, nullptr, nullptr);
		if(!sa.buf.ptr || !sb.buf.ptr)
		{
			scratchSpace.resize(gsl::narrow<size_t>(max(sa.buf.len_used, sb.buf.len_used))*2);
			continue;
		}
		if(sa.str.len == 0 && sb.str.len == 0)
		{
			return true;
		}
		return false;
	}
}

After other details are sorted out I can add a plain C example. Not sure if there should also be a utf8proc_isequal_normalized_simple that does all that and returns a simpler result. Also not sure if there should be a version that doesn't take the custom_func/custom_data.

@eschnett
Copy link
Collaborator

A thread-local global variable... I see how this is attractive, and this would work in a plain C/C++ code. I'm not quite sure this will work in Julia, or with other run-time systems that may set up their own threading mechanism. I say this because Julia has its own calling convention (its own ABI) and I am not sure that Julia's ccall ensures that thread-local variables in C will be set up correctly. I kind of assume they are (I guess too much would break in glibc otherwise) but that's a point that needs to be evaluated carefully.

@LB--
Copy link
Author

LB-- commented Jul 20, 2025

I'm not sure what you mean, the thread local is just my test/example code, it's not required at all for use of the new function I added in the commit.

@eschnett
Copy link
Collaborator

Right, my bad.

@stevengj
Copy link
Member

I'm not a huge fan of the complication of a re-entrant API. I really think the algorithm can be re-worked to avoid allocations entirely by using a fixed length (stack) array of the most recent character for each combining class.

@LB--
Copy link
Author

LB-- commented Jul 21, 2025

A version that doesn't require scratch buffers would be nice, but with my current understanding of the problem I don't see how that's possible without sacrificing performance. Valid Unicode sequences can contain infinite combining characters in any order, and when comparing two un-normalized UTF-8 strings you need some way to detect when they would be normalized to the same thing even if the un-normalized versions are in different orders.

The Julia algorithm did this by decomposing into the scratch buffers and sorting the combining characters in there so they'd be ready for simple direct comparison, when I ported it to C I copied the normalization sorting code that already existed elsewhere in the same source file. The worst case scenario is that both strings are entirely one long sequence of combining characters in different orders, meaning the two scratch buffers used for sorting both need to be able to hold the entire source string decomposed to 32-bit code points in that case.

Any solution which does not require scratch buffers necessarily will have terrible performance when passing some threshold for number of combining characters in a row, which can lead to denial of service attacks. At least with the scratch buffer approach, the maximum required buffer size can be computed in advance and rejected if it's undesirable, and then the cost is just to sort the combining sequences for comparison.

The scratch buffer approach already lends itself well to using local stack buffers as a fast path for the vast majority of "normal" input strings that don't have too many combining characters in a row, and the choice of how to handle the less usual strings falls to the user, if they even want to bother with it at all. Hence my suggestion to possibly add a simple wrapper that makes that decision internally, so users who aren't picky at the moment can use the simplified wrapper until they need to optimize.

It's possible I'm overestimating the problem though, I don't have a full grasp of Unicode combining classes and combining characters so if you think it's simpler than this please do tell me.

@stevengj
Copy link
Member

stevengj commented Jul 22, 2025

Any solution which does not require scratch buffers necessarily will have terrible performance

This is not obvious to me, because you don't have to sort within each combining class, and there are only a small number of possible combining classes.

For example, the simplest allocation-free algorithm would be to scan through the combining characters, one combining class at a time, searching/comparing only characters in that combining class before going back and searching the next combining class. This is a linear-time algorithm, with cost at worst proportional to the number of combining characters multiplied by the number of combining classes.

And I suspect that you could be more efficient than this by storing a fixed amount of state, proportional to the number of combining characters classes, to reduce the number of passes over the string (basically, just caching where the first character in each combining class appears, if at all).

(Furthermore, in a typical string the number of combining characters per character is almost always going to be small, rarely more than two or three.)

Getting this right is tricky, but will be worth it, and I don't want to lock us into a re-entrant API in the meantime.

@LB--
Copy link
Author

LB-- commented Jul 23, 2025

If you think it's possible then I'll work another attempt at this, after researching a bit more about the combining class stuff. Probably won't get around to it until this weekend at the earliest though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants