Add utf8proc_isequal_normalized #298

LB-- · 2025-07-20T16:54:45Z

Based on conversation in #101 I have ported the isequal_normalized function from Julia to C. This is my first attempt and I haven't updated tests/fuzzing yet, at this stage I mainly want feedback on the API and code style. Once that's settled I'll figure out the other bells and whistles.

Despite my earlier comment, I did actually manage to implement the algorithm without requiring a callback nor any memory allocation in the middle of it, instead it's a re-entrant algorithm that updates the string pointers to avoid re-testing parts that have already been deemed equivalent. This makes it easy to detect and fix or handle invalid UTF-8 as well, since the string pointers will be pointing right at the problem on error. I also managed to make the algorithm able to run ahead and try out the rest of the string to determine how big the scratch buffers need to be in the case when they aren't large enough, though it does complicate the implementations slightly.

My example/test code is written in C++ currently, it looks similar to this:

thread_local std::vector<utf8proc_int32_t> scratchSpace{};
[[nodiscard]] bool isEqualNormalized(std::string_view const a, std::string_view const b)
{
	if(std::empty(scratchSpace))
	{
		scratchSpace.resize(32);
	}
	utf8proc_processing_state_t sa{};
	utf8proc_processing_state_t sb{};
	sa.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(a)), gsl::narrow<utf8proc_ssize_t>(std::size(a))};
	sb.str = {reinterpret_cast<utf8proc_uint8_t const*>(std::data(b)), gsl::narrow<utf8proc_ssize_t>(std::size(b))};
	while(true)
	{
		gsl::span const bufA(gsl::span(scratchSpace).subspan(0, std::size(scratchSpace)/2));
		gsl::span const bufB(gsl::span(scratchSpace).subspan(std::size(bufA), std::size(bufA)));
		sa.buf = {std::data(bufA), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufA))};
		sb.buf = {std::data(bufB), 0, gsl::narrow<utf8proc_ssize_t>(std::size(bufB))};
		utf8proc_isequal_normalized(&sa, &sb, utf8proc_option_t{}, nullptr, nullptr);
		if(!sa.buf.ptr || !sb.buf.ptr)
		{
			scratchSpace.resize(gsl::narrow<size_t>(max(sa.buf.len_used, sb.buf.len_used))*2);
			continue;
		}
		if(sa.str.len == 0 && sb.str.len == 0)
		{
			return true;
		}
		return false;
	}
}

After other details are sorted out I can add a plain C example. Not sure if there should also be a utf8proc_isequal_normalized_simple that does all that and returns a simpler result. Also not sure if there should be a version that doesn't take the custom_func/custom_data.

eschnett · 2025-07-20T17:35:25Z

A thread-local global variable... I see how this is attractive, and this would work in a plain C/C++ code. I'm not quite sure this will work in Julia, or with other run-time systems that may set up their own threading mechanism. I say this because Julia has its own calling convention (its own ABI) and I am not sure that Julia's ccall ensures that thread-local variables in C will be set up correctly. I kind of assume they are (I guess too much would break in glibc otherwise) but that's a point that needs to be evaluated carefully.

LB-- · 2025-07-20T17:38:56Z

I'm not sure what you mean, the thread local is just my test/example code, it's not required at all for use of the new function I added in the commit.

eschnett · 2025-07-20T20:36:16Z

Right, my bad.

stevengj · 2025-07-21T12:38:34Z

I'm not a huge fan of the complication of a re-entrant API. I really think the algorithm can be re-worked to avoid allocations entirely by using a fixed length (stack) array of the most recent character for each combining class.

LB-- · 2025-07-21T20:33:17Z

A version that doesn't require scratch buffers would be nice, but with my current understanding of the problem I don't see how that's possible without sacrificing performance. Valid Unicode sequences can contain infinite combining characters in any order, and when comparing two un-normalized UTF-8 strings you need some way to detect when they would be normalized to the same thing even if the un-normalized versions are in different orders.

The Julia algorithm did this by decomposing into the scratch buffers and sorting the combining characters in there so they'd be ready for simple direct comparison, when I ported it to C I copied the normalization sorting code that already existed elsewhere in the same source file. The worst case scenario is that both strings are entirely one long sequence of combining characters in different orders, meaning the two scratch buffers used for sorting both need to be able to hold the entire source string decomposed to 32-bit code points in that case.

Any solution which does not require scratch buffers necessarily will have terrible performance when passing some threshold for number of combining characters in a row, which can lead to denial of service attacks. At least with the scratch buffer approach, the maximum required buffer size can be computed in advance and rejected if it's undesirable, and then the cost is just to sort the combining sequences for comparison.

The scratch buffer approach already lends itself well to using local stack buffers as a fast path for the vast majority of "normal" input strings that don't have too many combining characters in a row, and the choice of how to handle the less usual strings falls to the user, if they even want to bother with it at all. Hence my suggestion to possibly add a simple wrapper that makes that decision internally, so users who aren't picky at the moment can use the simplified wrapper until they need to optimize.

It's possible I'm overestimating the problem though, I don't have a full grasp of Unicode combining classes and combining characters so if you think it's simpler than this please do tell me.

stevengj · 2025-07-22T23:33:23Z

Any solution which does not require scratch buffers necessarily will have terrible performance

This is not obvious to me, because you don't have to sort within each combining class, and there are only a small number of possible combining classes.

For example, the simplest allocation-free algorithm would be to scan through the combining characters, one combining class at a time, searching/comparing only characters in that combining class before going back and searching the next combining class. This is a linear-time algorithm, with cost at worst proportional to the number of combining characters multiplied by the number of combining classes.

And I suspect that you could be more efficient than this by storing a fixed amount of state, proportional to the number of combining ~~characters~~ classes, to reduce the number of passes over the string (basically, just caching where the first character in each combining class appears, if at all).

(Furthermore, in a typical string the number of combining characters per character is almost always going to be small, rarely more than two or three.)

Getting this right is tricky, but will be worth it, and I don't want to lock us into a re-entrant API in the meantime.

LB-- · 2025-07-23T00:19:42Z

If you think it's possible then I'll work another attempt at this, after researching a bit more about the combining class stuff. Probably won't get around to it until this weekend at the earliest though.

[WIP] Add utf8proc_isequal_normalized

5b31811

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add utf8proc_isequal_normalized #298

Add utf8proc_isequal_normalized #298

Uh oh!

LB-- commented Jul 20, 2025 •

edited

Loading

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

LB-- commented Jul 20, 2025

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

stevengj commented Jul 21, 2025

Uh oh!

LB-- commented Jul 21, 2025 •

edited

Loading

Uh oh!

stevengj commented Jul 22, 2025 •

edited

Loading

Uh oh!

LB-- commented Jul 23, 2025

Uh oh!

Uh oh!

Add utf8proc_isequal_normalized #298

Are you sure you want to change the base?

Add utf8proc_isequal_normalized #298

Uh oh!

Conversation

LB-- commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

LB-- commented Jul 20, 2025

Uh oh!

eschnett commented Jul 20, 2025

Uh oh!

stevengj commented Jul 21, 2025

Uh oh!

LB-- commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LB-- commented Jul 23, 2025

Uh oh!

Uh oh!

LB-- commented Jul 20, 2025 •

edited

Loading

LB-- commented Jul 21, 2025 •

edited

Loading

stevengj commented Jul 22, 2025 •

edited

Loading