Skip to content

Commit 9c9b6f6

Browse files
committed
Fix use with malformed Char
The previous implementation assumed that all `Char` are well-formed, which is of course not guaranteed to be the case (and which is also correctly handled by the existing implementation). On top of that, this is even faster, since counting the number of trailing zeros has hardware support on a wide range of architectures.
1 parent 3e9520a commit 9c9b6f6

File tree

1 file changed

+7
-12
lines changed

1 file changed

+7
-12
lines changed

base/char.jl

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -63,18 +63,13 @@ to an output stream, or `ncodeunits(string(c))` but computed efficiently.
6363
using `ncodeunits(string(c))`.
6464
"""
6565
function ncodeunits(c::Char)
66-
# All Char are 4 byte wide, and since unicode encoding
67-
# doesn't have null bytes (except for \0), we can just
68-
# count non-zero bytes
69-
char_data = reinterpret(UInt32, c)
70-
mask = 0xff % UInt32
71-
nbytes = !iszero(char_data & mask)
72-
Base.Cartesian.@nexprs 3 i -> begin
73-
m <<= 0x8
74-
nbytes += !iszero(char_data & mask)
75-
end
76-
# We have to account for `\0`, which is encoded as all zeros
77-
nbytes + iszero(uc)
66+
u = reinterpret(UInt32, c)
67+
68+
# We care about how many trailing bytes are all zero
69+
n_nonzero_bytes = sizeof(UInt32) - div(trailing_zeros(u), 0x8)
70+
71+
# Take care of '\0', which has an all-zero bitpattern
72+
n_nonzero_bytes + iszero(u)
7873
end
7974

8075
"""

0 commit comments

Comments
 (0)