Skip to content

Commit 0487d78

Browse files
uros-dbyaooqinn
authored andcommitted
[SPARK-48748][SQL] Cache numChars in UTF8String
### What changes were proposed in this pull request? Cache `numChars` value in a thread-safe way. ### Why are the changes needed? Faster access to `numChars()` method, which currently requires entire UTF8String scan every time. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47142 from uros-db/cache-numchars. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Kent Yao <[email protected]>
1 parent f49418b commit 0487d78

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ public final class UTF8String implements Comparable<UTF8String>, Externalizable,
5959
private Object base;
6060
private long offset;
6161
private int numBytes;
62+
private volatile int numChars = -1;
6263

6364
public Object getBaseObject() { return base; }
6465
public long getBaseOffset() { return offset; }
@@ -254,6 +255,16 @@ public int numBytes() {
254255
* Returns the number of code points in it.
255256
*/
256257
public int numChars() {
258+
if (numChars == -1) numChars = getNumChars();
259+
return numChars;
260+
}
261+
262+
/**
263+
* Private helper method to calculate the number of code points in the UTF-8 string. Counting
264+
* the code points is a linear time operation, as we need to scan the entire UTF-8 string.
265+
* Hence, this method should generally only be called once for non-empty UTF-8 strings.
266+
*/
267+
private int getNumChars() {
257268
int len = 0;
258269
for (int i = 0; i < numBytes; i += numBytesForFirstByte(getByte(i))) {
259270
len += 1;

0 commit comments

Comments
 (0)