-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
There is a number of operations on the built-in string slices and String
that are specified terms of length or the number of unspecified collection items.
Notably, the implementation of Collection
method len
and String
's method truncate
. The documentation of these methods does not say whether the length is in bytes or UTF-8 codepoints; in practice it's bytes. This is hinted, but not said explicitly, in the general description of str
.
Contrastingly, many popular programming environments such as Java, C#/CLI, and Qt, have similar string operations in terms of UTF-16 characters (UTF-16 is not free of the same issues, but it still generally works as wide-char Unicode for the masses, unless you are into dead or obscure scripts or emoji chat software). These operations are so familiar that many people would use them without looking up their precise definition, and in case of Rust, they may end up being wrong even if the documentation gave all possible warning. Their code will compile and work until it meets its first non-ASCII string, which in some sad cases might not happen until after shipping. Double grief if a mistakenly interpreted value passes into unsafe code and causes hard-to-debug trouble, putting a stain on the image of Rust as a safe language for the developers concerned.
I think this problem could best be mitigated by careful API design. I've got the following suggestions:
- Deprecate
truncate
in favor oftruncate_bytes
, and addtruncate_chars
alongside. - Move
len
out of theCollection
trait into a new subtraitSizedCollection
, which standard strings will not implement. The byte length for strings will be always one method call away behindas_bytes()
, which makes the intent explicit. This trait split will also allow implementations of linked lists without the explicitly maintained size counter, and maybe enable some clever lock-free concurrent collections in the future.