Skip to content

Dangerously vague meaning of .len() and .truncate() on strings #350

@mzabaluev

Description

@mzabaluev

There is a number of operations on the built-in string slices and String that are specified terms of length or the number of unspecified collection items.
Notably, the implementation of Collection method len and String's method truncate. The documentation of these methods does not say whether the length is in bytes or UTF-8 codepoints; in practice it's bytes. This is hinted, but not said explicitly, in the general description of str.

Contrastingly, many popular programming environments such as Java, C#/CLI, and Qt, have similar string operations in terms of UTF-16 characters (UTF-16 is not free of the same issues, but it still generally works as wide-char Unicode for the masses, unless you are into dead or obscure scripts or emoji chat software). These operations are so familiar that many people would use them without looking up their precise definition, and in case of Rust, they may end up being wrong even if the documentation gave all possible warning. Their code will compile and work until it meets its first non-ASCII string, which in some sad cases might not happen until after shipping. Double grief if a mistakenly interpreted value passes into unsafe code and causes hard-to-debug trouble, putting a stain on the image of Rust as a safe language for the developers concerned.

I think this problem could best be mitigated by careful API design. I've got the following suggestions:

  • Deprecate truncate in favor of truncate_bytes, and add truncate_chars alongside.
  • Move len out of the Collection trait into a new subtrait SizedCollection, which standard strings will not implement. The byte length for strings will be always one method call away behind as_bytes(), which makes the intent explicit. This trait split will also allow implementations of linked lists without the explicitly maintained size counter, and maybe enable some clever lock-free concurrent collections in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions