Skip to content

Commit 2bec76d

Browse files
committed
RFC: Statically enforce Unicode in std::fmt
Statically enforce that the `std::fmt` module can only create valid UTF-8 data by removing the arbitrary `write` method in favor of a `write_str` method.
1 parent 44e3043 commit 2bec76d

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed

text/0000-fmt-text-writer.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
- Start Date: (fill me in with today's date, YYYY-MM-DD)
2+
- RFC PR: (leave this empty)
3+
- Rust Issue: (leave this empty)
4+
5+
# Summary
6+
7+
Statically enforce that the `std::fmt` module can only create valid UTF-8 data
8+
by removing the arbitrary `write` method in favor of a `write_str` method.
9+
10+
# Motivation
11+
12+
Today it is conventionally true that the output from macros like `format!` and
13+
well as implementations of `Show` only create valid UTF-8 data. This is not
14+
statically enforced, however. As a consequence the `.to_string()` method must
15+
perform a `str::is_utf8` check before returning a `String`.
16+
17+
This `str::is_utf8` check is currently [one of the most costly parts][bench1]
18+
of the formatting subsystem while normally just being a redundant check.
19+
20+
[bench1]: https://gist.github.com/alexcrichton/162a5f8f93062800c914
21+
22+
Additionally, it is possible to statically enforce the convention that `Show`
23+
only deals with valid unicode, and as such the possibility of doing so should be
24+
explored.
25+
26+
# Detailed design
27+
28+
The `std::fmt::FormatWriter` trait will be redefined as:
29+
30+
```rust
31+
pub trait Writer {
32+
fn write_str(&mut self, data: &str) -> Result;
33+
fn write_char(&mut self, ch: char) -> Result {
34+
// default method calling write_str
35+
}
36+
fn write_fmt(&mut self, f: &Arguments) -> Result {
37+
// default method calling fmt::write
38+
}
39+
}
40+
```
41+
42+
There are a few major differences with today's trait:
43+
44+
* The name has changed to `Writer` in accordance with [RFC 356][rfc356]
45+
* The `write` method has moved from taking `&[u8]` to taking `&str` instead.
46+
* A `write_char` method has been added.
47+
48+
[rfc356]: https://github.com/rust-lang/rfcs/blob/master/text/0356-no-module-prefixes.md
49+
50+
The corresponding methods on the `Formatter` structure will also be altered to
51+
respect these signatures.
52+
53+
The key idea behind this API is that the `Writer` trait only operates on unicode
54+
data. The `write_str` method is a static enforcement of UTF-8-ness, and using
55+
`write_char` follows suit as a `char` can only be a valid unicode codepoint.
56+
57+
With this trait definition, the implementation of `Writer` for `Vec<u8>` will be
58+
removed (note this is *not* the `io::Writer` implementation) in favor of an
59+
implementation directly on `String`. The `.to_string()` method will change
60+
accordingly (as well as `format!`) to write directly into a `String`, bypassing
61+
all UTF-8 validity checks afterwards.
62+
63+
This change [has been implemented][branch] in a branch of mine, and as expected
64+
the [benchmark numbers have improved][bench2] for the much larger texts.
65+
66+
[branch]: https://github.com/alexcrichton/rust/tree/fmt-text
67+
[bench2]: https://gist.github.com/alexcrichton/182ccef5d8c2583a2423
68+
69+
Note that a key point of the changes implemented is that a call to `write!` into
70+
an arbitrary `io::Writer` is *still valid* as it's still just a sink for bytes.
71+
The changes outlined in this RFC will only affect `Show` and other formatting
72+
trait implementations. As can be seen from the sample implementation, the
73+
fallout is quite minimal with respect to the rest of the standard library.
74+
75+
# Drawbacks
76+
77+
A version of this RFC has been [previously postponed][rfc57], but this variant
78+
is much less ambitious in terms of generic `TextWriter` support. At this time
79+
the design of `fmt::Writer` is purposely conservative.
80+
81+
[rfc57]: https://github.com/rust-lang/rfcs/pull/57
82+
83+
There are currently some use cases today where a `&mut Formatter` is interpreted
84+
as a `&mut Writer`, e.g. for the `Show` impl of `Json`. This is undoubtedly used
85+
outside this repository, and it would break all of these users relying on the
86+
binary functionality of the old `FormatWriter`.
87+
88+
# Alternatives
89+
90+
Another possible solution to specifically the performance problem is to have an
91+
`unsafe` flag on a `Formatter` indicating that only valid utf-8 data was
92+
written, and if all sub-parts of formatting set this flag then the data can be
93+
assumed utf-8. In general relying on `unsafe` apis is less "pure" than relying
94+
on the type system instead.
95+
96+
The `fmt::Writer` trait can also be located as `io::TextWriter` instead to
97+
emphasize its possible future connection with I/O, although there are not
98+
concrete plans today to develop these connections.
99+
100+
# Unresolved questions
101+
102+
* It is unclear to what degree a `fmt::Writer` needs to interact with
103+
`io::Writer` and the various adaptors/buffers. For example one would have to
104+
implement their own `BufferedWriter` for a `fmt::Writer`.
105+

0 commit comments

Comments
 (0)