RUST-1992 Introduce the `&CStr` and `CString` types for keys and regular expressions #563

abr-egn · 2025-06-27T18:05:46Z

RUST-1992

This introduces the &CStr and CString types; these are zero-overhead equivalents to &str and String that witness that the text contain no zero bytes. These types are used to enforce that zero-byte checking is done for regular expressions and value keys at construction time (i.e. load or user input) rather than at encoding, which means (a) errors will happen closer to the root cause and (b) the encoding machinery can be simplified.

The new types are made fairly easy to work with via implementation of a swath of standard library traits and a cstr! macro that checks at compile-time if a given string literal is valid and errors with a friendly message if not.

abr-egn · 2025-06-27T23:42:46Z

serde-tests/json.rs

-    doc_buf.append("number", 12).unwrap();
-    doc_buf.append("bool", false).unwrap();
-    doc_buf.append("nu", RawBson::Null).unwrap();
+    doc_buf.append(cstr!("a"), "key");


This really is the whole story in this one line - the failure point has been shifted from writing bytes to the document buffer (append and everything that used it) to constructing the string, and that in turn can now be done either at run-time or at compile-time if it's just a literal.

Would it be possible to provide support for the c"string" syntax here, and more generally interop with the equivalent std::ffi types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.

I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str?

Unfortunately std::ffi::CStr and the c"string" syntax don't don't require that the text be valid UTF8, so there'd still have to be a validation step at encode time if we used those :/. I added documentation to the bson CStr types to call out the differences (and make usage more clear).

Zero bytes are actually completely valid UTF8 under normal circumstances (although why they'd be in there is anyone's guess), and can be constructed in string literals via \0:

let embedded_zero = "foo\0bar"; println!("{embedded_zero}\n{:?}", embedded_zero.as_bytes());

prints

foobar [102, 111, 111, 0, 98, 97, 114]

std::ffi::CStr and the c"string" syntax don't don't require that the text be valid UTF8

For my understanding because my knowledge of these string details is pretty fuzzy - is there somewhere we're enforcing valid UTF-8 for CStr/CString upon construction? I see that we're checking for null bytes in validate_cstr, but don't understand how that's different from the constraints of a std::ffi::CStr. I do get a compiler error for something like:

let string = c"abc\0def";

So I'm wondering what would be valid input for c"..." but not cstr!("...").

(I also missed the difference in inclusion/exclusion of the terminating null byte, thanks for adding that to the docs!)

We don't need to explicitly validate UTF-8 for these types because they only accept &str / String, which are already required to be UTF-8 :). In the code where we're parsing things from a bytestream we've already got the validation code, and likewise any user constructing keys or regexes from strings will already have had to validate the data.

I added some examples to the rustdoc to make it more clear where the validation for bson::raw::CStr is more strict than either str or std::ffi::CStr - for the latter tl;dr is that out of range byte sequences like c"\xc3\x28" are perfectly valid but cstr!("\xc3\x28") won't compile because the string literal "\xc3\x28" itself is invalid.

excellent, that all makes sense. thank you for the explanations and docs!

abr-egn · 2025-06-27T23:44:47Z

serde-tests/test.rs

-            ("a key", RawBson::String("a value".to_string())),
-            ("an objectid", RawBson::ObjectId(oid)),
-            ("a date", RawBson::DateTime(dt)),
+            (cstr!("a key"), RawBson::String("a value".to_string())),


Building from a key-value list is one place where the repeated cstr! is more of a hassle; I could potentially see providing a convenience try_from_iter method that accepts &str keys and returns a Result (i.e. like this used to be).

makes sense, seems like something we can add if there's user demand for it

abr-egn · 2025-06-27T23:46:16Z

src/macros.rs

-    // Insert the current entry followed by trailing comma.
-    (@object $object:ident [$($key:tt)+] ($value:expr) , $($rest:tt)*) => {
-        $object.append(($($key)+), $value).expect("invalid bson value");
+    // Insert the current entry with followed by trailing comma, with a key literal.


I tweaked the behavior of rawdoc! a bit here - now if the key is a literal it'll be implicitly wrapped in cstr! so it gets compile-time validated, otherwise it'll be assumed to be an expression that evaluates to a valid key and passed on to append (as before). The main difference here is that now this macro can no longer panic :)

abr-egn · 2025-06-27T23:47:01Z

src/raw/array_buf.rs

+    }
+}
+
+impl<B: BindRawBsonRef> FromIterator<B> for RawArrayBuf {


This gets to be a real impl again 🎉

abr-egn · 2025-06-27T23:48:45Z

src/raw/bson_ref.rs

            RawBsonRef::Null => RawBson::Null,
            RawBsonRef::RegularExpression(re) => {
-                RawBson::RegularExpression(Regex::new(re.pattern, re.options))
+                let mut chars: Vec<_> = re.options.as_str().chars().collect();


This doesn't use Regex::from_strings because it's coming from already-validated data, so it can skip the extra validation step that would add.

abr-egn · 2025-07-01T13:53:57Z

src/raw/cstr.rs

+    }
+
+    const fn from_str_unchecked(value: &str) -> &Self {
+        // Safety: the conversion is safe because CStr is repr(transparent), and the deref is safe


I should note that this is the same way we construct &RawDocument / &RawArray.

isabelatkinson · 2025-07-01T16:27:11Z

src/raw/document_buf.rs

-        value.bind(|value_ref| {
-            raw_writer::RawWriter::new(&mut self.data).append(key.as_ref(), value_ref)
-        })
+    pub fn append(&mut self, key: impl AsRef<CStr>, value: impl BindRawBsonRef) {


The doc for this method still says this method will panic upon bad input - can we update it with an explanation of the Cstr type?

Updated the comment here to point to those types - I put the majority of the explanation on those types themselves so hopefully a pointer is sufficient.

isabelatkinson · 2025-07-02T16:48:45Z

serde-tests/json.rs

-    doc_buf.append("number", 12).unwrap();
-    doc_buf.append("bool", false).unwrap();
-    doc_buf.append("nu", RawBson::Null).unwrap();
+    doc_buf.append(cstr!("a"), "key");


Would it be possible to provide support for the c"string" syntax here, and more generally interop with the equivalent std::ffi types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.

I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str?

abr-egn · 2025-07-03T14:21:36Z

src/raw/cstr.rs

 use crate::error::{Error, Result};

-// A BSON-spec cstring: Zero or more UTF-8 encoded characters, excluding the null byte.
+#[allow(rustdoc::invalid_rust_codeblocks)]


This was needed because otherwise rustdoc would also error out when parsing the invalid string literal in the example.

isabelatkinson

lgtm!

isabelatkinson · 2025-07-03T15:13:42Z

serde-tests/json.rs

-    doc_buf.append("number", 12).unwrap();
-    doc_buf.append("bool", false).unwrap();
-    doc_buf.append("nu", RawBson::Null).unwrap();
+    doc_buf.append(cstr!("a"), "key");


excellent, that all makes sense. thank you for the explanations and docs!

isabelatkinson · 2025-07-03T15:14:26Z

serde-tests/test.rs

-            ("a key", RawBson::String("a value".to_string())),
-            ("an objectid", RawBson::ObjectId(oid)),
-            ("a date", RawBson::DateTime(dt)),
+            (cstr!("a key"), RawBson::String("a value".to_string())),


makes sense, seems like something we can add if there's user demand for it

abr-egn · 2025-07-03T16:15:09Z

One more review, I'm afraid, had to resolve a minor merge conflict.

abr-egn changed the title ~~RUST-1992 Introduce the &CStr and CString types for regular expressions~~ RUST-1992 Introduce the &CStr and CString types for keys and regular expressions Jun 27, 2025

abr-egn commented Jun 28, 2025

View reviewed changes

abr-egn marked this pull request as ready for review June 28, 2025 00:27

abr-egn requested a review from a team as a code owner June 28, 2025 00:27

abr-egn requested a review from isabelatkinson June 28, 2025 00:27

abr-egn force-pushed the RUST-1992/cstr branch from aad5212 to 0e8aa79 Compare June 30, 2025 15:52

abr-egn commented Jul 1, 2025

View reviewed changes

isabelatkinson reviewed Jul 2, 2025

View reviewed changes

abr-egn requested a review from isabelatkinson July 2, 2025 17:52

abr-egn commented Jul 3, 2025

View reviewed changes

isabelatkinson previously approved these changes Jul 3, 2025

View reviewed changes

abr-egn added 19 commits July 3, 2025 12:03

non-serde conversion

718073e

serde conversion

d32ccc1

propagate infallibility

1c33eb5

cstr append key

1aa5d08

fix doctests

33b69c5

clippy fixes

c7564b9

fix fuzzer

7944192

AsRef<str> for CStr

a6e60e5

require cstr to be a literal

b63cd15

better rawdoc

fe810e9

fix serde-tests

2ac2cb7

cleanup

6eeaec4

fix cfg

a4eb15c

fix clippy again

df0729a

add documentation

b5882bf

more doc examples

1d8f3ad

fix comment

14e97c7

more doc updates

a4b4d3a

one final doc tweak

e6e57ef

abr-egn dismissed isabelatkinson’s stale review via e6e57ef July 3, 2025 16:08

abr-egn force-pushed the RUST-1992/cstr branch from 8369a5d to e6e57ef Compare July 3, 2025 16:08

abr-egn requested a review from isabelatkinson July 3, 2025 16:14

isabelatkinson approved these changes Jul 3, 2025

View reviewed changes

abr-egn merged commit 25ac200 into mongodb:main Jul 3, 2025
9 of 11 checks passed

RUST-1992 Introduce the &CStr and CString types for keys and regular expressions #563

RUST-1992 Introduce the &CStr and CString types for keys and regular expressions #563

Uh oh!

Conversation

abr-egn commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isabelatkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abr-egn commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RUST-1992 Introduce the `&CStr` and `CString` types for keys and regular expressions #563

RUST-1992 Introduce the `&CStr` and `CString` types for keys and regular expressions #563

abr-egn commented Jun 27, 2025 •

edited

Loading