Skip to content

Conversation

abr-egn
Copy link
Contributor

@abr-egn abr-egn commented Jun 27, 2025

RUST-1992

This introduces the &CStr and CString types; these are zero-overhead equivalents to &str and String that witness that the text contain no zero bytes. These types are used to enforce that zero-byte checking is done for regular expressions and value keys at construction time (i.e. load or user input) rather than at encoding, which means (a) errors will happen closer to the root cause and (b) the encoding machinery can be simplified.

The new types are made fairly easy to work with via implementation of a swath of standard library traits and a cstr! macro that checks at compile-time if a given string literal is valid and errors with a friendly message if not.

@abr-egn abr-egn changed the title RUST-1992 Introduce the &CStr and CString types for regular expressions RUST-1992 Introduce the &CStr and CString types for keys and regular expressions Jun 27, 2025
doc_buf.append("number", 12).unwrap();
doc_buf.append("bool", false).unwrap();
doc_buf.append("nu", RawBson::Null).unwrap();
doc_buf.append(cstr!("a"), "key");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really is the whole story in this one line - the failure point has been shifted from writing bytes to the document buffer (append and everything that used it) to constructing the string, and that in turn can now be done either at run-time or at compile-time if it's just a literal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to provide support for the c"string" syntax here, and more generally interop with the equivalent std::ffi types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.

I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately std::ffi::CStr and the c"string" syntax don't don't require that the text be valid UTF8, so there'd still have to be a validation step at encode time if we used those :/. I added documentation to the bson CStr types to call out the differences (and make usage more clear).

Zero bytes are actually completely valid UTF8 under normal circumstances (although why they'd be in there is anyone's guess), and can be constructed in string literals via \0:

let embedded_zero = "foo\0bar";
println!("{embedded_zero}\n{:?}", embedded_zero.as_bytes());

prints

foobar
[102, 111, 111, 0, 98, 97, 114]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::ffi::CStr and the c"string" syntax don't don't require that the text be valid UTF8

For my understanding because my knowledge of these string details is pretty fuzzy - is there somewhere we're enforcing valid UTF-8 for CStr/CString upon construction? I see that we're checking for null bytes in validate_cstr, but don't understand how that's different from the constraints of a std::ffi::CStr. I do get a compiler error for something like:

let string = c"abc\0def";

So I'm wondering what would be valid input for c"..." but not cstr!("...").

(I also missed the difference in inclusion/exclusion of the terminating null byte, thanks for adding that to the docs!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to explicitly validate UTF-8 for these types because they only accept &str / String, which are already required to be UTF-8 :). In the code where we're parsing things from a bytestream we've already got the validation code, and likewise any user constructing keys or regexes from strings will already have had to validate the data.

I added some examples to the rustdoc to make it more clear where the validation for bson::raw::CStr is more strict than either str or std::ffi::CStr - for the latter tl;dr is that out of range byte sequences like c"\xc3\x28" are perfectly valid but cstr!("\xc3\x28") won't compile because the string literal "\xc3\x28" itself is invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent, that all makes sense. thank you for the explanations and docs!

("a key", RawBson::String("a value".to_string())),
("an objectid", RawBson::ObjectId(oid)),
("a date", RawBson::DateTime(dt)),
(cstr!("a key"), RawBson::String("a value".to_string())),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building from a key-value list is one place where the repeated cstr! is more of a hassle; I could potentially see providing a convenience try_from_iter method that accepts &str keys and returns a Result (i.e. like this used to be).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, seems like something we can add if there's user demand for it

// Insert the current entry followed by trailing comma.
(@object $object:ident [$($key:tt)+] ($value:expr) , $($rest:tt)*) => {
$object.append(($($key)+), $value).expect("invalid bson value");
// Insert the current entry with followed by trailing comma, with a key literal.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked the behavior of rawdoc! a bit here - now if the key is a literal it'll be implicitly wrapped in cstr! so it gets compile-time validated, otherwise it'll be assumed to be an expression that evaluates to a valid key and passed on to append (as before). The main difference here is that now this macro can no longer panic :)

}
}

impl<B: BindRawBsonRef> FromIterator<B> for RawArrayBuf {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets to be a real impl again 🎉

RawBsonRef::Null => RawBson::Null,
RawBsonRef::RegularExpression(re) => {
RawBson::RegularExpression(Regex::new(re.pattern, re.options))
let mut chars: Vec<_> = re.options.as_str().chars().collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use Regex::from_strings because it's coming from already-validated data, so it can skip the extra validation step that would add.

@abr-egn abr-egn marked this pull request as ready for review June 28, 2025 00:27
@abr-egn abr-egn requested a review from a team as a code owner June 28, 2025 00:27
@abr-egn abr-egn requested a review from isabelatkinson June 28, 2025 00:27
}

const fn from_str_unchecked(value: &str) -> &Self {
// Safety: the conversion is safe because CStr is repr(transparent), and the deref is safe
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note that this is the same way we construct &RawDocument / &RawArray.

value.bind(|value_ref| {
raw_writer::RawWriter::new(&mut self.data).append(key.as_ref(), value_ref)
})
pub fn append(&mut self, key: impl AsRef<CStr>, value: impl BindRawBsonRef) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc for this method still says this method will panic upon bad input - can we update it with an explanation of the Cstr type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment here to point to those types - I put the majority of the explanation on those types themselves so hopefully a pointer is sufficient.

doc_buf.append("number", 12).unwrap();
doc_buf.append("bool", false).unwrap();
doc_buf.append("nu", RawBson::Null).unwrap();
doc_buf.append(cstr!("a"), "key");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to provide support for the c"string" syntax here, and more generally interop with the equivalent std::ffi types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.

I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str?

@abr-egn abr-egn requested a review from isabelatkinson July 2, 2025 17:52
use crate::error::{Error, Result};

// A BSON-spec cstring: Zero or more UTF-8 encoded characters, excluding the null byte.
#[allow(rustdoc::invalid_rust_codeblocks)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed because otherwise rustdoc would also error out when parsing the invalid string literal in the example.

isabelatkinson
isabelatkinson previously approved these changes Jul 3, 2025
Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

doc_buf.append("number", 12).unwrap();
doc_buf.append("bool", false).unwrap();
doc_buf.append("nu", RawBson::Null).unwrap();
doc_buf.append(cstr!("a"), "key");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent, that all makes sense. thank you for the explanations and docs!

("a key", RawBson::String("a value".to_string())),
("an objectid", RawBson::ObjectId(oid)),
("a date", RawBson::DateTime(dt)),
(cstr!("a key"), RawBson::String("a value".to_string())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, seems like something we can add if there's user demand for it

@abr-egn
Copy link
Contributor Author

abr-egn commented Jul 3, 2025

One more review, I'm afraid, had to resolve a minor merge conflict.

@abr-egn abr-egn merged commit 25ac200 into mongodb:main Jul 3, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants