-
Notifications
You must be signed in to change notification settings - Fork 150
RUST-1992 Introduce the &CStr
and CString
types for keys and regular expressions
#563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
&CStr
and CString
types for regular expressions&CStr
and CString
types for keys and regular expressions
doc_buf.append("number", 12).unwrap(); | ||
doc_buf.append("bool", false).unwrap(); | ||
doc_buf.append("nu", RawBson::Null).unwrap(); | ||
doc_buf.append(cstr!("a"), "key"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really is the whole story in this one line - the failure point has been shifted from writing bytes to the document buffer (append
and everything that used it) to constructing the string, and that in turn can now be done either at run-time or at compile-time if it's just a literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to provide support for the c"string"
syntax here, and more generally interop with the equivalent std::ffi
types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.
I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately std::ffi::CStr
and the c"string"
syntax don't don't require that the text be valid UTF8, so there'd still have to be a validation step at encode time if we used those :/. I added documentation to the bson CStr
types to call out the differences (and make usage more clear).
Zero bytes are actually completely valid UTF8 under normal circumstances (although why they'd be in there is anyone's guess), and can be constructed in string literals via \0
:
let embedded_zero = "foo\0bar";
println!("{embedded_zero}\n{:?}", embedded_zero.as_bytes());
prints
foobar
[102, 111, 111, 0, 98, 97, 114]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::ffi::CStr
and thec"string"
syntax don't don't require that the text be valid UTF8
For my understanding because my knowledge of these string details is pretty fuzzy - is there somewhere we're enforcing valid UTF-8 for CStr
/CString
upon construction? I see that we're checking for null bytes in validate_cstr
, but don't understand how that's different from the constraints of a std::ffi::CStr
. I do get a compiler error for something like:
let string = c"abc\0def";
So I'm wondering what would be valid input for c"..."
but not cstr!("...")
.
(I also missed the difference in inclusion/exclusion of the terminating null byte, thanks for adding that to the docs!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to explicitly validate UTF-8 for these types because they only accept &str
/ String
, which are already required to be UTF-8 :). In the code where we're parsing things from a bytestream we've already got the validation code, and likewise any user constructing keys or regexes from strings will already have had to validate the data.
I added some examples to the rustdoc to make it more clear where the validation for bson::raw::CStr
is more strict than either str
or std::ffi::CStr
- for the latter tl;dr is that out of range byte sequences like c"\xc3\x28"
are perfectly valid but cstr!("\xc3\x28")
won't compile because the string literal "\xc3\x28"
itself is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent, that all makes sense. thank you for the explanations and docs!
("a key", RawBson::String("a value".to_string())), | ||
("an objectid", RawBson::ObjectId(oid)), | ||
("a date", RawBson::DateTime(dt)), | ||
(cstr!("a key"), RawBson::String("a value".to_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building from a key-value list is one place where the repeated cstr!
is more of a hassle; I could potentially see providing a convenience try_from_iter
method that accepts &str
keys and returns a Result
(i.e. like this used to be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, seems like something we can add if there's user demand for it
// Insert the current entry followed by trailing comma. | ||
(@object $object:ident [$($key:tt)+] ($value:expr) , $($rest:tt)*) => { | ||
$object.append(($($key)+), $value).expect("invalid bson value"); | ||
// Insert the current entry with followed by trailing comma, with a key literal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked the behavior of rawdoc!
a bit here - now if the key is a literal it'll be implicitly wrapped in cstr!
so it gets compile-time validated, otherwise it'll be assumed to be an expression that evaluates to a valid key and passed on to append (as before). The main difference here is that now this macro can no longer panic :)
} | ||
} | ||
|
||
impl<B: BindRawBsonRef> FromIterator<B> for RawArrayBuf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets to be a real impl
again 🎉
RawBsonRef::Null => RawBson::Null, | ||
RawBsonRef::RegularExpression(re) => { | ||
RawBson::RegularExpression(Regex::new(re.pattern, re.options)) | ||
let mut chars: Vec<_> = re.options.as_str().chars().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't use Regex::from_strings
because it's coming from already-validated data, so it can skip the extra validation step that would add.
} | ||
|
||
const fn from_str_unchecked(value: &str) -> &Self { | ||
// Safety: the conversion is safe because CStr is repr(transparent), and the deref is safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note that this is the same way we construct &RawDocument
/ &RawArray
.
value.bind(|value_ref| { | ||
raw_writer::RawWriter::new(&mut self.data).append(key.as_ref(), value_ref) | ||
}) | ||
pub fn append(&mut self, key: impl AsRef<CStr>, value: impl BindRawBsonRef) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc for this method still says this method will panic upon bad input - can we update it with an explanation of the Cstr
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment here to point to those types - I put the majority of the explanation on those types themselves so hopefully a pointer is sufficient.
doc_buf.append("number", 12).unwrap(); | ||
doc_buf.append("bool", false).unwrap(); | ||
doc_buf.append("nu", RawBson::Null).unwrap(); | ||
doc_buf.append(cstr!("a"), "key"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to provide support for the c"string"
syntax here, and more generally interop with the equivalent std::ffi
types? I think that would be slightly more ergonomic for string literals since it wouldn't require a macro import.
I'm also wondering how users can construct strings with interior null bytes - is that ever doable with a &static str
?
use crate::error::{Error, Result}; | ||
|
||
// A BSON-spec cstring: Zero or more UTF-8 encoded characters, excluding the null byte. | ||
#[allow(rustdoc::invalid_rust_codeblocks)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed because otherwise rustdoc would also error out when parsing the invalid string literal in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
doc_buf.append("number", 12).unwrap(); | ||
doc_buf.append("bool", false).unwrap(); | ||
doc_buf.append("nu", RawBson::Null).unwrap(); | ||
doc_buf.append(cstr!("a"), "key"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent, that all makes sense. thank you for the explanations and docs!
("a key", RawBson::String("a value".to_string())), | ||
("an objectid", RawBson::ObjectId(oid)), | ||
("a date", RawBson::DateTime(dt)), | ||
(cstr!("a key"), RawBson::String("a value".to_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, seems like something we can add if there's user demand for it
One more review, I'm afraid, had to resolve a minor merge conflict. |
RUST-1992
This introduces the
&CStr
andCString
types; these are zero-overhead equivalents to&str
andString
that witness that the text contain no zero bytes. These types are used to enforce that zero-byte checking is done for regular expressions and value keys at construction time (i.e. load or user input) rather than at encoding, which means (a) errors will happen closer to the root cause and (b) the encoding machinery can be simplified.The new types are made fairly easy to work with via implementation of a swath of standard library traits and a
cstr!
macro that checks at compile-time if a given string literal is valid and errors with a friendly message if not.