Skip to content

Conversation

patrickfreed
Copy link
Contributor

@patrickfreed patrickfreed commented Jul 2, 2021

RUST-870

This PR introduces two new functions to the public API which allow for deserializing a T directly from raw BSON bytes instead of going through Document first: from_reader and from_slice. The primary goal of these functions is to enable performance improvements in the driver, though they should be generally useful on their own. This PR also enables borrowed deserialization via from_slice (#231, RUST-688).

The implementation of this involved introducing a new deserializer that reads directly from raw BSON. The code for this lives in srd/de/raw.rs, and is modeled off of the Implementing a Deserializer example and serde_json's Deserializer. I encourage reading through the example and having both handy while reviewing this PR.

This also fixes RUST-880, RUST-884, and partially RUST-882.

values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver is currently on 1.47, but in order to ergonomically convert between a Vec<u8> and an array of bytes for Decimal128, I needed Rust 1.48. I'm sort of thinking we should consider bumping this all the way to the latest stable right before the 2.0.0 release to give us the newest possible version to work with for a while. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm as well


use bson::{Bson, Deserializer, Serializer};

macro_rules! bson {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests used separate macros, I think because they were written before the public doc! and bson! macros existed. I updated them to use the crate's actual macros and to test both deserialization techniques.

}

impl From<u32> for Bson {
fn from(a: u32) -> Bson {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was leading to integers that were too big wrapping around. See RUST-882 for more info.

}
}
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 99999 => {
Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 9999 => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RUST-884

Ok((key, val))
}

impl Binary {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these impls were moved mostly as-is from the above match so that they could be reused in the raw deserializer.

V: MapAccess<'de>,
{
let values = DocumentVisitor::new().visit_map(visitor)?;
Ok(Bson::from_extended_document(values))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is much the same, but it was updated to make use of serde's MapAccess API for the extended JSON format parts to achieve better performance. Prior to these changes, a Document would allocated and deserialized each time, and then its format would be parsed. Instead, we deserialize the extended format directly, allowing us to go straight from Raw BSON to our BSON types where possible. If the input data doesn't match an extended JSON format, it's just deserialized to a Document.

}
}

pub(crate) struct DocumentVisitor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was implemented directly in BsonVisitor and wasn't needed anymore.

.expect(&description);

let mut native_to_bson_bson_to_native_cv = Vec::new();
let canonical_bson = hex::decode(&valid.canonical_bson).expect(&description);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the corpus tests to cover both deserializers, as they weren't doing so before. This uncovered a few bugs.

let bson = hex::decode(&decode_error.bson).expect("should decode from hex");
Document::from_reader(bson.as_slice()).expect_err(decode_error.description.as_str());

// the from_reader implementation supports deserializing from lossy UTF-8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RUST-886

}

// native_to_bson( bson_to_native(dB) ) = cB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were shifted earlier so that they can be run even without the decimal128 flag.

@patrickfreed patrickfreed marked this pull request as ready for review July 2, 2021 20:00
values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

src/de/raw.rs Outdated
/// Read the next element type and update the root deserializer with it.
///
/// Returns `Ok(None)` if the document has been fully read and has no more elements.
fn read_next_tag(&mut self) -> Result<Option<ElementType>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be easier to follow if Deserializer had a fn that read, parsed the tag, and updated current_type, and this one called that and checked/updated length_remaining.

let subtype = BinarySubtype::from(read_u8(&mut self.bytes)?);
match subtype {
BinarySubtype::Generic => {
visitor.visit_borrowed_bytes(self.bytes.read_slice(len as usize)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure I'm following: this means that generic Binary values will deserialize to &[u8] where all other subtypes will be a {"$binary: {"subType": ..., "base64": ...}} map structure, right? Which I guess means a data structure intended to be deserialized from raw BSON must conform to either the generic shape, or the everything-else shape, and can't accept both.

src/de/raw.rs Outdated
Ok(())
}

fn read_cstr(&mut self) -> Result<&'a str> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are utf8 errors here a hard failure rather than the lossy behavior of read_str (or the other way around)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the common case for invalid UTF-8 was server error message values rather than in the keys, I opted to only do the lossy handling for those. That being said, it probably makes more sense to be consistent throughout. Once we decide a way forward on the lossy UTF-8 situation, I'll unify this with read_str.

/// - deserializing from the serialized document produces `expected_value`
/// - round trip through raw BSON:
/// - deserializing a `T` from the raw BSON version of `expected_doc` produces `expected_value`
/// - desierializing a `Document` from the raw BSON version of `expected_doc` produces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "deserializing"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

values:
- id: "min"
display_name: "1.43 (minimum supported version)"
display_name: "1.48 (minimum supported version)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm as well

"a": 1,
"b": 2,
};
bson::from_document::<Foo>(doc.clone()).expect_err("extra filds should cause failure");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here ("filds" -> "fields")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

impl From<f32> for Bson {
fn from(a: f32) -> Bson {
Bson::Double(a as f64)
Bson::Double(a.into())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a stylistic change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, to signify that this is in fact lossless, whereas as can be lossy depending on the conversion.

/// This is mainly useful when reading raw BSON returned from a MongoDB server, which
/// in rare cases can contain invalidly truncated strings (https://jira.mongodb.org/browse/SERVER-24007).
/// For most use cases, `bson::from_slice` can be used instead.
pub fn from_reader_utf8_lossy<R, T>(reader: R) -> Result<T>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per our discussion in slack, I added new functions for the lossy UTF-8 approach.

cc @abr-egn @isabelatkinson

@patrickfreed patrickfreed merged commit 7ccf82b into mongodb:master Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants