RUST-870 Support deserializing directly from raw BSON #276

patrickfreed · 2021-07-02T16:55:23Z

This PR introduces two new functions to the public API which allow for deserializing a T directly from raw BSON bytes instead of going through Document first: from_reader and from_slice. The primary goal of these functions is to enable performance improvements in the driver, though they should be generally useful on their own. This PR also enables borrowed deserialization via from_slice (#231, RUST-688).

The implementation of this involved introducing a new deserializer that reads directly from raw BSON. The code for this lives in srd/de/raw.rs, and is modeled off of the Implementing a Deserializer example and serde_json's Deserializer. I encourage reading through the example and having both handy while reviewing this PR.

This also fixes RUST-880, RUST-884, and partially RUST-882.

This reverts commit a5a08dada7bd2c465084d4dd0ae9201140d259c4.

patrickfreed · 2021-07-02T16:57:32Z

.evergreen/config.yml

    values:
      - id: "min"
-        display_name: "1.43 (minimum supported version)"
+        display_name: "1.48 (minimum supported version)"


The driver is currently on 1.47, but in order to ergonomically convert between a Vec<u8> and an array of bytes for Decimal128, I needed Rust 1.48. I'm sort of thinking we should consider bumping this all the way to the latest stable right before the 2.0.0 release to give us the newest possible version to work with for a while. Thoughts?

sgtm as well

patrickfreed · 2021-07-02T16:58:32Z

serde-tests/test.rs

-
-use bson::{Bson, Deserializer, Serializer};
-
-macro_rules! bson {


These tests used separate macros, I think because they were written before the public doc! and bson! macros existed. I updated them to use the crate's actual macros and to test both deserialization techniques.

patrickfreed · 2021-07-02T17:00:23Z

src/bson.rs

 }

 impl From<u32> for Bson {
    fn from(a: u32) -> Bson {


This was leading to integers that were too big wrapping around. See RUST-882 for more info.

patrickfreed · 2021-07-02T17:00:46Z

src/bson.rs

                }
            }
-            Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 99999 => {
+            Bson::DateTime(v) if v.timestamp_millis() >= 0 && v.to_chrono().year() <= 9999 => {


See RUST-884

patrickfreed · 2021-07-02T17:02:06Z

src/de/mod.rs

    Ok((key, val))
 }

+impl Binary {


these impls were moved mostly as-is from the above match so that they could be reused in the raw deserializer.

patrickfreed · 2021-07-02T19:13:19Z

src/de/serde.rs

        V: MapAccess<'de>,
    {
-        let values = DocumentVisitor::new().visit_map(visitor)?;
-        Ok(Bson::from_extended_document(values))


The logic here is much the same, but it was updated to make use of serde's MapAccess API for the extended JSON format parts to achieve better performance. Prior to these changes, a Document would allocated and deserialized each time, and then its format would be parsed. Instead, we deserialize the extended format directly, allowing us to go straight from Raw BSON to our BSON types where possible. If the input data doesn't match an extended JSON format, it's just deserialized to a Document.

patrickfreed · 2021-07-02T19:14:36Z

src/document.rs

    }
 }

-pub(crate) struct DocumentVisitor {


This logic was implemented directly in BsonVisitor and wasn't needed anymore.

patrickfreed · 2021-07-02T19:15:10Z

src/tests/spec/corpus.rs

-        .expect(&description);
-
-        let mut native_to_bson_bson_to_native_cv = Vec::new();
+        let canonical_bson = hex::decode(&valid.canonical_bson).expect(&description);


I expanded the corpus tests to cover both deserializers, as they weren't doing so before. This uncovered a few bugs.

patrickfreed · 2021-07-02T19:15:42Z

src/tests/spec/corpus.rs

+            let bson = hex::decode(&decode_error.bson).expect("should decode from hex");
+            Document::from_reader(bson.as_slice()).expect_err(decode_error.description.as_str());
+
+            // the from_reader implementation supports deserializing from lossy UTF-8


See RUST-886

patrickfreed · 2021-07-02T19:16:10Z

src/tests/spec/corpus.rs

        }

-        // native_to_bson( bson_to_native(dB) ) = cB
-


these were shifted earlier so that they can be run even without the decimal128 flag.

abr-egn · 2021-07-06T14:40:26Z

.evergreen/config.yml

    values:
      - id: "min"
-        display_name: "1.43 (minimum supported version)"
+        display_name: "1.48 (minimum supported version)"


abr-egn · 2021-07-06T17:15:57Z

src/de/raw.rs

+    /// Read the next element type and update the root deserializer with it.
+    ///
+    /// Returns `Ok(None)` if the document has been fully read and has no more elements.
+    fn read_next_tag(&mut self) -> Result<Option<ElementType>> {


I think this might be easier to follow if Deserializer had a fn that read, parsed the tag, and updated current_type, and this one called that and checked/updated length_remaining.

abr-egn · 2021-07-06T17:33:11Z

src/de/raw.rs

+                let subtype = BinarySubtype::from(read_u8(&mut self.bytes)?);
+                match subtype {
+                    BinarySubtype::Generic => {
+                        visitor.visit_borrowed_bytes(self.bytes.read_slice(len as usize)?)


Making sure I'm following: this means that generic Binary values will deserialize to &[u8] where all other subtypes will be a {"$binary: {"subType": ..., "base64": ...}} map structure, right? Which I guess means a data structure intended to be deserialized from raw BSON must conform to either the generic shape, or the everything-else shape, and can't accept both.

abr-egn · 2021-07-06T17:54:39Z

src/de/raw.rs

+        Ok(())
+    }
+
+    fn read_cstr(&mut self) -> Result<&'a str> {


Why are utf8 errors here a hard failure rather than the lossy behavior of read_str (or the other way around)?

Since the common case for invalid UTF-8 was server error message values rather than in the keys, I opted to only do the lossy handling for those. That being said, it probably makes more sense to be consistent throughout. Once we decide a way forward on the lossy UTF-8 situation, I'll unify this with read_str.

abr-egn · 2021-07-06T18:34:45Z

serde-tests/test.rs

+///     - deserializing from the serialized document produces `expected_value`
+///   - round trip through raw BSON:
+///     - deserializing a `T` from the raw BSON version of `expected_doc` produces `expected_value`
+///     - desierializing a `Document` from the raw BSON version of `expected_doc` produces


Typo: "deserializing"

src/ser/mod.rs

isabelatkinson · 2021-07-06T19:40:55Z

.evergreen/config.yml

    values:
      - id: "min"
-        display_name: "1.43 (minimum supported version)"
+        display_name: "1.48 (minimum supported version)"


sgtm as well

isabelatkinson · 2021-07-06T20:24:41Z

serde-tests/test.rs

+        "a": 1,
+        "b": 2,
+    };
+    bson::from_document::<Foo>(doc.clone()).expect_err("extra filds should cause failure");


typo here ("filds" -> "fields")

isabelatkinson · 2021-07-06T20:27:06Z

src/bson.rs

 impl From<f32> for Bson {
    fn from(a: f32) -> Bson {
-        Bson::Double(a as f64)
+        Bson::Double(a.into())


is this just a stylistic change?

Yep, to signify that this is in fact lossless, whereas as can be lossy depending on the conversion.

patrickfreed · 2021-07-08T17:08:17Z

src/de/mod.rs

+/// This is mainly useful when reading raw BSON returned from a MongoDB server, which
+/// in rare cases can contain invalidly truncated strings (https://jira.mongodb.org/browse/SERVER-24007).
+/// For most use cases, `bson::from_slice` can be used instead.
+pub fn from_reader_utf8_lossy<R, T>(reader: R) -> Result<T>


per our discussion in slack, I added new functions for the lossy UTF-8 approach.

cc @abr-egn @isabelatkinson

patrickfreed added 26 commits July 1, 2021 21:50

wip raw deserializer

7d1e550

wip binary

799f18b

wip raw deserialization

382b742

finish datetime deserialization

4b70b1f

regex deserialization

761068c

dbpointer code code w scope

845e9f8

take ownership in to_extended_document

ca69e9b

timestamp

fdec423

minkey maxkey

2ca22ea

decimal128

9030336

combine documentaccess and seqaccess

45a99ad

deduplicate deserialization logic

488ed41

wip corpus

3b44841

finish corpus

6728082

ensure serde tests are passing, decimal128, enums

1f667bb

slice based deserialization

93ca2fe

borrow keys in Bson deserialization

caa9f5b

Revert "borrow keys in Bson deserialization"

3647aa8

This reverts commit a5a08dada7bd2c465084d4dd0ae9201140d259c4.

add test for all types

90942ba

various cleanup

cc8c8ff

bump MSRV to 1.48

e14a74f

handle overreads

916e782

consolidate key deserialization

e9b38d3

enable decimal128 tests, use $decimal128 instead of $decimal128Bytes

77668c6

fully support borrowed deserialization

1ab13af

test borrowing some other types

a88d7dd

patrickfreed commented Jul 2, 2021

View reviewed changes

patrickfreed marked this pull request as ready for review July 2, 2021 20:00

patrickfreed requested review from abr-egn and isabelatkinson July 2, 2021 20:00

reintroduce support for deserializing decimal128 from extended json

0511f69

abr-egn reviewed Jul 6, 2021

View reviewed changes

patrickfreed added 2 commits July 6, 2021 15:22

fix typo

de68ee9

move element type deserialization logic into deserializer

2db5468

abr-egn approved these changes Jul 6, 2021

View reviewed changes

isabelatkinson reviewed Jul 7, 2021

View reviewed changes

fix typo

d10410b

isabelatkinson approved these changes Jul 8, 2021

View reviewed changes

support lossy and non-lossy UTF-8 decoding

e3036ce

patrickfreed commented Jul 8, 2021

View reviewed changes

fix docstring

b062873

patrickfreed merged commit 7ccf82b into mongodb:master Jul 8, 2021

patrickfreed mentioned this pull request Jul 22, 2021

RUST-882 Remove lossy From<u64> impl for Bson #280

Merged


		use bson::{Bson, Deserializer, Serializer};

		macro_rules! bson {

RUST-870 Support deserializing directly from raw BSON #276

RUST-870 Support deserializing directly from raw BSON #276

Uh oh!

Conversation

patrickfreed commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

patrickfreed commented Jul 2, 2021 •

edited

Loading