Skip to content

Expensive eager __repr__ calls when serializing untagged unions #1649

@JoshKarpel

Description

@JoshKarpel

Hi!

I was doing some profiling recently and noticed that our application was spending a lot of time running __repr__()s inside model_dump_json():

Image

(snippet of a flamegraph at the bottom of the call stack, sorry, can't show the full graph :( )

It turns out that the model we're serializing has an untagged union in it. Here's a reproduction where I've got an example model that does have the discriminator, and one that doesn't (yes, we really should have a discriminator here in our app code, but I think the performance issue is not really about whether we have a discriminator, it's purely about the order of operations in the non-tagged-union serialization code).

from typing import Union, Annotated, Literal

from pydantic import BaseModel, Field

class A(BaseModel):
    type: Literal["a"] = "a"

    def __repr__(self):
        print("A.__repr__ called")
        super().__repr__()


class B(BaseModel):
    type: Literal["b"] = "b"

    def __repr__(self):
        print("B.__repr__ called")
        super().__repr__()


AB_NO_DISCRIMINATOR = Annotated[Union[A, B], Field(...)]
AB_WITH_DISCRIMINATOR = Annotated[Union[A, B], Field(..., discriminator="type")]


class NotDiscriminated(BaseModel):
    field: list[AB_NO_DISCRIMINATOR]


class Discriminated(BaseModel):
    field: list[AB_WITH_DISCRIMINATOR]


a = A()
b = B()

# When it's not a discriminated union,
# the reprs of the models that aren't first in the union get evaluated during serialization!

not_discriminated = NotDiscriminated(field=[a, b])
discriminated = Discriminated(field=[a, b])

print("##################")
print(f"{not_discriminated.model_dump_json()=}")

print("##################")
print(f"{discriminated.model_dump_json()=}")

If you run this, you'll see:

##################
B.__repr__ called
not_discriminated.model_dump_json()='{"field":[{"type":"a"},{"type":"b"}]}'
##################
discriminated.model_dump_json()='{"field":[{"type":"a"},{"type":"b"}]}'

This is surprising because I would only expect to need B's repr during serialization if something went wrong and an error message was generated, but there are no errors or warnings generated by this example. Note that A appears first in the Union and its repr is not called.

I started tracing through the code and it looks like what's happening is that because we have an untagged union, we end up in this loop trying to find the serializer that we should use:

for comb_serializer in choices {
match selector(comb_serializer, &new_extra) {
Ok(v) => return Ok(Some(v)),
Err(err) => errors.push(err),
}
}

For the first item in the field, which is an A, that loop first tries to serialize it as an A since that's the first serializer in the union. This eventually gets to

} else if self.allow_value(value, &model_extra)? {
let inner_value = self.get_inner_value(value, &model_extra)?;
self.serializer.to_python(&inner_value, include, exclude, &model_extra)
, which is allowed via
SerCheck::Strict => Ok(value.get_type().is(class)),
because the first item is in fact an A.

However, for the second item in the list, which is a B, that first loop iteration still tries to serialize it using A's serializer first (since A is first in the union). It gets to the same point, but allow_value fails, putting us in the else case here

} else {
extra.warnings.on_fallback_py(self.get_name(), value, &model_extra)?;
infer_to_python(value, include, exclude, &model_extra)
}
. It seems like extra.check is enabled at this point, because
} else if extra.check.enabled() {
let type_name = value
.get_type()
.qualname()
.unwrap_or_else(|_| PyString::new(value.py(), "<unknown python object>"));
let value_str = truncate_safe_repr(value, None);
Err(PydanticSerializationUnexpectedValue::new_err(Some(format!(
"Expected `{field_type}` but got `{type_name}` with value `{value_str}` - serialized value may not be as expected"
))))
runs and generates an error message about unexpected types, which includes the repr of the object - this is where the repr call happens!

However, that Err bounces back up to the serialization choice loop at

for comb_serializer in choices {
match selector(comb_serializer, &new_extra) {
Ok(v) => return Ok(Some(v)),
Err(err) => errors.push(err),
}
}
, which stashes it away and advances to the next potential serializer, which is for B and therefore succeeds on the B instance. If all serializers had failed we would need that error message, but since one succeeded, it ended up being irrelevant. The real union in our app has many options, each of which is a somewhat-nested model, so that's why we ended up spending so much time evaluating reprs.

If there's a discriminator, this all gets bypassed because there's no loop to discover the right serializer. Obviously it would be preferable to have a tagged union here, but it seems like the work of generating the reprs for errors that won't be used is still unnecessary in the case where you don't have one for whatever reason. No reason for untagged unions to be slower than they can be!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions