-
Notifications
You must be signed in to change notification settings - Fork 300
Description
Hi!
I was doing some profiling recently and noticed that our application was spending a lot of time running __repr__()
s inside model_dump_json()
:
(snippet of a flamegraph at the bottom of the call stack, sorry, can't show the full graph :( )
It turns out that the model we're serializing has an untagged union in it. Here's a reproduction where I've got an example model that does have the discriminator, and one that doesn't (yes, we really should have a discriminator here in our app code, but I think the performance issue is not really about whether we have a discriminator, it's purely about the order of operations in the non-tagged-union serialization code).
from typing import Union, Annotated, Literal
from pydantic import BaseModel, Field
class A(BaseModel):
type: Literal["a"] = "a"
def __repr__(self):
print("A.__repr__ called")
super().__repr__()
class B(BaseModel):
type: Literal["b"] = "b"
def __repr__(self):
print("B.__repr__ called")
super().__repr__()
AB_NO_DISCRIMINATOR = Annotated[Union[A, B], Field(...)]
AB_WITH_DISCRIMINATOR = Annotated[Union[A, B], Field(..., discriminator="type")]
class NotDiscriminated(BaseModel):
field: list[AB_NO_DISCRIMINATOR]
class Discriminated(BaseModel):
field: list[AB_WITH_DISCRIMINATOR]
a = A()
b = B()
# When it's not a discriminated union,
# the reprs of the models that aren't first in the union get evaluated during serialization!
not_discriminated = NotDiscriminated(field=[a, b])
discriminated = Discriminated(field=[a, b])
print("##################")
print(f"{not_discriminated.model_dump_json()=}")
print("##################")
print(f"{discriminated.model_dump_json()=}")
If you run this, you'll see:
##################
B.__repr__ called
not_discriminated.model_dump_json()='{"field":[{"type":"a"},{"type":"b"}]}'
##################
discriminated.model_dump_json()='{"field":[{"type":"a"},{"type":"b"}]}'
This is surprising because I would only expect to need B
's repr during serialization if something went wrong and an error message was generated, but there are no errors or warnings generated by this example. Note that A
appears first in the Union
and its repr is not called.
I started tracing through the code and it looks like what's happening is that because we have an untagged union, we end up in this loop trying to find the serializer that we should use:
pydantic-core/src/serializers/type_serializers/union.rs
Lines 87 to 92 in 298e4ea
for comb_serializer in choices { | |
match selector(comb_serializer, &new_extra) { | |
Ok(v) => return Ok(Some(v)), | |
Err(err) => errors.push(err), | |
} | |
} |
For the first item in the field
, which is an A
, that loop first tries to serialize it as an A
since that's the first serializer in the union. This eventually gets to
pydantic-core/src/serializers/type_serializers/model.rs
Lines 199 to 201 in 8c59fa2
} else if self.allow_value(value, &model_extra)? { | |
let inner_value = self.get_inner_value(value, &model_extra)?; | |
self.serializer.to_python(&inner_value, include, exclude, &model_extra) |
SerCheck::Strict => Ok(value.get_type().is(class)), |
A
.
However, for the second item in the list, which is a B
, that first loop iteration still tries to serialize it using A
's serializer first (since A
is first in the union). It gets to the same point, but allow_value
fails, putting us in the else
case here
pydantic-core/src/serializers/type_serializers/model.rs
Lines 202 to 205 in 8c59fa2
} else { | |
extra.warnings.on_fallback_py(self.get_name(), value, &model_extra)?; | |
infer_to_python(value, include, exclude, &model_extra) | |
} |
extra.check
is enabled at this point, because pydantic-core/src/serializers/extra.rs
Lines 417 to 426 in 8c59fa2
} else if extra.check.enabled() { | |
let type_name = value | |
.get_type() | |
.qualname() | |
.unwrap_or_else(|_| PyString::new(value.py(), "<unknown python object>")); | |
let value_str = truncate_safe_repr(value, None); | |
Err(PydanticSerializationUnexpectedValue::new_err(Some(format!( | |
"Expected `{field_type}` but got `{type_name}` with value `{value_str}` - serialized value may not be as expected" | |
)))) |
However, that Err
bounces back up to the serialization choice loop at
pydantic-core/src/serializers/type_serializers/union.rs
Lines 87 to 92 in 298e4ea
for comb_serializer in choices { | |
match selector(comb_serializer, &new_extra) { | |
Ok(v) => return Ok(Some(v)), | |
Err(err) => errors.push(err), | |
} | |
} |
B
and therefore succeeds on the B
instance. If all serializers had failed we would need that error message, but since one succeeded, it ended up being irrelevant. The real union in our app has many options, each of which is a somewhat-nested model, so that's why we ended up spending so much time evaluating reprs.
If there's a discriminator, this all gets bypassed because there's no loop to discover the right serializer. Obviously it would be preferable to have a tagged union here, but it seems like the work of generating the reprs for errors that won't be used is still unnecessary in the case where you don't have one for whatever reason. No reason for untagged unions to be slower than they can be!