Potentially silently wrong results from what vector allows to the users #660

ikrommyd · 2025-11-03T19:07:31Z

ikrommyd
Nov 3, 2025

We all know for example that you need 4 coordinates to define a 4-vector. You can say for example

In [3]: vector.obj(pt=1.1, phi=2.2, eta=3.3, mass=4.4)
Out[3]: MomentumObject4D(pt=1.1, phi=2.2, eta=3.3, mass=4.4)

and that's good. Some names map to the same coordinate for example like rho maps to pt and vector let's you know of that.

In [4]: vector.obj(pt=1.1, phi=2.2, eta=3.3, mass=4.4, rho=5.5)
...
TypeError: duplicate coordinates (through momentum-aliases): 'rho'

Or if you try to define mass and energy for example together.

In [5]: vector.obj(pt=1.1, phi=2.2, eta=3.3, mass=4.4, energy=5.5)
...
TypeError: specify t= or tau=, but not more than one

That's all good but does not always work properly. For example vector lets you do this

In [7]: vector.obj(pt=1.1, phi=2.2, eta=3.3, e=4.4, E=5.5)
Out[7]: MomentumObject4D(pt=1.1, phi=2.2, eta=3.3, E=4.4)

was e or E used here as the energy? This should have raised an error.

We have so far been talking about vector objects only so let's talk about arrays. Vector let's you construct vectors out of numpy arrays using vector.array for example. But the following does not error at all:

In [12]: vector.array(
    ...:     {
    ...:         "pt": np.random.exponential(5, 10000),
    ...:         "phi": np.random.uniform(-np.pi, np.pi, 10000),
    ...:         "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
    ...:         "mass": np.full(10000, 0.000511),
    ...:         "energy": np.full(10000, -999),
    ...:     }
    ...: )
Out[12]:
MomentumNumpy4D([( 5.78627272, -0.31474093, 1.39315   , -999, 0.000511),
                 ( 2.33387069, -1.96905435, 2.27688939, -999, 0.000511),
                 ( 0.57113301,  1.49470835, 1.99019391, -999, 0.000511), ...,
                 (23.70853118,  1.4990086 , 1.4878906 , -999, 0.000511),
                 ( 4.77509547, -3.01717108, 1.16470938, -999, 0.000511),
                 ( 4.22299058, -2.13848499, 1.30976863, -999, 0.000511)],
                shape=(10000,), dtype=[('rho', '<f8'), ('phi', '<f8'), ('theta', '<f8'), ('t', '<i8'), ('tau', '<f8')])

Let's see what happens in the awkward array case though.

In [18]: vector.Array(
    ...:     ak.Array(
    ...:         {
    ...:             "pt": np.random.exponential(5, 10000),
    ...:             "phi": np.random.uniform(-np.pi, np.pi, 10000),
    ...:             "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
    ...:             "mass": np.full(10000, 0.000511),
    ...:             "energy": np.full(10000, -999),
    ...:         }
    ...:     )
    ...: )
...
TypeError: duplicate coordinates (through momentum-aliases): 'pt', 'phi', 'eta', 'mass', 'energy'

This one properly errors. Same for vector.zip

In [19]: vector.zip(
    ...:     {
    ...:         "pt": np.random.exponential(5, 10000),
    ...:         "phi": np.random.uniform(-np.pi, np.pi, 10000),
    ...:         "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
    ...:         "mass": np.full(10000, 0.000511),
    ...:         "energy": np.full(10000, -999),
    ...:     }
    ...: )
...
TypeError: duplicate coordinates (through momentum-aliases): 'pt', 'phi', 'eta', 'mass', 'energy'

So issue number one is that errors are not raised properly for vector constructor methods defined withhin the package. I'm attempting to solve that in #659 already and that's not the major problem here.

The major problem is what happens when users do not use the vector constructor methods. What happens when they do ak.zip(..., with_name="Momentum4D") and just attach the vector behavior like that.
In that case, there are no checks happening. Awkward just blindly assigns a behavior because it does not know anything about vector and what behaviors it carries over. It just knows to attach a string "Momentum4D" without any knowledge of what that implies.
So people can do things like these without knowing the implications just because they can't know all these vector aliases.

In [39]: v = ak.zip(
    ...:     {
    ...:         "pt": np.random.exponential(5, 10000),
    ...:         "phi": np.random.uniform(-np.pi, np.pi, 10000),
    ...:         "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
    ...:         "mass": np.full(10000, 0.000511),
    ...:         "energy": np.full(10000, 999),
    ...:     },
    ...:     with_name="Momentum4D",
    ...: )

In [40]: v.pt
Out[40]: <Array [2.86, 6.3, 16.7, 0.389, ..., 1.16, 0.523, 8.03] type='10000 * float64'>

In [41]: v.mass
Out[41]: <Array [999, 999, 998, 999, ..., 999, 999, 999, 997] type='10000 * float64'>

In [42]: v = ak.zip(
    ...:     {
    ...:         "pt": np.random.exponential(5, 10000),
    ...:         "phi": np.random.uniform(-np.pi, np.pi, 10000),
    ...:         "eta": np.arccos(np.random.uniform(-1, 1, 10000)),
    ...:         "mass": np.full(10000, 0.000511),
    ...:         "rho": np.full(10000, 999),
    ...:     },
    ...:     with_name="Momentum4D",
    ...: )

In [43]: v.pt
Out[43]: <Array [999, 999, 999, 999, 999, ..., 999, 999, 999, 999] type='10000 * int64'>

In [44]: v.mass
Out[44]: <Array [0.000511, 0.000511, ..., 0.000511, 0.000511] type='10000 * float64'>

Furthermore, people often use the __setitem__ syntax to add new fields. The underlying layout of an awkward array is immutable so a new one gets created but the users get a feeling of mutability even though it's fake. In CMS, people may often do
jets["rho"] = events.Rho.fixedGridRhoAll and that messes up your jet.pt just because they can't blindly know that rho maps to pt. If your answer is "go read the vector docs" sure I can get that but it gets more tricky when the experiments define such things in the root files.

For example, when people open root files with coffea, coffea assigns 4-vector behaviors on the objects.
Let's say we have Elecron_pt/phi/eta/mass on the root file. Coffea will make the electron collection a 4-vector.
However, the experiment may decide one day to add Electron_energy to the root file. This has actually happened for a custom NanoAOD format. energy in this case was not the energy measured by the rest of the 4-vector coordinates but it was the energy measured by just the ECAL. What happened in this case, coffea/awkward blindly assigned the behavior and the 4-vector silently became a pt/eta/phi/energy 4-vector with the wrong energy. So people would get silently wrong numbers.
And these things can happen by any experiment any day and we just can't keep track of everything.

We need a way to properly error when behaviors are assigned and the vector constructor methods are not used.
Awkward Arrays are immutable. Even when you use a __setitem__ syntax, a new RecordArray layout gets created under-neath. So...if we could have checks of the existing fields and properly error at every single RecordArray creation (inside its __init__ method), I believe that would solve the problem. I don't know how to immediately do it though. It's weird if awkward arrays have checks for vector even though vector is not a dependency. You are mixing packages in a weird way there.
Furthermore, even if that was the case, we can check for the "known" vector behaviors. But what about custom behaviors? Behaviors can be subclassed and any code can define its own behaviors. Coffea does that all the time. You can do ak.zip(..., with_name="PtEtaPhiMCandidate") where PtEtaPhiMCandidate is something coffea creates as a subclass of Momentum4D. How to do that is explained here: https://vector.readthedocs.io/en/latest/src/awkward.html#Advanced:-subclassing-Awkward-Vector-behaviors
So what should we check and error for? Everything? We probably don't even know what everything is in this case but that's the ideal solution to me at least.

I'm just writing this up to start a discussion. This is my current understanding of the situation. I would like to say that it is a bit of a serious problem though because A) if users don't understand vector aliases, they will get silently wrong numbers and we can't control what they understand and B) it's way worse if branches with the wrong name are already present in the root file because we don't control that and we'd need to hunt that down.
I will say that in CMS, it's likely that someone got some scale factors slightly wrong only because Electron_energy was present in the root file and that's not something that should be allowed just by the construction of our tools IMO.

We need less footguns 😃.

pfackeldey · 2025-11-04T11:29:15Z

pfackeldey
Nov 4, 2025
Maintainer

After some research I ended up with the following conclusion regarding my previously written 3-point solution for vector + awkward-array:

Awkward Arrays are fully immutable. Anything that adds or deletes a field creates a new highlevel Array instance - so that's good! (I don't know why I forgot this...)
the distinction of getattr vs getitem access is not solving this issue, but it would make it pretty clear when one accesses a field or a behavior property. There is definitely value in thinking about a way to clearly differentiate between behavior property access and record-array field access.
Behaviors currently can't run __post_init__-like checks, see:

In [1]: class MyBehavior(ak.Array):
    ...:     @property
    ...:     def foo(self):
    ...:         return "It's me hi, I'm the behavior, it's me"
    ...:
    ...:     def __init__(self, *args, **kwargs):
    ...:         super().__init__(*args, **kwargs)
    ...:         print("Constructor was called!")
    ...:

In [2]: ak.behavior["mybehavior"] = MyBehavior

In [3]: arr = ak.with_parameter(ak.Array([1,2,3]), "__list__", "mybehavior")

In [4]: arr
Out[4]: <MyBehavior [1, 2, 3] type='3 * int64[parameters={"__list__": "mybehavior"}]'>

In [5]: arr.foo
Out[5]: "It's me hi, I'm the behavior, it's me"

In [6]: # no print happened for the __init__ :(

I can add a __awkward_post_init__ (similar to python's dataclasses __post_init__) hook to awkward-array's behavior system that runs after attaching a behavior and before awkward's own validity check. That should be pretty safe, and since ak.Arrays are immutable one can't really modify self in-place to break it (unless you forcefully want to - but then the validity check is still there).

The idea would then be that vector could implement this hook per behavior and ensure there are no broken states. If someone uses ak.with_field/__setitem__/... to modify the fields awkward-array automatically always runs the __awkward_post_init__ hook again.

$\rightarrow$ that should prohibit ever getting a 'broken' state for vector behaviors + awkward-array.

The idea of ak.Array validation has also been discussed by @agoose77 and @jpivarski before here: scikit-hep/awkward#1483 (in fact in the scope of the vector development). As far as I can understand from the issue, @agoose77 also arrived at the conclusion that this validation should become part of the behavior system.

I am pretty sure that my proposed solution is safe and also not very invasive in awkward. We can discuss this in today's AS meeting, if everyone is happy with this solution, I'll open a draft PR in awkward-array.

Example behavior-level ak.Array validation:

class MyBehavior(ak.Array):
  def __awkward_post_init__(self):
    print("I can validate myself here")
    
ak.behavior["mybehavior"] = MyBehavior

arr = ak.with_parameter(ak.Array([1,2,3]), "__list__", "mybehavior")
# -> "I can validate myself here"

__awkward_post_init__ would run on every ak.Array instance creation (also on ak.with_field/__setitem__/...)

0 replies

nsmith- · 2025-11-04T16:02:09Z

nsmith-
Nov 4, 2025
Maintainer

Peter's proposal definitely takes care of the validation, but then there is the matter of when a RecordArray has an ambiguous set of fields for what concerns the definition of the vector. Do we have a mechanism to resolve which combination of record fields should become the canonical coordinates for a vector object?
A reasonable answer here is that vector should have a clearly-documented precedence for selecting the coordinate type, and it is up to the record array builder to ensure they are defining the minimal fields to guarantee the desired coordinates. From the coffea point of view, this likely means we should not impart momentum behavior to collections like coffea.nanoevents.methods.nanoaod.Electron, rather only expose a .p4 object as a sub-record.

6 replies

pfackeldey Nov 4, 2025
Maintainer

Personally, I'm a fan of explicit scoping with e.g. .p4. Then one would not access a behavior property when doing events.Jet.pt, but only when doing events.Jet.p4.pt. This is fine until there's a field called p4 in the original dataset 😅

Also, not yet clear to me how to decide to choose the coordinate system using the scoped .p4 accessor, I guess coffea would hardcode just one?

ikrommyd Nov 4, 2025
Author

Yeah I guess it would be tied to PtEtaPhiMCandidate for example to have a pt-eta-phi-m p4. To be fair, it's a good idea. I am only negative because of how much code this would break. The idea is pretty solid.

nsmith- Nov 4, 2025
Maintainer

Well you can leave fields that are in the NanoAOD file in the top-level object, so events.Jet.pt would work, but say, events.Jet.energy would not (since it is derived) and one must do events.Jet.p4.energy.

pfackeldey Nov 4, 2025
Maintainer

Well you can leave fields that are in the NanoAOD file in the top-level object, so events.Jet.pt would work, but say, events.Jet.energy would not (since it is derived) and one must do events.Jet.p4.energy.

Yes, good point, and probably people were intending to access the field with events.Jet.pt anyway in the first place.

ikrommyd Nov 4, 2025
Author

Ah yeah, that's more fair. To scope behaviors that are not present as fields. Although we do have a bit of .energy in HiggsDNA let's say but yeah that wouldn't break a lot of people. It would still need a deprecation cycle but it's not extremely invasive.

pfackeldey · 2025-11-05T11:21:58Z

pfackeldey
Nov 5, 2025
Maintainer

Behavior validation is now possible, see: scikit-hep/awkward#3710 for more details.

0 replies

Potentially silently wrong results from what vector allows to the users #660

Uh oh!

Uh oh!

ikrommyd Nov 3, 2025

Replies: 3 comments · 6 replies

Uh oh!

Uh oh!

pfackeldey Nov 4, 2025 Maintainer

Uh oh!

nsmith- Nov 4, 2025 Maintainer

Uh oh!

pfackeldey Nov 4, 2025 Maintainer

Uh oh!

ikrommyd Nov 4, 2025 Author

Uh oh!

nsmith- Nov 4, 2025 Maintainer

Uh oh!

pfackeldey Nov 4, 2025 Maintainer

Uh oh!

ikrommyd Nov 4, 2025 Author

Uh oh!

pfackeldey Nov 5, 2025 Maintainer

ikrommyd
Nov 3, 2025

Replies: 3 comments 6 replies

pfackeldey
Nov 4, 2025
Maintainer

nsmith-
Nov 4, 2025
Maintainer

pfackeldey Nov 4, 2025
Maintainer

ikrommyd Nov 4, 2025
Author

nsmith- Nov 4, 2025
Maintainer

pfackeldey Nov 4, 2025
Maintainer

ikrommyd Nov 4, 2025
Author

pfackeldey
Nov 5, 2025
Maintainer