Replies: 3 comments 6 replies
-
|
After some research I ended up with the following conclusion regarding my previously written 3-point solution for vector + awkward-array:
In [1]: class MyBehavior(ak.Array):
...: @property
...: def foo(self):
...: return "It's me hi, I'm the behavior, it's me"
...:
...: def __init__(self, *args, **kwargs):
...: super().__init__(*args, **kwargs)
...: print("Constructor was called!")
...:
In [2]: ak.behavior["mybehavior"] = MyBehavior
In [3]: arr = ak.with_parameter(ak.Array([1,2,3]), "__list__", "mybehavior")
In [4]: arr
Out[4]: <MyBehavior [1, 2, 3] type='3 * int64[parameters={"__list__": "mybehavior"}]'>
In [5]: arr.foo
Out[5]: "It's me hi, I'm the behavior, it's me"
In [6]: # no print happened for the __init__ :(I can add a The idea would then be that vector could implement this hook per behavior and ensure there are no broken states. If someone uses The idea of ak.Array validation has also been discussed by @agoose77 and @jpivarski before here: scikit-hep/awkward#1483 (in fact in the scope of the vector development). As far as I can understand from the issue, @agoose77 also arrived at the conclusion that this validation should become part of the behavior system. I am pretty sure that my proposed solution is safe and also not very invasive in awkward. We can discuss this in today's AS meeting, if everyone is happy with this solution, I'll open a draft PR in awkward-array. Example behavior-level ak.Array validation: class MyBehavior(ak.Array):
def __awkward_post_init__(self):
print("I can validate myself here")
ak.behavior["mybehavior"] = MyBehavior
arr = ak.with_parameter(ak.Array([1,2,3]), "__list__", "mybehavior")
# -> "I can validate myself here"
|
Beta Was this translation helpful? Give feedback.
-
|
Peter's proposal definitely takes care of the validation, but then there is the matter of when a RecordArray has an ambiguous set of fields for what concerns the definition of the vector. Do we have a mechanism to resolve which combination of record fields should become the canonical coordinates for a vector object? |
Beta Was this translation helpful? Give feedback.
-
|
Behavior validation is now possible, see: scikit-hep/awkward#3710 for more details. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We all know for example that you need 4 coordinates to define a 4-vector. You can say for example
and that's good. Some names map to the same coordinate for example like rho maps to pt and vector let's you know of that.
Or if you try to define mass and energy for example together.
That's all good but does not always work properly. For example vector lets you do this
was
eorEused here as the energy? This should have raised an error.We have so far been talking about vector objects only so let's talk about arrays. Vector let's you construct vectors out of numpy arrays using
vector.arrayfor example. But the following does not error at all:Let's see what happens in the awkward array case though.
This one properly errors. Same for
vector.zipSo issue number one is that errors are not raised properly for vector constructor methods defined withhin the package. I'm attempting to solve that in #659 already and that's not the major problem here.
The major problem is what happens when users do not use the vector constructor methods. What happens when they do
ak.zip(..., with_name="Momentum4D")and just attach the vector behavior like that.In that case, there are no checks happening. Awkward just blindly assigns a behavior because it does not know anything about vector and what behaviors it carries over. It just knows to attach a string
"Momentum4D"without any knowledge of what that implies.So people can do things like these without knowing the implications just because they can't know all these vector aliases.
Furthermore, people often use the
__setitem__syntax to add new fields. The underlying layout of an awkward array is immutable so a new one gets created but the users get a feeling of mutability even though it's fake. In CMS, people may often dojets["rho"] = events.Rho.fixedGridRhoAlland that messes up yourjet.ptjust because they can't blindly know thatrhomaps topt. If your answer is "go read the vector docs" sure I can get that but it gets more tricky when the experiments define such things in the root files.For example, when people open root files with coffea, coffea assigns 4-vector behaviors on the objects.
Let's say we have
Elecron_pt/phi/eta/masson the root file. Coffea will make the electron collection a 4-vector.However, the experiment may decide one day to add
Electron_energyto the root file. This has actually happened for a custom NanoAOD format.energyin this case was not the energy measured by the rest of the 4-vector coordinates but it was the energy measured by just the ECAL. What happened in this case, coffea/awkward blindly assigned the behavior and the 4-vector silently became a pt/eta/phi/energy 4-vector with the wrong energy. So people would get silently wrong numbers.And these things can happen by any experiment any day and we just can't keep track of everything.
We need a way to properly error when behaviors are assigned and the vector constructor methods are not used.
Awkward Arrays are immutable. Even when you use a
__setitem__syntax, a newRecordArraylayout gets created under-neath. So...if we could have checks of the existing fields and properly error at every singleRecordArraycreation (inside its__init__method), I believe that would solve the problem. I don't know how to immediately do it though. It's weird if awkward arrays have checks forvectoreven thoughvectoris not a dependency. You are mixing packages in a weird way there.Furthermore, even if that was the case, we can check for the "known" vector behaviors. But what about custom behaviors? Behaviors can be subclassed and any code can define its own behaviors. Coffea does that all the time. You can do
ak.zip(..., with_name="PtEtaPhiMCandidate")wherePtEtaPhiMCandidateis something coffea creates as a subclass of Momentum4D. How to do that is explained here: https://vector.readthedocs.io/en/latest/src/awkward.html#Advanced:-subclassing-Awkward-Vector-behaviorsSo what should we check and error for? Everything? We probably don't even know what everything is in this case but that's the ideal solution to me at least.
I'm just writing this up to start a discussion. This is my current understanding of the situation. I would like to say that it is a bit of a serious problem though because A) if users don't understand vector aliases, they will get silently wrong numbers and we can't control what they understand and B) it's way worse if branches with the wrong name are already present in the root file because we don't control that and we'd need to hunt that down.
I will say that in CMS, it's likely that someone got some scale factors slightly wrong only because
Electron_energywas present in the root file and that's not something that should be allowed just by the construction of our tools IMO.We need less footguns 😃.
Beta Was this translation helpful? Give feedback.
All reactions