-
Notifications
You must be signed in to change notification settings - Fork 25
feat: PlottableProtocol support #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like you first add properties directly on interface layer of the axis class and then wrap that data in a Option object on the Python layer on top of the interface. Wouldn't it be easier to make the Option object directly wrap the integer returned by |
|
The point of the changes in the second commit (other than trying to simplify the C++ class sitting on top of a single unsigned value that had to have a pickle function, copy/deepcopy, etc.) was to make "discrete" available in the .options structure (which is now probably closer to A few ideas (referring to the Python Histogram, since the rest is internal detail that we can change):
I'm pretty open to any direction, just started playing with it last night. I originally liked 4, though I like 1 better than I did originally in the Protocol discussion. @jpivarski didn't really like the name "options", since he thought that sounded like input options, rather than "properties" that you query. (In Boost.Histogram, of course, they are something you can input, while they are preset in boost-histogram). Thoughts? |
|
If we're voting, I'd vote for We can also drop |
|
There are two things - the Protocol, and boost-histogram. I think for the Protocol @HDembinski was neutral on the nesting, I was for but have changed to be against, and you are against - so let's remove .options from the Protocol. Then Hans can pick whatever direction he'd like to go with boost-histogram. Probably option 1 or 2 as listed above? |
|
I was neutral on options, but they have been growing on me. The options all logically belong together, so it makes sense to have them grouped. It also makes tab-completion on the axis less cluttered. Only library developers will query options. Normal users who write scripts won't have the need to query them, they know what options they have set. Since they are not queried frequently, they can and should sit deeper in the interface hierarchy. |
In C++,
This is my preferred compromise between exactness and simplicity. I think we wouldn't even worry about the distinction between options and traits if these were not technically implemented differently in C++. For Python, the distinction seems irrelevant, so grouping them all in .options, which describes what the axis can do, makes sense. |
|
I would prefer to not to discuss things here and in the uhi repo simultaneously. |
|
Okay, that makes sense - we'll go with option 4 then. Thanks! Don't worry about answering twice, I can update UHI's proposal from the discussion here. |
1c278bc to
e9b0e6f
Compare
e9b0e6f to
df3823d
Compare
|
The Reminder for anyone watching this - there are a few more items to work on before the next release, and I want to help get mplhep and histoprint ready for this before or around the time of a new release. |
|
@HDembinski feel free to check it over! Check the docs changes, for example. In the current state, the only thing that invalidates a simple storage's variance is a weighed fill. Specifically, there are three ways someone can set values: h.fill(1, weight=0.5) # 1. Should invalidate .variances()
h[...] = array # 2. ?
h.view()[...] = array # 3. Cannot invalidate .variances()We can't control option 3, but that's not an issue - if someone messes with the view, they are playing with raw access to our memory anyway. The question is what to do for line 2 - I think for consistency, we should follow line 3; but it's a close call, very recently I would have said any set operation should invalidate variances as well. Happy to go either way if you have a preference on line 2. For a library that produces histograms like uproot/coffea/etc (which is the most common place I would expect to see memory set), then you can always do a single weight 0 fill to invalidate the variances if you know the data you are loading has had lossy weighed fills on it. PS: In place operations like scaling also can change the values, didn't include that above, currently they do not invalidate the |
I agree with you here generally. I also find it difficult to decide on 2), but agree with your assessment. We don't know what the user assigns. They might assign counts, so this operation could leave one with a histogram that has perfectly valid variances. Since we don't want to offer a public method to "restore" the variances once they are gone, we should err on the side of providing them while they are invalid rather than removing them when they are actually valid.
Ideally, (in-place) scaling should switch the storage from Double() to Weight() before doing the scaling operation, then the variance afterwards would be correct. If that's not possible .variances() should be invalided by that operation. |
|
|
||
| * ``h.kind``: The bh.Kind of the histogram (COUNT or MEAN) | ||
| * ``h.values()``: The value (as given by the kind) | ||
| * ``h.variances()``: The variance in the value (None if an unweighed histogram was filled with weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is not correct, because .variances() does not give you the variance of the mean if kind == MEAN it gives the variances of the samples. You need to divide that by .counts() to get the variance of the mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we decided in the PlottableProtocol that .variances() is always the variance of .value(), so that a "dumb" plotter could just plot values and variances and ignore the .kind and get something reasonable. It is also simpler if ".values" is the value of the Kind, and the ".variances" is the variances of the Kind. So in this case, the wording is correct, but the value is incorrect, it should be calculated as .view()["variance"] / .view()["count"]? And a user wishing to get the sample variance would multiply by the .counts()?
I could easily be wrong here, though, either/both about what we decided and what is best. @jpivarski what does Uproot return for .variances() here for a TProfile?
63299c9 to
e796c0d
Compare
|
@HDembinski I've fixed Happy to hear arguments in either direction, though. Maybe implementing this in mplhep will help clear up this confusion? I'm also preparing "UHI" so we don't have to duplicate documentation and the protocol. Edit: Inplace validation still needed; I also still need to compare to uproot4. I finished teaching a Python course, so should have a bit more time in the next few days. |
a3e8f9d to
dbd2e5b
Compare
|
Sorry for the late reply, I am drowning in work right now. I am not sure whether we understand the same under "always return the variance". The rationale in Boost.Histogram was to always provide the variance that requires least computation. In case of the mean accumulator, the variance of the samples is computed. Analog for the weighted_mean accumulator. Since variance() here is a secondary interface, you don't have to follow that lead. It is ok to return the variance of the mean instead of the variance of the samples. I am not sure what the average user wants. For statistical analysis you want the variance of the mean, not the variance of the samples. Returning the variance of the mean is more consistent wrt other storages. I don't feel strongly either way, I am leaning slightly toward returning variance of the mean, but as long as it is correctly documented what .variance() returns both options are ok. |
…mple Don't return values if non valid
dbd2e5b to
6a70bbf
Compare
|
Invalidating the variance on an inplace operation was a one line fix. :) |
|
Let's get this in, then iterate a bit as needed. It is the variance of the mean at the moment. I'd like to compare with ROOT in the tests, but converting is not completely trivial. This PR is big enough. :) |
Adds PlottableProtocol support to boost-histogram.
See the Uhi repo, such as this PR: scikit-hep/uhi#2 and later updates.
TODO:
h[...] =..values()and such instead of.view().We might want to loosen the protocol slightly to specify a Sequence is returned for
__getitem__on non-discrete axis; it does not have to be exactly a Tuple. Specifically, something like uproot that does not support non-discrete axis could make this a NumPy 2D array with a few extra properties, and that should still pass the Protocol. (With a mix of axis types, this would be still representable by an AwkwardArray, technically (edit: except for needing to flatten the 1 element return).We should also ideally do a static check against typing/plottable.py, but that likely will wait till we have more types and the UHI repo gets a few other useful bits and moves to scikit-hep. I've vendored in the Protocol for the docs for now.
Closes #423.
TODO 2: