feat: PlottableProtocol support #476

henryiii · 2020-11-25T23:43:28Z

Adds PlottableProtocol support to boost-histogram.

See the Uhi repo, such as this PR: scikit-hep/uhi#2 and later updates.

TODO:

Return None for variance if a simple storage is set with weights or manually changed via h[...] = .
Change tests to use traits instead of options
Add to changelog, with upgrade warning (options -> traits)
Add to docs
Check existing docs to see if there are some places to show .values() and such instead of .view().

We might want to loosen the protocol slightly to specify a Sequence is returned for __getitem__ on non-discrete axis; it does not have to be exactly a Tuple. Specifically, something like uproot that does not support non-discrete axis could make this a NumPy 2D array with a few extra properties, and that should still pass the Protocol. (With a mix of axis types, this would be still representable by an AwkwardArray, technically (edit: except for needing to flatten the 1 element return).

We should also ideally do a static check against typing/plottable.py, but that likely will wait till we have more types and the UHI repo gets a few other useful bits and moves to scikit-hep. I've vendored in the Protocol for the docs for now.

Closes #423.

TODO 2:

check and match uproot for mean variance
Add docs cross links
Invalidate weights if inplace operations are used

HDembinski · 2020-11-26T10:50:03Z

It looks like you first add properties directly on interface layer of the axis class and then wrap that data in a Option object on the Python layer on top of the interface. Wouldn't it be easier to make the Option object directly wrap the integer returned by get_options?

henryiii · 2020-11-26T15:16:29Z

The point of the changes in the second commit (other than trying to simplify the C++ class sitting on top of a single unsigned value that had to have a pickle function, copy/deepcopy, etc.) was to make "discrete" available in the .options structure (which is now probably closer to traits than options?). Do you like these traits bundled in with "options" here? In Python, you can't set any of these, so they all are really just output "traits" and options is a way to organize them.

A few ideas (referring to the Python Histogram, since the rest is internal detail that we can change):

Remove .options and just make these properties on .axis (and change the Protocol, we don't have much more time to make changes to the Protocol, as Uproot is released Monday, but this is small and can be done easily). Reduces the available namespace on the axis a bit, but do you really want a user to be able to add ax.growth = ...?
Keep .options but add traits directly on the axis. Provide .circular also on the axis, because only having .options.circular in the protocol seems really strange.
Inject just .discrete into .options - this was hard to do in the current state, as is messed with pickling. Also it is nice to be able to ask all the Boost.Histogram trait questions, even if they are not part of the protocol?
Bundle all traits into .options (the current state of the PR's second commit). I included all three of the Boost.Histogram traits.
Put them in .traits. Or .properties.
Put them in .traits and .traits.options. This is the closest to Boost.Histogram, but is probalby too structured for an ideal Protocol, so probably needs to be mixed with one of the choices above for the Protocol.

I'm pretty open to any direction, just started playing with it last night. I originally liked 4, though I like 1 better than I did originally in the Protocol discussion. @jpivarski didn't really like the name "options", since he thought that sounded like input options, rather than "properties" that you query. (In Boost.Histogram, of course, they are something you can input, while they are preset in boost-histogram).

Thoughts?

jpivarski · 2020-11-26T15:32:06Z

If we're voting, I'd vote for circular and discrete to be properties directly on the axis, without an intermediate object. I don't think the number of these is going to grow so much that we regret not putting it in a separate namespace, and the separate namespace is cumbersome and makes it just a little harder to find these properties.

We can also drop "median". It was just an indication of where this might grow, but it doesn't need to be defined into there's a need for it. But as you pointed out, these need to be strings or some very basic type. An implementation could return enumerations that are equal to the strings they represent.

henryiii · 2020-11-27T15:03:29Z

There are two things - the Protocol, and boost-histogram. I think for the Protocol @HDembinski was neutral on the nesting, I was for but have changed to be against, and you are against - so let's remove .options from the Protocol. Then Hans can pick whatever direction he'd like to go with boost-histogram. Probably option 1 or 2 as listed above?

HDembinski · 2020-11-27T18:56:16Z

I was neutral on options, but they have been growing on me. The options all logically belong together, so it makes sense to have them grouped. It also makes tab-completion on the axis less cluttered. Only library developers will query options. Normal users who write scripts won't have the need to query them, they know what options they have set. Since they are not queried frequently, they can and should sit deeper in the interface hierarchy.

HDembinski · 2020-11-27T19:04:17Z

The point of the changes in the second commit (other than trying to simplify the C++ class sitting on top of a single unsigned value that had to have a pickle function, copy/deepcopy, etc.) was to make "discrete" available in the .options structure (which is now probably closer to traits than options?). Do you like these traits bundled in with "options" here? In Python, you can't set any of these, so they all are really just output "traits" and options is a way to organize them.

In C++, discrete is a trait that the compiler can figure out at compile-time by inspecting the axis interface. In Python it makes sense to have it .options for the lack of a better place.

4. Bundle all traits into `.options` (the current state of the PR's second commit). I included all three of the Boost.Histogram traits.

This is my preferred compromise between exactness and simplicity. I think we wouldn't even worry about the distinction between options and traits if these were not technically implemented differently in C++. For Python, the distinction seems irrelevant, so grouping them all in .options, which describes what the axis can do, makes sense.

HDembinski · 2020-11-27T19:06:35Z

I would prefer to not to discuss things here and in the uhi repo simultaneously.

henryiii · 2020-11-27T20:07:25Z

Okay, that makes sense - we'll go with option 4 then. Thanks! Don't worry about answering twice, I can update UHI's proposal from the discussion here.

henryiii · 2020-12-22T22:32:43Z

The __dict__ in __slots__ trick is perfect for wrapping a C++ class. You can keep the wrapped item in __slots__, so it's not in your __dict__. Used that to simplify the Histogram implementation now too. Finished the rest of the items, I think this is ready!

Reminder for anyone watching this - there are a few more items to work on before the next release, and I want to help get mplhep and histoprint ready for this before or around the time of a new release.

henryiii · 2020-12-22T22:41:10Z

@HDembinski feel free to check it over! Check the docs changes, for example.

In the current state, the only thing that invalidates a simple storage's variance is a weighed fill. Specifically, there are three ways someone can set values:

h.fill(1, weight=0.5)    # 1. Should invalidate .variances()
h[...] = array           # 2. ?
h.view()[...] = array    # 3. Cannot invalidate .variances()

We can't control option 3, but that's not an issue - if someone messes with the view, they are playing with raw access to our memory anyway. The question is what to do for line 2 - I think for consistency, we should follow line 3; but it's a close call, very recently I would have said any set operation should invalidate variances as well. Happy to go either way if you have a preference on line 2.

For a library that produces histograms like uproot/coffea/etc (which is the most common place I would expect to see memory set), then you can always do a single weight 0 fill to invalidate the variances if you know the data you are loading has had lossy weighed fills on it.

PS: In place operations like scaling also can change the values, didn't include that above, currently they do not invalidate the .variances() either.

HDembinski · 2020-12-26T15:23:28Z

@HDembinski feel free to check it over! Check the docs changes, for example.

In the current state, the only thing that invalidates a simple storage's variance is a weighed fill. Specifically, there are three ways someone can set values:
h.fill(1, weight=0.5)    # 1. Should invalidate .variances()
h[...] = array           # 2. ?
h.view()[...] = array    # 3. Cannot invalidate .variances()
We can't control option 3, but that's not an issue - if someone messes with the view, they are playing with raw access to our memory anyway. The question is what to do for line 2 - I think for consistency, we should follow line 3; but it's a close call, very recently I would have said any set operation should invalidate variances as well. Happy to go either way if you have a preference on line 2.

I agree with you here generally. I also find it difficult to decide on 2), but agree with your assessment. We don't know what the user assigns. They might assign counts, so this operation could leave one with a histogram that has perfectly valid variances. Since we don't want to offer a public method to "restore" the variances once they are gone, we should err on the side of providing them while they are invalid rather than removing them when they are actually valid.

PS: In place operations like scaling also can change the values, didn't include that above, currently they do not invalidate the .variances() either.

Ideally, (in-place) scaling should switch the storage from Double() to Weight() before doing the scaling operation, then the variance afterwards would be correct. If that's not possible .variances() should be invalided by that operation.

docs/usage/histogram.rst

docs/usage/plotting.rst

HDembinski · 2020-12-26T15:34:41Z

docs/usage/plotting.rst

+
+* ``h.kind``: The bh.Kind of the histogram (COUNT or MEAN)
+* ``h.values()``: The value (as given by the kind)
+* ``h.variances()``: The variance in the value (None if an unweighed histogram was filled with weights)


This wording is not correct, because .variances() does not give you the variance of the mean if kind == MEAN it gives the variances of the samples. You need to divide that by .counts() to get the variance of the mean.

I thought we decided in the PlottableProtocol that .variances() is always the variance of .value(), so that a "dumb" plotter could just plot values and variances and ignore the .kind and get something reasonable. It is also simpler if ".values" is the value of the Kind, and the ".variances" is the variances of the Kind. So in this case, the wording is correct, but the value is incorrect, it should be calculated as .view()["variance"] / .view()["count"]? And a user wishing to get the sample variance would multiply by the .counts()?

I could easily be wrong here, though, either/both about what we decided and what is best. @jpivarski what does Uproot return for .variances() here for a TProfile?

docs/usage/plotting.rst

docs/usage/quickstart.rst

include/bh_python/register_axis.hpp

src/boost_histogram/_internal/axis.py

src/boost_histogram/_internal/hist.py

henryiii · 2021-01-19T23:03:26Z

@HDembinski I've fixed ~~everything~~, I hope to merge soon, we can iterate before release. I went with matching the description we agreed on in UHI: variance is the variance of the value, regardless of the Kind. I don't know if that's best, that's just how I understood our agreement, and how I would expect this to be most useful for plotting; "dumb" plotters should get as close as possible by just plotting value and variance; that's what we worked on for the return value from variance when the storage is unweighed, as well.

Happy to hear arguments in either direction, though.

Maybe implementing this in mplhep will help clear up this confusion?

I'm also preparing "UHI" so we don't have to duplicate documentation and the protocol.

Edit: Inplace validation still needed; I also still need to compare to uproot4. I finished teaching a Python course, so should have a bit more time in the next few days.

HDembinski · 2021-01-21T18:22:15Z

Sorry for the late reply, I am drowning in work right now. I am not sure whether we understand the same under "always return the variance". The rationale in Boost.Histogram was to always provide the variance that requires least computation. In case of the mean accumulator, the variance of the samples is computed. Analog for the weighted_mean accumulator. Since variance() here is a secondary interface, you don't have to follow that lead. It is ok to return the variance of the mean instead of the variance of the samples. I am not sure what the average user wants. For statistical analysis you want the variance of the mean, not the variance of the samples. Returning the variance of the mean is more consistent wrt other storages.

I don't feel strongly either way, I am leaning slightly toward returning variance of the mean, but as long as it is correctly documented what .variance() returns both options are ok.

…mple Don't return values if non valid

henryiii · 2021-01-25T22:16:32Z

Invalidating the variance on an inplace operation was a one line fix. :)

henryiii · 2021-01-26T03:17:01Z

Let's get this in, then iterate a bit as needed. It is the variance of the mean at the moment. I'd like to compare with ROOT in the tests, but converting is not completely trivial. This PR is big enough. :)

github-actions bot added the needs changelog Might need a changelog entry label Nov 25, 2020

henryiii added this to the 1.0.0 milestone Dec 7, 2020

henryiii mentioned this pull request Dec 9, 2020

DRAFT: Plotting (and later fitting) Protocol #423

Closed

henryiii force-pushed the feat/protocol branch from 1c278bc to e9b0e6f Compare December 18, 2020 04:25

henryiii mentioned this pull request Dec 21, 2020

feat: __dict__ major simplification #477

Merged

henryiii force-pushed the feat/protocol branch from e9b0e6f to df3823d Compare December 21, 2020 18:06

github-actions bot removed the needs changelog Might need a changelog entry label Dec 21, 2020

henryiii marked this pull request as ready for review December 22, 2020 22:32

henryiii changed the title ~~feat: basic protocol support~~ feat: PlottableProtocol support Dec 23, 2020