Skip to content

Conversation

@henryiii
Copy link
Member

@henryiii henryiii commented Nov 25, 2020

Adds PlottableProtocol support to boost-histogram.

See the Uhi repo, such as this PR: scikit-hep/uhi#2 and later updates.

TODO:

  • Return None for variance if a simple storage is set with weights or manually changed via h[...] = .
  • Change tests to use traits instead of options
  • Add to changelog, with upgrade warning (options -> traits)
  • Add to docs
  • Check existing docs to see if there are some places to show .values() and such instead of .view().

We might want to loosen the protocol slightly to specify a Sequence is returned for __getitem__ on non-discrete axis; it does not have to be exactly a Tuple. Specifically, something like uproot that does not support non-discrete axis could make this a NumPy 2D array with a few extra properties, and that should still pass the Protocol. (With a mix of axis types, this would be still representable by an AwkwardArray, technically (edit: except for needing to flatten the 1 element return).

We should also ideally do a static check against typing/plottable.py, but that likely will wait till we have more types and the UHI repo gets a few other useful bits and moves to scikit-hep. I've vendored in the Protocol for the docs for now.

Closes #423.

TODO 2:

  • check and match uproot for mean variance
  • Add docs cross links
  • Invalidate weights if inplace operations are used

@github-actions github-actions bot added the needs changelog Might need a changelog entry label Nov 25, 2020
@HDembinski
Copy link
Member

It looks like you first add properties directly on interface layer of the axis class and then wrap that data in a Option object on the Python layer on top of the interface. Wouldn't it be easier to make the Option object directly wrap the integer returned by get_options?

@henryiii
Copy link
Member Author

henryiii commented Nov 26, 2020

The point of the changes in the second commit (other than trying to simplify the C++ class sitting on top of a single unsigned value that had to have a pickle function, copy/deepcopy, etc.) was to make "discrete" available in the .options structure (which is now probably closer to traits than options?). Do you like these traits bundled in with "options" here? In Python, you can't set any of these, so they all are really just output "traits" and options is a way to organize them.

A few ideas (referring to the Python Histogram, since the rest is internal detail that we can change):

  1. Remove .options and just make these properties on .axis (and change the Protocol, we don't have much more time to make changes to the Protocol, as Uproot is released Monday, but this is small and can be done easily). Reduces the available namespace on the axis a bit, but do you really want a user to be able to add ax.growth = ...?
  2. Keep .options but add traits directly on the axis. Provide .circular also on the axis, because only having .options.circular in the protocol seems really strange.
  3. Inject just .discrete into .options - this was hard to do in the current state, as is messed with pickling. Also it is nice to be able to ask all the Boost.Histogram trait questions, even if they are not part of the protocol?
  4. Bundle all traits into .options (the current state of the PR's second commit). I included all three of the Boost.Histogram traits.
  5. Put them in .traits. Or .properties.
  6. Put them in .traits and .traits.options. This is the closest to Boost.Histogram, but is probalby too structured for an ideal Protocol, so probably needs to be mixed with one of the choices above for the Protocol.

I'm pretty open to any direction, just started playing with it last night. I originally liked 4, though I like 1 better than I did originally in the Protocol discussion. @jpivarski didn't really like the name "options", since he thought that sounded like input options, rather than "properties" that you query. (In Boost.Histogram, of course, they are something you can input, while they are preset in boost-histogram).

Thoughts?

@jpivarski
Copy link
Member

If we're voting, I'd vote for circular and discrete to be properties directly on the axis, without an intermediate object. I don't think the number of these is going to grow so much that we regret not putting it in a separate namespace, and the separate namespace is cumbersome and makes it just a little harder to find these properties.

We can also drop "median". It was just an indication of where this might grow, but it doesn't need to be defined into there's a need for it. But as you pointed out, these need to be strings or some very basic type. An implementation could return enumerations that are equal to the strings they represent.

@henryiii
Copy link
Member Author

There are two things - the Protocol, and boost-histogram. I think for the Protocol @HDembinski was neutral on the nesting, I was for but have changed to be against, and you are against - so let's remove .options from the Protocol. Then Hans can pick whatever direction he'd like to go with boost-histogram. Probably option 1 or 2 as listed above?

@HDembinski
Copy link
Member

I was neutral on options, but they have been growing on me. The options all logically belong together, so it makes sense to have them grouped. It also makes tab-completion on the axis less cluttered. Only library developers will query options. Normal users who write scripts won't have the need to query them, they know what options they have set. Since they are not queried frequently, they can and should sit deeper in the interface hierarchy.

@HDembinski
Copy link
Member

HDembinski commented Nov 27, 2020

The point of the changes in the second commit (other than trying to simplify the C++ class sitting on top of a single unsigned value that had to have a pickle function, copy/deepcopy, etc.) was to make "discrete" available in the .options structure (which is now probably closer to traits than options?). Do you like these traits bundled in with "options" here? In Python, you can't set any of these, so they all are really just output "traits" and options is a way to organize them.

In C++, discrete is a trait that the compiler can figure out at compile-time by inspecting the axis interface. In Python it makes sense to have it .options for the lack of a better place.

4. Bundle all traits into `.options` (the current state of the PR's second commit). I included all three of the Boost.Histogram traits.

This is my preferred compromise between exactness and simplicity. I think we wouldn't even worry about the distinction between options and traits if these were not technically implemented differently in C++. For Python, the distinction seems irrelevant, so grouping them all in .options, which describes what the axis can do, makes sense.

@HDembinski
Copy link
Member

I would prefer to not to discuss things here and in the uhi repo simultaneously.

@henryiii
Copy link
Member Author

Okay, that makes sense - we'll go with option 4 then. Thanks! Don't worry about answering twice, I can update UHI's proposal from the discussion here.

@henryiii henryiii added this to the 1.0.0 milestone Dec 7, 2020
@github-actions github-actions bot removed the needs changelog Might need a changelog entry label Dec 21, 2020
@henryiii henryiii marked this pull request as ready for review December 22, 2020 22:32
@henryiii
Copy link
Member Author

henryiii commented Dec 22, 2020

The __dict__ in __slots__ trick is perfect for wrapping a C++ class. You can keep the wrapped item in __slots__, so it's not in your __dict__. Used that to simplify the Histogram implementation now too. Finished the rest of the items, I think this is ready!

Reminder for anyone watching this - there are a few more items to work on before the next release, and I want to help get mplhep and histoprint ready for this before or around the time of a new release.

@henryiii
Copy link
Member Author

henryiii commented Dec 22, 2020

@HDembinski feel free to check it over! Check the docs changes, for example.

In the current state, the only thing that invalidates a simple storage's variance is a weighed fill. Specifically, there are three ways someone can set values:

h.fill(1, weight=0.5)    # 1. Should invalidate .variances()
h[...] = array           # 2. ?
h.view()[...] = array    # 3. Cannot invalidate .variances()

We can't control option 3, but that's not an issue - if someone messes with the view, they are playing with raw access to our memory anyway. The question is what to do for line 2 - I think for consistency, we should follow line 3; but it's a close call, very recently I would have said any set operation should invalidate variances as well. Happy to go either way if you have a preference on line 2.

For a library that produces histograms like uproot/coffea/etc (which is the most common place I would expect to see memory set), then you can always do a single weight 0 fill to invalidate the variances if you know the data you are loading has had lossy weighed fills on it.

PS: In place operations like scaling also can change the values, didn't include that above, currently they do not invalidate the .variances() either.

@henryiii henryiii changed the title feat: basic protocol support feat: PlottableProtocol support Dec 23, 2020
@HDembinski
Copy link
Member

HDembinski commented Dec 26, 2020

@HDembinski feel free to check it over! Check the docs changes, for example.

In the current state, the only thing that invalidates a simple storage's variance is a weighed fill. Specifically, there are three ways someone can set values:

h.fill(1, weight=0.5)    # 1. Should invalidate .variances()
h[...] = array           # 2. ?
h.view()[...] = array    # 3. Cannot invalidate .variances()

We can't control option 3, but that's not an issue - if someone messes with the view, they are playing with raw access to our memory anyway. The question is what to do for line 2 - I think for consistency, we should follow line 3; but it's a close call, very recently I would have said any set operation should invalidate variances as well. Happy to go either way if you have a preference on line 2.

I agree with you here generally. I also find it difficult to decide on 2), but agree with your assessment. We don't know what the user assigns. They might assign counts, so this operation could leave one with a histogram that has perfectly valid variances. Since we don't want to offer a public method to "restore" the variances once they are gone, we should err on the side of providing them while they are invalid rather than removing them when they are actually valid.

PS: In place operations like scaling also can change the values, didn't include that above, currently they do not invalidate the .variances() either.

Ideally, (in-place) scaling should switch the storage from Double() to Weight() before doing the scaling operation, then the variance afterwards would be correct. If that's not possible .variances() should be invalided by that operation.


* ``h.kind``: The bh.Kind of the histogram (COUNT or MEAN)
* ``h.values()``: The value (as given by the kind)
* ``h.variances()``: The variance in the value (None if an unweighed histogram was filled with weights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is not correct, because .variances() does not give you the variance of the mean if kind == MEAN it gives the variances of the samples. You need to divide that by .counts() to get the variance of the mean.

Copy link
Member Author

@henryiii henryiii Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided in the PlottableProtocol that .variances() is always the variance of .value(), so that a "dumb" plotter could just plot values and variances and ignore the .kind and get something reasonable. It is also simpler if ".values" is the value of the Kind, and the ".variances" is the variances of the Kind. So in this case, the wording is correct, but the value is incorrect, it should be calculated as .view()["variance"] / .view()["count"]? And a user wishing to get the sample variance would multiply by the .counts()?

I could easily be wrong here, though, either/both about what we decided and what is best. @jpivarski what does Uproot return for .variances() here for a TProfile?

@henryiii henryiii force-pushed the feat/protocol branch 2 times, most recently from 63299c9 to e796c0d Compare January 19, 2021 21:01
@henryiii
Copy link
Member Author

henryiii commented Jan 19, 2021

@HDembinski I've fixed everything, I hope to merge soon, we can iterate before release. I went with matching the description we agreed on in UHI: variance is the variance of the value, regardless of the Kind. I don't know if that's best, that's just how I understood our agreement, and how I would expect this to be most useful for plotting; "dumb" plotters should get as close as possible by just plotting value and variance; that's what we worked on for the return value from variance when the storage is unweighed, as well.

Happy to hear arguments in either direction, though.

Maybe implementing this in mplhep will help clear up this confusion?

I'm also preparing "UHI" so we don't have to duplicate documentation and the protocol.

Edit: Inplace validation still needed; I also still need to compare to uproot4. I finished teaching a Python course, so should have a bit more time in the next few days.

@henryiii henryiii force-pushed the feat/protocol branch 3 times, most recently from a3e8f9d to dbd2e5b Compare January 21, 2021 03:54
@HDembinski
Copy link
Member

Sorry for the late reply, I am drowning in work right now. I am not sure whether we understand the same under "always return the variance". The rationale in Boost.Histogram was to always provide the variance that requires least computation. In case of the mean accumulator, the variance of the samples is computed. Analog for the weighted_mean accumulator. Since variance() here is a secondary interface, you don't have to follow that lead. It is ok to return the variance of the mean instead of the variance of the samples. I am not sure what the average user wants. For statistical analysis you want the variance of the mean, not the variance of the samples. Returning the variance of the mean is more consistent wrt other storages.

I don't feel strongly either way, I am leaning slightly toward returning variance of the mean, but as long as it is correctly documented what .variance() returns both options are ok.

@henryiii
Copy link
Member Author

Invalidating the variance on an inplace operation was a one line fix. :)

@henryiii
Copy link
Member Author

Let's get this in, then iterate a bit as needed. It is the variance of the mean at the moment. I'd like to compare with ROOT in the tests, but converting is not completely trivial. This PR is big enough. :)

@henryiii henryiii merged commit bb0a24c into develop Jan 26, 2021
@henryiii henryiii deleted the feat/protocol branch January 26, 2021 03:18
@henryiii henryiii modified the milestones: 1.0.0, 0.13.0 Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DRAFT: Plotting (and later fitting) Protocol

4 participants