-
Notifications
You must be signed in to change notification settings - Fork 297
IEP 1 #1988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IEP 1 #1988
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| # IEP 1 - Enhanced indexing | ||
|
|
||
| ## Background | ||
|
|
||
| Currently, to select a subset of a Cube based on coordinate values we use something like: | ||
| [source,python] | ||
| ---- | ||
| cube.extract(iris.Constraint(realization=3, | ||
| model_level_number=[1, 5], | ||
| latitude=lambda cell: 40 <= cell <= 60)) | ||
| ---- | ||
| On the plus side, this works irrespective of the dimension order of the data, but the drawbacks with this form of indexing include: | ||
|
|
||
| * It uses a completely different syntax to position-based indexing, e.g. `cube[4, 0:6]`. | ||
| * It uses a completely different syntax to pandas and xarray value-based indexing, e.g. `df.loc[4, 0:6]`. | ||
| * It is long-winded and requires the use of an additional class. | ||
| * It requires the use of lambda functions even when just selecting a range. | ||
|
|
||
| Arguably, the situation when subsetting using positional indices but where the dimension order is unknown is even worse - it has no standard syntax _at all_! Instead it requires code akin to: | ||
| [source,python] | ||
| ---- | ||
| key = [slice(None)] * cube.ndim | ||
| key[cube.coord_dims('model_level_number')[0]] = slice(3, 9, 2) | ||
| cube[tuple(key)] | ||
| ---- | ||
|
|
||
| The only form of indexing that is well supported is indexing by position where the dimension order is known: | ||
| [source,python] | ||
| ---- | ||
| cube[4, 0:6, 30:] | ||
| ---- | ||
|
|
||
| ## Proposal | ||
|
|
||
| Provide indexing helpers on the Cube to extend explicit support to all permutations of: | ||
|
|
||
| * implicit dimension vs. named coordinate, | ||
| * and positional vs. coordinate-value based selection. | ||
|
|
||
| ### Helper syntax options | ||
|
|
||
| Commonly, the names of coordinates are also valid Python identifiers. | ||
| For names where this is not true, the names can expressed through either the `helper[...]` or `helper(...)` syntax by constructing an explicit dict. | ||
| For example: `cube.loc[{'12': 0}]` or `cube.loc(**{'12': 0})`. | ||
|
|
||
| #### Extended pandas style | ||
|
|
||
| Use a single helper for index by position, and a single helper for index by value. Helper names taken from pandas, but their behaviour is extended by making them callable to support named coordinates. | ||
|
|
||
| |=== | ||
| .2+| 2+h|Index by | ||
| h|Position h|Value | ||
|
|
||
| h|Implicit dimension | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube[:, 2] # No change | ||
| cube.iloc[:, 2] | ||
| ---- | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube.loc[:, 1.5] | ||
| ---- | ||
|
|
||
| h|Coordinate name | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube[dict(height=2)] | ||
| cube.iloc[dict(height=2)] | ||
| cube.iloc(height=2) | ||
| ---- | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube.loc[dict(height=1.5)] | ||
| cube.loc(height=1.5) | ||
| ---- | ||
| |=== | ||
|
|
||
| #### xarray style | ||
|
|
||
| xarray introduces a second set of helpers for accessing named dimensions that provide the callable syntax `(foo=...)`. | ||
|
|
||
| |=== | ||
| .2+| 2+h|Index by | ||
| h|Position h|Value | ||
|
|
||
| h|Implicit dimension | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube[:, 2] # No change | ||
| ---- | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube.loc[:, 1.5] | ||
| ---- | ||
|
|
||
| h|Coordinate name | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube[dict(height=2)] | ||
| cube.isel(height=2) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i find the term other than the name of the function, the syntax looks the same as before There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll take that as a second vote for the pandas-style naming. |
||
| ---- | ||
|
|
||
| a|[source,python] | ||
| ---- | ||
| cube.loc[dict(height=1.5)] | ||
| cube.sel(height=1.5) | ||
| ---- | ||
| |=== | ||
|
|
||
| ### Slices | ||
|
|
||
| The semantics of position-based slices will continue to match that of normal Python slices. The start position is included, the end position is excluded. | ||
|
|
||
| Value-based slices will be stricly inclusive, with both the start and end values included. This behaviour differs from normal Python slices but is in common with pandas. | ||
|
|
||
| Just as for normal Python slices, we do not need to provide the ability to control the include/exclude behaviour for slicing. | ||
|
|
||
| ### Value-based indexing | ||
|
|
||
| #### Equality | ||
|
|
||
| Should the behaviour of value-based equality depend on the data type of the coordinate? | ||
|
|
||
| * integer: exact match | ||
| * float: tolerance match, tolerance determined by bit-width | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this always makes me a little nervous. I can very much understand the approach tho for interest, if I want to specify an inequality, how might i do such a thing? does this mean that I have to deal with the situation that it may match multiple values? |
||
| * string: exact match | ||
|
|
||
| #### Scalar/category | ||
|
|
||
| If/how to deal with category selection `cube.loc(season='JJA')`? Defer to `groupby()`? | ||
|
|
||
| `cube.loc[12]` - must always match a single value or raise KeyError, corresponding dimension will be removed | ||
| `cube.loc[[12]]` - may match any number of values? (incl. zero?), dimension will be retained | ||
|
|
||
| ### Out of scope | ||
|
|
||
| * Deliberately enhancing the performance. | ||
| This is a very valuable topic and should be addressed by subsequent efforts. | ||
|
|
||
| * Time/date values as strings. | ||
| Providing pandas-style string representations for convenient representation of partial date/times should be addressed in a subsequent effort - perhaps in conjunction with an explicit performance test suite. | ||
| There is a risk that this topic could bog down when dealing with non-standard calendars and climatological date ranges. | ||
|
|
||
| ## Work required | ||
|
|
||
| * Implementations for each of the new helper objects. | ||
| * An update to the documentation to demonstrate best practice. Known impacted areas include: | ||
| ** The "Subsetting a Cube" chapter of the user guide. | ||
|
|
||
| ### TODO | ||
| * Multi-dimensional coordinates | ||
| * Non-orthogonal coordinates | ||
| * Bounds | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is an interesting facet. I'd like to think about this more. find sample point within bounds may be an interesting way to handle float matching in some cases. opportunities and dragons, I suspect |
||
| * Boolean array indexing | ||
| * Lambdas? | ||
| * What to do about constrained loading? | ||
| * Relationship to http://scitools.org.uk/iris/docs/v1.9.2/iris/iris/cube.html#iris.cube.Cube.intersection[iris.cube.Cube.intersection]? | ||
| * Relationship to interpolation (especially nearest-neighbour)? | ||
| ** e.g. What to do about values that don't exist? | ||
| *** pandas throws a KeyError | ||
| *** xarray supports (several) nearest-neighbour schemes via http://xarray.pydata.org/en/stable/indexing.html#nearest-neighbor-lookups[`data.sel()`] | ||
| *** Apparently http://holoviews.org/[holoviews] does nearest-neighbour interpolation. | ||
| * multi-dimensional coordinate => unroll? | ||
| * var_name only selection? `cube.vloc(t0=12)` | ||
| * Orthogonal only? Or also independent? `cube.loc_points(lon=[1, 1, 5], lat=[31, 33, 32])` | ||
| ** This seems quite closely linked to interpolation. Is the interpolation scheme orthogonal to cross-product vs. independent? | ||
| + | ||
| [source,python] | ||
| ---- | ||
| cube.interpolate( | ||
| scheme='nearest', | ||
| mesh=dict(lon=[5, 10, 15], lat=[40, 50])) | ||
| cube.interpolate( | ||
| scheme=Nearest(mode='spherical'), | ||
| locations=Ortho(lon=[5, 10, 15], lat=[40, 50])) | ||
| ---- | ||
|
|
||
| ## References | ||
| . Iris | ||
| * http://scitools.org.uk/iris/docs/v1.9.2/iris/iris.html#iris.Constraint[iris.Constraint] | ||
| * http://scitools.org.uk/iris/docs/v1.9.2/userguide/subsetting_a_cube.html[Subsetting a cube] | ||
| . http://pandas.pydata.org/pandas-docs/stable/indexing.html[pandas indexing] | ||
| . http://xarray.pydata.org/en/stable/indexing.html[xarray indexing] | ||
| . http://legacy.python.org/dev/peps/pep-0472/[PEP 472 - Support for indexing with keyword arguments] | ||
| . http://nbviewer.jupyter.org/gist/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d[Time slicing NetCDF or OPeNDAP datasets] - Rich Signell's xarray/iris comparison focussing on time handling and performance | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is
iloca string which could be used in aIf so, what coord names are allowed that would not be able to support this?
coord.name()could return a long_name, with space in, for exampleoh, I see more now,
locandilocare special functions (catching up slowly)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, my comments are related to the string 'height'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the names are the same as for
cube.coord(...).A name that is not a valid Python identifier can be expressed with a dict literal, e.g.:
This shouldn't be an issue for general purpose code, which is probably doing the equivalent of: