From ce05a10dc8c4aeed10ebf6b7c19720dea644a0be Mon Sep 17 00:00:00 2001 From: Richard Hattersley Date: Wed, 27 Apr 2016 15:29:53 +0100 Subject: [PATCH 1/4] First draft of IEP 1. --- docs/iris/src/IEP/IEP001.adoc | 138 ++++++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 docs/iris/src/IEP/IEP001.adoc diff --git a/docs/iris/src/IEP/IEP001.adoc b/docs/iris/src/IEP/IEP001.adoc new file mode 100644 index 0000000000..18493bc2ab --- /dev/null +++ b/docs/iris/src/IEP/IEP001.adoc @@ -0,0 +1,138 @@ +# IEP 1 - Enhanced indexing + +## Background + +Currently, to select a subset of a Cube based on coordinate values we use something like: +[source,python] +---- +cube.extract(iris.Constraint(realization=3, + model_level_number=[1, 5], + latitude=lambda cell: 40 <= cell <= 60)) +---- +On the plus side, this works irrespective of the dimension order of the data, but the drawbacks with this form of indexing include: + +* It uses a completely different syntax to position-based indexing, e.g. `cube[4, 0:6]`. +* It uses a completely different syntax to pandas and xarray value-based indexing, e.g. `df[4, 0:6]`. +* It is long-winded. + +Similarly, to select a subset of a Cube using positional indices but where the dimension is unknown has no standard syntax _at all_! Instead it requires code akin to: +[source,python] +---- +key = [slice(None)] * cube.ndim +key[cube.coord_dims('model_level_number')[0]] = slice(3, 9, 2) +cube[tuple(key)] +---- + +The only form of indexing that is well supported is indexing by position where the dimension order is known: +[source,python] +---- +cube[4, 0:6, 30:] +---- + +## Proposal + +Provide indexing helpers on the Cube to extend support to all permutations of positional vs. named dimensions and positional vs. coordinate-value based selection. + +### Extended pandas style + +Use a single helper for index by position, and a single helper for index by value. Helper names taken from pandas, but their behaviour is extended by making them callable to support named dimensions. + +|=== +2.2+| 2+h|Index by +h|Position h|Value + +.2+h|Dimension +h|Position + +a|[source,python] +---- +cube[:, 2] # No change +cube.iloc[:, 2] +---- + +a|[source,python] +---- +cube.loc[:, 1.5] +---- + +h|Name + +a|[source,python] +---- +cube[dict(height=2)] +cube.iloc[dict(height=2)] +cube.iloc(height=2) +---- + +a|[source,python] +---- +cube.loc[dict(height=1.5)] +cube.loc(height=1.5) +---- +|=== + +### xarray style + +xarray introduces a second set of helpers for accessing named dimensions that provide the callable syntax `(foo=...)`. + +|=== +2.2+| 2+h|Index by +h|Position h|Value + +.2+h|Dimension +h|Position + +a|[source,python] +---- +cube[:, 2] # No change +---- + +a|[source,python] +---- +cube.loc[:, 1.5] +---- + +h|Name + +a|[source,python] +---- + cube[dict(height=2)] + cube.isel(height=2) +---- + +a|[source,python] +---- +cube.loc[dict(height=1.5)] +cube.sel(height=1.5) +---- +|=== + +### TODO +* Consistent terminology +* `coord.name()` vs. `var_name` vs. "dimension name"? +* Names that aren't valid Python identifiers +* Inclusive vs. exclusive +** Default: Inclusive? (as for pandas & xarray) +** Use boolean otherwise. +* Multi-dimensional coordinates +* Non-orthogonal coordinates +* Bounds +* Boolean array indexing +* Lambdas? +* What to do about constrained loading? +* Relationship to http://scitools.org.uk/iris/docs/v1.9.2/iris/iris/cube.html#iris.cube.Cube.intersection[iris.cube.Cube.intersection]? +* Relationship to interpolation (especially nearest-neighbour)? +** e.g. What to do about values that don't exist? +*** pandas throws a KeyError +*** xarray supports (several) nearest-neighbour schemes via http://xarray.pydata.org/en/stable/indexing.html#nearest-neighbor-lookups[`data.sel()`] +*** Apparently http://holoviews.org/[holoviews] does nearest-neighbour interpolation. +* Time handling +** e.g. Rich Signell's http://nbviewer.jupyter.org/gist/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d[xarray/iris comparison] + +## References +. Iris + * http://scitools.org.uk/iris/docs/v1.9.2/iris/iris.html#iris.Constraint[iris.Constraint] + * http://scitools.org.uk/iris/docs/v1.9.2/userguide/subsetting_a_cube.html[Subsetting a cube] +. http://pandas.pydata.org/pandas-docs/stable/indexing.html[pandas indexing] +. http://xarray.pydata.org/en/stable/indexing.html[xarray indexing] +. http://legacy.python.org/dev/peps/pep-0472/[PEP 472 - Support for indexing with keyword arguments] From e080ea047efc59094890627f0e7e2bf69e95c1d6 Mon Sep 17 00:00:00 2001 From: Richard Hattersley Date: Wed, 27 Apr 2016 16:43:02 +0100 Subject: [PATCH 2/4] Add "Out of scope" and "Work required" --- docs/iris/src/IEP/IEP001.adoc | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/docs/iris/src/IEP/IEP001.adoc b/docs/iris/src/IEP/IEP001.adoc index 18493bc2ab..430cd995f3 100644 --- a/docs/iris/src/IEP/IEP001.adoc +++ b/docs/iris/src/IEP/IEP001.adoc @@ -33,6 +33,15 @@ cube[4, 0:6, 30:] Provide indexing helpers on the Cube to extend support to all permutations of positional vs. named dimensions and positional vs. coordinate-value based selection. +### Out of scope + +* Deliberately enhancing the performance. +This is a very valuable topic and should be addressed by subsequent efforts. + +* Time/date values as strings. +Providing pandas-style string representations for convenient representation of partial date/times should be addressed in a subsequent effort. +There is a risk that this topic could bog down when dealing with non-standard calendars and climatological date ranges. + ### Extended pandas style Use a single helper for index by position, and a single helper for index by value. Helper names taken from pandas, but their behaviour is extended by making them callable to support named dimensions. @@ -107,6 +116,12 @@ cube.sel(height=1.5) ---- |=== +## Work required + +* Implementations for each of the new helper objects. +* An update to the documentation to demonstrate best practice. Known impacted areas include: +** The "Subsetting a Cube" chapter of the user guide. + ### TODO * Consistent terminology * `coord.name()` vs. `var_name` vs. "dimension name"? @@ -126,8 +141,6 @@ cube.sel(height=1.5) *** pandas throws a KeyError *** xarray supports (several) nearest-neighbour schemes via http://xarray.pydata.org/en/stable/indexing.html#nearest-neighbor-lookups[`data.sel()`] *** Apparently http://holoviews.org/[holoviews] does nearest-neighbour interpolation. -* Time handling -** e.g. Rich Signell's http://nbviewer.jupyter.org/gist/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d[xarray/iris comparison] ## References . Iris @@ -136,3 +149,4 @@ cube.sel(height=1.5) . http://pandas.pydata.org/pandas-docs/stable/indexing.html[pandas indexing] . http://xarray.pydata.org/en/stable/indexing.html[xarray indexing] . http://legacy.python.org/dev/peps/pep-0472/[PEP 472 - Support for indexing with keyword arguments] +. http://nbviewer.jupyter.org/gist/rsignell-usgs/13d7ce9d95fddb4983d4cbf98be6c71d[Time slicing NetCDF or OPeNDAP datasets] - Rich Signell's xarray/iris comparison focussing on time handling and performance From 8f6b53675742a083f53cd804a144128ff5c2f9d9 Mon Sep 17 00:00:00 2001 From: Richard Hattersley Date: Wed, 27 Apr 2016 17:01:32 +0100 Subject: [PATCH 3/4] Names that aren't valid Python identifiers --- docs/iris/src/IEP/IEP001.adoc | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/iris/src/IEP/IEP001.adoc b/docs/iris/src/IEP/IEP001.adoc index 430cd995f3..764536974b 100644 --- a/docs/iris/src/IEP/IEP001.adoc +++ b/docs/iris/src/IEP/IEP001.adoc @@ -33,6 +33,10 @@ cube[4, 0:6, 30:] Provide indexing helpers on the Cube to extend support to all permutations of positional vs. named dimensions and positional vs. coordinate-value based selection. +Commonly, the names of dimensions are also valid Python identifiers. +For names where this is not true, the names can expressed through either the `helper[...]` or `helper(...)` syntax by constructing an explicit dict. +For example: `cube.loc[{'12': 0}]` or `cube.loc(**{'12': 0})`. + ### Out of scope * Deliberately enhancing the performance. @@ -125,7 +129,6 @@ cube.sel(height=1.5) ### TODO * Consistent terminology * `coord.name()` vs. `var_name` vs. "dimension name"? -* Names that aren't valid Python identifiers * Inclusive vs. exclusive ** Default: Inclusive? (as for pandas & xarray) ** Use boolean otherwise. From 993339549f702e10143883a994db00cbb7032421 Mon Sep 17 00:00:00 2001 From: Richard Hattersley Date: Thu, 28 Apr 2016 11:10:35 +0100 Subject: [PATCH 4/4] Slice behaviour and misc clarifications --- docs/iris/src/IEP/IEP001.adoc | 98 ++++++++++++++++++++++++----------- 1 file changed, 68 insertions(+), 30 deletions(-) diff --git a/docs/iris/src/IEP/IEP001.adoc b/docs/iris/src/IEP/IEP001.adoc index 764536974b..d38b2e8478 100644 --- a/docs/iris/src/IEP/IEP001.adoc +++ b/docs/iris/src/IEP/IEP001.adoc @@ -12,10 +12,11 @@ cube.extract(iris.Constraint(realization=3, On the plus side, this works irrespective of the dimension order of the data, but the drawbacks with this form of indexing include: * It uses a completely different syntax to position-based indexing, e.g. `cube[4, 0:6]`. -* It uses a completely different syntax to pandas and xarray value-based indexing, e.g. `df[4, 0:6]`. -* It is long-winded. +* It uses a completely different syntax to pandas and xarray value-based indexing, e.g. `df.loc[4, 0:6]`. +* It is long-winded and requires the use of an additional class. +* It requires the use of lambda functions even when just selecting a range. -Similarly, to select a subset of a Cube using positional indices but where the dimension is unknown has no standard syntax _at all_! Instead it requires code akin to: +Arguably, the situation when subsetting using positional indices but where the dimension order is unknown is even worse - it has no standard syntax _at all_! Instead it requires code akin to: [source,python] ---- key = [slice(None)] * cube.ndim @@ -31,31 +32,26 @@ cube[4, 0:6, 30:] ## Proposal -Provide indexing helpers on the Cube to extend support to all permutations of positional vs. named dimensions and positional vs. coordinate-value based selection. +Provide indexing helpers on the Cube to extend explicit support to all permutations of: -Commonly, the names of dimensions are also valid Python identifiers. -For names where this is not true, the names can expressed through either the `helper[...]` or `helper(...)` syntax by constructing an explicit dict. -For example: `cube.loc[{'12': 0}]` or `cube.loc(**{'12': 0})`. - -### Out of scope +* implicit dimension vs. named coordinate, +* and positional vs. coordinate-value based selection. -* Deliberately enhancing the performance. -This is a very valuable topic and should be addressed by subsequent efforts. +### Helper syntax options -* Time/date values as strings. -Providing pandas-style string representations for convenient representation of partial date/times should be addressed in a subsequent effort. -There is a risk that this topic could bog down when dealing with non-standard calendars and climatological date ranges. +Commonly, the names of coordinates are also valid Python identifiers. +For names where this is not true, the names can expressed through either the `helper[...]` or `helper(...)` syntax by constructing an explicit dict. +For example: `cube.loc[{'12': 0}]` or `cube.loc(**{'12': 0})`. -### Extended pandas style +#### Extended pandas style -Use a single helper for index by position, and a single helper for index by value. Helper names taken from pandas, but their behaviour is extended by making them callable to support named dimensions. +Use a single helper for index by position, and a single helper for index by value. Helper names taken from pandas, but their behaviour is extended by making them callable to support named coordinates. |=== -2.2+| 2+h|Index by +.2+| 2+h|Index by h|Position h|Value -.2+h|Dimension -h|Position +h|Implicit dimension a|[source,python] ---- @@ -68,7 +64,7 @@ a|[source,python] cube.loc[:, 1.5] ---- -h|Name +h|Coordinate name a|[source,python] ---- @@ -84,16 +80,15 @@ cube.loc(height=1.5) ---- |=== -### xarray style +#### xarray style xarray introduces a second set of helpers for accessing named dimensions that provide the callable syntax `(foo=...)`. |=== -2.2+| 2+h|Index by +.2+| 2+h|Index by h|Position h|Value -.2+h|Dimension -h|Position +h|Implicit dimension a|[source,python] ---- @@ -105,7 +100,7 @@ a|[source,python] cube.loc[:, 1.5] ---- -h|Name +h|Coordinate name a|[source,python] ---- @@ -120,6 +115,40 @@ cube.sel(height=1.5) ---- |=== +### Slices + +The semantics of position-based slices will continue to match that of normal Python slices. The start position is included, the end position is excluded. + +Value-based slices will be stricly inclusive, with both the start and end values included. This behaviour differs from normal Python slices but is in common with pandas. + +Just as for normal Python slices, we do not need to provide the ability to control the include/exclude behaviour for slicing. + +### Value-based indexing + +#### Equality + +Should the behaviour of value-based equality depend on the data type of the coordinate? + +* integer: exact match +* float: tolerance match, tolerance determined by bit-width +* string: exact match + +#### Scalar/category + +If/how to deal with category selection `cube.loc(season='JJA')`? Defer to `groupby()`? + +`cube.loc[12]` - must always match a single value or raise KeyError, corresponding dimension will be removed +`cube.loc[[12]]` - may match any number of values? (incl. zero?), dimension will be retained + +### Out of scope + +* Deliberately enhancing the performance. +This is a very valuable topic and should be addressed by subsequent efforts. + +* Time/date values as strings. +Providing pandas-style string representations for convenient representation of partial date/times should be addressed in a subsequent effort - perhaps in conjunction with an explicit performance test suite. +There is a risk that this topic could bog down when dealing with non-standard calendars and climatological date ranges. + ## Work required * Implementations for each of the new helper objects. @@ -127,11 +156,6 @@ cube.sel(height=1.5) ** The "Subsetting a Cube" chapter of the user guide. ### TODO -* Consistent terminology -* `coord.name()` vs. `var_name` vs. "dimension name"? -* Inclusive vs. exclusive -** Default: Inclusive? (as for pandas & xarray) -** Use boolean otherwise. * Multi-dimensional coordinates * Non-orthogonal coordinates * Bounds @@ -144,6 +168,20 @@ cube.sel(height=1.5) *** pandas throws a KeyError *** xarray supports (several) nearest-neighbour schemes via http://xarray.pydata.org/en/stable/indexing.html#nearest-neighbor-lookups[`data.sel()`] *** Apparently http://holoviews.org/[holoviews] does nearest-neighbour interpolation. +* multi-dimensional coordinate => unroll? +* var_name only selection? `cube.vloc(t0=12)` +* Orthogonal only? Or also independent? `cube.loc_points(lon=[1, 1, 5], lat=[31, 33, 32])` + ** This seems quite closely linked to interpolation. Is the interpolation scheme orthogonal to cross-product vs. independent? ++ +[source,python] +---- +cube.interpolate( + scheme='nearest', + mesh=dict(lon=[5, 10, 15], lat=[40, 50])) +cube.interpolate( + scheme=Nearest(mode='spherical'), + locations=Ortho(lon=[5, 10, 15], lat=[40, 50])) +---- ## References . Iris