Skip to content

Commit 6bd26b7

Browse files
Merge pull request #2597 from bjlittle/dask-merge-back
Dask merge back
2 parents 2b9c2aa + e2eeea3 commit 6bd26b7

File tree

776 files changed

+12220
-5695
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

776 files changed

+12220
-5695
lines changed

.travis.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ git:
2222
depth: 10000
2323

2424
install:
25-
- export IRIS_TEST_DATA_REF="7c0e32c8812b464e467a9555bdc25dc1e0c5be0c"
25+
- export IRIS_TEST_DATA_REF="2f3a6bcf25f81bd152b3d66223394074c9069a96"
2626
- export IRIS_TEST_DATA_SUFFIX=$(echo "${IRIS_TEST_DATA_REF}" | sed "s/^v//")
2727

2828
# Install miniconda
@@ -54,7 +54,7 @@ install:
5454
conda install --quiet --file minimal-conda-requirements.txt;
5555
else
5656
if [[ "$TRAVIS_PYTHON_VERSION" == 3* ]]; then
57-
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e '/iris-grib/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
57+
sed -e '/ecmwf_grib/d' -e '/esmpy/d' -e 's/#.\+$//' conda-requirements.txt | xargs conda install --quiet;
5858
else
5959
conda install --quiet --file conda-requirements.txt;
6060
fi

INSTALL

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,6 @@ numpy 1.9 or later (http://numpy.scipy.org/)
8080
Python package for scientific computing including a powerful N-dimensional
8181
array object.
8282

83-
biggus 0.14 or later (https://github.com/SciTools/biggus)
84-
Virtual large arrays and lazy evaluation.
85-
8683
scipy 0.10 or later (http://www.scipy.org/)
8784
Python package for scientific computing.
8885

@@ -128,10 +125,6 @@ grib-api 1.9.16 or later
128125
edition 2 messages. A compression library such as Jasper is required
129126
to read JPEG2000 compressed GRIB2 files.
130127

131-
iris-grib 0.9 or later
132-
(https://github.com/scitools/iris-grib)
133-
Iris interface to ECMWF's GRIB API
134-
135128
matplotlib 1.2.0 (http://matplotlib.sourceforge.net/)
136129
Python package for 2D plotting.
137130

conda-requirements.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
# conda create -n <name> --file conda-requirements.txt
33

44
# Mandatory dependencies
5-
biggus
65
cartopy
76
matplotlib<1.9
87
netcdf4
98
numpy
109
pyke
1110
udunits2
1211
cf_units
12+
dask
1313

1414
# Iris build dependencies
1515
setuptools
@@ -25,12 +25,12 @@ imagehash
2525
requests
2626

2727
# Optional iris dependencies
28-
nc_time_axis
29-
iris-grib
28+
ecmwf_grib
3029
esmpy>=7.0
3130
gdal
3231
libmo_unpack
33-
pandas
34-
pyugrid
3532
mo_pack
33+
nc_time_axis
34+
pandas
3635
python-stratify
36+
pyugrid

docs/iris/src/conf.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,6 @@
158158
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
159159
'matplotlib': ('http://matplotlib.org/', None),
160160
'cartopy': ('http://scitools.org.uk/cartopy/docs/latest/', None),
161-
'biggus': ('https://biggus.readthedocs.io/en/latest/', None),
162-
'iris-grib': ('http://iris-grib.readthedocs.io/en/latest/', None),
163161
}
164162

165163

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
Iris Dask Interface
2+
*******************
3+
4+
Iris uses `dask <http://dask.pydata.org>`_ to manage lazy data interfaces and processing graphs.
5+
The key principles that define this interface are:
6+
7+
* A call to :attr:`cube.data` will always load all of the data.
8+
9+
* Once this has happened:
10+
11+
* :attr:`cube.data` is a mutable NumPy masked array or ``ndarray``, and
12+
* ``cube._numpy_array`` is a private NumPy masked array, accessible via :attr:`cube.data`, which may strip off the mask and return a reference to the bare ``ndarray``.
13+
14+
* You can use :attr:`cube.data` to set the data. This accepts:
15+
16+
* a NumPy array (including masked array), which is assigned to ``cube._numpy_array``, or
17+
* a dask array, which is assigned to ``cube._dask_array``, while ``cube._numpy_array`` is set to None.
18+
19+
* ``cube._dask_array`` may be None, otherwise it is expected to be a dask array:
20+
21+
* this may wrap a proxy to a file collection, or
22+
* this may wrap the NumPy array in ``cube._numpy_array``.
23+
24+
* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:
25+
26+
* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
27+
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.
28+
29+
* In order to support this mask conversion, cubes have a ``fill_value`` defined as part of their metadata, which may be ``None``.
30+
31+
* Array copying is kept to an absolute minimum:
32+
33+
* array references should always be passed, not new arrays created, unless an explicit copy operation is requested.
34+
35+
* To test for the presence of a dask array of any sort, we use :func:`iris._lazy_data.is_lazy_data`. This is implemented as ``hasattr(data, 'compute')``.

docs/iris/src/developers_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,3 +38,4 @@
3838
tests.rst
3939
deprecations.rst
4040
release.rst
41+
dask_interface.rst

docs/iris/src/userguide/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ Iris user guide
66

77
How to use the user guide
88
---------------------------
9-
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
10-
fully before experimenting with your own data files.
9+
If you are reading this user guide for the first time it is strongly recommended that you read the user guide
10+
fully before experimenting with your own data files.
1111

1212

13-
Much of the content has supplementary links to the reference documentation; you will not need to follow these
13+
Much of the content has supplementary links to the reference documentation; you will not need to follow these
1414
links in order to understand the guide but they may serve as a useful reference for future exploration.
1515

1616
.. htmlonly::
@@ -30,6 +30,7 @@ User guide table of contents
3030
saving_iris_cubes.rst
3131
navigating_a_cube.rst
3232
subsetting_a_cube.rst
33+
real_and_lazy_data.rst
3334
plotting_a_cube.rst
3435
interpolation_and_regridding.rst
3536
merge_and_concat.rst

docs/iris/src/userguide/interpolation_and_regridding.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
176176
>>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
177177
>>> new_column = column.interpolate(sample_points, scheme)
178178
>>> print(new_column.coord('altitude').points)
179-
[ nan 494.44451904 588.88891602 683.33325195 777.77783203
180-
872.222229 966.66674805 1061.11108398 1155.55541992 nan]
179+
[-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125
180+
872.2222290039062 966.666748046875 1061.111083984375 1155.555419921875 --]
181181

182182

183183
.. _caching_an_interpolator:
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
.. _real_and_lazy_data:
2+
3+
4+
.. testsetup:: *
5+
6+
import dask.array as da
7+
import iris
8+
import numpy as np
9+
10+
11+
==================
12+
Real and Lazy Data
13+
==================
14+
15+
We have seen in the :doc:`user_guide_introduction` section of the user guide that
16+
Iris cubes contain data and metadata about a phenomenon. The data element of a cube
17+
is always an array, but the array may be either "real" or "lazy".
18+
19+
In this section of the user guide we will look specifically at the concepts of
20+
real and lazy data as they apply to the cube and other data structures in Iris.
21+
22+
23+
What is real and lazy data?
24+
---------------------------
25+
26+
In Iris, we use the term **real data** to describe data arrays that are loaded
27+
into memory. Real data is typically provided as a
28+
`NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_,
29+
which has a shape and data type that are used to describe the array's data points.
30+
Each data point takes up a small amount of memory, which means large NumPy arrays can
31+
take up a large amount of memory.
32+
33+
Conversely, we use the term **lazy data** to describe data that is not loaded into memory.
34+
(This is sometimes also referred to as **deferred data**.)
35+
In Iris, lazy data is provided as a
36+
`dask array <http://dask.pydata.org/en/latest/array-overview.html>`_.
37+
A dask array also has a shape and data type
38+
but typically the dask array's data points are not loaded into memory.
39+
Instead the data points are stored on disk and only loaded into memory in
40+
small chunks when absolutely necessary (see the section :ref:`when_real_data`
41+
for examples of when this might happen).
42+
43+
The primary advantage of using lazy data is that it enables
44+
`out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_;
45+
that is, the loading and manipulating of datasets that otherwise would not fit into memory.
46+
47+
You can check whether a cube has real data or lazy data by using the method
48+
:meth:`~iris.cube.Cube.has_lazy_data`. For example::
49+
50+
>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
51+
>>> cube.has_lazy_data()
52+
True
53+
# Realise the lazy data.
54+
>>> cube.data
55+
>>> cube.has_lazy_data()
56+
False
57+
58+
59+
.. _when_real_data:
60+
61+
When does my data become real?
62+
------------------------------
63+
64+
When you load a dataset using Iris the data array will almost always initially be
65+
a lazy array. This section details some operations that will realise lazy data
66+
as well as some operations that will maintain lazy data. We use the term **realise**
67+
to mean converting lazy data into real data.
68+
69+
Most operations on data arrays can be run equivalently on both real and lazy data.
70+
If the data array is real then the operation will be run on the data array
71+
immediately. The results of the operation will be available as soon as processing is completed.
72+
If the data array is lazy then the operation will be deferred and the data array will
73+
remain lazy until you request the result (such as when you call ``cube.data``)::
74+
75+
>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
76+
>>> cube.has_lazy_data()
77+
True
78+
>>> cube += 5
79+
>>> cube.has_lazy_data()
80+
True
81+
82+
The process by which the operation is deferred until the result is requested is
83+
referred to as **lazy evaluation**.
84+
85+
Certain operations, including regridding and plotting, can only be run on real data.
86+
Calling such operations on lazy data will automatically realise your lazy data.
87+
88+
You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data.
89+
To 'touch' the data means directly accessing the data by calling ``cube.data``,
90+
as in the previous example.
91+
92+
Core data
93+
^^^^^^^^^
94+
95+
Cubes have the concept of "core data". This returns the cube's data in its
96+
current state:
97+
98+
* If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
99+
will return the cube's lazy dask array. Calling the cube's
100+
:meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
101+
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
102+
will return the cube's real NumPy array.
103+
104+
For example::
105+
106+
>>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp'))
107+
>>> cube.has_lazy_data()
108+
True
109+
110+
>>> the_data = cube.core_data()
111+
>>> type(the_data)
112+
<class 'dask.array.core.Array'>
113+
>>> cube.has_lazy_data()
114+
True
115+
116+
# Realise the lazy data.
117+
>>> cube.data
118+
>>> the_data = cube.core_data()
119+
>>> type(the_data)
120+
<type 'numpy.ndarray'>
121+
>>> cube.has_lazy_data()
122+
False
123+
124+
125+
Coordinates
126+
-----------
127+
128+
In the same way that Iris cubes contain a data array, Iris coordinates contain a
129+
points array and an optional bounds array.
130+
Coordinate points and bounds arrays can also be real or lazy:
131+
132+
* A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds
133+
arrays because of monotonicity checks that realise lazy arrays.
134+
* An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds.
135+
* An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate)
136+
can have **real or lazy** points and bounds. If all of the
137+
:class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate
138+
have real points and bounds then the derived coordinate will have real points
139+
and bounds, otherwise the derived coordinate will have lazy points and bounds.
140+
141+
Iris cubes and coordinates have very similar interfaces, which extends to accessing
142+
coordinates' lazy points and bounds:
143+
144+
.. doctest::
145+
146+
>>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc'))
147+
148+
>>> dim_coord = cube.coord('model_level_number')
149+
>>> print(dim_coord.has_lazy_points())
150+
False
151+
>>> print(dim_coord.has_bounds())
152+
False
153+
>>> print(dim_coord.has_lazy_bounds())
154+
False
155+
156+
>>> aux_coord = cube.coord('sigma')
157+
>>> print(aux_coord.has_lazy_points())
158+
True
159+
>>> print(aux_coord.has_bounds())
160+
True
161+
>>> print(aux_coord.has_lazy_bounds())
162+
True
163+
164+
# Realise the lazy points. This will **not** realise the lazy bounds.
165+
>>> points = aux_coord.points
166+
>>> print(aux_coord.has_lazy_points())
167+
False
168+
>>> print(aux_coord.has_lazy_bounds())
169+
True
170+
171+
>>> derived_coord = cube.coord('altitude')
172+
>>> print(derived_coord.has_lazy_points())
173+
True
174+
>>> print(derived_coord.has_bounds())
175+
True
176+
>>> print(derived_coord.has_lazy_bounds())
177+
True
178+
179+
.. note::
180+
Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays!
181+
182+
183+
Dask processing options
184+
-----------------------
185+
186+
As stated earlier in this user guide section, Iris uses dask to provide
187+
lazy data arrays for both Iris cubes and coordinates. Iris also uses dask
188+
functionality for processing deferred operations on lazy arrays.
189+
190+
Dask provides processing options to control how deferred operations on lazy arrays
191+
are computed. This is provided via the ``dask.set_options`` interface.
192+
We can make use of this functionality in Iris. This means we can
193+
control how dask arrays in Iris are processed, for example giving us power to
194+
run Iris processing in parallel.
195+
196+
Iris by default applies a single dask processing option. This specifies that
197+
all dask processing in Iris should be run in serial (that is, without any
198+
parallel processing enabled).
199+
200+
The dask processing option applied by Iris can be overridden by manually setting
201+
dask processing options for either or both of:
202+
203+
* the number of parallel workers to use,
204+
* the scheduler to use.
205+
206+
This must be done **before** importing Iris. For example, to specify that dask
207+
processing within Iris should use four workers in a thread pool::
208+
209+
>>> from multiprocessing.pool import ThreadPool
210+
>>> import dask
211+
>>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4))
212+
213+
>>> import iris
214+
>>> # Iris processing here...
215+
216+
.. note::
217+
These dask processing options will last for the lifetime of the Python session
218+
and must be re-applied in other or subsequent sessions.
219+
220+
Other dask processing options are also available. See the
221+
`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_
222+
for more information on setting dask processing options.
223+
224+
225+
Further reading
226+
---------------
227+
228+
This section of the Iris user guide provides a quick overview of real and lazy
229+
data within Iris. For more details on these and related concepts,
230+
see the whitepaper on lazy data.

0 commit comments

Comments
 (0)