|
| 1 | +.. _real_and_lazy_data: |
| 2 | + |
| 3 | + |
| 4 | +.. testsetup:: * |
| 5 | + |
| 6 | + import dask.array as da |
| 7 | + import iris |
| 8 | + import numpy as np |
| 9 | + |
| 10 | + |
| 11 | +================== |
| 12 | +Real and Lazy Data |
| 13 | +================== |
| 14 | + |
| 15 | +We have seen in the :doc:`user_guide_introduction` section of the user guide that |
| 16 | +Iris cubes contain data and metadata about a phenomenon. The data element of a cube |
| 17 | +is always an array, but the array may be either "real" or "lazy". |
| 18 | + |
| 19 | +In this section of the user guide we will look specifically at the concepts of |
| 20 | +real and lazy data as they apply to the cube and other data structures in Iris. |
| 21 | + |
| 22 | + |
| 23 | +What is real and lazy data? |
| 24 | +--------------------------- |
| 25 | + |
| 26 | +In Iris, we use the term **real data** to describe data arrays that are loaded |
| 27 | +into memory. Real data is typically provided as a |
| 28 | +`NumPy array <https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html>`_, |
| 29 | +which has a shape and data type that are used to describe the array's data points. |
| 30 | +Each data point takes up a small amount of memory, which means large NumPy arrays can |
| 31 | +take up a large amount of memory. |
| 32 | + |
| 33 | +Conversely, we use the term **lazy data** to describe data that is not loaded into memory. |
| 34 | +(This is sometimes also referred to as **deferred data**.) |
| 35 | +In Iris, lazy data is provided as a |
| 36 | +`dask array <http://dask.pydata.org/en/latest/array-overview.html>`_. |
| 37 | +A dask array also has a shape and data type |
| 38 | +but typically the dask array's data points are not loaded into memory. |
| 39 | +Instead the data points are stored on disk and only loaded into memory in |
| 40 | +small chunks when absolutely necessary (see the section :ref:`when_real_data` |
| 41 | +for examples of when this might happen). |
| 42 | + |
| 43 | +The primary advantage of using lazy data is that it enables |
| 44 | +`out-of-core processing <https://en.wikipedia.org/wiki/Out-of-core_algorithm>`_; |
| 45 | +that is, the loading and manipulating of datasets that otherwise would not fit into memory. |
| 46 | + |
| 47 | +You can check whether a cube has real data or lazy data by using the method |
| 48 | +:meth:`~iris.cube.Cube.has_lazy_data`. For example:: |
| 49 | + |
| 50 | + >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) |
| 51 | + >>> cube.has_lazy_data() |
| 52 | + True |
| 53 | + # Realise the lazy data. |
| 54 | + >>> cube.data |
| 55 | + >>> cube.has_lazy_data() |
| 56 | + False |
| 57 | + |
| 58 | + |
| 59 | +.. _when_real_data: |
| 60 | + |
| 61 | +When does my data become real? |
| 62 | +------------------------------ |
| 63 | + |
| 64 | +When you load a dataset using Iris the data array will almost always initially be |
| 65 | +a lazy array. This section details some operations that will realise lazy data |
| 66 | +as well as some operations that will maintain lazy data. We use the term **realise** |
| 67 | +to mean converting lazy data into real data. |
| 68 | + |
| 69 | +Most operations on data arrays can be run equivalently on both real and lazy data. |
| 70 | +If the data array is real then the operation will be run on the data array |
| 71 | +immediately. The results of the operation will be available as soon as processing is completed. |
| 72 | +If the data array is lazy then the operation will be deferred and the data array will |
| 73 | +remain lazy until you request the result (such as when you call ``cube.data``):: |
| 74 | + |
| 75 | + >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) |
| 76 | + >>> cube.has_lazy_data() |
| 77 | + True |
| 78 | + >>> cube += 5 |
| 79 | + >>> cube.has_lazy_data() |
| 80 | + True |
| 81 | + |
| 82 | +The process by which the operation is deferred until the result is requested is |
| 83 | +referred to as **lazy evaluation**. |
| 84 | + |
| 85 | +Certain operations, including regridding and plotting, can only be run on real data. |
| 86 | +Calling such operations on lazy data will automatically realise your lazy data. |
| 87 | + |
| 88 | +You can also realise (and so load into memory) your cube's lazy data if you 'touch' the data. |
| 89 | +To 'touch' the data means directly accessing the data by calling ``cube.data``, |
| 90 | +as in the previous example. |
| 91 | + |
| 92 | +Core data |
| 93 | +^^^^^^^^^ |
| 94 | + |
| 95 | +Cubes have the concept of "core data". This returns the cube's data in its |
| 96 | +current state: |
| 97 | + |
| 98 | + * If a cube has lazy data, calling the cube's :meth:`~iris.cube.Cube.core_data` method |
| 99 | + will return the cube's lazy dask array. Calling the cube's |
| 100 | + :meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data. |
| 101 | + * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method |
| 102 | + will return the cube's real NumPy array. |
| 103 | + |
| 104 | +For example:: |
| 105 | + |
| 106 | + >>> cube = iris.load_cube(iris.sample_data_path('air_temp.pp')) |
| 107 | + >>> cube.has_lazy_data() |
| 108 | + True |
| 109 | + |
| 110 | + >>> the_data = cube.core_data() |
| 111 | + >>> type(the_data) |
| 112 | + <class 'dask.array.core.Array'> |
| 113 | + >>> cube.has_lazy_data() |
| 114 | + True |
| 115 | + |
| 116 | + # Realise the lazy data. |
| 117 | + >>> cube.data |
| 118 | + >>> the_data = cube.core_data() |
| 119 | + >>> type(the_data) |
| 120 | + <type 'numpy.ndarray'> |
| 121 | + >>> cube.has_lazy_data() |
| 122 | + False |
| 123 | + |
| 124 | + |
| 125 | +Coordinates |
| 126 | +----------- |
| 127 | + |
| 128 | +In the same way that Iris cubes contain a data array, Iris coordinates contain a |
| 129 | +points array and an optional bounds array. |
| 130 | +Coordinate points and bounds arrays can also be real or lazy: |
| 131 | + |
| 132 | + * A :class:`~iris.coords.DimCoord` will only ever have **real** points and bounds |
| 133 | + arrays because of monotonicity checks that realise lazy arrays. |
| 134 | + * An :class:`~iris.coords.AuxCoord` can have **real or lazy** points and bounds. |
| 135 | + * An :class:`~iris.aux_factory.AuxCoordFactory` (or derived coordinate) |
| 136 | + can have **real or lazy** points and bounds. If all of the |
| 137 | + :class:`~iris.coords.AuxCoord` instances used to construct the derived coordinate |
| 138 | + have real points and bounds then the derived coordinate will have real points |
| 139 | + and bounds, otherwise the derived coordinate will have lazy points and bounds. |
| 140 | + |
| 141 | +Iris cubes and coordinates have very similar interfaces, which extends to accessing |
| 142 | +coordinates' lazy points and bounds: |
| 143 | + |
| 144 | +.. doctest:: |
| 145 | + |
| 146 | + >>> cube = iris.load_cube(iris.sample_data_path('hybrid_height.nc')) |
| 147 | + |
| 148 | + >>> dim_coord = cube.coord('model_level_number') |
| 149 | + >>> print(dim_coord.has_lazy_points()) |
| 150 | + False |
| 151 | + >>> print(dim_coord.has_bounds()) |
| 152 | + False |
| 153 | + >>> print(dim_coord.has_lazy_bounds()) |
| 154 | + False |
| 155 | + |
| 156 | + >>> aux_coord = cube.coord('sigma') |
| 157 | + >>> print(aux_coord.has_lazy_points()) |
| 158 | + True |
| 159 | + >>> print(aux_coord.has_bounds()) |
| 160 | + True |
| 161 | + >>> print(aux_coord.has_lazy_bounds()) |
| 162 | + True |
| 163 | + |
| 164 | + # Realise the lazy points. This will **not** realise the lazy bounds. |
| 165 | + >>> points = aux_coord.points |
| 166 | + >>> print(aux_coord.has_lazy_points()) |
| 167 | + False |
| 168 | + >>> print(aux_coord.has_lazy_bounds()) |
| 169 | + True |
| 170 | + |
| 171 | + >>> derived_coord = cube.coord('altitude') |
| 172 | + >>> print(derived_coord.has_lazy_points()) |
| 173 | + True |
| 174 | + >>> print(derived_coord.has_bounds()) |
| 175 | + True |
| 176 | + >>> print(derived_coord.has_lazy_bounds()) |
| 177 | + True |
| 178 | + |
| 179 | +.. note:: |
| 180 | + Printing a lazy :class:`~iris.coords.AuxCoord` will realise its points and bounds arrays! |
| 181 | + |
| 182 | + |
| 183 | +Dask processing options |
| 184 | +----------------------- |
| 185 | + |
| 186 | +As stated earlier in this user guide section, Iris uses dask to provide |
| 187 | +lazy data arrays for both Iris cubes and coordinates. Iris also uses dask |
| 188 | +functionality for processing deferred operations on lazy arrays. |
| 189 | + |
| 190 | +Dask provides processing options to control how deferred operations on lazy arrays |
| 191 | +are computed. This is provided via the ``dask.set_options`` interface. |
| 192 | +We can make use of this functionality in Iris. This means we can |
| 193 | +control how dask arrays in Iris are processed, for example giving us power to |
| 194 | +run Iris processing in parallel. |
| 195 | + |
| 196 | +Iris by default applies a single dask processing option. This specifies that |
| 197 | +all dask processing in Iris should be run in serial (that is, without any |
| 198 | +parallel processing enabled). |
| 199 | + |
| 200 | +The dask processing option applied by Iris can be overridden by manually setting |
| 201 | +dask processing options for either or both of: |
| 202 | + |
| 203 | + * the number of parallel workers to use, |
| 204 | + * the scheduler to use. |
| 205 | + |
| 206 | +This must be done **before** importing Iris. For example, to specify that dask |
| 207 | +processing within Iris should use four workers in a thread pool:: |
| 208 | + |
| 209 | + >>> from multiprocessing.pool import ThreadPool |
| 210 | + >>> import dask |
| 211 | + >>> dask.set_options(get=dask.threaded.get, pool=ThreadPool(4)) |
| 212 | + |
| 213 | + >>> import iris |
| 214 | + >>> # Iris processing here... |
| 215 | + |
| 216 | +.. note:: |
| 217 | + These dask processing options will last for the lifetime of the Python session |
| 218 | + and must be re-applied in other or subsequent sessions. |
| 219 | + |
| 220 | +Other dask processing options are also available. See the |
| 221 | +`dask documentation <http://dask.pydata.org/en/latest/scheduler-overview.html>`_ |
| 222 | +for more information on setting dask processing options. |
| 223 | + |
| 224 | + |
| 225 | +Further reading |
| 226 | +--------------- |
| 227 | + |
| 228 | +This section of the Iris user guide provides a quick overview of real and lazy |
| 229 | +data within Iris. For more details on these and related concepts, |
| 230 | +see the whitepaper on lazy data. |
0 commit comments