diff --git a/_toc.yml b/_toc.yml index dabc1b9e7..38874183a 100644 --- a/_toc.yml +++ b/_toc.yml @@ -54,6 +54,8 @@ sections: - file: core/data-formats/netcdf-cf - file: core/xarray + sections: + - file: core/xarray/xarray-intro - part: Appendix chapters: diff --git a/core/xarray.md b/core/xarray.md index 057180b41..709906db3 100644 --- a/core/xarray.md +++ b/core/xarray.md @@ -4,4 +4,18 @@ This content is under construction! ``` -This section will contain tutorials on using [xarray](http://xarray.pydata.org/en/stable/) for analysis of gridded N-dimensional datasets. +This section contains tutorials on using [Xarray][xarray home]. Xarray is used widely in the geosciences and beyond for analysis of gridded N-dimensional datasets. + +--- + +From the [Xarray website][xarray home]: + +> Xarray (formerly Xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun! +> +> Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures. +> +> Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing. + +You should have a basic familiarity with [Numpy arrays](numpy) prior to working through the Xarray notebooks presented here. + +[xarray home]: http://xarray.pydata.org/en/stable/ diff --git a/core/xarray/nc-cf.ipynb b/core/xarray/nc-cf.ipynb new file mode 100644 index 000000000..11ed5e248 --- /dev/null +++ b/core/xarray/nc-cf.ipynb @@ -0,0 +1,482 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f97468c3-b57e-428b-97ec-72a62b8c66e6", + "metadata": {}, + "source": [ + "# Introduction to Climate and Forecasting Metadata Conventions" + ] + }, + { + "cell_type": "markdown", + "id": "9cdf6f4b-63e5-4dc1-9308-9972b96e1982", + "metadata": {}, + "source": [ + "In order to better enable reproducible data and research, the Climate and Forecasting (CF) metadata convention was created to have proper metadata in atmospheric data files. In the remainder of this notebook, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data in this notebook, with more in depth examples provided in the full [CF notebook](https://github.com/Unidata/python-workshop/blob/master/notebooks/CF%20Conventions/NetCDF%20and%20CF%20-%20The%20Basics.ipynb). Xarray makes the creation of netCDFs with proper metadata simple and straightforward, so we will use that, instead of the netCDF-Python library.\n", + "\n", + "This assumes a basic understanding of netCDF." + ] + }, + { + "cell_type": "markdown", + "id": "4b2c17b9-eb79-4a71-8608-2486dd4d6a26", + "metadata": {}, + "source": [ + "\n", + "## Gridded Data\n", + "Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.\n", + "\n", + "To start, let's assume the following about our data:\n", + "* It corresponds to forecast three dimensional temperature at several times\n", + "* The native coordinate system of the model is on a regular grid that represents the Earth on a Lambert conformal projection.\n", + "\n", + "We'll also go ahead and generate some arrays of data below to get started:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8771f5b0-5d88-402d-b22b-85fad2ef9d8e", + "metadata": {}, + "outputs": [], + "source": [ + "# Import some useful Python tools\n", + "from datetime import datetime\n", + "\n", + "# Twelve hours of hourly output starting at 22Z today\n", + "start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)\n", + "times = np.array([start + timedelta(hours=h) for h in range(13)])\n", + "\n", + "# 3km spacing in x and y\n", + "x = np.arange(-150, 153, 3)\n", + "y = np.arange(-100, 100, 3)\n", + "\n", + "# Standard pressure levels in hPa\n", + "press = np.array([1000, 925, 850, 700, 500, 300, 250])\n", + "\n", + "temps = np.random.randn(times.size, press.size, y.size, x.size)" + ] + }, + { + "cell_type": "markdown", + "id": "3bcb6d54-94b0-406b-b91a-318f396f7dbb", + "metadata": {}, + "source": [ + "Time coordinates must contain a `units` attribute with a string value with a form similar to `'seconds since 2019-01-06 12:00:00.00'`. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.\n", + "\n", + "Before we can write data, we need to first need to convert our list of Python `datetime` instances to numeric values. We can use the `cftime` library to make this easy to convert using the unit string as defined above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b497941-206f-48e0-b97d-1ae011f831d2", + "metadata": {}, + "outputs": [], + "source": [ + "from cftime import date2num\n", + "\n", + "time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])\n", + "time_vals = date2num(times, time_units)\n", + "time_vals" + ] + }, + { + "cell_type": "markdown", + "id": "387019b8-08ed-4c20-b95c-322a232d63f9", + "metadata": {}, + "source": [ + "Now we can create the `forecast_time` variable just as we did before for the other coordinate variables:" + ] + }, + { + "cell_type": "markdown", + "id": "f7670c04-9937-486a-9afe-543cfc750f40", + "metadata": {}, + "source": [ + "### Convert arrays into Xarray Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ebbbfe6b-4dd0-4542-bacb-08c05169c1ba", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.Dataset(\n", + " {'temperature': (['time', 'z', 'y', 'x'], temps, {'units': 'Kelvin'})},\n", + " coords={\n", + " 'x_dist': (['x'], x, {'units': 'km'}),\n", + " 'y_dist': (['y'], y, {'units': 'km'}),\n", + " 'pressure': (['z'], press, {'units': 'hPa'}),\n", + " 'forecast_time': (['time'], times),\n", + " },\n", + ")\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "2bd41835-ab7e-4631-9001-8d0b9ea64845", + "metadata": {}, + "source": [ + "Due to how xarray handles time units, we need to encode the units in the `forecast_time` coordinate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "779e1a25-ff73-4de2-aecc-cde68f4e743d", + "metadata": {}, + "outputs": [], + "source": [ + "ds.forecast_time.encoding['units'] = time_units" + ] + }, + { + "cell_type": "markdown", + "id": "88601434-6aca-44e0-b8a0-b214b3ed6f96", + "metadata": {}, + "source": [ + "If we look at our data variable, we can see the units printed out, so they were attached properly!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9e6d4ed1-3633-45d6-a31e-f1ac8c4184ea", + "metadata": {}, + "outputs": [], + "source": [ + "ds.temperature" + ] + }, + { + "cell_type": "markdown", + "id": "d64fc7cc-5d06-4830-bb37-554a6c4cdfd9", + "metadata": {}, + "source": [ + "We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "193c43fa-8988-4e7a-90ac-f91240c23dd2", + "metadata": {}, + "outputs": [], + "source": [ + "ds.attrs['Conventions'] = 'CF-1.7'\n", + "ds.attrs['title'] = 'Forecast model run'\n", + "ds.attrs['nc.institution'] = 'Unidata'\n", + "ds.attrs['source'] = 'WRF-1.5'\n", + "ds.attrs['history'] = str(datetime.utcnow()) + ' Python'\n", + "ds.attrs['references'] = ''\n", + "ds.attrs['comment'] = ''\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "ffe50ca8-1fd7-472d-9a4d-1e4c0ed8d0a8", + "metadata": {}, + "source": [ + "We can also add attributes to this variable to define metadata. The CF conventions require a `units` attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we have already set it to a value of `'Kelvin'`. We also set the standard (optional) attributes of `long_name` and `standard_name`. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the `missing_value` attribute to an appropriate value." + ] + }, + { + "cell_type": "markdown", + "id": "a60e0ef0-c9a5-411b-a459-4ecf9e32f62d", + "metadata": {}, + "source": [ + "> **NASA Dataset Interoperability Recommendations:**\n", + ">\n", + "> Section 2.2 - Include Basic CF Attributes\n", + ">\n", + "> Include where applicable: `units`, `long_name`, `standard_name`, `valid_min` / `valid_max`, `scale_factor` / `add_offset` and others." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9ee88a8-5364-41aa-b68b-f01c252afa48", + "metadata": {}, + "outputs": [], + "source": [ + "ds.temperature.attrs['standard_name'] = 'air_temperature'\n", + "ds.temperature.attrs['long_name'] = 'Forecast air temperature'\n", + "ds.temperature.attrs['missing_value'] = -9999\n", + "ds.temperature" + ] + }, + { + "cell_type": "markdown", + "id": "9cd79d24-906b-4a37-83ca-410b6f4ca0b3", + "metadata": {}, + "source": [ + "### Coordinate variables" + ] + }, + { + "cell_type": "markdown", + "id": "b9974ccc-6005-4ef0-804a-5a8a532fcaf2", + "metadata": {}, + "source": [ + "To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called \"Coordinate Variables\". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.\n", + "\n", + "To start, we define variables which define our `x` and `y` coordinate values. These variables include `standard_name`s which allow associating them with projections (more on this later) as well as an optional `axis` attribute to make clear what standard direction this coordinate refers to." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24618344-93da-4034-9855-f62270440b65", + "metadata": {}, + "outputs": [], + "source": [ + "ds.x.attrs['axis'] = 'X' # Optional\n", + "ds.x.attrs['standard_name'] = 'projection_x_coordinate'\n", + "ds.x.attrs['long_name'] = 'x-coordinate in projected coordinate system'\n", + "\n", + "ds.y.attrs['axis'] = 'Y' # Optional\n", + "ds.y.attrs['standard_name'] = 'projection_y_coordinate'\n", + "ds.y.attrs['long_name'] = 'y-coordinate in projected coordinate system'" + ] + }, + { + "cell_type": "markdown", + "id": "bce45e18-54dc-4704-b043-41979846866f", + "metadata": {}, + "source": [ + "We also define a coordinate variable `pressure` to reference our data in the vertical dimension. The `standard_name` of `'air_pressure'` is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the `axis` as well. We also specify the attribute `positive` to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e7ebe77-00df-463a-a927-4db5ddb9c278", + "metadata": {}, + "outputs": [], + "source": [ + "ds.pressure.attrs['axis'] = 'Z' # Optional\n", + "ds.pressure.attrs['standard_name'] = 'air_pressure'\n", + "ds.pressure.attrs['positive'] = 'down' # Optional" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "062cb0b6-1f7d-46da-adf4-18cf151728ec", + "metadata": {}, + "outputs": [], + "source": [ + "ds.forecast_time['axis'] = 'T' # Optional\n", + "ds.forecast_time['standard_name'] = 'time' # Optional\n", + "ds.forecast_time['long_name'] = 'time'" + ] + }, + { + "cell_type": "markdown", + "id": "405e0e5e-1e76-4331-ab34-26a17af2edd7", + "metadata": {}, + "source": [ + "### Auxilliary Coordinates" + ] + }, + { + "cell_type": "markdown", + "id": "473b4ea3-1754-4d5d-bc9f-3ce11cc41a4b", + "metadata": {}, + "source": [ + "Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called \"auxillary coordinate variables\", not because they are extra, but because they are not simple one dimensional variables.\n", + "\n", + "Below, we first generate longitude and latitude values from our projected coordinates using the `pyproj` library." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e4ae6fe-e1d5-467f-bdca-f2becf562a12", + "metadata": {}, + "outputs": [], + "source": [ + "from pyproj import Proj\n", + "\n", + "X, Y = np.meshgrid(x, y)\n", + "lcc = Proj({'proj': 'lcc', 'lon_0': -105, 'lat_0': 40, 'a': 6371000.0, 'lat_1': 25})\n", + "lon, lat = lcc(X * 1000, Y * 1000, inverse=True)" + ] + }, + { + "cell_type": "markdown", + "id": "13a22c15-34e8-4dd3-8c15-bc7aa6b71f6d", + "metadata": {}, + "source": [ + "Now we can create the needed variables. Both are dimensioned on `y` and `x` and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of `'degrees_east'`, as well as the optional `'longitude'` `standard_name` attribute. The case is the same for latitude, except the units are `'degrees_north'` and the `standard_name` is `'latitude'`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1789009-cc63-4ae0-9ec7-445fe2c76afb", + "metadata": {}, + "outputs": [], + "source": [ + "ds = ds.assign_coords(lon=(['y', 'x'], lon))\n", + "ds = ds.assign_coords(lat=(['y', 'x'], lat))\n", + "ds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09403fc0-9dfa-49de-85e1-ea282dadf85d", + "metadata": {}, + "outputs": [], + "source": [ + "ds.lon.attrs['units'] = 'degrees_east'\n", + "ds.lon.attrs['standard_name'] = 'longitude' # Optional\n", + "ds.lon.attrs['long_name'] = 'longitude'\n", + "\n", + "ds.lat.attrs['units'] = 'degrees_north'\n", + "ds.lat.attrs['standard_name'] = 'latitude' # Optional\n", + "ds.lat.attrs['long_name'] = 'latitude'" + ] + }, + { + "cell_type": "markdown", + "id": "dc13dd93-53e0-4f25-81df-8d9d2fbbd67d", + "metadata": {}, + "source": [ + "With the variables created, we identify these variables as containing coordinates for the `Temperature` variable by setting the `coordinates` value to a space-separated list of the names of the auxilliary coordinate variables:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3ae0ebf1-84d3-48ac-a6e5-fdbe90b3de07", + "metadata": {}, + "outputs": [], + "source": [ + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "707029ae-c3a3-4112-8a37-3081c334764b", + "metadata": {}, + "source": [ + "### Coordinate System Information" + ] + }, + { + "cell_type": "markdown", + "id": "e3cc5520-ff49-47b9-9225-f59b34199d13", + "metadata": {}, + "source": [ + "With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a \"grid mapping\" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their `grid_mapping` attribute.\n", + "\n", + "Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The `grid_mapping_name` attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99f3a5fb-9eb1-44ef-a3ce-9fdc2426e392", + "metadata": {}, + "outputs": [], + "source": [ + "ds['lambert_projection'] = int()\n", + "ds.lambert_projection.attrs['grid_mapping_name'] = 'lambert_conformal_conic'\n", + "ds.lambert_projection.attrs['standard_parallel'] = 25.0\n", + "ds.lambert_projection.attrs['latitude_of_projection_origin'] = 40.0\n", + "ds.lambert_projection.attrs['longitude_of_central_meridian'] = -105.0\n", + "ds.lambert_projection.attrs['semi_major_axis'] = 6371000.0\n", + "ds.lambert_projection" + ] + }, + { + "cell_type": "markdown", + "id": "31e6165d-7ee0-41cb-87bc-3538b4847fe6", + "metadata": {}, + "source": [ + "Now that we created the variable, all that's left is to set the `grid_mapping` attribute on our `Temperature` variable to the name of our dummy variable:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11c0ba6e-e21b-4ba1-ac4b-8ecf380803d9", + "metadata": {}, + "outputs": [], + "source": [ + "ds.temperature.attrs['grid_mapping'] = 'lambert_projection' # or proj_var.name\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "4358f871-3e77-4786-b6d1-b4173af0a35e", + "metadata": {}, + "source": [ + "### Write to NetCDF" + ] + }, + { + "cell_type": "markdown", + "id": "263e3978-d8f5-409c-8d12-2273bbba4df1", + "metadata": {}, + "source": [ + "Xarray has built-in support for a few flavors of netCDF. Here we'll write a netCDF4 file from our Dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a37d473a-2ae0-42e8-9707-30fc64c8d794", + "metadata": {}, + "outputs": [], + "source": [ + "ds.to_netcdf('test_netcdf.nc', format='NETCDF4')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ab005c8-9297-4fee-8571-dcfc9c90bd2d", + "metadata": { + "jupyter": { + "source_hidden": true + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!ncdump -h test_netcdf.nc" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/core/xarray/xarray-intro.ipynb b/core/xarray/xarray-intro.ipynb new file mode 100644 index 000000000..5b1d2dd7c --- /dev/null +++ b/core/xarray/xarray-intro.ipynb @@ -0,0 +1,955 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![xarray Logo](http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png \"xarray Logo\")\n", + "\n", + "# Introduction to Xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "This notebook will introduce the basics of gridded, labeled data with Xarray. Since Xarray introduces additional abstractions on top of plain arrays of data, our goal is to show why these abstractions are useful and how they frequently lead to simpler, more robust code.\n", + "\n", + "We'll cover these topics:\n", + "\n", + "1. Create a `DataArray`, one of the core object types in Xarray\n", + "1. Understand how to use named coordinates and metadata in a `DataArray`\n", + "1. Combine individual `DataArrays` into a `Dataset`, the other core object type in Xarray\n", + "1. Subset, slice, and interpolate the data using named coordinates\n", + "1. Open netCDF data using XArray\n", + "1. Basic subsetting and aggregation of a `Dataset`\n", + "1. Brief introduction to plotting with Xarray" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "| Concepts | Importance | Notes |\n", + "| --- | --- | --- |\n", + "| [NumPy Basics](../numpy/numpy-basics) | Necessary | |\n", + "| [Intermediate NumPy](../numpy/intermediate-numpy) | Helpful | Familiarity with indexing and slicing arrays |\n", + "| [NumPy Broadcasting](../numpy/numpy-broadcasting) | Helpful | Familiar with array arithmetic and broadcasting |\n", + "| [Introduction to Pandas](../pandas/pandas_fullNotebook) | Helpful | Familiarity with labeled data |\n", + "| [Datetime](../datetime/datetime) | Helpful | Familiarity with time formats and the `timedelta` object |\n", + "| [Understanding of NetCDF](some-link-to-external-resource) | Helpful | Familiarity with metadata structure |\n", + "\n", + "- **Experience level**: **intermediate**\n", + "- **Time to learn**: **medium**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Tip: You will often see the nickname xr used as an abbreviation for xarray in the import statement, just like numpy is often imported as np.
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import timedelta\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import xarray as xr\n", + "from pythia_datasets import DATASETS # Project Pythia's custom store of example data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Introducing the `DataArray` and `Dataset`\n", + "\n", + "Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's [Common Data Model (CDM)](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/common_data_model_overview.html). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creation of a `DataArray` object\n", + "\n", + "The `DataArray` is one of the basic building blocks of Xarray (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataarray)). It provides a `numpy.ndarray`-like object that expands to provide two critical pieces of functionality:\n", + "\n", + "1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful\n", + "2. It has a built-in container for attributes\n", + "\n", + "Here we'll initialize a `DataArray` object by wrapping a plain NumPy array, and explore a few of its properties." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Generate a random numpy array\n", + "\n", + "For our first example, we'll just create a random array of \"temperature\" data in units of Kelvin:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "slideshow": { + "slide_type": "fragment" + } + }, + "outputs": [], + "source": [ + "data = 283 + 5 * np.random.randn(5, 3, 4)\n", + "data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Wrap the array: first attempt\n", + "\n", + "Now we create a basic `DataArray` just by passing our plain `data` as input:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp = xr.DataArray(data)\n", + "temp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note two things:\n", + "\n", + "1. Xarray generates some basic dimension names for us (`dim_0`, `dim_1`, `dim_2`). We'll improve this with better names in the next example.\n", + "2. Wrapping the numpy array in a `DataArray` gives us a rich display in the notebook! (Try clicking the array symbol to expand or collapse the view)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Assign dimension names\n", + "\n", + "Much of the power of Xarray comes from making use of named dimensions. So let's add some more useful names! We can do that by passing an ordered list of names using the keyword argument `dims`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])\n", + "temp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is already improved upon from a NumPy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the `DataArray`. We'll see this in the next example." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a `DataArray` with named Coordinates\n", + "\n", + "#### Make time and space coordinates\n", + "\n", + "Here we will use [Pandas](../pandas) to create an array of [datetime data](../datetime), which we will then use to create a `DataArray` with a named coordinate `time`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "times = pd.date_range('2018-01-01', periods=5)\n", + "times" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll also create arrays to represent sample longitude and latitude:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lons = np.linspace(-120, -60, 4)\n", + "lats = np.linspace(25, 55, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Initialize the `DataArray` with complete coordinate info\n", + "\n", + "When we create the `DataArray` instance, we pass in the arrays we just created:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])\n", + "temp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Set useful attributes\n", + "\n", + "...and while we're at it, we can also set some attribute metadata:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.attrs['units'] = 'kelvin'\n", + "temp.attrs['standard_name'] = 'air_temperature'\n", + "\n", + "temp" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Attributes are not preserved by default!\n", + "\n", + "Notice what happens if we perform a mathematical operaton with the `DataArray`: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.\n", + "\n", + "To illustrate this, we'll do a simple unit conversion from Kelvin to Celsius:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp_in_celsius = temp - 273.15\n", + "temp_in_celsius" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For an in-depth discussion of how Xarray handles metadata, start in the Xarray docs [here](http://xarray.pydata.org/en/stable/getting-started-guide/faq.html#approach-to-metadata)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The `Dataset`: a container for `DataArray`s with shared coordinates\n", + "\n", + "Along with `DataArray`, the other key object type in Xarray is the `Dataset`: a dictionary-like container that holds one or more `DataArray`s, which can also optionally share coordinates (see docs [here](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataset)).\n", + "\n", + "The most common way to create a `Dataset` object is to load data from a file (see [below](#Opening-netCDF-data)). Here, instead, we will create another `DataArray` and combine it with our `temp` data.\n", + "\n", + "This will illustrate how the information about common coordinate axes is used." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Create a pressure `DataArray` using the same coordinates\n", + "\n", + "This code mirrors how we created the `temp` object above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)\n", + "pressure = xr.DataArray(\n", + " pressure_data, coords=[times, lats, lons], dims=['time', 'lat', 'lon']\n", + ")\n", + "pressure.attrs['units'] = 'hPa'\n", + "pressure.attrs['standard_name'] = 'air_pressure'\n", + "\n", + "pressure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Create a `Dataset` object\n", + "\n", + "Each `DataArray` in our `Dataset` needs a name! \n", + "\n", + "The most straightforward way to create a `Dataset` with our `temp` and `pressure` arrays is to pass a dictionary using the keyword argument `data_vars`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.Dataset(data_vars={'Temperature': temp, 'Pressure': pressure})\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the `Dataset` object `ds` is aware that both data arrays sit on the same coordinate axes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Access Data variables and Coordinates in a `Dataset`\n", + "\n", + "We can pull out any of the individual `DataArray` objects in a few different ways.\n", + "\n", + "Using the \"dot\" notation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.Pressure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "... or using dictionary access like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds['Pressure']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll return to the `Dataset` object when we start loading data from files." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Subsetting and selection by coordinate values\n", + "\n", + "Much of the power of labeled coordinates comes from the ability to select data based on coordinate names and values, rather than array indices. We'll explore this briefly here.\n", + "\n", + "### NumPy-like selection\n", + "\n", + "Suppose we want to extract all the spatial data for one single date: January 2, 2018. It's possible to achieve that with NumPy-like index selection:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "indexed_selection = temp[1, :, :] # Index 1 along axis 0 is the time slice we want...\n", + "indexed_selection" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "HOWEVER, notice that this requires us (the user / programmer) to have **detailed knowledge** of the order of the axes and the meaning of the indices along those axes!\n", + "\n", + "_**Named coordinates free us from this burden...**_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Selecting with `.sel()`\n", + "\n", + "We can instead select data based on coordinate values using the `.sel()` method, which takes one or more named coordinate(s) as keyword argument:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "named_selection = temp.sel(time='2018-01-02')\n", + "named_selection" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We got the same result, but \n", + "- we didn't have to know anything about how the array was created or stored\n", + "- our code is agnostic about how many dimensions we are dealing with\n", + "- the intended meaning of our code is much clearer!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Approximate selection and interpolation\n", + "\n", + "With time and space data, we frequently want to sample \"near\" the coordinate points in our dataset. Here are a few simple ways to achieve that.\n", + "\n", + "#### Nearest-neighbor sampling\n", + "\n", + "Suppose we want to sample the nearest datapoint within 2 days of date `2018-01-07`. Since the last day on our `time` axis is `2018-01-05`, this is well-posed.\n", + "\n", + "`.sel` has the flexibility to perform nearest neighbor sampling, taking an optional tolerance:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "where we see that `.sel` indeed pulled out the data for date `2018-01-05`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Interpolation\n", + "\n", + "Suppose we want to extract a timeseries for Boulder (40°N, 105°W). Since `lon=-105` is _not_ a point on our longitude axis, this requires interpolation between data points.\n", + "\n", + "The `.interp()` method (see the docs [here](http://xarray.pydata.org/en/stable/interpolation.html)) works similarly to `.sel()`. Using `.interp()`, we can interpolate to any latitude/longitude location:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.interp(lon=-105, lat=40)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Note: Note, Xarray's interpolation functionality requires the SciPy package!\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Slicing along coordinates\n", + "\n", + "Frequently we want to select a range (or _slice_) along one or more coordinate(s). We can achieve this by passing a Python [slice](https://docs.python.org/3/library/functions.html#slice) object to `.sel()`, as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.sel(\n", + " time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Note: The calling sequence for slice always looks like slice(start, stop[, step]).\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice how the length of each coordinate axis has changed due to our slicing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### One more selection method: `.loc`\n", + "\n", + "All of these operations can also be done within square brackets on the `.loc` attribute of the `DataArray`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.loc['2018-01-02']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is sort of in between the NumPy-style selection\n", + "```\n", + "temp[1,:,:]\n", + "```\n", + "and the fully label-based selection using `.sel()`\n", + "\n", + "With `.loc`, we make use of the coordinate *values*, but lose the ability to specify the *names* of the various dimensions. Instead, the slicing must be done in the correct order:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One advantage of using `.loc` is that we can use NumPy-style slice notation like `25:45`, rather than the more verbose `slice(25,45)`. But of course that also works:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temp.loc['2018-01-01':'2018-01-03', slice(25, 45), -110:-70]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What *doesn't* work is passing the slices in a different order:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This will generate an error\n", + "# temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Opening netCDF data\n", + "\n", + "With its close ties to the netCDF data model, Xarray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of Xarray's limitations (such as 1-dimensional coordinates).\n", + "\n", + "### Access netCDF data with `xr.open_dataset`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Tip: Here we're getting the data from Project Pythia's custom library of example data, which we already imported above with from pythia_datasets import DATASETS. The DATASETS.fetch() method will automatically download and cache our example data file NARR_19930313_0000.nc locally.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "filepath = DATASETS.fetch('NARR_19930313_0000.nc')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once we have a valid path to a data file that Xarray knows how to read, we can open it like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.open_dataset(filepath)\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Subsetting the `Dataset`\n", + "\n", + "Our call to `xr.open_dataset()` above returned a `Dataset` object that we've decided to call `ds`. We can then pull out individual fields:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds.isobaric1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "(recall that we can also use dictionary syntax like `ds['isobaric1']` to do the same thing)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`Dataset`s also support much of the same subsetting operations as `DataArray`, but will perform the operation on all data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds_1000 = ds.sel(isobaric1=1000.0)\n", + "ds_1000" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And further subsetting to a single `DataArray`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ds_1000.Temperature_isobaric" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Aggregation operations\n", + "\n", + "Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like `std` (standard deviation):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "u_winds = ds['u-component_of_wind_isobaric']\n", + "u_winds.std(dim=['x', 'y'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " Note: Aggregation methods for Xarray objects operate over the named coordinate dimension(s) specified by keyword argument dim. Compare to NumPy, where aggregations operate over specified numbered axes\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the sample dataset, we can calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:\n", + " * x: -182km to 424km\n", + " * y: -1450km to -990km\n", + " \n", + "(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temps = ds.Temperature_isobaric\n", + "co_temps = temps.sel(x=slice(-182, 424), y=slice(-1450, -990))\n", + "prof = co_temps.mean(dim=['x', 'y'])\n", + "prof" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Plotting with Xarray\n", + "\n", + "Another major benefit of using labeled data structures is that they enable automated plotting with sensible axis labels. \n", + "\n", + "### Simple visualization with `.plot()`\n", + "\n", + "Much like we saw in [Pandas](../pandas/pandas_fullNotebook), Xarray includes an interface to [Matplotlib](../matplotlib) that we can access through the `.plot()` method of every `DataArray`.\n", + "\n", + "For quick and easy data exploration, we can just call `.plot()` without any modifiers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prof.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here Xarray has generated a line plot of the temperature data against the coordinate variable `isobaric`. Also the metadata are used to auto-generate axis labels and units." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Customizing the plot\n", + "\n", + "As in Pandas, the `.plot()` method is mostly just a wrapper to Matplotlib, so we can customize our plot in familiar ways.\n", + "\n", + "In this air temperature profile example, we would like to make two changes:\n", + "- swap the axes so that we have isobaric levels on the y (vertical) axis of the figure\n", + "- make pressure decrease upward in the figure, so that up is up\n", + "\n", + "A few keyword arguments to our `.plot()` call will take care of this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prof.plot(y=\"isobaric1\", yincrease=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plotting 2D data\n", + "\n", + "In the example above, the `.plot()` method produced a line plot.\n", + "\n", + "What if we call `.plot()` on a 2D array?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "temps.sel(isobaric1=1000).plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Xarray has recognized that the `DataArray` object calling the plot method has two coordinate variables, and generates a 2D plot using the `pcolormesh` method from Matplotlib.\n", + "\n", + "In this case, we are looking at air temperatures on the 1000 hPa isobaric surface over North America. We could of course improve this figure by using [Cartopy](../cartopy) to handle the map projection and geographic features!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n", + "\n", + "Xarray brings the joy of Pandas-style labeled data operations to N-dimensional data. As such, it has become a central workhorse in the geoscience community for the analysis of gridded datasets. Xarray allows us to open self-describing NetCDF files and make full use of the coordinate axes, labels, units, and other metadata. By making use of labeled coordinates, our code is often easier to write, easier to read, and more robust.\n", + "\n", + "### What's next?\n", + "\n", + "Additional notebooks to appear in this section will go into more detail about \n", + "- arithemtic and broadcasting with Xarray data structures\n", + "- using \"group by\" operations\n", + "- remote data access with OpenDAP\n", + "- more advanced visualization including map integration with Cartopy" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [] + }, + "source": [ + "## Resources and references\n", + "\n", + "This notebook was adapated from material in [Unidata's Python Training](https://unidata.github.io/python-training/workshop/XArray/xarray-and-cf/).\n", + "\n", + "The best resource for Xarray is the [Xarray documentation](http://xarray.pydata.org/en/stable/). See in particular\n", + "- [Why Xarray](http://xarray.pydata.org/en/stable/getting-started-guide/why-xarray.html)\n", + "- [Quick overview](http://xarray.pydata.org/en/stable/getting-started-guide/quick-overview.html#)\n", + "- [Example gallery](http://xarray.pydata.org/en/stable/gallery.html)\n", + "\n", + "Another excellent resource is this [Xarray Tutorial collection](https://xarray-contrib.github.io/xarray-tutorial/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/environment.yml b/environment.yml index e08c8a6d3..5620f82a2 100644 --- a/environment.yml +++ b/environment.yml @@ -13,6 +13,7 @@ dependencies: - xarray - netcdf4 - pyproj + - scipy - pip - pip: # works for regular pip packages