.. _combining data: Combining data -------------- .. jupyter-execute:: :hide-code: :hide-output: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) %xmode minimal * For combining datasets or data arrays along a single dimension, see concatenate_. * For combining datasets with different variables, see merge_. * For combining datasets or data arrays with different indexes or missing values, see combine_. * For combining datasets or data arrays along multiple dimensions see combining.multi_. .. _concatenate: Concatenate ~~~~~~~~~~~ To combine :py:class:`~xarray.Dataset` / :py:class:`~xarray.DataArray` objects along an existing or new dimension into a larger object, you can use :py:func:`~xarray.concat`. ``concat`` takes an iterable of ``DataArray`` or ``Dataset`` objects, as well as a dimension name, and concatenates along that dimension: .. jupyter-execute:: da = xr.DataArray( np.arange(6).reshape(2, 3), [("x", ["a", "b"]), ("y", [10, 20, 30])] ) da.isel(y=slice(0, 1)) # same as da[:, :1] .. jupyter-execute:: # This resembles how you would use np.concatenate: xr.concat([da[:, :1], da[:, 1:]], dim="y") .. jupyter-execute:: # For more friendly pandas-like indexing you can use: xr.concat([da.isel(y=slice(0, 1)), da.isel(y=slice(1, None))], dim="y") In addition to combining along an existing dimension, ``concat`` can create a new dimension by stacking lower dimensional arrays together: .. jupyter-execute:: da.sel(x="a") .. jupyter-execute:: xr.concat([da.isel(x=0), da.isel(x=1)], "x") If the second argument to ``concat`` is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension: .. jupyter-execute:: xr.concat([da.isel(x=0), da.isel(x=1)], "new_dim") The second argument to ``concat`` can also be an :py:class:`~pandas.Index` or :py:class:`~xarray.DataArray` object as well as a string, in which case it is used to label the values along the new dimension: .. jupyter-execute:: xr.concat([da.isel(x=0), da.isel(x=1)], pd.Index([-90, -100], name="new_dim")) Of course, ``concat`` also works on ``Dataset`` objects: .. jupyter-execute:: ds = da.to_dataset(name="foo") xr.concat([ds.sel(x="a"), ds.sel(x="b")], "x") :py:func:`~xarray.concat` has a number of options which provide deeper control over which variables are concatenated and how it handles conflicting variables between datasets. With the default parameters, xarray will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily using :ref:`dask`. .. _merge: Merge ~~~~~ To combine variables and coordinates between multiple ``DataArray`` and/or ``Dataset`` objects, use :py:func:`~xarray.merge`. It can merge a list of ``Dataset``, ``DataArray`` or dictionaries of objects convertible to ``DataArray`` objects: .. jupyter-execute:: xr.merge([ds, ds.rename({"foo": "bar"})]) .. jupyter-execute:: xr.merge([xr.DataArray(n, name="var%d" % n) for n in range(5)]) If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the **union** of all index coordinates: .. jupyter-execute:: other = xr.Dataset({"bar": ("x", [1, 2, 3, 4]), "x": list("abcd")}) xr.merge([ds, other]) This ensures that ``merge`` is non-destructive. ``xarray.MergeError`` is raised if you attempt to merge two variables with the same name but different values: .. jupyter-execute:: :raises: xr.merge([ds, ds + 1]) The same non-destructive merging between ``DataArray`` index coordinates is used in the :py:class:`~xarray.Dataset` constructor: .. jupyter-execute:: xr.Dataset({"a": da.isel(x=slice(0, 1)), "b": da.isel(x=slice(1, 2))}) .. _combine: Combine ~~~~~~~ The instance method :py:meth:`~xarray.DataArray.combine_first` combines two datasets/data arrays and defaults to non-null values in the calling object, using values from the called object to fill holes. The resulting coordinates are the union of coordinate labels. Vacant cells as a result of the outer-join are filled with ``NaN``. For example: .. jupyter-execute:: ar0 = xr.DataArray([[0, 0], [0, 0]], [("x", ["a", "b"]), ("y", [-1, 0])]) ar1 = xr.DataArray([[1, 1], [1, 1]], [("x", ["b", "c"]), ("y", [0, 1])]) ar0.combine_first(ar1) .. jupyter-execute:: ar1.combine_first(ar0) For datasets, ``ds0.combine_first(ds1)`` works similarly to ``xr.merge([ds0, ds1])``, except that ``xr.merge`` raises ``MergeError`` when there are conflicting values in variables to be merged, whereas ``.combine_first`` defaults to the calling object's values. .. _update: Update ~~~~~~ In contrast to ``merge``, :py:meth:`~xarray.Dataset.update` modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values: .. jupyter-execute:: ds.update({"space": ("space", [10.2, 9.4, 3.9])}) However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it. ``update`` also performs automatic alignment if necessary. Unlike ``merge``, it maintains the alignment of the original array instead of merging indexes: .. jupyter-execute:: ds.update(other) The exact same alignment logic when setting a variable with ``__setitem__`` syntax: .. jupyter-execute:: ds["baz"] = xr.DataArray([9, 9, 9, 9, 9], coords=[("x", list("abcde"))]) ds.baz Equals and identical ~~~~~~~~~~~~~~~~~~~~ Xarray objects can be compared by using the :py:meth:`~xarray.Dataset.equals`, :py:meth:`~xarray.Dataset.identical` and :py:meth:`~xarray.Dataset.broadcast_equals` methods. These methods are used by the optional ``compat`` argument on ``concat`` and ``merge``. :py:attr:`~xarray.Dataset.equals` checks dimension names, indexes and array values: .. jupyter-execute:: da.equals(da.copy()) :py:attr:`~xarray.Dataset.identical` also checks attributes, and the name of each object: .. jupyter-execute:: da.identical(da.rename("bar")) :py:attr:`~xarray.Dataset.broadcast_equals` does a more relaxed form of equality check that allows variables to have different dimensions, as long as values are constant along those new dimensions: .. jupyter-execute:: left = xr.Dataset(coords={"x": 0}) right = xr.Dataset({"x": [0, 0, 0]}) left.broadcast_equals(right) Like pandas objects, two xarray objects are still equal or identical if they have missing values marked by ``NaN`` in the same locations. In contrast, the ``==`` operation performs element-wise comparison (like numpy): .. jupyter-execute:: da == da.copy() Note that ``NaN`` does not compare equal to ``NaN`` in element-wise comparison; you may need to deal with missing values explicitly. .. _combining.no_conflicts: Merging with 'no_conflicts' ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``compat`` argument ``'no_conflicts'`` is only available when combining xarray objects with ``merge``. In addition to the above comparison methods it allows the merging of xarray objects with locations where *either* have ``NaN`` values. This can be used to combine data with overlapping coordinates as long as any non-missing values agree or are disjoint: .. jupyter-execute:: ds1 = xr.Dataset({"a": ("x", [10, 20, 30, np.nan])}, {"x": [1, 2, 3, 4]}) ds2 = xr.Dataset({"a": ("x", [np.nan, 30, 40, 50])}, {"x": [2, 3, 4, 5]}) xr.merge([ds1, ds2], compat="no_conflicts") Note that due to the underlying representation of missing values as floating point numbers (``NaN``), variable data type is not always preserved when merging in this manner. .. _combining.multi: Combining along multiple dimensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For combining many objects along multiple dimensions xarray provides :py:func:`~xarray.combine_nested` and :py:func:`~xarray.combine_by_coords`. These functions use a combination of ``concat`` and ``merge`` across different variables to combine many objects into one. :py:func:`~xarray.combine_nested` requires specifying the order in which the objects should be combined, while :py:func:`~xarray.combine_by_coords` attempts to infer this ordering automatically from the coordinates in the data. :py:func:`~xarray.combine_nested` is useful when you know the spatial relationship between each object in advance. The datasets must be provided in the form of a nested list, which specifies their relative position and ordering. A common task is collecting data from a parallelized simulation where each processor wrote out data to a separate file. A domain which was decomposed into 4 parts, 2 each along both the x and y axes, requires organising the datasets into a doubly-nested list, e.g: .. jupyter-execute:: arr = xr.DataArray( name="temperature", data=np.random.randint(5, size=(2, 2)), dims=["x", "y"] ) arr .. jupyter-execute:: ds_grid = [[arr, arr], [arr, arr]] xr.combine_nested(ds_grid, concat_dim=["x", "y"]) :py:func:`~xarray.combine_nested` can also be used to explicitly merge datasets with different variables. For example if we have 4 datasets, which are divided along two times, and contain two different variables, we can pass ``None`` to ``'concat_dim'`` to specify the dimension of the nested list over which we wish to use ``merge`` instead of ``concat``: .. jupyter-execute:: temp = xr.DataArray(name="temperature", data=np.random.randn(2), dims=["t"]) precip = xr.DataArray(name="precipitation", data=np.random.randn(2), dims=["t"]) ds_grid = [[temp, precip], [temp, precip]] xr.combine_nested(ds_grid, concat_dim=["t", None]) :py:func:`~xarray.combine_by_coords` is for combining objects which have dimension coordinates which specify their relationship to and order relative to one another, for example a linearly-increasing 'time' dimension coordinate. Here we combine two datasets using their common dimension coordinates. Notice they are concatenated in order based on the values in their dimension coordinates, not on their position in the list passed to ``combine_by_coords``. .. jupyter-execute:: x1 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [0, 1, 2])]) x2 = xr.DataArray(name="foo", data=np.random.randn(3), coords=[("x", [3, 4, 5])]) xr.combine_by_coords([x2, x1]) These functions can be used by :py:func:`~xarray.open_mfdataset` to open many files as one dataset. The particular function used is specified by setting the argument ``'combine'`` to ``'by_coords'`` or ``'nested'``. This is useful for situations where your data is split across many files in multiple locations, which have some known relationship between one another. .. currentmodule:: xarray .. _complex: Complex Numbers =============== .. jupyter-execute:: :hide-code: import numpy as np import xarray as xr Xarray leverages NumPy to seamlessly handle complex numbers in :py:class:`~xarray.DataArray` and :py:class:`~xarray.Dataset` objects. In the examples below, we are using a DataArray named ``da`` with complex elements (of :math:`\mathbb{C}`): .. jupyter-execute:: data = np.array([[1 + 2j, 3 + 4j], [5 + 6j, 7 + 8j]]) da = xr.DataArray( data, dims=["x", "y"], coords={"x": ["a", "b"], "y": [1, 2]}, name="complex_nums", ) Operations on Complex Data -------------------------- You can access real and imaginary components using the ``.real`` and ``.imag`` attributes. Most NumPy universal functions (ufuncs) like :py:doc:`numpy.abs ` or :py:doc:`numpy.angle ` work directly. .. jupyter-execute:: da.real .. jupyter-execute:: np.abs(da) .. note:: Like NumPy, ``.real`` and ``.imag`` typically return *views*, not copies, of the original data. Reading and Writing Complex Data -------------------------------- Writing complex data to NetCDF files (see :ref:`io.netcdf`) is supported via :py:meth:`~xarray.DataArray.to_netcdf` using specific backend engines that handle complex types: .. tab:: h5netcdf This requires the `h5netcdf `_ library to be installed. .. jupyter-execute:: # write the data to disk da.to_netcdf("complex_nums_h5.nc", engine="h5netcdf") # read the file back into memory ds_h5 = xr.open_dataset("complex_nums_h5.nc", engine="h5netcdf") # check the dtype ds_h5[da.name].dtype .. tab:: netcdf4 Requires the `netcdf4-python (>= 1.7.1) `_ library and you have to enable ``auto_complex=True``. .. jupyter-execute:: # write the data to disk da.to_netcdf("complex_nums_nc4.nc", engine="netcdf4", auto_complex=True) # read the file back into memory ds_nc4 = xr.open_dataset( "complex_nums_nc4.nc", engine="netcdf4", auto_complex=True ) # check the dtype ds_nc4[da.name].dtype .. warning:: The ``scipy`` engine only supports NetCDF V3 and does *not* support complex arrays; writing with ``engine="scipy"`` raises a ``TypeError``. Alternative: Manual Handling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If direct writing is not supported (e.g., targeting NetCDF3), you can manually split the complex array into separate real and imaginary variables before saving: .. jupyter-execute:: # Write data to file ds_manual = xr.Dataset( { f"{da.name}_real": da.real, f"{da.name}_imag": da.imag, } ) ds_manual.to_netcdf("complex_manual.nc", engine="scipy") # Example # Read data from file ds = xr.open_dataset("complex_manual.nc", engine="scipy") reconstructed = ds[f"{da.name}_real"] + 1j * ds[f"{da.name}_imag"] Recommendations ^^^^^^^^^^^^^^^ - Use ``engine="netcdf4"`` with ``auto_complex=True`` for full compliance and ease. - Use ``h5netcdf`` for HDF5-based storage when interoperability with HDF5 is desired. - For maximum legacy support (NetCDF3), manually handle real/imaginary components. .. jupyter-execute:: :hide-code: # Cleanup import os for f in ["complex_nums_nc4.nc", "complex_nums_h5.nc", "complex_manual.nc"]: if os.path.exists(f): os.remove(f) See also -------- - :ref:`io.netcdf` — full NetCDF I/O guide - `NumPy complex numbers `__ .. currentmodule:: xarray .. _compute: ########### Computation ########### The labels associated with :py:class:`~xarray.DataArray` and :py:class:`~xarray.Dataset` objects enables some powerful shortcuts for computation, notably including aggregation and broadcasting by dimension names. Basic array math ================ Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values: .. jupyter-execute:: :hide-code: :hide-output: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) %xmode minimal .. jupyter-execute:: arr = xr.DataArray( np.random.default_rng(0).random((2, 3)), [("x", ["a", "b"]), ("y", [10, 20, 30])], ) arr - 3 .. jupyter-execute:: abs(arr) You can also use any of numpy's or scipy's many `ufunc`__ functions directly on a DataArray: __ https://numpy.org/doc/stable/reference/ufuncs.html .. jupyter-execute:: np.sin(arr) Use :py:func:`~xarray.where` to conditionally switch between values: .. jupyter-execute:: xr.where(arr > 0, "positive", "negative") Use ``@`` to compute the :py:func:`~xarray.dot` product: .. jupyter-execute:: arr @ arr Data arrays also implement many :py:class:`numpy.ndarray` methods: .. jupyter-execute:: arr.round(2) .. jupyter-execute:: arr.T .. jupyter-execute:: intarr = xr.DataArray([0, 1, 2, 3, 4, 5]) intarr << 2 # only supported for int types .. jupyter-execute:: intarr >> 1 .. _missing_values: Missing values ============== Xarray represents missing values using the "NaN" (Not a Number) value from NumPy, which is a special floating-point value that indicates a value that is undefined or unrepresentable. There are several methods for handling missing values in xarray: Xarray objects borrow the :py:meth:`~xarray.DataArray.isnull`, :py:meth:`~xarray.DataArray.notnull`, :py:meth:`~xarray.DataArray.count`, :py:meth:`~xarray.DataArray.dropna`, :py:meth:`~xarray.DataArray.fillna`, :py:meth:`~xarray.DataArray.ffill`, and :py:meth:`~xarray.DataArray.bfill` methods for working with missing data from pandas: :py:meth:`~xarray.DataArray.isnull` is a method in xarray that can be used to check for missing or null values in an xarray object. It returns a new xarray object with the same dimensions as the original object, but with boolean values indicating where **missing values** are present. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.isnull() In this example, the third and fourth elements of 'x' are NaN, so the resulting :py:class:`~xarray.DataArray` object has 'True' values in the third and fourth positions and 'False' values in the other positions. :py:meth:`~xarray.DataArray.notnull` is a method in xarray that can be used to check for non-missing or non-null values in an xarray object. It returns a new xarray object with the same dimensions as the original object, but with boolean values indicating where **non-missing values** are present. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.notnull() In this example, the first two and the last elements of x are not NaN, so the resulting :py:class:`~xarray.DataArray` object has 'True' values in these positions, and 'False' values in the third and fourth positions where NaN is located. :py:meth:`~xarray.DataArray.count` is a method in xarray that can be used to count the number of non-missing values along one or more dimensions of an xarray object. It returns a new xarray object with the same dimensions as the original object, but with each element replaced by the count of non-missing values along the specified dimensions. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.count() In this example, 'x' has five elements, but two of them are NaN, so the resulting :py:class:`~xarray.DataArray` object having a single element containing the value '3', which represents the number of non-null elements in x. :py:meth:`~xarray.DataArray.dropna` is a method in xarray that can be used to remove missing or null values from an xarray object. It returns a new xarray object with the same dimensions as the original object, but with missing values removed. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.dropna(dim="x") In this example, on calling x.dropna(dim="x") removes any missing values and returns a new :py:class:`~xarray.DataArray` object with only the non-null elements [0, 1, 2] of 'x', in the original order. :py:meth:`~xarray.DataArray.fillna` is a method in xarray that can be used to fill missing or null values in an xarray object with a specified value or method. It returns a new xarray object with the same dimensions as the original object, but with missing values filled. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.fillna(-1) In this example, there are two NaN values in 'x', so calling x.fillna(-1) replaces these values with -1 and returns a new :py:class:`~xarray.DataArray` object with five elements, containing the values [0, 1, -1, -1, 2] in the original order. :py:meth:`~xarray.DataArray.ffill` is a method in xarray that can be used to forward fill (or fill forward) missing values in an xarray object along one or more dimensions. It returns a new xarray object with the same dimensions as the original object, but with missing values replaced by the last non-missing value along the specified dimensions. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.ffill("x") In this example, there are two NaN values in 'x', so calling x.ffill("x") fills these values with the last non-null value in the same dimension, which are 0 and 1, respectively. The resulting :py:class:`~xarray.DataArray` object has five elements, containing the values [0, 1, 1, 1, 2] in the original order. :py:meth:`~xarray.DataArray.bfill` is a method in xarray that can be used to backward fill (or fill backward) missing values in an xarray object along one or more dimensions. It returns a new xarray object with the same dimensions as the original object, but with missing values replaced by the next non-missing value along the specified dimensions. .. jupyter-execute:: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=["x"]) x.bfill("x") In this example, there are two NaN values in 'x', so calling x.bfill("x") fills these values with the next non-null value in the same dimension, which are 2 and 2, respectively. The resulting :py:class:`~xarray.DataArray` object has five elements, containing the values [0, 1, 2, 2, 2] in the original order. Like pandas, xarray uses the float value ``np.nan`` (not-a-number) to represent missing values. Xarray objects also have an :py:meth:`~xarray.DataArray.interpolate_na` method for filling missing values via 1D interpolation. It returns a new xarray object with the same dimensions as the original object, but with missing values interpolated. .. jupyter-execute:: x = xr.DataArray( [0, 1, np.nan, np.nan, 2], dims=["x"], coords={"xx": xr.Variable("x", [0, 1, 1.1, 1.9, 3])}, ) x.interpolate_na(dim="x", method="linear", use_coordinate="xx") In this example, there are two NaN values in 'x', so calling x.interpolate_na(dim="x", method="linear", use_coordinate="xx") fills these values with interpolated values along the "x" dimension using linear interpolation based on the values of the xx coordinate. The resulting :py:class:`~xarray.DataArray` object has five elements, containing the values [0., 1., 1.05, 1.45, 2.] in the original order. Note that the interpolated values are calculated based on the values of the 'xx' coordinate, which has non-integer values, resulting in non-integer interpolated values. Note that xarray slightly diverges from the pandas ``interpolate`` syntax by providing the ``use_coordinate`` keyword which facilitates a clear specification of which values to use as the index in the interpolation. Xarray also provides the ``max_gap`` keyword argument to limit the interpolation to data gaps of length ``max_gap`` or smaller. See :py:meth:`~xarray.DataArray.interpolate_na` for more. .. _agg: Aggregation =========== Aggregation methods have been updated to take a ``dim`` argument instead of ``axis``. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s): .. jupyter-execute:: arr.sum(dim="x") .. jupyter-execute:: arr.std(["x", "y"]) .. jupyter-execute:: arr.min() If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the :py:meth:`~xarray.DataArray.get_axis_num` method: .. jupyter-execute:: arr.get_axis_num("y") These operations automatically skip missing values, like in pandas: .. jupyter-execute:: xr.DataArray([1, 2, np.nan, 3]).mean() If desired, you can disable this behavior by invoking the aggregation method with ``skipna=False``. .. _compute.rolling: Rolling window operations ========================= ``DataArray`` objects include a :py:meth:`~xarray.DataArray.rolling` method. This method supports rolling window aggregation: .. jupyter-execute:: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5), dims=("x", "y")) arr :py:meth:`~xarray.DataArray.rolling` is applied along one dimension using the name of the dimension as a key (e.g. ``y``) and the window size as the value (e.g. ``3``). We get back a ``Rolling`` object: .. jupyter-execute:: arr.rolling(y=3) Aggregation and summary methods can be applied directly to the ``Rolling`` object: .. jupyter-execute:: r = arr.rolling(y=3) r.reduce(np.std) .. jupyter-execute:: r.mean() Aggregation results are assigned the coordinate at the end of each window by default, but can be centered by passing ``center=True`` when constructing the ``Rolling`` object: .. jupyter-execute:: r = arr.rolling(y=3, center=True) r.mean() As can be seen above, aggregations of windows which overlap the border of the array produce ``nan``\s. Setting ``min_periods`` in the call to ``rolling`` changes the minimum number of observations within the window required to have a value when aggregating: .. jupyter-execute:: r = arr.rolling(y=3, min_periods=2) r.mean() .. jupyter-execute:: r = arr.rolling(y=3, center=True, min_periods=2) r.mean() From version 0.17, xarray supports multidimensional rolling, .. jupyter-execute:: r = arr.rolling(x=2, y=3, min_periods=2) r.mean() .. tip:: Note that rolling window aggregations are faster and use less memory when bottleneck_ is installed. This only applies to numpy-backed xarray objects with 1d-rolling. .. _bottleneck: https://github.com/pydata/bottleneck We can also manually iterate through ``Rolling`` objects: .. code:: python for label, arr_window in r: # arr_window is a view of x ... .. _compute.rolling_exp: While ``rolling`` provides a simple moving average, ``DataArray`` also supports an exponential moving average with :py:meth:`~xarray.DataArray.rolling_exp`. This is similar to pandas' ``ewm`` method. numbagg_ is required. .. _numbagg: https://github.com/numbagg/numbagg .. code:: python arr.rolling_exp(y=3).mean() The ``rolling_exp`` method takes a ``window_type`` kwarg, which can be ``'alpha'``, ``'com'`` (for ``center-of-mass``), ``'span'``, and ``'halflife'``. The default is ``span``. Finally, the rolling object has a ``construct`` method which returns a view of the original ``DataArray`` with the windowed dimension in the last position. You can use this for more advanced rolling operations such as strided rolling, windowed rolling, convolution, short-time FFT etc. .. jupyter-execute:: # rolling with 2-point stride rolling_da = r.construct(x="x_win", y="y_win", stride=2) rolling_da .. jupyter-execute:: rolling_da.mean(["x_win", "y_win"], skipna=False) Because the ``DataArray`` given by ``r.construct('window_dim')`` is a view of the original array, it is memory efficient. You can also use ``construct`` to compute a weighted rolling sum: .. jupyter-execute:: weight = xr.DataArray([0.25, 0.5, 0.25], dims=["window"]) arr.rolling(y=3).construct(y="window").dot(weight) .. note:: numpy's Nan-aggregation functions such as ``nansum`` copy the original array. In xarray, we internally use these functions in our aggregation methods (such as ``.sum()``) if ``skipna`` argument is not specified or set to True. This means ``rolling_da.mean('window_dim')`` is memory inefficient. To avoid this, use ``skipna=False`` as the above example. .. _compute.weighted: Weighted array reductions ========================= :py:class:`DataArray` and :py:class:`Dataset` objects include :py:meth:`DataArray.weighted` and :py:meth:`Dataset.weighted` array reduction methods. They currently support weighted ``sum``, ``mean``, ``std``, ``var`` and ``quantile``. .. jupyter-execute:: coords = dict(month=("month", [1, 2, 3])) prec = xr.DataArray([1.1, 1.0, 0.9], dims=("month",), coords=coords) weights = xr.DataArray([31, 28, 31], dims=("month",), coords=coords) Create a weighted object: .. jupyter-execute:: weighted_prec = prec.weighted(weights) weighted_prec Calculate the weighted sum: .. jupyter-execute:: weighted_prec.sum() Calculate the weighted mean: .. jupyter-execute:: weighted_prec.mean(dim="month") Calculate the weighted quantile: .. jupyter-execute:: weighted_prec.quantile(q=0.5, dim="month") The weighted sum corresponds to: .. jupyter-execute:: weighted_sum = (prec * weights).sum() weighted_sum the weighted mean to: .. jupyter-execute:: weighted_mean = weighted_sum / weights.sum() weighted_mean the weighted variance to: .. jupyter-execute:: weighted_var = weighted_prec.sum_of_squares() / weights.sum() weighted_var and the weighted standard deviation to: .. jupyter-execute:: weighted_std = np.sqrt(weighted_var) weighted_std However, the functions also take missing values in the data into account: .. jupyter-execute:: data = xr.DataArray([np.nan, 2, 4]) weights = xr.DataArray([8, 1, 1]) data.weighted(weights).mean() Using ``(data * weights).sum() / weights.sum()`` would (incorrectly) result in 0.6. If the weights add up to to 0, ``sum`` returns 0: .. jupyter-execute:: data = xr.DataArray([1.0, 1.0]) weights = xr.DataArray([-1.0, 1.0]) data.weighted(weights).sum() and ``mean``, ``std`` and ``var`` return ``nan``: .. jupyter-execute:: data.weighted(weights).mean() .. note:: ``weights`` must be a :py:class:`DataArray` and cannot contain missing values. Missing values can be replaced manually by ``weights.fillna(0)``. .. _compute.coarsen: Coarsen large arrays ==================== :py:class:`DataArray` and :py:class:`Dataset` objects include a :py:meth:`~xarray.DataArray.coarsen` and :py:meth:`~xarray.Dataset.coarsen` methods. This supports block aggregation along multiple dimensions, .. jupyter-execute:: x = np.linspace(0, 10, 300) t = pd.date_range("1999-12-15", periods=364) da = xr.DataArray( np.sin(x) * np.cos(np.linspace(0, 1, 364)[:, np.newaxis]), dims=["time", "x"], coords={"time": t, "x": x}, ) da In order to take a block mean for every 7 days along ``time`` dimension and every 2 points along ``x`` dimension, .. jupyter-execute:: da.coarsen(time=7, x=2).mean() :py:meth:`~xarray.DataArray.coarsen` raises a ``ValueError`` if the data length is not a multiple of the corresponding window size. You can choose ``boundary='trim'`` or ``boundary='pad'`` options for trimming the excess entries or padding ``nan`` to insufficient entries, .. jupyter-execute:: da.coarsen(time=30, x=2, boundary="trim").mean() If you want to apply a specific function to coordinate, you can pass the function or method name to ``coord_func`` option, .. jupyter-execute:: da.coarsen(time=7, x=2, coord_func={"time": "min"}).mean() You can also :ref:`use coarsen to reshape` without applying a computation. .. _compute.using_coordinates: Computation using Coordinates ============================= Xarray objects have some handy methods for the computation with their coordinates. :py:meth:`~xarray.DataArray.differentiate` computes derivatives by central finite differences using their coordinates, .. jupyter-execute:: a = xr.DataArray([0, 1, 2, 3], dims=["x"], coords=[[0.1, 0.11, 0.2, 0.3]]) a.differentiate("x") This method can be used also for multidimensional arrays, .. jupyter-execute:: a = xr.DataArray( np.arange(8).reshape(4, 2), dims=["x", "y"], coords={"x": [0.1, 0.11, 0.2, 0.3]} ) a.differentiate("x") :py:meth:`~xarray.DataArray.integrate` computes integration based on trapezoidal rule using their coordinates, .. jupyter-execute:: a.integrate("x") .. note:: These methods are limited to simple cartesian geometry. Differentiation and integration along multidimensional coordinate are not supported. .. _compute.polyfit: Fitting polynomials =================== Xarray objects provide an interface for performing linear or polynomial regressions using the least-squares method. :py:meth:`~xarray.DataArray.polyfit` computes the best fitting coefficients along a given dimension and for a given order, .. jupyter-execute:: x = xr.DataArray(np.arange(10), dims=["x"], name="x") a = xr.DataArray(3 + 4 * x, dims=["x"], coords={"x": x}) out = a.polyfit(dim="x", deg=1, full=True) out The method outputs a dataset containing the coefficients (and more if ``full=True``). The inverse operation is done with :py:meth:`~xarray.polyval`, .. jupyter-execute:: xr.polyval(coord=x, coeffs=out.polyfit_coefficients) .. note:: These methods replicate the behaviour of :py:func:`numpy.polyfit` and :py:func:`numpy.polyval`. .. _compute.curvefit: Fitting arbitrary functions =========================== Xarray objects also provide an interface for fitting more complex functions using :py:func:`scipy.optimize.curve_fit`. :py:meth:`~xarray.DataArray.curvefit` accepts user-defined functions and can fit along multiple coordinates. For example, we can fit a relationship between two ``DataArray`` objects, maintaining a unique fit at each spatial coordinate but aggregating over the time dimension: .. jupyter-execute:: def exponential(x, a, xc): return np.exp((x - xc) / a) x = np.arange(-5, 5, 0.1) t = np.arange(-5, 5, 0.1) X, T = np.meshgrid(x, t) Z1 = np.random.uniform(low=-5, high=5, size=X.shape) Z2 = exponential(Z1, 3, X) Z3 = exponential(Z1, 1, -X) ds = xr.Dataset( data_vars=dict( var1=(["t", "x"], Z1), var2=(["t", "x"], Z2), var3=(["t", "x"], Z3) ), coords={"t": t, "x": x}, ) ds[["var2", "var3"]].curvefit( coords=ds.var1, func=exponential, reduce_dims="t", bounds={"a": (0.5, 5), "xc": (-5, 5)}, ) We can also fit multi-dimensional functions, and even use a wrapper function to simultaneously fit a summation of several functions, such as this field containing two gaussian peaks: .. jupyter-execute:: def gaussian_2d(coords, a, xc, yc, xalpha, yalpha): x, y = coords z = a * np.exp( -np.square(x - xc) / 2 / np.square(xalpha) - np.square(y - yc) / 2 / np.square(yalpha) ) return z def multi_peak(coords, *args): z = np.zeros(coords[0].shape) for i in range(len(args) // 5): z += gaussian_2d(coords, *args[i * 5 : i * 5 + 5]) return z x = np.arange(-5, 5, 0.1) y = np.arange(-5, 5, 0.1) X, Y = np.meshgrid(x, y) n_peaks = 2 names = ["a", "xc", "yc", "xalpha", "yalpha"] names = [f"{name}{i}" for i in range(n_peaks) for name in names] Z = gaussian_2d((X, Y), 3, 1, 1, 2, 1) + gaussian_2d((X, Y), 2, -1, -2, 1, 1) Z += np.random.normal(scale=0.1, size=Z.shape) da = xr.DataArray(Z, dims=["y", "x"], coords={"y": y, "x": x}) da.curvefit( coords=["x", "y"], func=multi_peak, param_names=names, kwargs={"maxfev": 10000}, ) .. note:: This method replicates the behavior of :py:func:`scipy.optimize.curve_fit`. .. _compute.broadcasting: Broadcasting by dimension name ============================== ``DataArray`` objects automatically align themselves ("broadcasting" in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with :py:func:`numpy.reshape` or :py:data:`numpy.newaxis`. This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions: .. jupyter-execute:: a = xr.DataArray([1, 2], [("x", ["a", "b"])]) a .. jupyter-execute:: b = xr.DataArray([-1, -2, -3], [("y", [10, 20, 30])]) b With xarray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically: .. jupyter-execute:: a * b Moreover, dimensions are always reordered to the order in which they first appeared: .. jupyter-execute:: c = xr.DataArray(np.arange(6).reshape(3, 2), [b["y"], a["x"]]) c .. jupyter-execute:: a + c This means, for example, that you always subtract an array from its transpose: .. jupyter-execute:: c - c.T You can explicitly broadcast xarray data structures by using the :py:func:`~xarray.broadcast` function: .. jupyter-execute:: a2, b2 = xr.broadcast(a, b) a2 .. jupyter-execute:: b2 .. _math automatic alignment: Automatic alignment =================== Xarray enforces alignment between *index* :ref:`coordinates` (that is, coordinates with the same name as a dimension, marked by ``*``) on objects used in binary operations. Similarly to pandas, this alignment is automatic for arithmetic on binary operations. The default result of a binary operation is by the *intersection* (not the union) of coordinate labels: .. jupyter-execute:: arr = xr.DataArray(np.arange(3), [("x", range(3))]) arr + arr[:-1] If coordinate values for a dimension are missing on either argument, all matching dimensions must have the same size: .. jupyter-execute:: :raises: arr + xr.DataArray([1, 2], dims="x") However, one can explicitly change this default automatic alignment type ("inner") via :py:func:`~xarray.set_options()` in context manager: .. jupyter-execute:: with xr.set_options(arithmetic_join="outer"): arr + arr[:1] arr + arr[:1] Before loops or performance critical code, it's a good idea to align arrays explicitly (e.g., by putting them in the same Dataset or using :py:func:`~xarray.align`) to avoid the overhead of repeated alignment with each operation. See :ref:`align and reindex` for more details. .. note:: There is no automatic alignment between arguments when performing in-place arithmetic operations such as ``+=``. You will need to use :ref:`manual alignment`. This ensures in-place arithmetic never needs to modify data types. .. _coordinates math: Coordinates =========== Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates: .. jupyter-execute:: arr[0] .. jupyter-execute:: arr[1] .. jupyter-execute:: # notice that the scalar coordinate 'x' is silently dropped arr[1] - arr[0] Still, xarray will persist other coordinates in arithmetic, as long as there are no conflicting values: .. jupyter-execute:: # only one argument has the 'x' coordinate arr[0] + 1 .. jupyter-execute:: # both arguments have the same 'x' coordinate arr[0] - arr[0] Math with datasets ================== Datasets support arithmetic operations by automatically looping over all data variables: .. jupyter-execute:: ds = xr.Dataset( { "x_and_y": (("x", "y"), np.random.randn(3, 5)), "x_only": ("x", np.random.randn(3)), }, coords=arr.coords, ) ds > 0 Datasets support most of the same methods found on data arrays: .. jupyter-execute:: ds.mean(dim="x") .. jupyter-execute:: abs(ds) Datasets also support NumPy ufuncs (requires NumPy v1.13 or newer), or alternatively you can use :py:meth:`~xarray.Dataset.map` to map a function to each variable in a dataset: .. jupyter-execute:: np.sin(ds) # equivalent to ds.map(np.sin) Datasets also use looping over variables for *broadcasting* in binary arithmetic. You can do arithmetic between any ``DataArray`` and a dataset: .. jupyter-execute:: ds + arr Arithmetic between two datasets matches data variables of the same name: .. jupyter-execute:: ds2 = xr.Dataset({"x_and_y": 0, "x_only": 100}) ds - ds2 Similarly to index based alignment, the result has the intersection of all matching data variables. .. _compute.wrapping-custom: Wrapping custom computation =========================== It doesn't always make sense to do computation directly with xarray objects: - In the inner loop of performance limited code, using xarray can add considerable overhead compared to using NumPy or native Python types. This is particularly true when working with scalars or small arrays (less than ~1e6 elements). Keeping track of labels and ensuring their consistency adds overhead, and xarray's core itself is not especially fast, because it's written in Python rather than a compiled language like C. Also, xarray's high level label-based APIs removes low-level control over how operations are implemented. - Even if speed doesn't matter, it can be important to wrap existing code, or to support alternative interfaces that don't use xarray objects. For these reasons, it is often well-advised to write low-level routines that work with NumPy arrays, and to wrap these routines to work with xarray objects. However, adding support for labels on both :py:class:`~xarray.Dataset` and :py:class:`~xarray.DataArray` can be a bit of a chore. To make this easier, xarray supplies the :py:func:`~xarray.apply_ufunc` helper function, designed for wrapping functions that support broadcasting and vectorization on unlabeled arrays in the style of a NumPy `universal function `_ ("ufunc" for short). ``apply_ufunc`` takes care of everything needed for an idiomatic xarray wrapper, including alignment, broadcasting, looping over ``Dataset`` variables (if needed), and merging of coordinates. In fact, many internal xarray functions/methods are written using ``apply_ufunc``. Simple functions that act independently on each value should work without any additional arguments: .. jupyter-execute:: squared_error = lambda x, y: (x - y) ** 2 arr1 = xr.DataArray([0, 1, 2, 3], dims="x") xr.apply_ufunc(squared_error, arr1, 1) For using more complex operations that consider some array values collectively, it's important to understand the idea of "core dimensions" from NumPy's `generalized ufuncs `_. Core dimensions are defined as dimensions that should *not* be broadcast over. Usually, they correspond to the fundamental dimensions over which an operation is defined, e.g., the summed axis in ``np.sum``. A good clue that core dimensions are needed is the presence of an ``axis`` argument on the corresponding NumPy function. With ``apply_ufunc``, core dimensions are recognized by name, and then moved to the last dimension of any input arguments before applying the given function. This means that for functions that accept an ``axis`` argument, you usually need to set ``axis=-1``. As an example, here is how we would wrap :py:func:`numpy.linalg.norm` to calculate the vector norm: .. code-block:: python def vector_norm(x, dim, ord=None): return xr.apply_ufunc( np.linalg.norm, x, input_core_dims=[[dim]], kwargs={"ord": ord, "axis": -1} ) .. jupyter-execute:: :hide-code: def vector_norm(x, dim, ord=None): return xr.apply_ufunc( np.linalg.norm, x, input_core_dims=[[dim]], kwargs={"ord": ord, "axis": -1} ) .. jupyter-execute:: vector_norm(arr1, dim="x") Because ``apply_ufunc`` follows a standard convention for ufuncs, it plays nicely with tools for building vectorized functions, like :py:func:`numpy.broadcast_arrays` and :py:class:`numpy.vectorize`. For high performance needs, consider using :doc:`Numba's vectorize and guvectorize `. In addition to wrapping functions, ``apply_ufunc`` can automatically parallelize many functions when using dask by setting ``dask='parallelized'``. See :ref:`dask.automatic-parallelization` for details. :py:func:`~xarray.apply_ufunc` also supports some advanced options for controlling alignment of variables and the form of the result. See the docstring for full details and more examples. .. currentmodule:: xarray .. _dask: Parallel Computing with Dask ============================ .. jupyter-execute:: # Note that it's not necessary to import dask to use xarray with dask. import numpy as np import pandas as pd import xarray as xr import bottleneck .. jupyter-execute:: :hide-code: import os np.random.seed(123456) # limit the amount of information printed to screen xr.set_options(display_expand_data=False) np.set_printoptions(precision=3, linewidth=100, threshold=10, edgeitems=2) ds = xr.Dataset( { "temperature": ( ("time", "latitude", "longitude"), np.random.randn(30, 180, 180), ), "time": pd.date_range("2015-01-01", periods=30), "longitude": np.arange(180), "latitude": np.arange(89.5, -90.5, -1), } ) ds.to_netcdf("example-data.nc") Xarray integrates with `Dask `__, a general purpose library for parallel computing, to handle larger-than-memory computations. If you’ve been using Xarray to read in large datasets or split up data across a number of files, you may already be using Dask: .. code-block:: python ds = xr.open_zarr("/path/to/data.zarr") timeseries = ds["temp"].mean(dim=["x", "y"]).compute() # Compute result Using Dask with Xarray feels similar to working with NumPy arrays, but on much larger datasets. The Dask integration is transparent, so you usually don’t need to manage the parallelism directly; Xarray and Dask handle these aspects behind the scenes. This makes it easy to write code that scales from small, in-memory datasets on a single machine to large datasets that are distributed across a cluster, with minimal code changes. Examples -------- If you're new to using Xarray with Dask, we recommend the `Xarray + Dask Tutorial `_. Here are some examples for using Xarray with Dask at scale: - `Zonal averaging with the NOAA National Water Model `_ - `CMIP6 Precipitation Frequency Analysis `_ - `Using Dask + Cloud Optimized GeoTIFFs `_ Find more examples at the `Project Pythia cookbook gallery `_. Using Dask with Xarray ---------------------- .. image:: ../_static/dask-array.svg :width: 50 % :align: right :alt: A Dask array Dask divides arrays into smaller parts called chunks. These chunks are small, manageable pieces of the larger dataset, that Dask is able to process in parallel (see the `Dask Array docs on chunks `_). Commonly chunks are set when reading data, but you can also set the chunksize manually at any point in your workflow using :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`. See :ref:`dask.chunks` for more. Xarray operations on Dask-backed arrays are lazy. This means computations are not executed immediately, but are instead queued up as tasks in a Dask graph. When a result is requested (e.g., for plotting, writing to disk, or explicitly computing), Dask executes the task graph. The computations are carried out in parallel, with each chunk being processed independently. This parallel execution is key to handling large datasets efficiently. Nearly all Xarray methods have been extended to work automatically with Dask Arrays. This includes things like indexing, concatenating, rechunking, grouped operations, etc. Common operations are covered in more detail in each of the sections below. .. _dask.io: Reading and writing data ~~~~~~~~~~~~~~~~~~~~~~~~ When reading data, Dask divides your dataset into smaller chunks. You can specify the size of chunks with the ``chunks`` argument. Specifying ``chunks="auto"`` will set the dask chunk sizes to be a multiple of the on-disk chunk sizes. This can be a good idea, but usually the appropriate dask chunk size will depend on your workflow. .. tab:: Zarr The `Zarr `_ format is ideal for working with large datasets. Each chunk is stored in a separate file, allowing parallel reading and writing with Dask. You can also use Zarr to read/write directly from cloud storage buckets (see the `Dask documentation on connecting to remote data `__) When you open a Zarr dataset with :py:func:`~xarray.open_zarr`, it is loaded as a Dask array by default (if Dask is installed):: ds = xr.open_zarr("path/to/directory.zarr") See :ref:`io.zarr` for more details. .. tab:: NetCDF Open a single netCDF file with :py:func:`~xarray.open_dataset` and supplying a ``chunks`` argument:: ds = xr.open_dataset("example-data.nc", chunks={"time": 10}) Or open multiple files in parallel with py:func:`~xarray.open_mfdataset`:: xr.open_mfdataset('my/files/*.nc', parallel=True) .. tip:: When reading in many netCDF files with py:func:`~xarray.open_mfdataset`, using ``engine="h5netcdf"`` can be faster than the default which uses the netCDF4 package. Save larger-than-memory netCDF files:: ds.to_netcdf("my-big-file.nc") Or set ``compute=False`` to return a dask.delayed object that can be computed later:: delayed_write = ds.to_netcdf("my-big-file.nc", compute=False) delayed_write.compute() .. note:: When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend. See :ref:`io.netcdf` for more details. .. tab:: HDF5 Open HDF5 files with :py:func:`~xarray.open_dataset`:: xr.open_dataset("/path/to/my/file.h5", chunks='auto') See :ref:`io.hdf5` for more details. .. tab:: GeoTIFF Open large geoTIFF files with rioxarray:: xds = rioxarray.open_rasterio("my-satellite-image.tif", chunks='auto') See :ref:`io.rasterio` for more details. Loading Dask Arrays ~~~~~~~~~~~~~~~~~~~ There are a few common cases where you may want to convert lazy Dask arrays into eager, in-memory Xarray data structures: - You want to inspect smaller intermediate results when working interactively or debugging - You've reduced the dataset (by filtering or with a groupby, for example) and now have something much smaller that fits in memory - You need to compute intermediate results since Dask is unable (or struggles) to perform a certain computation. The canonical example of this is normalizing a dataset, e.g., ``ds - ds.mean()``, when ``ds`` is larger than memory. Typically, you should either save ``ds`` to disk or compute ``ds.mean()`` eagerly. To do this, you can use :py:meth:`Dataset.compute` or :py:meth:`DataArray.compute`: .. jupyter-execute:: ds.compute() .. note:: Using :py:meth:`Dataset.compute` is preferred to :py:meth:`Dataset.load`, which changes the results in-place. You can also access :py:attr:`DataArray.values`, which will always be a NumPy array: .. jupyter-input:: ds.temperature.values .. jupyter-output:: array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01], [ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00], ... # truncated for brevity NumPy ufuncs like :py:func:`numpy.sin` transparently work on all xarray objects, including those that store lazy Dask arrays: .. jupyter-execute:: np.sin(ds) To access Dask arrays directly, use the :py:attr:`DataArray.data` attribute which exposes the DataArray's underlying array type. If you're using a Dask cluster, you can also use :py:meth:`Dataset.persist` for quickly accessing intermediate outputs. This is most helpful after expensive operations like rechunking or setting an index. It's a way of telling the cluster that it should start executing the computations that you have defined so far, and that it should try to keep those results in memory. You will get back a new Dask array that is semantically equivalent to your old array, but now points to running data. .. code-block:: python ds = ds.persist() .. tip:: Remember to save the dataset returned by persist! This is a common mistake. .. _dask.chunks: Chunking and performance ~~~~~~~~~~~~~~~~~~~~~~~~ The way a dataset is chunked can be critical to performance when working with large datasets. You'll want chunk sizes large enough to reduce the number of chunks that Dask has to think about (to reduce overhead from the task graph) but also small enough so that many of them can fit in memory at once. .. tip:: A good rule of thumb is to create arrays with a minimum chunk size of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), you may need larger chunks. See `Choosing good chunk sizes in Dask `_. It can be helpful to choose chunk sizes based on your downstream analyses and to chunk as early as possible. Datasets with smaller chunks along the time axis, for example, can make time domain problems easier to parallelize since Dask can perform the same operation on each time chunk. If you're working with a large dataset with chunks that make downstream analyses challenging, you may need to rechunk your data. This is an expensive operation though, so is only recommended when needed. You can chunk or rechunk a dataset by: - Specifying the ``chunks`` kwarg when reading in your dataset. If you know you'll want to do some spatial subsetting, for example, you could use ``chunks={'latitude': 10, 'longitude': 10}`` to specify small chunks across space. This can avoid loading subsets of data that span multiple chunks, thus reducing the number of file reads. Note that this will only work, though, for chunks that are similar to how the data is chunked on disk. Otherwise, it will be very slow and require a lot of network bandwidth. - Many array file formats are chunked on disk. You can specify ``chunks={}`` to have a single dask chunk map to a single on-disk chunk, and ``chunks="auto"`` to have a single dask chunk be a automatically chosen multiple of the on-disk chunks. - Using :py:meth:`Dataset.chunk` after you've already read in your dataset. For time domain problems, for example, you can use ``ds.chunk(time=TimeResampler())`` to rechunk according to a specified unit of time. ``ds.chunk(time=TimeResampler("MS"))``, for example, will set the chunks so that a month of data is contained in one chunk. For large-scale rechunking tasks (e.g., converting a simulation dataset stored with chunking only along time to a dataset with chunking only across space), consider writing another copy of your data on disk and/or using dedicated tools such as `Rechunker `_. .. _dask.automatic-parallelization: Parallelize custom functions with ``apply_ufunc`` and ``map_blocks`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Almost all of Xarray's built-in operations work on Dask arrays. If you want to use a function that isn't wrapped by Xarray, and have it applied in parallel on each block of your xarray object, you have three options: 1. Use :py:func:`~xarray.apply_ufunc` to apply functions that consume and return NumPy arrays. 2. Use :py:func:`~xarray.map_blocks`, :py:meth:`Dataset.map_blocks` or :py:meth:`DataArray.map_blocks` to apply functions that consume and return xarray objects. 3. Extract Dask Arrays from xarray objects with :py:attr:`DataArray.data` and use Dask directly. .. tip:: See the extensive Xarray tutorial on `apply_ufunc `_. ``apply_ufunc`` ############### :py:func:`~xarray.apply_ufunc` automates `embarrassingly parallel `__ "map" type operations where a function written for processing NumPy arrays should be repeatedly applied to Xarray objects containing Dask Arrays. It works similarly to :py:func:`dask.array.map_blocks` and :py:func:`dask.array.blockwise`, but without requiring an intermediate layer of abstraction. See the `Dask documentation `__ for more details. For the best performance when using Dask's multi-threaded scheduler, wrap a function that already releases the global interpreter lock, which fortunately already includes most NumPy and Scipy functions. Here we show an example using NumPy operations and a fast function from `bottleneck `__, which we use to calculate `Spearman's rank-correlation coefficient `__: .. code-block:: python def covariance_gufunc(x, y): return ( (x - x.mean(axis=-1, keepdims=True)) * (y - y.mean(axis=-1, keepdims=True)) ).mean(axis=-1) def pearson_correlation_gufunc(x, y): return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1)) def spearman_correlation_gufunc(x, y): x_ranks = bottleneck.rankdata(x, axis=-1) y_ranks = bottleneck.rankdata(y, axis=-1) return pearson_correlation_gufunc(x_ranks, y_ranks) def spearman_correlation(x, y, dim): return xr.apply_ufunc( spearman_correlation_gufunc, x, y, input_core_dims=[[dim], [dim]], dask="parallelized", output_dtypes=[float], ) The only aspect of this example that is different from standard usage of ``apply_ufunc()`` is that we needed to supply the ``output_dtypes`` arguments. (Read up on :ref:`compute.wrapping-custom` for an explanation of the "core dimensions" listed in ``input_core_dims``.) Our new ``spearman_correlation()`` function achieves near linear speedup when run on large arrays across the four cores on my laptop. It would also work as a streaming operation, when run on arrays loaded from disk: .. jupyter-input:: rs = np.random.default_rng(0) array1 = xr.DataArray(rs.randn(1000, 100000), dims=["place", "time"]) # 800MB array2 = array1 + 0.5 * rs.randn(1000, 100000) # using one core, on NumPy arrays %time _ = spearman_correlation(array1, array2, 'time') # CPU times: user 21.6 s, sys: 2.84 s, total: 24.5 s # Wall time: 24.9 s chunked1 = array1.chunk({"place": 10}) chunked2 = array2.chunk({"place": 10}) # using all my laptop's cores, with Dask r = spearman_correlation(chunked1, chunked2, "time").compute() %time _ = r.compute() # CPU times: user 30.9 s, sys: 1.74 s, total: 32.6 s # Wall time: 4.59 s One limitation of ``apply_ufunc()`` is that it cannot be applied to arrays with multiple chunks along a core dimension: .. jupyter-input:: spearman_correlation(chunked1, chunked2, "place") .. jupyter-output:: ValueError: dimension 'place' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single Dask array chunk along this dimension, i.e., ``.rechunk({'place': -1})``, but beware that this may significantly increase memory usage. This reflects the nature of core dimensions, in contrast to broadcast (non-core) dimensions that allow operations to be split into arbitrary chunks for application. .. tip:: When possible, it's recommended to use pre-existing ``dask.array`` functions, either with existing xarray methods or :py:func:`~xarray.apply_ufunc()` with ``dask='allowed'``. Dask can often have a more efficient implementation that makes use of the specialized structure of a problem, unlike the generic speedups offered by ``dask='parallelized'``. ``map_blocks`` ############## Functions that consume and return Xarray objects can be easily applied in parallel using :py:func:`map_blocks`. Your function will receive an Xarray Dataset or DataArray subset to one chunk along each chunked dimension. .. jupyter-execute:: ds.temperature This DataArray has 3 chunks each with length 10 along the time dimension. At compute time, a function applied with :py:func:`map_blocks` will receive a DataArray corresponding to a single block of shape 10x180x180 (time x latitude x longitude) with values loaded. The following snippet illustrates how to check the shape of the object received by the applied function. .. jupyter-execute:: def func(da): print(da.sizes) return da.time mapped = xr.map_blocks(func, ds.temperature) mapped Notice that the :py:meth:`map_blocks` call printed ``Frozen({'time': 0, 'latitude': 0, 'longitude': 0})`` to screen. ``func`` is received 0-sized blocks! :py:meth:`map_blocks` needs to know what the final result looks like in terms of dimensions, shapes etc. It does so by running the provided function on 0-shaped inputs (*automated inference*). This works in many cases, but not all. If automatic inference does not work for your function, provide the ``template`` kwarg (see :ref:`below `). In this case, automatic inference has worked so let's check that the result is as expected. .. jupyter-execute:: mapped.load(scheduler="single-threaded") mapped.identical(ds.time) Note that we use ``.load(scheduler="single-threaded")`` to execute the computation. This executes the Dask graph in serial using a for loop, but allows for printing to screen and other debugging techniques. We can easily see that our function is receiving blocks of shape 10x180x180 and the returned result is identical to ``ds.time`` as expected. Here is a common example where automated inference will not work. .. jupyter-execute:: :raises: def func(da): print(da.sizes) return da.isel(time=[1]) mapped = xr.map_blocks(func, ds.temperature) ``func`` cannot be run on 0-shaped inputs because it is not possible to extract element 1 along a dimension of size 0. In this case we need to tell :py:func:`map_blocks` what the returned result looks like using the ``template`` kwarg. ``template`` must be an xarray Dataset or DataArray (depending on what the function returns) with dimensions, shapes, chunk sizes, attributes, coordinate variables *and* data variables that look exactly like the expected result. The variables should be dask-backed and hence not incur much memory cost. .. _template-note: .. note:: Note that when ``template`` is provided, ``attrs`` from ``template`` are copied over to the result. Any ``attrs`` set in ``func`` will be ignored. .. jupyter-execute:: template = ds.temperature.isel(time=[1, 11, 21]) mapped = xr.map_blocks(func, ds.temperature, template=template) Notice that the 0-shaped sizes were not printed to screen. Since ``template`` has been provided :py:func:`map_blocks` does not need to infer it by running ``func`` on 0-shaped inputs. .. jupyter-execute:: mapped.identical(template) :py:func:`map_blocks` also allows passing ``args`` and ``kwargs`` down to the user function ``func``. ``func`` will be executed as ``func(block_xarray, *args, **kwargs)`` so ``args`` must be a list and ``kwargs`` must be a dictionary. .. jupyter-execute:: def func(obj, a, b=0): return obj + a + b mapped = ds.map_blocks(func, args=[10], kwargs={"b": 10}) expected = ds + 10 + 10 mapped.identical(expected) .. jupyter-execute:: :hide-code: ds.close() # Closes "example-data.nc". os.remove("example-data.nc") .. tip:: As :py:func:`map_blocks` loads each block into memory, reduce as much as possible objects consumed by user functions. For example, drop useless variables before calling ``func`` with :py:func:`map_blocks`. Deploying Dask -------------- By default, Dask uses the multi-threaded scheduler, which distributes work across multiple cores on a single machine and allows for processing some datasets that do not fit into memory. However, this has two limitations: - You are limited by the size of your hard drive - Downloading data can be slow and expensive Instead, it can be faster and cheaper to run your computations close to where your data is stored, distributed across many machines on a Dask cluster. Often, this means deploying Dask on HPC clusters or on the cloud. See the `Dask deployment documentation `__ for more details. Best Practices -------------- Dask is pretty easy to use but there are some gotchas, many of which are under active development. Here are some tips we have found through experience. We also recommend checking out the `Dask best practices `_. 1. Do your spatial and temporal indexing (e.g. ``.sel()`` or ``.isel()``) early, especially before calling ``resample()`` or ``groupby()``. Grouping and resampling triggers some computation on all the blocks, which in theory should commute with indexing, but this optimization hasn't been implemented in Dask yet. (See `Dask issue #746 `_). 2. More generally, ``groupby()`` is a costly operation and will perform a lot better if the ``flox`` package is installed. See the `flox documentation `_ for more. By default Xarray will use ``flox`` if installed. 3. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 `_) 4. Use the `Dask dashboard `_ to identify performance bottlenecks. Here's an example of a simplified workflow putting some of these tips together: .. code-block:: python ds = xr.open_zarr( # Since we're doing a spatial reduction, increase chunk size in x, y "my-data.zarr", chunks={"x": 100, "y": 100} ) time_subset = ds.sea_temperature.sel( time=slice("2020-01-01", "2020-12-31") # Filter early ) # faster resampling when flox is installed daily = ds.resample(time="D").mean() daily.load() # Pull smaller results into memory after reducing the dataset .. _data structures: Data Structures =============== .. jupyter-execute:: :hide-code: :hide-output: import numpy as np import pandas as pd import xarray as xr import matplotlib.pyplot as plt np.random.seed(123456) np.set_printoptions(threshold=10) %xmode minimal DataArray --------- :py:class:`xarray.DataArray` is xarray's implementation of a labeled, multi-dimensional array. It has several key properties: - ``values``: a :py:class:`numpy.ndarray` or :ref:`numpy-like array ` holding the array's values - ``dims``: dimension names for each axis (e.g., ``('x', 'y', 'z')``) - ``coords``: a dict-like container of arrays (*coordinates*) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings) - ``attrs``: :py:class:`dict` to hold arbitrary metadata (*attributes*) Xarray uses ``dims`` and ``coords`` to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the ``axis`` argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the ``index`` found on a pandas :py:class:`~pandas.DataFrame` or :py:class:`~pandas.Series`. DataArray objects also can have a ``name`` and can hold arbitrary metadata in the form of their ``attrs`` property. Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases. For reading and writing attributes xarray relies on the capabilities of the supported backends. (see FAQ, :ref:`approach to metadata`). .. _creating a dataarray: Creating a DataArray ~~~~~~~~~~~~~~~~~~~~ The :py:class:`~xarray.DataArray` constructor takes: - ``data``: a multi-dimensional array of values (e.g., a numpy ndarray, a :ref:`numpy-like array `, :py:class:`~pandas.Series`, :py:class:`~pandas.DataFrame` or ``pandas.Panel``) - ``coords``: a list or dictionary of coordinates. If a list, it should be a list of tuples where the first element is the dimension name and the second element is the corresponding coordinate array_like object. - ``dims``: a list of dimension names. If omitted and ``coords`` is a list of tuples, dimension names are taken from ``coords``. - ``attrs``: a dictionary of attributes to add to the instance - ``name``: a string that names the instance .. jupyter-execute:: data = np.random.rand(4, 3) locs = ["IA", "IL", "IN"] times = pd.date_range("2000-01-01", periods=4) foo = xr.DataArray(data, coords=[times, locs], dims=["time", "space"]) foo Only ``data`` is required; all of other arguments will be filled in with default values: .. jupyter-execute:: xr.DataArray(data) As you can see, dimension names are always present in the xarray data model: if you do not provide them, defaults of the form ``dim_N`` will be created. However, coordinates are always optional, and dimensions do not have automatic coordinate labels. .. note:: This is different from pandas, where axes always have tick labels, which default to the integers ``[0, ..., n-1]``. Prior to xarray v0.9, xarray copied this behavior: default coordinates for each dimension would be created if coordinates were not supplied explicitly. This is no longer the case. Coordinates can be specified in the following ways: - A list of values with length equal to the number of dimensions, providing coordinate labels for each dimension. Each value must be of one of the following forms: * A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable` * A tuple of the form ``(dims, data[, attrs])``, which is converted into arguments for :py:class:`~xarray.Variable` * A pandas object or scalar value, which is converted into a ``DataArray`` * A 1D array or list, which is interpreted as values for a one dimensional coordinate variable along the same dimension as its name - A dictionary of ``{coord_name: coord}`` where values are of the same form as the list. Supplying coordinates as a dictionary allows other coordinates than those corresponding to dimensions (more on these later). If you supply ``coords`` as a dictionary, you must explicitly provide ``dims``. As a list of tuples: .. jupyter-execute:: xr.DataArray(data, coords=[("time", times), ("space", locs)]) As a dictionary: .. jupyter-execute:: xr.DataArray( data, coords={ "time": times, "space": locs, "const": 42, "ranking": ("space", [1, 2, 3]), }, dims=["time", "space"], ) As a dictionary with coords across multiple dimensions: .. jupyter-execute:: xr.DataArray( data, coords={ "time": times, "space": locs, "const": 42, "ranking": (("time", "space"), np.arange(12).reshape(4, 3)), }, dims=["time", "space"], ) If you create a ``DataArray`` by supplying a pandas :py:class:`~pandas.Series`, :py:class:`~pandas.DataFrame` or ``pandas.Panel``, any non-specified arguments in the ``DataArray`` constructor will be filled in from the pandas object: .. jupyter-execute:: df = pd.DataFrame({"x": [0, 1], "y": [2, 3]}, index=["a", "b"]) df.index.name = "abc" df.columns.name = "xyz" df .. jupyter-execute:: xr.DataArray(df) DataArray properties ~~~~~~~~~~~~~~~~~~~~ Let's take a look at the important properties on our array: .. jupyter-execute:: foo.values .. jupyter-execute:: foo.dims .. jupyter-execute:: foo.coords .. jupyter-execute:: foo.attrs .. jupyter-execute:: print(foo.name) You can modify ``values`` inplace: .. jupyter-execute:: foo.values = 1.0 * foo.values .. note:: The array values in a :py:class:`~xarray.DataArray` have a single (homogeneous) data type. To work with heterogeneous or structured data types in xarray, use coordinates, or put separate ``DataArray`` objects in a single :py:class:`~xarray.Dataset` (see below). Now fill in some of that missing metadata: .. jupyter-execute:: foo.name = "foo" foo.attrs["units"] = "meters" foo The :py:meth:`~xarray.DataArray.rename` method is another option, returning a new data array: .. jupyter-execute:: foo.rename("bar") DataArray Coordinates ~~~~~~~~~~~~~~~~~~~~~ The ``coords`` property is ``dict`` like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself: .. jupyter-execute:: foo.coords["time"] .. jupyter-execute:: foo["time"] These are also :py:class:`~xarray.DataArray` objects, which contain tick-labels for each dimension. Coordinates can also be set or removed by using the dictionary like syntax: .. jupyter-execute:: foo["ranking"] = ("space", [1, 2, 3]) foo.coords .. jupyter-execute:: del foo["ranking"] foo.coords For more details, see :ref:`coordinates` below. Dataset ------- :py:class:`xarray.Dataset` is xarray's multi-dimensional equivalent of a :py:class:`~pandas.DataFrame`. It is a dict-like container of labeled arrays (:py:class:`~xarray.DataArray` objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the `netCDF`__ file format. __ https://www.unidata.ucar.edu/software/netcdf/ In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties: - ``dims``: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., ``{'x': 6, 'y': 6, 'time': 8}``) - ``data_vars``: a dict-like container of DataArrays corresponding to variables - ``coords``: another dict-like container of DataArrays intended to label points used in ``data_vars`` (e.g., arrays of numbers, datetime objects or strings) - ``attrs``: :py:class:`dict` to hold arbitrary metadata The distinction between whether a variable falls in data or coordinates (borrowed from `CF conventions`_) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xarray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data. .. _CF conventions: https://cfconventions.org/ Here is an example of how we might structure a dataset for a weather forecast: .. image:: ../_static/dataset-diagram.png In this example, it would be natural to call ``temperature`` and ``precipitation`` "data variables" and all the other arrays "coordinate variables" because they label the points along the dimensions. (see [1]_ for more background on this example). Creating a Dataset ~~~~~~~~~~~~~~~~~~ To make an :py:class:`~xarray.Dataset` from scratch, supply dictionaries for any variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``). - ``data_vars`` should be a dictionary with each key as the name of the variable and each value as one of: * A :py:class:`~xarray.DataArray` or :py:class:`~xarray.Variable` * A tuple of the form ``(dims, data[, attrs])``, which is converted into arguments for :py:class:`~xarray.Variable` * A pandas object, which is converted into a ``DataArray`` * A 1D array or list, which is interpreted as values for a one dimensional coordinate variable along the same dimension as its name - ``coords`` should be a dictionary of the same form as ``data_vars``. - ``attrs`` should be a dictionary. Let's create some fake data for the example we show above. In this example dataset, we will represent measurements of the temperature and pressure that were made under various conditions: * the measurements were made on four different days; * they were made at two separate locations, which we will represent using their latitude and longitude; and * they were made using instruments by three different manufacturers, which we will refer to as ``'manufac1'``, ``'manufac2'``, and ``'manufac3'``. .. jupyter-execute:: np.random.seed(0) temperature = 15 + 8 * np.random.randn(2, 3, 4) precipitation = 10 * np.random.rand(2, 3, 4) lon = [-99.83, -99.32] lat = [42.25, 42.21] instruments = ["manufac1", "manufac2", "manufac3"] time = pd.date_range("2014-09-06", periods=4) reference_time = pd.Timestamp("2014-09-05") # for real use cases, its good practice to supply array attributes such as # units, but we won't bother here for the sake of brevity ds = xr.Dataset( { "temperature": (["loc", "instrument", "time"], temperature), "precipitation": (["loc", "instrument", "time"], precipitation), }, coords={ "lon": (["loc"], lon), "lat": (["loc"], lat), "instrument": instruments, "time": time, "reference_time": reference_time, }, ) ds Here we pass :py:class:`xarray.DataArray` objects or a pandas object as values in the dictionary: .. jupyter-execute:: xr.Dataset(dict(bar=foo)) .. jupyter-execute:: xr.Dataset(dict(bar=foo.to_pandas())) Where a pandas object is supplied as a value, the names of its indexes are used as dimension names, and its data is aligned to any existing dimensions. You can also create an dataset from: - A :py:class:`pandas.DataFrame` or ``pandas.Panel`` along its columns and items respectively, by passing it into the :py:class:`~xarray.Dataset` directly - A :py:class:`pandas.DataFrame` with :py:meth:`Dataset.from_dataframe `, which will additionally handle MultiIndexes See :ref:`pandas` - A netCDF file on disk with :py:func:`~xarray.open_dataset`. See :ref:`io`. Dataset contents ~~~~~~~~~~~~~~~~ :py:class:`~xarray.Dataset` implements the Python mapping interface, with values given by :py:class:`xarray.DataArray` objects: .. jupyter-execute:: print("temperature" in ds) ds["temperature"] Valid keys include each listed coordinate and data variable. Data and coordinate variables are also contained separately in the :py:attr:`~xarray.Dataset.data_vars` and :py:attr:`~xarray.Dataset.coords` dictionary-like attributes: .. jupyter-execute:: ds.data_vars .. jupyter-execute:: ds.coords Finally, like data arrays, datasets also store arbitrary metadata in the form of ``attributes``: .. jupyter-execute:: print(ds.attrs) ds.attrs["title"] = "example attribute" ds Xarray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or :py:class:`numpy.ndarray` objects. As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes: .. jupyter-execute:: ds.temperature This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython. .. _dictionary_like_methods: Dictionary like methods ~~~~~~~~~~~~~~~~~~~~~~~ We can update a dataset in-place using Python's standard dictionary syntax. For example, to create this example dataset from scratch, we could have written: .. jupyter-execute:: ds = xr.Dataset() ds["temperature"] = (("loc", "instrument", "time"), temperature) ds["temperature_double"] = (("loc", "instrument", "time"), temperature * 2) ds["precipitation"] = (("loc", "instrument", "time"), precipitation) ds.coords["lat"] = (("loc",), lat) ds.coords["lon"] = (("loc",), lon) ds.coords["time"] = pd.date_range("2014-09-06", periods=4) ds.coords["reference_time"] = pd.Timestamp("2014-09-05") To change the variables in a ``Dataset``, you can use all the standard dictionary methods, including ``values``, ``items``, ``__delitem__``, ``get`` and :py:meth:`~xarray.Dataset.update`. Note that assigning a ``DataArray`` or pandas object to a ``Dataset`` variable using ``__setitem__`` or ``update`` will :ref:`automatically align` the array(s) to the original dataset's indexes. You can copy a ``Dataset`` by calling the :py:meth:`~xarray.Dataset.copy` method. By default, the copy is shallow, so only the container will be copied: the arrays in the ``Dataset`` will still be stored in the same underlying :py:class:`numpy.ndarray` objects. You can copy all data by calling ``ds.copy(deep=True)``. .. _transforming datasets: Transforming datasets ~~~~~~~~~~~~~~~~~~~~~ In addition to dictionary-like methods (described above), xarray has additional methods (like pandas) for transforming datasets into new objects. For removing variables, you can select and drop an explicit list of variables by indexing with a list of names or using the :py:meth:`~xarray.Dataset.drop_vars` methods to return a new ``Dataset``. These operations keep around coordinates: .. jupyter-execute:: ds[["temperature"]] .. jupyter-execute:: ds[["temperature", "temperature_double"]] .. jupyter-execute:: ds.drop_vars("temperature") To remove a dimension, you can use :py:meth:`~xarray.Dataset.drop_dims` method. Any variables using that dimension are dropped: .. jupyter-execute:: ds.drop_dims("time") As an alternate to dictionary-like modifications, you can use :py:meth:`~xarray.Dataset.assign` and :py:meth:`~xarray.Dataset.assign_coords`. These methods return a new dataset with additional (or replaced) values: .. jupyter-execute:: ds.assign(temperature2=2 * ds.temperature) There is also the :py:meth:`~xarray.Dataset.pipe` method that allows you to use a method call with an external function (e.g., ``ds.pipe(func)``) instead of simply calling it (e.g., ``func(ds)``). This allows you to write pipelines for transforming your data (using "method chaining") instead of writing hard to follow nested function calls: .. jupyter-input:: # these lines are equivalent, but with pipe we can make the logic flow # entirely from left to right plt.plot((2 * ds.temperature.sel(loc=0)).mean("instrument")) (ds.temperature.sel(loc=0).pipe(lambda x: 2 * x).mean("instrument").pipe(plt.plot)) Both ``pipe`` and ``assign`` replicate the pandas methods of the same names (:py:meth:`DataFrame.pipe ` and :py:meth:`DataFrame.assign `). With xarray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach. Renaming variables ~~~~~~~~~~~~~~~~~~ Another useful option is the :py:meth:`~xarray.Dataset.rename` method to rename dataset variables: .. jupyter-execute:: ds.rename({"temperature": "temp", "precipitation": "precip"}) The related :py:meth:`~xarray.Dataset.swap_dims` method allows you do to swap dimension and non-dimension variables: .. jupyter-execute:: ds.coords["day"] = ("time", [6, 7, 8, 9]) ds.swap_dims({"time": "day"}) DataTree -------- :py:class:`~xarray.DataTree` is ``xarray``'s highest-level data structure, able to organise heterogeneous data which could not be stored inside a single :py:class:`~xarray.Dataset` object. This includes representing the recursive structure of multiple `groups`_ within a netCDF file or `Zarr Store`_. .. _groups: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/groups-types/GroupsIntro.html .. _Zarr Store: https://zarr.readthedocs.io/en/stable/tutorial.html#groups Each :py:class:`~xarray.DataTree` object (or "node") contains the same data that a single :py:class:`xarray.Dataset` would (i.e. :py:class:`~xarray.DataArray` objects stored under hashable keys), and so has the same key properties: - ``dims``: a dictionary mapping of dimension names to lengths, for the variables in this node, and this node's ancestors, - ``data_vars``: a dict-like container of DataArrays corresponding to variables in this node, - ``coords``: another dict-like container of DataArrays, corresponding to coordinate variables in this node, and this node's ancestors, - ``attrs``: dict to hold arbitrary metadata relevant to data in this node. A single :py:class:`~xarray.DataTree` object acts much like a single :py:class:`~xarray.Dataset` object, and has a similar set of dict-like methods defined upon it. However, :py:class:`~xarray.DataTree`\s can also contain other :py:class:`~xarray.DataTree` objects, so they can be thought of as nested dict-like containers of both :py:class:`xarray.DataArray`\s and :py:class:`~xarray.DataTree`\s. A single datatree object is known as a "node", and its position relative to other nodes is defined by two more key properties: - ``children``: An dictionary mapping from names to other :py:class:`~xarray.DataTree` objects, known as its "child nodes". - ``parent``: The single :py:class:`~xarray.DataTree` object whose children this datatree is a member of, known as its "parent node". Each child automatically knows about its parent node, and a node without a parent is known as a "root" node (represented by the ``parent`` attribute pointing to ``None``). Nodes can have multiple children, but as each child node has at most one parent, there can only ever be one root node in a given tree. The overall structure is technically a connected acyclic undirected rooted graph, otherwise known as a `"Tree" `_. :py:class:`~xarray.DataTree` objects can also optionally have a ``name`` as well as ``attrs``, just like a :py:class:`~xarray.DataArray`. Again these are not normally used unless explicitly accessed by the user. .. _creating a datatree: Creating a DataTree ~~~~~~~~~~~~~~~~~~~ One way to create a :py:class:`~xarray.DataTree` from scratch is to create each node individually, specifying the nodes' relationship to one another as you create each one. The :py:class:`~xarray.DataTree` constructor takes: - ``dataset``: The data that will be stored in this node, represented by a single :py:class:`xarray.Dataset`. - ``children``: The various child nodes (if there are any), given as a mapping from string keys to :py:class:`~xarray.DataTree` objects. - ``name``: A string to use as the name of this node. Let's make a single datatree node with some example data in it: .. jupyter-execute:: ds1 = xr.Dataset({"foo": "orange"}) dt = xr.DataTree(name="root", dataset=ds1) dt At this point we have created a single node datatree with no parent and no children. .. jupyter-execute:: print(dt.parent is None) dt.children We can add a second node to this tree, assigning it to the parent node ``dt``: .. jupyter-execute:: dataset2 = xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}) dt2 = xr.DataTree(name="a", dataset=dataset2) # Add the child Datatree to the root node dt.children = {"child-node": dt2} dt More idiomatically you can create a tree from a dictionary of ``Datasets`` and ``DataTrees``. In this case we add a new node under ``dt["child-node"]`` by providing the explicit path under ``"child-node"`` as the dictionary key: .. jupyter-execute:: # create a third Dataset ds3 = xr.Dataset({"zed": np.nan}) # create a tree from a dictionary of DataTrees and Datasets dt = xr.DataTree.from_dict({"/": dt, "/child-node/new-zed-node": ds3}) We have created a tree with three nodes in it: .. jupyter-execute:: dt Consistency checks are enforced. For instance, if we try to create a cycle, where the root node is also a child of a descendant, the constructor will raise an (:py:class:`~xarray.InvalidTreeError`): .. jupyter-execute:: :raises: dt["child-node"].children = {"new-child": dt} Alternatively you can also create a :py:class:`~xarray.DataTree` object from: - A dictionary mapping directory-like paths to either :py:class:`~xarray.DataTree` nodes or data, using :py:meth:`xarray.DataTree.from_dict()`, - A well formed netCDF or Zarr file on disk with :py:func:`~xarray.open_datatree()`. See :ref:`reading and writing files `. For data files with groups that do not not align see :py:func:`xarray.open_groups` or target each group individually :py:func:`xarray.open_dataset(group='groupname') `. For more information about coordinate alignment see :ref:`datatree-inheritance` DataTree Contents ~~~~~~~~~~~~~~~~~ Like :py:class:`~xarray.Dataset`, :py:class:`~xarray.DataTree` implements the python mapping interface, but with values given by either :py:class:`~xarray.DataArray` objects or other :py:class:`~xarray.DataTree` objects. .. jupyter-execute:: dt["child-node"] .. jupyter-execute:: dt["foo"] Iterating over keys will iterate over both the names of variables and child nodes. We can also access all the data in a single node, and its inherited coordinates, through a dataset-like view .. jupyter-execute:: dt["child-node"].dataset This demonstrates the fact that the data in any one node is equivalent to the contents of a single :py:class:`~xarray.Dataset` object. The :py:attr:`DataTree.dataset ` property returns an immutable view, but we can instead extract the node's data contents as a new and mutable :py:class:`~xarray.Dataset` object via :py:meth:`DataTree.to_dataset() `: .. jupyter-execute:: dt["child-node"].to_dataset() Like with :py:class:`~xarray.Dataset`, you can access the data and coordinate variables of a node separately via the :py:attr:`~xarray.DataTree.data_vars` and :py:attr:`~xarray.DataTree.coords` attributes: .. jupyter-execute:: dt["child-node"].data_vars .. jupyter-execute:: dt["child-node"].coords Dictionary-like methods ~~~~~~~~~~~~~~~~~~~~~~~ We can update a datatree in-place using Python's standard dictionary syntax, similar to how we can for Dataset objects. For example, to create this example DataTree from scratch, we could have written: .. jupyter-execute:: dt = xr.DataTree(name="root") dt["foo"] = "orange" dt["child-node"] = xr.DataTree( dataset=xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}) ) dt["child-node/new-zed-node/zed"] = np.nan dt To change the variables in a node of a :py:class:`~xarray.DataTree`, you can use all the standard dictionary methods, including ``values``, ``items``, ``__delitem__``, ``get`` and :py:meth:`xarray.DataTree.update`. Note that assigning a :py:class:`~xarray.DataTree` object to a :py:class:`~xarray.DataTree` variable using ``__setitem__`` or :py:meth:`~xarray.DataTree.update` will :ref:`automatically align ` the array(s) to the original node's indexes. If you copy a :py:class:`~xarray.DataTree` using the :py:func:`copy` function or the :py:meth:`xarray.DataTree.copy` method it will copy the subtree, meaning that node and children below it, but no parents above it. Like for :py:class:`~xarray.Dataset`, this copy is shallow by default, but you can copy all the underlying data arrays by calling ``dt.copy(deep=True)``. .. _datatree-inheritance: DataTree Inheritance ~~~~~~~~~~~~~~~~~~~~ DataTree implements a simple inheritance mechanism. Coordinates, dimensions and their associated indices are propagated from downward starting from the root node to all descendent nodes. Coordinate inheritance was inspired by the NetCDF-CF inherited dimensions, but DataTree's inheritance is slightly stricter yet easier to reason about. The constraint that this puts on a DataTree is that dimensions and indices that are inherited must be aligned with any direct descendant node's existing dimension or index. This allows descendants to use dimensions defined in ancestor nodes, without duplicating that information. But as a consequence, if a dimension-name is defined in on a node and that same dimension-name exists in one of its ancestors, they must align (have the same index and size). Some examples: .. jupyter-execute:: # Set up coordinates time = xr.DataArray(data=["2022-01", "2023-01"], dims="time") stations = xr.DataArray(data=list("abcdef"), dims="station") lon = [-100, -80, -60] lat = [10, 20, 30] # Set up fake data wind_speed = xr.DataArray(np.ones((2, 6)) * 2, dims=("time", "station")) pressure = xr.DataArray(np.ones((2, 6)) * 3, dims=("time", "station")) air_temperature = xr.DataArray(np.ones((2, 6)) * 4, dims=("time", "station")) dewpoint = xr.DataArray(np.ones((2, 6)) * 5, dims=("time", "station")) infrared = xr.DataArray(np.ones((2, 3, 3)) * 6, dims=("time", "lon", "lat")) true_color = xr.DataArray(np.ones((2, 3, 3)) * 7, dims=("time", "lon", "lat")) dt2 = xr.DataTree.from_dict( { "/": xr.Dataset( coords={"time": time}, ), "/weather": xr.Dataset( coords={"station": stations}, data_vars={ "wind_speed": wind_speed, "pressure": pressure, }, ), "/weather/temperature": xr.Dataset( data_vars={ "air_temperature": air_temperature, "dewpoint": dewpoint, }, ), "/satellite": xr.Dataset( coords={"lat": lat, "lon": lon}, data_vars={ "infrared": infrared, "true_color": true_color, }, ), }, ) dt2 Here there are four different coordinate variables, which apply to variables in the DataTree in different ways: ``time`` is a shared coordinate used by both ``weather`` and ``satellite`` variables ``station`` is used only for ``weather`` variables ``lat`` and ``lon`` are only use for ``satellite`` images Coordinate variables are inherited to descendent nodes, which is only possible because variables at different levels of a hierarchical DataTree are always aligned. Placing the ``time`` variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, ``station`` is in the base ``weather`` node, because it applies to all weather variables, both directly in ``weather`` and in the ``temperature`` sub-tree. Notice the inherited coordinates are explicitly shown in the tree representation under ``Inherited coordinates:``. .. jupyter-execute:: dt2["/weather"] Accessing any of the lower level trees through the :py:func:`.dataset ` property automatically includes coordinates from higher levels (e.g., ``time`` and ``station``): .. jupyter-execute:: dt2["/weather/temperature"].dataset Similarly, when you retrieve a Dataset through :py:func:`~xarray.DataTree.to_dataset` , the inherited coordinates are included by default unless you exclude them with the ``inherit`` flag: .. jupyter-execute:: dt2["/weather/temperature"].to_dataset() .. jupyter-execute:: dt2["/weather/temperature"].to_dataset(inherit=False) For more examples and further discussion see :ref:`alignment and coordinate inheritance `. .. _coordinates: Coordinates ----------- Coordinates are ancillary variables stored for ``DataArray`` and ``Dataset`` objects in the ``coords`` attribute: .. jupyter-execute:: ds.coords Unlike attributes, xarray *does* interpret and persist coordinates in operations that transform xarray objects. There are two types of coordinates in xarray: - **dimension coordinates** are one dimensional coordinates with a name equal to their sole dimension (marked by ``*`` when printing a dataset or data array). They are used for label based indexing and alignment, like the ``index`` found on a pandas :py:class:`~pandas.DataFrame` or :py:class:`~pandas.Series`. Indeed, these "dimension" coordinates use a :py:class:`pandas.Index` internally to store their values. - **non-dimension coordinates** are variables that contain coordinate data, but are not a dimension coordinate. They can be multidimensional (see :ref:`/examples/multidimensional-coords.ipynb`), and there is no relationship between the name of a non-dimension coordinate and the name(s) of its dimension(s). Non-dimension coordinates can be useful for indexing or plotting; otherwise, xarray does not make any direct use of the values associated with them. They are not used for alignment or automatic indexing, nor are they required to match when doing arithmetic (see :ref:`coordinates math`). .. note:: Xarray's terminology differs from the `CF terminology`_, where the "dimension coordinates" are called "coordinate variables", and the "non-dimension coordinates" are called "auxiliary coordinate variables" (see :issue:`1295` for more details). .. _CF terminology: https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#terminology Modifying coordinates ~~~~~~~~~~~~~~~~~~~~~ To entirely add or remove coordinate arrays, you can use dictionary like syntax, as shown above. To convert back and forth between data and coordinates, you can use the :py:meth:`~xarray.Dataset.set_coords` and :py:meth:`~xarray.Dataset.reset_coords` methods: .. jupyter-execute:: ds.reset_coords() .. jupyter-execute:: ds.set_coords(["temperature", "precipitation"]) .. jupyter-execute:: ds["temperature"].reset_coords(drop=True) Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xarray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful. Coordinates methods ~~~~~~~~~~~~~~~~~~~ ``Coordinates`` objects also have a few useful methods, mostly for converting them into dataset objects: .. jupyter-execute:: ds.coords.to_dataset() The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see :ref:`compute`): .. jupyter-execute:: alt = xr.Dataset(coords={"z": [10], "lat": 0, "lon": 0}) ds.coords.merge(alt.coords) The ``coords.merge`` method may be useful if you want to implement your own binary operations that act on xarray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xarray's built-in arithmetic. Indexes ~~~~~~~ To convert a coordinate (or any ``DataArray``) into an actual :py:class:`pandas.Index`, use the :py:meth:`~xarray.DataArray.to_index` method: .. jupyter-execute:: ds["time"].to_index() A useful shortcut is the ``indexes`` property (on both ``DataArray`` and ``Dataset``), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are ``Index`` objects: .. jupyter-execute:: ds.indexes MultiIndex coordinates ~~~~~~~~~~~~~~~~~~~~~~ Xarray supports labeling coordinate values with a :py:class:`pandas.MultiIndex`: .. jupyter-execute:: midx = pd.MultiIndex.from_arrays( [["R", "R", "V", "V"], [0.1, 0.2, 0.7, 0.9]], names=("band", "wn") ) mda = xr.DataArray(np.random.rand(4), coords={"spec": midx}, dims="spec") mda For convenience multi-index levels are directly accessible as "virtual" or "derived" coordinates (marked by ``-`` when printing a dataset or data array): .. jupyter-execute:: mda["band"] .. jupyter-execute:: mda.wn Indexing with multi-index levels is also possible using the ``sel`` method (see :ref:`multi-level indexing`). Unlike other coordinates, "virtual" level coordinates are not stored in the ``coords`` attribute of ``DataArray`` and ``Dataset`` objects (although they are shown when printing the ``coords`` attribute). Consequently, most of the coordinates related methods don't apply for them. It also can't be used to replace one particular level. Because in a ``DataArray`` or ``Dataset`` object each multi-index level is accessible as a "virtual" coordinate, its name must not conflict with the names of the other levels, coordinates and data variables of the same object. Even though xarray sets default names for multi-indexes with unnamed levels, it is recommended that you explicitly set the names of the levels. .. [1] Latitude and longitude are 2D arrays because the dataset uses `projected coordinates`__. ``reference_time`` refers to the reference time at which the forecast was made, rather than ``time`` which is the valid time for which the forecast applies. __ https://en.wikipedia.org/wiki/Map_projection .. currentmodule:: xarray .. _userguide.duckarrays: Working with numpy-like arrays ============================== NumPy-like arrays (often known as :term:`duck array`\s) are drop-in replacements for the :py:class:`numpy.ndarray` class but with different features, such as propagating physical units or a different layout in memory. Xarray can often wrap these array types, allowing you to use labelled dimensions and indexes whilst benefiting from the additional features of these array libraries. Some numpy-like array types that xarray already has some support for: * `Cupy `_ - GPU support (see `cupy-xarray `_), * `Sparse `_ - for performant arrays with many zero elements, * `Pint `_ - for tracking the physical units of your data (see `pint-xarray `_), * `Dask `_ - parallel computing on larger-than-memory arrays (see :ref:`using dask with xarray `), * `Cubed `_ - another parallel computing framework that emphasises reliability (see `cubed-xarray `_). .. warning:: This feature should be considered somewhat experimental. Please report any bugs you find on `xarray’s issue tracker `_. .. note:: For information on wrapping dask arrays see :ref:`dask`. Whilst xarray wraps dask arrays in a similar way to that described on this page, chunked array types like :py:class:`dask.array.Array` implement additional methods that require slightly different user code (e.g. calling ``.chunk`` or ``.compute``). See the docs on :ref:`wrapping chunked arrays `. Why "duck"? ----------- Why is it also called a "duck" array? This comes from a common statement of object-oriented programming - "If it walks like a duck, and quacks like a duck, treat it like a duck". In other words, a library like xarray that is capable of using multiple different types of arrays does not have to explicitly check that each one it encounters is permitted (e.g. ``if dask``, ``if numpy``, ``if sparse`` etc.). Instead xarray can take the more permissive approach of simply treating the wrapped array as valid, attempting to call the relevant methods (e.g. ``.mean()``) and only raising an error if a problem occurs (e.g. the method is not found on the wrapped class). This is much more flexible, and allows objects and classes from different libraries to work together more easily. What is a numpy-like array? --------------------------- A "numpy-like array" (also known as a "duck array") is a class that contains array-like data, and implements key numpy-like functionality such as indexing, broadcasting, and computation methods. For example, the `sparse `_ library provides a sparse array type which is useful for representing nD array objects like sparse matrices in a memory-efficient manner. We can create a sparse array object (of the :py:class:`sparse.COO` type) from a numpy array like this: .. jupyter-execute:: from sparse import COO import xarray as xr import numpy as np %xmode minimal .. jupyter-execute:: x = np.eye(4, dtype=np.uint8) # create diagonal identity matrix s = COO.from_numpy(x) s This sparse object does not attempt to explicitly store every element in the array, only the non-zero elements. This approach is much more efficient for large arrays with only a few non-zero elements (such as tri-diagonal matrices). Sparse array objects can be converted back to a "dense" numpy array by calling :py:meth:`sparse.COO.todense`. Just like :py:class:`numpy.ndarray` objects, :py:class:`sparse.COO` arrays support indexing .. jupyter-execute:: s[1, 1] # diagonal elements should be ones .. jupyter-execute:: s[2, 3] # off-diagonal elements should be zero broadcasting, .. jupyter-execute:: x2 = np.zeros( (4, 1), dtype=np.uint8 ) # create second sparse array of different shape s2 = COO.from_numpy(x2) (s * s2) # multiplication requires broadcasting and various computation methods .. jupyter-execute:: s.sum(axis=1) This numpy-like array also supports calling so-called `numpy ufuncs `_ ("universal functions") on it directly: .. jupyter-execute:: np.sum(s, axis=1) Notice that in each case the API for calling the operation on the sparse array is identical to that of calling it on the equivalent numpy array - this is the sense in which the sparse array is "numpy-like". .. note:: For discussion on exactly which methods a class needs to implement to be considered "numpy-like", see :ref:`internals.duckarrays`. Wrapping numpy-like arrays in xarray ------------------------------------ :py:class:`DataArray`, :py:class:`Dataset`, and :py:class:`Variable` objects can wrap these numpy-like arrays. Constructing xarray objects which wrap numpy-like arrays ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The primary way to create an xarray object which wraps a numpy-like array is to pass that numpy-like array instance directly to the constructor of the xarray class. The :ref:`page on xarray data structures ` shows how :py:class:`DataArray` and :py:class:`Dataset` both accept data in various forms through their ``data`` argument, but in fact this data can also be any wrappable numpy-like array. For example, we can wrap the sparse array we created earlier inside a new DataArray object: .. jupyter-execute:: s_da = xr.DataArray(s, dims=["i", "j"]) s_da We can see what's inside - the printable representation of our xarray object (the repr) automatically uses the printable representation of the underlying wrapped array. Of course our sparse array object is still there underneath - it's stored under the ``.data`` attribute of the dataarray: .. jupyter-execute:: s_da.data Array methods ~~~~~~~~~~~~~ We saw above that numpy-like arrays provide numpy methods. Xarray automatically uses these when you call the corresponding xarray method: .. jupyter-execute:: s_da.sum(dim="j") Converting wrapped types ~~~~~~~~~~~~~~~~~~~~~~~~ If you want to change the type inside your xarray object you can use :py:meth:`DataArray.as_numpy`: .. jupyter-execute:: s_da.as_numpy() This returns a new :py:class:`DataArray` object, but now wrapping a normal numpy array. If instead you want to convert to numpy and return that numpy array you can use either :py:meth:`DataArray.to_numpy` or :py:meth:`DataArray.values`, where the former is strongly preferred. The difference is in the way they coerce to numpy - :py:meth:`~DataArray.values` always uses :py:func:`numpy.asarray` which will fail for some array types (e.g. ``cupy``), whereas :py:meth:`~DataArray.to_numpy` uses the correct method depending on the array type. .. jupyter-execute:: s_da.to_numpy() .. jupyter-execute:: :raises: s_da.values This illustrates the difference between :py:meth:`~DataArray.data` and :py:meth:`~DataArray.values`, which is sometimes a point of confusion for new xarray users. Explicitly: :py:meth:`DataArray.data` returns the underlying numpy-like array, regardless of type, whereas :py:meth:`DataArray.values` converts the underlying array to a numpy array before returning it. (This is another reason to use :py:meth:`~DataArray.to_numpy` over :py:meth:`~DataArray.values` - the intention is clearer.) Conversion to numpy as a fallback ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If a wrapped array does not implement the corresponding array method then xarray will often attempt to convert the underlying array to a numpy array so that the operation can be performed. You may want to watch out for this behavior, and report any instances in which it causes problems. Most of xarray's API does support using :term:`duck array` objects, but there are a few areas where the code will still convert to ``numpy`` arrays: - Dimension coordinates, and thus all indexing operations: * :py:meth:`Dataset.sel` and :py:meth:`DataArray.sel` * :py:meth:`Dataset.loc` and :py:meth:`DataArray.loc` * :py:meth:`Dataset.drop_sel` and :py:meth:`DataArray.drop_sel` * :py:meth:`Dataset.reindex`, :py:meth:`Dataset.reindex_like`, :py:meth:`DataArray.reindex` and :py:meth:`DataArray.reindex_like`: duck arrays in data variables and non-dimension coordinates won't be casted - Functions and methods that depend on external libraries or features of ``numpy`` not covered by ``__array_function__`` / ``__array_ufunc__``: * :py:meth:`Dataset.ffill` and :py:meth:`DataArray.ffill` (uses ``bottleneck``) * :py:meth:`Dataset.bfill` and :py:meth:`DataArray.bfill` (uses ``bottleneck``) * :py:meth:`Dataset.interp`, :py:meth:`Dataset.interp_like`, :py:meth:`DataArray.interp` and :py:meth:`DataArray.interp_like` (uses ``scipy``): duck arrays in data variables and non-dimension coordinates will be casted in addition to not supporting duck arrays in dimension coordinates * :py:meth:`Dataset.rolling` and :py:meth:`DataArray.rolling` (requires ``numpy>=1.20``) * :py:meth:`Dataset.rolling_exp` and :py:meth:`DataArray.rolling_exp` (uses ``numbagg``) * :py:meth:`Dataset.interpolate_na` and :py:meth:`DataArray.interpolate_na` (uses :py:class:`numpy.vectorize`) * :py:func:`apply_ufunc` with ``vectorize=True`` (uses :py:class:`numpy.vectorize`) - Incompatibilities between different :term:`duck array` libraries: * :py:meth:`Dataset.chunk` and :py:meth:`DataArray.chunk`: this fails if the data was not already chunked and the :term:`duck array` (e.g. a ``pint`` quantity) should wrap the new ``dask`` array; changing the chunk sizes works however. Extensions using duck arrays ---------------------------- Whilst the features above allow many numpy-like array libraries to be used pretty seamlessly with xarray, it often also makes sense to use an interfacing package to make certain tasks easier. For example the `pint-xarray package `_ offers a custom ``.pint`` accessor (see :ref:`internals.accessors`) which provides convenient access to information stored within the wrapped array (e.g. ``.units`` and ``.magnitude``), and makes creating wrapped pint arrays (and especially xarray-wrapping-pint-wrapping-dask arrays) simpler for the user. We maintain a list of libraries extending ``xarray`` to make working with particular wrapped duck arrays easier. If you know of more that aren't on this list please raise an issue to add them! - `pint-xarray `_ - `cupy-xarray `_ - `cubed-xarray `_ .. _ecosystem: Xarray related projects ----------------------- Below is a list of existing open source projects that build functionality upon xarray. See also section :ref:`internals` for more details on how to build xarray extensions. We also maintain the `xarray-contrib `_ GitHub organization as a place to curate projects that build upon xarray. Geosciences ~~~~~~~~~~~ - `aospy `_: Automated analysis and management of gridded climate data. - `argopy `_: xarray-based Argo data access, manipulation and visualisation for standard users as well as Argo experts. - `cf_xarray `_: Provides an accessor (DataArray.cf or Dataset.cf) that allows you to interpret Climate and Forecast metadata convention attributes present on xarray objects. - `climpred `_: Analysis of ensemble forecast models for climate prediction. - `geocube `_: Tool to convert geopandas vector data into rasterized xarray data. - `GeoWombat `_: Utilities for analysis of remotely sensed and gridded raster data at scale (easily tame Landsat, Sentinel, Quickbird, and PlanetScope). - `grib2io `_: Utility to work with GRIB2 files including an xarray backend, DASK support for parallel reading in open_mfdataset, lazy loading of data, editing of GRIB2 attributes and GRIB2IO DataArray attrs, and spatial interpolation and reprojection of GRIB2 messages and GRIB2IO Datasets/DataArrays for both grid to grid and grid to stations. - `gsw-xarray `_: a wrapper around `gsw `_ that adds CF compliant attributes when possible, units, name. - `infinite-diff `_: xarray-based finite-differencing, focused on gridded climate/meteorology data - `marc_analysis `_: Analysis package for CESM/MARC experiments and output. - `MetPy `_: A collection of tools in Python for reading, visualizing, and performing calculations with weather data. - `MPAS-Analysis `_: Analysis for simulations produced with Model for Prediction Across Scales (MPAS) components and the Accelerated Climate Model for Energy (ACME). - `OGGM `_: Open Global Glacier Model - `Oocgcm `_: Analysis of large gridded geophysical datasets - `Open Data Cube `_: Analysis toolkit of continental scale Earth Observation data from satellites. - `Pangaea `_: xarray extension for gridded land surface & weather model output). - `Pangeo `_: A community effort for big data geoscience in the cloud. - `PyGDX `_: Python 3 package for accessing data stored in GAMS Data eXchange (GDX) files. Also uses a custom subclass. - `pyinterp `_: Python 3 package for interpolating geo-referenced data used in the field of geosciences. - `pyXpcm `_: xarray-based Profile Classification Modelling (PCM), mostly for ocean data. - `Regionmask `_: plotting and creation of masks of spatial regions - `rioxarray `_: geospatial xarray extension powered by rasterio - `salem `_: Adds geolocalised subsetting, masking, and plotting operations to xarray's data structures via accessors. - `SatPy `_ : Library for reading and manipulating meteorological remote sensing data and writing it to various image and data file formats. - `SARXarray `_: xarray extension for reading and processing large Synthetic Aperture Radar (SAR) data stacks. - `shxarray `_: Convert, filter,and map geodesy related spherical harmonic representations of gravity and terrestrial water storage through an xarray extension. - `Spyfit `_: FTIR spectroscopy of the atmosphere - `windspharm `_: Spherical harmonic wind analysis in Python. - `wradlib `_: An Open Source Library for Weather Radar Data Processing. - `wrf-python `_: A collection of diagnostic and interpolation routines for use with output of the Weather Research and Forecasting (WRF-ARW) Model. - `xarray-eopf `_: An xarray backend implementation for opening ESA EOPF data products in Zarr format. - `xarray-regrid `_: xarray extension for regridding rectilinear data. - `xarray-simlab `_: xarray extension for computer model simulations. - `xarray-spatial `_: Numba-accelerated raster-based spatial processing tools (NDVI, curvature, zonal-statistics, proximity, hillshading, viewshed, etc.) - `xarray-topo `_: xarray extension for topographic analysis and modelling. - `xbpch `_: xarray interface for bpch files. - `xCDAT `_: An extension of xarray for climate data analysis on structured grids. - `xclim `_: A library for calculating climate science indices with unit handling built from xarray and dask. - `xESMF `_: Universal regridder for geospatial data. - `xgcm `_: Extends the xarray data model to understand finite volume grid cells (common in General Circulation Models) and provides interpolation and difference operations for such grids. - `xmitgcm `_: a python package for reading `MITgcm `_ binary MDS files into xarray data structures. - `xnemogcm `_: a package to read `NEMO `_ output files and add attributes to interface with xgcm. Machine Learning ~~~~~~~~~~~~~~~~ - `ArviZ `_: Exploratory analysis of Bayesian models, built on top of xarray. - `Darts `_: User-friendly modern machine learning for time series in Python. - `Elm `_: Parallel machine learning on xarray data structures - `sklearn-xarray (1) `_: Combines scikit-learn and xarray (1). - `sklearn-xarray (2) `_: Combines scikit-learn and xarray (2). - `xbatcher `_: Batch Generation from Xarray Datasets. Other domains ~~~~~~~~~~~~~ - `ptsa `_: EEG Time Series Analysis - `pycalphad `_: Computational Thermodynamics in Python - `pyomeca `_: Python framework for biomechanical analysis - `movement `_: A Python toolbox for analysing animal body movements Extend xarray capabilities ~~~~~~~~~~~~~~~~~~~~~~~~~~ - `Collocate `_: Collocate xarray trajectories in arbitrary physical dimensions - `eofs `_: EOF analysis in Python. - `hypothesis-gufunc `_: Extension to hypothesis. Makes it easy to write unit tests with xarray objects as input. - `ntv-pandas `_ : A tabular analyzer and a semantic, compact and reversible converter for multidimensional and tabular data - `nxarray `_: NeXus input/output capability for xarray. - `xarray-compare `_: xarray extension for data comparison. - `xarray-dataclasses `_: xarray extension for typed DataArray and Dataset creation. - `xarray_einstats `_: Statistics, linear algebra and einops for xarray - `xarray_extras `_: Advanced algorithms for xarray objects (e.g. integrations/interpolations). - `xeofs `_: PCA/EOF analysis and related techniques, integrated with xarray and Dask for efficient handling of large-scale data. - `xpublish `_: Publish Xarray Datasets via a Zarr compatible REST API. - `xrft `_: Fourier transforms for xarray data. - `xr-scipy `_: A lightweight scipy wrapper for xarray. - `X-regression `_: Multiple linear regression from Statsmodels library coupled with Xarray library. - `xskillscore `_: Metrics for verifying forecasts. - `xyzpy `_: Easily generate high dimensional data, including parallelization. - `xarray-lmfit `_: xarray extension for curve fitting using `lmfit `_. Visualization ~~~~~~~~~~~~~ - `datashader `_, `geoviews `_, `holoviews `_, : visualization packages for large data. - `hvplot `_ : A high-level plotting API for the PyData ecosystem built on HoloViews. - `psyplot `_: Interactive data visualization with python. - `xarray-leaflet `_: An xarray extension for tiled map plotting based on ipyleaflet. - `xtrude `_: An xarray extension for 3D terrain visualization based on pydeck. - `pyvista-xarray `_: xarray DataArray accessor for 3D visualization with `PyVista `_ and DataSet engines for reading VTK data formats. Non-Python projects ~~~~~~~~~~~~~~~~~~~ - `xframe `_: C++ data structures inspired by xarray. - `AxisArrays `_, `NamedArrays `_ and `YAXArrays.jl `_: similar data structures for Julia. More projects can be found at the `"xarray" Github topic `_. .. currentmodule:: xarray .. _groupby: GroupBy: Group and Bin Data --------------------------- Often we want to bin or group data, produce statistics (mean, variance) on the groups, and then return a reduced data set. To do this, Xarray supports `"group by"`__ operations with the same API as pandas to implement the `split-apply-combine`__ strategy: __ https://pandas.pydata.org/pandas-docs/stable/groupby.html __ https://www.jstatsoft.org/v40/i01/paper - Split your data into multiple independent groups. - Apply some function to each group. - Combine your groups back into a single data object. Group by operations work on both :py:class:`Dataset` and :py:class:`DataArray` objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas' implementation of the same pipeline. .. tip:: `Install the flox package `_ to substantially improve the performance of GroupBy operations, particularly with dask. flox `extends Xarray's in-built GroupBy capabilities `_ by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default. Split ~~~~~ Let's create a simple example dataset: .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) .. jupyter-execute:: ds = xr.Dataset( {"foo": (("x", "y"), np.random.rand(4, 3))}, coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))}, ) arr = ds["foo"] ds If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a ``GroupBy`` object: .. jupyter-execute:: ds.groupby("letters") This object works very similarly to a pandas GroupBy object. You can view the group indices with the ``groups`` attribute: .. jupyter-execute:: ds.groupby("letters").groups You can also iterate over groups in ``(label, group)`` pairs: .. jupyter-execute:: list(ds.groupby("letters")) You can index out a particular group: .. jupyter-execute:: ds.groupby("letters")["b"] To group by multiple variables, see :ref:`this section `. Binning ~~~~~~~ Sometimes you don't want to use all the unique values to determine the groups but instead want to "bin" the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the :py:meth:`Dataset.groupby_bins` method. .. jupyter-execute:: x_bins = [0, 25, 50] ds.groupby_bins("x", x_bins).groups The binning is implemented via :func:`pandas.cut`, whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose ``float`` labels which identify the bin centers: .. jupyter-execute:: x_bin_labels = [12.5, 37.5] ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups Apply ~~~~~ To apply a function to each group, you can use the flexible :py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically concatenated back together along the group axis: .. jupyter-execute:: def standardize(x): return (x - x.mean()) / x.std() arr.groupby("letters").map(standardize) GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an aggregation function: .. jupyter-execute:: arr.groupby("letters").mean(dim="x") Using a groupby is thus also a convenient shortcut for aggregating over all dimensions *other than* the provided one: .. jupyter-execute:: ds.groupby("x").std(...) .. note:: We use an ellipsis (`...`) here to indicate we want to reduce over all other dimensions First and last ~~~~~~~~~~~~~~ There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension: .. jupyter-execute:: ds.groupby("letters").first(...) By default, they skip missing values (control this with ``skipna``). Grouped arithmetic ~~~~~~~~~~~~~~~~~~ GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for ``(GroupBy, Dataset)`` and ``(GroupBy, DataArray)`` pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example: .. jupyter-execute:: alt = arr.groupby("letters").mean(...) alt .. jupyter-execute:: ds.groupby("letters") - alt This last line is roughly equivalent to the following:: results = [] for label, group in ds.groupby('letters'): results.append(group - alt.sel(letters=label)) xr.concat(results, dim='x') .. _groupby.multidim: Multidimensional Grouping ~~~~~~~~~~~~~~~~~~~~~~~~~ Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the `CF conventions`__. Xarray supports groupby operations over multidimensional coordinate variables: __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables .. jupyter-execute:: da = xr.DataArray( [[0, 1], [2, 3]], coords={ "lon": (["ny", "nx"], [[30, 40], [40, 50]]), "lat": (["ny", "nx"], [[10, 10], [20, 20]]), }, dims=["ny", "nx"], ) da .. jupyter-execute:: da.groupby("lon").sum(...) .. jupyter-execute:: da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False) Because multidimensional groups have the ability to generate a very large number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins` may be desirable: .. jupyter-execute:: da.groupby_bins("lon", [0, 45, 50]).sum() These methods group by ``lon`` values. It is also possible to groupby each cell in a grid, regardless of value, by stacking multiple dimensions, applying your function, and then unstacking the result: .. jupyter-execute:: stacked = da.stack(gridcell=["ny", "nx"]) stacked.groupby("gridcell").sum(...).unstack("gridcell") Alternatively, you can groupby both ``lat`` and ``lon`` at the :ref:`same time `. .. _groupby.groupers: Grouper Objects ~~~~~~~~~~~~~~~ Both ``groupby_bins`` and ``resample`` are specializations of the core ``groupby`` operation for binning, and time resampling. Many problems demand more complex GroupBy application: for example, grouping by multiple variables with a combination of categorical grouping, binning, and resampling; or more specializations like spatial resampling; or more complex time grouping like special handling of seasons, or the ability to specify custom seasons. To handle these use-cases and more, Xarray is evolving to providing an extension point using ``Grouper`` objects. .. tip:: See the `grouper design`_ doc for more detail on the motivation and design ideas behind Grouper objects. .. _grouper design: https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md For now Xarray provides three specialized Grouper objects: 1. :py:class:`groupers.UniqueGrouper` for categorical grouping 2. :py:class:`groupers.BinGrouper` for binned grouping 3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate These provide functionality identical to the existing ``groupby``, ``groupby_bins``, and ``resample`` methods. That is, .. code-block:: python ds.groupby("x") is identical to .. code-block:: python from xarray.groupers import UniqueGrouper ds.groupby(x=UniqueGrouper()) Similarly, .. code-block:: python ds.groupby_bins("x", bins=bins) is identical to .. code-block:: python from xarray.groupers import BinGrouper ds.groupby(x=BinGrouper(bins)) and .. code-block:: python ds.resample(time="ME") is identical to .. code-block:: python from xarray.groupers import TimeResampler ds.resample(time=TimeResampler("ME")) The :py:class:`groupers.UniqueGrouper` accepts an optional ``labels`` kwarg that is not present in :py:meth:`DataArray.groupby` or :py:meth:`Dataset.groupby`. Specifying ``labels`` is required when grouping by a lazy array type (e.g. dask or cubed). The ``labels`` are used to construct the output coordinate (say for a reduction), and aggregations will only be run over the specified labels. You may use ``labels`` to also specify the ordering of groups to be used during iteration. The order will be preserved in the output. .. _groupby.multiple: Grouping by multiple variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use grouper objects to group by multiple dimensions: .. jupyter-execute:: from xarray.groupers import UniqueGrouper da.groupby(["lat", "lon"]).sum() The above is sugar for using ``UniqueGrouper`` objects directly: .. jupyter-execute:: da.groupby(lat=UniqueGrouper(), lon=UniqueGrouper()).sum() Different groupers can be combined to construct sophisticated GroupBy operations. .. jupyter-execute:: from xarray.groupers import BinGrouper ds.groupby(x=BinGrouper(bins=[5, 15, 25]), letters=UniqueGrouper()).sum() Time Grouping and Resampling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. seealso:: See :ref:`resampling`. Shuffling ~~~~~~~~~ Shuffling is a generalization of sorting a DataArray or Dataset by another DataArray, named ``label`` for example, that follows from the idea of grouping by ``label``. Shuffling reorders the DataArray or the DataArrays in a Dataset such that all members of a group occur sequentially. For example, Shuffle the object using either :py:class:`DatasetGroupBy` or :py:class:`DataArrayGroupBy` as appropriate. .. jupyter-execute:: da = xr.DataArray( dims="x", data=[1, 2, 3, 4, 5, 6], coords={"label": ("x", "a b c a b c".split(" "))}, ) da.groupby("label").shuffle_to_chunks() For chunked array types (e.g. dask or cubed), shuffle may result in a more optimized communication pattern when compared to direct indexing by the appropriate indexer. Shuffling also makes GroupBy operations on chunked arrays an embarrassingly parallel problem, and may significantly improve workloads that use :py:meth:`DatasetGroupBy.map` or :py:meth:`DataArrayGroupBy.map`. .. _userguide.hierarchical-data: Hierarchical data ================= .. jupyter-execute:: :hide-code: :hide-output: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) np.set_printoptions(threshold=10) %xmode minimal .. _why: Why Hierarchical Data? ---------------------- Many real-world datasets are composed of multiple differing components, and it can often be useful to think of these in terms of a hierarchy of related groups of data. Examples of data which one might want organise in a grouped or hierarchical manner include: - Simulation data at multiple resolutions, - Observational data about the same system but from multiple different types of sensors, - Mixed experimental and theoretical data, - A systematic study recording the same experiment but with different parameters, - Heterogeneous data, such as demographic and metereological data, or even any combination of the above. Often datasets like this cannot easily fit into a single :py:class:`~xarray.Dataset` object, or are more usefully thought of as groups of related :py:class:`~xarray.Dataset` objects. For this purpose we provide the :py:class:`xarray.DataTree` class. This page explains in detail how to understand and use the different features of the :py:class:`~xarray.DataTree` class for your own hierarchical data needs. .. _node relationships: Node Relationships ------------------ .. _creating a family tree: Creating a Family Tree ~~~~~~~~~~~~~~~~~~~~~~ The three main ways of creating a :py:class:`~xarray.DataTree` object are described briefly in :ref:`creating a datatree`. Here we go into more detail about how to create a tree node-by-node, using a famous family tree from the Simpsons cartoon as an example. Let's start by defining nodes representing the two siblings, Bart and Lisa Simpson: .. jupyter-execute:: bart = xr.DataTree(name="Bart") lisa = xr.DataTree(name="Lisa") Each of these node objects knows their own :py:class:`~xarray.DataTree.name`, but they currently have no relationship to one another. We can connect them by creating another node representing a common parent, Homer Simpson: .. jupyter-execute:: homer = xr.DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa}) Here we set the children of Homer in the node's constructor. We now have a small family tree where we can see how these individual Simpson family members are related to one another: .. jupyter-execute:: print(homer) .. note:: We use ``print()`` above to show the compact tree hierarchy. :py:class:`~xarray.DataTree` objects also have an interactive HTML representation that is enabled by default in editors such as JupyterLab and VSCode. The HTML representation is especially helpful for larger trees and exploring new datasets, as it allows you to expand and collapse nodes. If you prefer the text representations you can also set ``xr.set_options(display_style="text")``. .. Comment:: may remove note and print()s after upstream theme changes https://github.com/pydata/pydata-sphinx-theme/pull/2187 The nodes representing Bart and Lisa are now connected - we can confirm their sibling rivalry by examining the :py:class:`~xarray.DataTree.siblings` property: .. jupyter-execute:: list(homer["Bart"].siblings) But oops, we forgot Homer's third daughter, Maggie! Let's add her by updating Homer's :py:class:`~xarray.DataTree.children` property to include her: .. jupyter-execute:: maggie = xr.DataTree(name="Maggie") homer.children = {"Bart": bart, "Lisa": lisa, "Maggie": maggie} print(homer) Let's check that Maggie knows who her Dad is: .. jupyter-execute:: maggie.parent.name That's good - updating the properties of our nodes does not break the internal consistency of our tree, as changes of parentage are automatically reflected on both nodes. These children obviously have another parent, Marge Simpson, but :py:class:`~xarray.DataTree` nodes can only have a maximum of one parent. Genealogical `family trees are not even technically trees `_ in the mathematical sense - the fact that distant relatives can mate makes them directed acyclic graphs. Trees of :py:class:`~xarray.DataTree` objects cannot represent this. Homer is currently listed as having no parent (the so-called "root node" of this tree), but we can update his :py:class:`~xarray.DataTree.parent` property: .. jupyter-execute:: abe = xr.DataTree(name="Abe") abe.children = {"Homer": homer} Abe is now the "root" of this tree, which we can see by examining the :py:class:`~xarray.DataTree.root` property of any node in the tree .. jupyter-execute:: maggie.root.name We can see the whole tree by printing Abe's node or just part of the tree by printing Homer's node: .. jupyter-execute:: print(abe) .. jupyter-execute:: print(abe["Homer"]) In episode 28, Abe Simpson reveals that he had another son, Herbert "Herb" Simpson. We can add Herbert to the family tree without displacing Homer by :py:meth:`~xarray.DataTree.assign`-ing another child to Abe: .. jupyter-execute:: herbert = xr.DataTree(name="Herb") abe = abe.assign({"Herbert": herbert}) print(abe) .. jupyter-execute:: print(abe["Herbert"].name) print(herbert.name) .. note:: This example shows a subtlety - the returned tree has Homer's brother listed as ``"Herbert"``, but the original node was named "Herb". Not only are names overridden when stored as keys like this, but the new node is a copy, so that the original node that was referenced is unchanged (i.e. ``herbert.name == "Herb"`` still). In other words, nodes are copied into trees, not inserted into them. This is intentional, and mirrors the behaviour when storing named :py:class:`~xarray.DataArray` objects inside datasets. Certain manipulations of our tree are forbidden, if they would create an inconsistent result. In episode 51 of the show Futurama, Philip J. Fry travels back in time and accidentally becomes his own Grandfather. If we try similar time-travelling hijinks with Homer, we get a :py:class:`~xarray.InvalidTreeError` raised: .. jupyter-execute:: :raises: abe["Homer"].children = {"Abe": abe} .. _evolutionary tree: Ancestry in an Evolutionary Tree ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's use a different example of a tree to discuss more complex relationships between nodes - the phylogenetic tree, or tree of life. .. jupyter-execute:: vertebrates = xr.DataTree.from_dict( { "/Sharks": None, "/Bony Skeleton/Ray-finned Fish": None, "/Bony Skeleton/Four Limbs/Amphibians": None, "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates": None, "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Rodents & Rabbits": None, "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs": None, "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Birds": None, }, name="Vertebrae", ) primates = vertebrates["/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates"] dinosaurs = vertebrates[ "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs" ] We have used the :py:meth:`~xarray.DataTree.from_dict` constructor method as a preferred way to quickly create a whole tree, and :ref:`filesystem paths` (to be explained shortly) to select two nodes of interest. .. jupyter-execute:: print(vertebrates) This tree shows various families of species, grouped by their common features (making it technically a `"Cladogram" `_, rather than an evolutionary tree). Here both the species and the features used to group them are represented by :py:class:`~xarray.DataTree` node objects - there is no distinction in types of node. We can however get a list of only the nodes we used to represent species by using the fact that all those nodes have no children - they are "leaf nodes". We can check if a node is a leaf with :py:meth:`~xarray.DataTree.is_leaf`, and get a list of all leaves with the :py:class:`~xarray.DataTree.leaves` property: .. jupyter-execute:: print(primates.is_leaf) [node.name for node in vertebrates.leaves] Pretending that this is a true evolutionary tree for a moment, we can find the features of the evolutionary ancestors (so-called "ancestor" nodes), the distinguishing feature of the common ancestor of all vertebrate life (the root node), and even the distinguishing feature of the common ancestor of any two species (the common ancestor of two nodes): .. jupyter-execute:: print([node.name for node in reversed(primates.parents)]) print(primates.root.name) print(primates.find_common_ancestor(dinosaurs).name) We can only find a common ancestor between two nodes that lie in the same tree. If we try to find the common evolutionary ancestor between primates and an Alien species that has no relationship to Earth's evolutionary tree, an error will be raised. .. jupyter-execute:: :raises: alien = xr.DataTree(name="Xenomorph") primates.find_common_ancestor(alien) .. _navigating trees: Navigating Trees ---------------- There are various ways to access the different nodes in a tree. Properties ~~~~~~~~~~ We can navigate trees using the :py:class:`~xarray.DataTree.parent` and :py:class:`~xarray.DataTree.children` properties of each node, for example: .. jupyter-execute:: lisa.parent.children["Bart"].name but there are also more convenient ways to access nodes. Dictionary-like interface ~~~~~~~~~~~~~~~~~~~~~~~~~ Children are stored on each node as a key-value mapping from name to child node. They can be accessed and altered via the :py:class:`~xarray.DataTree.__getitem__` and :py:class:`~xarray.DataTree.__setitem__` syntax. In general :py:class:`~xarray.DataTree.DataTree` objects support almost the entire set of dict-like methods, including :py:meth:`~xarray.DataTree.keys`, :py:class:`~xarray.DataTree.values`, :py:class:`~xarray.DataTree.items`, :py:meth:`~xarray.DataTree.__delitem__` and :py:meth:`~xarray.DataTree.update`. .. jupyter-execute:: print(vertebrates["Bony Skeleton"]["Ray-finned Fish"]) Note that the dict-like interface combines access to child :py:class:`~xarray.DataTree` nodes and stored :py:class:`~xarray.DataArrays`, so if we have a node that contains both children and data, calling :py:meth:`~xarray.DataTree.keys` will list both names of child nodes and names of data variables: .. jupyter-execute:: dt = xr.DataTree( dataset=xr.Dataset({"foo": 0, "bar": 1}), children={"a": xr.DataTree(), "b": xr.DataTree()}, ) print(dt) list(dt.keys()) This also means that the names of variables and of child nodes must be different to one another. Attribute-like access ~~~~~~~~~~~~~~~~~~~~~ You can also select both variables and child nodes through dot indexing .. jupyter-execute:: print(dt.foo) print(dt.a) .. _filesystem paths: Filesystem-like Paths ~~~~~~~~~~~~~~~~~~~~~ Hierarchical trees can be thought of as analogous to file systems. Each node is like a directory, and each directory can contain both more sub-directories and data. .. note:: Future development will allow you to make the filesystem analogy concrete by using :py:func:`~xarray.DataTree.open_mfdatatree` or :py:func:`~xarray.DataTree.save_mfdatatree`. (`See related issue in GitHub `_) Datatree objects support a syntax inspired by unix-like filesystems, where the "path" to a node is specified by the keys of each intermediate node in sequence, separated by forward slashes. This is an extension of the conventional dictionary ``__getitem__`` syntax to allow navigation across multiple levels of the tree. Like with filepaths, paths within the tree can either be relative to the current node, e.g. .. jupyter-execute:: print(abe["Homer/Bart"].name) print(abe["./Homer/Bart"].name) # alternative syntax or relative to the root node. A path specified from the root (as opposed to being specified relative to an arbitrary node in the tree) is sometimes also referred to as a `"fully qualified name" `_, or as an "absolute path". The root node is referred to by ``"/"``, so the path from the root node to its grand-child would be ``"/child/grandchild"``, e.g. .. jupyter-execute:: # access lisa's sibling by a relative path. print(lisa["../Bart"]) # or from absolute path print(lisa["/Homer/Bart"]) Relative paths between nodes also support the ``"../"`` syntax to mean the parent of the current node. We can use this with ``__setitem__`` to add a missing entry to our evolutionary tree, but add it relative to a more familiar node of interest: .. jupyter-execute:: primates["../../Two Fenestrae/Crocodiles"] = xr.DataTree() print(vertebrates) Given two nodes in a tree, we can also find their relative path: .. jupyter-execute:: bart.relative_to(lisa) You can use this filepath feature to build a nested tree from a dictionary of filesystem-like paths and corresponding :py:class:`~xarray.Dataset` objects in a single step. If we have a dictionary where each key is a valid path, and each value is either valid data or ``None``, we can construct a complex tree quickly using the alternative constructor :py:meth:`~xarray.DataTree.from_dict()`: .. jupyter-execute:: d = { "/": xr.Dataset({"foo": "orange"}), "/a": xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])}), "/a/b": xr.Dataset({"zed": np.nan}), "a/c/d": None, } dt = xr.DataTree.from_dict(d) print(dt) .. note:: Notice that using the path-like syntax will also create any intermediate empty nodes necessary to reach the end of the specified path (i.e. the node labelled ``"/a/c"`` in this case.) This is to help avoid lots of redundant entries when creating deeply-nested trees using :py:meth:`xarray.DataTree.from_dict`. .. _iterating over trees: Iterating over trees ~~~~~~~~~~~~~~~~~~~~ You can iterate over every node in a tree using the subtree :py:class:`~xarray.DataTree.subtree` property. This returns an iterable of nodes, which yields them in depth-first order. .. jupyter-execute:: for node in vertebrates.subtree: print(node.path) Similarly, :py:class:`~xarray.DataTree.subtree_with_keys` returns an iterable of relative paths and corresponding nodes. A very useful pattern is to iterate over :py:class:`~xarray.DataTree.subtree_with_keys` to manipulate nodes however you wish, then rebuild a new tree using :py:meth:`xarray.DataTree.from_dict()`. For example, we could keep only the nodes containing data by looping over all nodes, checking if they contain any data using :py:class:`~xarray.DataTree.has_data`, then rebuilding a new tree using only the paths of those nodes: .. jupyter-execute:: non_empty_nodes = { path: node.dataset for path, node in dt.subtree_with_keys if node.has_data } print(xr.DataTree.from_dict(non_empty_nodes)) You can see this tree is similar to the ``dt`` object above, except that it is missing the empty nodes ``a/c`` and ``a/c/d``. (If you want to keep the name of the root node, you will need to add the ``name`` kwarg to :py:class:`~xarray.DataTree.from_dict`, i.e. ``DataTree.from_dict(non_empty_nodes, name=dt.name)``.) .. _manipulating trees: Manipulating Trees ------------------ Subsetting Tree Nodes ~~~~~~~~~~~~~~~~~~~~~ We can subset our tree to select only nodes of interest in various ways. Similarly to on a real filesystem, matching nodes by common patterns in their paths is often useful. We can use :py:meth:`xarray.DataTree.match` for this: .. jupyter-execute:: dt = xr.DataTree.from_dict( { "/a/A": None, "/a/B": None, "/b/A": None, "/b/B": None, } ) result = dt.match("*/B") print(result) We can also subset trees by the contents of the nodes. :py:meth:`xarray.DataTree.filter` retains only the nodes of a tree that meet a certain condition. For example, we could recreate the Simpson's family tree with the ages of each individual, then filter for only the adults: First lets recreate the tree but with an ``age`` data variable in every node: .. jupyter-execute:: simpsons = xr.DataTree.from_dict( { "/": xr.Dataset({"age": 83}), "/Herbert": xr.Dataset({"age": 40}), "/Homer": xr.Dataset({"age": 39}), "/Homer/Bart": xr.Dataset({"age": 10}), "/Homer/Lisa": xr.Dataset({"age": 8}), "/Homer/Maggie": xr.Dataset({"age": 1}), }, name="Abe", ) print(simpsons) Now let's filter out the minors: .. jupyter-execute:: print(simpsons.filter(lambda node: node["age"] > 18)) The result is a new tree, containing only the nodes matching the condition. (Yes, under the hood :py:meth:`~xarray.DataTree.filter` is just syntactic sugar for the pattern we showed you in :ref:`iterating over trees` !) .. _Tree Contents: Tree Contents ------------- Hollow Trees ~~~~~~~~~~~~ A concept that can sometimes be useful is that of a "Hollow Tree", which means a tree with data stored only at the leaf nodes. This is useful because certain useful tree manipulation operations only make sense for hollow trees. You can check if a tree is a hollow tree by using the :py:class:`~xarray.DataTree.is_hollow` property. We can see that the Simpson's family is not hollow because the data variable ``"age"`` is present at some nodes which have children (i.e. Abe and Homer). .. jupyter-execute:: simpsons.is_hollow .. _tree computation: Computation ----------- :py:class:`~xarray.DataTree` objects are also useful for performing computations, not just for organizing data. Operations and Methods on Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To show how applying operations across a whole tree at once can be useful, let's first create a example scientific dataset. .. jupyter-execute:: def time_stamps(n_samples, T): """Create an array of evenly-spaced time stamps""" return xr.DataArray( data=np.linspace(0, 2 * np.pi * T, n_samples), dims=["time"] ) def signal_generator(t, f, A, phase): """Generate an example electrical-like waveform""" return A * np.sin(f * t.data + phase) time_stamps1 = time_stamps(n_samples=15, T=1.5) time_stamps2 = time_stamps(n_samples=10, T=1.0) voltages = xr.DataTree.from_dict( { "/oscilloscope1": xr.Dataset( { "potential": ( "time", signal_generator(time_stamps1, f=2, A=1.2, phase=0.5), ), "current": ( "time", signal_generator(time_stamps1, f=2, A=1.2, phase=1), ), }, coords={"time": time_stamps1}, ), "/oscilloscope2": xr.Dataset( { "potential": ( "time", signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.2), ), "current": ( "time", signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7), ), }, coords={"time": time_stamps2}, ), } ) print(voltages) Most xarray computation methods also exist as methods on datatree objects, so you can for example take the mean value of these two timeseries at once: .. jupyter-execute:: print(voltages.mean(dim="time")) This works by mapping the standard :py:meth:`xarray.Dataset.mean()` method over the dataset stored in each node of the tree one-by-one. The arguments passed to the method are used for every node, so the values of the arguments you pass might be valid for one node and invalid for another .. jupyter-execute:: :raises: voltages.isel(time=12) Notice that the error raised helpfully indicates which node of the tree the operation failed on. Arithmetic Methods on Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Arithmetic methods are also implemented, so you can e.g. add a scalar to every dataset in the tree at once. For example, we can advance the timeline of the Simpsons by a decade just by .. jupyter-execute:: print(simpsons + 10) See that the same change (fast-forwarding by adding 10 years to the age of each character) has been applied to every node. Mapping Custom Functions Over Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can map custom computation over each node in a tree using :py:meth:`xarray.DataTree.map_over_datasets`. You can map any function, so long as it takes :py:class:`xarray.Dataset` objects as one (or more) of the input arguments, and returns one (or more) xarray datasets. .. note:: Functions passed to :py:func:`~xarray.DataTree.map_over_datasets` cannot alter nodes in-place. Instead they must return new :py:class:`xarray.Dataset` objects. For example, we can define a function to calculate the Root Mean Square of a timeseries .. jupyter-execute:: def rms(signal): return np.sqrt(np.mean(signal**2)) Then calculate the RMS value of these signals: .. jupyter-execute:: print(voltages.map_over_datasets(rms)) .. _multiple trees: We can also use :py:func:`~xarray.map_over_datasets` to apply a function over the data in multiple trees, by passing the trees as positional arguments. Operating on Multiple Trees --------------------------- The examples so far have involved mapping functions or methods over the nodes of a single tree, but we can generalize this to mapping functions over multiple trees at once. Iterating Over Multiple Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To iterate over the corresponding nodes in multiple trees, use :py:func:`~xarray.group_subtrees` instead of :py:class:`~xarray.DataTree.subtree_with_keys`. This combines well with :py:meth:`xarray.DataTree.from_dict()` to build a new tree: .. jupyter-execute:: dt1 = xr.DataTree.from_dict({"a": xr.Dataset({"x": 1}), "b": xr.Dataset({"x": 2})}) dt2 = xr.DataTree.from_dict( {"a": xr.Dataset({"x": 10}), "b": xr.Dataset({"x": 20})} ) result = {} for path, (node1, node2) in xr.group_subtrees(dt1, dt2): result[path] = node1.dataset + node2.dataset dt3 = xr.DataTree.from_dict(result) print(dt3) Alternatively, you apply a function directly to paired datasets at every node using :py:func:`xarray.map_over_datasets`: .. jupyter-execute:: dt3 = xr.map_over_datasets(lambda x, y: x + y, dt1, dt2) print(dt3) Comparing Trees for Isomorphism ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For it to make sense to map a single non-unary function over the nodes of multiple trees at once, each tree needs to have the same structure. Specifically two trees can only be considered similar, or "isomorphic", if the full paths to all of their descendent nodes are the same. Applying :py:func:`~xarray.group_subtrees` to trees with different structures raises :py:class:`~xarray.TreeIsomorphismError`: .. jupyter-execute:: :raises: tree = xr.DataTree.from_dict({"a": None, "a/b": None, "a/c": None}) simple_tree = xr.DataTree.from_dict({"a": None}) for _ in xr.group_subtrees(tree, simple_tree): ... We can explicitly also check if any two trees are isomorphic using the :py:meth:`~xarray.DataTree.isomorphic` method: .. jupyter-execute:: tree.isomorphic(simple_tree) Corresponding tree nodes do not need to have the same data in order to be considered isomorphic: .. jupyter-execute:: tree_with_data = xr.DataTree.from_dict({"a": xr.Dataset({"foo": 1})}) simple_tree.isomorphic(tree_with_data) They also do not need to define child nodes in the same order: .. jupyter-execute:: reordered_tree = xr.DataTree.from_dict({"a": None, "a/c": None, "a/b": None}) tree.isomorphic(reordered_tree) Arithmetic Between Multiple Trees ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Arithmetic operations like multiplication are binary operations, so as long as we have two isomorphic trees, we can do arithmetic between them. .. jupyter-execute:: currents = xr.DataTree.from_dict( { "/oscilloscope1": xr.Dataset( { "current": ( "time", signal_generator(time_stamps1, f=2, A=1.2, phase=1), ), }, coords={"time": time_stamps1}, ), "/oscilloscope2": xr.Dataset( { "current": ( "time", signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7), ), }, coords={"time": time_stamps2}, ), } ) print(currents) .. jupyter-execute:: currents.isomorphic(voltages) We could use this feature to quickly calculate the electrical power in our signal, P=IV. .. jupyter-execute:: power = currents * voltages print(power) .. _hierarchical-data.alignment-and-coordinate-inheritance: Alignment and Coordinate Inheritance ------------------------------------ .. _data-alignment: Data Alignment ~~~~~~~~~~~~~~ The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes. Exact alignment means that shared dimensions must be the same length, and indexes along those dimensions must be equal. .. note:: If you were a previous user of the prototype `xarray-contrib/datatree `_ package, this is different from what you're used to! In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter. This allows us to provide features like :ref:`coordinate-inheritance`. To demonstrate, let's first generate some example datasets which are not aligned with one another: .. jupyter-execute:: # (drop the attributes just to make the printed representation shorter) ds = xr.tutorial.open_dataset("air_temperature").drop_attrs() ds_daily = ds.resample(time="D").mean("time") ds_weekly = ds.resample(time="W").mean("time") ds_monthly = ds.resample(time="ME").mean("time") These datasets have different lengths along the ``time`` dimension, and are therefore not aligned along that dimension. .. jupyter-execute:: print(ds_daily.sizes) print(ds_weekly.sizes) print(ds_monthly.sizes) We cannot store these non-alignable variables on a single :py:class:`~xarray.Dataset` object, because they do not exactly align: .. jupyter-execute:: :raises: xr.align(ds_daily, ds_weekly, ds_monthly, join="exact") But we :ref:`previously said ` that multi-resolution data is a good use case for :py:class:`~xarray.DataTree`, so surely we should be able to store these in a single :py:class:`~xarray.DataTree`? If we first try to create a :py:class:`~xarray.DataTree` with these different-length time dimensions present in both parents and children, we will still get an alignment error: .. jupyter-execute:: :raises: xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly}) This is because DataTree checks that data in child nodes align exactly with their parents. .. note:: This requirement of aligned dimensions is similar to netCDF's concept of `inherited dimensions `_, as in netCDF-4 files dimensions are `visible to all child groups `_. This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this :py:func:`~xarray.align` command succeeds: .. code:: python xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact") To represent our unalignable data in a single :py:class:`~xarray.DataTree`, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings. .. jupyter-execute:: dt = xr.DataTree.from_dict( {"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly} ) print(dt) Now we have a valid :py:class:`~xarray.DataTree` structure which contains all the data at each different time frequency, stored in a separate group. This is a useful way to organise our data because we can still operate on all the groups at once. For example we can extract all three timeseries at a specific lat-lon location: .. jupyter-execute:: dt_sel = dt.sel(lat=75, lon=300) print(dt_sel) or compute the standard deviation of each timeseries to find out how it varies with sampling frequency: .. jupyter-execute:: dt_std = dt.std(dim="time") print(dt_std) .. _coordinate-inheritance: Coordinate Inheritance ~~~~~~~~~~~~~~~~~~~~~~ Notice that in the trees we constructed above there is some redundancy - the ``lat`` and ``lon`` variables appear in each sibling group, but are identical across the groups. .. jupyter-execute:: dt We can use "Coordinate Inheritance" to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups. .. note:: This is also a new feature relative to the prototype `xarray-contrib/datatree `_ package. Let's instead place only the time-dependent variables in the child groups, and put the non-time-dependent ``lat`` and ``lon`` variables in the parent (root) group: .. jupyter-execute:: dt = xr.DataTree.from_dict( { "/": ds.drop_dims("time"), "daily": ds_daily.drop_vars(["lat", "lon"]), "weekly": ds_weekly.drop_vars(["lat", "lon"]), "monthly": ds_monthly.drop_vars(["lat", "lon"]), } ) dt This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates. Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations. We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups: .. jupyter-execute:: dt.daily.coords .. jupyter-execute:: dt["daily/lat"] As we can still access them, we say that the ``lat`` and ``lon`` coordinates in the child groups have been "inherited" from their common parent group. If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such: .. jupyter-execute:: dt["/daily"] This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it. We can also still perform all the same operations on the whole tree: .. jupyter-execute:: dt.sel(lat=[75], lon=[300]) .. jupyter-execute:: dt.std(dim="time") ########### User Guide ########### In this user guide, you will find detailed descriptions and examples that describe many common tasks that you can accomplish with Xarray. .. toctree:: :maxdepth: 2 :caption: Data model terminology data-structures hierarchical-data dask .. toctree:: :maxdepth: 2 :caption: Core operations indexing combining reshaping computation groupby interpolation .. toctree:: :maxdepth: 2 :caption: I/O io complex-numbers .. toctree:: :maxdepth: 2 :caption: Visualization plotting .. toctree:: :maxdepth: 2 :caption: Interoperability pandas duckarrays ecosystem .. toctree:: :maxdepth: 2 :caption: Domain-specific workflows time-series weather-climate .. toctree:: :maxdepth: 2 :caption: Options and Testing options testing .. _indexing: Indexing and selecting data =========================== .. jupyter-execute:: :hide-code: :hide-output: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) %xmode minimal Xarray offers extremely flexible indexing routines that combine the best features of NumPy and pandas for data selection. The most basic way to access elements of a :py:class:`~xarray.DataArray` object is to use Python's ``[]`` syntax, such as ``array[i, j]``, where ``i`` and ``j`` are both integers. As xarray objects can store coordinates corresponding to each dimension of an array, label-based indexing similar to ``pandas.DataFrame.loc`` is also possible. In label-based indexing, the element position ``i`` is automatically looked-up from the coordinate values. Dimensions of xarray objects have names, so you can also lookup the dimensions by name, instead of remembering their positional order. Quick overview -------------- In total, xarray supports four different kinds of indexing, as described below and summarized in this table: .. |br| raw:: html
+------------------+--------------+---------------------------------+--------------------------------+ | Dimension lookup | Index lookup | ``DataArray`` syntax | ``Dataset`` syntax | +==================+==============+=================================+================================+ | Positional | By integer | ``da[:, 0]`` | *not available* | +------------------+--------------+---------------------------------+--------------------------------+ | Positional | By label | ``da.loc[:, 'IA']`` | *not available* | +------------------+--------------+---------------------------------+--------------------------------+ | By name | By integer | ``da.isel(space=0)`` or |br| | ``ds.isel(space=0)`` or |br| | | | | ``da[dict(space=0)]`` | ``ds[dict(space=0)]`` | +------------------+--------------+---------------------------------+--------------------------------+ | By name | By label | ``da.sel(space='IA')`` or |br| | ``ds.sel(space='IA')`` or |br| | | | | ``da.loc[dict(space='IA')]`` | ``ds.loc[dict(space='IA')]`` | +------------------+--------------+---------------------------------+--------------------------------+ More advanced indexing is also possible for all the methods by supplying :py:class:`~xarray.DataArray` objects as indexer. See :ref:`vectorized_indexing` for the details. Positional indexing ------------------- Indexing a :py:class:`~xarray.DataArray` directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray: .. jupyter-execute:: da = xr.DataArray( np.random.rand(4, 3), [ ("time", pd.date_range("2000-01-01", periods=4)), ("space", ["IA", "IL", "IN"]), ], ) da[:2] .. jupyter-execute:: da[0, 0] .. jupyter-execute:: da[:, [2, 1]] Attributes are persisted in all indexing operations. .. warning:: Positional indexing deviates from the NumPy when indexing with multiple arrays like ``da[[0, 1], [0, 1]]``, as described in :ref:`vectorized_indexing`. Xarray also supports label-based indexing, just like pandas. Because we use a :py:class:`pandas.Index` under the hood, label based indexing is very fast. To do label based indexing, use the :py:attr:`~xarray.DataArray.loc` attribute: .. jupyter-execute:: da.loc["2000-01-01":"2000-01-02", "IA"] In this example, the selected is a subpart of the array in the range '2000-01-01':'2000-01-02' along the first coordinate ``time`` and with 'IA' value from the second coordinate ``space``. You can perform any of the `label indexing operations supported by pandas`__, including indexing with individual, slices and lists/arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xarray is *inclusive* of both the start and stop bounds. __ https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label Setting values with label based indexing is also supported: .. jupyter-execute:: da.loc["2000-01-01", ["IL", "IN"]] = -10 da Indexing with dimension names ----------------------------- With the dimension names, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this: 1. Use the :py:meth:`~xarray.DataArray.sel` and :py:meth:`~xarray.DataArray.isel` convenience methods: .. jupyter-execute:: # index by integer array indices da.isel(space=0, time=slice(None, 2)) .. jupyter-execute:: # index by dimension coordinate labels da.sel(time=slice("2000-01-01", "2000-01-02")) 2. Use a dictionary as the argument for array positional or label based array indexing: .. jupyter-execute:: # index by integer array indices da[dict(space=0, time=slice(None, 2))] .. jupyter-execute:: # index by dimension coordinate labels da.loc[dict(time=slice("2000-01-01", "2000-01-02"))] The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, :py:class:`Python slice` objects or 1-dimensional arrays. .. note:: We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, `Python does not yet support indexing with keyword arguments`__ like ``da[space=0]`` __ https://legacy.python.org/dev/peps/pep-0472/ .. _nearest neighbor lookups: Nearest neighbor lookups ------------------------ The label based selection methods :py:meth:`~xarray.Dataset.sel`, :py:meth:`~xarray.Dataset.reindex` and :py:meth:`~xarray.Dataset.reindex_like` all support ``method`` and ``tolerance`` keyword argument. The method parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods ``'pad'``, ``'backfill'`` or ``'nearest'``: .. jupyter-execute:: da = xr.DataArray([1, 2, 3], [("x", [0, 1, 2])]) da.sel(x=[1.1, 1.9], method="nearest") .. jupyter-execute:: da.sel(x=0.1, method="backfill") .. jupyter-execute:: da.reindex(x=[0.5, 1, 1.5, 2, 2.5], method="pad") Tolerance limits the maximum distance for valid matches with an inexact lookup: .. jupyter-execute:: da.reindex(x=[1.1, 1.5], method="nearest", tolerance=0.2) The method parameter is not yet supported if any of the arguments to ``.sel()`` is a ``slice`` object: .. jupyter-execute:: :raises: da.sel(x=slice(1, 3), method="nearest") However, you don't need to use ``method`` to do inexact slicing. Slicing already returns all values inside the range (inclusive), as long as the index labels are monotonic increasing: .. jupyter-execute:: da.sel(x=slice(0.9, 3.1)) Indexing axes with monotonic decreasing labels also works, as long as the ``slice`` or ``.loc`` arguments are also decreasing: .. jupyter-execute:: reversed_da = da[::-1] reversed_da.loc[3.1:0.9] .. note:: If you want to interpolate along coordinates rather than looking up the nearest neighbors, use :py:meth:`~xarray.Dataset.interp` and :py:meth:`~xarray.Dataset.interp_like`. See :ref:`interpolation ` for the details. Dataset indexing ---------------- We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset: .. jupyter-execute:: da = xr.DataArray( np.random.rand(4, 3), [ ("time", pd.date_range("2000-01-01", periods=4)), ("space", ["IA", "IL", "IN"]), ], ) ds = da.to_dataset(name="foo") ds.isel(space=[0], time=[0]) .. jupyter-execute:: ds.sel(time="2000-01-01") Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with dimension names: .. jupyter-execute:: ds[dict(space=[0], time=[0])] .. jupyter-execute:: ds.loc[dict(time="2000-01-01")] Dropping labels and dimensions ------------------------------ The :py:meth:`~xarray.Dataset.drop_sel` method returns a new object with the listed index labels along a dimension dropped: .. jupyter-execute:: ds.drop_sel(space=["IN", "IL"]) ``drop_sel`` is both a ``Dataset`` and ``DataArray`` method. Use :py:meth:`~xarray.Dataset.drop_dims` to drop a full dimension from a Dataset. Any variables with these dimensions are also dropped: .. jupyter-execute:: ds.drop_dims("time") .. _masking with where: Masking with ``where`` ---------------------- Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xarray, use :py:meth:`~xarray.DataArray.where`: .. jupyter-execute:: da = xr.DataArray(np.arange(16).reshape(4, 4), dims=["x", "y"]) da.where(da.x + da.y < 4) This is particularly useful for ragged indexing of multi-dimensional data, e.g., to apply a 2D mask to an image. Note that ``where`` follows all the usual xarray broadcasting and alignment rules for binary operations (e.g., ``+``) between the object being indexed and the condition, as described in :ref:`compute`: .. jupyter-execute:: da.where(da.y < 2) By default ``where`` maintains the original size of the data. For cases where the selected data size is much smaller than the original data, use of the option ``drop=True`` clips coordinate elements that are fully masked: .. jupyter-execute:: da.where(da.y < 2, drop=True) .. _selecting values with isin: Selecting values with ``isin`` ------------------------------ To check whether elements of an xarray object contain a single object, you can compare with the equality operator ``==`` (e.g., ``arr == 3``). To check multiple values, use :py:meth:`~xarray.DataArray.isin`: .. jupyter-execute:: da = xr.DataArray([1, 2, 3, 4, 5], dims=["x"]) da.isin([2, 4]) :py:meth:`~xarray.DataArray.isin` works particularly well with :py:meth:`~xarray.DataArray.where` to support indexing by arrays that are not already labels of an array: .. jupyter-execute:: lookup = xr.DataArray([-1, -2, -3, -4, -5], dims=["x"]) da.where(lookup.isin([-2, -4]), drop=True) However, some caution is in order: when done repeatedly, this type of indexing is significantly slower than using :py:meth:`~xarray.DataArray.sel`. .. _vectorized_indexing: Vectorized Indexing ------------------- Like numpy and pandas, xarray supports indexing many array elements at once in a vectorized manner. If you only provide integers, slices, or unlabeled arrays (array without dimension names, such as ``np.ndarray``, ``list``, but not :py:meth:`~xarray.DataArray` or :py:meth:`~xarray.Variable`) indexing can be understood as orthogonally. Each indexer component selects independently along the corresponding dimension, similar to how vector indexing works in Fortran or MATLAB, or after using the :py:func:`numpy.ix_` helper: .. jupyter-execute:: da = xr.DataArray( np.arange(12).reshape((3, 4)), dims=["x", "y"], coords={"x": [0, 1, 2], "y": ["a", "b", "c", "d"]}, ) da .. jupyter-execute:: da[[0, 2, 2], [1, 3]] For more flexibility, you can supply :py:meth:`~xarray.DataArray` objects as indexers. Dimensions on resultant arrays are given by the ordered union of the indexers' dimensions: .. jupyter-execute:: ind_x = xr.DataArray([0, 1], dims=["x"]) ind_y = xr.DataArray([0, 1], dims=["y"]) da[ind_x, ind_y] # orthogonal indexing Slices or sequences/arrays without named-dimensions are treated as if they have the same dimension which is indexed along: .. jupyter-execute:: # Because [0, 1] is used to index along dimension 'x', # it is assumed to have dimension 'x' da[[0, 1], ind_x] Furthermore, you can use multi-dimensional :py:meth:`~xarray.DataArray` as indexers, where the resultant array dimension is also determined by indexers' dimension: .. jupyter-execute:: ind = xr.DataArray([[0, 1], [0, 1]], dims=["a", "b"]) da[ind] Similar to how `NumPy's advanced indexing`_ works, vectorized indexing for xarray is based on our :ref:`broadcasting rules `. See :ref:`indexing.rules` for the complete specification. .. _NumPy's advanced indexing: https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing Vectorized indexing also works with ``isel``, ``loc``, and ``sel``: .. jupyter-execute:: ind = xr.DataArray([[0, 1], [0, 1]], dims=["a", "b"]) da.isel(y=ind) # same as da[:, ind] .. jupyter-execute:: ind = xr.DataArray([["a", "b"], ["b", "a"]], dims=["a", "b"]) da.loc[:, ind] # same as da.sel(y=ind) These methods may also be applied to ``Dataset`` objects .. jupyter-execute:: ds = da.to_dataset(name="bar") ds.isel(x=xr.DataArray([0, 1, 2], dims=["points"])) Vectorized indexing may be used to extract information from the nearest grid cells of interest, for example, the nearest climate model grid cells to a collection specified weather station latitudes and longitudes. To trigger vectorized indexing behavior you will need to provide the selection dimensions with a new shared output dimension name. In the example below, the selections of the closest latitude and longitude are renamed to an output dimension named "points": .. jupyter-execute:: ds = xr.tutorial.open_dataset("air_temperature") # Define target latitude and longitude (where weather stations might be) target_lon = xr.DataArray([200, 201, 202, 205], dims="points") target_lat = xr.DataArray([31, 41, 42, 42], dims="points") # Retrieve data at the grid cells nearest to the target latitudes and longitudes da = ds["air"].sel(lon=target_lon, lat=target_lat, method="nearest") da .. tip:: If you are lazily loading your data from disk, not every form of vectorized indexing is supported (or if supported, may not be supported efficiently). You may find increased performance by loading your data into memory first, e.g., with :py:meth:`~xarray.Dataset.load`. .. note:: If an indexer is a :py:meth:`~xarray.DataArray`, its coordinates should not conflict with the selected subpart of the target array (except for the explicitly indexed dimensions with ``.loc``/``.sel``). Otherwise, ``IndexError`` will be raised. .. _assigning_values: Assigning values with indexing ------------------------------ To select and assign values to a portion of a :py:meth:`~xarray.DataArray` you can use indexing with ``.loc`` : .. jupyter-execute:: ds = xr.tutorial.open_dataset("air_temperature") # add an empty 2D dataarray ds["empty"] = xr.full_like(ds.air.mean("time"), fill_value=0) # modify one grid point using loc() ds["empty"].loc[dict(lon=260, lat=30)] = 100 # modify a 2D region using loc() lc = ds.coords["lon"] la = ds.coords["lat"] ds["empty"].loc[ dict(lon=lc[(lc > 220) & (lc < 260)], lat=la[(la > 20) & (la < 60)]) ] = 100 or :py:meth:`~xarray.where`: .. jupyter-execute:: # modify one grid point using xr.where() ds["empty"] = xr.where( (ds.coords["lat"] == 20) & (ds.coords["lon"] == 260), 100, ds["empty"] ) # or modify a 2D region using xr.where() mask = ( (ds.coords["lat"] > 20) & (ds.coords["lat"] < 60) & (ds.coords["lon"] > 220) & (ds.coords["lon"] < 260) ) ds["empty"] = xr.where(mask, 100, ds["empty"]) Vectorized indexing can also be used to assign values to xarray object. .. jupyter-execute:: da = xr.DataArray( np.arange(12).reshape((3, 4)), dims=["x", "y"], coords={"x": [0, 1, 2], "y": ["a", "b", "c", "d"]}, ) da .. jupyter-execute:: da[0] = -1 # assignment with broadcasting da .. jupyter-execute:: ind_x = xr.DataArray([0, 1], dims=["x"]) ind_y = xr.DataArray([0, 1], dims=["y"]) da[ind_x, ind_y] = -2 # assign -2 to (ix, iy) = (0, 0) and (1, 1) da .. jupyter-execute:: da[ind_x, ind_y] += 100 # increment is also possible da Like ``numpy.ndarray``, value assignment sometimes works differently from what one may expect. .. jupyter-execute:: da = xr.DataArray([0, 1, 2, 3], dims=["x"]) ind = xr.DataArray([0, 0, 0], dims=["x"]) da[ind] -= 1 da Where the 0th element will be subtracted 1 only once. This is because ``v[0] = v[0] - 1`` is called three times, rather than ``v[0] = v[0] - 1 - 1 - 1``. See `Assigning values to indexed arrays`__ for the details. __ https://numpy.org/doc/stable/user/basics.indexing.html#assigning-values-to-indexed-arrays .. note:: Dask array does not support value assignment (see :ref:`dask` for the details). .. note:: Coordinates in both the left- and right-hand-side arrays should not conflict with each other. Otherwise, ``IndexError`` will be raised. .. warning:: Do not try to assign values when using any of the indexing methods ``isel`` or ``sel``:: # DO NOT do this da.isel(space=0) = 0 Instead, values can be assigned using dictionary-based indexing:: da[dict(space=0)] = 0 Assigning values with the chained indexing using ``.sel`` or ``.isel`` fails silently. .. jupyter-execute:: da = xr.DataArray([0, 1, 2, 3], dims=["x"]) # DO NOT do this da.isel(x=[0, 1, 2])[1] = -1 da You can also assign values to all variables of a :py:class:`Dataset` at once: .. jupyter-execute:: :stderr: ds_org = xr.tutorial.open_dataset("eraint_uvz").isel( latitude=slice(56, 59), longitude=slice(255, 258), level=0 ) # set all values to 0 ds = xr.zeros_like(ds_org) ds .. jupyter-execute:: # by integer ds[dict(latitude=2, longitude=2)] = 1 ds["u"] .. jupyter-execute:: ds["v"] .. jupyter-execute:: # by label ds.loc[dict(latitude=47.25, longitude=[11.25, 12])] = 100 ds["u"] .. jupyter-execute:: # dataset as new values new_dat = ds_org.loc[dict(latitude=48, longitude=[11.25, 12])] new_dat .. jupyter-execute:: ds.loc[dict(latitude=47.25, longitude=[11.25, 12])] = new_dat ds["u"] The dimensions can differ between the variables in the dataset, but all variables need to have at least the dimensions specified in the indexer dictionary. The new values must be either a scalar, a :py:class:`DataArray` or a :py:class:`Dataset` itself that contains all variables that also appear in the dataset to be modified. .. _more_advanced_indexing: More advanced indexing ----------------------- The use of :py:meth:`~xarray.DataArray` objects as indexers enables very flexible indexing. The following is an example of the pointwise indexing: .. jupyter-execute:: da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=["x", "y"]) da .. jupyter-execute:: da.isel(x=xr.DataArray([0, 1, 6], dims="z"), y=xr.DataArray([0, 1, 0], dims="z")) where three elements at ``(ix, iy) = ((0, 0), (1, 1), (6, 0))`` are selected and mapped along a new dimension ``z``. If you want to add a coordinate to the new dimension ``z``, you can supply a :py:class:`~xarray.DataArray` with a coordinate, .. jupyter-execute:: da.isel( x=xr.DataArray([0, 1, 6], dims="z", coords={"z": ["a", "b", "c"]}), y=xr.DataArray([0, 1, 0], dims="z"), ) Analogously, label-based pointwise-indexing is also possible by the ``.sel`` method: .. jupyter-execute:: da = xr.DataArray( np.random.rand(4, 3), [ ("time", pd.date_range("2000-01-01", periods=4)), ("space", ["IA", "IL", "IN"]), ], ) times = xr.DataArray( pd.to_datetime(["2000-01-03", "2000-01-02", "2000-01-01"]), dims="new_time" ) da.sel(space=xr.DataArray(["IA", "IL", "IN"], dims=["new_time"]), time=times) .. _align and reindex: Align and reindex ----------------- Xarray's ``reindex``, ``reindex_like`` and ``align`` impose a ``DataArray`` or ``Dataset`` onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with ``NaN``. Xarray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance. To reindex a particular dimension, use :py:meth:`~xarray.DataArray.reindex`: .. jupyter-execute:: da.reindex(space=["IA", "CA"]) The :py:meth:`~xarray.DataArray.reindex_like` method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values: .. jupyter-execute:: foo = da.rename("foo") baz = (10 * da[:2, :2]).rename("baz") baz Reindexing ``foo`` with ``baz`` selects out the first two values along each dimension: .. jupyter-execute:: foo.reindex_like(baz) The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with ``NaN``: .. jupyter-execute:: baz.reindex_like(foo) The :py:func:`~xarray.align` function lets us perform more flexible database-like ``'inner'``, ``'outer'``, ``'left'`` and ``'right'`` joins: .. jupyter-execute:: xr.align(foo, baz, join="inner") .. jupyter-execute:: xr.align(foo, baz, join="outer") Both ``reindex_like`` and ``align`` work interchangeably between :py:class:`~xarray.DataArray` and :py:class:`~xarray.Dataset` objects, and with any number of matching dimension names: .. jupyter-execute:: ds .. jupyter-execute:: ds.reindex_like(baz) .. jupyter-execute:: other = xr.DataArray(["a", "b", "c"], dims="other") # this is a no-op, because there are no shared dimension names ds.reindex_like(other) .. _indexing.missing_coordinates: Missing coordinate labels ------------------------- Coordinate labels for each dimension are optional (as of xarray v0.9). Label based indexing with ``.sel`` and ``.loc`` uses standard positional, integer-based indexing as a fallback for dimensions without a coordinate label: .. jupyter-execute:: da = xr.DataArray([1, 2, 3], dims="x") da.sel(x=[0, -1]) Alignment between xarray objects where one or both do not have coordinate labels succeeds only if all dimensions of the same name have the same length. Otherwise, it raises an informative error: .. jupyter-execute:: :raises: xr.align(da, da[:2]) Underlying Indexes ------------------ Xarray uses the :py:class:`pandas.Index` internally to perform indexing operations. If you need to access the underlying indexes, they are available through the :py:attr:`~xarray.DataArray.indexes` attribute. .. jupyter-execute:: da = xr.DataArray( np.random.rand(4, 3), [ ("time", pd.date_range("2000-01-01", periods=4)), ("space", ["IA", "IL", "IN"]), ], ) da .. jupyter-execute:: da.indexes .. jupyter-execute:: da.indexes["time"] Use :py:meth:`~xarray.DataArray.get_index` to get an index for a dimension, falling back to a default :py:class:`pandas.RangeIndex` if it has no coordinate labels: .. jupyter-execute:: da = xr.DataArray([1, 2, 3], dims="x") da .. jupyter-execute:: da.get_index("x") .. _copies_vs_views: Copies vs. Views ---------------- Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels. For positional (integer) indexing, xarray follows the same `rules`_ as NumPy: * Positional indexing with only integers and slices returns a view. * Positional indexing with arrays or lists returns a copy. The rules for label based indexing are more complex: * Label-based indexing with only slices returns a view. * Label-based indexing with arrays returns a copy. * Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas. Whether data is a copy or a view is more predictable in xarray than in pandas, so unlike pandas, xarray does not produce `SettingWithCopy warnings`_. However, you should still avoid assignment with chained indexing. Note that other operations (such as :py:meth:`~xarray.DataArray.values`) may also return views rather than copies. .. _SettingWithCopy warnings: https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy .. _rules: https://numpy.org/doc/stable/user/basics.copies.html .. _multi-level indexing: Multi-level indexing -------------------- Just like pandas, advanced indexing on multi-level indexes is possible with ``loc`` and ``sel``. You can slice a multi-index by providing multiple indexers, i.e., a tuple of slices, labels, list of labels, or any selector allowed by pandas: .. jupyter-execute:: midx = pd.MultiIndex.from_product([list("abc"), [0, 1]], names=("one", "two")) mda = xr.DataArray(np.random.rand(6, 3), [("x", midx), ("y", range(3))]) mda .. jupyter-execute:: mda.sel(x=(list("ab"), [0])) You can also select multiple elements by providing a list of labels or tuples or a slice of tuples: .. jupyter-execute:: mda.sel(x=[("a", 0), ("b", 1)]) Additionally, xarray supports dictionaries: .. jupyter-execute:: mda.sel(x={"one": "a", "two": 0}) For convenience, ``sel`` also accepts multi-index levels directly as keyword arguments: .. jupyter-execute:: mda.sel(one="a", two=0) Note that using ``sel`` it is not possible to mix a dimension indexer with level indexers for that dimension (e.g., ``mda.sel(x={'one': 'a'}, two=0)`` will raise a ``ValueError``). Like pandas, xarray handles partial selection on multi-index (level drop). As shown below, it also renames the dimension / coordinate when the multi-index is reduced to a single index. .. jupyter-execute:: mda.loc[{"one": "a"}, ...] Unlike pandas, xarray does not guess whether you provide index levels or dimensions when using ``loc`` in some ambiguous cases. For example, for ``mda.loc[{'one': 'a', 'two': 0}]`` and ``mda.loc['a', 0]`` xarray always interprets ('one', 'two') and ('a', 0) as the names and labels of the 1st and 2nd dimension, respectively. You must specify all dimensions or use the ellipsis in the ``loc`` specifier, e.g. in the example above, ``mda.loc[{'one': 'a', 'two': 0}, :]`` or ``mda.loc[('a', 0), ...]``. .. _indexing.rules: Indexing rules -------------- Here we describe the full rules xarray uses for vectorized indexing. Note that this is for the purposes of explanation: for the sake of efficiency and to support various backends, the actual implementation is different. 0. (Only for label based indexing.) Look up positional indexes along each dimension from the corresponding :py:class:`pandas.Index`. 1. A full slice object ``:`` is inserted for each dimension without an indexer. 2. ``slice`` objects are converted into arrays, given by ``np.arange(*slice.indices(...))``. 3. Assume dimension names for array indexers without dimensions, such as ``np.ndarray`` and ``list``, from the dimensions to be indexed along. For example, ``v.isel(x=[0, 1])`` is understood as ``v.isel(x=xr.DataArray([0, 1], dims=['x']))``. 4. For each variable in a ``Dataset`` or ``DataArray`` (the array and its coordinates): a. Broadcast all relevant indexers based on their dimension names (see :ref:`compute.broadcasting` for full details). b. Index the underling array by the broadcast indexers, using NumPy's advanced indexing rules. 5. If any indexer DataArray has coordinates and no coordinate with the same name exists, attach them to the indexed object. .. note:: Only 1-dimensional boolean arrays can be used as indexers. .. _interp: Interpolating data ================== .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr import matplotlib.pyplot as plt np.random.seed(123456) Xarray offers flexible interpolation routines, which have a similar interface to our :ref:`indexing `. .. note:: ``interp`` requires ``scipy`` installed. Scalar and 1-dimensional interpolation -------------------------------------- Interpolating a :py:class:`~xarray.DataArray` works mostly like labeled indexing of a :py:class:`~xarray.DataArray`, .. jupyter-execute:: da = xr.DataArray( np.sin(0.3 * np.arange(12).reshape(4, 3)), [("time", np.arange(4)), ("space", [0.1, 0.2, 0.3])], ) # label lookup da.sel(time=3) .. jupyter-execute:: # interpolation da.interp(time=2.5) Similar to the indexing, :py:meth:`~xarray.DataArray.interp` also accepts an array-like, which gives the interpolated result as an array. .. jupyter-execute:: # label lookup da.sel(time=[2, 3]) .. jupyter-execute:: # interpolation da.interp(time=[2.5, 3.5]) To interpolate data with a :py:doc:`numpy.datetime64 ` coordinate you can pass a string. .. jupyter-execute:: da_dt64 = xr.DataArray( [1, 3], [("time", pd.date_range("1/1/2000", "1/3/2000", periods=2))] ) da_dt64.interp(time="2000-01-02") The interpolated data can be merged into the original :py:class:`~xarray.DataArray` by specifying the time periods required. .. jupyter-execute:: da_dt64.interp(time=pd.date_range("1/1/2000", "1/3/2000", periods=3)) Interpolation of data indexed by a :py:class:`~xarray.CFTimeIndex` is also allowed. See :ref:`CFTimeIndex` for examples. .. note:: Currently, our interpolation only works for regular grids. Therefore, similarly to :py:meth:`~xarray.DataArray.sel`, only 1D coordinates along a dimension can be used as the original coordinate to be interpolated. Multi-dimensional Interpolation ------------------------------- Like :py:meth:`~xarray.DataArray.sel`, :py:meth:`~xarray.DataArray.interp` accepts multiple coordinates. In this case, multidimensional interpolation is carried out. .. jupyter-execute:: # label lookup da.sel(time=2, space=0.1) .. jupyter-execute:: # interpolation da.interp(time=2.5, space=0.15) Array-like coordinates are also accepted: .. jupyter-execute:: # label lookup da.sel(time=[2, 3], space=[0.1, 0.2]) .. jupyter-execute:: # interpolation da.interp(time=[1.5, 2.5], space=[0.15, 0.25]) :py:meth:`~xarray.DataArray.interp_like` method is a useful shortcut. This method interpolates an xarray object onto the coordinates of another xarray object. For example, if we want to compute the difference between two :py:class:`~xarray.DataArray` s (``da`` and ``other``) staying on slightly different coordinates, .. jupyter-execute:: other = xr.DataArray( np.sin(0.4 * np.arange(9).reshape(3, 3)), [("time", [0.9, 1.9, 2.9]), ("space", [0.15, 0.25, 0.35])], ) it might be a good idea to first interpolate ``da`` so that it will stay on the same coordinates of ``other``, and then subtract it. :py:meth:`~xarray.DataArray.interp_like` can be used for such a case, .. jupyter-execute:: # interpolate da along other's coordinates interpolated = da.interp_like(other) interpolated It is now possible to safely compute the difference ``other - interpolated``. Interpolation methods --------------------- We use either :py:class:`scipy.interpolate.interp1d` or special interpolants from :py:class:`scipy.interpolate` for 1-dimensional interpolation (see :py:meth:`~xarray.Dataset.interp`). For multi-dimensional interpolation, an attempt is first made to decompose the interpolation in a series of 1-dimensional interpolations, in which case the relevant 1-dimensional interpolator is used. If a decomposition cannot be made (e.g. with advanced interpolation), :py:func:`scipy.interpolate.interpn` is used. The interpolation method can be specified by the optional ``method`` argument. .. jupyter-execute:: da = xr.DataArray( np.sin(np.linspace(0, 2 * np.pi, 10)), dims="x", coords={"x": np.linspace(0, 1, 10)}, ) da.plot.line("o", label="original") da.interp(x=np.linspace(0, 1, 100)).plot.line(label="linear (default)") da.interp(x=np.linspace(0, 1, 100), method="cubic").plot.line(label="cubic") plt.legend(); Additional keyword arguments can be passed to scipy's functions. .. jupyter-execute:: # fill 0 for the outside of the original coordinates. da.interp(x=np.linspace(-0.5, 1.5, 10), kwargs={"fill_value": 0.0}) .. jupyter-execute:: # 1-dimensional extrapolation da.interp(x=np.linspace(-0.5, 1.5, 10), kwargs={"fill_value": "extrapolate"}) .. jupyter-execute:: # multi-dimensional extrapolation da = xr.DataArray( np.sin(0.3 * np.arange(12).reshape(4, 3)), [("time", np.arange(4)), ("space", [0.1, 0.2, 0.3])], ) da.interp( time=4, space=np.linspace(-0.1, 0.5, 10), kwargs={"fill_value": "extrapolate"} ) Advanced Interpolation ---------------------- :py:meth:`~xarray.DataArray.interp` accepts :py:class:`~xarray.DataArray` as similar to :py:meth:`~xarray.DataArray.sel`, which enables us more advanced interpolation. Based on the dimension of the new coordinate passed to :py:meth:`~xarray.DataArray.interp`, the dimension of the result are determined. For example, if you want to interpolate a two dimensional array along a particular dimension, as illustrated below, you can pass two 1-dimensional :py:class:`~xarray.DataArray` s with a common dimension as new coordinate. .. image:: ../_static/advanced_selection_interpolation.svg :height: 200px :width: 400 px :alt: advanced indexing and interpolation :align: center For example: .. jupyter-execute:: da = xr.DataArray( np.sin(0.3 * np.arange(20).reshape(5, 4)), [("x", np.arange(5)), ("y", [0.1, 0.2, 0.3, 0.4])], ) # advanced indexing x = xr.DataArray([0, 2, 4], dims="z") y = xr.DataArray([0.1, 0.2, 0.3], dims="z") da.sel(x=x, y=y) .. jupyter-execute:: # advanced interpolation, without extrapolation x = xr.DataArray([0.5, 1.5, 2.5, 3.5], dims="z") y = xr.DataArray([0.15, 0.25, 0.35, 0.45], dims="z") da.interp(x=x, y=y) where values on the original coordinates ``(x, y) = ((0.5, 0.15), (1.5, 0.25), (2.5, 0.35), (3.5, 0.45))`` are obtained by the 2-dimensional interpolation and mapped along a new dimension ``z``. Since no keyword arguments are passed to the interpolation routine, no extrapolation is performed resulting in a ``nan`` value. If you want to add a coordinate to the new dimension ``z``, you can supply :py:class:`~xarray.DataArray` s with a coordinate. Extrapolation can be achieved by passing additional arguments to SciPy's ``interpnd`` function, .. jupyter-execute:: x = xr.DataArray([0.5, 1.5, 2.5, 3.5], dims="z", coords={"z": ["a", "b", "c", "d"]}) y = xr.DataArray( [0.15, 0.25, 0.35, 0.45], dims="z", coords={"z": ["a", "b", "c", "d"]} ) da.interp(x=x, y=y, kwargs={"fill_value": None}) For the details of the advanced indexing, see :ref:`more advanced indexing `. Interpolating arrays with NaN ----------------------------- Our :py:meth:`~xarray.DataArray.interp` works with arrays with NaN the same way that `scipy.interpolate.interp1d `_ and `scipy.interpolate.interpn `_ do. ``linear`` and ``nearest`` methods return arrays including NaN, while other methods such as ``cubic`` or ``quadratic`` return all NaN arrays. .. jupyter-execute:: da = xr.DataArray([0, 2, np.nan, 3, 3.25], dims="x", coords={"x": range(5)}) da.interp(x=[0.5, 1.5, 2.5]) .. jupyter-execute:: da.interp(x=[0.5, 1.5, 2.5], method="cubic") To avoid this, you can drop NaN by :py:meth:`~xarray.DataArray.dropna`, and then make the interpolation .. jupyter-execute:: dropped = da.dropna("x") dropped .. jupyter-execute:: dropped.interp(x=[0.5, 1.5, 2.5], method="cubic") If NaNs are distributed randomly in your multidimensional array, dropping all the columns containing more than one NaNs by :py:meth:`~xarray.DataArray.dropna` may lose a significant amount of information. In such a case, you can fill NaN by :py:meth:`~xarray.DataArray.interpolate_na`, which is similar to :py:meth:`pandas.Series.interpolate`. .. jupyter-execute:: filled = da.interpolate_na(dim="x") filled This fills NaN by interpolating along the specified dimension. After filling NaNs, you can interpolate: .. jupyter-execute:: filled.interp(x=[0.5, 1.5, 2.5], method="cubic") For the details of :py:meth:`~xarray.DataArray.interpolate_na`, see :ref:`Missing values `. Example ------- Let's see how :py:meth:`~xarray.DataArray.interp` works on real data. .. jupyter-execute:: # Raw data ds = xr.tutorial.open_dataset("air_temperature").isel(time=0) fig, axes = plt.subplots(ncols=2, figsize=(10, 4)) ds.air.plot(ax=axes[0]) axes[0].set_title("Raw data") # Interpolated data new_lon = np.linspace(ds.lon[0].item(), ds.lon[-1].item(), ds.sizes["lon"] * 4) new_lat = np.linspace(ds.lat[0].item(), ds.lat[-1].item(), ds.sizes["lat"] * 4) dsi = ds.interp(lat=new_lat, lon=new_lon) dsi.air.plot(ax=axes[1]) axes[1].set_title("Interpolated data"); Our advanced interpolation can be used to remap the data to the new coordinate. Consider the new coordinates x and z on the two dimensional plane. The remapping can be done as follows .. jupyter-execute:: # new coordinate x = np.linspace(240, 300, 100) z = np.linspace(20, 70, 100) # relation between new and original coordinates lat = xr.DataArray(z, dims=["z"], coords={"z": z}) lon = xr.DataArray( (x[:, np.newaxis] - 270) / np.cos(z * np.pi / 180) + 270, dims=["x", "z"], coords={"x": x, "z": z}, ) fig, axes = plt.subplots(ncols=2, figsize=(10, 4)) ds.air.plot(ax=axes[0]) # draw the new coordinate on the original coordinates. for idx in [0, 33, 66, 99]: axes[0].plot(lon.isel(x=idx), lat, "--k") for idx in [0, 33, 66, 99]: axes[0].plot(*xr.broadcast(lon.isel(z=idx), lat.isel(z=idx)), "--k") axes[0].set_title("Raw data") dsi = ds.interp(lon=lon, lat=lat) dsi.air.plot(ax=axes[1]) axes[1].set_title("Remapped data"); .. currentmodule:: xarray .. _io: Reading and writing files ========================= Xarray supports direct serialization and IO to several file formats, from simple :ref:`io.pickle` files to the more flexible :ref:`io.netcdf` format (recommended). .. jupyter-execute:: :hide-code: import os import iris import ncdata.iris_xarray import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) You can read different types of files in ``xr.open_dataset`` by specifying the engine to be used: .. code:: python xr.open_dataset("example.nc", engine="netcdf4") The "engine" provides a set of instructions that tells xarray how to read the data and pack them into a ``Dataset`` (or ``Dataarray``). These instructions are stored in an underlying "backend". Xarray comes with several backends that cover many common data formats. Many more backends are available via external libraries, or you can `write your own `_. This diagram aims to help you determine - based on the format of the file you'd like to read - which type of backend you're using and how to use it. Text and boxes are clickable for more information. Following the diagram is detailed information on many popular backends. You can learn more about using and developing backends in the `Xarray tutorial JupyterBook `_. .. _comment: mermaid Flowcharg "link" text gets secondary color background, SVG icon fill gets primary color .. raw:: html .. mermaid:: :config: {"theme":"base","themeVariables":{"fontSize":"20px","primaryColor":"#fff","primaryTextColor":"#fff","primaryBorderColor":"#59c7d6","lineColor":"#e28126","secondaryColor":"#767985"}} :alt: Flowchart illustrating how to choose the right backend engine to read your data flowchart LR built-in-eng["`**Is your data stored in one of these formats?** - netCDF4 - netCDF3 - Zarr - DODS/OPeNDAP - HDF5 `"] built-in("`**You're in luck!** Xarray bundles a backend to automatically read these formats. Open data using xr.open_dataset(). We recommend explicitly setting engine='xxxx' for faster loading.`") installed-eng["""One of these formats? - GRIB - TileDB - GeoTIFF, JPEG-2000, etc. (via GDAL) - Sentinel-1 SAFE """] installed("""Install the linked backend library and use it with xr.open_dataset(file, engine='xxxx').""") other["`**Options:** - Look around to see if someone has created an Xarray backend for your format! - Create your own backend - Convert your data to a supported format `"] built-in-eng -->|Yes| built-in built-in-eng -->|No| installed-eng installed-eng -->|Yes| installed installed-eng -->|No| other click built-in-eng "https://docs.xarray.dev/en/stable/get-help/faq.html#how-do-i-open-format-x-file-as-an-xarray-dataset" classDef quesNodefmt font-size:12pt,fill:#0e4666,stroke:#59c7d6,stroke-width:3 class built-in-eng,installed-eng quesNodefmt classDef ansNodefmt font-size:12pt,fill:#4a4a4a,stroke:#17afb4,stroke-width:3 class built-in,installed,other ansNodefmt linkStyle default font-size:18pt,stroke-width:4 .. _io.netcdf: netCDF ------ The recommended way to store xarray data structures is `netCDF`__, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to :py:class:`Dataset` objects (more accurately, a group in a netCDF file directly corresponds to a :py:class:`Dataset` object. See :ref:`io.netcdf_groups` for more.) NetCDF is supported on almost all platforms, and parsers exist for the vast majority of scientific programming languages. Recent versions of netCDF are based on the even more widely used HDF5 file-format. __ https://www.unidata.ucar.edu/software/netcdf/ .. tip:: If you aren't familiar with this data format, the `netCDF FAQ`_ is a good place to start. .. _netCDF FAQ: https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#What-Is-netCDF Reading and writing netCDF files with xarray requires scipy, h5netcdf, or the `netCDF4-Python`__ library to be installed. SciPy only supports reading and writing of netCDF V3 files. __ https://github.com/Unidata/netcdf4-python We can save a Dataset to disk using the :py:meth:`Dataset.to_netcdf` method: .. jupyter-execute:: ds = xr.Dataset( {"foo": (("x", "y"), np.random.rand(4, 5))}, coords={ "x": [10, 20, 30, 40], "y": pd.date_range("2000-01-01", periods=5), "z": ("x", list("abcd")), }, ) ds.to_netcdf("saved_on_disk.nc") By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the ``format`` and ``engine`` arguments. .. tip:: Using the `h5netcdf `_ package by passing ``engine='h5netcdf'`` to :py:meth:`open_dataset` can sometimes be quicker than the default ``engine='netcdf4'`` that uses the `netCDF4 `_ package. We can load netCDF files to create a new Dataset using :py:func:`open_dataset`: .. jupyter-execute:: ds_disk = xr.open_dataset("saved_on_disk.nc") ds_disk .. jupyter-execute:: :hide-code: # Close "saved_on_disk.nc", but retain the file until after closing or deleting other # datasets that will refer to it. ds_disk.close() Similarly, a DataArray can be saved to disk using the :py:meth:`DataArray.to_netcdf` method, and loaded from disk using the :py:func:`open_dataarray` function. As netCDF files correspond to :py:class:`Dataset` objects, these functions internally convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back when loading, ensuring that the ``DataArray`` that is loaded is always exactly the same as the one that was saved. A dataset can also be loaded or written to a specific group within a netCDF file. To load from a group, pass a ``group`` keyword argument to the ``open_dataset`` function. The group can be specified as a path-like string, e.g., to access subgroup 'bar' within group 'foo' pass '/foo/bar' as the ``group`` argument. When writing multiple groups in one file, pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the file. .. tip:: It is recommended to use :py:class:`~xarray.DataTree` to represent hierarchical data, and to use the :py:meth:`xarray.DataTree.to_netcdf` method when writing hierarchical data to a netCDF file. Data is *always* loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below. There may be minor differences in the :py:class:`Dataset` object returned when reading a NetCDF file with different engines. It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched. .. tip:: Xarray's lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a Dataset (or DataArray) entirely into memory by invoking the :py:meth:`Dataset.load` method. Datasets have a :py:meth:`Dataset.close` method to close the associated netCDF file. However, it's often cleaner to use a ``with`` statement: .. jupyter-execute:: # this automatically closes the dataset after use with xr.open_dataset("saved_on_disk.nc") as ds: print(ds.keys()) Although xarray provides reasonable support for incremental reads of files on disk, it does not support incremental writes, which can be a useful strategy for dealing with datasets too big to fit into memory. Instead, xarray integrates with dask.array (see :ref:`dask`), which provides a fully featured engine for streaming computation. It is possible to append or overwrite netCDF variables using the ``mode='a'`` argument. When using this option, all variables in the dataset will be written to the original netCDF file, regardless if they exist in the original dataset. .. _io.netcdf_groups: Groups ~~~~~~ Whilst netCDF groups can only be loaded individually as ``Dataset`` objects, a whole file of many nested groups can be loaded as a single :py:class:`xarray.DataTree` object. To open a whole netCDF file as a tree of groups use the :py:func:`xarray.open_datatree` function. To save a DataTree object as a netCDF file containing many groups, use the :py:meth:`xarray.DataTree.to_netcdf` method. .. _netcdf.root_group.note: .. note:: Due to file format specifications the on-disk root group name is always ``"/"``, overriding any given ``DataTree`` root node name. .. _netcdf.group.warning: .. warning:: ``DataTree`` objects do not follow the exact same data model as netCDF files, which means that perfect round-tripping is not always possible. In particular in the netCDF data model dimensions are entities that can exist regardless of whether any variable possesses them. This is in contrast to `xarray's data model `_ (and hence :ref:`DataTree's data model `) in which the dimensions of a (Dataset/Tree) object are simply the set of dimensions present across all variables in that dataset. This means that if a netCDF file contains dimensions but no variables which possess those dimensions, these dimensions will not be present when that file is opened as a DataTree object. Saving this DataTree object to file will therefore not preserve these "unused" dimensions. .. _io.encoding: Reading encoded data ~~~~~~~~~~~~~~~~~~~~ NetCDF files follow some conventions for encoding datetime arrays (as numbers with a "units" attribute) and for packing and unpacking data (as described by the "scale_factor" and "add_offset" attributes). If the argument ``decode_cf=True`` (default) is given to :py:func:`open_dataset`, xarray will attempt to automatically decode the values in the netCDF objects according to `CF conventions`_. Sometimes this will fail, for example, if a variable has an invalid "units" or "calendar" attribute. For these cases, you can turn this decoding off manually. .. _CF conventions: https://cfconventions.org/ You can view this encoding information (among others) in the :py:attr:`DataArray.encoding` and :py:attr:`DataArray.encoding` attributes: .. jupyter-execute:: ds_disk["y"].encoding .. jupyter-execute:: ds_disk.encoding Note that all operations that manipulate variables other than indexing will remove encoding information. In some cases it is useful to intentionally reset a dataset's original encoding values. This can be done with either the :py:meth:`Dataset.drop_encoding` or :py:meth:`DataArray.drop_encoding` methods. .. jupyter-execute:: ds_no_encoding = ds_disk.drop_encoding() ds_no_encoding.encoding .. _combining multiple files: Reading multi-file datasets ........................... NetCDF files are often encountered in collections, e.g., with different files corresponding to different model runs or one file per timestamp. Xarray can straightforwardly combine such files into a single Dataset by making use of :py:func:`concat`, :py:func:`merge`, :py:func:`combine_nested` and :py:func:`combine_by_coords`. For details on the difference between these functions see :ref:`combining data`. Xarray includes support for manipulating datasets that don't fit into memory with dask_. If you have dask installed, you can open multiple files simultaneously in parallel using :py:func:`open_mfdataset`:: xr.open_mfdataset('my/files/*.nc', parallel=True) This function automatically concatenates and merges multiple files into a single xarray dataset. It is the recommended way to open multiple files with xarray. For more details on parallel reading, see :ref:`combining.multi`, :ref:`dask.io` and a `blog post`_ by Stephan Hoyer. :py:func:`open_mfdataset` takes many kwargs that allow you to control its behaviour (for e.g. ``parallel``, ``combine``, ``compat``, ``join``, ``concat_dim``). See its docstring for more details. .. note:: A common use-case involves a dataset distributed across a large number of files with each file containing a large number of variables. Commonly, a few of these variables need to be concatenated along a dimension (say ``"time"``), while the rest are equal across the datasets (ignoring floating point differences). The following command with suitable modifications (such as ``parallel=True``) works well with such datasets:: xr.open_mfdataset('my/files/*.nc', concat_dim="time", combine="nested", data_vars='minimal', coords='minimal', compat='override') This command concatenates variables along the ``"time"`` dimension, but only those that already contain the ``"time"`` dimension (``data_vars='minimal', coords='minimal'``). Variables that lack the ``"time"`` dimension are taken from the first dataset (``compat='override'``). .. _dask: https://www.dask.org .. _blog post: https://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/ Sometimes multi-file datasets are not conveniently organized for easy use of :py:func:`open_mfdataset`. One can use the ``preprocess`` argument to provide a function that takes a dataset and returns a modified Dataset. :py:func:`open_mfdataset` will call ``preprocess`` on every dataset (corresponding to each file) prior to combining them. If :py:func:`open_mfdataset` does not meet your needs, other approaches are possible. The general pattern for parallel reading of multiple files using dask, modifying those datasets and then combining into a single ``Dataset`` is:: def modify(ds): # modify ds here return ds # this is basically what open_mfdataset does open_kwargs = dict(decode_cf=True, decode_times=False) open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names] tasks = [dask.delayed(modify)(task) for task in open_tasks] datasets = dask.compute(tasks) # get a list of xarray.Datasets combined = xr.combine_nested(datasets) # or some combination of concat, merge As an example, here's how we could approximate ``MFDataset`` from the netCDF4 library:: from glob import glob import xarray as xr def read_netcdfs(files, dim): # glob expands paths with * to a list of files, like the unix shell paths = sorted(glob(files)) datasets = [xr.open_dataset(p) for p in paths] combined = xr.concat(datasets, dim) return combined combined = read_netcdfs('/all/my/files/*.nc', dim='time') This function will work in many cases, but it's not very robust. First, it never closes files, which means it will fail if you need to load more than a few thousand files. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file. Here's a slightly more sophisticated example of how to remedy these deficiencies:: def read_netcdfs(files, dim, transform_func=None): def process_one_path(path): # use a context manager, to ensure the file gets closed after use with xr.open_dataset(path) as ds: # transform_func should do some sort of selection or # aggregation if transform_func is not None: ds = transform_func(ds) # load all data from the transformed dataset, to ensure we can # use it after closing each original file ds.load() return ds paths = sorted(glob(files)) datasets = [process_one_path(p) for p in paths] combined = xr.concat(datasets, dim) return combined # here we suppose we only care about the combined mean of each file; # you might also use indexing operations like .sel to subset datasets combined = read_netcdfs('/all/my/files/*.nc', dim='time', transform_func=lambda ds: ds.mean()) This pattern works well and is very robust. We've used similar code to process tens of thousands of files constituting 100s of GB of data. .. _io.netcdf.writing_encoded: Writing encoded data ~~~~~~~~~~~~~~~~~~~~ Conversely, you can customize how xarray writes netCDF files on disk by providing explicit encodings for each dataset variable. The ``encoding`` argument takes a dictionary with variable names as keys and variable specific encodings as values. These encodings are saved as attributes on the netCDF variables on disk, which allows xarray to faithfully read encoded data back into memory. It is important to note that using encodings is entirely optional: if you do not supply any of these encoding options, xarray will write data to disk using a default encoding, or the options in the ``encoding`` attribute, if set. This works perfectly fine in most cases, but encoding can be useful for additional control, especially for enabling compression. In the file on disk, these encodings are saved as attributes on each variable, which allow xarray and other CF-compliant tools for working with netCDF files to correctly read the data. Scaling and type conversions ............................ These encoding options (based on `CF Conventions on packed data`_) work on any version of the netCDF file format: - ``dtype``: Any valid NumPy dtype or string convertible to a dtype, e.g., ``'int16'`` or ``'float32'``. This controls the type of the data written on disk. - ``_FillValue``: Values of ``NaN`` in xarray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, because ``NaN`` is not a valid value for integer dtypes. By default, variables with float types are attributed a ``_FillValue`` of ``NaN`` in the output file, unless explicitly disabled with an encoding ``{'_FillValue': None}``. - ``scale_factor`` and ``add_offset``: Used to convert from encoded data on disk to to the decoded data in memory, according to the formula ``decoded = scale_factor * encoded + add_offset``. Please note that ``scale_factor`` and ``add_offset`` must be of same type and determine the type of the decoded data. These parameters can be fruitfully combined to compress discretized data on disk. For example, to save the variable ``foo`` with a precision of 0.1 in 16-bit integers while converting ``NaN`` to ``-9999``, we would use ``encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}``. Compression and decompression with such discretization is extremely fast. .. _CF Conventions on packed data: https://cfconventions.org/cf-conventions/cf-conventions.html#packed-data .. _io.string-encoding: String encoding ............... Xarray can write unicode strings to netCDF files in two ways: - As variable length strings. This is only supported on netCDF4 (HDF5) files. - By encoding strings into bytes, and writing encoded bytes as a character array. The default encoding is UTF-8. By default, we use variable length strings for compatible files and fall-back to using encoded character arrays. Character arrays can be selected even for netCDF4 files by setting the ``dtype`` field in ``encoding`` to ``S1`` (corresponding to NumPy's single-character bytes dtype). If character arrays are used: - The string encoding that was used is stored on disk in the ``_Encoding`` attribute, which matches an ad-hoc convention `adopted by the netCDF4-Python library `_. At the time of this writing (October 2017), a standard convention for indicating string encoding for character arrays in netCDF files was `still under discussion `_. Technically, you can use `any string encoding recognized by Python `_ if you feel the need to deviate from UTF-8, by setting the ``_Encoding`` field in ``encoding``. But `we don't recommend it `_. - The character dimension name can be specified by the ``char_dim_name`` field of a variable's ``encoding``. If the name of the character dimension is not specified, the default is ``f'string{data.shape[-1]}'``. When decoding character arrays from existing files, the ``char_dim_name`` is added to the variables ``encoding`` to preserve if encoding happens, but the field can be edited by the user. .. warning:: Missing values in bytes or unicode string arrays (represented by ``NaN`` in xarray) are currently written to disk as empty strings ``''``. This means missing values will not be restored when data is loaded from disk. This behavior is likely to change in the future (:issue:`1647`). Unfortunately, explicitly setting a ``_FillValue`` for string arrays to handle missing values doesn't work yet either, though we also hope to fix this in the future. Chunk based compression ....................... ``zlib``, ``complevel``, ``fletcher32``, ``contiguous`` and ``chunksizes`` can be used for enabling netCDF4/HDF5's chunk based compression, as described in the `documentation for createVariable`_ for netCDF4-Python. This only works for netCDF4 files and thus requires using ``format='netCDF4'`` and either ``engine='netcdf4'`` or ``engine='h5netcdf'``. .. _documentation for createVariable: https://unidata.github.io/netcdf4-python/#netCDF4.Dataset.createVariable Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5's compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first. Time units .......... The ``units`` and ``calendar`` attributes control how xarray serializes ``datetime64`` and ``timedelta64`` arrays to datasets on disk as numeric values. The ``units`` encoding should be a string like ``'days since 1900-01-01'`` for ``datetime64`` data or a string like ``'days'`` for ``timedelta64`` data. ``calendar`` should be one of the calendar types supported by netCDF4-python: ``'standard'``, ``'gregorian'``, ``'proleptic_gregorian'``, ``'noleap'``, ``'365_day'``, ``'360_day'``, ``'julian'``, ``'all_leap'``, ``'366_day'``. By default, xarray uses the ``'proleptic_gregorian'`` calendar and units of the smallest time difference between values, with a reference time of the first time value. .. _io.coordinates: Coordinates ........... You can control the ``coordinates`` attribute written to disk by specifying ``DataArray.encoding["coordinates"]``. If not specified, xarray automatically sets ``DataArray.encoding["coordinates"]`` to a space-delimited list of names of coordinate variables that share dimensions with the ``DataArray`` being written. This allows perfect roundtripping of xarray datasets but may not be desirable. When an xarray ``Dataset`` contains non-dimensional coordinates that do not share dimensions with any of the variables, these coordinate variable names are saved under a "global" ``"coordinates"`` attribute. This is not CF-compliant but again facilitates roundtripping of xarray datasets. Invalid netCDF files ~~~~~~~~~~~~~~~~~~~~ The library ``h5netcdf`` allows writing some dtypes that aren't allowed in netCDF4 (see `h5netcdf documentation `_). This feature is available through :py:meth:`DataArray.to_netcdf` and :py:meth:`Dataset.to_netcdf` when used with ``engine="h5netcdf"`` and currently raises a warning unless ``invalid_netcdf=True`` is set. .. warning:: Note that this produces a file that is likely to be not readable by other netCDF libraries! .. _io.hdf5: HDF5 ---- `HDF5`_ is both a file format and a data model for storing information. HDF5 stores data hierarchically, using groups to create a nested structure. HDF5 is a more general version of the netCDF4 data model, so the nested structure is one of many similarities between the two data formats. Reading HDF5 files in xarray requires the ``h5netcdf`` engine, which can be installed with ``conda install h5netcdf``. Once installed we can use xarray to open HDF5 files: .. code:: python xr.open_dataset("/path/to/my/file.h5") The similarities between HDF5 and netCDF4 mean that HDF5 data can be written with the same :py:meth:`Dataset.to_netcdf` method as used for netCDF4 data: .. jupyter-execute:: ds = xr.Dataset( {"foo": (("x", "y"), np.random.rand(4, 5))}, coords={ "x": [10, 20, 30, 40], "y": pd.date_range("2000-01-01", periods=5), "z": ("x", list("abcd")), }, ) ds.to_netcdf("saved_on_disk.h5") Groups ~~~~~~ If you have multiple or highly nested groups, xarray by default may not read the group that you want. A particular group of an HDF5 file can be specified using the ``group`` argument: .. code:: python xr.open_dataset("/path/to/my/file.h5", group="/my/group") While xarray cannot interrogate an HDF5 file to determine which groups are available, the HDF5 Python reader `h5py`_ can be used instead. Natively the xarray data structures can only handle one level of nesting, organized as DataArrays inside of Datasets. If your HDF5 file has additional levels of hierarchy you can only access one group and a time and will need to specify group names. .. _HDF5: https://hdfgroup.github.io/hdf5/index.html .. _h5py: https://www.h5py.org/ .. _io.zarr: Zarr ---- `Zarr`_ is a Python package that provides an implementation of chunked, compressed, N-dimensional arrays. Zarr has the ability to store arrays in a range of ways, including in memory, in files, and in cloud-based object storage such as `Amazon S3`_ and `Google Cloud Storage`_. Xarray's Zarr backend allows xarray to leverage these capabilities, including the ability to store and analyze datasets far too large fit onto disk (particularly :ref:`in combination with dask `). Xarray can't open just any zarr dataset, because xarray requires special metadata (attributes) describing the dataset dimensions and coordinates. At this time, xarray can only open zarr datasets with these special attributes, such as zarr datasets written by xarray, `netCDF `_, or `GDAL `_. For implementation details, see :ref:`zarr_encoding`. To write a dataset with zarr, we use the :py:meth:`Dataset.to_zarr` method. To write to a local directory, we pass a path to a directory: .. jupyter-execute:: :hide-code: ! rm -rf path/to/directory.zarr .. jupyter-execute:: :stderr: ds = xr.Dataset( {"foo": (("x", "y"), np.random.rand(4, 5))}, coords={ "x": [10, 20, 30, 40], "y": pd.date_range("2000-01-01", periods=5), "z": ("x", list("abcd")), }, ) ds.to_zarr("path/to/directory.zarr", zarr_format=2, consolidated=False) (The suffix ``.zarr`` is optional--just a reminder that a zarr store lives there.) If the directory does not exist, it will be created. If a zarr store is already present at that path, an error will be raised, preventing it from being overwritten. To override this behavior and overwrite an existing store, add ``mode='w'`` when invoking :py:meth:`~Dataset.to_zarr`. DataArrays can also be saved to disk using the :py:meth:`DataArray.to_zarr` method, and loaded from disk using the :py:func:`open_dataarray` function with ``engine='zarr'``. Similar to :py:meth:`DataArray.to_netcdf`, :py:meth:`DataArray.to_zarr` will convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back when loading, ensuring that the ``DataArray`` that is loaded is always exactly the same as the one that was saved. .. note:: xarray does not write `NCZarr `_ attributes. Therefore, NCZarr data must be opened in read-only mode. To store variable length strings, convert them to object arrays first with ``dtype=object``. To read back a zarr dataset that has been created this way, we use the :py:func:`open_zarr` method: .. jupyter-execute:: ds_zarr = xr.open_zarr("path/to/directory.zarr", consolidated=False) ds_zarr Cloud Storage Buckets ~~~~~~~~~~~~~~~~~~~~~ It is possible to read and write xarray datasets directly from / to cloud storage buckets using zarr. This example uses the `gcsfs`_ package to provide an interface to `Google Cloud Storage`_. General `fsspec`_ URLs, those that begin with ``s3://`` or ``gcs://`` for example, are parsed and the store set up for you automatically when reading. You should include any arguments to the storage backend as the key ```storage_options``, part of ``backend_kwargs``. .. code:: python ds_gcs = xr.open_dataset( "gcs:///path.zarr", backend_kwargs={ "storage_options": {"project": "", "token": None} }, engine="zarr", ) This also works with ``open_mfdataset``, allowing you to pass a list of paths or a URL to be interpreted as a glob string. For writing, you may either specify a bucket URL or explicitly set up a ``zarr.abc.store.Store`` instance, as follows: .. tab:: URL .. code:: python # write to the bucket via GCS URL ds.to_zarr("gs://") # read it back ds_gcs = xr.open_zarr("gs://") .. tab:: fsspec .. code:: python import gcsfs import zarr # manually manage the cloud filesystem connection -- useful, for example, # when you need to manage permissions to cloud resources fs = gcsfs.GCSFileSystem(project="", token=None) zstore = zarr.storage.FsspecStore(fs, path="") # write to the bucket ds.to_zarr(store=zstore) # read it back ds_gcs = xr.open_zarr(zstore) .. tab:: obstore .. code:: python import obstore import zarr # alternatively, obstore offers a modern, performant interface for # cloud buckets gcsstore = obstore.store.GCSStore( "", prefix="", skip_signature=True ) zstore = zarr.store.ObjectStore(gcsstore) # write to the bucket ds.to_zarr(store=zstore) # read it back ds_gcs = xr.open_zarr(zstore) .. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/ .. _obstore: https://developmentseed.org/obstore/latest/ .. _Zarr: https://zarr.readthedocs.io/ .. _Amazon S3: https://aws.amazon.com/s3/ .. _Google Cloud Storage: https://cloud.google.com/storage/ .. _gcsfs: https://github.com/fsspec/gcsfs .. _io.zarr.distributed_writes: Distributed writes ~~~~~~~~~~~~~~~~~~ Xarray will natively use dask to write in parallel to a zarr store, which should satisfy most moderately sized datasets. For more flexible parallelization, we can use ``region`` to write to limited regions of arrays in an existing Zarr store. To scale this up to writing large datasets, first create an initial Zarr store without writing all of its array data. This can be done by first creating a ``Dataset`` with dummy values stored in :ref:`dask `, and then calling ``to_zarr`` with ``compute=False`` to write only metadata (including ``attrs``) to Zarr: .. jupyter-execute:: :hide-code: ! rm -rf path/to/directory.zarr .. jupyter-execute:: import dask.array # The values of this dask array are entirely irrelevant; only the dtype, # shape and chunks are used dummies = dask.array.zeros(30, chunks=10) ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x": np.arange(30)}) path = "path/to/directory.zarr" # Now we write the metadata without computing any array values ds.to_zarr(path, compute=False, consolidated=False) Now, a Zarr store with the correct variable shapes and attributes exists that can be filled out by subsequent calls to ``to_zarr``. Setting ``region="auto"`` will open the existing store and determine the correct alignment of the new data with the existing dimensions, or as an explicit mapping from dimension names to Python ``slice`` objects indicating where the data should be written (in index space, not label space), e.g., .. jupyter-execute:: # For convenience, we'll slice a single dataset, but in the real use-case # we would create them separately possibly even from separate processes. ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)}) # Any of the following region specifications are valid ds.isel(x=slice(0, 10)).to_zarr(path, region="auto", consolidated=False) ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"}, consolidated=False) ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)}, consolidated=False) Concurrent writes with ``region`` are safe as long as they modify distinct chunks in the underlying Zarr arrays (or use an appropriate ``lock``). As a safety check to make it harder to inadvertently override existing values, if you set ``region`` then *all* variables included in a Dataset must have dimensions included in ``region``. Other variables (typically coordinates) need to be explicitly dropped and/or written in a separate calls to ``to_zarr`` with ``mode='a'``. Zarr Compressors and Filters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are many different `options for compression and filtering possible with zarr `_. These options can be passed to the ``to_zarr`` method as variable encoding. For example: .. jupyter-execute:: :hide-code: ! rm -rf foo.zarr .. jupyter-execute:: import zarr from zarr.codecs import BloscCodec compressor = BloscCodec(cname="zstd", clevel=3, shuffle="shuffle") ds.to_zarr("foo.zarr", consolidated=False, encoding={"foo": {"compressors": [compressor]}}) .. note:: Not all native zarr compression and filtering options have been tested with xarray. .. _io.zarr.appending: Modifying existing Zarr stores ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Xarray supports several ways of incrementally writing variables to a Zarr store. These options are useful for scenarios when it is infeasible or undesirable to write your entire dataset at once. 1. Use ``mode='a'`` to add or overwrite entire variables, 2. Use ``append_dim`` to resize and append to existing variables, and 3. Use ``region`` to write to limited regions of existing arrays. .. tip:: For ``Dataset`` objects containing dask arrays, a single call to ``to_zarr()`` will write all of your data in parallel. .. warning:: Alignment of coordinates is currently not checked when modifying an existing Zarr store. It is up to the user to ensure that coordinates are consistent. To add or overwrite entire variables, simply call :py:meth:`~Dataset.to_zarr` with ``mode='a'`` on a Dataset containing the new variables, passing in an existing Zarr store or path to a Zarr store. To resize and then append values along an existing dimension in a store, set ``append_dim``. This is a good option if data always arrives in a particular order, e.g., for time-stepping a simulation: .. jupyter-execute:: :hide-code: ! rm -rf path/to/directory.zarr .. jupyter-execute:: ds1 = xr.Dataset( {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, coords={ "x": [10, 20, 30, 40], "y": [1, 2, 3, 4, 5], "t": pd.date_range("2001-01-01", periods=2), }, ) ds1.to_zarr("path/to/directory.zarr", consolidated=False) .. jupyter-execute:: ds2 = xr.Dataset( {"foo": (("x", "y", "t"), np.random.rand(4, 5, 2))}, coords={ "x": [10, 20, 30, 40], "y": [1, 2, 3, 4, 5], "t": pd.date_range("2001-01-03", periods=2), }, ) ds2.to_zarr("path/to/directory.zarr", append_dim="t", consolidated=False) .. _io.zarr.writing_chunks: Specifying chunks in a zarr store ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Chunk sizes may be specified in one of three ways when writing to a zarr store: 1. Manual chunk sizing through the use of the ``encoding`` argument in :py:meth:`Dataset.to_zarr`: 2. Automatic chunking based on chunks in dask arrays 3. Default chunk behavior determined by the zarr library The resulting chunks will be determined based on the order of the above list; dask chunks will be overridden by manually-specified chunks in the encoding argument, and the presence of either dask chunks or chunks in the ``encoding`` attribute will supersede the default chunking heuristics in zarr. Importantly, this logic applies to every array in the zarr store individually, including coordinate arrays. Therefore, if a dataset contains one or more dask arrays, it may still be desirable to specify a chunk size for the coordinate arrays (for example, with a chunk size of ``-1`` to include the full coordinate). To specify chunks manually using the ``encoding`` argument, provide a nested dictionary with the structure ``{'variable_or_coord_name': {'chunks': chunks_tuple}}``. .. note:: The positional ordering of the chunks in the encoding argument must match the positional ordering of the dimensions in each array. Watch out for arrays with differently-ordered dimensions within a single Dataset. For example, let's say we're working with a dataset with dimensions ``('time', 'x', 'y')``, a variable ``Tair`` which is chunked in ``x`` and ``y``, and two multi-dimensional coordinates ``xc`` and ``yc``: .. jupyter-execute:: ds = xr.tutorial.open_dataset("rasm") ds["Tair"] = ds["Tair"].chunk({"x": 100, "y": 100}) ds These multi-dimensional coordinates are only two-dimensional and take up very little space on disk or in memory, yet when writing to disk the default zarr behavior is to split them into chunks: .. jupyter-execute:: ds.to_zarr("path/to/directory.zarr", consolidated=False, mode="w") !tree -I zarr.json path/to/directory.zarr This may cause unwanted overhead on some systems, such as when reading from a cloud storage provider. To disable this chunking, we can specify a chunk size equal to the shape of each coordinate array in the ``encoding`` argument: .. jupyter-execute:: ds.to_zarr( "path/to/directory.zarr", encoding={"xc": {"chunks": ds.xc.shape}, "yc": {"chunks": ds.yc.shape}}, consolidated=False, mode="w", ) !tree -I zarr.json path/to/directory.zarr The number of chunks on Tair matches our dask chunks, while there is now only a single chunk in the directory stores of each coordinate. Groups ~~~~~~ Nested groups in zarr stores can be represented by loading the store as a :py:class:`xarray.DataTree` object, similarly to netCDF. To open a whole zarr store as a tree of groups use the :py:func:`open_datatree` function. To save a ``DataTree`` object as a zarr store containing many groups, use the :py:meth:`xarray.DataTree.to_zarr()` method. .. note:: Note that perfect round-tripping should always be possible with a zarr store (:ref:`unlike for netCDF files `), as zarr does not support "unused" dimensions. For the root group the same restrictions (:ref:`as for netCDF files `) apply. Due to file format specifications the on-disk root group name is always ``"/"`` overriding any given ``DataTree`` root node name. .. _io.zarr.consolidated_metadata: Consolidated Metadata ~~~~~~~~~~~~~~~~~~~~~ Xarray needs to read all of the zarr metadata when it opens a dataset. In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_), this can introduce significant overhead, because two separate HTTP calls to the object store must be made for each variable in the dataset. By default Xarray uses a feature called *consolidated metadata*, storing all metadata for the entire dataset with a single key (by default called ``.zmetadata``). This typically drastically speeds up opening the store. (For more information on this feature, consult the `zarr docs on consolidating metadata `_.) By default, xarray writes consolidated metadata and attempts to read stores with consolidated metadata, falling back to use non-consolidated metadata for reads. Because this fall-back option is so much slower, xarray issues a ``RuntimeWarning`` with guidance when reading with consolidated metadata fails: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider: 1. Consolidating metadata in this existing store with :py:func:`zarr.consolidate_metadata`. 2. Explicitly setting ``consolidated=False``, to avoid trying to read consolidate metadata. 3. Explicitly setting ``consolidated=True``, to raise an error in this case instead of falling back to try reading non-consolidated metadata. Fill Values ~~~~~~~~~~~ Zarr arrays have a ``fill_value`` that is used for chunks that were never written to disk. For the Zarr version 2 format, Xarray will set ``fill_value`` to be equal to the CF/NetCDF ``"_FillValue"``. This is ``np.nan`` by default for floats, and unset otherwise. Note that the Zarr library will set a default ``fill_value`` if not specified (usually ``0``). For the Zarr version 3 format, ``_FillValue`` and ```fill_value`` are decoupled. So you can set ``fill_value`` in ``encoding`` as usual. Note that at read-time, you can control whether ``_FillValue`` is masked using the ``mask_and_scale`` kwarg; and whether Zarr's ``fill_value`` is treated as synonymous with ``_FillValue`` using the ``use_zarr_fill_value_as_mask`` kwarg to :py:func:`xarray.open_zarr`. .. _io.kerchunk: Kerchunk -------- `Kerchunk `_ is a Python library that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS), many of which are primary data formats for many data archives, by viewing the whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access. Instead of creating a new copy of the dataset in the Zarr spec/format or downloading the files locally, Kerchunk reads through the data archive and extracts the byte range and compression information of each chunk and saves as a ``reference``. These references are then saved as ``json`` files or ``parquet`` (more efficient) for later use. You can view some of these stored in the ``references`` directory `here `_. .. note:: These references follow this `specification `_. Packages like `kerchunk`_ and `virtualizarr `_ help in creating and reading these references. Reading these data archives becomes really easy with ``kerchunk`` in combination with ``xarray``, especially when these archives are large in size. A single combined reference can refer to thousands of the original data files present in these archives. You can view the whole dataset with from this combined reference using the above packages. The following example shows opening a single ``json`` reference to the ``saved_on_disk.h5`` file created above. If the file were instead stored remotely (e.g. ``s3://saved_on_disk.h5``) you can use ``storage_options`` that are used to `configure fsspec `_: .. jupyter-execute:: ds_kerchunked = xr.open_dataset( "./combined.json", engine="kerchunk", storage_options={}, ) ds_kerchunked .. note:: You can refer to the `project pythia kerchunk cookbook `_ and the `pangeo guide on kerchunk `_ for more information. .. _io.iris: Iris ---- The Iris_ tool allows easy reading of common meteorological and climate model formats (including GRIB and UK MetOffice PP files) into ``Cube`` objects which are in many ways very similar to ``DataArray`` objects, while enforcing a CF-compliant data model. DataArray ``to_iris`` and ``from_iris`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If iris is installed, xarray can convert a ``DataArray`` into a ``Cube`` using :py:meth:`DataArray.to_iris`: .. jupyter-execute:: da = xr.DataArray( np.random.rand(4, 5), dims=["x", "y"], coords=dict(x=[10, 20, 30, 40], y=pd.date_range("2000-01-01", periods=5)), ) cube = da.to_iris() print(cube) Conversely, we can create a new ``DataArray`` object from a ``Cube`` using :py:meth:`DataArray.from_iris`: .. jupyter-execute:: da_cube = xr.DataArray.from_iris(cube) da_cube Ncdata ~~~~~~ Ncdata_ provides more sophisticated means of transferring data, including entire datasets. It uses the file saving and loading functions in both projects to provide a more "correct" translation between them, but still with very low overhead and not using actual disk files. Here we load an xarray dataset and convert it to Iris cubes: .. jupyter-execute:: :stderr: ds = xr.tutorial.open_dataset("air_temperature_gradient") cubes = ncdata.iris_xarray.cubes_from_xarray(ds) print(cubes) .. jupyter-execute:: print(cubes[1]) And we can convert the cubes back to an xarray dataset: .. jupyter-execute:: # ensure dataset-level and variable-level attributes loaded correctly iris.FUTURE.save_split_attrs = True ds = ncdata.iris_xarray.cubes_to_xarray(cubes) ds Ncdata can also adjust file data within load and save operations, to fix data loading problems or provide exact save formatting without needing to modify files on disk. See for example : `ncdata usage examples`_ .. _Iris: https://scitools.org.uk/iris .. _Ncdata: https://ncdata.readthedocs.io/en/latest/index.html .. _ncdata usage examples: https://github.com/pp-mo/ncdata/tree/v0.1.2?tab=readme-ov-file#correct-a-miscoded-attribute-in-iris-input OPeNDAP ------- Xarray includes support for `OPeNDAP`__ (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP. __ https://www.opendap.org/ For example, we can open a connection to GBs of weather data produced by the `PRISM`__ project, and hosted by `IRI`__ at Columbia: __ https://www.prism.oregonstate.edu/ __ https://iri.columbia.edu/ .. jupyter-input:: remote_data = xr.open_dataset( "http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods", decode_times=False, ) remote_data .. jupyter-output:: Dimensions: (T: 1422, X: 1405, Y: 621) Coordinates: * X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ... * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ... * Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ... Data variables: ppt (T, Y, X) float64 ... tdmean (T, Y, X) float64 ... tmax (T, Y, X) float64 ... tmin (T, Y, X) float64 ... Attributes: Conventions: IRIDL expires: 1375315200 .. TODO: update this example to show off decode_cf? .. note:: Like many real-world datasets, this dataset does not entirely follow `CF conventions`_. Unexpected formats will usually cause xarray's automatic decoding to fail. The way to work around this is to either set ``decode_cf=False`` in ``open_dataset`` to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set ``decode_times=False`` because the time axis here provides the calendar attribute in a format that xarray does not expect (the integer ``360`` instead of a string like ``'360_day'``). We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values: .. jupyter-input:: tmax = remote_data["tmax"][:500, ::3, ::3] tmax .. jupyter-output:: [48541500 values with dtype=float64] Coordinates: * Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ... * X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ... * T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ... Attributes: pointwidth: 120 standard_name: air_temperature units: Celsius_scale expires: 1443657600 .. jupyter-input:: # the data is downloaded automatically when we make the plot tmax[0].plot() .. image:: ../_static/opendap-prism-tmax.png Some servers require authentication before we can access the data. Pydap uses a `Requests`__ session object (which the user can pre-define), and this session object can recover `authentication`__` credentials from a locally stored ``.netrc`` file. For example, to connect to a server that requires NASA's URS authentication, with the username/password credentials stored on a locally accessible ``.netrc``, access to OPeNDAP data should be as simple as this:: import xarray as xr import requests my_session = requests.Session() ds_url = 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc' ds = xr.open_dataset(ds_url, session=my_session, engine="pydap") Moreover, a bearer token header can be included in a `Requests`__ session object, allowing for token-based authentication which OPeNDAP servers can use to avoid some redirects. Lastly, OPeNDAP servers may provide endpoint URLs for different OPeNDAP protocols, DAP2 and DAP4. To specify which protocol between the two options to use, you can replace the scheme of the url with the name of the protocol. For example:: # dap2 url ds_url = 'dap2://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc' # dap4 url ds_url = 'dap4://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/example.nc' While most OPeNDAP servers implement DAP2, not all servers implement DAP4. It is recommended to check if the URL you are using `supports DAP4`__ by checking the URL on a browser. __ https://docs.python-requests.org __ https://pydap.github.io/pydap/en/notebooks/Authentication.html __ https://pydap.github.io/pydap/en/faqs/dap2_or_dap4_url.html .. _io.pickle: Pickle ------ The simplest way to serialize an xarray object is to use Python's built-in pickle module: .. jupyter-execute:: import pickle # use the highest protocol (-1) because it is way faster than the default # text based pickle format pkl = pickle.dumps(ds, protocol=-1) pickle.loads(pkl) Pickling is important because it doesn't require any external libraries and lets you use xarray objects with Python modules like :py:mod:`multiprocessing` or :ref:`Dask `. However, pickling is **not recommended for long-term storage**. Restoring a pickle requires that the internal structure of the types for the pickled data remain unchanged. Because the internal design of xarray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xarray will work in future versions. .. note:: When pickling an object opened from a NetCDF file, the pickle file will contain a reference to the file on disk. If you want to store the actual array values, load it into memory first with :py:meth:`Dataset.load` or :py:meth:`Dataset.compute`. .. _dictionary io: Dictionary ---------- We can convert a ``Dataset`` (or a ``DataArray``) to a dict using :py:meth:`Dataset.to_dict`: .. jupyter-execute:: ds = xr.Dataset({"foo": ("x", np.arange(30))}) d = ds.to_dict() d We can create a new xarray object from a dict using :py:meth:`Dataset.from_dict`: .. jupyter-execute:: ds_dict = xr.Dataset.from_dict(d) ds_dict Dictionary support allows for flexible use of xarray objects. It doesn't require external libraries and dicts can easily be pickled, or converted to json, or geojson. All the values are converted to lists, so dicts might be quite large. To export just the dataset schema without the data itself, use the ``data=False`` option: .. jupyter-execute:: ds.to_dict(data=False) .. jupyter-execute:: :hide-code: # We're now done with the dataset named `ds`. Although the `with` statement closed # the dataset, displaying the unpickled pickle of `ds` re-opened "saved_on_disk.nc". # However, `ds` (rather than the unpickled dataset) refers to the open file. Delete # `ds` to close the file. del ds for f in ["saved_on_disk.nc", "saved_on_disk.h5"]: if os.path.exists(f): os.remove(f) This can be useful for generating indices of dataset contents to expose to search indices or other automated data discovery tools. .. _io.rasterio: Rasterio -------- GDAL readable raster data using `rasterio`_ such as GeoTIFFs can be opened using the `rioxarray`_ extension. `rioxarray`_ can also handle geospatial related tasks such as re-projecting and clipping. .. jupyter-input:: import rioxarray rds = rioxarray.open_rasterio("RGB.byte.tif") rds .. jupyter-output:: [1703814 values with dtype=uint8] Coordinates: * band (band) int64 1 2 3 * y (y) float64 2.827e+06 2.826e+06 ... 2.612e+06 2.612e+06 * x (x) float64 1.021e+05 1.024e+05 ... 3.389e+05 3.392e+05 spatial_ref int64 0 Attributes: STATISTICS_MAXIMUM: 255 STATISTICS_MEAN: 29.947726688477 STATISTICS_MINIMUM: 0 STATISTICS_STDDEV: 52.340921626611 transform: (300.0379266750948, 0.0, 101985.0, 0.0, -300.0417827... _FillValue: 0.0 scale_factor: 1.0 add_offset: 0.0 grid_mapping: spatial_ref .. jupyter-input:: rds.rio.crs # CRS.from_epsg(32618) rds4326 = rds.rio.reproject("epsg:4326") rds4326.rio.crs # CRS.from_epsg(4326) rds4326.rio.to_raster("RGB.byte.4326.tif") .. _rasterio: https://rasterio.readthedocs.io/en/latest/ .. _rioxarray: https://corteva.github.io/rioxarray/stable/ .. _test files: https://github.com/rasterio/rasterio/blob/master/tests/data/RGB.byte.tif .. _pyproj: https://github.com/pyproj4/pyproj .. _io.cfgrib: .. jupyter-execute:: :hide-code: import shutil shutil.rmtree("foo.zarr") shutil.rmtree("path/to/directory.zarr") GRIB format via cfgrib ---------------------- Xarray supports reading GRIB files via ECMWF cfgrib_ python driver, if it is installed. To open a GRIB file supply ``engine='cfgrib'`` to :py:func:`open_dataset` after installing cfgrib_: .. jupyter-input:: ds_grib = xr.open_dataset("example.grib", engine="cfgrib") We recommend installing cfgrib via conda:: conda install -c conda-forge cfgrib .. _cfgrib: https://github.com/ecmwf/cfgrib CSV and other formats supported by pandas ----------------------------------------- For more options (tabular formats and CSV files in particular), consider exporting your objects to pandas and using its broad range of `IO tools`_. For CSV files, one might also consider `xarray_extras`_. .. _xarray_extras: https://xarray-extras.readthedocs.io/en/latest/api/csv.html .. _IO tools: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html Third party libraries --------------------- More formats are supported by extension libraries: - `xarray-mongodb `_: Store xarray objects on MongoDB .. currentmodule:: xarray .. _options: Configuration ============= Xarray offers a small number of configuration options through :py:func:`set_options`. With these, you can 1. Control the ``repr``: - ``display_expand_attrs`` - ``display_expand_coords`` - ``display_expand_data`` - ``display_expand_data_vars`` - ``display_max_rows`` - ``display_style`` 2. Control behaviour during operations: ``arithmetic_join``, ``keep_attrs``, ``use_bottleneck``. 3. Control colormaps for plots:``cmap_divergent``, ``cmap_sequential``. 4. Aspects of file reading: ``file_cache_maxsize``, ``warn_on_unclosed_files``. You can set these options either globally :: xr.set_options(arithmetic_join="exact") or locally as a context manager: :: with xr.set_options(arithmetic_join="exact"): # do operation here pass .. currentmodule:: xarray .. _pandas: =================== Working with pandas =================== One of the most important features of xarray is the ability to convert to and from :py:mod:`pandas` objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the `visualization built in to pandas itself`__ or provided by the pandas aware libraries such as `Seaborn`__. __ https://pandas.pydata.org/pandas-docs/stable/visualization.html __ https://seaborn.pydata.org/ .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) Hierarchical and tidy data ~~~~~~~~~~~~~~~~~~~~~~~~~~ Tabular data is easiest to work with when it meets the criteria for `tidy data`__: * Each column holds a different variable. * Each rows holds a different observation. __ https://www.jstatsoft.org/v59/i10/ In this "tidy data" format, we can represent any :py:class:`Dataset` and :py:class:`DataArray` in terms of :py:class:`~pandas.DataFrame` and :py:class:`~pandas.Series`, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a :py:class:`pandas.MultiIndex`. Dataset and DataFrame --------------------- To convert any dataset to a ``DataFrame`` in tidy form, use the :py:meth:`Dataset.to_dataframe()` method: .. jupyter-execute:: ds = xr.Dataset( {"foo": (("x", "y"), np.random.randn(2, 3))}, coords={ "x": [10, 20], "y": ["a", "b", "c"], "along_x": ("x", np.random.randn(2)), "scalar": 123, }, ) ds .. jupyter-execute:: df = ds.to_dataframe() df We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the ``DataFrame`` to any other convenient representation, use ``DataFrame`` methods like :py:meth:`~pandas.DataFrame.reset_index`, :py:meth:`~pandas.DataFrame.stack` and :py:meth:`~pandas.DataFrame.unstack`. For datasets containing dask arrays where the data should be lazily loaded, see the :py:meth:`Dataset.to_dask_dataframe()` method. To create a ``Dataset`` from a ``DataFrame``, use the :py:meth:`Dataset.from_dataframe` class method or the equivalent :py:meth:`pandas.DataFrame.to_xarray` method: .. jupyter-execute:: xr.Dataset.from_dataframe(df) Notice that the dimensions of variables in the ``Dataset`` have now expanded after the round-trip conversion to a ``DataFrame``. This is because every object in a ``DataFrame`` must have the same indices, so we need to broadcast the data of each array to the full size of the new ``MultiIndex``. Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates. DataArray and Series -------------------- ``DataArray`` objects have a complementary representation in terms of a :py:class:`~pandas.Series`. Using a Series preserves the ``Dataset`` to ``DataArray`` relationship, because ``DataFrames`` are dict-like containers of ``Series``. The methods are very similar to those for working with DataFrames: .. jupyter-execute:: s = ds["foo"].to_series() s .. jupyter-execute:: # or equivalently, with Series.to_xarray() xr.DataArray.from_series(s) Both the ``from_series`` and ``from_dataframe`` methods use reindexing, so they work even if the hierarchical index is not a full tensor product: .. jupyter-execute:: s[::2] .. jupyter-execute:: s[::2].to_xarray() Lossless and reversible conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The previous ``Dataset`` example shows that the conversion is not reversible (lossy roundtrip) and that the size of the ``Dataset`` increases. Particularly after a roundtrip, the following deviations are noted: - a non-dimension Dataset ``coordinate`` is converted into ``variable`` - a non-dimension DataArray ``coordinate`` is not converted - ``dtype`` is not always the same (e.g. "str" is converted to "object") - ``attrs`` metadata is not conserved To avoid these problems, the third-party `ntv-pandas `__ library offers lossless and reversible conversions between ``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects. This solution is particularly interesting for converting any ``DataFrame`` into a ``Dataset`` (the converter finds the multidimensional structure hidden by the tabular structure). The `ntv-pandas examples `__ show how to improve the conversion for the previous ``Dataset`` example and for more complex examples. Multi-dimensional data ~~~~~~~~~~~~~~~~~~~~~~ Tidy data is great, but it sometimes you want to preserve dimensions instead of automatically stacking them into a ``MultiIndex``. :py:meth:`DataArray.to_pandas()` is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality, if available in pandas (i.e., a 1D array is converted to a :py:class:`~pandas.Series` and 2D to :py:class:`~pandas.DataFrame`): .. jupyter-execute:: arr = xr.DataArray( np.random.randn(2, 3), coords=[("x", [10, 20]), ("y", ["a", "b", "c"])] ) df = arr.to_pandas() df To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the :py:class:`DataArray` constructor: .. jupyter-execute:: xr.DataArray(df) Both the ``DataArray`` and ``Dataset`` constructors directly convert pandas objects into xarray objects with the same shape. This means that they preserve all use of multi-indexes: .. jupyter-execute:: index = pd.MultiIndex.from_arrays( [["a", "a", "b"], [0, 1, 2]], names=["one", "two"] ) df = pd.DataFrame({"x": 1, "y": 2}, index=index) ds = xr.Dataset(df) ds However, you will need to set dimension names explicitly, either with the ``dims`` argument on in the ``DataArray`` constructor or by calling :py:class:`~Dataset.rename` on the new object. .. _panel transition: Transitioning from pandas.Panel to xarray ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``Panel``, pandas' data structure for 3D arrays, was always a second class data structure compared to the Series and DataFrame. To allow pandas developers to focus more on its core functionality built around the DataFrame, pandas removed ``Panel`` in favor of directing users who use multi-dimensional arrays to xarray. Xarray has most of ``Panel``'s features, a more explicit API (particularly around indexing), and the ability to scale to >3 dimensions with the same interface. As discussed in the :ref:`data structures section of the docs `, there are two primary data structures in xarray: ``DataArray`` and ``Dataset``. You can imagine a ``DataArray`` as a n-dimensional pandas ``Series`` (i.e. a single typed array), and a ``Dataset`` as the ``DataFrame`` equivalent (i.e. a dict of aligned ``DataArray`` objects). So you can represent a Panel, in two ways: - As a 3-dimensional ``DataArray``, - Or as a ``Dataset`` containing a number of 2-dimensional DataArray objects. Let's take a look: .. jupyter-execute:: data = np.random.default_rng(0).random((2, 3, 4)) items = list("ab") major_axis = list("mno") minor_axis = pd.date_range(start="2000", periods=4, name="date") With old versions of pandas (prior to 0.25), this could stored in a ``Panel``: .. jupyter-input:: pd.Panel(data, items, major_axis, minor_axis) .. jupyter-output:: Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis) Items axis: a to b Major_axis axis: m to o Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-04 00:00:00 To put this data in a ``DataArray``, write: .. jupyter-execute:: array = xr.DataArray(data, [items, major_axis, minor_axis]) array As you can see, there are three dimensions (each is also a coordinate). Two of the axes of were unnamed, so have been assigned ``dim_0`` and ``dim_1`` respectively, while the third retains its name ``date``. You can also easily convert this data into ``Dataset``: .. jupyter-execute:: array.to_dataset(dim="dim_0") Here, there are two data variables, each representing a DataFrame on panel's ``items`` axis, and labeled as such. Each variable is a 2D array of the respective values along the ``items`` dimension. While the xarray docs are relatively complete, a few items stand out for Panel users: - A DataArray's data is stored as a numpy array, and so can only contain a single type. As a result, a Panel that contains :py:class:`~pandas.DataFrame` objects with multiple types will be converted to ``dtype=object``. A ``Dataset`` of multiple ``DataArray`` objects each with its own dtype will allow original types to be preserved. - :ref:`Indexing ` is similar to pandas, but more explicit and leverages xarray's naming of dimensions. - Because of those features, making much higher dimensional data is very practical. - Variables in ``Dataset`` objects can use a subset of its dimensions. For example, you can have one dataset with Person x Score x Time, and another with Person x Score. - You can use coordinates are used for both dimensions and for variables which _label_ the data variables, so you could have a coordinate Age, that labelled the Person dimension of a Dataset of Person x Score x Time. While xarray may take some getting used to, it's worth it! If anything is unclear, please `post an issue on GitHub `__ or `StackOverflow `__, and we'll endeavor to respond to the specific case or improve the general docs. .. currentmodule:: xarray .. _plotting: Plotting ======== Introduction ------------ Labeled data enables expressive computations. These same labels can also be used to easily create informative plots. Xarray's plotting capabilities are centered around :py:class:`DataArray` objects. To plot :py:class:`Dataset` objects simply access the relevant DataArrays, i.e. ``dset['var1']``. Dataset specific plotting routines are also available (see :ref:`plot-dataset`). Here we focus mostly on arrays 2d or larger. If your data fits nicely into a pandas DataFrame then you're better off using one of the more developed tools there. Xarray plotting functionality is a thin wrapper around the popular `matplotlib `_ library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xarray can plot. To use xarray's plotting capabilities with time coordinates containing ``cftime.datetime`` objects `nc-time-axis `_ v1.3.0 or later needs to be installed. For more extensive plotting applications consider the following projects: - `Seaborn `_: "provides a high-level interface for drawing attractive statistical graphics." Integrates well with pandas. - `HoloViews `_ and `GeoViews `_: "Composable, declarative data structures for building even complex visualizations easily." Includes native support for xarray objects. - `hvplot `_: ``hvplot`` makes it very easy to produce dynamic plots (backed by ``Holoviews`` or ``Geoviews``) by adding a ``hvplot`` accessor to DataArrays. - `Cartopy `_: Provides cartographic tools. Imports ~~~~~~~ .. jupyter-execute:: :hide-code: # Use defaults so we don't get gridlines in generated docs import matplotlib as mpl mpl.rcdefaults() The following imports are necessary for all of the examples. .. jupyter-execute:: import cartopy.crs as ccrs import matplotlib.pyplot as plt import numpy as np import pandas as pd import xarray as xr For these examples we'll use the North American air temperature dataset. .. jupyter-execute:: airtemps = xr.tutorial.open_dataset("air_temperature") airtemps .. jupyter-execute:: # Convert to celsius air = airtemps.air - 273.15 # copy attributes to get nice figure labels and change Kelvin to Celsius air.attrs = airtemps.air.attrs air.attrs["units"] = "deg C" .. note:: Until :issue:`1614` is solved, you might need to copy over the metadata in ``attrs`` to get informative figure labels (as was done above). DataArrays ---------- One Dimension ~~~~~~~~~~~~~ ================ Simple Example ================ The simplest way to make a plot is to call the :py:func:`DataArray.plot()` method. .. jupyter-execute:: air1d = air.isel(lat=10, lon=10) air1d.plot(); Xarray uses the coordinate name along with metadata ``attrs.long_name``, ``attrs.standard_name``, ``DataArray.name`` and ``attrs.units`` (if available) to label the axes. The names ``long_name``, ``standard_name`` and ``units`` are copied from the `CF-conventions spec `_. When choosing names, the order of precedence is ``long_name``, ``standard_name`` and finally ``DataArray.name``. The y-axis label in the above plot was constructed from the ``long_name`` and ``units`` attributes of ``air1d``. .. jupyter-execute:: air1d.attrs ====================== Additional Arguments ====================== Additional arguments are passed directly to the matplotlib function which does the work. For example, :py:func:`xarray.plot.line` calls matplotlib.pyplot.plot_ passing in the index and the array values as x and y, respectively. So to make a line plot with blue triangles a matplotlib format string can be used: .. _matplotlib.pyplot.plot: https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot .. jupyter-execute:: air1d[:200].plot.line("b-^"); .. note:: Not all xarray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments. Keyword arguments work the same way, and are more explicit. .. jupyter-execute:: air1d[:200].plot.line(color="purple", marker="o"); ========================= Adding to Existing Axis ========================= To add the plot to an existing axis pass in the axis as a keyword argument ``ax``. This works for all xarray plotting methods. In this example ``axs`` is an array consisting of the left and right axes created by ``plt.subplots``. .. jupyter-execute:: fig, axs = plt.subplots(ncols=2) print(axs) air1d.plot(ax=axs[0]) air1d.plot.hist(ax=axs[1]); On the right is a histogram created by :py:func:`xarray.plot.hist`. .. _plotting.figsize: ============================= Controlling the figure size ============================= You can pass a ``figsize`` argument to all xarray's plotting methods to control the figure size. For convenience, xarray's plotting methods also support the ``aspect`` and ``size`` arguments which control the size of the resulting image via the formula ``figsize = (aspect * size, size)``: .. jupyter-execute:: air1d.plot(aspect=2, size=3); This feature also works with :ref:`plotting.faceting`. For facet plots, ``size`` and ``aspect`` refer to a single panel (so that ``aspect * size`` gives the width of each facet in inches), while ``figsize`` refers to the entire figure (as for matplotlib's ``figsize`` argument). .. note:: If ``figsize`` or ``size`` are used, a new figure is created, so this is mutually exclusive with the ``ax`` argument. .. note:: The convention used by xarray (``figsize = (aspect * size, size)``) is borrowed from seaborn: it is therefore `not equivalent to matplotlib's`_. .. _not equivalent to matplotlib's: https://github.com/mwaskom/seaborn/issues/746 .. _plotting.multiplelines: ========================= Determine x-axis values ========================= Per default dimension coordinates are used for the x-axis (here the time coordinates). However, you can also use non-dimension coordinates, MultiIndex levels, and dimensions without coordinates along the x-axis. To illustrate this, let's calculate a 'decimal day' (epoch) from the time and assign it as a non-dimension coordinate: .. jupyter-execute:: decimal_day = (air1d.time - air1d.time[0]) / pd.Timedelta("1d") air1d_multi = air1d.assign_coords(decimal_day=("time", decimal_day.data)) air1d_multi To use ``'decimal_day'`` as x coordinate it must be explicitly specified: .. jupyter-execute:: air1d_multi.plot(x="decimal_day"); Creating a new MultiIndex named ``'date'`` from ``'time'`` and ``'decimal_day'``, it is also possible to use a MultiIndex level as x-axis: .. jupyter-execute:: air1d_multi = air1d_multi.set_index(date=("time", "decimal_day")) air1d_multi.plot(x="decimal_day"); Finally, if a dataset does not have any coordinates it enumerates all data points: .. jupyter-execute:: air1d_multi = air1d_multi.drop_vars(["date", "time", "decimal_day"]) air1d_multi.plot(); The same applies to 2D plots below. ==================================================== Multiple lines showing variation along a dimension ==================================================== It is possible to make line plots of two-dimensional data by calling :py:func:`xarray.plot.line` with appropriate arguments. Consider the 3D variable ``air`` defined above. We can use line plots to check the variation of air temperature at three different latitudes along a longitude line: .. jupyter-execute:: air.isel(lon=10, lat=[19, 21, 22]).plot.line(x="time"); It is required to explicitly specify either 1. ``x``: the dimension to be used for the x-axis, or 2. ``hue``: the dimension you want to represent by multiple lines. Thus, we could have made the previous plot by specifying ``hue='lat'`` instead of ``x='time'``. If required, the automatic legend can be turned off using ``add_legend=False``. Alternatively, ``hue`` can be passed directly to :py:func:`xarray.plot.line` as ``air.isel(lon=10, lat=[19,21,22]).plot.line(hue='lat')``. ======================== Dimension along y-axis ======================== It is also possible to make line plots such that the data are on the x-axis and a dimension is on the y-axis. This can be done by specifying the appropriate ``y`` keyword argument. .. jupyter-execute:: air.isel(time=10, lon=[10, 11]).plot(y="lat", hue="lon"); ============ Step plots ============ As an alternative, also a step plot similar to matplotlib's ``plt.step`` can be made using 1D data. .. jupyter-execute:: air1d[:20].plot.step(where="mid"); The argument ``where`` defines where the steps should be placed, options are ``'pre'`` (default), ``'post'``, and ``'mid'``. This is particularly handy when plotting data grouped with :py:meth:`Dataset.groupby_bins`. .. jupyter-execute:: air_grp = air.mean(["time", "lon"]).groupby_bins("lat", [0, 23.5, 66.5, 90]) air_mean = air_grp.mean() air_std = air_grp.std() air_mean.plot.step() (air_mean + air_std).plot.step(ls=":") (air_mean - air_std).plot.step(ls=":") plt.ylim(-20, 30) plt.title("Zonal mean temperature"); In this case, the actual boundaries of the bins are used and the ``where`` argument is ignored. Other axes kwargs ~~~~~~~~~~~~~~~~~ The keyword arguments ``xincrease`` and ``yincrease`` let you control the axes direction. .. jupyter-execute:: air.isel(time=10, lon=[10, 11]).plot.line( y="lat", hue="lon", xincrease=False, yincrease=False ); In addition, one can use ``xscale, yscale`` to set axes scaling; ``xticks, yticks`` to set axes ticks and ``xlim, ylim`` to set axes limits. These accept the same values as the matplotlib methods ``ax.set_(x,y)scale()``, ``ax.set_(x,y)ticks()``, ``ax.set_(x,y)lim()``, respectively. Two Dimensions ~~~~~~~~~~~~~~ ================ Simple Example ================ The default method :py:meth:`DataArray.plot` calls :py:func:`xarray.plot.pcolormesh` by default when the data is two-dimensional. .. jupyter-execute:: air2d = air.isel(time=500) air2d.plot(); All 2d plots in xarray allow the use of the keyword arguments ``yincrease`` and ``xincrease``. .. jupyter-execute:: air2d.plot(yincrease=False); .. note:: We use :py:func:`xarray.plot.pcolormesh` as the default two-dimensional plot method because it is more flexible than :py:func:`xarray.plot.imshow`. However, for large arrays, ``imshow`` can be much faster than ``pcolormesh``. If speed is important to you and you are plotting a regular mesh, consider using ``imshow``. ================ Missing Values ================ Xarray plots data with :ref:`missing_values`. .. jupyter-execute:: bad_air2d = air2d.copy() bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan bad_air2d.plot(); ======================== Nonuniform Coordinates ======================== It's not necessary for the coordinates to be evenly spaced. Both :py:func:`xarray.plot.pcolormesh` (default) and :py:func:`xarray.plot.contourf` can produce plots with nonuniform coordinates. .. jupyter-execute:: b = air2d.copy() # Apply a nonlinear transformation to one of the coords b.coords["lat"] = np.log(b.coords["lat"]) b.plot(); ==================== Other types of plot ==================== There are several other options for plotting 2D data. Contour plot using :py:meth:`DataArray.plot.contour()` .. jupyter-execute:: air2d.plot.contour(); Filled contour plot using :py:meth:`DataArray.plot.contourf()` .. jupyter-execute:: air2d.plot.contourf(); Surface plot using :py:meth:`DataArray.plot.surface()` .. jupyter-execute:: # transpose just to make the example look a bit nicer air2d.T.plot.surface(); ==================== Calling Matplotlib ==================== Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available. .. jupyter-execute:: air2d.plot(cmap=plt.cm.Blues) plt.title("These colors prove North America\nhas fallen in the ocean") plt.ylabel("latitude") plt.xlabel("longitude"); .. note:: Xarray methods update label information and generally play around with the axes. So any kind of updates to the plot should be done *after* the call to the xarray's plot. In the example below, ``plt.xlabel`` effectively does nothing, since ``d_ylog.plot()`` updates the xlabel. .. jupyter-execute:: plt.xlabel("Never gonna see this.") air2d.plot(); =========== Colormaps =========== Xarray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius: .. jupyter-execute:: airtemps.air.isel(time=0).plot(); The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used. .. _robust-plotting: ======== Robust ======== Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot. .. jupyter-execute:: air_outliers = airtemps.air.isel(time=0).copy() air_outliers[0, 0] = 100 air_outliers[-1, -1] = 400 air_outliers.plot(); This plot shows that we have outliers. The easy way to visualize the data without the outliers is to pass the parameter ``robust=True``. This will use the 2nd and 98th percentiles of the data to compute the color limits. .. jupyter-execute:: air_outliers.plot(robust=True); Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds. ==================== Discrete Colormaps ==================== It is often useful, when visualizing 2d data, to use a discrete colormap, rather than the default continuous colormaps that matplotlib uses. The ``levels`` keyword argument can be used to generate plots with discrete colormaps. For example, to make a plot with 8 discrete color intervals: .. jupyter-execute:: air2d.plot(levels=8); It is also possible to use a list of levels to specify the boundaries of the discrete colormap: .. jupyter-execute:: air2d.plot(levels=[0, 12, 18, 30]); You can also specify a list of discrete colors through the ``colors`` argument: .. jupyter-execute:: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"] air2d.plot(levels=[0, 12, 18, 30], colors=flatui); Finally, if you have `Seaborn `_ installed, you can also specify a seaborn color palette to the ``cmap`` argument. Note that ``levels`` *must* be specified with seaborn color palettes if using ``imshow`` or ``pcolormesh`` (but not with ``contour`` or ``contourf``, since levels are chosen automatically). .. jupyter-execute:: air2d.plot(levels=10, cmap="husl"); .. _plotting.faceting: Faceting ~~~~~~~~ Faceting here refers to splitting an array along one or two dimensions and plotting each group. Xarray's basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That's where facets become helpful. The general approach to plotting here is called “small multiples”, where the same kind of plot is repeated multiple times, and the specific use of small multiples to display the same relationship conditioned on one or more other variables is often called a “trellis plot”. Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a separate plot for each time period. The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That's too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let's use a slice to pick 6 times throughout the first year. .. jupyter-execute:: t = air.isel(time=slice(0, 365 * 4, 250)) t.coords ================ Simple Example ================ The easiest way to create faceted plots is to pass in ``row`` or ``col`` arguments to the xarray plotting methods/functions. This returns a :py:class:`xarray.plot.FacetGrid` object. .. jupyter-execute:: g_simple = t.plot(x="lon", y="lat", col="time", col_wrap=3); Faceting also works for line plots. .. jupyter-execute:: g_simple_line = t.isel(lat=slice(0, None, 4)).plot( x="lon", hue="lat", col="time", col_wrap=3 ); =============== 4 dimensional =============== For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter. .. jupyter-execute:: t2 = t.isel(time=slice(0, 2)) t4d = xr.concat([t2, t2 + 40], pd.Index(["normal", "hot"], name="fourth_dim")) # This is a 4d array t4d.coords t4d.plot(x="lon", y="lat", col="time", row="fourth_dim"); ================ Other features ================ Faceted plotting supports other arguments common to xarray 2d plots. .. jupyter-execute:: hasoutliers = t.isel(time=slice(0, 5)).copy() hasoutliers[0, 0, 0] = -100 hasoutliers[-1, -1, -1] = 400 g = hasoutliers.plot.pcolormesh( x="lon", y="lat", col="time", col_wrap=3, robust=True, cmap="viridis", cbar_kwargs={"label": "this has outliers"}, ) =================== FacetGrid Objects =================== The object returned, ``g`` in the above examples, is a :py:class:`~xarray.plot.FacetGrid` object that links a :py:class:`DataArray` to a matplotlib figure with a particular structure. This object can be used to control the behavior of the multiple plots. It borrows an API and code from `Seaborn's FacetGrid `_. The structure is contained within the ``axs`` and ``name_dicts`` attributes, both 2d NumPy object arrays. .. jupyter-execute:: g.axs .. jupyter-execute:: g.name_dicts It's possible to select the :py:class:`xarray.DataArray` or :py:class:`xarray.Dataset` corresponding to the FacetGrid through the ``name_dicts``. .. jupyter-execute:: g.data.loc[g.name_dicts[0, 0]] Here is an example of using the lower level API and then modifying the axes after they have been plotted. .. jupyter-execute:: g = t.plot.imshow(x="lon", y="lat", col="time", col_wrap=3, robust=True) for i, ax in enumerate(g.axs.flat): ax.set_title("Air Temperature %d" % i) bottomright = g.axs[-1, -1] bottomright.annotate("bottom right", (240, 40)); :py:class:`~xarray.plot.FacetGrid` objects have methods that let you customize the automatically generated axis labels, axis ticks and plot titles. See :py:meth:`~xarray.plot.FacetGrid.set_titles`, :py:meth:`~xarray.plot.FacetGrid.set_xlabels`, :py:meth:`~xarray.plot.FacetGrid.set_ylabels` and :py:meth:`~xarray.plot.FacetGrid.set_ticks` for more information. Plotting functions can be applied to each subset of the data by calling :py:meth:`~xarray.plot.FacetGrid.map_dataarray` or to each subplot by calling :py:meth:`~xarray.plot.FacetGrid.map`. TODO: add an example of using the ``map`` method to plot dataset variables (e.g., with ``plt.quiver``). .. _plot-dataset: Datasets -------- Xarray has limited support for plotting Dataset variables against each other. Consider this dataset .. jupyter-execute:: ds = xr.tutorial.scatter_example_dataset(seed=42) ds Scatter ~~~~~~~ Let's plot the ``A`` DataArray as a function of the ``y`` coord .. jupyter-execute:: with xr.set_options(display_expand_data=False): display(ds.A) .. jupyter-execute:: ds.A.plot.scatter(x="y"); Same plot can be displayed using the dataset: .. jupyter-execute:: ds.plot.scatter(x="y", y="A"); Now suppose we want to scatter the ``A`` DataArray against the ``B`` DataArray .. jupyter-execute:: ds.plot.scatter(x="A", y="B"); The ``hue`` kwarg lets you vary the color by variable value .. jupyter-execute:: ds.plot.scatter(x="A", y="B", hue="w"); You can force a legend instead of a colorbar by setting ``add_legend=True, add_colorbar=False``. .. jupyter-execute:: ds.plot.scatter(x="A", y="B", hue="w", add_legend=True, add_colorbar=False); .. jupyter-execute:: ds.plot.scatter(x="A", y="B", hue="w", add_legend=False, add_colorbar=True); The ``markersize`` kwarg lets you vary the point's size by variable value. You can additionally pass ``size_norm`` to control how the variable's values are mapped to point sizes. .. jupyter-execute:: ds.plot.scatter(x="A", y="B", hue="y", markersize="z"); The ``z`` kwarg lets you plot the data along the z-axis as well. .. jupyter-execute:: ds.plot.scatter(x="A", y="B", z="z", hue="y", markersize="x"); Faceting is also possible .. jupyter-execute:: ds.plot.scatter(x="A", y="B", hue="y", markersize="x", row="x", col="w"); And adding the z-axis .. jupyter-execute:: ds.plot.scatter(x="A", y="B", z="z", hue="y", markersize="x", row="x", col="w"); For more advanced scatter plots, we recommend converting the relevant data variables to a pandas DataFrame and using the extensive plotting capabilities of ``seaborn``. Quiver ~~~~~~ Visualizing vector fields is supported with quiver plots: .. jupyter-execute:: ds.isel(w=1, z=1).plot.quiver(x="x", y="y", u="A", v="B"); where ``u`` and ``v`` denote the x and y direction components of the arrow vectors. Again, faceting is also possible: .. jupyter-execute:: ds.plot.quiver(x="x", y="y", u="A", v="B", col="w", row="z", scale=4); ``scale`` is required for faceted quiver plots. The scale determines the number of data units per arrow length unit, i.e. a smaller scale parameter makes the arrow longer. Streamplot ~~~~~~~~~~ Visualizing vector fields is also supported with streamline plots: .. jupyter-execute:: ds.isel(w=1, z=1).plot.streamplot(x="x", y="y", u="A", v="B"); where ``u`` and ``v`` denote the x and y direction components of the vectors tangent to the streamlines. Again, faceting is also possible: .. jupyter-execute:: ds.plot.streamplot(x="x", y="y", u="A", v="B", col="w", row="z"); .. _plot-maps: Maps ---- To follow this section you'll need to have Cartopy installed and working. This script will plot the air temperature on a map. .. jupyter-execute:: :stderr: air = xr.tutorial.open_dataset("air_temperature").air p = air.isel(time=0).plot( subplot_kws=dict(projection=ccrs.Orthographic(-80, 35), facecolor="gray"), transform=ccrs.PlateCarree(), ) p.axes.set_global() p.axes.coastlines(); When faceting on maps, the projection can be transferred to the ``plot`` function using the ``subplot_kws`` keyword. The axes for the subplots created by faceting are accessible in the object returned by ``plot``: .. jupyter-execute:: p = air.isel(time=[0, 4]).plot( transform=ccrs.PlateCarree(), col="time", subplot_kws={"projection": ccrs.Orthographic(-80, 35)}, ) for ax in p.axs.flat: ax.coastlines() ax.gridlines() Details ------- Ways to Use ~~~~~~~~~~~ There are three ways to use the xarray plotting functionality: 1. Use ``plot`` as a convenience method for a DataArray. 2. Access a specific plotting method from the ``plot`` attribute of a DataArray. 3. Directly from the xarray plot submodule. These are provided for user convenience; they all call the same code. .. jupyter-execute:: da = xr.DataArray(range(5)) fig, axs = plt.subplots(ncols=2, nrows=2) da.plot(ax=axs[0, 0]) da.plot.line(ax=axs[0, 1]) xr.plot.plot(da, ax=axs[1, 0]) xr.plot.line(da, ax=axs[1, 1]); Here the output is the same. Since the data is 1 dimensional the line plot was used. The convenience method :py:meth:`xarray.DataArray.plot` dispatches to an appropriate plotting function based on the dimensions of the ``DataArray`` and whether the coordinates are sorted and uniformly spaced. This table describes what gets plotted: =============== =========================== Dimensions Plotting function --------------- --------------------------- 1 :py:func:`xarray.plot.line` 2 :py:func:`xarray.plot.pcolormesh` Anything else :py:func:`xarray.plot.hist` =============== =========================== Coordinates ~~~~~~~~~~~ If you'd like to find out what's really going on in the coordinate system, read on. .. jupyter-execute:: a0 = xr.DataArray(np.zeros((4, 3, 2)), dims=("y", "x", "z"), name="temperature") a0[0, 0, 0] = 1 a = a0.isel(z=0) a The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be. .. jupyter-execute:: a.plot(); It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates. Multidimensional coordinates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See also: :ref:`/examples/multidimensional-coords.ipynb`. You can plot irregular grids defined by multidimensional coordinates with xarray, but you'll have to tell the plot function to use these coordinates instead of the default ones: .. jupyter-execute:: lon, lat = np.meshgrid(np.linspace(-20, 20, 5), np.linspace(0, 30, 4)) lon += lat / 10 lat += lon / 10 da = xr.DataArray( np.arange(20).reshape(4, 5), dims=["y", "x"], coords={"lat": (("y", "x"), lat), "lon": (("y", "x"), lon)}, ) da.plot.pcolormesh(x="lon", y="lat"); Note that in this case, xarray still follows the pixel centered convention. This might be undesirable in some cases, for example when your data is defined on a polar projection (:issue:`781`). This is why the default is to not follow this convention when plotting on a map: .. jupyter-execute:: :stderr: ax = plt.subplot(projection=ccrs.PlateCarree()) da.plot.pcolormesh(x="lon", y="lat", ax=ax) ax.scatter(lon, lat, transform=ccrs.PlateCarree()) ax.coastlines() ax.gridlines(draw_labels=True); You can however decide to infer the cell boundaries and use the ``infer_intervals`` keyword: .. jupyter-execute:: ax = plt.subplot(projection=ccrs.PlateCarree()) da.plot.pcolormesh(x="lon", y="lat", ax=ax, infer_intervals=True) ax.scatter(lon, lat, transform=ccrs.PlateCarree()) ax.coastlines() ax.gridlines(draw_labels=True); .. note:: The data model of xarray does not support datasets with `cell boundaries`_ yet. If you want to use these coordinates, you'll have to make the plots outside the xarray framework. .. _cell boundaries: https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#cell-boundaries One can also make line plots with multidimensional coordinates. In this case, ``hue`` must be a dimension name, not a coordinate name. .. jupyter-execute:: f, ax = plt.subplots(2, 1) da.plot.line(x="lon", hue="y", ax=ax[0]) da.plot.line(x="lon", hue="x", ax=ax[1]); .. _reshape: ############################### Reshaping and reorganizing data ############################### Reshaping and reorganizing data refers to the process of changing the structure or organization of data by modifying dimensions, array shapes, order of values, or indexes. Xarray provides several methods to accomplish these tasks. These methods are particularly useful for reshaping xarray objects for use in machine learning packages, such as scikit-learn, that usually require two-dimensional numpy arrays as inputs. Reshaping can also be required before passing data to external visualization tools, for example geospatial data might expect input organized into a particular format corresponding to stacks of satellite images. Importing the library --------------------- .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) # Use defaults so we don't get gridlines in generated docs import matplotlib as mpl mpl.rcdefaults() Reordering dimensions --------------------- To reorder dimensions on a :py:class:`~xarray.DataArray` or across all variables on a :py:class:`~xarray.Dataset`, use :py:meth:`~xarray.DataArray.transpose`. An ellipsis (`...`) can be used to represent all other dimensions: .. jupyter-execute:: ds = xr.Dataset({"foo": (("x", "y", "z"), [[[42]]]), "bar": (("y", "z"), [[24]])}) ds.transpose("y", "z", "x") # equivalent to ds.transpose(..., "x") .. jupyter-execute:: ds.transpose() # reverses all dimensions Expand and squeeze dimensions ----------------------------- To expand a :py:class:`~xarray.DataArray` or all variables on a :py:class:`~xarray.Dataset` along a new dimension, use :py:meth:`~xarray.DataArray.expand_dims` .. jupyter-execute:: expanded = ds.expand_dims("w") expanded This method attaches a new dimension with size 1 to all data variables. To remove such a size-1 dimension from the :py:class:`~xarray.DataArray` or :py:class:`~xarray.Dataset`, use :py:meth:`~xarray.DataArray.squeeze` .. jupyter-execute:: expanded.squeeze("w") Converting between datasets and arrays -------------------------------------- To convert from a Dataset to a DataArray, use :py:meth:`~xarray.Dataset.to_dataarray`: .. jupyter-execute:: arr = ds.to_dataarray() arr This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates. To convert back from a DataArray to a Dataset, use :py:meth:`~xarray.DataArray.to_dataset`: .. jupyter-execute:: arr.to_dataset(dim="variable") The broadcasting behavior of ``to_dataarray`` means that the resulting array includes the union of data variable dimensions: .. jupyter-execute:: ds2 = xr.Dataset({"a": 0, "b": ("x", [3, 4, 5])}) # the input dataset has 4 elements ds2 .. jupyter-execute:: # the resulting array has 6 elements ds2.to_dataarray() Otherwise, the result could not be represented as an orthogonal array. If you use ``to_dataset`` without supplying the ``dim`` argument, the DataArray will be converted into a Dataset of one variable: .. jupyter-execute:: arr.to_dataset(name="combined") .. _reshape.stack: Stack and unstack ----------------- As part of xarray's nascent support for :py:class:`pandas.MultiIndex`, we have implemented :py:meth:`~xarray.DataArray.stack` and :py:meth:`~xarray.DataArray.unstack` method, for combining or splitting dimensions: .. jupyter-execute:: array = xr.DataArray( np.random.randn(2, 3), coords=[("x", ["a", "b"]), ("y", [0, 1, 2])] ) stacked = array.stack(z=("x", "y")) stacked .. jupyter-execute:: stacked.unstack("z") As elsewhere in xarray, an ellipsis (`...`) can be used to represent all unlisted dimensions: .. jupyter-execute:: stacked = array.stack(z=[..., "x"]) stacked These methods are modeled on the :py:class:`pandas.DataFrame` methods of the same name, although in xarray they always create new dimensions rather than adding to the existing index or columns. Like :py:meth:`DataFrame.unstack`, xarray's ``unstack`` always succeeds, even if the multi-index being unstacked does not contain all possible levels. Missing levels are filled in with ``NaN`` in the resulting object: .. jupyter-execute:: stacked2 = stacked[::2] stacked2 .. jupyter-execute:: stacked2.unstack("z") However, xarray's ``stack`` has an important difference from pandas: unlike pandas, it does not automatically drop missing values. Compare: .. jupyter-execute:: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=["x", "y"]) array.stack(z=("x", "y")) .. jupyter-execute:: array.to_pandas().stack() We departed from pandas's behavior here because predictable shapes for new array dimensions is necessary for :ref:`dask`. .. _reshape.stacking_different: Stacking different variables together ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These stacking and unstacking operations are particularly useful for reshaping xarray objects for use in machine learning packages, such as `scikit-learn `_, that usually require two-dimensional numpy arrays as inputs. For datasets with only one variable, we only need ``stack`` and ``unstack``, but combining multiple variables in a :py:class:`xarray.Dataset` is more complicated. If the variables in the dataset have matching numbers of dimensions, we can call :py:meth:`~xarray.Dataset.to_dataarray` and then stack along the the new coordinate. But :py:meth:`~xarray.Dataset.to_dataarray` will broadcast the dataarrays together, which will effectively tile the lower dimensional variable along the missing dimensions. The method :py:meth:`xarray.Dataset.to_stacked_array` allows combining variables of differing dimensions without this wasteful copying while :py:meth:`xarray.DataArray.to_unstacked_dataset` reverses this operation. Just as with :py:meth:`xarray.Dataset.stack` the stacked coordinate is represented by a :py:class:`pandas.MultiIndex` object. These methods are used like this: .. jupyter-execute:: data = xr.Dataset( data_vars={"a": (("x", "y"), [[0, 1, 2], [3, 4, 5]]), "b": ("x", [6, 7])}, coords={"y": ["u", "v", "w"]}, ) data .. jupyter-execute:: stacked = data.to_stacked_array("z", sample_dims=["x"]) stacked .. jupyter-execute:: unstacked = stacked.to_unstacked_dataset("z") unstacked In this example, ``stacked`` is a two dimensional array that we can easily pass to a scikit-learn or another generic numerical method. .. note:: Unlike with ``stack``, in ``to_stacked_array``, the user specifies the dimensions they **do not** want stacked. For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples are drawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these sampling dimensions. .. _reshape.set_index: Set and reset index ------------------- Complementary to stack / unstack, xarray's ``.set_index``, ``.reset_index`` and ``.reorder_levels`` allow easy manipulation of ``DataArray`` or ``Dataset`` multi-indexes without modifying the data and its dimensions. You can create a multi-index from several 1-dimensional variables and/or coordinates using :py:meth:`~xarray.DataArray.set_index`: .. jupyter-execute:: da = xr.DataArray( np.random.rand(4), coords={ "band": ("x", ["a", "a", "b", "b"]), "wavenumber": ("x", np.linspace(200, 400, 4)), }, dims="x", ) da .. jupyter-execute:: mda = da.set_index(x=["band", "wavenumber"]) mda These coordinates can now be used for indexing, e.g., .. jupyter-execute:: mda.sel(band="a") Conversely, you can use :py:meth:`~xarray.DataArray.reset_index` to extract multi-index levels as coordinates (this is mainly useful for serialization): .. jupyter-execute:: mda.reset_index("x") :py:meth:`~xarray.DataArray.reorder_levels` allows changing the order of multi-index levels: .. jupyter-execute:: mda.reorder_levels(x=["wavenumber", "band"]) As of xarray v0.9 coordinate labels for each dimension are optional. You can also use ``.set_index`` / ``.reset_index`` to add / remove labels for one or several dimensions: .. jupyter-execute:: array = xr.DataArray([1, 2, 3], dims="x") array .. jupyter-execute:: array["c"] = ("x", ["a", "b", "c"]) array.set_index(x="c") .. jupyter-execute:: array = array.set_index(x="c") array = array.reset_index("x", drop=True) .. _reshape.shift_and_roll: Shift and roll -------------- To adjust coordinate labels, you can use the :py:meth:`~xarray.Dataset.shift` and :py:meth:`~xarray.Dataset.roll` methods: .. jupyter-execute:: array = xr.DataArray([1, 2, 3, 4], dims="x") array.shift(x=2) .. jupyter-execute:: array.roll(x=2, roll_coords=True) .. _reshape.sort: Sort ---- One may sort a DataArray/Dataset via :py:meth:`~xarray.DataArray.sortby` and :py:meth:`~xarray.Dataset.sortby`. The input can be an individual or list of 1D ``DataArray`` objects: .. jupyter-execute:: ds = xr.Dataset( { "A": (("x", "y"), [[1, 2], [3, 4]]), "B": (("x", "y"), [[5, 6], [7, 8]]), }, coords={"x": ["b", "a"], "y": [1, 0]}, ) dax = xr.DataArray([100, 99], [("x", [0, 1])]) day = xr.DataArray([90, 80], [("y", [0, 1])]) ds.sortby([day, dax]) As a shortcut, you can refer to existing coordinates by name: .. jupyter-execute:: ds.sortby("x") .. jupyter-execute:: ds.sortby(["y", "x"]) .. jupyter-execute:: ds.sortby(["y", "x"], ascending=False) .. _reshape.coarsen: Reshaping via coarsen --------------------- Whilst :py:class:`~xarray.DataArray.coarsen` is normally used for reducing your data's resolution by applying a reduction function (see the :ref:`page on computation`), it can also be used to reorganise your data without applying a computation via :py:meth:`~xarray.computation.rolling.DataArrayCoarsen.construct`. Taking our example tutorial air temperature dataset over the Northern US .. jupyter-execute:: air = xr.tutorial.open_dataset("air_temperature")["air"] air.isel(time=0).plot(x="lon", y="lat"); we can split this up into sub-regions of size ``(9, 18)`` points using :py:meth:`~xarray.computation.rolling.DataArrayCoarsen.construct`: .. jupyter-execute:: regions = air.coarsen(lat=9, lon=18, boundary="pad").construct( lon=("x_coarse", "x_fine"), lat=("y_coarse", "y_fine") ) with xr.set_options(display_expand_data=False): regions 9 new regions have been created, each of size 9 by 18 points. The ``boundary="pad"`` kwarg ensured that all regions are the same size even though the data does not evenly divide into these sizes. By plotting these 9 regions together via :ref:`faceting` we can see how they relate to the original data. .. jupyter-execute:: regions.isel(time=0).plot( x="x_fine", y="y_fine", col="x_coarse", row="y_coarse", yincrease=False ); We are now free to easily apply any custom computation to each coarsened region of our new dataarray. This would involve specifying that applied functions should act over the ``"x_fine"`` and ``"y_fine"`` dimensions, but broadcast over the ``"x_coarse"`` and ``"y_coarse"`` dimensions. .. currentmodule:: xarray .. _terminology: Terminology =========== *Xarray terminology differs slightly from CF, mathematical conventions, and pandas; so we've put together a glossary of its terms. Here,* ``arr`` *refers to an xarray* :py:class:`DataArray` *in the examples. For more complete examples, please consult the relevant documentation.* .. jupyter-execute:: :hide-code: import numpy as np import xarray as xr .. glossary:: DataArray A multi-dimensional array with labeled or named dimensions. ``DataArray`` objects add metadata such as dimension names, coordinates, and attributes (defined below) to underlying "unlabeled" data structures such as numpy and Dask arrays. If its optional ``name`` property is set, it is a *named DataArray*. Dataset A dict-like collection of ``DataArray`` objects with aligned dimensions. Thus, most operations that can be performed on the dimensions of a single ``DataArray`` can be performed on a dataset. Datasets have data variables (see **Variable** below), dimensions, coordinates, and attributes. Variable A `NetCDF-like variable `_ consisting of dimensions, data, and attributes which describe a single array. The main functional difference between variables and numpy arrays is that numerical operations on variables implement array broadcasting by dimension name. Each ``DataArray`` has an underlying variable that can be accessed via ``arr.variable``. However, a variable is not fully described outside of either a ``Dataset`` or a ``DataArray``. .. note:: The :py:class:`Variable` class is low-level interface and can typically be ignored. However, the word "variable" appears often enough in the code and documentation that is useful to understand. Dimension In mathematics, the *dimension* of data is loosely the number of degrees of freedom for it. A *dimension axis* is a set of all points in which all but one of these degrees of freedom is fixed. We can think of each dimension axis as having a name, for example the "x dimension". In xarray, a ``DataArray`` object's *dimensions* are its named dimension axes ``da.dims``, and the name of the ``i``-th dimension is ``da.dims[i]``. If an array is created without specifying dimension names, the default dimension names will be ``dim_0``, ``dim_1``, and so forth. Coordinate An array that labels a dimension or set of dimensions of another ``DataArray``. In the usual one-dimensional case, the coordinate array's values can loosely be thought of as tick labels along a dimension. We distinguish :term:`Dimension coordinate` vs. :term:`Non-dimension coordinate` and :term:`Indexed coordinate` vs. :term:`Non-indexed coordinate`. A coordinate named ``x`` can be retrieved from ``arr.coords[x]``. A ``DataArray`` can have more coordinates than dimensions because a single dimension can be labeled by multiple coordinate arrays. However, only one coordinate array can be assigned as a particular dimension's dimension coordinate array. Dimension coordinate A one-dimensional coordinate array assigned to ``arr`` with both a name and dimension name in ``arr.dims``. Usually (but not always), a dimension coordinate is also an :term:`Indexed coordinate` so that it can be used for label-based indexing and alignment, like the index found on a :py:class:`pandas.DataFrame` or :py:class:`pandas.Series`. Non-dimension coordinate A coordinate array assigned to ``arr`` with a name in ``arr.coords`` but *not* in ``arr.dims``. These coordinates arrays can be one-dimensional or multidimensional, and they are useful for auxiliary labeling. As an example, multidimensional coordinates are often used in geoscience datasets when :doc:`the data's physical coordinates (such as latitude and longitude) differ from their logical coordinates <../examples/multidimensional-coords>`. Printing ``arr.coords`` will print all of ``arr``'s coordinate names, with the corresponding dimension(s) in parentheses. For example, ``coord_name (dim_name) 1 2 3 ...``. Indexed coordinate A coordinate which has an associated :term:`Index`. Generally this means that the coordinate labels can be used for indexing (selection) and/or alignment. An indexed coordinate may have one or more arbitrary dimensions although in most cases it is also a :term:`Dimension coordinate`. It may or may not be grouped with other indexed coordinates depending on whether they share the same index. Indexed coordinates are marked by an asterisk ``*`` when printing a ``DataArray`` or ``Dataset``. Non-indexed coordinate A coordinate which has no associated :term:`Index`. It may still represent fixed labels along one or more dimensions but it cannot be used for label-based indexing and alignment. Index An *index* is a data structure optimized for efficient data selection and alignment within a discrete or continuous space that is defined by coordinate labels (unless it is a functional index). By default, Xarray creates a :py:class:`~xarray.indexes.PandasIndex` object (i.e., a :py:class:`pandas.Index` wrapper) for each :term:`Dimension coordinate`. For more advanced use cases (e.g., staggered or irregular grids, geospatial indexes), Xarray also accepts any instance of a specialized :py:class:`~xarray.indexes.Index` subclass that is associated to one or more arbitrary coordinates. The index associated with the coordinate ``x`` can be retrieved by ``arr.xindexes[x]`` (or ``arr.indexes["x"]`` if the index is convertible to a :py:class:`pandas.Index` object). If two coordinates ``x`` and ``y`` share the same index, ``arr.xindexes[x]`` and ``arr.xindexes[y]`` both return the same :py:class:`~xarray.indexes.Index` object. name The names of dimensions, coordinates, DataArray objects and data variables can be anything as long as they are :term:`hashable`. However, it is preferred to use :py:class:`str` typed names. scalar By definition, a scalar is not an :term:`array` and when converted to one, it has 0 dimensions. That means that, e.g., :py:class:`int`, :py:class:`float`, and :py:class:`str` objects are "scalar" while :py:class:`list` or :py:class:`tuple` are not. duck array `Duck arrays`__ are array implementations that behave like numpy arrays. They have to define the ``shape``, ``dtype`` and ``ndim`` properties. For integration with ``xarray``, the ``__array__``, ``__array_ufunc__`` and ``__array_function__`` protocols are also required. __ https://numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html Aligning Aligning refers to the process of ensuring that two or more DataArrays or Datasets have the same dimensions and coordinates, so that they can be combined or compared properly. .. jupyter-execute:: x = xr.DataArray( [[25, 35], [10, 24]], dims=("lat", "lon"), coords={"lat": [35.0, 40.0], "lon": [100.0, 120.0]}, ) y = xr.DataArray( [[20, 5], [7, 13]], dims=("lat", "lon"), coords={"lat": [35.0, 42.0], "lon": [100.0, 120.0]}, ) a, b = xr.align(x, y) # By default, an "inner join" is performed # so "a" is a copy of "x" where coordinates match "y" a Broadcasting A technique that allows operations to be performed on arrays with different shapes and dimensions. When performing operations on arrays with different shapes and dimensions, xarray will automatically attempt to broadcast the arrays to a common shape before the operation is applied. .. jupyter-execute:: # 'a' has shape (3,) and 'b' has shape (4,) a = xr.DataArray(np.array([1, 2, 3]), dims=["x"]) b = xr.DataArray(np.array([4, 5, 6, 7]), dims=["y"]) # 2D array with shape (3, 4) a + b Merging Merging is used to combine two or more Datasets or DataArrays that have different variables or coordinates along the same dimensions. When merging, xarray aligns the variables and coordinates of the different datasets along the specified dimensions and creates a new ``Dataset`` containing all the variables and coordinates. .. jupyter-execute:: # create two 1D arrays with names arr1 = xr.DataArray( [1, 2, 3], dims=["x"], coords={"x": [10, 20, 30]}, name="arr1" ) arr2 = xr.DataArray( [4, 5, 6], dims=["x"], coords={"x": [20, 30, 40]}, name="arr2" ) # merge the two arrays into a new dataset merged_ds = xr.Dataset({"arr1": arr1, "arr2": arr2}) merged_ds Concatenating Concatenating is used to combine two or more Datasets or DataArrays along a dimension. When concatenating, xarray arranges the datasets or dataarrays along a new dimension, and the resulting ``Dataset`` or ``Dataarray`` will have the same variables and coordinates along the other dimensions. .. jupyter-execute:: a = xr.DataArray([[1, 2], [3, 4]], dims=("x", "y")) b = xr.DataArray([[5, 6], [7, 8]], dims=("x", "y")) c = xr.concat([a, b], dim="c") c Combining Combining is the process of arranging two or more DataArrays or Datasets into a single ``DataArray`` or ``Dataset`` using some combination of merging and concatenation operations. .. jupyter-execute:: ds1 = xr.Dataset( {"data": xr.DataArray([[1, 2], [3, 4]], dims=("x", "y"))}, coords={"x": [1, 2], "y": [3, 4]}, ) ds2 = xr.Dataset( {"data": xr.DataArray([[5, 6], [7, 8]], dims=("x", "y"))}, coords={"x": [2, 3], "y": [4, 5]}, ) # combine the datasets combined_ds = xr.combine_by_coords([ds1, ds2]) combined_ds lazy Lazily-evaluated operations do not load data into memory until necessary. Instead of doing calculations right away, xarray lets you plan what calculations you want to do, like finding the average temperature in a dataset. This planning is called "lazy evaluation." Later, when you're ready to see the final result, you tell xarray, "Okay, go ahead and do those calculations now!" That's when xarray starts working through the steps you planned and gives you the answer you wanted. This lazy approach helps save time and memory because xarray only does the work when you actually need the results. labeled Labeled data has metadata describing the context of the data, not just the raw data values. This contextual information can be labels for array axes (i.e. dimension names) tick labels along axes (stored as Coordinate variables) or unique names for each array. These labels provide context and meaning to the data, making it easier to understand and work with. If you have temperature data for different cities over time. Using xarray, you can label the dimensions: one for cities and another for time. serialization Serialization is the process of converting your data into a format that makes it easy to save and share. When you serialize data in xarray, you're taking all those temperature measurements, along with their labels and other information, and turning them into a format that can be stored in a file or sent over the internet. xarray objects can be serialized into formats which store the labels alongside the data. Some supported serialization formats are files that can then be stored or transferred (e.g. netCDF), whilst others are protocols that allow for data access over a network (e.g. Zarr). indexing :ref:`Indexing` is how you select subsets of your data which you are interested in. - Label-based Indexing: Selecting data by passing a specific label and comparing it to the labels stored in the associated coordinates. You can use labels to specify what you want like "Give me the temperature for New York on July 15th." - Positional Indexing: You can use numbers to refer to positions in the data like "Give me the third temperature value" This is useful when you know the order of your data but don't need to remember the exact labels. - Slicing: You can take a "slice" of your data, like you might want all temperatures from July 1st to July 10th. xarray supports slicing for both positional and label-based indexing. DataTree A tree-like collection of ``Dataset`` objects. A *tree* is made up of one or more *nodes*, each of which can store the same information as a single ``Dataset`` (accessed via ``.dataset``). This data is stored in the same way as in a ``Dataset``, i.e. in the form of data :term:`variables`, :term:`dimensions`, :term:`coordinates`, and attributes. The nodes in a tree are linked to one another, and each node is its own instance of ``DataTree`` object. Each node can have zero or more *children* (stored in a dictionary-like manner under their corresponding *names*), and those child nodes can themselves have children. If a node is a child of another node that other node is said to be its *parent*. Nodes can have a maximum of one parent, and if a node has no parent it is said to be the *root* node of that *tree*. Subtree A section of a *tree*, consisting of a *node* along with all the child nodes below it (and the child nodes below them, i.e. all so-called *descendant* nodes). Excludes the parent node and all nodes above. Group Another word for a subtree, reflecting how the hierarchical structure of a ``DataTree`` allows for grouping related data together. Analogous to a single `netCDF group `_ or `Zarr group `_. .. _testing: Testing your code ================= .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) .. _testing.hypothesis: Hypothesis testing ------------------ .. note:: Testing with hypothesis is a fairly advanced topic. Before reading this section it is recommended that you take a look at our guide to xarray's :ref:`data structures`, are familiar with conventional unit testing in `pytest `_, and have seen the `hypothesis library documentation `_. `The hypothesis library `_ is a powerful tool for property-based testing. Instead of writing tests for one example at a time, it allows you to write tests parameterized by a source of many dynamically generated examples. For example you might have written a test which you wish to be parameterized by the set of all possible integers via :py:func:`hypothesis.strategies.integers()`. Property-based testing is extremely powerful, because (unlike more conventional example-based testing) it can find bugs that you did not even think to look for! Strategies ~~~~~~~~~~ Each source of examples is called a "strategy", and xarray provides a range of custom strategies which produce xarray data structures containing arbitrary data. You can use these to efficiently test downstream code, quickly ensuring that your code can handle xarray objects of all possible structures and contents. These strategies are accessible in the :py:mod:`xarray.testing.strategies` module, which provides .. currentmodule:: xarray .. autosummary:: testing.strategies.supported_dtypes testing.strategies.names testing.strategies.dimension_names testing.strategies.dimension_sizes testing.strategies.attrs testing.strategies.variables testing.strategies.unique_subset_of These build upon the numpy and array API strategies offered in :py:mod:`hypothesis.extra.numpy` and :py:mod:`hypothesis.extra.array_api`: .. jupyter-execute:: import hypothesis.extra.numpy as npst Generating Examples ~~~~~~~~~~~~~~~~~~~ To see an example of what each of these strategies might produce, you can call one followed by the ``.example()`` method, which is a general hypothesis method valid for all strategies. .. jupyter-execute:: import xarray.testing.strategies as xrst xrst.variables().example() .. jupyter-execute:: xrst.variables().example() .. jupyter-execute:: xrst.variables().example() You can see that calling ``.example()`` multiple times will generate different examples, giving you an idea of the wide range of data that the xarray strategies can generate. In your tests however you should not use ``.example()`` - instead you should parameterize your tests with the :py:func:`hypothesis.given` decorator: .. jupyter-execute:: from hypothesis import given .. jupyter-execute:: @given(xrst.variables()) def test_function_that_acts_on_variables(var): assert func(var) == ... Chaining Strategies ~~~~~~~~~~~~~~~~~~~ Xarray's strategies can accept other strategies as arguments, allowing you to customise the contents of the generated examples. .. jupyter-execute:: # generate a Variable containing an array with a complex number dtype, but all other details still arbitrary from hypothesis.extra.numpy import complex_number_dtypes xrst.variables(dtype=complex_number_dtypes()).example() This also works with custom strategies, or strategies defined in other packages. For example you could imagine creating a ``chunks`` strategy to specify particular chunking patterns for a dask-backed array. Fixing Arguments ~~~~~~~~~~~~~~~~ If you want to fix one aspect of the data structure, whilst allowing variation in the generated examples over all other aspects, then use :py:func:`hypothesis.strategies.just()`. .. jupyter-execute:: import hypothesis.strategies as st # Generates only variable objects with dimensions ["x", "y"] xrst.variables(dims=st.just(["x", "y"])).example() (This is technically another example of chaining strategies - :py:func:`hypothesis.strategies.just()` is simply a special strategy that just contains a single example.) To fix the length of dimensions you can instead pass ``dims`` as a mapping of dimension names to lengths (i.e. following xarray objects' ``.sizes()`` property), e.g. .. jupyter-execute:: # Generates only variables with dimensions ["x", "y"], of lengths 2 & 3 respectively xrst.variables(dims=st.just({"x": 2, "y": 3})).example() You can also use this to specify that you want examples which are missing some part of the data structure, for instance .. jupyter-execute:: # Generates a Variable with no attributes xrst.variables(attrs=st.just({})).example() Through a combination of chaining strategies and fixing arguments, you can specify quite complicated requirements on the objects your chained strategy will generate. .. jupyter-execute:: fixed_x_variable_y_maybe_z = st.fixed_dictionaries( {"x": st.just(2), "y": st.integers(3, 4)}, optional={"z": st.just(2)} ) fixed_x_variable_y_maybe_z.example() .. jupyter-execute:: special_variables = xrst.variables(dims=fixed_x_variable_y_maybe_z) special_variables.example() .. jupyter-execute:: special_variables.example() Here we have used one of hypothesis' built-in strategies :py:func:`hypothesis.strategies.fixed_dictionaries` to create a strategy which generates mappings of dimension names to lengths (i.e. the ``size`` of the xarray object we want). This particular strategy will always generate an ``x`` dimension of length 2, and a ``y`` dimension of length either 3 or 4, and will sometimes also generate a ``z`` dimension of length 2. By feeding this strategy for dictionaries into the ``dims`` argument of xarray's :py:func:`~st.variables` strategy, we can generate arbitrary :py:class:`~xarray.Variable` objects whose dimensions will always match these specifications. Generating Duck-type Arrays ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Xarray objects don't have to wrap numpy arrays, in fact they can wrap any array type which presents the same API as a numpy array (so-called "duck array wrapping", see :ref:`wrapping numpy-like arrays `). Imagine we want to write a strategy which generates arbitrary ``Variable`` objects, each of which wraps a :py:class:`sparse.COO` array instead of a ``numpy.ndarray``. How could we do that? There are two ways: 1. Create a xarray object with numpy data and use the hypothesis' ``.map()`` method to convert the underlying array to a different type: .. jupyter-execute:: import sparse .. jupyter-execute:: def convert_to_sparse(var): return var.copy(data=sparse.COO.from_numpy(var.to_numpy())) .. jupyter-execute:: sparse_variables = xrst.variables(dims=xrst.dimension_names(min_dims=1)).map( convert_to_sparse ) sparse_variables.example() .. jupyter-execute:: sparse_variables.example() 2. Pass a function which returns a strategy which generates the duck-typed arrays directly to the ``array_strategy_fn`` argument of the xarray strategies: .. jupyter-execute:: def sparse_random_arrays(shape: tuple[int, ...]) -> sparse._coo.core.COO: """Strategy which generates random sparse.COO arrays""" if shape is None: shape = npst.array_shapes() else: shape = st.just(shape) density = st.integers(min_value=0, max_value=1) # note sparse.random does not accept a dtype kwarg return st.builds(sparse.random, shape=shape, density=density) def sparse_random_arrays_fn( *, shape: tuple[int, ...], dtype: np.dtype ) -> st.SearchStrategy[sparse._coo.core.COO]: return sparse_random_arrays(shape=shape) .. jupyter-execute:: sparse_random_variables = xrst.variables( array_strategy_fn=sparse_random_arrays_fn, dtype=st.just(np.dtype("float64")) ) sparse_random_variables.example() Either approach is fine, but one may be more convenient than the other depending on the type of the duck array which you want to wrap. Compatibility with the Python Array API Standard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Xarray aims to be compatible with any duck-array type that conforms to the `Python Array API Standard `_ (see our :ref:`docs on Array API Standard support `). .. warning:: The strategies defined in :py:mod:`testing.strategies` are **not** guaranteed to use array API standard-compliant dtypes by default. For example arrays with the dtype ``np.dtype('float16')`` may be generated by :py:func:`testing.strategies.variables` (assuming the ``dtype`` kwarg was not explicitly passed), despite ``np.dtype('float16')`` not being in the array API standard. If the array type you want to generate has an array API-compliant top-level namespace (e.g. that which is conventionally imported as ``xp`` or similar), you can use this neat trick: .. jupyter-execute:: import numpy as xp # compatible in numpy 2.0 # use `import numpy.array_api as xp` in numpy>=1.23,<2.0 from hypothesis.extra.array_api import make_strategies_namespace xps = make_strategies_namespace(xp) xp_variables = xrst.variables( array_strategy_fn=xps.arrays, dtype=xps.scalar_dtypes(), ) xp_variables.example() Another array API-compliant duck array library would replace the import, e.g. ``import cupy as cp`` instead. Testing over Subsets of Dimensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A common task when testing xarray user code is checking that your function works for all valid input dimensions. We can chain strategies to achieve this, for which the helper strategy :py:func:`~testing.strategies.unique_subset_of` is useful. It works for lists of dimension names .. jupyter-execute:: dims = ["x", "y", "z"] xrst.unique_subset_of(dims).example() .. jupyter-execute:: xrst.unique_subset_of(dims).example() as well as for mappings of dimension names to sizes .. jupyter-execute:: dim_sizes = {"x": 2, "y": 3, "z": 4} xrst.unique_subset_of(dim_sizes).example() .. jupyter-execute:: xrst.unique_subset_of(dim_sizes).example() This is useful because operations like reductions can be performed over any subset of the xarray object's dimensions. For example we can write a pytest test that tests that a reduction gives the expected result when applying that reduction along any possible valid subset of the Variable's dimensions. .. code-block:: python import numpy.testing as npt @given(st.data(), xrst.variables(dims=xrst.dimension_names(min_dims=1))) def test_mean(data, var): """Test that the mean of an xarray Variable is always equal to the mean of the underlying array.""" # specify arbitrary reduction along at least one dimension reduction_dims = data.draw(xrst.unique_subset_of(var.dims, min_size=1)) # create expected result (using nanmean because arrays with Nans will be generated) reduction_axes = tuple(var.get_axis_num(dim) for dim in reduction_dims) expected = np.nanmean(var.data, axis=reduction_axes) # assert property is always satisfied result = var.mean(dim=reduction_dims).data npt.assert_equal(expected, result) .. currentmodule:: xarray .. _time-series: ================ Time series data ================ A major use case for xarray is multi-dimensional time-series data. Accordingly, we've copied many of features that make working with time-series data in pandas such a joy to xarray. In most cases, we rely on pandas for the core functionality. .. jupyter-execute:: :hide-code: import numpy as np import pandas as pd import xarray as xr np.random.seed(123456) Creating datetime64 data ------------------------ Xarray uses the numpy dtypes :py:class:`numpy.datetime64` and :py:class:`numpy.timedelta64` with specified units (one of ``"s"``, ``"ms"``, ``"us"`` and ``"ns"``) to represent datetime data, which offer vectorized operations with numpy and smooth integration with pandas. To convert to or create regular arrays of :py:class:`numpy.datetime64` data, we recommend using :py:func:`pandas.to_datetime`, :py:class:`pandas.DatetimeIndex`, or :py:func:`xarray.date_range`: .. jupyter-execute:: pd.to_datetime(["2000-01-01", "2000-02-02"]) .. jupyter-execute:: pd.DatetimeIndex( ["2000-01-01 00:00:00", "2000-02-02 00:00:00"], dtype="datetime64[s]" ) .. jupyter-execute:: xr.date_range("2000-01-01", periods=365) .. jupyter-execute:: xr.date_range("2000-01-01", periods=365, unit="s") .. note:: Care has to be taken to create the output with the wanted resolution. For :py:func:`pandas.date_range` the ``unit``-kwarg has to be specified and for :py:func:`pandas.to_datetime` the selection of the resolution isn't possible at all. For that :py:class:`pd.DatetimeIndex` can be used directly. There is more in-depth information in section :ref:`internals.timecoding`. Alternatively, you can supply arrays of Python ``datetime`` objects. These get converted automatically when used as arguments in xarray objects (with us-resolution): .. jupyter-execute:: import datetime xr.Dataset({"time": datetime.datetime(2000, 1, 1)}) When reading or writing netCDF files, xarray automatically decodes datetime and timedelta arrays using `CF conventions`_ (that is, by using a ``units`` attribute like ``'days since 2000-01-01'``). .. _CF conventions: https://cfconventions.org .. note:: When decoding/encoding datetimes for non-standard calendars or for dates before `1582-10-15`_, xarray uses the `cftime`_ library by default. It was previously packaged with the ``netcdf4-python`` package under the name ``netcdftime`` but is now distributed separately. ``cftime`` is an :ref:`optional dependency` of xarray. .. _cftime: https://unidata.github.io/cftime .. _1582-10-15: https://en.wikipedia.org/wiki/Gregorian_calendar You can manual decode arrays in this form by passing a dataset to :py:func:`decode_cf`: .. jupyter-execute:: attrs = {"units": "hours since 2000-01-01"} ds = xr.Dataset({"time": ("time", [0, 1, 2, 3], attrs)}) # Default decoding to 'ns'-resolution xr.decode_cf(ds) .. jupyter-execute:: # Decoding to 's'-resolution coder = xr.coders.CFDatetimeCoder(time_unit="s") xr.decode_cf(ds, decode_times=coder) From xarray 2025.01.2 the resolution of the dates can be one of ``"s"``, ``"ms"``, ``"us"`` or ``"ns"``. One limitation of using ``datetime64[ns]`` is that it limits the native representation of dates to those that fall between the years 1678 and 2262, which gets increased significantly with lower resolutions. When a store contains dates outside of these bounds (or dates < `1582-10-15`_ with a Gregorian, also known as standard, calendar), dates will be returned as arrays of :py:class:`cftime.datetime` objects and a :py:class:`CFTimeIndex` will be used for indexing. :py:class:`CFTimeIndex` enables most of the indexing functionality of a :py:class:`pandas.DatetimeIndex`. See :ref:`CFTimeIndex` for more information. Datetime indexing ----------------- Xarray borrows powerful indexing machinery from pandas (see :ref:`indexing`). This allows for several useful and succinct forms of indexing, particularly for ``datetime64`` data. For example, we support indexing with strings for single items and with the ``slice`` object: .. jupyter-execute:: time = pd.date_range("2000-01-01", freq="h", periods=365 * 24) ds = xr.Dataset({"foo": ("time", np.arange(365 * 24)), "time": time}) ds.sel(time="2000-01") .. jupyter-execute:: ds.sel(time=slice("2000-06-01", "2000-06-10")) You can also select a particular time by indexing with a :py:class:`datetime.time` object: .. jupyter-execute:: ds.sel(time=datetime.time(12)) For more details, read the pandas documentation and the section on :ref:`datetime_component_indexing` (i.e. using the ``.dt`` accessor). .. _dt_accessor: Datetime components ------------------- Similar to `pandas accessors`_, the components of datetime objects contained in a given ``DataArray`` can be quickly computed using a special ``.dt`` accessor. .. _pandas accessors: https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dt-accessors .. jupyter-execute:: time = pd.date_range("2000-01-01", freq="6h", periods=365 * 4) ds = xr.Dataset({"foo": ("time", np.arange(365 * 4)), "time": time}) ds.time.dt.hour .. jupyter-execute:: ds.time.dt.dayofweek The ``.dt`` accessor works on both coordinate dimensions as well as multi-dimensional data. Xarray also supports a notion of "virtual" or "derived" coordinates for `datetime components`__ implemented by pandas, including "year", "month", "day", "hour", "minute", "second", "dayofyear", "week", "dayofweek", "weekday" and "quarter": __ https://pandas.pydata.org/pandas-docs/stable/api.html#time-date-components .. jupyter-execute:: ds["time.month"] .. jupyter-execute:: ds["time.dayofyear"] For use as a derived coordinate, xarray adds ``'season'`` to the list of datetime components supported by pandas: .. jupyter-execute:: ds["time.season"] .. jupyter-execute:: ds["time"].dt.season The set of valid seasons consists of 'DJF', 'MAM', 'JJA' and 'SON', labeled by the first letters of the corresponding months. You can use these shortcuts with both Datasets and DataArray coordinates. In addition, xarray supports rounding operations ``floor``, ``ceil``, and ``round``. These operations require that you supply a `rounding frequency as a string argument.`__ __ https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases .. jupyter-execute:: ds["time"].dt.floor("D") The ``.dt`` accessor can also be used to generate formatted datetime strings for arrays utilising the same formatting as the standard `datetime.strftime`_. .. _datetime.strftime: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior .. jupyter-execute:: ds["time"].dt.strftime("%a, %b %d %H:%M") .. _datetime_component_indexing: Indexing Using Datetime Components ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can use use the ``.dt`` accessor when subsetting your data as well. For example, we can subset for the month of January using the following: .. jupyter-execute:: ds.isel(time=(ds.time.dt.month == 1)) You can also search for multiple months (in this case January through March), using ``isin``: .. jupyter-execute:: ds.isel(time=ds.time.dt.month.isin([1, 2, 3])) .. _resampling: Resampling and grouped operations --------------------------------- .. seealso:: For more generic documentation on grouping, see :ref:`groupby`. Datetime components couple particularly well with grouped operations for analyzing features that repeat over time. Here's how to calculate the mean by time of day: .. jupyter-execute:: ds.groupby("time.hour").mean() For upsampling or downsampling temporal resolutions, xarray offers a :py:meth:`Dataset.resample` method building on the core functionality offered by the pandas method of the same name. Resample uses essentially the same api as :py:meth:`pandas.DataFrame.resample` `in pandas`_. .. _in pandas: https://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling For example, we can downsample our dataset from hourly to 6-hourly: .. jupyter-execute:: ds.resample(time="6h") This will create a specialized :py:class:`~xarray.core.resample.DatasetResample` or :py:class:`~xarray.core.resample.DataArrayResample` object which saves information necessary for resampling. All of the reduction methods which work with :py:class:`Dataset` or :py:class:`DataArray` objects can also be used for resampling: .. jupyter-execute:: ds.resample(time="6h").mean() You can also supply an arbitrary reduction function to aggregate over each resampling group: .. jupyter-execute:: ds.resample(time="6h").reduce(np.mean) You can also resample on the time dimension while applying reducing along other dimensions at the same time by specifying the ``dim`` keyword argument .. code-block:: python ds.resample(time="6h").mean(dim=["time", "latitude", "longitude"]) For upsampling, xarray provides six methods: ``asfreq``, ``ffill``, ``bfill``, ``pad``, ``nearest`` and ``interpolate``. ``interpolate`` extends :py:func:`scipy.interpolate.interp1d` and supports all of its schemes. All of these resampling operations work on both Dataset and DataArray objects with an arbitrary number of dimensions. In order to limit the scope of the methods ``ffill``, ``bfill``, ``pad`` and ``nearest`` the ``tolerance`` argument can be set in coordinate units. Data that has indices outside of the given ``tolerance`` are set to ``NaN``. .. jupyter-execute:: ds.resample(time="1h").nearest(tolerance="1h") It is often desirable to center the time values after a resampling operation. That can be accomplished by updating the resampled dataset time coordinate values using time offset arithmetic via the :py:func:`pandas.tseries.frequencies.to_offset` function. .. jupyter-execute:: resampled_ds = ds.resample(time="6h").mean() offset = pd.tseries.frequencies.to_offset("6h") / 2 resampled_ds["time"] = resampled_ds.get_index("time") + offset resampled_ds .. seealso:: For more examples of using grouped operations on a time dimension, see :doc:`../examples/weather-data`. .. _seasonal_grouping: Handling Seasons ~~~~~~~~~~~~~~~~ Two extremely common time series operations are to group by seasons, and resample to a seasonal frequency. Xarray has historically supported some simple versions of these computations. For example, ``.groupby("time.season")`` (where the seasons are DJF, MAM, JJA, SON) and resampling to a seasonal frequency using Pandas syntax: ``.resample(time="QS-DEC")``. Quite commonly one wants more flexibility in defining seasons. For these use-cases, Xarray provides :py:class:`groupers.SeasonGrouper` and :py:class:`groupers.SeasonResampler`. .. currentmodule:: xarray.groupers .. jupyter-execute:: from xarray.groupers import SeasonGrouper ds.groupby(time=SeasonGrouper(["DJF", "MAM", "JJA", "SON"])).mean() Note how the seasons are in the specified order, unlike ``.groupby("time.season")`` where the seasons are sorted alphabetically. .. jupyter-execute:: ds.groupby("time.season").mean() :py:class:`SeasonGrouper` supports overlapping seasons: .. jupyter-execute:: ds.groupby(time=SeasonGrouper(["DJFM", "MAMJ", "JJAS", "SOND"])).mean() Skipping months is allowed: .. jupyter-execute:: ds.groupby(time=SeasonGrouper(["JJAS"])).mean() Use :py:class:`SeasonResampler` to specify custom seasons. .. jupyter-execute:: from xarray.groupers import SeasonResampler ds.resample(time=SeasonResampler(["DJF", "MAM", "JJA", "SON"])).mean() :py:class:`SeasonResampler` is smart enough to correctly handle years for seasons that span the end of the year (e.g. DJF). By default :py:class:`SeasonResampler` will skip any season that is incomplete (e.g. the first DJF season for a time series that starts in Jan). Pass the ``drop_incomplete=False`` kwarg to :py:class:`SeasonResampler` to disable this behaviour. .. jupyter-execute:: from xarray.groupers import SeasonResampler ds.resample( time=SeasonResampler(["DJF", "MAM", "JJA", "SON"], drop_incomplete=False) ).mean() Seasons need not be of the same length: .. jupyter-execute:: ds.resample(time=SeasonResampler(["JF", "MAM", "JJAS", "OND"])).mean() .. currentmodule:: xarray .. _weather-climate: Weather and climate data ======================== .. jupyter-execute:: :hide-code: import xarray as xr import numpy as np Xarray can leverage metadata that follows the `Climate and Forecast (CF) conventions`_ if present. Examples include :ref:`automatic labelling of plots` with descriptive names and units if proper metadata is present and support for non-standard calendars used in climate science through the ``cftime`` module (explained in the :ref:`CFTimeIndex` section). There are also a number of :ref:`geosciences-focused projects that build on xarray`. .. _Climate and Forecast (CF) conventions: https://cfconventions.org .. _cf_variables: Related Variables ----------------- Several CF variable attributes contain lists of other variables associated with the variable with the attribute. A few of these are now parsed by xarray, with the attribute value popped to encoding on read and the variables in that value interpreted as non-dimension coordinates: - ``coordinates`` - ``bounds`` - ``grid_mapping`` - ``climatology`` - ``geometry`` - ``node_coordinates`` - ``node_count`` - ``part_node_count`` - ``interior_ring`` - ``cell_measures`` - ``formula_terms`` This decoding is controlled by the ``decode_coords`` kwarg to :py:func:`open_dataset` and :py:func:`open_mfdataset`. The CF attribute ``ancillary_variables`` was not included in the list due to the variables listed there being associated primarily with the variable with the attribute, rather than with the dimensions. .. _metpy_accessor: CF-compliant coordinate variables --------------------------------- `MetPy`_ adds a ``metpy`` accessor that allows accessing coordinates with appropriate CF metadata using generic names ``x``, ``y``, ``vertical`` and ``time``. There is also a ``cartopy_crs`` attribute that provides projection information, parsed from the appropriate CF metadata, as a `Cartopy`_ projection object. See the `metpy documentation`_ for more information. .. _`MetPy`: https://unidata.github.io/MetPy/dev/index.html .. _`metpy documentation`: https://unidata.github.io/MetPy/dev/tutorials/xarray_tutorial.html#coordinates .. _`Cartopy`: https://scitools.org.uk/cartopy/docs/latest/reference/crs.html .. _CFTimeIndex: Non-standard calendars and dates outside the precision range ------------------------------------------------------------ Through the standalone ``cftime`` library and a custom subclass of :py:class:`pandas.Index`, xarray supports a subset of the indexing functionality enabled through the standard :py:class:`pandas.DatetimeIndex` for dates from non-standard calendars commonly used in climate science or dates using a standard calendar, but outside the `precision range`_ and dates prior to `1582-10-15`_. .. note:: As of xarray version 0.11, by default, :py:class:`cftime.datetime` objects will be used to represent times (either in indexes, as a :py:class:`~xarray.CFTimeIndex`, or in data arrays with dtype object) if any of the following are true: - The dates are from a non-standard calendar - Any dates are outside the nanosecond-precision range (prior xarray version 2025.01.2) - Any dates are outside the time span limited by the resolution (from xarray version 2025.01.2) Otherwise pandas-compatible dates from a standard calendar will be represented with the ``np.datetime64[unit]`` data type (where unit can be one of ``"s"``, ``"ms"``, ``"us"``, ``"ns"``), enabling the use of a :py:class:`pandas.DatetimeIndex` or arrays with dtype ``np.datetime64[unit]`` and their full set of associated features. As of pandas version 2.0.0, pandas supports non-nanosecond precision datetime values. From xarray version 2025.01.2 on, non-nanosecond precision datetime values are also supported in xarray (this can be parameterized via :py:class:`~xarray.coders.CFDatetimeCoder` and ``decode_times`` kwarg). See also :ref:`internals.timecoding`. For example, you can create a DataArray indexed by a time coordinate with dates from a no-leap calendar and a :py:class:`~xarray.CFTimeIndex` will automatically be used: .. jupyter-execute:: from itertools import product from cftime import DatetimeNoLeap dates = [ DatetimeNoLeap(year, month, 1) for year, month in product(range(1, 3), range(1, 13)) ] da = xr.DataArray(np.arange(24), coords=[dates], dims=["time"], name="foo") Xarray also includes a :py:func:`~xarray.date_range` function, which enables creating a :py:class:`~xarray.CFTimeIndex` with regularly-spaced dates. For instance, we can create the same dates and DataArray we created above using (note that ``use_cftime=True`` is not mandatory to return a :py:class:`~xarray.CFTimeIndex` for non-standard calendars, but can be nice to use to be explicit): .. jupyter-execute:: dates = xr.date_range( start="0001", periods=24, freq="MS", calendar="noleap", use_cftime=True ) da = xr.DataArray(np.arange(24), coords=[dates], dims=["time"], name="foo") Mirroring pandas' method with the same name, :py:meth:`~xarray.infer_freq` allows one to infer the sampling frequency of a :py:class:`~xarray.CFTimeIndex` or a 1-D :py:class:`~xarray.DataArray` containing cftime objects. It also works transparently with ``np.datetime64`` and ``np.timedelta64`` data (with "s", "ms", "us" or "ns" resolution). .. jupyter-execute:: xr.infer_freq(dates) With :py:meth:`~xarray.CFTimeIndex.strftime` we can also easily generate formatted strings from the datetime values of a :py:class:`~xarray.CFTimeIndex` directly or through the ``dt`` accessor for a :py:class:`~xarray.DataArray` using the same formatting as the standard `datetime.strftime`_ convention . .. _datetime.strftime: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior .. jupyter-execute:: dates.strftime("%c") .. jupyter-execute:: da["time"].dt.strftime("%Y%m%d") Conversion between non-standard calendar and to/from pandas DatetimeIndexes is facilitated with the :py:meth:`xarray.Dataset.convert_calendar` method (also available as :py:meth:`xarray.DataArray.convert_calendar`). Here, like elsewhere in xarray, the ``use_cftime`` argument controls which datetime backend is used in the output. The default (``None``) is to use ``pandas`` when possible, i.e. when the calendar is ``standard``/``gregorian`` and dates starting with `1582-10-15`_. There is no such restriction when converting to a ``proleptic_gregorian`` calendar. .. _1582-10-15: https://en.wikipedia.org/wiki/Gregorian_calendar .. jupyter-execute:: dates = xr.date_range( start="2001", periods=24, freq="MS", calendar="noleap", use_cftime=True ) da_nl = xr.DataArray(np.arange(24), coords=[dates], dims=["time"], name="foo") da_std = da.convert_calendar("standard", use_cftime=True) The data is unchanged, only the timestamps are modified. Further options are implemented for the special ``"360_day"`` calendar and for handling missing dates. There is also :py:meth:`xarray.Dataset.interp_calendar` (and :py:meth:`xarray.DataArray.interp_calendar`) for interpolating data between calendars. For data indexed by a :py:class:`~xarray.CFTimeIndex` xarray currently supports: - `Partial datetime string indexing`_: .. jupyter-execute:: da.sel(time="0001") .. jupyter-execute:: da.sel(time=slice("0001-05", "0002-02")) .. note:: For specifying full or partial datetime strings in cftime indexing, xarray supports two versions of the `ISO 8601 standard`_, the basic pattern (YYYYMMDDhhmmss) or the extended pattern (YYYY-MM-DDThh:mm:ss), as well as the default cftime string format (YYYY-MM-DD hh:mm:ss). This is somewhat more restrictive than pandas; in other words, some datetime strings that would be valid for a :py:class:`pandas.DatetimeIndex` are not valid for an :py:class:`~xarray.CFTimeIndex`. - Access of basic datetime components via the ``dt`` accessor (in this case just "year", "month", "day", "hour", "minute", "second", "microsecond", "season", "dayofyear", "dayofweek", and "days_in_month") with the addition of "calendar", absent from pandas: .. jupyter-execute:: da.time.dt.year .. jupyter-execute:: da.time.dt.month .. jupyter-execute:: da.time.dt.season .. jupyter-execute:: da.time.dt.dayofyear .. jupyter-execute:: da.time.dt.dayofweek .. jupyter-execute:: da.time.dt.days_in_month .. jupyter-execute:: da.time.dt.calendar - Rounding of datetimes to fixed frequencies via the ``dt`` accessor: .. jupyter-execute:: da.time.dt.ceil("3D").head() .. jupyter-execute:: da.time.dt.floor("5D").head() .. jupyter-execute:: da.time.dt.round("2D").head() - Group-by operations based on datetime accessor attributes (e.g. by month of the year): .. jupyter-execute:: da.groupby("time.month").sum() - Interpolation using :py:class:`cftime.datetime` objects: .. jupyter-execute:: da.interp(time=[DatetimeNoLeap(1, 1, 15), DatetimeNoLeap(1, 2, 15)]) - Interpolation using datetime strings: .. jupyter-execute:: da.interp(time=["0001-01-15", "0001-02-15"]) - Differentiation: .. jupyter-execute:: da.differentiate("time") - Serialization: .. jupyter-execute:: da.to_netcdf("example-no-leap.nc") reopened = xr.open_dataset("example-no-leap.nc") reopened .. jupyter-execute:: :hide-code: import os reopened.close() os.remove("example-no-leap.nc") - And resampling along the time dimension for data indexed by a :py:class:`~xarray.CFTimeIndex`: .. jupyter-execute:: da.resample(time="81min", closed="right", label="right", offset="3min").mean() .. _precision range: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations .. _ISO 8601 standard: https://en.wikipedia.org/wiki/ISO_8601 .. _partial datetime string indexing: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing# API reference# This page provides an auto-generated summary of xarray’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation. See also: What parts of xarray are considered public API? and How stable is Xarray’s API?. ## Top-level functions# `apply_ufunc`(func, *args[, input_core_dims, ...]) | Apply a vectorized function for unlabeled arrays on xarray objects. ---|--- `align`(*objects[, join, copy, indexes, ...]) | Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes. `broadcast`(*args[, exclude]) | Explicitly broadcast any number of DataArray or Dataset objects against one another. `concat`(objs, dim[, data_vars, coords, ...]) | Concatenate xarray objects along a new or existing dimension. `merge`(objects[, compat, join, fill_value, ...]) | Merge any number of xarray objects into a single Dataset as variables. `combine_by_coords`([data_objects, compat, ...]) | Attempt to auto-magically combine the given datasets (or data arrays) into one by using dimension coordinates. `combine_nested`(datasets, concat_dim[, ...]) | Explicitly combine an N-dimensional grid of datasets into one by using a succession of concat and merge operations along each dimension of the grid. `where`(cond, x, y[, keep_attrs]) | Return elements from x or y depending on cond. `infer_freq`(index) | Infer the most likely frequency given the input index. `full_like`(other, fill_value[, dtype, ...]) | Return a new object with the same shape and type as a given object. `zeros_like`(other[, dtype, chunks, ...]) | Return a new object of zeros with the same shape and type as a given dataarray or dataset. `ones_like`(other[, dtype, chunks, ...]) | Return a new object of ones with the same shape and type as a given dataarray or dataset. `cov`(da_a, da_b[, dim, ddof, weights]) | Compute covariance between two DataArray objects along a shared dimension. `corr`(da_a, da_b[, dim, weights]) | Compute the Pearson correlation coefficient between two DataArray objects along a shared dimension. `cross`(a, b, *, dim) | Compute the cross product of two (arrays of) vectors. `dot`(*arrays[, dim]) | Generalized dot product for xarray objects. `polyval`(coord, coeffs[, degree_dim]) | Evaluate a polynomial at specific values `map_blocks`(func, obj[, args, kwargs, template]) | Apply a function to each block of a DataArray or Dataset. `show_versions`([file]) | print the versions of xarray and its dependencies `set_options`(**kwargs) | Set options for xarray in a controlled context. `get_options`() | Get options for xarray. `unify_chunks`(*objects) | Given any number of Dataset and/or DataArray objects, returns new objects with unified chunk size along all chunked dimensions. ## Dataset# ### Creating a dataset# `Dataset`([data_vars, coords, attrs]) | A multi-dimensional, in memory, array database. ---|--- `decode_cf`(obj[, concat_characters, ...]) | Decode the given Dataset or Datastore according to CF conventions into a new Dataset. ### Attributes# `Dataset.dims` | Mapping from dimension names to lengths. ---|--- `Dataset.sizes` | Mapping from dimension names to lengths. `Dataset.dtypes` | Mapping from data variable names to dtypes. `Dataset.data_vars` | Dictionary of DataArray objects corresponding to data variables `Dataset.coords` | Mapping of `DataArray` objects corresponding to coordinate variables. `Dataset.attrs` | Dictionary of global attributes on this dataset `Dataset.encoding` | Dictionary of global encoding attributes on this dataset `Dataset.indexes` | Mapping of pandas.Index objects used for label based indexing. `Dataset.xindexes` | Mapping of `Index` objects used for label based indexing. `Dataset.chunks` | Mapping from dimension names to block lengths for this dataset's data. `Dataset.chunksizes` | Mapping from dimension names to block lengths for this dataset's data. `Dataset.nbytes` | Total bytes consumed by the data arrays of all variables in this dataset. ### Dictionary interface# Datasets implement the mapping interface with keys given by variable names and values given by `DataArray` objects. `Dataset.__getitem__`(key) | Access variables or coordinates of this dataset as a `DataArray` or a subset of variables or a indexed dataset. ---|--- `Dataset.__setitem__`(key, value) | Add an array to this dataset. `Dataset.__delitem__`(key) | Remove a variable from this dataset. `Dataset.update`(other) | Update this dataset's variables with those from another dataset. `Dataset.get`(k[,d]) | `Dataset.items`() | `Dataset.keys`() | `Dataset.values`() | ### Dataset contents# `Dataset.copy`([deep, data]) | Returns a copy of this dataset. ---|--- `Dataset.assign`([variables]) | Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones. `Dataset.assign_coords`([coords]) | Assign new coordinates to this object. `Dataset.assign_attrs`(*args, **kwargs) | Assign new attrs to this object. `Dataset.pipe`(func, *args, **kwargs) | Apply `func(self, *args, **kwargs)` `Dataset.merge`(other[, overwrite_vars, ...]) | Merge the arrays of two datasets into a single dataset. `Dataset.rename`([name_dict]) | Returns a new object with renamed variables, coordinates and dimensions. `Dataset.rename_vars`([name_dict]) | Returns a new object with renamed variables including coordinates `Dataset.rename_dims`([dims_dict]) | Returns a new object with renamed dimensions only. `Dataset.swap_dims`([dims_dict]) | Returns a new object with swapped dimensions. `Dataset.expand_dims`([dim, axis, ...]) | Return a new object with an additional axis (or axes) inserted at the corresponding position in the array shape. `Dataset.drop_vars`(names, *[, errors]) | Drop variables from this dataset. `Dataset.drop_indexes`(coord_names, *[, errors]) | Drop the indexes assigned to the given coordinates. `Dataset.drop_duplicates`(dim, *[, keep]) | Returns a new Dataset with duplicate dimension values removed. `Dataset.drop_dims`(drop_dims, *[, errors]) | Drop dimensions and associated variables from this dataset. `Dataset.drop_encoding`() | Return a new Dataset without encoding on the dataset or any of its variables/coords. `Dataset.drop_attrs`(*[, deep]) | Removes all attributes from the Dataset and its variables. `Dataset.set_coords`(names) | Given names of one or more variables, set them as coordinates `Dataset.reset_coords`([names, drop]) | Given names of coordinates, reset them to become variables `Dataset.convert_calendar`(calendar[, dim, ...]) | Convert the Dataset to another calendar. `Dataset.interp_calendar`(target[, dim]) | Interpolates the Dataset to another calendar based on decimal year measure. `Dataset.get_index`(key) | Get an index for a dimension, with fall-back to a default RangeIndex ### Comparisons# `Dataset.equals`(other) | Two Datasets are equal if they have matching variables and coordinates, all of which are equal. ---|--- `Dataset.identical`(other) | Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates. `Dataset.broadcast_equals`(other) | Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other. ### Indexing# `Dataset.loc` | Attribute for location based indexing. ---|--- `Dataset.isel`([indexers, drop, missing_dims]) | Returns a new dataset with each array indexed along the specified dimension(s). `Dataset.sel`([indexers, method, tolerance, drop]) | Returns a new dataset with each array indexed by tick labels along the specified dimension(s). `Dataset.drop_sel`([labels, errors]) | Drop index labels from this dataset. `Dataset.drop_isel`([indexers]) | Drop index positions from this Dataset. `Dataset.head`([indexers]) | Returns a new dataset with the first n values of each array for the specified dimension(s). `Dataset.tail`([indexers]) | Returns a new dataset with the last n values of each array for the specified dimension(s). `Dataset.thin`([indexers]) | Returns a new dataset with each array indexed along every n-th value for the specified dimension(s) `Dataset.squeeze`([dim, drop, axis]) | Return a new object with squeezed data. `Dataset.interp`([coords, method, ...]) | Interpolate a Dataset onto new coordinates. `Dataset.interp_like`(other[, method, ...]) | Interpolate this object onto the coordinates of another object. `Dataset.reindex`([indexers, method, ...]) | Conform this object onto a new set of indexes, filling in missing values with `fill_value`. `Dataset.reindex_like`(other[, method, ...]) | Conform this object onto the indexes of another object, for indexes which the objects share. `Dataset.set_index`([indexes, append]) | Set Dataset (multi-)indexes using one or more existing coordinates or variables. `Dataset.reset_index`(dims_or_levels, *[, drop]) | Reset the specified index(es) or multi-index level(s). `Dataset.set_xindex`(coord_names[, index_cls]) | Set a new, Xarray-compatible index from one or more existing coordinate(s). `Dataset.reorder_levels`([dim_order]) | Rearrange index levels using input order. `Dataset.query`([queries, parser, engine, ...]) | Return a new dataset with each array indexed along the specified dimension(s), where the indexers are given as strings containing Python expressions to be evaluated against the data variables in the dataset. ### Missing value handling# `Dataset.isnull`([keep_attrs]) | Test each value in the array for whether it is a missing value. ---|--- `Dataset.notnull`([keep_attrs]) | Test each value in the array for whether it is not a missing value. `Dataset.combine_first`(other) | Combine two Datasets, default to data_vars of self. `Dataset.count`([dim, keep_attrs]) | Reduce this Dataset's data by applying `count` along some dimension(s). `Dataset.dropna`(dim, *[, how, thresh, subset]) | Returns a new dataset with dropped labels for missing values along the provided dimension. `Dataset.fillna`(value) | Fill missing values in this object. `Dataset.ffill`(dim[, limit]) | Fill NaN values by propagating values forward `Dataset.bfill`(dim[, limit]) | Fill NaN values by propagating values backward `Dataset.interpolate_na`([dim, method, limit, ...]) | Fill in NaNs by interpolating according to different methods. `Dataset.where`(cond[, other, drop]) | Filter elements from this object according to a condition. `Dataset.isin`(test_elements) | Tests each value in the array for whether it is in test elements. ### Computation# `Dataset.map`(func[, keep_attrs, args]) | Apply a function to each data variable in this dataset ---|--- `Dataset.reduce`(func[, dim, keep_attrs, ...]) | Reduce this dataset by applying func along some dimension(s). `Dataset.groupby`([group, squeeze, ...]) | Returns a DatasetGroupBy object for performing grouped operations. `Dataset.groupby_bins`(group, bins[, right, ...]) | Returns a DatasetGroupBy object for performing grouped operations. `Dataset.rolling`([dim, min_periods, center]) | Rolling window object for Datasets. `Dataset.rolling_exp`([window, window_type]) | Exponentially-weighted moving window. `Dataset.cumulative`(dim[, min_periods]) | Accumulating object for Datasets `Dataset.weighted`(weights) | Weighted Dataset operations. `Dataset.coarsen`([dim, boundary, side, ...]) | Coarsen object for Datasets. `Dataset.resample`([indexer, skipna, closed, ...]) | Returns a Resample object for performing resampling operations. `Dataset.diff`(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. `Dataset.quantile`(q[, dim, method, ...]) | Compute the qth quantile of the data along the specified dimension. `Dataset.differentiate`(coord[, edge_order, ...]) | Differentiate with the second order accurate central differences. `Dataset.integrate`(coord[, datetime_unit]) | Integrate along the given coordinate using the trapezoidal rule. `Dataset.map_blocks`(func[, args, kwargs, ...]) | Apply a function to each block of this Dataset. `Dataset.polyfit`(dim, deg[, skipna, rcond, ...]) | Least squares polynomial fit. `Dataset.curvefit`(coords, func[, ...]) | Curve fitting optimization for arbitrary functions. `Dataset.eval`(statement, *[, parser]) | Calculate an expression supplied as a string in the context of the dataset. ### Aggregation# `Dataset.all`([dim, keep_attrs]) | Reduce this Dataset's data by applying `all` along some dimension(s). ---|--- `Dataset.any`([dim, keep_attrs]) | Reduce this Dataset's data by applying `any` along some dimension(s). `Dataset.argmax`([dim]) | Indices of the maxima of the member variables. `Dataset.argmin`([dim]) | Indices of the minima of the member variables. `Dataset.count`([dim, keep_attrs]) | Reduce this Dataset's data by applying `count` along some dimension(s). `Dataset.idxmax`([dim, skipna, fill_value, ...]) | Return the coordinate label of the maximum value along a dimension. `Dataset.idxmin`([dim, skipna, fill_value, ...]) | Return the coordinate label of the minimum value along a dimension. `Dataset.max`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `max` along some dimension(s). `Dataset.min`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `min` along some dimension(s). `Dataset.mean`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `mean` along some dimension(s). `Dataset.median`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `median` along some dimension(s). `Dataset.prod`([dim, skipna, min_count, ...]) | Reduce this Dataset's data by applying `prod` along some dimension(s). `Dataset.sum`([dim, skipna, min_count, keep_attrs]) | Reduce this Dataset's data by applying `sum` along some dimension(s). `Dataset.std`([dim, skipna, ddof, keep_attrs]) | Reduce this Dataset's data by applying `std` along some dimension(s). `Dataset.var`([dim, skipna, ddof, keep_attrs]) | Reduce this Dataset's data by applying `var` along some dimension(s). `Dataset.cumsum`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `cumsum` along some dimension(s). `Dataset.cumprod`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `cumprod` along some dimension(s). ### ndarray methods# `Dataset.argsort`([axis, kind, order]) | Returns the indices that would sort this array. ---|--- `Dataset.astype`(dtype, *[, order, casting, ...]) | Copy of the xarray object, with data cast to a specified type. `Dataset.clip`([min, max, keep_attrs]) | Return an array whose values are limited to `[min, max]`. `Dataset.conj`() | Complex-conjugate all elements. `Dataset.conjugate`(*args, **kwargs) | a.conj() `Dataset.imag` | The imaginary part of each data variable. `Dataset.round`(*args, **kwargs) | `Dataset.real` | The real part of each data variable. `Dataset.rank`(dim, *[, pct, keep_attrs]) | Ranks the data. ### Reshaping and reorganizing# `Dataset.transpose`(*dim[, missing_dims]) | Return a new Dataset object with all array dimensions transposed. ---|--- `Dataset.stack`([dim, create_index, index_cls]) | Stack any number of existing dimensions into a single new dimension. `Dataset.unstack`([dim, fill_value, sparse]) | Unstack existing dimensions corresponding to MultiIndexes into multiple new dimensions. `Dataset.to_stacked_array`(new_dim, sample_dims) | Combine variables of differing dimensionality into a DataArray without broadcasting. `Dataset.shift`([shifts, fill_value]) | Shift this dataset by an offset along one or more dimensions. `Dataset.roll`([shifts, roll_coords]) | Roll this dataset by an offset along one or more dimensions. `Dataset.pad`([pad_width, mode, stat_length, ...]) | Pad this dataset along one or more dimensions. `Dataset.sortby`(variables[, ascending]) | Sort object by labels or values (along an axis). `Dataset.broadcast_like`(other[, exclude]) | Broadcast this DataArray against another Dataset or DataArray. ## DataArray# `DataArray`([data, coords, dims, name, attrs, ...]) | N-dimensional array with labeled coordinates and dimensions. ---|--- ### Attributes# `DataArray.values` | The array's data converted to numpy.ndarray. ---|--- `DataArray.data` | The DataArray's data as an array. `DataArray.coords` | Mapping of `DataArray` objects corresponding to coordinate variables. `DataArray.dims` | Tuple of dimension names associated with this array. `DataArray.sizes` | Ordered mapping from dimension names to lengths. `DataArray.name` | The name of this array. `DataArray.attrs` | Dictionary storing arbitrary metadata with this array. `DataArray.encoding` | Dictionary of format-specific settings for how this array should be serialized. `DataArray.indexes` | Mapping of pandas.Index objects used for label based indexing. `DataArray.xindexes` | Mapping of `Index` objects used for label based indexing. `DataArray.chunksizes` | Mapping from dimension names to block lengths for this dataarray's data. ### ndarray attributes# `DataArray.ndim` | Number of array dimensions. ---|--- `DataArray.nbytes` | Total bytes consumed by the elements of this DataArray's data. `DataArray.shape` | Tuple of array dimensions. `DataArray.size` | Number of elements in the array. `DataArray.dtype` | Data-type of the array’s elements. `DataArray.chunks` | Tuple of block lengths for this dataarray's data, in order of dimensions, or None if the underlying data is not a dask array. ### DataArray contents# `DataArray.assign_coords`([coords]) | Assign new coordinates to this object. ---|--- `DataArray.assign_attrs`(*args, **kwargs) | Assign new attrs to this object. `DataArray.pipe`(func, *args, **kwargs) | Apply `func(self, *args, **kwargs)` `DataArray.rename`([new_name_or_name_dict]) | Returns a new DataArray with renamed coordinates, dimensions or a new name. `DataArray.swap_dims`([dims_dict]) | Returns a new DataArray with swapped dimensions. `DataArray.expand_dims`([dim, axis, ...]) | Return a new object with an additional axis (or axes) inserted at the corresponding position in the array shape. `DataArray.drop_vars`(names, *[, errors]) | Returns an array with dropped variables. `DataArray.drop_indexes`(coord_names, *[, errors]) | Drop the indexes assigned to the given coordinates. `DataArray.drop_duplicates`(dim, *[, keep]) | Returns a new DataArray with duplicate dimension values removed. `DataArray.drop_encoding`() | Return a new DataArray without encoding on the array or any attached coords. `DataArray.drop_attrs`(*[, deep]) | Removes all attributes from the DataArray. `DataArray.reset_coords`([names, drop]) | Given names of coordinates, reset them to become variables. `DataArray.copy`([deep, data]) | Returns a copy of this array. `DataArray.convert_calendar`(calendar[, dim, ...]) | Convert the DataArray to another calendar. `DataArray.interp_calendar`(target[, dim]) | Interpolates the DataArray to another calendar based on decimal year measure. `DataArray.get_index`(key) | Get an index for a dimension, with fall-back to a default RangeIndex `DataArray.astype`(dtype, *[, order, casting, ...]) | Copy of the xarray object, with data cast to a specified type. `DataArray.item`(*args) | Copy an element of an array to a standard Python scalar and return it. ### Indexing# `DataArray.__getitem__`(key) | ---|--- `DataArray.__setitem__`(key, value) | `DataArray.loc` | Attribute for location based indexing like pandas. `DataArray.isel`([indexers, drop, missing_dims]) | Return a new DataArray whose data is given by selecting indexes along the specified dimension(s). `DataArray.sel`([indexers, method, tolerance, ...]) | Return a new DataArray whose data is given by selecting index labels along the specified dimension(s). `DataArray.drop_sel`([labels, errors]) | Drop index labels from this DataArray. `DataArray.drop_isel`([indexers]) | Drop index positions from this DataArray. `DataArray.head`([indexers]) | Return a new DataArray whose data is given by the the first n values along the specified dimension(s). `DataArray.tail`([indexers]) | Return a new DataArray whose data is given by the the last n values along the specified dimension(s). `DataArray.thin`([indexers]) | Return a new DataArray whose data is given by each n value along the specified dimension(s). `DataArray.squeeze`([dim, drop, axis]) | Return a new object with squeezed data. `DataArray.interp`([coords, method, ...]) | Interpolate a DataArray onto new coordinates. `DataArray.interp_like`(other[, method, ...]) | Interpolate this object onto the coordinates of another object, filling out of range values with NaN. `DataArray.reindex`([indexers, method, ...]) | Conform this object onto the indexes of another object, filling in missing values with `fill_value`. `DataArray.reindex_like`(other, *[, method, ...]) | Conform this object onto the indexes of another object, for indexes which the objects share. `DataArray.set_index`([indexes, append]) | Set DataArray (multi-)indexes using one or more existing coordinates. `DataArray.reset_index`(dims_or_levels[, drop]) | Reset the specified index(es) or multi-index level(s). `DataArray.set_xindex`(coord_names[, index_cls]) | Set a new, Xarray-compatible index from one or more existing coordinate(s). `DataArray.reorder_levels`([dim_order]) | Rearrange index levels using input order. `DataArray.query`([queries, parser, engine, ...]) | Return a new data array indexed along the specified dimension(s), where the indexers are given as strings containing Python expressions to be evaluated against the values in the array. ### Missing value handling# `DataArray.isnull`([keep_attrs]) | Test each value in the array for whether it is a missing value. ---|--- `DataArray.notnull`([keep_attrs]) | Test each value in the array for whether it is not a missing value. `DataArray.combine_first`(other) | Combine two DataArray objects, with union of coordinates. `DataArray.count`([dim, keep_attrs]) | Reduce this DataArray's data by applying `count` along some dimension(s). `DataArray.dropna`(dim, *[, how, thresh]) | Returns a new array with dropped labels for missing values along the provided dimension. `DataArray.fillna`(value) | Fill missing values in this object. `DataArray.ffill`(dim[, limit]) | Fill NaN values by propagating values forward `DataArray.bfill`(dim[, limit]) | Fill NaN values by propagating values backward `DataArray.interpolate_na`([dim, method, ...]) | Fill in NaNs by interpolating according to different methods. `DataArray.where`(cond[, other, drop]) | Filter elements from this object according to a condition. `DataArray.isin`(test_elements) | Tests each value in the array for whether it is in test elements. ### Comparisons# `DataArray.equals`(other) | True if two DataArrays have the same dimensions, coordinates and values; otherwise False. ---|--- `DataArray.identical`(other) | Like equals, but also checks the array name and attributes, and attributes on all coordinates. `DataArray.broadcast_equals`(other) | Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions. ### Computation# `DataArray.reduce`(func[, dim, axis, ...]) | Reduce this array by applying func along some dimension(s). ---|--- `DataArray.groupby`([group, squeeze, ...]) | Returns a DataArrayGroupBy object for performing grouped operations. `DataArray.groupby_bins`(group, bins[, right, ...]) | Returns a DataArrayGroupBy object for performing grouped operations. `DataArray.rolling`([dim, min_periods, center]) | Rolling window object for DataArrays. `DataArray.rolling_exp`([window, window_type]) | Exponentially-weighted moving window. `DataArray.cumulative`(dim[, min_periods]) | Accumulating object for DataArrays. `DataArray.weighted`(weights) | Weighted DataArray operations. `DataArray.coarsen`([dim, boundary, side, ...]) | Coarsen object for DataArrays. `DataArray.resample`([indexer, skipna, ...]) | Returns a Resample object for performing resampling operations. `DataArray.get_axis_num`(dim) | Return axis number(s) corresponding to dimension(s) in this array. `DataArray.diff`(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. `DataArray.dot`(other[, dim]) | Perform dot product of two DataArrays along their shared dims. `DataArray.quantile`(q[, dim, method, ...]) | Compute the qth quantile of the data along the specified dimension. `DataArray.differentiate`(coord[, edge_order, ...]) | Differentiate the array with the second order accurate central differences. `DataArray.integrate`([coord, datetime_unit]) | Integrate along the given coordinate using the trapezoidal rule. `DataArray.polyfit`(dim, deg[, skipna, rcond, ...]) | Least squares polynomial fit. `DataArray.map_blocks`(func[, args, kwargs, ...]) | Apply a function to each block of this DataArray. `DataArray.curvefit`(coords, func[, ...]) | Curve fitting optimization for arbitrary functions. ### Aggregation# `DataArray.all`([dim, keep_attrs]) | Reduce this DataArray's data by applying `all` along some dimension(s). ---|--- `DataArray.any`([dim, keep_attrs]) | Reduce this DataArray's data by applying `any` along some dimension(s). `DataArray.argmax`([dim, axis, keep_attrs, skipna]) | Index or indices of the maximum of the DataArray over one or more dimensions. `DataArray.argmin`([dim, axis, keep_attrs, skipna]) | Index or indices of the minimum of the DataArray over one or more dimensions. `DataArray.count`([dim, keep_attrs]) | Reduce this DataArray's data by applying `count` along some dimension(s). `DataArray.idxmax`([dim, skipna, fill_value, ...]) | Return the coordinate label of the maximum value along a dimension. `DataArray.idxmin`([dim, skipna, fill_value, ...]) | Return the coordinate label of the minimum value along a dimension. `DataArray.max`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `max` along some dimension(s). `DataArray.min`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `min` along some dimension(s). `DataArray.mean`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `mean` along some dimension(s). `DataArray.median`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `median` along some dimension(s). `DataArray.prod`([dim, skipna, min_count, ...]) | Reduce this DataArray's data by applying `prod` along some dimension(s). `DataArray.sum`([dim, skipna, min_count, ...]) | Reduce this DataArray's data by applying `sum` along some dimension(s). `DataArray.std`([dim, skipna, ddof, keep_attrs]) | Reduce this DataArray's data by applying `std` along some dimension(s). `DataArray.var`([dim, skipna, ddof, keep_attrs]) | Reduce this DataArray's data by applying `var` along some dimension(s). `DataArray.cumsum`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `cumsum` along some dimension(s). `DataArray.cumprod`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `cumprod` along some dimension(s). ### ndarray methods# `DataArray.argsort`([axis, kind, order]) | Returns the indices that would sort this array. ---|--- `DataArray.clip`([min, max, keep_attrs]) | Return an array whose values are limited to `[min, max]`. `DataArray.conj`() | Complex-conjugate all elements. `DataArray.conjugate`(*args, **kwargs) | a.conj() `DataArray.imag` | The imaginary part of the array. `DataArray.searchsorted`(v[, side, sorter]) | Find indices where elements of v should be inserted in a to maintain order. `DataArray.round`(*args, **kwargs) | `DataArray.real` | The real part of the array. `DataArray.T` | `DataArray.rank`(dim, *[, pct, keep_attrs]) | Ranks the data. ### String manipulation# `DataArray.str` | ---|--- `DataArray.str.capitalize`() | Convert strings in the array to be capitalized. ---|--- `DataArray.str.casefold`() | Convert strings in the array to be casefolded. `DataArray.str.cat`(*others[, sep]) | Concatenate strings elementwise in the DataArray with other strings. `DataArray.str.center`(width[, fillchar]) | Pad left and right side of each string in the array. `DataArray.str.contains`(pat[, case, flags, regex]) | Test if pattern or regex is contained within each string of the array. `DataArray.str.count`(pat[, flags, case]) | Count occurrences of pattern in each string of the array. `DataArray.str.decode`(encoding[, errors]) | Decode character string in the array using indicated encoding. `DataArray.str.encode`(encoding[, errors]) | Encode character string in the array using indicated encoding. `DataArray.str.endswith`(pat) | Test if the end of each string in the array matches a pattern. `DataArray.str.extract`(pat, dim[, case, flags]) | Extract the first match of capture groups in the regex pat as a new dimension in a DataArray. `DataArray.str.extractall`(pat, group_dim, ...) | Extract all matches of capture groups in the regex pat as new dimensions in a DataArray. `DataArray.str.find`(sub[, start, end, side]) | Return lowest or highest indexes in each strings in the array where the substring is fully contained between [start:end]. `DataArray.str.findall`(pat[, case, flags]) | Find all occurrences of pattern or regular expression in the DataArray. `DataArray.str.format`(*args, **kwargs) | Perform python string formatting on each element of the DataArray. `DataArray.str.get`(i[, default]) | Extract character number i from each string in the array. `DataArray.str.get_dummies`(dim[, sep]) | Return DataArray of dummy/indicator variables. `DataArray.str.index`(sub[, start, end, side]) | Return lowest or highest indexes in each strings where the substring is fully contained between [start:end]. `DataArray.str.isalnum`() | Check whether all characters in each string are alphanumeric. `DataArray.str.isalpha`() | Check whether all characters in each string are alphabetic. `DataArray.str.isdecimal`() | Check whether all characters in each string are decimal. `DataArray.str.isdigit`() | Check whether all characters in each string are digits. `DataArray.str.islower`() | Check whether all characters in each string are lowercase. `DataArray.str.isnumeric`() | Check whether all characters in each string are numeric. `DataArray.str.isspace`() | Check whether all characters in each string are spaces. `DataArray.str.istitle`() | Check whether all characters in each string are titlecase. `DataArray.str.isupper`() | Check whether all characters in each string are uppercase. `DataArray.str.join`([dim, sep]) | Concatenate strings in a DataArray along a particular dimension. `DataArray.str.len`() | Compute the length of each string in the array. `DataArray.str.ljust`(width[, fillchar]) | Pad right side of each string in the array. `DataArray.str.lower`() | Convert strings in the array to lowercase. `DataArray.str.lstrip`([to_strip]) | Remove leading characters. `DataArray.str.match`(pat[, case, flags]) | Determine if each string in the array matches a regular expression. `DataArray.str.normalize`(form) | Return the Unicode normal form for the strings in the datarray. `DataArray.str.pad`(width[, side, fillchar]) | Pad strings in the array up to width. `DataArray.str.partition`(dim[, sep]) | Split the strings in the DataArray at the first occurrence of separator sep. `DataArray.str.repeat`(repeats) | Repeat each string in the array. `DataArray.str.replace`(pat, repl[, n, case, ...]) | Replace occurrences of pattern/regex in the array with some string. `DataArray.str.rfind`(sub[, start, end]) | Return highest indexes in each strings in the array where the substring is fully contained between [start:end]. `DataArray.str.rindex`(sub[, start, end]) | Return highest indexes in each strings where the substring is fully contained between [start:end]. `DataArray.str.rjust`(width[, fillchar]) | Pad left side of each string in the array. `DataArray.str.rpartition`(dim[, sep]) | Split the strings in the DataArray at the last occurrence of separator sep. `DataArray.str.rsplit`(dim[, sep, maxsplit]) | Split strings in a DataArray around the given separator/delimiter sep. `DataArray.str.rstrip`([to_strip]) | Remove trailing characters. `DataArray.str.slice`([start, stop, step]) | Slice substrings from each string in the array. `DataArray.str.slice_replace`([start, stop, repl]) | Replace a positional slice of a string with another value. `DataArray.str.split`(dim[, sep, maxsplit]) | Split strings in a DataArray around the given separator/delimiter sep. `DataArray.str.startswith`(pat) | Test if the start of each string in the array matches a pattern. `DataArray.str.strip`([to_strip, side]) | Remove leading and trailing characters. `DataArray.str.swapcase`() | Convert strings in the array to be swapcased. `DataArray.str.title`() | Convert strings in the array to titlecase. `DataArray.str.translate`(table) | Map characters of each string through the given mapping table. `DataArray.str.upper`() | Convert strings in the array to uppercase. `DataArray.str.wrap`(width, **kwargs) | Wrap long strings in the array in paragraphs with length less than width. `DataArray.str.zfill`(width) | Pad each string in the array by prepending '0' characters. ### Datetimelike properties# **Datetime properties** : `DataArray.dt.year` | The year of the datetime ---|--- `DataArray.dt.month` | The month as January=1, December=12 `DataArray.dt.day` | The days of the datetime `DataArray.dt.hour` | The hours of the datetime `DataArray.dt.minute` | The minutes of the datetime `DataArray.dt.second` | The seconds of the datetime `DataArray.dt.microsecond` | The microseconds of the datetime `DataArray.dt.nanosecond` | The nanoseconds of the datetime `DataArray.dt.dayofweek` | The day of the week with Monday=0, Sunday=6 `DataArray.dt.weekday` | The day of the week with Monday=0, Sunday=6 `DataArray.dt.dayofyear` | The ordinal day of the year `DataArray.dt.quarter` | The quarter of the date `DataArray.dt.days_in_month` | The number of days in the month `DataArray.dt.daysinmonth` | The number of days in the month `DataArray.dt.days_in_year` | Each datetime as the year plus the fraction of the year elapsed. `DataArray.dt.season` | Season of the year `DataArray.dt.time` | Timestamps corresponding to datetimes `DataArray.dt.date` | Date corresponding to datetimes `DataArray.dt.decimal_year` | Convert the dates as a fractional year. `DataArray.dt.calendar` | The name of the calendar of the dates. `DataArray.dt.is_month_start` | Indicate whether the date is the first day of the month `DataArray.dt.is_month_end` | Indicate whether the date is the last day of the month `DataArray.dt.is_quarter_end` | Indicate whether the date is the last day of a quarter `DataArray.dt.is_year_start` | Indicate whether the date is the first day of a year `DataArray.dt.is_leap_year` | Indicate if the date belongs to a leap year **Datetime methods** : `DataArray.dt.floor`(freq) | Round timestamps downward to specified frequency resolution. ---|--- `DataArray.dt.ceil`(freq) | Round timestamps upward to specified frequency resolution. `DataArray.dt.isocalendar`() | Dataset containing ISO year, week number, and weekday. `DataArray.dt.round`(freq) | Round timestamps to specified frequency resolution. `DataArray.dt.strftime`(date_format) | Return an array of formatted strings specified by date_format, which supports the same string format as the python standard library. **Timedelta properties** : `DataArray.dt.days` | Number of days for each element ---|--- `DataArray.dt.seconds` | Number of seconds (>= 0 and less than 1 day) for each element `DataArray.dt.microseconds` | Number of microseconds (>= 0 and less than 1 second) for each element `DataArray.dt.nanoseconds` | Number of nanoseconds (>= 0 and less than 1 microsecond) for each element `DataArray.dt.total_seconds` | **Timedelta methods** : `DataArray.dt.floor`(freq) | Round timestamps downward to specified frequency resolution. ---|--- `DataArray.dt.ceil`(freq) | Round timestamps upward to specified frequency resolution. `DataArray.dt.round`(freq) | Round timestamps to specified frequency resolution. ### Reshaping and reorganizing# `DataArray.transpose`(*dim[, ...]) | Return a new DataArray object with transposed dimensions. ---|--- `DataArray.stack`([dim, create_index, index_cls]) | Stack any number of existing dimensions into a single new dimension. `DataArray.unstack`([dim, fill_value, sparse]) | Unstack existing dimensions corresponding to MultiIndexes into multiple new dimensions. `DataArray.to_unstacked_dataset`(dim[, level]) | Unstack DataArray expanding to Dataset along a given level of a stacked coordinate. `DataArray.shift`([shifts, fill_value]) | Shift this DataArray by an offset along one or more dimensions. `DataArray.roll`([shifts, roll_coords]) | Roll this array by an offset along one or more dimensions. `DataArray.pad`([pad_width, mode, ...]) | Pad this array along one or more dimensions. `DataArray.sortby`(variables[, ascending]) | Sort object by labels or values (along an axis). `DataArray.broadcast_like`(other, *[, exclude]) | Broadcast this DataArray against another Dataset or DataArray. ## DataTree# ### Creating a DataTree# Methods of creating a `DataTree`. `DataTree`([dataset, children, name]) | A tree-like hierarchical collection of xarray objects. ---|--- `DataTree.from_dict`(d, /[, name]) | Create a datatree from a dictionary of data objects, organised by paths into the tree. ### Tree Attributes# Attributes relating to the recursive tree-like structure of a `DataTree`. `DataTree.parent` | Parent of this node. ---|--- `DataTree.children` | Child nodes of this node, stored under a mapping via their names. `DataTree.name` | The name of this node. `DataTree.path` | Return the file-like path from the root to this node. `DataTree.root` | Root node of the tree `DataTree.is_root` | Whether this node is the tree root. `DataTree.is_leaf` | Whether this node is a leaf node. `DataTree.leaves` | All leaf nodes. `DataTree.level` | Level of this node. `DataTree.depth` | Maximum level of this tree. `DataTree.width` | Number of nodes at this level in the tree. `DataTree.subtree` | Iterate over all nodes in this tree, including both self and all descendants. `DataTree.subtree_with_keys` | Iterate over relative paths and node pairs for all nodes in this tree. `DataTree.descendants` | Child nodes and all their child nodes. `DataTree.siblings` | Nodes with the same parent as this node. `DataTree.lineage` | All parent nodes and their parent nodes, starting with the closest. `DataTree.parents` | All parent nodes and their parent nodes, starting with the closest. `DataTree.ancestors` | All parent nodes and their parent nodes, starting with the most distant. `DataTree.groups` | Return all groups in the tree, given as a tuple of path-like strings. `DataTree.xindexes` | Mapping of xarray Index objects used for label based indexing. ### Data Contents# Interface to the data objects (optionally) stored inside a single `DataTree` node. This interface echoes that of `xarray.Dataset`. `DataTree.dims` | Mapping from dimension names to lengths. ---|--- `DataTree.sizes` | Mapping from dimension names to lengths. `DataTree.data_vars` | Dictionary of DataArray objects corresponding to data variables `DataTree.ds` | An immutable Dataset-like view onto the data in this node. `DataTree.coords` | Dictionary of xarray.DataArray objects corresponding to coordinate variables `DataTree.attrs` | Dictionary of global attributes on this node object. `DataTree.encoding` | Dictionary of global encoding attributes on this node object. `DataTree.indexes` | Mapping of pandas.Index objects used for label based indexing. `DataTree.nbytes` | `DataTree.dataset` | An immutable Dataset-like view onto the data in this node. `DataTree.to_dataset`([inherit]) | Return the data in this node as a new xarray.Dataset object. `DataTree.has_data` | Whether or not there are any variables in this node. `DataTree.has_attrs` | Whether or not there are any metadata attributes in this node. `DataTree.is_empty` | False if node contains any data or attrs. `DataTree.is_hollow` | True if only leaf nodes contain data. `DataTree.chunksizes` | Mapping from group paths to a mapping of chunksizes. ### Dictionary Interface# `DataTree` objects also have a dict-like interface mapping keys to either `xarray.DataArray`s or to child `DataTree` nodes. `DataTree.__getitem__`(key) | Access child nodes, variables, or coordinates stored anywhere in this tree. ---|--- `DataTree.__setitem__`(key, value) | Add either a child node or an array to the tree, at any position. `DataTree.__delitem__`(key) | Remove a variable or child node from this datatree node. `DataTree.update`(other) | Update this node's children and / or variables. `DataTree.get`(key[, default]) | Access child nodes, variables, or coordinates stored in this node. `DataTree.items`() | `DataTree.keys`() | `DataTree.values`() | ### Tree Manipulation# For manipulating, traversing, navigating, or mapping over the tree structure. `DataTree.orphan`() | Detach this node from its parent. ---|--- `DataTree.same_tree`(other) | True if other node is in the same tree as this node. `DataTree.relative_to`(other) | Compute the relative path from this node to node other. `DataTree.iter_lineage`() | Iterate up the tree, starting from the current node. `DataTree.find_common_ancestor`(other) | Find the first common ancestor of two nodes in the same tree. `DataTree.map_over_datasets`(func, *args[, kwargs]) | Apply a function to every dataset in this subtree, returning a new tree which stores the results. `DataTree.pipe`(func, *args, **kwargs) | Apply `func(self, *args, **kwargs)` `DataTree.match`(pattern) | Return nodes with paths matching pattern. `DataTree.filter`(filterfunc) | Filter nodes according to a specified condition. `DataTree.filter_like`(other) | Filter a datatree like another datatree. ### Pathlib-like Interface# `DataTree` objects deliberately echo some of the API of `pathlib.PurePath`. `DataTree.name` | The name of this node. ---|--- `DataTree.parent` | Parent of this node. `DataTree.parents` | All parent nodes and their parent nodes, starting with the closest. `DataTree.relative_to`(other) | Compute the relative path from this node to node other. ### DataTree Contents# Manipulate the contents of all nodes in a `DataTree` simultaneously. `DataTree.copy`(*[, inherit, deep]) | Returns a copy of this subtree. ---|--- ### DataTree Node Contents# Manipulate the contents of a single `DataTree` node. `DataTree.assign`([items]) | Assign new data variables or child nodes to a DataTree, returning a new object with all the original items in addition to the new ones. ---|--- `DataTree.drop_nodes`(names, *[, errors]) | Drop child nodes from this node. ### DataTree Operations# Apply operations over multiple `DataTree` objects. `map_over_datasets`(func, *args[, kwargs]) | Applies a function to every dataset in one or more DataTree objects with the same structure (ie.., that are isomorphic), returning new trees which store the results. ---|--- `group_subtrees`(*trees) | Iterate over subtrees grouped by relative paths in breadth-first order. ### Comparisons# Compare one `DataTree` object to another. `DataTree.isomorphic`(other) | Two DataTrees are considered isomorphic if the set of paths to their descendent nodes are the same. ---|--- `DataTree.equals`(other) | Two DataTrees are equal if they have isomorphic node structures, with matching node names, and if they have matching variables and coordinates, all of which are equal. `DataTree.identical`(other) | Like equals, but also checks attributes on all datasets, variables and coordinates, and requires that any inherited coordinates at the tree root are also inherited on the other tree. ### Indexing# Index into all nodes in the subtree simultaneously. `DataTree.isel`([indexers, drop, missing_dims]) | Returns a new data tree with each array indexed along the specified dimension(s). ---|--- `DataTree.sel`([indexers, method, tolerance, drop]) | Returns a new data tree with each array indexed by tick labels along the specified dimension(s). ### Aggregation# Aggregate data in all nodes in the subtree simultaneously. `DataTree.all`([dim, keep_attrs]) | Reduce this DataTree's data by applying `all` along some dimension(s). ---|--- `DataTree.any`([dim, keep_attrs]) | Reduce this DataTree's data by applying `any` along some dimension(s). `DataTree.max`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `max` along some dimension(s). `DataTree.min`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `min` along some dimension(s). `DataTree.mean`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `mean` along some dimension(s). `DataTree.median`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `median` along some dimension(s). `DataTree.prod`([dim, skipna, min_count, ...]) | Reduce this DataTree's data by applying `prod` along some dimension(s). `DataTree.sum`([dim, skipna, min_count, ...]) | Reduce this DataTree's data by applying `sum` along some dimension(s). `DataTree.std`([dim, skipna, ddof, keep_attrs]) | Reduce this DataTree's data by applying `std` along some dimension(s). `DataTree.var`([dim, skipna, ddof, keep_attrs]) | Reduce this DataTree's data by applying `var` along some dimension(s). `DataTree.cumsum`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `cumsum` along some dimension(s). `DataTree.cumprod`([dim, skipna, keep_attrs]) | Reduce this DataTree's data by applying `cumprod` along some dimension(s). ### ndarray methods# Methods copied from `numpy.ndarray` objects, here applying to the data in all nodes in the subtree. `DataTree.argsort`([axis, kind, order]) | Returns the indices that would sort this array. ---|--- `DataTree.conj`() | Complex-conjugate all elements. `DataTree.conjugate`(*args, **kwargs) | a.conj() `DataTree.round`(*args, **kwargs) | ## Coordinates# ### Creating coordinates# `Coordinates`([coords, indexes]) | Dictionary like container for Xarray coordinates (variables + indexes). ---|--- `Coordinates.from_xindex`(index) | Create Xarray coordinates from an existing Xarray index. `Coordinates.from_pandas_multiindex`(midx, dim) | Wrap a pandas multi-index as Xarray coordinates (dimension + levels). ### Attributes# `Coordinates.dims` | Mapping from dimension names to lengths or tuple of dimension names. ---|--- `Coordinates.sizes` | Mapping from dimension names to lengths. `Coordinates.dtypes` | Mapping from coordinate names to dtypes. `Coordinates.variables` | Low level interface to Coordinates contents as dict of Variable objects. `Coordinates.indexes` | Mapping of pandas.Index objects used for label based indexing. `Coordinates.xindexes` | Mapping of `Index` objects used for label based indexing. ### Dictionary Interface# Coordinates implement the mapping interface with keys given by variable names and values given by `DataArray` objects. `Coordinates.__getitem__`(key) | ---|--- `Coordinates.__setitem__`(key, value) | `Coordinates.__delitem__`(key) | `Coordinates.update`(other) | Update this Coordinates variables with other coordinate variables. `Coordinates.get`(k[,d]) | `Coordinates.items`() | `Coordinates.keys`() | `Coordinates.values`() | ### Coordinates contents# `Coordinates.to_dataset`() | Convert these coordinates into a new Dataset. ---|--- `Coordinates.to_index`([ordered_dims]) | Convert all index coordinates into a `pandas.Index`. `Coordinates.assign`([coords]) | Assign new coordinates (and indexes) to a Coordinates object, returning a new object with all the original coordinates in addition to the new ones. `Coordinates.merge`(other) | Merge two sets of coordinates to create a new Dataset `Coordinates.copy`([deep, memo]) | Return a copy of this Coordinates object. ### Comparisons# `Coordinates.equals`(other) | Two Coordinates objects are equal if they have matching variables, all of which are equal. ---|--- `Coordinates.identical`(other) | Like equals, but also checks all variable attributes. ### Proxies# Coordinates that are accessed from the `coords` property of Dataset, DataArray and DataTree objects, respectively. `core.coordinates.DatasetCoordinates`(dataset) | Dictionary like container for Dataset coordinates (variables + indexes). ---|--- `core.coordinates.DataArrayCoordinates`(dataarray) | Dictionary like container for DataArray coordinates (variables + indexes). `core.coordinates.DataTreeCoordinates`(datatree) | Dictionary like container for coordinates of a DataTree node (variables + indexes). ## Indexes# Default, pandas-backed indexes built-in to Xarray: `indexes.PandasIndex`(array, dim[, ...]) | Wrap a pandas.Index as an xarray compatible index. ---|--- `indexes.PandasMultiIndex`(array, dim[, ...]) | Wrap a pandas.MultiIndex as an xarray compatible index. More complex indexes built-in to Xarray: `CFTimeIndex`(data[, name]) | Custom Index for working with CF calendars and dates ---|--- `indexes.RangeIndex`(transform) | Xarray index implementing a simple bounded 1-dimension interval with evenly spaced, monotonic floating-point values. `indexes.NDPointIndex`(tree_obj, *, ...) | Xarray index for irregular, n-dimensional data. ### Creating indexes# `cftime_range`([start, end, periods, freq, ...]) | Return a fixed frequency CFTimeIndex. ---|--- `date_range`([start, end, periods, freq, tz, ...]) | Return a fixed frequency datetime index. `date_range_like`(source, calendar[, use_cftime]) | Generate a datetime array with the same frequency, start and end as another one, but in a different calendar. `indexes.RangeIndex.arange`([start, stop, ...]) | Create a new RangeIndex from given start, stop and step values. `indexes.RangeIndex.linspace`(start, stop[, ...]) | Create a new RangeIndex from given start / stop values and number of values. ## Universal functions# These functions are equivalent to their NumPy versions, but for xarray objects backed by non-NumPy array types (e.g. `cupy`, `sparse`, or `jax`), they will ensure that the computation is dispatched to the appropriate backend. You can find them in the `xarray.ufuncs` module: `ufuncs.abs` | xarray specific variant of `numpy.abs()`. ---|--- `ufuncs.absolute` | xarray specific variant of `numpy.absolute()`. `ufuncs.acos` | xarray specific variant of `numpy.acos()`. `ufuncs.acosh` | xarray specific variant of `numpy.acosh()`. `ufuncs.arccos` | xarray specific variant of `numpy.arccos()`. `ufuncs.arccosh` | xarray specific variant of `numpy.arccosh()`. `ufuncs.arcsin` | xarray specific variant of `numpy.arcsin()`. `ufuncs.arcsinh` | xarray specific variant of `numpy.arcsinh()`. `ufuncs.arctan` | xarray specific variant of `numpy.arctan()`. `ufuncs.arctanh` | xarray specific variant of `numpy.arctanh()`. `ufuncs.asin` | xarray specific variant of `numpy.asin()`. `ufuncs.asinh` | xarray specific variant of `numpy.asinh()`. `ufuncs.atan` | xarray specific variant of `numpy.atan()`. `ufuncs.atanh` | xarray specific variant of `numpy.atanh()`. `ufuncs.bitwise_count` | xarray specific variant of `numpy.bitwise_count()`. `ufuncs.bitwise_invert` | xarray specific variant of `numpy.bitwise_invert()`. `ufuncs.bitwise_not` | xarray specific variant of `numpy.bitwise_not()`. `ufuncs.cbrt` | xarray specific variant of `numpy.cbrt()`. `ufuncs.ceil` | xarray specific variant of `numpy.ceil()`. `ufuncs.conj` | xarray specific variant of `numpy.conj()`. `ufuncs.conjugate` | xarray specific variant of `numpy.conjugate()`. `ufuncs.cos` | xarray specific variant of `numpy.cos()`. `ufuncs.cosh` | xarray specific variant of `numpy.cosh()`. `ufuncs.deg2rad` | xarray specific variant of `numpy.deg2rad()`. `ufuncs.degrees` | xarray specific variant of `numpy.degrees()`. `ufuncs.exp` | xarray specific variant of `numpy.exp()`. `ufuncs.exp2` | xarray specific variant of `numpy.exp2()`. `ufuncs.expm1` | xarray specific variant of `numpy.expm1()`. `ufuncs.fabs` | xarray specific variant of `numpy.fabs()`. `ufuncs.floor` | xarray specific variant of `numpy.floor()`. `ufuncs.invert` | xarray specific variant of `numpy.invert()`. `ufuncs.isfinite` | xarray specific variant of `numpy.isfinite()`. `ufuncs.isinf` | xarray specific variant of `numpy.isinf()`. `ufuncs.isnan` | xarray specific variant of `numpy.isnan()`. `ufuncs.isnat` | xarray specific variant of `numpy.isnat()`. `ufuncs.log` | xarray specific variant of `numpy.log()`. `ufuncs.log10` | xarray specific variant of `numpy.log10()`. `ufuncs.log1p` | xarray specific variant of `numpy.log1p()`. `ufuncs.log2` | xarray specific variant of `numpy.log2()`. `ufuncs.logical_not` | xarray specific variant of `numpy.logical_not()`. `ufuncs.negative` | xarray specific variant of `numpy.negative()`. `ufuncs.positive` | xarray specific variant of `numpy.positive()`. `ufuncs.rad2deg` | xarray specific variant of `numpy.rad2deg()`. `ufuncs.radians` | xarray specific variant of `numpy.radians()`. `ufuncs.reciprocal` | xarray specific variant of `numpy.reciprocal()`. `ufuncs.rint` | xarray specific variant of `numpy.rint()`. `ufuncs.sign` | xarray specific variant of `numpy.sign()`. `ufuncs.signbit` | xarray specific variant of `numpy.signbit()`. `ufuncs.sin` | xarray specific variant of `numpy.sin()`. `ufuncs.sinh` | xarray specific variant of `numpy.sinh()`. `ufuncs.spacing` | xarray specific variant of `numpy.spacing()`. `ufuncs.sqrt` | xarray specific variant of `numpy.sqrt()`. `ufuncs.square` | xarray specific variant of `numpy.square()`. `ufuncs.tan` | xarray specific variant of `numpy.tan()`. `ufuncs.tanh` | xarray specific variant of `numpy.tanh()`. `ufuncs.trunc` | xarray specific variant of `numpy.trunc()`. `ufuncs.add` | xarray specific variant of `numpy.add()`. `ufuncs.arctan2` | xarray specific variant of `numpy.arctan2()`. `ufuncs.atan2` | xarray specific variant of `numpy.atan2()`. `ufuncs.bitwise_and` | xarray specific variant of `numpy.bitwise_and()`. `ufuncs.bitwise_left_shift` | xarray specific variant of `numpy.bitwise_left_shift()`. `ufuncs.bitwise_or` | xarray specific variant of `numpy.bitwise_or()`. `ufuncs.bitwise_right_shift` | xarray specific variant of `numpy.bitwise_right_shift()`. `ufuncs.bitwise_xor` | xarray specific variant of `numpy.bitwise_xor()`. `ufuncs.copysign` | xarray specific variant of `numpy.copysign()`. `ufuncs.divide` | xarray specific variant of `numpy.divide()`. `ufuncs.equal` | xarray specific variant of `numpy.equal()`. `ufuncs.float_power` | xarray specific variant of `numpy.float_power()`. `ufuncs.floor_divide` | xarray specific variant of `numpy.floor_divide()`. `ufuncs.fmax` | xarray specific variant of `numpy.fmax()`. `ufuncs.fmin` | xarray specific variant of `numpy.fmin()`. `ufuncs.fmod` | xarray specific variant of `numpy.fmod()`. `ufuncs.gcd` | xarray specific variant of `numpy.gcd()`. `ufuncs.greater` | xarray specific variant of `numpy.greater()`. `ufuncs.greater_equal` | xarray specific variant of `numpy.greater_equal()`. `ufuncs.heaviside` | xarray specific variant of `numpy.heaviside()`. `ufuncs.hypot` | xarray specific variant of `numpy.hypot()`. `ufuncs.lcm` | xarray specific variant of `numpy.lcm()`. `ufuncs.ldexp` | xarray specific variant of `numpy.ldexp()`. `ufuncs.left_shift` | xarray specific variant of `numpy.left_shift()`. `ufuncs.less` | xarray specific variant of `numpy.less()`. `ufuncs.less_equal` | xarray specific variant of `numpy.less_equal()`. `ufuncs.logaddexp` | xarray specific variant of `numpy.logaddexp()`. `ufuncs.logaddexp2` | xarray specific variant of `numpy.logaddexp2()`. `ufuncs.logical_and` | xarray specific variant of `numpy.logical_and()`. `ufuncs.logical_or` | xarray specific variant of `numpy.logical_or()`. `ufuncs.logical_xor` | xarray specific variant of `numpy.logical_xor()`. `ufuncs.maximum` | xarray specific variant of `numpy.maximum()`. `ufuncs.minimum` | xarray specific variant of `numpy.minimum()`. `ufuncs.mod` | xarray specific variant of `numpy.mod()`. `ufuncs.multiply` | xarray specific variant of `numpy.multiply()`. `ufuncs.nextafter` | xarray specific variant of `numpy.nextafter()`. `ufuncs.not_equal` | xarray specific variant of `numpy.not_equal()`. `ufuncs.pow` | xarray specific variant of `numpy.pow()`. `ufuncs.power` | xarray specific variant of `numpy.power()`. `ufuncs.remainder` | xarray specific variant of `numpy.remainder()`. `ufuncs.right_shift` | xarray specific variant of `numpy.right_shift()`. `ufuncs.subtract` | xarray specific variant of `numpy.subtract()`. `ufuncs.true_divide` | xarray specific variant of `numpy.true_divide()`. `ufuncs.angle` | xarray specific variant of `numpy.angle()`. `ufuncs.isreal` | xarray specific variant of `numpy.isreal()`. `ufuncs.iscomplex` | xarray specific variant of `numpy.iscomplex()`. ## IO / Conversion# ### Dataset methods# `load_dataset`(filename_or_obj, **kwargs) | Open, load into memory, and close a Dataset from a file or file-like object. ---|--- `open_dataset`(filename_or_obj, *[, engine, ...]) | Open and decode a dataset from a file or file-like object. `open_mfdataset`(paths[, chunks, concat_dim, ...]) | Open multiple files as a single dataset. `open_zarr`(store[, group, synchronizer, ...]) | Load and decode a dataset from a Zarr store. `save_mfdataset`(datasets, paths[, mode, ...]) | Write multiple datasets to disk as netCDF files simultaneously. `Dataset.as_numpy`() | Coerces wrapped data and coordinates into numpy arrays, returning a Dataset. `Dataset.from_dataframe`(dataframe[, sparse]) | Convert a pandas.DataFrame into an xarray.Dataset `Dataset.from_dict`(d) | Convert a dictionary into an xarray.Dataset. `Dataset.to_dataarray`([dim, name]) | Convert this dataset into an xarray.DataArray `Dataset.to_dataframe`([dim_order]) | Convert this dataset into a pandas.DataFrame. `Dataset.to_dask_dataframe`([dim_order, set_index]) | Convert this dataset into a dask.dataframe.DataFrame. `Dataset.to_dict`([data, encoding]) | Convert this dataset to a dictionary following xarray naming conventions. `Dataset.to_netcdf`([path, mode, format, ...]) | Write dataset contents to a netCDF file. `Dataset.to_pandas`() | Convert this dataset into a pandas object without changing the number of dimensions. `Dataset.to_zarr`([store, chunk_store, mode, ...]) | Write dataset contents to a zarr group. `Dataset.chunk`([chunks, name_prefix, token, ...]) | Coerce all arrays in this dataset into dask arrays with the given chunks. `Dataset.close`() | Release any resources linked to this object. `Dataset.compute`(**kwargs) | Manually trigger loading and/or computation of this dataset's data from disk or a remote source into memory and return a new dataset. `Dataset.filter_by_attrs`(**kwargs) | Returns a `Dataset` with variables that match specific conditions. `Dataset.info`([buf]) | Concise summary of a Dataset variables and attributes. `Dataset.load`(**kwargs) | Manually trigger loading and/or computation of this dataset's data from disk or a remote source into memory and return this dataset. `Dataset.persist`(**kwargs) | Trigger computation, keeping data as chunked arrays. `Dataset.unify_chunks`() | Unify chunk size along all chunked dimensions of this Dataset. ### DataArray methods# `load_dataarray`(filename_or_obj, **kwargs) | Open, load into memory, and close a DataArray from a file or file-like object containing a single data variable. ---|--- `open_dataarray`(filename_or_obj, *[, engine, ...]) | Open an DataArray from a file or file-like object containing a single data variable. `DataArray.as_numpy`() | Coerces wrapped data and coordinates into numpy arrays, returning a DataArray. `DataArray.from_dict`(d) | Convert a dictionary into an xarray.DataArray `DataArray.from_iris`(cube) | Convert a iris.cube.Cube into an xarray.DataArray `DataArray.from_series`(series[, sparse]) | Convert a pandas.Series into an xarray.DataArray. `DataArray.to_dask_dataframe`([dim_order, ...]) | Convert this array into a dask.dataframe.DataFrame. `DataArray.to_dataframe`([name, dim_order]) | Convert this array and its coordinates into a tidy pandas.DataFrame. `DataArray.to_dataset`([dim, name, promote_attrs]) | Convert a DataArray to a Dataset. `DataArray.to_dict`([data, encoding]) | Convert this xarray.DataArray into a dictionary following xarray naming conventions. `DataArray.to_index`() | Convert this variable to a pandas.Index. `DataArray.to_iris`() | Convert this array into a iris.cube.Cube `DataArray.to_masked_array`([copy]) | Convert this array into a numpy.ma.MaskedArray `DataArray.to_netcdf`([path, mode, format, ...]) | Write DataArray contents to a netCDF file. `DataArray.to_numpy`() | Coerces wrapped data to numpy and returns a numpy.ndarray. `DataArray.to_pandas`() | Convert this array into a pandas object with the same shape. `DataArray.to_series`() | Convert this array into a pandas.Series. `DataArray.to_zarr`([store, chunk_store, ...]) | Write DataArray contents to a Zarr store `DataArray.chunk`([chunks, name_prefix, ...]) | Coerce this array's data into a dask arrays with the given chunks. `DataArray.close`() | Release any resources linked to this object. `DataArray.compute`(**kwargs) | Manually trigger loading of this array's data from disk or a remote source into memory and return a new array. `DataArray.persist`(**kwargs) | Trigger computation in constituent dask arrays `DataArray.load`(**kwargs) | Manually trigger loading of this array's data from disk or a remote source into memory and return this array. `DataArray.unify_chunks`() | Unify chunk size along all chunked dimensions of this DataArray. ### DataTree methods# `open_datatree`(filename_or_obj, *[, engine, ...]) | Open and decode a DataTree from a file or file-like object, creating one tree node for each group in the file. ---|--- `open_groups`(filename_or_obj, *[, engine, ...]) | Open and decode a file or file-like object, creating a dictionary containing one xarray Dataset for each group in the file. `DataTree.to_dict`([relative]) | Create a dictionary mapping of paths to the data contained in those nodes. `DataTree.to_netcdf`(filepath[, mode, ...]) | Write datatree contents to a netCDF file. `DataTree.to_zarr`(store[, mode, encoding, ...]) | Write datatree contents to a Zarr store. `DataTree.chunk`([chunks, name_prefix, token, ...]) | Coerce all arrays in all groups in this tree into dask arrays with the given chunks. `DataTree.load`(**kwargs) | Manually trigger loading and/or computation of this datatree's data from disk or a remote source into memory and return this datatree. `DataTree.compute`(**kwargs) | Manually trigger loading and/or computation of this datatree's data from disk or a remote source into memory and return a new datatree. `DataTree.persist`(**kwargs) | Trigger computation, keeping data as chunked arrays. ## Encoding/Decoding# ### Coder objects# `coders.CFDatetimeCoder`([use_cftime, time_unit]) | Coder for CF Datetime coding. ---|--- ## Plotting# ### Dataset# `Dataset.plot.scatter`(*args[, x, y, z, hue, ...]) | Scatter variables against each other. ---|--- `Dataset.plot.quiver`(*args[, x, y, u, v, ...]) | Quiver plot of Dataset variables. `Dataset.plot.streamplot`(*args[, x, y, u, v, ...]) | Plot streamlines of Dataset variables. ### DataArray# `DataArray.plot`(*[, row, col, col_wrap, ax, ...]) | Default plot of DataArray using `matplotlib.pyplot`. ---|--- `DataArray.plot.contourf`(*args[, x, y, ...]) | Filled contour plot of 2D DataArray. ---|--- `DataArray.plot.contour`(*args[, x, y, ...]) | Contour plot of 2D DataArray. `DataArray.plot.hist`(*args[, figsize, size, ...]) | Histogram of DataArray. `DataArray.plot.imshow`(*args[, x, y, ...]) | Image plot of 2D DataArray. `DataArray.plot.line`(*args[, row, col, ...]) | Line plot of DataArray values. `DataArray.plot.pcolormesh`(*args[, x, y, ...]) | Pseudocolor plot of 2D DataArray. `DataArray.plot.step`(*args[, where, ...]) | Step plot of DataArray values. `DataArray.plot.scatter`(*args[, x, y, z, ...]) | Scatter variables against each other. `DataArray.plot.surface`(*args[, x, y, ...]) | Surface plot of 2D DataArray. ### Faceting# `plot.FacetGrid`(data[, col, row, col_wrap, ...]) | Initialize the Matplotlib figure and FacetGrid object. ---|--- `plot.FacetGrid.add_colorbar`(**kwargs) | Draw a colorbar. `plot.FacetGrid.add_legend`(*[, label, ...]) | `plot.FacetGrid.add_quiverkey`(u, v, **kwargs) | `plot.FacetGrid.map`(func, *args, **kwargs) | Apply a plotting function to each facet's subset of the data. `plot.FacetGrid.map_dataarray`(func, x, y, ...) | Apply a plotting function to a 2d facet's subset of the data. `plot.FacetGrid.map_dataarray_line`(func, x, ...) | `plot.FacetGrid.map_dataset`(func[, x, y, ...]) | `plot.FacetGrid.map_plot1d`(func, x, y, *[, ...]) | Apply a plotting function to a 1d facet's subset of the data. `plot.FacetGrid.set_axis_labels`(*axlabels) | Set axis labels on the left column and bottom row of the grid. `plot.FacetGrid.set_ticks`([max_xticks, ...]) | Set and control tick behavior. `plot.FacetGrid.set_titles`([template, ...]) | Draw titles either above each facet or on the grid margins. `plot.FacetGrid.set_xlabels`([label]) | Label the x axis on the bottom row of the grid. `plot.FacetGrid.set_ylabels`([label]) | Label the y axis on the left column of the grid. ## GroupBy objects# ### Dataset# `DatasetGroupBy`(obj, groupers[, ...]) | ---|--- `DatasetGroupBy.map`(func[, args, shortcut]) | Apply a function to each Dataset in the group and concatenate them together into a new Dataset. `DatasetGroupBy.reduce`(func[, dim, axis, ...]) | Reduce the items in this group by applying func along some dimension(s). `DatasetGroupBy.assign`(**kwargs) | Assign data variables by group. `DatasetGroupBy.assign_coords`([coords]) | Assign coordinates by group. `DatasetGroupBy.first`([skipna, keep_attrs]) | Return the first element of each group along the group dimension `DatasetGroupBy.last`([skipna, keep_attrs]) | Return the last element of each group along the group dimension `DatasetGroupBy.fillna`(value) | Fill missing values in this object by group. `DatasetGroupBy.quantile`(q[, dim, method, ...]) | Compute the qth quantile over each array in the groups and concatenate them together into a new array. `DatasetGroupBy.where`(cond[, other]) | Return elements from self or other depending on cond. `DatasetGroupBy.all`([dim, keep_attrs]) | Reduce this Dataset's data by applying `all` along some dimension(s). `DatasetGroupBy.any`([dim, keep_attrs]) | Reduce this Dataset's data by applying `any` along some dimension(s). `DatasetGroupBy.count`([dim, keep_attrs]) | Reduce this Dataset's data by applying `count` along some dimension(s). `DatasetGroupBy.cumsum`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `cumsum` along some dimension(s). `DatasetGroupBy.cumprod`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `cumprod` along some dimension(s). `DatasetGroupBy.max`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `max` along some dimension(s). `DatasetGroupBy.mean`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `mean` along some dimension(s). `DatasetGroupBy.median`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `median` along some dimension(s). `DatasetGroupBy.min`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `min` along some dimension(s). `DatasetGroupBy.prod`([dim, skipna, ...]) | Reduce this Dataset's data by applying `prod` along some dimension(s). `DatasetGroupBy.std`([dim, skipna, ddof, ...]) | Reduce this Dataset's data by applying `std` along some dimension(s). `DatasetGroupBy.sum`([dim, skipna, min_count, ...]) | Reduce this Dataset's data by applying `sum` along some dimension(s). `DatasetGroupBy.var`([dim, skipna, ddof, ...]) | Reduce this Dataset's data by applying `var` along some dimension(s). `DatasetGroupBy.dims` | `DatasetGroupBy.groups` | Mapping from group labels to indices. `DatasetGroupBy.shuffle_to_chunks`([chunks]) | Sort or "shuffle" the underlying object. ### DataArray# `DataArrayGroupBy`(obj, groupers[, ...]) | ---|--- `DataArrayGroupBy.map`(func[, args, shortcut]) | Apply a function to each array in the group and concatenate them together into a new array. `DataArrayGroupBy.reduce`(func[, dim, axis, ...]) | Reduce the items in this group by applying func along some dimension(s). `DataArrayGroupBy.assign_coords`([coords]) | Assign coordinates by group. `DataArrayGroupBy.first`([skipna, keep_attrs]) | Return the first element of each group along the group dimension `DataArrayGroupBy.last`([skipna, keep_attrs]) | Return the last element of each group along the group dimension `DataArrayGroupBy.fillna`(value) | Fill missing values in this object by group. `DataArrayGroupBy.quantile`(q[, dim, method, ...]) | Compute the qth quantile over each array in the groups and concatenate them together into a new array. `DataArrayGroupBy.where`(cond[, other]) | Return elements from self or other depending on cond. `DataArrayGroupBy.all`([dim, keep_attrs]) | Reduce this DataArray's data by applying `all` along some dimension(s). `DataArrayGroupBy.any`([dim, keep_attrs]) | Reduce this DataArray's data by applying `any` along some dimension(s). `DataArrayGroupBy.count`([dim, keep_attrs]) | Reduce this DataArray's data by applying `count` along some dimension(s). `DataArrayGroupBy.cumsum`([dim, skipna, ...]) | Reduce this DataArray's data by applying `cumsum` along some dimension(s). `DataArrayGroupBy.cumprod`([dim, skipna, ...]) | Reduce this DataArray's data by applying `cumprod` along some dimension(s). `DataArrayGroupBy.max`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `max` along some dimension(s). `DataArrayGroupBy.mean`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `mean` along some dimension(s). `DataArrayGroupBy.median`([dim, skipna, ...]) | Reduce this DataArray's data by applying `median` along some dimension(s). `DataArrayGroupBy.min`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `min` along some dimension(s). `DataArrayGroupBy.prod`([dim, skipna, ...]) | Reduce this DataArray's data by applying `prod` along some dimension(s). `DataArrayGroupBy.std`([dim, skipna, ddof, ...]) | Reduce this DataArray's data by applying `std` along some dimension(s). `DataArrayGroupBy.sum`([dim, skipna, ...]) | Reduce this DataArray's data by applying `sum` along some dimension(s). `DataArrayGroupBy.var`([dim, skipna, ddof, ...]) | Reduce this DataArray's data by applying `var` along some dimension(s). `DataArrayGroupBy.dims` | `DataArrayGroupBy.groups` | Mapping from group labels to indices. `DataArrayGroupBy.shuffle_to_chunks`([chunks]) | Sort or "shuffle" the underlying object. ### Grouper Objects# `groupers.BinGrouper`(bins[, right, labels, ...]) | Grouper object for binning numeric data. ---|--- `groupers.UniqueGrouper`([labels]) | Grouper object for grouping by a categorical variable. `groupers.TimeResampler`(freq[, closed, ...]) | Grouper object specialized to resampling the time coordinate. `groupers.SeasonGrouper`(seasons) | Allows grouping using a custom definition of seasons. `groupers.SeasonResampler`(seasons, *[, ...]) | Allows grouping using a custom definition of seasons. ## Rolling objects# ### Dataset# `DatasetRolling`(obj, windows[, min_periods, ...]) | ---|--- `DatasetRolling.construct`([window_dim, ...]) | Convert this rolling object to xr.Dataset, where the window dimension is stacked as a new dimension `DatasetRolling.reduce`(func[, keep_attrs, ...]) | Reduce the items in this group by applying func along some dimension(s). `DatasetRolling.argmax`([keep_attrs]) | Reduce this object's data windows by applying argmax along its dimension. `DatasetRolling.argmin`([keep_attrs]) | Reduce this object's data windows by applying argmin along its dimension. `DatasetRolling.count`([keep_attrs]) | Reduce this object's data windows by applying count along its dimension. `DatasetRolling.max`([keep_attrs]) | Reduce this object's data windows by applying max along its dimension. `DatasetRolling.mean`([keep_attrs]) | Reduce this object's data windows by applying mean along its dimension. `DatasetRolling.median`([keep_attrs]) | Reduce this object's data windows by applying median along its dimension. `DatasetRolling.min`([keep_attrs]) | Reduce this object's data windows by applying min along its dimension. `DatasetRolling.prod`([keep_attrs]) | Reduce this object's data windows by applying prod along its dimension. `DatasetRolling.std`([keep_attrs]) | Reduce this object's data windows by applying std along its dimension. `DatasetRolling.sum`([keep_attrs]) | Reduce this object's data windows by applying sum along its dimension. `DatasetRolling.var`([keep_attrs]) | Reduce this object's data windows by applying var along its dimension. ### DataArray# `DataArrayRolling`(obj, windows[, ...]) | ---|--- `DataArrayRolling.__iter__`() | `DataArrayRolling.construct`([window_dim, ...]) | Convert this rolling object to xr.DataArray, where the window dimension is stacked as a new dimension `DataArrayRolling.reduce`(func[, keep_attrs, ...]) | Reduce each window by applying func. `DataArrayRolling.argmax`([keep_attrs]) | Reduce this object's data windows by applying argmax along its dimension. `DataArrayRolling.argmin`([keep_attrs]) | Reduce this object's data windows by applying argmin along its dimension. `DataArrayRolling.count`([keep_attrs]) | Reduce this object's data windows by applying count along its dimension. `DataArrayRolling.max`([keep_attrs]) | Reduce this object's data windows by applying max along its dimension. `DataArrayRolling.mean`([keep_attrs]) | Reduce this object's data windows by applying mean along its dimension. `DataArrayRolling.median`([keep_attrs]) | Reduce this object's data windows by applying median along its dimension. `DataArrayRolling.min`([keep_attrs]) | Reduce this object's data windows by applying min along its dimension. `DataArrayRolling.prod`([keep_attrs]) | Reduce this object's data windows by applying prod along its dimension. `DataArrayRolling.std`([keep_attrs]) | Reduce this object's data windows by applying std along its dimension. `DataArrayRolling.sum`([keep_attrs]) | Reduce this object's data windows by applying sum along its dimension. `DataArrayRolling.var`([keep_attrs]) | Reduce this object's data windows by applying var along its dimension. ## Coarsen objects# ### Dataset# `DatasetCoarsen`(obj, windows, boundary, side, ...) | ---|--- `DatasetCoarsen.all`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying all along some dimension(s). `DatasetCoarsen.any`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying any along some dimension(s). `DatasetCoarsen.construct`([window_dim, ...]) | Convert this Coarsen object to a DataArray or Dataset, where the coarsening dimension is split or reshaped to two new dimensions. `DatasetCoarsen.count`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying count along some dimension(s). `DatasetCoarsen.max`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying max along some dimension(s). `DatasetCoarsen.mean`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying mean along some dimension(s). `DatasetCoarsen.median`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying median along some dimension(s). `DatasetCoarsen.min`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying min along some dimension(s). `DatasetCoarsen.prod`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying prod along some dimension(s). `DatasetCoarsen.reduce`(func[, keep_attrs]) | Reduce the items in this group by applying func along some dimension(s). `DatasetCoarsen.std`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying std along some dimension(s). `DatasetCoarsen.sum`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying sum along some dimension(s). `DatasetCoarsen.var`([keep_attrs]) | Reduce this DatasetCoarsen's data by applying var along some dimension(s). ### DataArray# `DataArrayCoarsen`(obj, windows, boundary, ...) | ---|--- `DataArrayCoarsen.all`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying all along some dimension(s). `DataArrayCoarsen.any`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying any along some dimension(s). `DataArrayCoarsen.construct`([window_dim, ...]) | Convert this Coarsen object to a DataArray or Dataset, where the coarsening dimension is split or reshaped to two new dimensions. `DataArrayCoarsen.count`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying count along some dimension(s). `DataArrayCoarsen.max`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying max along some dimension(s). `DataArrayCoarsen.mean`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying mean along some dimension(s). `DataArrayCoarsen.median`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying median along some dimension(s). `DataArrayCoarsen.min`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying min along some dimension(s). `DataArrayCoarsen.prod`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying prod along some dimension(s). `DataArrayCoarsen.reduce`(func[, keep_attrs]) | Reduce the items in this group by applying func along some dimension(s). `DataArrayCoarsen.std`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying std along some dimension(s). `DataArrayCoarsen.sum`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying sum along some dimension(s). `DataArrayCoarsen.var`([keep_attrs]) | Reduce this DataArrayCoarsen's data by applying var along some dimension(s). ## Exponential rolling objects# `RollingExp`(obj, windows[, window_type, ...]) | Exponentially-weighted moving window object. ---|--- `RollingExp.mean`([keep_attrs]) | Exponentially weighted moving average. `RollingExp.sum`([keep_attrs]) | Exponentially weighted moving sum. ## Weighted objects# ### Dataset# `DatasetWeighted`(obj, weights) | ---|--- `DatasetWeighted.mean`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `mean` along some dimension(s). `DatasetWeighted.quantile`(q, *[, dim, ...]) | Apply a weighted `quantile` to this Dataset's data along some dimension(s). `DatasetWeighted.sum`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `sum` along some dimension(s). `DatasetWeighted.std`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `std` along some dimension(s). `DatasetWeighted.var`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `var` along some dimension(s). `DatasetWeighted.sum_of_weights`([dim, keep_attrs]) | Calculate the sum of weights, accounting for missing values in the data. `DatasetWeighted.sum_of_squares`([dim, ...]) | Reduce this Dataset's data by a weighted `sum_of_squares` along some dimension(s). ### DataArray# `DataArrayWeighted`(obj, weights) | ---|--- `DataArrayWeighted.mean`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `mean` along some dimension(s). `DataArrayWeighted.quantile`(q, *[, dim, ...]) | Apply a weighted `quantile` to this Dataset's data along some dimension(s). `DataArrayWeighted.sum`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `sum` along some dimension(s). `DataArrayWeighted.std`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `std` along some dimension(s). `DataArrayWeighted.var`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by a weighted `var` along some dimension(s). `DataArrayWeighted.sum_of_weights`([dim, ...]) | Calculate the sum of weights, accounting for missing values in the data. `DataArrayWeighted.sum_of_squares`([dim, ...]) | Reduce this Dataset's data by a weighted `sum_of_squares` along some dimension(s). ## Resample objects# ### Dataset# `DatasetResample`(*args[, dim, resample_dim]) | DatasetGroupBy object specialized to resampling a specified dimension ---|--- `DatasetResample.asfreq`() | Return values of original object at the new up-sampling frequency; essentially a re-index with new times set to NaN. `DatasetResample.backfill`([tolerance]) | Backward fill new values at up-sampled frequency. `DatasetResample.interpolate`([kind]) | Interpolate up-sampled data using the original data as knots. `DatasetResample.nearest`([tolerance]) | Take new values from nearest original coordinate to up-sampled frequency coordinates. `DatasetResample.pad`([tolerance]) | Forward fill new values at up-sampled frequency. `DatasetResample.all`([dim, keep_attrs]) | Reduce this Dataset's data by applying `all` along some dimension(s). `DatasetResample.any`([dim, keep_attrs]) | Reduce this Dataset's data by applying `any` along some dimension(s). `DatasetResample.apply`(func[, args, shortcut]) | Backward compatible implementation of `map` `DatasetResample.assign`(**kwargs) | Assign data variables by group. `DatasetResample.assign_coords`([coords]) | Assign coordinates by group. `DatasetResample.bfill`([tolerance]) | Backward fill new values at up-sampled frequency. `DatasetResample.count`([dim, keep_attrs]) | Reduce this Dataset's data by applying `count` along some dimension(s). `DatasetResample.ffill`([tolerance]) | Forward fill new values at up-sampled frequency. `DatasetResample.fillna`(value) | Fill missing values in this object by group. `DatasetResample.first`([skipna, keep_attrs]) | Return the first element of each group along the group dimension `DatasetResample.last`([skipna, keep_attrs]) | Return the last element of each group along the group dimension `DatasetResample.map`(func[, args, shortcut]) | Apply a function over each Dataset in the groups generated for resampling and concatenate them together into a new Dataset. `DatasetResample.max`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `max` along some dimension(s). `DatasetResample.mean`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `mean` along some dimension(s). `DatasetResample.median`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `median` along some dimension(s). `DatasetResample.min`([dim, skipna, keep_attrs]) | Reduce this Dataset's data by applying `min` along some dimension(s). `DatasetResample.prod`([dim, skipna, ...]) | Reduce this Dataset's data by applying `prod` along some dimension(s). `DatasetResample.quantile`(q[, dim, method, ...]) | Compute the qth quantile over each array in the groups and concatenate them together into a new array. `DatasetResample.reduce`(func[, dim, axis, ...]) | Reduce the items in this group by applying func along the pre-defined resampling dimension. `DatasetResample.std`([dim, skipna, ddof, ...]) | Reduce this Dataset's data by applying `std` along some dimension(s). `DatasetResample.sum`([dim, skipna, ...]) | Reduce this Dataset's data by applying `sum` along some dimension(s). `DatasetResample.var`([dim, skipna, ddof, ...]) | Reduce this Dataset's data by applying `var` along some dimension(s). `DatasetResample.where`(cond[, other]) | Return elements from self or other depending on cond. `DatasetResample.dims` | `DatasetResample.groups` | Mapping from group labels to indices. ### DataArray# `DataArrayResample`(*args[, dim, resample_dim]) | DataArrayGroupBy object specialized to time resampling operations over a specified dimension ---|--- `DataArrayResample.asfreq`() | Return values of original object at the new up-sampling frequency; essentially a re-index with new times set to NaN. `DataArrayResample.backfill`([tolerance]) | Backward fill new values at up-sampled frequency. `DataArrayResample.interpolate`([kind]) | Interpolate up-sampled data using the original data as knots. `DataArrayResample.nearest`([tolerance]) | Take new values from nearest original coordinate to up-sampled frequency coordinates. `DataArrayResample.pad`([tolerance]) | Forward fill new values at up-sampled frequency. `DataArrayResample.all`([dim, keep_attrs]) | Reduce this DataArray's data by applying `all` along some dimension(s). `DataArrayResample.any`([dim, keep_attrs]) | Reduce this DataArray's data by applying `any` along some dimension(s). `DataArrayResample.apply`(func[, args, shortcut]) | Backward compatible implementation of `map` `DataArrayResample.assign_coords`([coords]) | Assign coordinates by group. `DataArrayResample.bfill`([tolerance]) | Backward fill new values at up-sampled frequency. `DataArrayResample.count`([dim, keep_attrs]) | Reduce this DataArray's data by applying `count` along some dimension(s). `DataArrayResample.ffill`([tolerance]) | Forward fill new values at up-sampled frequency. `DataArrayResample.fillna`(value) | Fill missing values in this object by group. `DataArrayResample.first`([skipna, keep_attrs]) | Return the first element of each group along the group dimension `DataArrayResample.last`([skipna, keep_attrs]) | Return the last element of each group along the group dimension `DataArrayResample.map`(func[, args, shortcut]) | Apply a function to each array in the group and concatenate them together into a new array. `DataArrayResample.max`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `max` along some dimension(s). `DataArrayResample.mean`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `mean` along some dimension(s). `DataArrayResample.median`([dim, skipna, ...]) | Reduce this DataArray's data by applying `median` along some dimension(s). `DataArrayResample.min`([dim, skipna, keep_attrs]) | Reduce this DataArray's data by applying `min` along some dimension(s). `DataArrayResample.prod`([dim, skipna, ...]) | Reduce this DataArray's data by applying `prod` along some dimension(s). `DataArrayResample.quantile`(q[, dim, method, ...]) | Compute the qth quantile over each array in the groups and concatenate them together into a new array. `DataArrayResample.reduce`(func[, dim, axis, ...]) | Reduce the items in this group by applying func along the pre-defined resampling dimension. `DataArrayResample.std`([dim, skipna, ddof, ...]) | Reduce this DataArray's data by applying `std` along some dimension(s). `DataArrayResample.sum`([dim, skipna, ...]) | Reduce this DataArray's data by applying `sum` along some dimension(s). `DataArrayResample.var`([dim, skipna, ddof, ...]) | Reduce this DataArray's data by applying `var` along some dimension(s). `DataArrayResample.where`(cond[, other]) | Return elements from self or other depending on cond. `DataArrayResample.dims` | `DataArrayResample.groups` | Mapping from group labels to indices. ## Accessors# `accessor_dt.DatetimeAccessor`(obj) | Access datetime fields for DataArrays with datetime-like dtypes. ---|--- `accessor_dt.TimedeltaAccessor`(obj) | Access Timedelta fields for DataArrays with Timedelta-like dtypes. `accessor_str.StringAccessor`(obj) | Vectorized string functions for string-like arrays. ## Custom Indexes# ### Building custom indexes# These classes are building blocks for more complex Indexes: `indexes.CoordinateTransform`(coord_names, ...) | Abstract coordinate transform with dimension & coordinate names. ---|--- `indexes.CoordinateTransformIndex`(transform) | Helper class for creating Xarray indexes based on coordinate transforms. `indexes.NDPointIndex`(tree_obj, *, ...) | Xarray index for irregular, n-dimensional data. `indexes.TreeAdapter`(points, *, options) | Lightweight adapter abstract class for plugging in 3rd-party structures like `scipy.spatial.KDTree` or `sklearn.neighbors.KDTree` into `NDPointIndex`. The Index base class for building custom indexes: `Index.from_variables`(variables, *, options) | Create a new index object from one or more coordinate variables. ---|--- `Index.concat`(indexes, dim[, positions]) | Create a new index by concatenating one or more indexes of the same type. `Index.stack`(variables, dim) | Create a new index by stacking coordinate variables into a single new dimension. `Index.unstack`() | Unstack a (multi-)index into multiple (single) indexes. `Index.create_variables`([variables]) | Maybe create new coordinate variables from this index. `Index.should_add_coord_to_array`(name, var, dims) | Define whether or not an index coordinate variable should be added to a new DataArray. `Index.to_pandas_index`() | Cast this xarray index to a pandas.Index object or raise a `TypeError` if this is not supported. `Index.isel`(indexers) | Maybe returns a new index from the current index itself indexed by positional indexers. `Index.sel`(labels) | Query the index with arbitrary coordinate label indexers. `Index.join`(other[, how]) | Return a new index from the combination of this index with another index of the same type. `Index.reindex_like`(other) | Query the index with another index of the same type. `Index.equals`(other, **kwargs) | Compare this index with another index of the same type. `Index.roll`(shifts) | Roll this index by an offset along one or more dimensions. `Index.rename`(name_dict, dims_dict) | Maybe update the index with new coordinate and dimension names. `Index.copy`([deep]) | Return a (deep) copy of this index. ## Tutorial# `tutorial.open_dataset`(name[, cache, ...]) | Open a dataset from the online repository (requires internet). ---|--- `tutorial.load_dataset`(*args, **kwargs) | Open, load into memory, and close a dataset from the online repository (requires internet). `tutorial.open_datatree`(name[, cache, ...]) | Open a dataset as a DataTree from the online repository (requires internet). `tutorial.load_datatree`(*args, **kwargs) | Open, load into memory (as a DataTree), and close a dataset from the online repository (requires internet). ## Testing# `testing.assert_equal`(a, b[, check_dim_order]) | Like `numpy.testing.assert_array_equal()`, but for xarray objects. ---|--- `testing.assert_identical`(a, b) | Like `xarray.testing.assert_equal()`, but also matches the objects' names and attributes. `testing.assert_allclose`(a, b[, rtol, atol, ...]) | Like `numpy.testing.assert_allclose()`, but for xarray objects. `testing.assert_chunks_equal`(a, b) | Assert that chunksizes along chunked dimensions are equal. Test that two `DataTree` objects are similar. `testing.assert_isomorphic`(a, b) | Two DataTrees are considered isomorphic if the set of paths to their descendent nodes are the same. ---|--- `testing.assert_equal`(a, b[, check_dim_order]) | Like `numpy.testing.assert_array_equal()`, but for xarray objects. `testing.assert_identical`(a, b) | Like `xarray.testing.assert_equal()`, but also matches the objects' names and attributes. ## Hypothesis Testing Strategies# See the documentation page on testing for a guide on how to use these strategies. Warning These strategies should be considered highly experimental, and liable to change at any time. `testing.strategies.supported_dtypes`() | Generates only those numpy dtypes which xarray can handle. ---|--- `testing.strategies.names`() | Generates arbitrary string names for dimensions / variables. `testing.strategies.dimension_names`(*[, ...]) | Generates an arbitrary list of valid dimension names. `testing.strategies.dimension_sizes`(*[, ...]) | Generates an arbitrary mapping from dimension names to lengths. `testing.strategies.attrs`() | Generates arbitrary valid attributes dictionaries for xarray objects. `testing.strategies.variables`(*[, ...]) | Generates arbitrary xarray.Variable objects. `testing.strategies.unique_subset_of`(objs, *) | Return a strategy which generates a unique subset of the given objects. ## Exceptions# `AlignmentError` | Error class for alignment failures due to incompatible arguments. ---|--- `CoordinateValidationError` | Error class for Xarray coordinate validation failures. `MergeError` | Error class for merge failures due to incompatible arguments. `SerializationWarning` | Warnings about encoding/decoding issues in serialization. ### DataTree# Exceptions raised when manipulating trees. `TreeIsomorphismError` | Error raised if two tree objects do not share the same node structure. ---|--- `InvalidTreeError` | Raised when user attempts to create an invalid tree in some way. `NotFoundInTreeError` | Raised when operation can't be completed because one node is not part of the expected tree. ## Advanced API# `Coordinates`([coords, indexes]) | Dictionary like container for Xarray coordinates (variables + indexes). ---|--- `Dataset.variables` | Low level interface to Dataset contents as dict of Variable objects. `DataArray.variable` | Low level interface to the Variable object for this DataArray. `DataTree.variables` | Low level interface to node contents as dict of Variable objects. `Variable`(dims, data[, attrs, encoding, fastpath]) | A netcdf-like variable consisting of dimensions, data and attributes which describe a single Array. `IndexVariable`(dims, data[, attrs, encoding, ...]) | Wrapper for accommodating a pandas.Index in an xarray.Variable. `as_variable`(obj[, name, auto_convert]) | Convert an object into a Variable. `Index`() | Base class inherited by all xarray-compatible indexes. `IndexSelResult`(dim_indexers[, indexes, ...]) | Index query results. `Context`(func) | object carrying the information of a call `register_dataset_accessor`(name) | Register a custom property on xarray.Dataset objects. `register_dataarray_accessor`(name) | Register a custom accessor on xarray.DataArray objects. `register_datatree_accessor`(name) | Register a custom accessor on DataTree objects. `Dataset.set_close`(close) | Register the function that releases any resources linked to this object. `backends.BackendArray`() | `backends.BackendEntrypoint`() | `BackendEntrypoint` is a class container and it is the main interface for the backend plugins, see BackendEntrypoint subclassing. `backends.list_engines`() | Return a dictionary of available engines and their BackendEntrypoint objects. `backends.refresh_engines`() | Refreshes the backend engines based on installed packages. These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the `load_store` and `dump_to_store` Dataset methods: `backends.NetCDF4DataStore`(manager[, group, ...]) | Store for reading and writing data via the Python-NetCDF4 library. ---|--- `backends.H5NetCDFStore`(manager[, group, ...]) | Store for reading and writing data via h5netcdf `backends.PydapDataStore`(dataset[, group]) | Store for accessing OpenDAP datasets with pydap. `backends.ScipyDataStore`(filename_or_obj[, ...]) | Store for reading and writing data via scipy.io.netcdf. `backends.ZarrStore`(zarr_group[, mode, ...]) | Store for reading and writing data via zarr `backends.FileManager`() | Manager for acquiring and closing a file object. `backends.CachingFileManager`(opener, *args[, ...]) | Wrapper for automatically opening and closing file objects. `backends.DummyFileManager`(value) | FileManager that simply wraps an open file in the FileManager interface. These BackendEntrypoints provide a basic interface to the most commonly used filetypes in the xarray universe. `backends.NetCDF4BackendEntrypoint`() | Backend for netCDF files based on the netCDF4 package. ---|--- `backends.H5netcdfBackendEntrypoint`() | Backend for netCDF files based on the h5netcdf package. `backends.PydapBackendEntrypoint`() | Backend for steaming datasets over the internet using the Data Access Protocol, also known as DODS or OPeNDAP based on the pydap package. `backends.ScipyBackendEntrypoint`() | Backend for netCDF files based on the scipy package. `backends.StoreBackendEntrypoint`() | `backends.ZarrBackendEntrypoint`() | Backend for ".zarr" files based on the zarr package. ## Deprecated / Pending Deprecation# `Dataset.drop`([labels, dim, errors]) | Backward compatible method based on drop_vars and drop_sel ---|--- `DataArray.drop`([labels, dim, errors]) | Backward compatible method based on drop_vars and drop_sel `Dataset.apply`(func[, keep_attrs, args]) | Backward compatible implementation of `map` `core.groupby.DataArrayGroupBy.apply`(func[, ...]) | Backward compatible implementation of `map` `core.groupby.DatasetGroupBy.apply`(func[, ...]) | Backward compatible implementation of `map` `DataArray.dt.weekofyear` | The week ordinal of the year ---|--- `DataArray.dt.week` | The week ordinal of the year