2026-04-30T13:32:51Z

I'm trying to load some .nc files this way:

ds = xr.open_dataset('path/to/file.nc')

At first I get no error message, but when I try to operate or simply visualize the data I get this really long error message:

ds
---------------------------------------------------------------------------
InvalidVersion                            Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\IPython\core\formatters.py:406, in BaseFormatter.__call__(self, obj)
    404     method = get_real_method(obj, self.print_method)
    405     if method is not None:
--> 406         return method()
    407     return None
    408 else:

File ~\anaconda3\Lib\site-packages\xarray\core\dataset.py:2436, in Dataset._repr_html_(self)
   2434 if OPTIONS["display_style"] == "text":
   2435     return f"<pre>{escape(repr(self))}</pre>"
-> 2436 return formatting_html.dataset_repr(self)

File ~\anaconda3\Lib\site-packages\xarray\core\formatting_html.py:377, in dataset_repr(ds)
    374 sections.append(dim_section(ds))
    376 if ds.coords:
--> 377     sections.append(coord_section(ds.coords))
    379 sections.append(datavar_section(ds.data_vars))
    381 display_default_indexes = _get_boolean_with_default(
    382     "display_default_indexes", False
    383 )

File ~\anaconda3\Lib\site-packages\xarray\core\formatting_html.py:225, in _mapping_section(mapping, name, details_func, max_items_collapse, expand_option_name, enabled, **kwargs)
    218 expanded = max_items_collapse is None or _get_boolean_with_default(
    219     expand_option_name, n_items < max_items_collapse
    220 )
    221 collapsed = not expanded
    223 return collapsible_section(
    224     f"{name}:",
--> 225     details=details_func(mapping, **kwargs),
    226     n_items=n_items,
    227     enabled=enabled,
    228     collapsed=collapsed,
    229 )

File ~\anaconda3\Lib\site-packages\xarray\core\formatting_html.py:127, in summarize_coords(variables)
    123 dim_ordered_coords = sorted(
    124     variables.items(), key=partial(_coord_sort_key, dims=dims)
    125 )
    126 for k, v in dim_ordered_coords:
--> 127     li_content = summarize_variable(k, v, is_index=k in variables.xindexes)
    128     li_items.append(f"<li class='xr-var-item'>{li_content}</li>")
    130 vars_li = "".join(li_items)

File ~\anaconda3\Lib\site-packages\xarray\core\formatting_html.py:96, in summarize_variable(name, var, is_index, dtype)
     93 data_id = "data-" + str(uuid.uuid4())
     94 disabled = "" if len(var.attrs) else "disabled"
---> 96 preview = escape(inline_variable_array_repr(variable, 35))
     97 attrs_ul = summarize_attrs(var.attrs)
     98 data_repr = short_data_repr_html(variable)

File ~\anaconda3\Lib\site-packages\xarray\core\formatting.py:316, in inline_variable_array_repr(var, max_width)
    314 if getattr(var, "_in_memory", False):
    315     return format_array_flat(var, max_width)
--> 316 dask_array_type = array_type("dask")
    317 if isinstance(var._data, dask_array_type):
    318     return inline_dask_repr(var.data)

File ~\anaconda3\Lib\site-packages\xarray\namedarray\pycompat.py:83, in array_type(mod)
     81 def array_type(mod: ModType) -> DuckArrayTypes:
     82     """Quick wrapper to get the array class of the module."""
---> 83     return _get_cached_duck_array_module(mod).type

File ~\anaconda3\Lib\site-packages\xarray\namedarray\pycompat.py:74, in _get_cached_duck_array_module(mod)
     72 def _get_cached_duck_array_module(mod: ModType) -> DuckArrayModule:
     73     if mod not in _cached_duck_array_modules:
---> 74         duckmod = DuckArrayModule(mod)
     75         _cached_duck_array_modules[mod] = duckmod
     76         return duckmod

File ~\anaconda3\Lib\site-packages\xarray\namedarray\pycompat.py:40, in DuckArrayModule.__init__(self, mod)
     38 try:
     39     duck_array_module = import_module(mod)
---> 40     duck_array_version = Version(duck_array_module.__version__)
     42     if mod == "dask":
     43         duck_array_type = (import_module("dask.array").Array,)

File ~\anaconda3\Lib\site-packages\packaging\version.py:202, in Version.__init__(self, version)
    200 match = self._regex.search(version)
    201 if not match:
--> 202     raise InvalidVersion(f"Invalid version: {version!r}")
    204 # Store the parsed out pieces of the version
    205 self._version = _Version(
    206     epoch=int(match.group("epoch")) if match.group("epoch") else 0,
    207     release=tuple(int(i) for i in match.group("release").split(".")),
   (...)    213     local=_parse_local_version(match.group("local")),
    214 )

InvalidVersion: Invalid version: 'unknown'

(this isn't actually the full message, but it wouldn't let me submit with such a long code section)

I tried looking for answers but couldn't find anyone with the same error message.

2026-04-24T18:00:51Z

I have an xarray DataArray that contains daily data for every day from 1970 to 2023 for the months October through March, and I want to calculate the average value of the variable for each day (e.g. the average value for all January 1sts).

My data is like this:

<xarray.DataArray 'u' (valid_time: 9990, latitude: 321, longitude: 361)> Size: 5GB
dask.array<getitem, shape=(9990, 321, 361), dtype=float32, chunksize=(260, 161, 181), chunktype=numpy.ndarray>
Coordinates:
    number          int64 8B 0
    pressure_level  float64 8B 500.0
  * latitude        (latitude) float64 3kB 0.0 -0.25 -0.5 ... -79.5 -79.75 -80.0
  * longitude       (longitude) float64 3kB -100.0 -99.75 -99.5 ... -10.25 -10.0
  * valid_time      (valid_time) datetime64[ns] 80kB 1970-01-01 ... 2023-12-31

I've tried with groupby but only managed to group the data either by month or by day (i.e. all 1sts of the month regardless of month), not both. I can regroup each of these groups, but I bet there's a more elegant solution. Regrouping the groups looks like this:

da.groupby('valid_time.month')[1].groupby('valid_time.day')[1]

That would result in all the January 1sts, and I could technically do it for all the days.

I also tried something like in this post:

da.groupby(da.index.strftime('%m-%d')).mean()

but I get the message:

AttributeError: 'DataArray' object has no attribute 'index'

The desired result would be something of shape (D, latitude, longitude), where D is the number of days between October 1st and March 31st (without all the February 29ths, which were removed).

2025-11-17T02:25:17Z

I am trying to analyze the 30 day standardized precipitation index for a multi-state range of the southeastern US for the year 2016. I'm using xclim to process a direct pull of gridded daily precipitation data from nClimGrid-Daily, which involves pulling about 40 years of data and doing a 30-day rolling mean of the daily sum precip. Resolution is daily and 48.3 km. What I expected was the code to provide me the 30-day SPI for the entire range which I could then subset for the year 2016. Instead, I'll run this on my school's HPC via ssh and then get kicked off the compute node halfway (compute node has about 70gb of RAM). I can run it on my local machine, but gets too hot to complete.

I've downloaded the range I needed (8.8gb). After geographically and temporally subsetting, it's around 7gb. I can lazy load everything, but when I persist/compute with dask it just keeps stalling after running for a certain amount of time - a lot of processing power is going towards rechunking by xclim even though I've tried rechunking prior to avoid it. As a note, when I use a coarser spatial resolution (county-level vs gridded) I can run this in less than an hour. With gridded, it breaks after about 3 hours. Please let me know if you have any suggestions!

import xarray as xr
from dask.distributed import Client, LocalCluster
import os
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

#### bulk download files 
base_url = "https://www.ncei.noaa.gov/data/nclimgrid-daily/access/grids"
years = range(1976, 2017)
months = range(1, 13)

out_dir = Path("./nclimgrid_prcp")  # adjust as needed
out_dir.mkdir(parents=True, exist_ok=True)

max_workers = 16
timeout = 60  # seconds

def dl(year, month):
    fname = f"ncdd-{year}{month:02}-grd-scaled.nc"
    url = f"{base_url}/{year}/{fname}"
    path = out_dir / fname

    if path.exists():
        return path, "exists"

    try:
        r = requests.get(url, timeout=timeout, stream=True)
        r.raise_for_status()
        with open(path, "wb") as f:
            for chunk in r.iter_content(chunk_size=2**20):
                if chunk:
                    f.write(chunk)
        return path, "downloaded"
    except Exception as e:
        return path, f"error: {e}"


futures = []
with ThreadPoolExecutor(max_workers=max_workers) as ex:
    for y in years:
        for m in months:
            futures.append(ex.submit(dl, y, m))

files = []
for fut in as_completed(futures):
    p, status = fut.result()
    print(p.name, status)
    if status in ("exists", "downloaded"):
        files.append(str(p))

####### End download files

files = sorted(glob.glob('nclimgrid_prcp/*.nc')) #access local files

def subset(ds):
    return ds.sel(
         lat=slice(30, 38),    
         lon=slice(-92, -78), 
    )
ds = xr.open_mfdataset(files, combine="by_coords"  preprocess=subset,engine='netcdf4')

pr = ds['prcp']
pr = pr.assign_attrs(units="mm/day")

pr = pr.chunk(
    {
        "time": -1,   
        "lat": 64,    
        "lon": 84,    
    }
)

cluster = LocalCluster()
client = Client(cluster)
client.cluster.scale(10)  

spi_30 = xc.indices.standardized_precipitation_index(
    pr,
    freq="D",   
    window=30)

2025-04-27T19:43:41Z

An example HRRR file I'm trying to open with xarray: https://storage.googleapis.com/high-resolution-rapid-refresh/hrrr.20250427/conus/hrrr.t18z.wrfprsf06.grib2

When trying to open the dataset with xarray:

data = xr.open_dataset(HRRR_data_path+model_file, engine="cfgrib", backend_kwargs= 
{'filter_by_keys': {'typeOfLevel': 'isobaricInhPa'}})

I encounter this error:

Traceback (most recent call last):
File 
"C:\Users\rconn\Documents\Python_Scripts\HRRR_four_level_stdev_omega_both_forecast_new.py", 
line 387, in <module>
main(sys.argv[1:])
File 
"C:\Users\rconn\Documents\Python_Scripts\HRRR_four_level_stdev_omega_both_forecast_new.py", 
line 216, in main
data = xr.open_dataset(HRRR_data_path+model_file, engine="cfgrib", backend_kwargs= 
{'filter_by_keys': {'typeOfLevel': 'isobaricInhPa'}})
       
File "C:\Users\rconn\anaconda3\Lib\site-packages\xarray\backends\api.py", line 552, in 
open_dataset
backend = plugins.get_backend(engine)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\rconn\anaconda3\Lib\site-packages\xarray\backends\plugins.py", line 205, in 
get_backend
raise ValueError(
ValueError: unrecognized engine cfgrib must be one of: ['netcdf4', 'h5netcdf', 'scipy', 
'gini', 'store', 'zarr']

I already installed eccodes (through pip) and cfgrib (through conda). If I run

python -m cfgrib selfcheck

that returns

Found: ecCodes v2.24.2.
Your system is ready.

yet the error about cfgrib being an unrecognized engine persists.

I'm using xarray 2023.6.0 in python 3.12.7 on a Win64 PC.

2026-01-07T16:08:12Z

I am trying to read multiple ERA5 GRIB files using the Python Xarray's xr.mfopen_dataset() method. My ERA5 input files are on a server in a location where I only have read permission for. Therefore, I would like to save the index files generated by the cfgrib engine in a local custom directory. This works fine for a single GRIB file using the following code below where I can specify the index filename explicitly.

import xarray as xr
from pathlib import Path

# define input file
ifile = Path('/soge-home/data/analysis/era5/0.28125x0.28125/hourly/total_precipitation/grb/era5_hourly_total_precipitation_202505.grb')

# create index directory
workdir = Path.cwd()
idx_dir = Path(f'{workdir}/idx_files')
idx_dir.mkdir(exist_ok=True, parents=True)

# read dataset
ds = xr.open_dataset(ifile,
                     chunks='auto',
                     engine='cfgrib',
                     backend_kwargs={
                         "indexpath": str(idx_dir / ifile.name) + ".{short_hash}.idx",
                         'time_dims': ('valid_time',)  # use "valid_time" as main time dim
                         },
                    ).rename({'valid_time': 'time'})  # rename "valid_time" to "time"

Now I try to employ the same method reading in multiple GRIB files using the code below.

import xarray as xr
from pathlib import Path

# create index directory
workdir = Path.cwd()
idx_dir = Path(f'{workdir}/idx_files')
idx_dir.mkdir(exist_ok=True, parents=True)

# create sorted list of input files 
var_dir = Path('/soge-home/data/analysis/era5/0.28125x0.28125/hourly/total_precipitation/grb/')
ifiles = sorted(var_dir.glob(f'*_2025*.grb'))

# read data with custom index files
ds = xr.open_mfdataset(ifiles,
                       chunks='auto',
                       engine='cfgrib',
                       combine='nested',
                       concat_dim='valid_time',
                       parallel=True,
                       backend_kwargs={
                           'indexpath': str(idx_dir) + "/test.idx",  # custom index file location
                           'time_dims': ('valid_time',)  # use valid_time as main time dim
                           },
                        ).rename({'valid_time': 'time'})  # rename valid_time to time

The files are read in fine by xr.open_mfdataset(), but no index files are being created. I am getting the following error messages:

Can't read index file '/hn01-home/worc1870/research/python/progs/grb_test/idx_files/test.idx'
Traceback (most recent call last):
  File "/hn01-home/worc1870/micromamba/envs/mm313/lib/python3.13/site-packages/cfgrib/messages.py", line 551, in from_indexpath_or_filestream
    self = cls.from_indexpath(indexpath)
  File "/hn01-home/worc1870/micromamba/envs/mm313/lib/python3.13/site-packages/cfgrib/messages.py", line 430, in from_indexpath
    index = pickle.load(file)
EOFError: Ran out of input
Ignoring index file '/hn01-home/worc1870/research/python/progs/grb_test/idx_files/test.idx' incompatible with GRIB file
Ignoring index file '/hn01-home/worc1870/research/python/progs/grb_test/idx_files/test.idx' incompatible with GRIB file
Ignoring index file '/hn01-home/worc1870/research/python/progs/grb_test/idx_files/test.idx' incompatible with GRIB file

I tried all kind of things but did not get it to work. Am I doing something wrong or is this just not supported at the moment? Any suggestions are much appreciated. The index files are especially useful when reading in larger data volumes.

Using Python 3.13.7, Xarray 2025.10.1 and cfgrib 0.9.15.1.

Many thanks,
Sebastian

2026-01-21T20:26:29Z

I have 2 elevation datasets -- one is a low resolution dataset (~0.1 degree resolution), and another is a high resolution dataset (~0.001 degree resolution). Let's assume that the datasets look something like this (in reality they are much larger, so I will need to find a way to do this in a memory efficient manner, such as through chunking):

import numpy as np
import pandas as pd
import geopandas as gpd
import xarray as xr

lr_elev = np.random.uniform(low=0, high=4000, size=(11, 11))
hr_elev = np.random.uniform(low=0, high=4000, size=(1001, 1001))

lr_lats = np.linspace(44.0, 45.0, 11)
lr_lons = np.linspace(-115.0, -114.0, 11)

hr_lats = np.linspace(44.0, 45.0, 1001)
hr_lons = np.linspace(-115.0, -114.0, 1001)


lr_ds = xr.Dataset(
    data_vars=dict(elevation=(["lat", "lon"], lr_elev)),
    coords=dict(lon=("lon", lr_lons), lat=("lat", lr_lats)),
)


hr_ds = xr.Dataset(
    data_vars=dict(elevation=(["lat", "lon"], hr_elev)),
    coords=dict(lon=("lon", hr_lons), lat=("lat", hr_lats)),
)

print(lr_ds)
print(hr_ds)

I know how to find the nearest elevation value from the low resolution dataset for each of the high resolution lat/lon pairs:

# ### Treat midpoint of HR grid cells as a point -- then find the nearest LR grid cell as if they are stations

nearest_lr_point = lr_ds.sel(lat=hr_ds.lat,lon=hr_ds.lon, method='nearest')
print(nearest_lr_point)

However, I haven't been able to figure out how I would instead return the index of the value instead of the actual value. The reason for this is that I would like to be able to match each high-res lat/lon pair with its corresponding low-resolution grid cell (for later calculations). Is there a way to return the index of the value instead for each lat/lon pair?

I am unable to convert this into a dataframe easily (since the real high resolution dataset is about 700GB in size)

2026-01-08T04:54:12Z

I am wondering how to add the crs to xarray dataset. For example if we have a min and max dataset and compute the mean. How can we set the mean dataset to have the same CRS as the min or max datasets ?

max_ds = xr.open_dataset("data/climate_processed/1958_2025_nsw_max_temp.nc")
min_ds = xr.open_dataset("data/climate_processed/1958_2025_nsw_min_temp.nc")
mean_temp = (max_ds["max_temp"] + min_ds["min_temp"]) / 2
mean_ds = mean_temp.to_dataset(name = "temp_mean")

I have copied the outputs max_ds and mean_ds just incase they are useful.
Here is the printed output of max_ds

print(max_ds.variables)
Frozen({'max_temp': <xarray.Variable (time: 24412, lat: 187, lon: 261)> Size: 10GB
[1191476484 values with dtype=float64]
Attributes:
    long_name:  Maximum temperature
    units:      Celsius
    valid_min:  -90
    valid_max:  540, 'crs': <xarray.Variable (time: 24412)> Size: 24kB
[24412 values with dtype=|S1]
Attributes:
    long_name:                    Coordinate reference system
    grid_mapping_name:            latitude_longitude
    longitude_of_prime_meridian:  0.0
    semi_major_axis:              6378137.0
    inverse_flattening:           298.257223563, 'lat': <xarray.IndexVariable 'lat' (lat: 187)> Size: 1kB
array([-37.5 , -37.45, -37.4 , -37.35, -37.3 , -37.25, -37.2 , -37.15, -37.1 ,
       -37.05, -37.  , -36.95, -36.9 , -36.85, -36.8 , -36.75, -36.7 , -36.65,
...
       -28.95, -28.9 , -28.85, -28.8 , -28.75, -28.7 , -28.65, -28.6 , -28.55,
       -28.5 , -28.45, -28.4 , -28.35, -28.3 , -28.25, -28.2 ])
Attributes:
    long_name:      latitude
    standard_name:  latitude
    units:          degrees_north
    axis:           Y, 'lon': <xarray.IndexVariable 'lon' (lon: 261)> Size: 2kB
array([141.  , 141.05, 141.1 , ..., 153.9 , 153.95, 154.  ], shape=(261,))
Attributes:
    long_name:      longitude
    standard_name:  longitude
    units:          degrees_east
    axis:           X, 'time': <xarray.IndexVariable 'time' (time: 24412)> Size: 195kB
array(['1958-01-01T00:00:00.000000000', '1958-01-02T00:00:00.000000000',
       '1958-01-03T00:00:00.000000000', ..., '2025-10-31T00:00:00.000000000',
       '2025-11-01T00:00:00.000000000', '2025-11-02T00:00:00.000000000'],
      shape=(24412,), dtype='datetime64[ns]')
Attributes:
    axis:     T})

printed output of mean_ds

print(mean_ds.varaible)
Frozen({'lat': <xarray.IndexVariable 'lat' (lat: 187)> Size: 1kB
array([-37.5 , -37.45, -37.4 , -37.35, -37.3 , -37.25, -37.2 , -37.15, -37.1 ,
       -37.05, -37.  , -36.95, -36.9 , -36.85, -36.8 , -36.75, -36.7 , -36.65,
       -36.6 , -36.55, -36.5 , -36.45, -36.4 , -36.35, -36.3 , -36.25, -36.2 ,
       -36.15, -36.1 , -36.05, -36.  , -35.95, -35.9 , -35.85, -35.8 , -35.75,
       -35.7 , -35.65, -35.6 , -35.55, -35.5 , -35.45, -35.4 , -35.35, -35.3 ,
       -35.25, -35.2 , -35.15, -35.1 , -35.05, -35.  , -34.95, -34.9 , -34.85,
       -34.8 , -34.75, -34.7 , -34.65, -34.6 , -34.55, -34.5 , -34.45, -34.4 ,
       -34.35, -34.3 , -34.25, -34.2 , -34.15, -34.1 , -34.05, -34.  , -33.95,
       -33.9 , -33.85, -33.8 , -33.75, -33.7 , -33.65, -33.6 , -33.55, -33.5 ,
       -33.45, -33.4 , -33.35, -33.3 , -33.25, -33.2 , -33.15, -33.1 , -33.05,
       -33.  , -32.95, -32.9 , -32.85, -32.8 , -32.75, -32.7 , -32.65, -32.6 ,
       -32.55, -32.5 , -32.45, -32.4 , -32.35, -32.3 , -32.25, -32.2 , -32.15,
       -32.1 , -32.05, -32.  , -31.95, -31.9 , -31.85, -31.8 , -31.75, -31.7 ,
       -31.65, -31.6 , -31.55, -31.5 , -31.45, -31.4 , -31.35, -31.3 , -31.25,
       -31.2 , -31.15, -31.1 , -31.05, -31.  , -30.95, -30.9 , -30.85, -30.8 ,
       -30.75, -30.7 , -30.65, -30.6 , -30.55, -30.5 , -30.45, -30.4 , -30.35,
       -30.3 , -30.25, -30.2 , -30.15, -30.1 , -30.05, -30.  , -29.95, -29.9 ,
       -29.85, -29.8 , -29.75, -29.7 , -29.65, -29.6 , -29.55, -29.5 , -29.45,
       -29.4 , -29.35, -29.3 , -29.25, -29.2 , -29.15, -29.1 , -29.05, -29.  ,
       -28.95, -28.9 , -28.85, -28.8 , -28.75, -28.7 , -28.65, -28.6 , -28.55,
       -28.5 , -28.45, -28.4 , -28.35, -28.3 , -28.25, -28.2 ])
Attributes:
    long_name:      latitude
    standard_name:  latitude
    units:          degrees_north
    axis:           Y, 'lon': <xarray.IndexVariable 'lon' (lon: 261)> Size: 2kB
array([141.  , 141.05, 141.1 , ..., 153.9 , 153.95, 154.  ], shape=(261,))
Attributes:
    long_name:      longitude
    standard_name:  longitude
    units:          degrees_east
    axis:           X, 'time': <xarray.IndexVariable 'time' (time: 24412)> Size: 195kB
array(['1958-01-01T00:00:00.000000000', '1958-01-02T00:00:00.000000000',
       '1958-01-03T00:00:00.000000000', ..., '2025-10-31T00:00:00.000000000',
       '2025-11-01T00:00:00.000000000', '2025-11-02T00:00:00.000000000'],
      shape=(24412,), dtype='datetime64[ns]')
Attributes:
    axis:     T, 'temp_mean': <xarray.Variable (time: 24412, lat: 187, lon: 261)> Size: 10GB
array([[[14.2 , 14.3 , 14.35, ...,   nan,   nan,   nan],
        [14.3 , 14.4 , 14.45, ...,   nan,   nan,   nan],
        [14.3 , 14.45, 14.5 , ...,   nan,   nan,   nan],
        ...,
        [23.1 , 23.1 , 23.05, ...,   nan,   nan,   nan],
        [23.2 , 23.15, 23.1 , ...,   nan,   nan,   nan],
        [23.25, 23.2 , 23.1 , ...,   nan,   nan,   nan]],

       [[13.25, 13.25, 13.2 , ...,   nan,   nan,   nan],
        [13.35, 13.3 , 13.25, ...,   nan,   nan,   nan],
        [13.4 , 13.35, 13.3 , ...,   nan,   nan,   nan],
        ...,
        [25.5 , 25.5 , 25.45, ...,   nan,   nan,   nan],
        [25.55, 25.55, 25.5 , ...,   nan,   nan,   nan],
        [25.55, 25.55, 25.5 , ...,   nan,   nan,   nan]],

       [[14.9 , 15.  , 15.  , ...,   nan,   nan,   nan],
        [15.05, 15.05, 15.1 , ...,   nan,   nan,   nan],
        [15.1 , 15.15, 15.15, ...,   nan,   nan,   nan],
        ...,
...
        [30.3 , 30.35, 30.35, ...,   nan,   nan,   nan],
        [30.4 , 30.4 , 30.35, ...,   nan,   nan,   nan],
        [30.45, 30.45, 30.45, ...,   nan,   nan,   nan]],

       [[15.4 , 15.4 , 15.4 , ...,   nan,   nan,   nan],
        [15.5 , 15.5 , 15.5 , ...,   nan,   nan,   nan],
        [15.6 , 15.6 , 15.6 , ...,   nan,   nan,   nan],
        ...,
        [27.2 , 27.2 , 27.2 , ...,   nan,   nan,   nan],
        [27.25, 27.25, 27.2 , ...,   nan,   nan,   nan],
        [27.3 , 27.25, 27.25, ...,   nan,   nan,   nan]],

       [[17.55, 17.55, 17.55, ...,   nan,   nan,   nan],
        [17.5 , 17.55, 17.55, ...,   nan,   nan,   nan],
        [17.5 , 17.55, 17.55, ...,   nan,   nan,   nan],
        ...,
        [30.05, 30.  , 29.95, ...,   nan,   nan,   nan],
        [30.1 , 30.05, 29.95, ...,   nan,   nan,   nan],
        [30.15, 30.05, 30.05, ...,   nan,   nan,   nan]]],
      shape=(24412, 187, 261))
Attributes:
    units:    Celsius})

2025-11-16T12:19:38Z

There has been at least one other question regarding the introduction of new dimensions in the output of xarray.apply_ufunc; I have two problems with this answer: First, I feel like the answer avoids the core problem, and secondly, there is little to no explanation on the (mis-)conceptions that lead to the problem.

Let's consider this (abstract and not working) example:

import numpy as np
import xarray as xr

a1 = np.array([[11,12,13,14],[15,16,17,18],[19,110,111,112],[113,114,115,116]], dtype=np.int16)
a2 = np.array([[21,22,23,24],[25,26,27,28],[29,210,211,212],[213,214,215,216]], dtype=np.int16)

ds = xr.Dataset(
    data_vars= {
        "d1": (("y", "x"), a1),
        "d2": (("y", "x"), a2) 
    }
).chunk({"y":2,"x":2})

def stackuf(xa1, xa2):
    return np.stack((xa1, xa2), axis=-1)

out = xr.apply_ufunc(
    stackuf,
    ds.d1,
    ds.d2,
    input_core_dims=[[],[]], # what is supposed to be here?
    output_core_dims=[[]],   # and what is supposed to be here?
    output_sizes=dict(ds.sizes, **{"xy":2}), # and what is needed there?
    dask="parallelized",
    vectorize=False,
    output_dtypes=[np.int16],
)
out.compute()

What it should do: From two variables with two dimensions (y, x), create a new variable with three dimensions (y, x, xy)

I suspect the arguments input_core_dims, output_core_dims, output_sizes to be the key arguments to make this working. But I do not really understand their semantics.

Please do not:

just make the code working
provide workarounds, that circumvent the need to introduce a new dimensions

instead, explain exactly, what needs to be understood about this - and similar - problems.

2022-03-31T09:19:48Z

I am looking for a way to return the interpolated coordinate value of a variable at a query variable value. I think it is the opposite use to a normal interpolation/table-look up.

For example, here is a simple, normal interpolation, From this very good tutorial:

x_data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
f = xr.DataArray(x_data**2, dims=['x'], coords={'x': x_data})
f.plot(marker='o')

Which produces this plot:

So a standard lookup (e.g. f.interp(x=4.5)) would look for the value of the variable at some coordinate value of x. But I want to look for the value of the x where the variable is equal to some value. For example in the above plot, what is the value of x where the variable equals 20.

I am ultimately trying to do this for a 3D field of temperature (longitude x latitude x depth). So, how do I find the value of depth, where the temperature field is closest to a value that I specify.

Many thanks in advance.

2025-11-09T02:18:31Z

Very new to python! I am trying to model bottom water temperatures over time and need to reduce the resolution of my model from 1/20º to 1º. My ultimate goal is to map this and select specific grid cells to calculate temperature for each year. I have tried re-grid but came up with errors when trying to import xesmf. Ive looked into using groupby_bins or the coarsen method but haven't been able to figure out how and if they would actually work.

This is the error i get when trying to import xesmf:

ModuleNotFoundError: No module named 'xarray.core.arithmetic'

this is the work that I have done with coarsen so far:

coarsen_factor = 20

v20x = v20x.rename({"nav_lon": "latitude", "nav_lat": "longitude"})

coarsened_ds = v20x.coarsen(
    latitude=coarsen_factor,
    longitude=coarsen_factor,
    boundary='exact' 
).mean()

But I am getting this error:

ValueError: Window dimensions ('latitude', 'longitude') not found in Dataset dimensions ('time_counter', 'deptht', 'axis_nbounds', 'y', 'x')

I'm not sure if this is the best approach or if groupby_bins would be better and how to execute this. Any advice would be so appreciated!

2025-10-17T07:41:49Z

I’m trying to subset a large xarray.Dataset backed by Dask and save it back to Zarr, but I’m running into a major memory problem when attempting to drop rows with a boolean mask.

Here’s a minimal working example that matches my real-world setup:

import numpy as np
import xarray as xr
import dask.array as da
import zarr
import zipfile

# Simulate a large dataset
pos_len = 100_000_000   # rows
sample_len = 100       # samples
chunks = (100_000, 100)

data = da.random.random((pos_len, sample_len), chunks=chunks)
xds = xr.Dataset(
    {"some_var": (("pos", "sample_id"), data)},
    coords={"pos": np.arange(pos_len), "sample_id": np.arange(sample_len)}
)

# Build a boolean mask based on mean coverage
coverage_array = "some_var"
min_coverage = 0.5
mask_1d = xds[coverage_array].mean(dim="sample_id", skipna=True) >= min_coverage

# Attempt to drop rows where mask is False
cds_masked = xds.where(mask_1d.compute(), other=np.nan, drop=True)  # <--- memory explodes here

# Without .compute() I get:
# ValueError: This will result in a dask array of unknown shape. Such arrays are unsupported by Xarray. Please compute the indexer first using .compute()

If I call .compute() on the mask, memory usage explodes and the process crashes.
If I skip .compute(), Xarray errors out because it doesn’t know the shape of the result.
If I use drop=False, I avoid the error — but then downstream operations have to handle the nans.

My question: Is there any memory-safe way to drop rows with a boolean mask in Xarray/Dask without fully computing the mask first? Or at least, without computing it in a way that explodes into ram and crashes the whole computer.

Or is this a fundamental limitation of Xarray/Dask due to unknown chunk shapes after boolean indexing?

Are there known workarounds or idioms for this pattern?

2018-08-31T04:26:40Z

I have a 1D array of independent variable values (x_array) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy pixel in y_array, with values for each timestep in x_array as my independent variable.

I can reshape to get the data in a form to input it into np.polyfit which is vectorised and fast:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

However, this ignores pixels that contain any NaN values (np.polyfit does not support NaN values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).

The answer here uses scipy.stats import linregress which does calculate the statistics I need, and suggests avoiding NaN issues by masking out these NaN values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped will have a different set of NaN values.

My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN values in a reasonably fast, vectorised way?

Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array, even if that pixel contains NaN values at some point in the time series

2025-10-27T05:27:11Z

I have a dataset that I need to convert into a rioxarray dataset so I can use regridding features but I can't work out how to convert an already existing xarray object to rioxarray.

Unfortunately I can't just load the object in as an rioxarray object as the file is a csv that I convert from a pandas object to an xarray dataset.

df = pd.read_csv('file.csv')
df=df.set_index(["date", "latitude", "longitude"])
ds=df.to_xarray()

2025-10-18T00:25:32Z

I'm working in a Jupyter Notebook using Python 3.12. I have a 2D xarray (in reality, it's 3D, but we can treat it as 2D). I want to pull out the values based on indices I acquire elsewhere, and then store those values in a list (or numpy array; haven't decided yet, but the particular container doesn't matter, just that I have the values stored in a container that I can easily access).

I have code that works, but it is painfully slow. See below.

myValues = []
for idx_pair in myIndices:
    myValues.append( da[ idx_pair[0],idx_pair[1] ].item() )

In the above, myIndices is a two column array, with each row being the x and y index of da. The amount of index pairs in myIndices can reach as high as 100,000, and I need to loop through several different sets of myIndices. da is a 3D xarray ( da.shape ~ (20, 2000, 2000) which correspond to time, x direction, y direction ), though it can be treated as a 2D array that's about 2000x2000 (just x and y). Getting the values for a 2D da takes about 35 seconds. So this code works (it's the only method I've found that does), but it is far too slow to be useful.

**How can I more efficiently and rapidly access values from an xarray using index locations? **

I have tried da.load(). Supposedly, that takes care of the lazy loading issue inherent with a lot of xarrays, but it does nothing to reduce the run time in my case.

I've tried different ways of accessing the values from the xarray (e.g., .isel(x=myXs, y=myYs) or da[ myXs, myYs ]), but I get weird and insanely large matrices (e.g., ~100,000x100,000 with most of the values being zero). I feel like the solution to my problem is in how I'm accessing the values, but I can't figure out any other method to do it.

2017-10-31T17:43:29Z

The default backend engine for xarray is set to netcdf4 in this function

What is the best way to make h5netcdf the default engine in xarray?

2023-12-05T15:04:21Z

I have a bunch of NetCDF (.nc) files (ERA5 dataset) that I'm reading in Python through xarray and rioxarray. They end up as arrays of float32 (4 bytes) in memory.

However, on disk they are stored as short (2 bytes):

$ ncdump -h file.nc
...
    short u100(time, latitude, longitude) ;
        u100:scale_factor = 0.000895262699529722 ;
        u100:add_offset = 2.29252111865024 ;
        u100:_FillValue = -32767s ;
        u100:missing_value = -32767s ;
...

Apparently xarray automatically applies the offset and scale factor to convert these integers back into floats while reading the NetCDF file.

Now I'm rechunking these and storing them as zarr, so I can efficiently access entire time series at a single geographical location. However, the zarr files end up at almost twice the size of the original NetCDFs, because the data remain stored as floats. Because it's about a terabyte in its original form, bandwidth and storage considerations are important, so I'd like to make this smaller. And we're not gaining anything by this additional storage size; the incoming data only had 16 bits of precision to begin with.

I know I could just manually convert the data back to shorts on the way into zarr, and back to floats on the way out of zarr, but that's tedious and error-prone (even when it happens automatically).

Is there a way to do this transparently, the way it seems to happen with NetCDF?

2025-09-29T18:42:27Z

I have netCDF files of oceanographic data processed in Python, that I'd like to update the global attributes of (i.e., add the same attributes to a bunch of files). Tried doing it in Xarray per their instructions, but nothing seems to have been updated. Not sure what I'm doing wrong

With these metadata:

metadata = {'title': 'Time-series data of currents, temperature and waves off Lahaina, Maui, 2023',
            'institution':'U.S. Geological Survey'}

I'm trying to update files in a folder with this:

for file in filelist:
        with xr.open_dataset(file,mode='a') as ds:
            ds = ds.attrs.update(metadata)

It runs just fine, and according to the file attributes, they have been updated. But when I open the file in Panoply, nothing has been added. the Xarray documentation states that this is a safer way to update a file, as it automatically closes.

2025-09-22T18:40:10Z

from joblib import load

ntrees_16_model = load(r"ntrees_quantile_16_model_watermask.joblib")
ntrees_50_model = load(r"ntrees_quantile_50_model_watermask.joblib")
ntrees_84_model = load(r"ntrees_quantile_84_model_watermask.joblib")

I am using xarray's xr.map_blocks() to run a function on my data frame, using these joblib files above. When I look at the Dask dashboard, I am seeing that my function is only being processed on one worker, despite having multiple workers that can do processing. I can supply my function, although no one would be able to replicate my issue with this code without supplying all of the input data needed.

import numpy as np
import pandas as pd

def generate_treelist(pixel_df, ntrees_50_model, ntrees_16_model, ntrees_84_model, pixel_resolution):
    try:
        pixel_df = pixel_df.rename(columns={'X': 'pixel_x'})
        pixel_df = pixel_df.rename(columns={'Y': 'pixel_y'})
        # Drop rows containing NaN values
        pixel_df = pixel_df.dropna()
        if pixel_df.isnull().values.any():
            print(f"[ERROR] NaNs detected in pixel_df after dropping NaN rows.")
            return None
        
        # Define the sampleDraws function
        pixel_df = pixel_df.reset_index(drop=True)
        def sampleDraws(df, num_pulls):
            try:
                pull_cols = ['pull_' + str(i+1) for i in range(num_pulls)]
                random_samples = np.random.normal(
                    df['mu'].values[:, np.newaxis], df['sd'].values[:, np.newaxis], (len(df), num_pulls))
                samples_df = pd.DataFrame(
                    random_samples, columns=pull_cols, index=df.index)
                out_df = pd.concat([df, samples_df], axis=1)
                return out_df
            except Exception as e:
                print(f"[ERROR] Exception occurred in sampleDraws: {e}")
                return None
            
        # Main treelist generation code starts here
        pixel_coords = pixel_df[["pixel_x", "pixel_y"]]
        pixel_df = pixel_df.drop(
            columns=['pixel_x', 'pixel_y', 'time'])
        
        desired_columns = [
        'tmmx_1_mean', 'tmmn_1_mean', 'pr_1_mean', 'vs_1_mean', 'rmin_1_mean', 
        'rmax_1_mean', 'etr_1_mean', 'tmmx_5_mean', 'tmmn_5_mean', 'pr_5_mean', 
        'vs_5_mean', 'rmin_5_mean', 'rmax_5_mean', 'etr_5_mean', 'tmmx_10_mean', 
        'tmmn_10_mean', 'pr_10_mean', 'vs_10_mean', 'rmin_10_mean', 'rmax_10_mean', 
        'etr_10_mean', 'tmmx_20_mean', 'tmmn_20_mean', 'pr_20_mean', 'vs_20_mean', 
        'rmin_20_mean', 'rmax_20_mean', 'etr_20_mean', 'tmmx_30_mean', 'tmmn_30_mean', 
        'pr_30_mean', 'vs_30_mean', 'rmin_30_mean', 'rmax_30_mean', 'etr_30_mean', 'water_mask', 
        'ls_blue', 'ls_green', 'ls_red', 'ls_nir', 'ls_swir1', 'ls_swir2', 
        'mean_num_of_fires', 'mode_years_since_fire'
        ]

        pixel_df = pixel_df[desired_columns]

        ntrees_50_pred = ntrees_50_model.predict(pixel_df)
        ntrees_16_pred = ntrees_16_model.predict(pixel_df)
        ntrees_84_pred = ntrees_84_model.predict(pixel_df)

        if np.isnan(ntrees_50_pred).any() or np.isnan(ntrees_16_pred).any() or np.isnan(ntrees_84_pred).any():
            print(f"[ERROR] NaNs detected in number of trees prediction.")
            return None

        # Store number of trees' mean and standard deviation for uncertainty
        ntrees_mu = ntrees_50_pred
        ntrees_sd = abs(ntrees_84_pred - ntrees_16_pred) / 2

        df = pd.DataFrame({'mu': ntrees_mu, 'sd': ntrees_sd})

        print(f"[INFO] Sampling number of trees...")
        random_ntrees = sampleDraws(df, num_pulls=1)
        if random_ntrees is None:
            return None
        random_ntrees_clean = random_ntrees.drop(['mu', 'sd'], axis=1)
        


        unscaled_round_ntrees = np.round(random_ntrees_clean).astype(int)
        random_ntrees_clean[random_ntrees_clean < 0] = 0

        current_ntrees = unscaled_round_ntrees.iloc[:, 0]
        current_ntrees[current_ntrees < 0] = 0

        pixel_df = pd.concat([pixel_df, current_ntrees], axis=1)

   
       
        random_coords = np.random.uniform(-pixel_resolution / 2,
                                          pixel_resolution / 2, (len(random_ntrees_clean), 2))
        repeated_pixel_coords = pixel_coords.values.repeat(
            current_ntrees, axis=0)
        tree_coords = repeated_pixel_coords + random_coords

        # Repeat pixel-level variables to match the length of treelist
        ntrees_mu_repeated = np.repeat(ntrees_mu, current_ntrees)
        ntrees_sd_repeated = np.repeat(ntrees_sd, current_ntrees)
      

        # Construct final treelist DataFrame with uncertainty columns
        treelist = pd.DataFrame({
            'pixel_x': repeated_pixel_coords[:, 0],
            'pixel_y': repeated_pixel_coords[:, 1],
            'tree_x': tree_coords[:, 0],
            'tree_y': tree_coords[:, 1],
            'n_trees_mu': ntrees_mu_repeated,
            'n_trees_sd': ntrees_sd_repeated
        })

        print(treelist)

        return treelist

    except Exception as e:
        print(f"[ERROR] Exception occurred during treelist generation: {e}")
        return None

Only thing I can think of is there is a "proper" way to load the joblib files to the Dask cluster as it is currently being loaded just in my client. Am I on the right track with my thinking?

IMPORTANT EDIT: I realize now that I didn't provide context on the data frame part. I have a wrapper function that converts the xarray into a data frame, then it runs generate_treelist() within this wrapper, and then converts the resulting data frame back into an xarray

2025-09-23T18:54:04Z

I am trying to concatenate 10 netCDF files that are output files from a software named Ichthyop. Each of the files is a result of a lagrangian simulation of particles drifting in the Eastern Mediterranean. The files are of consecutive simulations for the years 2013-2022. They all have the same lon and lat coordinates.

Here is an example of their structure using ncdump -h:

netcdf mercator3d_ichthyop-run202508261903 {
dimensions:
    time = UNLIMITED ; // (831 currently)
    drifter = 9900 ;
    type_zone = 3 ;
    edge = 1224 ;
    latlon = 2 ;
    zone0 = 10 ;
    geozone0 = 5 ;
    zone1 = 30 ;
    geozone1 = 5 ;
    zone2 = 12 ;
    geozone2 = 6 ;
variables:
    double time(time) ;
        time:calendar = "gregorian" ;
        time:units = "seconds since 1900-01-01 00:00" ;
    float lon(time, drifter) ;
        lon:long_name = "particle longitude" ;
        lon:unit = "degree east" ;
    float lat(time, drifter) ;
        lat:long_name = "particle latitude" ;
        lat:unit = "degree north" ;
    int mortality(time, drifter) ;
        mortality:long_name = "particle cause of mortality" ;
        mortality:unit = "death cause" ;
        mortality:alive = 0 ;
        mortality:dead_cold = 1 ;
        mortality:out_of_domain = 2 ;
        mortality:old = 3 ;
        mortality:beached = 4 ;
        mortality:starvation = 5 ;
        mortality:dead_hot = 6 ;
        mortality:dead_fresh = 7 ;
        mortality:dead_saline = 8 ;
    float depth(time, drifter) ;
        depth:long_name = "particle depth" ;
        depth:unit = "meter" ;
    int zone(time, drifter, type_zone) ;
        zone:long_name = "In which zones is the particle ?" ;
        zone:unit = "the index of the zone" ;
        zone:type_zone_0 = "release" ;
        zone:release_zone_0 = "zone1" ;
        zone:release_zone_1 = "zone2" ;
        zone:release_zone_2 = "zone3" ;
        zone:type_zone_1 = "recruitment" ;
        zone:recruitment_zone = "none for this run" ;
        zone:type_zone_2 = "target" ;
        zone:target_zone = "none for this run" ;
        zone:not_released_yet = -99 ;
        zone:out_of_zone = -1 ;
    int release_zone(drifter) ;
        release_zone:long_name = "In which zone has been released the particle ?" ;
        release_zone:unit = "the index of the release zone" ;
        release_zone:release_zone_0 = "zone1" ;
        release_zone:release_zone_1 = "zone2" ;
        release_zone:release_zone_2 = "zone3" ;
        release_zone:not_released_yet = -99 ;
    float region_edge(edge, latlon) ;
        region_edge:long_name = "geoposition of region edge" ;
        region_edge:unit = "lat degree north lon degree east" ;
    int drifter(drifter) ;
        drifter:long_name = "drifter index" ;
        drifter:unit = "" ;
    float coord_zone0(zone0, latlon) ;
        coord_zone0:long_name = "zone1" ;
        coord_zone0:unit = "x and y coordinates of the center of the cells in the zone" ;
        coord_zone0:type = "release" ;
        coord_zone0:color = "[r=255,g=0,b=255]" ;
    float coord_geo_zone0(geozone0, latlon) ;
    float coord_zone1(zone1, latlon) ;
        coord_zone1:long_name = "zone2" ;
        coord_zone1:unit = "x and y coordinates of the center of the cells in the zone" ;
        coord_zone1:type = "release" ;
        coord_zone1:color = "[r=255,g=0,b=255]" ;
    float coord_geo_zone1(geozone1, latlon) ;
    float coord_zone2(zone2, latlon) ;
        coord_zone2:long_name = "zone3" ;
        coord_zone2:unit = "x and y coordinates of the center of the cells in the zone" ;
        coord_zone2:type = "release" ;
        coord_zone2:color = "[r=255,g=0,b=255]" ;
    float coord_geo_zone2(geozone2, latlon) ;

// global attributes:
        :transport_dimension = "3d" ;
        :release.schedule.is_enabled = "true" ;
        :release.schedule.events = "\"year 2013 month 06 day 01 at 12:00\" \"year 2013 month 06 day 03 at 12:00\" \"year 2013 month 06 day 06 at 12:00\" \"year 2013 month 06 day 09 at 12:00\" \"year 2013 month 06 day 12 at 12:00\" \"year 2013 month 06 day 15 at 12:00\" \"year 2013 month 06 day 18 at 12:00\" \"year 2013 month 06 day 21 at 12:00\" \"year 2013 month 06 day 24 at 12:00\" \"year 2013 month 06 day 27 at 12:00\" \"year 2013 month 06 day 30 at 12:00\" \"year 2013 month 07 day 03 at 12:00\" \"year 2013 month 07 day 06 at 12:00\" \"year 2013 month 07 day 09 at 12:00\" \"year 2013 month 07 day 12 at 12:00\" \"year 2013 month 07 day 15 at 12:00\" \"year 2013 month 07 day 18 at 12:00\" \"year 2013 month 07 day 21 at 12:00\" \"year 2013 month 07 day 24 at 12:00\" \"year 2013 month 07 day 27 at 12:00\" \"year 2013 month 07 day 30 at 12:00\" \"year 2013 month 08 day 02 at 12:00\" \"year 2013 month 08 day 15 at 12:00\" \"year 2013 month 08 day 30 at 12:00\" \"year 2013 month 09 day 01 at 12:00\" \"year 2013 month 09 day 15 at 12:00\" \"year 2013 month 09 day 30 at 12:00\" \"year 2013 month 10 day 01 at 12:00\" \"year 2013 month 10 day 15 at 12:00\" \"year 2013 month 10 day 31 at 12:00\" \"year 2013 month 11 day 01 at 12:00\" \"year 2013 month 11 day 15 at 12:00\" \"year 2013 month 11 day 30 at 12:00\"" ;
        :app.time.initial_time = "year 2013 month 01 day 01 at 12:00" ;
        :app.time.transport_duration = "0030 day(s) 00 hour(s) 00 minute(s)" ;
        :app.time.time_step = "3150" ;
        :app.output.netcdf_output_format = "netcdf3" ;
        :app.output.file_prefix = "mercator3d" ;
        :app.output.output_path = "/home/yotam/Documents/oceanographic_processes_biological_implications/lionfish_dispersal/output/" ;
        :app.output.record_frequency = "12" ;
        :app.particle_length.initial_length = "0.89" ;
        :app.particle_length.hatch_length = "2.09" ;
        :app.particle_length.yolk2feeding_length = "2.6" ;
        :app.transport.coastline_behavior = "Bouncing" ;
        :action.recruitment.stain.enabled = "false" ;
        :action.lethal_temp.enabled = "false" ;
        :action.recruitment.zone.enabled = "false" ;
        :action.advection.enabled = "true" ;
        :action.advection.scheme = "Runge Kutta 4" ;
        :action.swimming.enabled = "false" ;
        :action.growth.enabled = "false" ;
        :action.hdisp.enabled = "true" ;
        :action.hdisp.epsilon = "1E-9" ;
        :action.wind.enabled = "false" ;
        :release.txtfile.enabled = "false" ;
        :release.ncfile.enabled = "false" ;
        :release.zone.enabled = "true" ;
        :release.zone.number_particles = "300" ;
        :release.zone.zone_file = "/home/yotam/Documents/oceanographic_processes_biological_implications/ichthyop_data/cfg/release_east_med_lev.xml" ;
        :release.patches.enabled = "false" ;
        :release.stain.enabled = "false" ;
        :dataset.mercator3d.enabled = "true" ;
        :dataset.mercator3d.input_path = "/home/yotam/Documents/oceanographic_processes_biological_implications/lionfish_dispersal/bashscripts/bashscripts_all/copernicusdata/" ;
        :dataset.mercator3d.hgr_pattern = "MED-MFC_006_004_coordinates_cropped.nc" ;
        :dataset.mercator3d.zgr_pattern = "MED-MFC_006_004_coordinates_cropped.nc" ;
        :dataset.mercator3d.byte_mask_pattern = "MED-MFC_006_004_mask_bathy_cropped.nc" ;
        :dataset.mercator3d.gridu_pattern = "*stks_data_*.nc" ;
        :dataset.mercator3d.gridv_pattern = "*stks_data_*.nc" ;
        :dataset.mercator3d.gridt_pattern = "*stks_data_*.nc" ;
        :dataset.mercator3d.field_dim_x = "longitude" ;
        :dataset.mercator3d.field_dim_y = "latitude" ;
        :dataset.mercator3d.field_dim_z = "depth" ;
        :dataset.mercator3d.field_dim_time = "time" ;
        :dataset.mercator3d.field_var_lon = "longitude" ;
        :dataset.mercator3d.field_var_lat = "latitude" ;
        :dataset.mercator3d.field_var_mask = "mask" ;
        :dataset.mercator3d.field_var_u = "uo" ;
        :dataset.mercator3d.field_var_v = "vo" ;
        :dataset.mercator3d.field_var_time = "time" ;
        :dataset.mercator3d.field_var_e3t = "e3t" ;
        :dataset.mercator3d.field_var_e2t = "e2t" ;
        :dataset.mercator3d.field_var_e1t = "e1t" ;
        :xml_file = "/home/yotam/Documents/oceanographic_processes_biological_implications/ichthyop_data/cfg/cfg-mercator3d-phd_1.xml" ;
        :nb_zones = 3 ;
}

All 10 files are identical, with the exception of the attributes app.time.initial.time which changes according to the year of the simulation and the release.schedule.events, which also change according to the year, but otherwise stay the same, e.g for 2013 "\"year 2013 month 06 day 01 at 12:00\" \"year 2013 month 06 day 03 at 12:00\" \"year 2013 month 06 day 06 at 12:00\" and etc., for 2014 "\"year 2014 month 06 day 01 at 12:00\" \"year 2014 month 06 day 03 at 12:00\" \"year 2014 month 06 day 06 at 12:00\" and so on.

What I need, but haven't found a way to do yet, since I have limited xarray and python netcdf knowledge is how when concatenating them, to not only keep and concatenate the release.schedule.events attribute, but also to find a way to add to the drifters the attribute of year of release, in order to differentiate the drifters between the different files, since they are all named 0-9899 in all of the files (drifter index).

Here is an example of how they appear:

Out[22]: 
<xarray.DataArray 'drifter' (drifter: 9900)> Size: 40kB
array([   0,    1,    2, ..., 9897, 9898, 9899], shape=(9900,), dtype=int32)
Coordinates:
  * drifter  (drifter) int32 40kB 0 1 2 3 4 5 ... 9894 9895 9896 9897 9898 9899
Attributes:
    long_name:  drifter index
    unit:

Besides using python, I have also tried both the CDO and NCO (ncrcat and mergetime) but those also didn't work. Any help on how to do this in any possible method would be much appreciated, although ideally I would prefer to solve this using xarray in order to improve my xarray knowledge.

Thanks in advance to any help provided

EDIT

Here is a link to the 10 files I am trying to concatenate, while also keeping the attributes of each file and re-indexing the drifter dimension: link_to_files

2025-09-16T08:21:49Z

I calculated the Growing Season Length (GSL) index for the 1950–2023 period in Turkey using ERA5-Land daily mean temperature data. According to the definition:

First, find the first occurrence of at least 6 consecutive days with daily mean temperature > 5 °C.
Then, after July 1st (NH), find the first occurrence of at least 6 consecutive days with daily mean temperature < 5 °C.
The number of days between these two dates is recorded as GSL.

The issue:

In coastal areas or warmer regions, the second condition (6 consecutive days < 5 °C) may never occur in some years, so GSL cannot be calculated and remains NaN.
For example, between 2000–2023, in some pixels GSL is missing for years like 2004, 2007, 2009, and 2012.
When I compute trends (e.g., Mann-Kendall), these missing years result in artificially high positive or negative trends in some pixels, which are not realistic.

My questions:

How should I handle such missing years before performing trend analysis?

What I tried:

Implemented GSL following the ETCCDI definition with ERA5-Land daily mean temperatures (1950–2023).
Calculated GSL pixel-wise using xarray + numpy.
Applied Mann-Kendall trend test on the yearly GSL time series. What I expected:
A realistic spatial distribution of GSL trends, e.g., gradual lengthening or shortening in certain regions.
Missing years to have minimal influence on long-term trends. What happened instead:
Pixels with missing years (due to no 6-day <5 °C period) show very large artificial trends (sometimes >±10 days/decade), which are not climatologically reasonable.

2025-09-02T01:45:34Z

I have a dasked xarray which is about 150k x 90k with chunk size of 8192 x 8192. I am working on a Window virtual machine which has 100gb RAM and 16 cores.

I want to plot it using the Datashader package. From reading the docs, Datashader does support dasked xarray. I am a beginner in using Dask, though my (limited) understanding is that since the xarray is chuncked, Datashader should only need to read one chunk at a time (or something to that effect) and do the required processing to produce the plot, so it doesn't need to read the entire xarray in-memory. Moreover, it should be able to leverage parallelism to speed up this process.

However, when I try to create a plot after about 15 seconds I see the RAM starts increasing steadily (and CPU > 95%) until I eventually manually interrupt the code when it reaches about 90gb in RAM. After interrupting, below is a part of the error message.

> KeyboardInterrupt                         Traceback (most recent call
> last) Cell In[4], line 1
> ----> 1 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))
> 
> File
> <PATH>\lib\site-packages\datashader\transfer_functions\__init__.py:717,
> in shade(agg, cmap, color_key, how, alpha, min_alpha, span, name,
> color_baseline, rescale_discrete_levels)
>     713         return _apply_discrete_colorkey(
>     714             agg, color_key, alpha, name, color_baseline
>     715         )
>     716     else:
> --> 717         return _interpolate(agg, cmap, how, alpha, span, min_alpha, name,
>     718                             rescale_discrete_levels)
>     719 elif agg.ndim == 3:
>     720     return _colorize(agg, color_key, how, alpha, span, min_alpha, name, color_baseline,
>     721                      rescale_discrete_levels)
> 
> File
> <PATH>\lib\site-packages\datashader\transfer_functions\__init__.py:256,
> in _interpolate(agg, cmap, how, alpha, span, min_alpha, name,
> rescale_discrete_levels)
>     254 data = agg.data
>     255 if isinstance(data, da.Array):
> --> 256     data = data.compute()
>     257 else:
>     258     data = data.copy()

From the above error message, it seems like compute is being run on the xarray which would perhaps explain why the RAM is increasing so much? If that is the case, I am confused why compute is being run since I thought the whole point of using dasked xarrays is to avoid reading the entire xarray. I would have thought it should have been possible to even create this plot using a regular computer with 16gb RAM, with the main limiting factor being the chunk size and some extra space for dask/datashader to work in the background.

I have almost certainly mis-understood some things. It would be much appreciated if anyone could explain what is happening here and how I can go about plotting the array that I have.

Minimal reproducible example

# import libraries
import numpy as np
import dask.array as da
import datashader as ds
from datashader import transfer_functions as tf
import xarray as xr

# create large dask array
dask_array = da.random.random((100000, 100000), chunks=(1000, 1000))
# convert to dasked xarray
dask_xarray = xr.DataArray(
    dask_array,
    dims=["x", "y"], 
    coords={"x": np.arange(100000), "y": np.arange(100000)},
    name="example_data"  # Name of the data variable
)
# create plot using Datashader
tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))

2025-09-04T10:44:57Z

I'm currently trying to resample a large geotiff file to a coarser resolution. This file contains classes of tree species (indicated by integer values) at each pixel, so I want to resample each block (3x3 or 2x2) to contain the most often occurring integer/tree species (ignoring nan/fill value).

My first approach was using scipy.stats.mode, which does exactly that, but it it made RAM explode, no matter if I use raw numpy or dask. (See my issue at github: link).

The actual data is just 20 % of my RAM (roughly 5.5 GB, shape: (86145, 64074) of uint8), so it should be possible (I think). Dask also cannot run this lazily and computes instant, which is also confusing to me (and the reason for my initial question in the issue). It just keeps allocating RAM until my RAM and SWAP is full and my laptop collapses.

In my head this task is actually quite simple:

Reshape the array to contain 2 new dimensions which represent the reduced window at each new location (which is exactly what xr.coarsen or skimage.measure.block_reduce does)
For each new location apply a function to all the values in the window, in my case mode.
Result is the same dataset in coarser resolution

I'm probably a bit naive and its not that easy, but I cant wrap my head around why this is so hard to do, it seems to not be that hard of a computational task. I tried many ways, from using np.bincount and np.unique to implementing something myself in python (and using numba), to reshaping everything into 2 dimensional data, but nothing works. Is numpy/python simply not the right tool?

So my question is:

How can I solve this efficiently using numpy?

Example code to reproduce:

import xarray as xr
import dask.array as da
import numpy as np
from scipy import stats

# Define dimensions
nx, ny, nt = 3000, 300, 100  # size of each dimension
chunks = (300, 30, 10)     # chunk sizes for Dask

# Create Dask arrays
data1 = da.random.random((nx, ny, nt), chunks=chunks)
data2 = da.random.random((nx, ny, nt), chunks=chunks)

# Create coordinates
x = np.linspace(0, 10, nx)
y = np.linspace(0, 5, ny)
time = np.arange(nt)

# Build the xarray Dataset
ds = xr.Dataset(
    {
        "temperature": (("x", "y", "time"), data1),
        "precipitation": (("x", "y", "time"), data2),
    },
    coords={
        "x": x,
        "y": y,
        "time": time,
    }
)

# custom function for accessing the mode
def find_mode(arr, axis):
    m, _ = stats.mode(arr, axis=axis)
    return m

# this is lazy!
coarse_mean_ds = ds.coarsen(x=3, y=3, boundary='pad').reduce(np.mean)

# this computes  on spotand and allocates far too much RAM in a single worker!
maj_vote_coarse = ds.coarsen(x=3, y=3, boundary='pad').reduce(find_mode)

2018-04-20T03:09:43Z

I'm trying to rechunk a NetCDF file collection and create a Zarr dataset on AWS S3. I have 168 original NetCDF4 classic files with arrays of dimension time: 1, y: 3840, x: 4608 chunked as chunks={'time':1, 'y':768, 'x':922}.

I want to write this output to Zarr, and I want to optimize for time series extraction, so include more time records in my chunks. I figured I would use xarray to help get the job done, since I have a lot of processors available to take advantage of Dask, and xarray has both xr.open_mfdataset and ds.to_zarr.

I first tried rechunking to chunks={'time':24, 'y':768, 'x':922} to match the input NetCDF4 chunking in x and y, but when I tried to write to Zarr it complained because it needed uniform chunk sizes in both x and y, only allowing non-uniform size in the last chunk along the time dimension (and unfortunately in the x dimension, the total size 4608 is not a multiple of chunk size 922.

So then I tried chunks={'time':168, 'y':384, 'x':288} and that started working, and proceeded very quickly for a few minutes, then got slower and slower. Eventually after 50 minutes, the cluster died with:

4072 distributed.core - INFO - Event loop was unresponsive in Worker for 1.41s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
4073 slurmstepd: error: Step 3294889.0 exceeded memory limit (25346188 > 25165824), being killed

Here's the code I'm using:

from dask.distributed import Client

import pandas as pd
import xarray as xr
import s3fs
import zarr

client = Client(scheduler_file='/home/rsignell/scheduler.json')
client

root = '/lustre/projects/hazards/cmgp/woodshole/rsignell/nwm/forcing_short_range/' 

bucket_endpoint='https://s3.us-west-1.amazonaws.com/'

f_zarr = 'rsignell/nwm/test_week4'

dates = pd.date_range(start='2018-04-01T00:00', end='2018-04-07T23:00', freq='H')

urls = ['{}{}/nwm.t{}z.short_range.forcing.f001.conus.nc'.format(root,a.strftime('%Y%m%d'),a.strftime('%H')) for a in dates]

ds = xr.open_mfdataset(urls, concat_dim='time', chunks={'time':1, 'y':768, 'x':922})
ds = ds.drop(['ProjectionCoordinateSystem','time_bounds'])
ds = ds.chunk(chunks={'time':168, 'y':384, 'x':288}).persist()
ds

producing

<xarray.Dataset>
Dimensions:         (reference_time: 168, time: 168, x: 4608, y: 3840)
Coordinates:
  * reference_time  (reference_time) datetime64[ns] 2018-04-01 ...
  * x               (x) float64 -2.304e+06 -2.303e+06 -2.302e+06 -2.301e+06 ...
  * y               (y) float64 -1.92e+06 -1.919e+06 -1.918e+06 -1.917e+06 ...
  * time            (time) datetime64[ns] 2018-04-01T01:00:00 ...
Data variables:
    T2D             (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    LWDOWN          (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    Q2D             (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    U2D             (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    V2D             (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    PSFC            (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    RAINRATE        (time, y, x) float32 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>
    SWDOWN          (time, y, x) float64 dask.array<shape=(168, 3840, 4608), chunksize=(168, 384, 288)>

Then I call

fs = s3fs.S3FileSystem(anon=False, client_kwargs=dict(endpoint_url=bucket_endpoint))
d = s3fs.S3Map(f_zarr, s3=fs)

compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}

delayed_store = ds.to_zarr(store=d, mode='w', encoding=encoding, compute=False)
persist_store = delayed_store.persist(retries=100)

and just before it dies, the Dask Daskboard looks like this:

The total size of the NetCDF4 files is 20GB, so it seems a bit crazy that I've got over 500GB showing in the Dask Dashboard, and that 30 processors each with 60GB RAM is not sufficient for the job.

What am I doing wrong, or what would be a better solution?

2019-02-20T01:50:05Z

I need to set a title for multiple plots each of which is drawn with Basemap.

The souce data is an xarray's DataArray object. The code uses DataArray.plot() to draw multiple plots on a single figure.

p = da_monthly_slice.plot(levels=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], cmap=plt.cm.Reds, x='longitude', y='latitude', col='time', col_wrap=3)

for i, ax in enumerate(p.axes.flatten()):
    ax.set_title('Month: %d' % i)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    map = Basemap(llcrnrlat=-39.2,urcrnrlat=-33.9,llcrnrlon=140.8,urcrnrlon=150.0,resolution='i',ax=ax)
    map.readshapefile(shapefile1, 'CFA_DISTRICT_BODY', linewidth=0.5)

fig = plt.figure()
st = fig.suptitle("Number of days for each month meeting the criteria in 2017", fontsize="large")

plt.show()

This caused two figures drawn separately. One is for the multiple plots while the other one contains the title only.

How to make the title to show on the top of the multiple plots on the same figure?

2025-08-21T04:13:09Z

I use Python version 3.9.18 to reading wrfout files (name like: wrfout_d02_2020-01-01_00:00:00) and get T2, Q2, PSFC, U10, V10, ACSWDNB variables and combine all days in the month to a output netcdf file (name like : WRF_PGW_2C_re-202001-d02-hourly.nc)。

Here is the code:

import xarray as xr
import glob
from metpy.units import units
import numpy as np
import pandas as pd
import calendar
from tqdm import tqdm


def get_file_list(input_dir: str, year: int, month: int) -> list:
    """
    Get list of WRF output files for a given year and month.
    
    Args:
        input_dir (str): Directory containing WRF output files.
        year (int): Year of the data.
        month (int): Month of the data.
    
    Returns:
        list: Sorted list of file paths.
    """
    file_pattern = f'{input_dir}wrfout_d02_{year}-{month:02}-*_00:00:00'
    return sorted(glob.glob(file_pattern))

def process_single_file(file: str) -> xr.Dataset:
    """
    Process a single WRF output file and extract specified variables.
    
    Args:
        file (str): Path to the WRF output file.
    
    Returns:
        xr.Dataset: Dataset with extracted variables and units.
    """
    ds = xr.open_dataset(file, engine='netcdf4')
    
    # Extract variables with units
    data_vars = {
        'Times': ds['Times'],
        'T2': ds['T2'] * units.kelvin,
        'Q2': ds['Q2'] * units('kg/kg'),
        'PSFC': ds['PSFC'] * units.pascal,
        'U10': ds['U10'] * units('m/s'),
        'V10': ds['V10'] * units('m/s'),
        'ACSWDNB': ds['ACSWDNB'] * units('J/m^2'),
    }
    
    # Create new dataset
    single_ds = xr.Dataset(data_vars)
    
    # Copy attributes
    for var in data_vars:
        single_ds[var].attrs = ds[var].attrs
    
    ds.close()
    return single_ds


def merge_datasets(datasets: list, year: int, month: int) -> xr.Dataset:
    """
    Merge datasets along the 'ymd' dimension.
    
    Args:
        datasets (list): List of xarray Datasets.
        year (int): Year of the data.
        month (int): Month of the data.
    
    Returns:
        xr.Dataset: Merged dataset.
    """
#    merged_ds = xr.concat(datasets, dim='ymd', data_vars='minimal', coords='minimal')
    merged_ds = xr.concat(datasets, dim='ymd')
    return merged_ds.transpose('ymd', 'Time', 'south_north', 'west_east')

def save_to_netcdf(merged_ds: xr.Dataset, output_path: str, year: int, month: int) -> None:
    """
    Save merged dataset to a NetCDF file.
    
    Args:
        merged_ds (xr.Dataset): Merged dataset to save.
        output_path (str): Directory to save the NetCDF file.
        year (int): Year of the data.
        month (int): Month of the data.
    """
   output_file = f'WRF_PGW_2C_re-{year}{month:02}-d02-hourly.nc' 
   merged_ds.to_netcdf(
        f'{output_path}{output_file}',
        format='NETCDF4',
        engine='netcdf4',
        encoding={
            'Times': {'dtype': 'S1'},
            'T2': {'dtype': 'float32'},
            'Q2': {'dtype': 'float32'},
            'PSFC': {'dtype': 'float32'},
            'U10': {'dtype': 'float32'},
            'V10': {'dtype': 'float32'},
            'ACSWDNB': {'dtype': 'float32'},
        }
    )

def main():
    """
    Main function to process WRF output files, merge daily data into monthly files,
    and save as NetCDF.
    """
    
    base_input_path = '/home/user/work/share/WRF3type/WRF_PGW/2C_re/' #for PGW_2C run
    output_path = '/home/user/work/share/WRF3type/hrdata-TCCIP-WRF3type-nc/PGW_2C/' 


    years = range(2020, 2021)
    
    for year in tqdm(years, desc="Processing years"):
        year_folder = f'2C_re-{year}-PGW_wrfout_ERA5_domainv2_L38/'
        input_dir = f'{base_input_path}{year_folder}test3/'
        
        for month in tqdm(range(1, 2), desc=f"Processing months for {year}", leave=False):
            # Get file list
            file_list = get_file_list(input_dir, year, month)
            
            # Process each file with progress bar
            datasets = [process_single_file(file) for file in tqdm(file_list, desc=f"Processing files for {year}-{month:02}", leave=False)]
            
            # Merge datasets
            merged_ds = merge_datasets(datasets, year, month)
            
            # Save to NetCDF
            save_to_netcdf(merged_ds, output_path, year, month)

if __name__ == "__main__":
    main()

Then ncdump -c the netcdf file:

netcdf WRF_PGW_2C_re-202001-d02-hourly {
dimensions:
        ymd = 31 ;
        Time = 24 ;
        string19 = 19 ;
        south_north = 230 ;
        west_east = 150 ;
variables:
        char Times(ymd, Time, string19) ;
                Times:coordinates = "XTIME" ;
        float T2(ymd, Time, south_north, west_east) ;
                T2:_FillValue = NaNf ;
                T2:FieldType = 104 ;
                T2:MemoryOrder = "XY " ;
                T2:description = "TEMP at 2 M" ;
                T2:units = "K" ;
                T2:stagger = "" ;
                T2:coordinates = "XLAT XLONG XTIME" ;
        float Q2(ymd, Time, south_north, west_east) ;
                Q2:_FillValue = NaNf ;
                Q2:FieldType = 104 ;
                Q2:MemoryOrder = "XY " ;
                Q2:description = "QV at 2 M" ;
                Q2:units = "kg kg-1" ;
                Q2:stagger = "" ;
                Q2:coordinates = "XLAT XLONG XTIME" ;
        float PSFC(ymd, Time, south_north, west_east) ;
                PSFC:_FillValue = NaNf ;
                PSFC:FieldType = 104 ;
                PSFC:MemoryOrder = "XY " ;
                PSFC:description = "SFC PRESSURE" ;
                PSFC:units = "Pa" ;
                PSFC:stagger = "" ;
                PSFC:coordinates = "XLAT XLONG XTIME" ;
        float U10(ymd, Time, south_north, west_east) ;
                U10:_FillValue = NaNf ;
                U10:FieldType = 104 ;
                U10:MemoryOrder = "XY " ;
                U10:description = "U at 10 M" ;
                U10:units = "m s-1" ;
                U10:stagger = "" ;
                U10:coordinates = "XLAT XLONG XTIME" ;
        float V10(ymd, Time, south_north, west_east) ;
                V10:_FillValue = NaNf ;
                V10:FieldType = 104 ;
                V10:MemoryOrder = "XY " ;
                V10:description = "V at 10 M" ;
                V10:units = "m s-1" ;
                V10:stagger = "" ;
                V10:coordinates = "XLAT XLONG XTIME" ;
        float ACSWDNB(ymd, Time, south_north, west_east) ;
                ACSWDNB:_FillValue = NaNf ;
                ACSWDNB:FieldType = 104 ;
                ACSWDNB:MemoryOrder = "XY " ;
                ACSWDNB:description = "ACCUMULATED DOWNWELLING SHORTWAVE FLUX AT BOTTOM" ;
                ACSWDNB:units = "J m-2" ;
                ACSWDNB:stagger = "" ;
                ACSWDNB:coordinates = "XLAT XLONG XTIME" ;
        float XTIME(ymd, Time) ;
                XTIME:_FillValue = NaNf ;
                XTIME:FieldType = 104 ;
                XTIME:MemoryOrder = "0  " ;
                XTIME:description = "minutes since 2020-01-01 00:00:00" ;
                XTIME:stagger = "" ;
                XTIME:units = "minutes since 2020-01-01" ;
                XTIME:calendar = "proleptic_gregorian" ;
        float XLAT(Time, south_north, west_east) ;
                XLAT:_FillValue = NaNf ;
                XLAT:FieldType = 104 ;
                XLAT:MemoryOrder = "XY " ;
                XLAT:description = "LATITUDE, SOUTH IS NEGATIVE" ;
                XLAT:units = "degree_north" ;
                XLAT:stagger = "" ;
                XLAT:coordinates = "XLONG XLAT" ;
        float XLONG(Time, south_north, west_east) ;
                XLONG:_FillValue = NaNf ;
                XLONG:FieldType = 104 ;
                XLONG:MemoryOrder = "XY " ;
                XLONG:description = "LONGITUDE, WEST IS NEGATIVE" ;
                XLONG:units = "degree_east" ;
                XLONG:stagger = "" ;
                XLONG:coordinates = "XLONG XLAT" ;
data:
}

I have two questions in mind:

In the Python code, I didn't edit any setting or extracting in XTIME, but I have no idea why it appear in the netcdf file. In other words, I just extract 'Times' variable to reveal the datetime in string. Is there any method to drop XTIME when merging datasets If possible?
XLAT, XLONG for the latitude and longitude information, I hope they in 2D (south_north, west_east), without "Time" or "ymd" dimension, and change the order to the first two variables.

2021-06-13T22:53:01Z

I have a grib file containing monthly precipitation and temperature from 1989 to 2018 (extracted from ERA5-Land).

I need to have those data in a dataset format with 6 column : longitude, latitude, ID of the cell/point in the grib file, date, temperature and precipitation.

I first imported the file using cfgrib. Here is what contains the xdata list after importation:

import cfgrib

grib_data = cfgrib.open_datasets('\era5land_extract.grib')

grib_data
Out[6]: 
[<xarray.Dataset>
 Dimensions:     (latitude: 781, longitude: 761, time: 372)
 Coordinates:
     number      int32 0
   * time        (time) datetime64[ns] 1989-01-01 1989-02-01 ... 2019-12-01
     step        timedelta64[ns] 1 days
     surface     float64 0.0
   * latitude    (latitude) float64 42.0 41.9 41.8 41.7 ... -35.8 -35.9 -36.0
   * longitude   (longitude) float64 -21.0 -20.9 -20.8 -20.7 ... 54.8 54.9 55.0
     valid_time  (time) datetime64[ns] ...
 Data variables:
     t2m         (time, latitude, longitude) float32 ...
 Attributes:
     GRIB_edition:            1
     GRIB_centre:             ecmf
     GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
     GRIB_subCentre:          0
     Conventions:             CF-1.7
     institution:             European Centre for Medium-Range Weather Forecasts,
 <xarray.Dataset>
 Dimensions:     (latitude: 781, longitude: 761, time: 156)
 Coordinates:
     number      int32 0
   * time        (time) datetime64[ns] 1989-01-01 1989-02-01 ... 2001-12-01
     step        timedelta64[ns] 1 days
     surface     float64 0.0
   * latitude    (latitude) float64 42.0 41.9 41.8 41.7 ... -35.8 -35.9 -36.0
   * longitude   (longitude) float64 -21.0 -20.9 -20.8 -20.7 ... 54.8 54.9 55.0
     valid_time  (time) datetime64[ns] ...
 Data variables:
     tp          (time, latitude, longitude) float32 ...
 Attributes:
     GRIB_edition:            1
     GRIB_centre:             ecmf
     GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
     GRIB_subCentre:          0
     Conventions:             CF-1.7
     institution:             European Centre for Medium-Range Weather Forecasts,
 <xarray.Dataset>
 Dimensions:     (latitude: 781, longitude: 761, time: 216)
 Coordinates:
     number      int32 0
   * time        (time) datetime64[ns] 2002-01-01 2002-02-01 ... 2019-12-01
     step        timedelta64[ns] 1 days
     surface     float64 0.0
   * latitude    (latitude) float64 42.0 41.9 41.8 41.7 ... -35.8 -35.9 -36.0
   * longitude   (longitude) float64 -21.0 -20.9 -20.8 -20.7 ... 54.8 54.9 55.0
     valid_time  (time) datetime64[ns] ...
 Data variables:
     tp          (time, latitude, longitude) float32 ...
 Attributes:
     GRIB_edition:            1
     GRIB_centre:             ecmf
     GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
     GRIB_subCentre:          0
     Conventions:             CF-1.7
     institution:             European Centre for Medium-Range Weather Forecasts]

So the temperature variable is called "t2m" and the precipitation variable "tp". Precipitaion variable is split in two xarrays but I don't understand why.

How can I obtain the needed dataset from this please ?

It's the first time I'm dealing with such data, and I'm really lost on how to proceed.

2024-03-21T16:33:17Z

How can I drop all xarray Dataset coordinates and variables that don't have any dimensions? In this example, I want to drop item and nothing from this Dataset.

import xarray as xr

ds = xr.Dataset(
    {"temperature": (["x", "y"], [[1, 2, 3], [1, 2, 3], [1, 2, 3]]), "nothing": 10},
    coords={"item": "hi", "x": [1, 2, 3], "y": [1, 2, 3]},
)
ds

I could of course use ds.drop_vars(["item", "nothing"]), but I was expecting a different method like ds.drop_dims() or ds.squeeze() could do this without me specifying which to drop.

An alternative way to look at this question is, how can I select only coordaintes and variables where the dimension is (x,y)?

2025-07-21T08:30:19Z

I had the task of analyzing a .nc file. I have not worked with it before, so I did some research online and found some codes. I was exploring the dataset. I tried:

I am using Debian 12, Python3 and Jupyter notebook

# import needed libraries
import numpy as np

import xarray as xr

# import dataset
file_path = 'dataset/air_quality.nc'

xrds = xr.open_dataset(file_path)

xrds

I saw it printed

   Dimensions:

   Coordinates: (0)
   
   Data variables: (0) 

   Indexes: (0)
   
   Attributes: (52)

Down the line, I ran the following code:

xrds.data_vars

Output:

Data variables:
    *empty*

Does this mean the dataset has no variables or values I can explore?

2019-09-13T13:49:53Z

I am applying slicing and aggregation operations over Netcdf files in Python language. One of the solutions for working with this kind of file is to use the Xarray library.

I am still new to the library functionalities, so I would like to know whether Xarray objects possess some method to check if a sliced DataSet/DataArray is empty or not, just like Pandas has (in the case of pandas, one can check if the dataframe/series is empty through the 'empty' method).

The only solution I found was to always convert the Xarray Dataset/DataArray into a pandas Dataframe/Series, to then check if it is empty or not.

Here is code snippet as example:

import xarray as xr

path = 'my_path_to_my_netcdf_file.nc'

Xarray_DataArray = xr.open_dataset(path)

print(Xarray_DataArray)

# this returns something like:

 #     Dimensions:      (lat: 600, lon: 672, time: 37)
 #     Coordinates:
 #     * lat          (lat) float32 -3.9791672 -3.9375012 ... 20.9375 20.979166
 #     * lon          (lon) float32 -60.979168 -60.9375 ... -33.0625 -33.020832
 #     * time         (time) datetime64[ns] 2010-05-19 2010-05-20 ... 2010-06-24
 #     Data variables:
 #       variable_name  (time, lat, lon) float32 dask.array<shape=(37, 600, 672), 
 #         chunksize=(37, 600, 672)>

 # I normally use the 'sel' method to slice the xarray object, like below:

Sliced_Xarray_DataArray = Xarray_DataArray.sel({'lat':slice(-10, -9),
                                                'lon':slice(-170, -169)                  
                                                })


 # but since, Xarray does not possess a proper way to check the slice, I usually have to do the following:

 if Sliced_Xarray_DataArray.to_dataframe().empty():
    print('is empty. Nothing to aggregate')

 else:
    Aggregated_value =  Aggregation_function(Sliced_Xarray_DataArray)

    print('continuing with the analysis')


 #    ... continue

I would appreciate any suggestions.

I thank you for your time, and I hope hearing from you soon.

Sincerely yours,

Philipe R. Leal

2025-07-16T18:36:57Z

My overall goal is the set up a virtual dataset of ERA5 data using Icechunk. As a smaller test example, I'm trying to pull all the data located in the 194001 ERA5 folder. I've been mostly able to follow the example located on the Virtual Dataset page.

My current issue is that half the ERA5 .nc files have 30 forecast_initial_time values and the other half have 32. (I've attached an image of when I was successfully able to create the dataset using a different method so you can better understand the dataset's structure.) Below is the code I've been able to produce so far. All my packages and libraries are up to date and the newest versions.

// Access all s3 files in the 194001 folder
fs = fsspec.filesystem('s3', anon=True)
s3_files = fs.glob('nsf-ncar-era5/e5.oper.fc.sfc.minmax/194001/e5.oper.fc.sfc.minmax.*.nc')
s3_files = sorted(['s3://' + f for f in s3_files])

// Create an array of virtual datasets
manifest_mod.PosixPath = pathlib.Path
reader_options = {
    "storage_options": {
        "anon": True,
    }
}
virtual_datasets = [
    open_virtual_dataset(
        filepath=url,
        indexes={},
        reader_options=reader_options,
        filetype='netCDF4',
    )
    for url in s3_files
]

// Concat all the datasets into one
virtual_ds = xr.concat(
    virtual_datasets,
    dim='time',
    coords='minimal',
    compat='override',
    combine_attrs='override',
)

The xr.concat is where I get the following error:

AlignmentError: cannot reindex or align along dimension 'forecast_initial_time' because of conflicting dimension sizes: {32, 30}

I understand that the differing dimensions sizes of the forecast_initial_time coordinate is causing the issue, but I have no idea how to manipulate the data. Any help would be great!

Merged Datasets