Skip to content

Add weight threshold option for temporal operations#683

Merged
tomvothecoder merged 14 commits intomainfrom
feature/531-temporal-apis
Aug 21, 2025
Merged

Add weight threshold option for temporal operations#683
tomvothecoder merged 14 commits intomainfrom
feature/531-temporal-apis

Conversation

@tomvothecoder
Copy link
Copy Markdown
Collaborator

@tomvothecoder tomvothecoder commented Jul 30, 2024

Description

This PR adds a min_weight threshold parameter to temporal group averaging APIs, allowing users to set a minimum data coverage requirement for temporal averaging. This enhancement ensures temporal averages are only computed when sufficient data is available, with values below the threshold being set to NaN.

  • Adds min_weight parameter to group_average, climatology, and departures methods
  • Implements weight threshold validation and masking in weighted temporal averaging
  • Refactors weighted averaging logic into a dedicated _weighted_group_average method

More Context

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)
@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Jul 30, 2024
@tomvothecoder tomvothecoder self-assigned this Jul 30, 2024
@codecov
Copy link
Copy Markdown

codecov bot commented Jul 30, 2024

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (683e973) to head (f21b2b6).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #683   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           16        16           
  Lines         1768      1781   +13     
=========================================
+ Hits          1768      1781   +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
xcdat/utils.py Outdated
Comment on lines +143 to +147
This function is useful for cases where the weighting of data might be
skewed based on the availability of data. For example, if one season in a
time series has more significantly more missing data than other seasons, it
can result in inaccurate calculations of climatologies. Masking values that
do not meet the minimum weight threshold ensures more accurate calculations.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an accurate description?

@tomvothecoder
Copy link
Copy Markdown
Collaborator Author

To test this PR against real-world data, we can:

  1. Create a script that loops over model data with different weight threshold options
  2. Generate expected: CDAT temporal averaging with criteriaarg specified
  3. Generate result: Runs xCDAT temporal averaging with min_weight specified
  4. Compare results against expected for closeness and aligned np.nan values
@lee1043
Copy link
Copy Markdown
Collaborator

lee1043 commented Apr 23, 2025

climatology and departures seems not having min_weight parameter added. Can they have it as like group_average for consistency?

@tomvothecoder
Copy link
Copy Markdown
Collaborator Author

climatology and departures seems not having min_weight parameter added. Can they have it as like group_average for consistency?

I am converting this PR back to draft and will tag you once it is ready.

@tomvothecoder tomvothecoder marked this pull request as draft April 23, 2025 22:35
@tomvothecoder tomvothecoder added project: seats-year5 A SEATS goal for year 5. and removed project: seats-year3 A SEATS goals for year 3. labels Jun 30, 2025
@tomvothecoder
Copy link
Copy Markdown
Collaborator Author

@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.

@tomvothecoder tomvothecoder force-pushed the feature/531-temporal-apis branch 2 times, most recently from 9de2d94 to 059ce37 Compare July 17, 2025 18:16
@pochedls
Copy link
Copy Markdown
Collaborator

@pochedls I actually don't think this is dependent on #735. I will pick this one back up first since it is easier I think.

I can't find where I linked these together, but I agree this is a related-but-separate issue.

@tomvothecoder tomvothecoder force-pushed the feature/531-temporal-apis branch from 059ce37 to 5ef344f Compare July 17, 2025 20:29
@tomvothecoder tomvothecoder force-pushed the feature/531-temporal-apis branch from 3c9e799 to 2cb27ce Compare August 14, 2025 20:41
@tomvothecoder
Copy link
Copy Markdown
Collaborator Author

tomvothecoder commented Aug 14, 2025

It looks like "> min_weight" being used, wondering rather ">= min_weight" could be used. For example, when min_weight=1 (consider grid with no missing time step), I get an empty field (see below, rightmost figure).

Thank you for the review so far @lee1043.

@pochedls can you or someone else grant me read permission to /globa/cfs/projectdirs/m4581? I'm getting PermissionError: [Errno 13] Permission denied: '/global/cfs/projectdirs/m4581/obs4MIPs/obs4MIPs_LLNL/MOHC/HadISST-1-1/mon/ts/gn/v20250415/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc' with Jiwoo's example script.

@lee1043
Copy link
Copy Markdown
Collaborator

lee1043 commented Aug 14, 2025

@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?

/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc

@tomvothecoder
Copy link
Copy Markdown
Collaborator Author

@tomvothecoder I put the input data here in NERSC, can you check if you can access to it?

/pscratch/sd/l/lee1043/temporary/ts_mon_HadISST-1-1_PCMDI_gn_187001-202501.nc

This works, thanks Jiwoo.

@tomvothecoder tomvothecoder force-pushed the feature/531-temporal-apis branch 2 times, most recently from b3ed292 to d353df8 Compare August 19, 2025 23:50
Comment on lines +1670 to +1680
# Set averaged data to NaN where masked weights are below the
# minimum threshold.
# Mask averaged data where the fraction of weights in each group
# does not meet the minimum weight threshold (fractional).
if self._min_weight > 0.0:
# The sum of all weights in each group (i.e., full coverage)
weight_sum_all = self._group_data(self._weights).sum(skipna=skipna)

# Fraction of weights present in each group.
weight_fraction = masked_weights_group_sum / weight_sum_all

# Mask the averaged data where the weight fraction is below
# the minimum weight threshold.
dv_avg = xr.where(
masked_weights_group_sum >= self._min_weight,
weight_fraction >= self._min_weight,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lee1043 and @pochedls, I identified and fixed a bug after reviewing your comments, as shown in the code diff here.

Before

This compared against the absolute sum of valid weights, which could behave inconsistently across groups with different total weights.

# Set averaged data to NaN where masked weights are below the
# minimum threshold.
if self._min_weight > 0.0:
    dv_avg = xr.where(
        masked_weights_group_sum >= self._min_weight,
        ...
    )

After

Now the check is fractional coverage (masked / total), ensuring consistent thresholding across groups of different sizes.

# Mask averaged data where the fraction of weights in each group
# does not meet the minimum weight threshold (fractional).
if self._min_weight > 0.0:
    # The sum of all weights in each group (i.e., full coverage)
    weight_sum_all = self._group_data(self._weights).sum(skipna=skipna)

    # Fraction of weights present in each group
    weight_fraction = masked_weights_group_sum / weight_sum_all

    # Mask the averaged data where the weight fraction is below
    # the minimum weight threshold
    dv_avg = xr.where(
        weight_fraction >= self._min_weight,
        ...
    )

This logic matches the spatial average weight threshold function here:

xcdat/xcdat/spatial.py

Lines 784 to 844 in 15c5ee7

def _mask_var_with_weight_threshold(
self,
dv: xr.DataArray,
dv_mean: xr.DataArray,
dim: list[str],
weights: xr.DataArray,
min_weight: float,
) -> xr.DataArray:
"""Mask values that do not meet the minimum weight threshold with np.nan.
This function is useful for cases where the weighting of data might be
skewed based on the availability of data. For example, if a portion of
cells in a region has significantly more missing data than other other
regions, it can result in inaccurate spatial average calculations.
Masking values that do not meet the minimum weight threshold ensures
more accurate calculations.
Parameters
----------
dv : xr.DataArray
The weighted variable used for getting masked weights.
dv_mean : xr.DataArray
The average of the weighted variable.
dim: list[str]:
List of axis dimensions to average over.
weights : xr.DataArray
A DataArray containing either the regional weights used for weighted
averaging. ``weights`` must include the same axis dimensions and
dimensional sizes as the data variable.
min_weight : float, optional
Minimum threshold of data coverage (weight) required to compute
a spatial average for a grouping window. Must be between 0 and 1.
Useful for ensuring accurate averages in regions with missing data,
by default None (equivalent to 0.0).
The value must be between 0 and 1, where:
- 0/``None`` means no minimum threshold (all data is considered,
even if coverage is minimal).
- 1 means full data coverage is required (no missing data is
allowed).
Returns
-------
xr.DataArray
The average of the weighted variable with the minimum weight
threshold applied.
"""
# Sum all weights, including zero for missing values.
weight_sum_all = weights.sum(dim=dim)
masked_weights = _get_masked_weights(dv, weights)
weight_sum_masked = masked_weights.sum(dim=dim)
# Get fraction of the available weight.
frac = weight_sum_masked / weight_sum_all
# Nan out values that don't meet specified weight threshold.
dv_new = xr.where(frac >= min_weight, dv_mean, np.nan, keep_attrs=True)
dv_new.name = dv_mean.name
return dv_new

With this change, @lee1043’s script (here) now produces plots for all min_weight settings (shown below):

output

Next Steps

@lee1043 and @pochedls — can you review again to confirm this new logic is correct?

Side Note

My unit tests didn’t catch this bug and still pass with the new logic. This highlights the importance of validating with real data alongside unit tests, as we routinely do.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder I was able to get the same plot using the latest version of this branch!

This comment was marked as outdated.

Copy link
Copy Markdown
Collaborator

@lee1043 lee1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for the PR, @tomvothecoder!

Comment on lines +1670 to +1680
# Set averaged data to NaN where masked weights are below the
# minimum threshold.
# Mask averaged data where the fraction of weights in each group
# does not meet the minimum weight threshold (fractional).
if self._min_weight > 0.0:
# The sum of all weights in each group (i.e., full coverage)
weight_sum_all = self._group_data(self._weights).sum(skipna=skipna)

# Fraction of weights present in each group.
weight_fraction = masked_weights_group_sum / weight_sum_all

# Mask the averaged data where the weight fraction is below
# the minimum weight threshold.
dv_avg = xr.where(
masked_weights_group_sum >= self._min_weight,
weight_fraction >= self._min_weight,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomvothecoder I was able to get the same plot using the latest version of this branch!

@tomvothecoder tomvothecoder requested a review from Copilot August 21, 2025 16:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a min_weight threshold parameter to temporal group averaging APIs, allowing users to set a minimum data coverage requirement for temporal averaging operations. This ensures temporal averages are only computed when sufficient data is available.

  • Adds min_weight parameter to group_average, climatology, and departures methods
  • Implements weight threshold validation and masking in weighted temporal averaging
  • Refactors weighted averaging logic into a dedicated _weighted_group_average method

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
xcdat/temporal.py Adds min_weight parameter to temporal averaging methods and implements weighted averaging logic with threshold validation
tests/test_temporal.py Updates existing tests to include expected min_weight attributes and adds comprehensive test coverage for min_weight threshold functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

tomvothecoder and others added 14 commits August 21, 2025 10:54
Add `_get_masked_weights()`

Update the order of methods

Convert temporal weight threshold methods to general functions
- Add tests for threshold functions

Add tests for utils

Apply suggestions from code review

Apply suggestions from code review

Fix masking of data in `_group_average()`
- This method now masks data using the weights grouped properly, instead of using the `weight_var_with_weight_threshold()` function
- Add `from __future__ import annotations` to `spatial.py`
- Add and update units

Fix docstring

Remove unused `_mask_var_with_weight_threshold()` function

Fix TypeError due to annotaitons
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@tomvothecoder tomvothecoder force-pushed the feature/531-temporal-apis branch from 39975bd to f21b2b6 Compare August 21, 2025 17:54
@tomvothecoder tomvothecoder merged commit 58dac72 into main Aug 21, 2025
10 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in xCDAT Development Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

project: seats-year5 A SEATS goal for year 5. type: enhancement New enhancement request

4 participants