Skip to content

Fix two flaky multi-DataNode/multi-region ITs from #18046 (remove-DataNode selection + migrate precondition)#18082

Open
CRZbulabula wants to merge 2 commits into
masterfrom
fix-remove-datanode-it-conflict-free-selector
Open

Fix two flaky multi-DataNode/multi-region ITs from #18046 (remove-DataNode selection + migrate precondition)#18082
CRZbulabula wants to merge 2 commits into
masterfrom
fix-remove-datanode-it-conflict-free-selector

Conversation

@CRZbulabula

@CRZbulabula CRZbulabula commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Description

Two integration tests added by #18046 (Support multiple regions in MIGRATE REGION and multiple DataNodes in REMOVE DATANODE) are flaky on the every-PR Simple/ClusterIT pipeline. Both flakes are test-side only; no production code is touched.

1. IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQL

The test removes two DataNodes at once and selects them via IoTDBRemoveDataNodeUtils.selectRemoveDataNodesWithoutRegionConflict, which must avoid picking two DataNodes that host replicas of the same consensus group (otherwise the ConfigNode rejects the removal: "Only one replica of the same consensus group is allowed to be migrated at the same time."). #18075 introduced this selector to replace the previous random pick.

However the selector was a single-pass greedy over a shuffled DataNode list with no backtracking: it unconditionally commits to the first shuffled DataNode and never reconsiders it. When removing 2 DataNodes it therefore throws IllegalStateException whenever the first shuffled DataNode shares a consensus group with every other DataNode (a "hub" in the region-sharing graph) — even though a valid conflict-free pair still exists among the other DataNodes. In the 1C5D layout (5 data-region groups at factor 2, balanced to ~2 replicas per DataNode, plus one schema-region group at factor 3) such a hub commonly exists, so a fraction of runs aborted with IllegalStateException before REMOVE DATANODE was ever submitted.

Fix: replace the greedy with an exhaustive depth-first search with backtracking (searchConflictFreeDataNodes) that visits each combination at most once and fails only when no conflict-free set of the requested size actually exists. The initial Collections.shuffle is kept, so when several valid selections exist a random one is still returned. The search space is tiny (a handful of DataNodes), so the cost is negligible, and this also future-proofs the helper for removeDataNodeNum > 2. Also corrected a stale log line that reported "timeout in 2 minutes" while awaitUntilSuccess actually waits 5 minutes.

2. IoTDBMigrateMultiRegionForIoTV1IT.multiRegionMigrateTest

The test runs on 1C5D with replication factor 1 and, as a precondition, requires a source DataNode hosting at least two regions (selectDataNodeHostingMultipleRegions), otherwise it fails with RuntimeException: Cannot find a DataNode hosting at least two regions. Under the default AUTO data-region policy a single insert created only ~2-3 factor-1 regions; when the balanced allocator spread them across the 5 DataNodes with at most one region each, the precondition was not met and the test errored intermittently (observed on this PR's CI, unrelated to the change itself).

Fix: force the CUSTOM data-region policy with 6 data-region groups per database. With 6 regions over 5 DataNodes at replication factor 1, pigeonhole guarantees at least one DataNode hosts ≥ 2 regions, and — since each factor-1 region lives on a single DataNode — another DataNode is always available as a conflict-free migration destination.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code.
  • modified existing integration tests to remove flakiness.

Key changed/added classes (or packages if there are too many classes) in this PR
  • IoTDBRemoveDataNodeUtils (integration-test) — exhaustive backtracking conflict-free DataNode selection replacing the non-backtracking greedy.
  • IoTDBRemoveDataNodeNormalIT (integration-test) — corrected stale timeout log message.
  • IoTDBMigrateMultiRegionForIoTV1IT (integration-test) — deterministic region layout so the "source DataNode with ≥2 regions" precondition always holds.
…kiness

success1C5DRemoveTwoDataNodesUseSQL removes two DataNodes at once and picks
them via selectRemoveDataNodesWithoutRegionConflict, which must avoid choosing
two DataNodes that host replicas of the same consensus group.

The previous selection was a single-pass greedy over a shuffled DataNode list:
it unconditionally commits to the first shuffled DataNode and never backtracks.
For k=2 this throws IllegalStateException whenever the first shuffled DataNode
shares a consensus group with every other DataNode (a "hub"), even though a
valid conflict-free pair exists among the others. In the 1C5D layout (5 data
regions factor 2 + 1 schema region factor 3) such a hub commonly exists, so the
test could abort with IllegalStateException before REMOVE DATANODE was even
submitted.

Replace the greedy with an exhaustive depth-first search with backtracking, so
the selection fails only when no conflict-free set exists; the shuffle is kept
so a random valid selection is still returned. Also correct a stale log message
that said "timeout in 2 minutes" while awaitUntilSuccess waits 5 minutes.
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.46%. Comparing base (383458f) to head (f09445f).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18082      +/-   ##
============================================
- Coverage     41.57%   41.46%   -0.11%     
  Complexity      318      318              
============================================
  Files          5294     5294              
  Lines        371424   371424              
  Branches      48061    48061              
============================================
- Hits         154410   154029     -381     
- Misses       217014   217395     +381     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
multiRegionMigrateTest runs on 1C5D with replication factor 1 and requires a
source DataNode hosting at least two regions (selectDataNodeHostingMultipleRegions),
otherwise it throws "Cannot find a DataNode hosting at least two regions". Under
the default AUTO data-region policy a single insert created only ~2-3 factor-1
regions, and when the balanced allocator spread them across the 5 DataNodes with
at most one region each, the precondition failed and the test errored
intermittently.

Force the CUSTOM data-region policy with 6 data region groups per database; with
6 regions over 5 DataNodes and replication factor 1, pigeonhole guarantees at
least one DataNode hosts >= 2 regions, and (since each factor-1 region lives on a
single DataNode) another DataNode is always available as a conflict-free
migration destination.
@CRZbulabula CRZbulabula changed the title Fix flaky remove-DataNode IT by making conflict-free DataNode selection exhaustive Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant