Fix two flaky multi-DataNode/multi-region ITs from #18046 (remove-DataNode selection + migrate precondition)#18082
Open
CRZbulabula wants to merge 2 commits into
Open
Conversation
…kiness success1C5DRemoveTwoDataNodesUseSQL removes two DataNodes at once and picks them via selectRemoveDataNodesWithoutRegionConflict, which must avoid choosing two DataNodes that host replicas of the same consensus group. The previous selection was a single-pass greedy over a shuffled DataNode list: it unconditionally commits to the first shuffled DataNode and never backtracks. For k=2 this throws IllegalStateException whenever the first shuffled DataNode shares a consensus group with every other DataNode (a "hub"), even though a valid conflict-free pair exists among the others. In the 1C5D layout (5 data regions factor 2 + 1 schema region factor 3) such a hub commonly exists, so the test could abort with IllegalStateException before REMOVE DATANODE was even submitted. Replace the greedy with an exhaustive depth-first search with backtracking, so the selection fails only when no conflict-free set exists; the shuffle is kept so a random valid selection is still returned. Also correct a stale log message that said "timeout in 2 minutes" while awaitUntilSuccess waits 5 minutes.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18082 +/- ##
============================================
- Coverage 41.57% 41.46% -0.11%
Complexity 318 318
============================================
Files 5294 5294
Lines 371424 371424
Branches 48061 48061
============================================
- Hits 154410 154029 -381
- Misses 217014 217395 +381 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
multiRegionMigrateTest runs on 1C5D with replication factor 1 and requires a source DataNode hosting at least two regions (selectDataNodeHostingMultipleRegions), otherwise it throws "Cannot find a DataNode hosting at least two regions". Under the default AUTO data-region policy a single insert created only ~2-3 factor-1 regions, and when the balanced allocator spread them across the 5 DataNodes with at most one region each, the precondition failed and the test errored intermittently. Force the CUSTOM data-region policy with 6 data region groups per database; with 6 regions over 5 DataNodes and replication factor 1, pigeonhole guarantees at least one DataNode hosts >= 2 regions, and (since each factor-1 region lives on a single DataNode) another DataNode is always available as a conflict-free migration destination.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
Two integration tests added by #18046 (Support multiple regions in MIGRATE REGION and multiple DataNodes in REMOVE DATANODE) are flaky on the every-PR
Simple/ClusterITpipeline. Both flakes are test-side only; no production code is touched.1.
IoTDBRemoveDataNodeNormalIT.success1C5DRemoveTwoDataNodesUseSQLThe test removes two DataNodes at once and selects them via
IoTDBRemoveDataNodeUtils.selectRemoveDataNodesWithoutRegionConflict, which must avoid picking two DataNodes that host replicas of the same consensus group (otherwise the ConfigNode rejects the removal: "Only one replica of the same consensus group is allowed to be migrated at the same time."). #18075 introduced this selector to replace the previous random pick.However the selector was a single-pass greedy over a shuffled DataNode list with no backtracking: it unconditionally commits to the first shuffled DataNode and never reconsiders it. When removing 2 DataNodes it therefore throws
IllegalStateExceptionwhenever the first shuffled DataNode shares a consensus group with every other DataNode (a "hub" in the region-sharing graph) — even though a valid conflict-free pair still exists among the other DataNodes. In the 1C5D layout (5 data-region groups at factor 2, balanced to ~2 replicas per DataNode, plus one schema-region group at factor 3) such a hub commonly exists, so a fraction of runs aborted withIllegalStateExceptionbeforeREMOVE DATANODEwas ever submitted.Fix: replace the greedy with an exhaustive depth-first search with backtracking (
searchConflictFreeDataNodes) that visits each combination at most once and fails only when no conflict-free set of the requested size actually exists. The initialCollections.shuffleis kept, so when several valid selections exist a random one is still returned. The search space is tiny (a handful of DataNodes), so the cost is negligible, and this also future-proofs the helper forremoveDataNodeNum > 2. Also corrected a stale log line that reported "timeout in 2 minutes" whileawaitUntilSuccessactually waits 5 minutes.2.
IoTDBMigrateMultiRegionForIoTV1IT.multiRegionMigrateTestThe test runs on 1C5D with replication factor 1 and, as a precondition, requires a source DataNode hosting at least two regions (
selectDataNodeHostingMultipleRegions), otherwise it fails withRuntimeException: Cannot find a DataNode hosting at least two regions. Under the defaultAUTOdata-region policy a single insert created only ~2-3 factor-1 regions; when the balanced allocator spread them across the 5 DataNodes with at most one region each, the precondition was not met and the test errored intermittently (observed on this PR's CI, unrelated to the change itself).Fix: force the
CUSTOMdata-region policy with 6 data-region groups per database. With 6 regions over 5 DataNodes at replication factor 1, pigeonhole guarantees at least one DataNode hosts ≥ 2 regions, and — since each factor-1 region lives on a single DataNode — another DataNode is always available as a conflict-free migration destination.This PR has:
Key changed/added classes (or packages if there are too many classes) in this PR
IoTDBRemoveDataNodeUtils(integration-test) — exhaustive backtracking conflict-free DataNode selection replacing the non-backtracking greedy.IoTDBRemoveDataNodeNormalIT(integration-test) — corrected stale timeout log message.IoTDBMigrateMultiRegionForIoTV1IT(integration-test) — deterministic region layout so the "source DataNode with ≥2 regions" precondition always holds.