Skip to content

Conversation

@rmdmattingly
Copy link
Contributor

@rmdmattingly rmdmattingly commented Jan 9, 2025

See my design doc here

To sum it up, the current load balancer isn't great for what it's supposed to do now, and it won't support all of the things that we'd like it to do in a perfect world.

Right now: primary replica balancing squashes all other considerations. The default weight for one of the several cost functions that factor into primary replica balancing is 100,000. Meanwhile the default read request cost is 5. The result is that the load balancer, OOTB, basically doesn't care about balancing actual load. To solve this, you can either set primary replica balancing costs to zero, which is fine if you don't use read replicas, or — if you do use read replicas — maybe you can produce a magic incantation of configurations that work just right, until your needs change.

In the future: we'd like a lot more out of the balancer. System table isolation, meta table isolation, colocation of regions based on start key prefix similarity (this is a very rough idea atm, and not touched in the scope of this PR). And to support all of these features with either cost functions or RS groups would be a real burden. I think what I'm proposing here will be a much, much easier path for HBase operators.

New features

This PR introduces some new features:

  1. Balancer conditional based replica distribution
  2. System table isolation (put backups, quotas, etc on their own RegionServer (all sys tables on 1))
  3. Meta table isolation (put meta on its own RegionServer)

These can be controlled via:

  • hbase.master.balancer.stochastic.conditionals.distributeReplicas: set this to true to enable conditional based replica distribution
  • hbase.master.balancer.stochastic.conditionals.isolateSystemTables: set this to true to enable system table isolation
  • hbase.master.balancer.stochastic.conditionals.isolateMetaTable: set this to true to enable meta table isolation
  • hbase.master.balancer.stochastic.additionalConditionals: much like cost functions, you can define your own RegionPlanConditional implementation and install them here

Testing

I wrote a lot of unit tests to validate the functionality here — both lightweight and some minicluster tests. Even in the most extreme cases (like, system table isolation + meta table isolation enabled on a 3 node cluster, or the number of read replicas == the number of servers) the balancer does what we'd expect.

Replica Distribution Improvements

Perfect primary and secondary replica distribution

Not only does this PR offer an alternative means of distributing replicas, but it's actually a massive improvement on the existing approach.

See the Replica Distribution testing section of my design doc. Cost functions never successfully balance 3 replicas across 3 servers OOTB — but balancer conditionals do so expeditiously.

To summarize the testing, we have replicated_table, a table with 3 region replicas. The 3 regions of a given replica share a color, and there are also 3 RegionServers in the cluster. We expect the balancer to evenly distribute one replica per server across the 3 RegionServers...

Cost functions don't work:
cf1
cf2

….omitting the meaningless snapshots between 4 and 27…

cf28

At this point, I just exited the test because it was clear that our existing balancer would never achieve true replica distribution.

But balancer conditionals do work:
bc1
bc2
bc3
bc4
bc5

New Features: Table Isolation Working as Designed

See below where we ran a new unit test, TestLargerClusterBalancerConditionals, and tracked the locations of regions for 3 tables across 18 RegionServers:

  1. 180 “product” table regions
  2. 1 meta table region
  3. 1 quotas table region

All regions began on a single RegionServer, and within 4 balancer iterations we had a well balanced cluster, and isolation of key system tables. It achieved this in about 2min on my local machine, where most of that time was spent bootstrapping the mini cluster.

output (2)

output (3)

output (5)

output (4)

Table isolation performance testing

Likewise, we created large tests for system table isolation, meta table isolation, multi table isolation, and multi table isolation + replica distribution. These tests reliably find exactly what we're looking for, and do so expeditiously on my local machine for 100 servers and 10k+ regions — all tests reliably pass within a few minutes.

cc @ndimiduk @charlesconnell @ksravista @aalhour

ASSIGN_REGION,
MOVE_REGION,
SWAP_REGIONS,
MOVE_BATCH,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional candidate generators might need to make a series of moves, depending on how complex the path to success is. This new action type allows generators to string together a series of moves

Comment on lines +132 to +139
private Supplier<List<Integer>> shuffledServerIndicesSupplier =
Suppliers.memoizeWithExpiration(() -> {
Collection<Integer> serverIndices = serversToIndex.values();
List<Integer> shuffledServerIndices = new ArrayList<>(serverIndices);
Collections.shuffle(shuffledServerIndices);
return shuffledServerIndices;
}, 5, TimeUnit.SECONDS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often found that it was nice to let opinionated candidate generators iterate through servers in different orders on subsequent runs, so this saves some cycles across generators that may want that (replica distribution and table isolation both rely on this atm)

}

public void doAction(BalanceAction action) {
public List<RegionPlan> convertActionToPlans(BalanceAction action) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to reason about what a RegionPlan is, vs what a BalanceAction is. So the new conditionals are intentionally RegionPlanConditionals that expect to work with only RegionPlans as the unified representation for what any BalanceAction might want to do.

As a result, I needed a way to transform BalanceActions to RegionPlans without running the action against the BCS

Comment on lines +973 to +1019
int[] removeRegions(int[] regions, Set<Integer> regionIndicesToRemove) {
// Calculate the size of the new regions array
int newSize = regions.length - regionIndicesToRemove.size();
if (newSize < 0) {
throw new IllegalStateException(
"Region indices mismatch: more regions to remove than in the regions array");
}

int[] newRegions = new int[newSize];
int newIndex = 0;

// Copy only the regions not in the removal set
for (int region : regions) {
if (!regionIndicesToRemove.contains(region)) {
newRegions[newIndex++] = region;
}
}

// If the newIndex is smaller than newSize, some regions were missing from the input array
if (newIndex != newSize) {
throw new IllegalStateException("Region indices mismatch: some regions in the removal "
+ "set were not found in the regions array");
}

return newRegions;
}

int[] addRegions(int[] regions, Set<Integer> regionIndicesToAdd) {
int[] newRegions = new int[regions.length + regionIndicesToAdd.size()];

// Copy the existing regions to the new array
System.arraycopy(regions, 0, newRegions, 0, regions.length);

// Add the new regions at the end of the array
int newIndex = regions.length;
for (int regionIndex : regionIndicesToAdd) {
newRegions[newIndex++] = regionIndex;
}

return newRegions;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods make it easier to support the MoveBatch action that might remove or add several regions to a server

* from finding a solution.
*/
@InterfaceAudience.Private
final class BalancerConditionals implements Configurable {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is statically available via the package protected BalancerConditionals.INSTANCE, and it exists to be a unified interface into all conditionals that are configured in the current balancer run

Comment on lines +456 to +477
Pair<CandidateGenerator, BalanceAction> nextAction(BalancerClusterState cluster) {
CandidateGenerator generator = getRandomGenerator(cluster);
return Pair.newPair(generator, generator.generate(cluster));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a pair here allows for better logging of generator activity, which can be nice when debugging the balancer's decision making

Comment on lines +467 to +497
// Prefer conditional generators if they have moves to make
if (balancerConditionals.isConditionalBalancingEnabled()) {
for (RegionPlanConditional conditional : balancerConditionals.getConditionals()) {
List<RegionPlanConditionalCandidateGenerator> generators =
conditional.getCandidateGenerators();
for (RegionPlanConditionalCandidateGenerator generator : generators) {
if (generator.getWeight(cluster) > 0) {
return generator;
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will always prefer conditional generator moves if they exist

* selecting a candidate generator is propotional to the share of cost of all cost functions among
* all cost functions that benefit from it.
*/
protected CandidateGenerator getRandomGenerator() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to rewrite this method, largely because of the refactor to a Map of generators and their weights, rather than lists. Anyway, this also fixes a bug in this method: it modifies the weightsOfGenerators array in place. That is:

  1. It first computes a running sum in weightsOfGenerators[i].
  2. It then divides those values by the total to get a cumulative distribution.
  3. It uses this (mutated) array to select a generator.
  4. But the original weights are overwritten and never restored.

As a result, this method can produce an unfair distribution of generator activity. Maybe in practice it was okay because we called updateCostsAndWeightsWithAction frequently enough and that implicitly resets weights — but I think in reality, it was likely that we ran into NULL_ACTIONs that manifested as no-ops frequently, and that produced skewed generator weights. Anyway, this was fragile and buggy, and it should be better and more fair now

* number of replicas. For example, this is useful when isolating system tables.
*/
@InterfaceAudience.Private
public final class TableColocationCandidateGenerator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System table isolation is one thing, but we also want colocation of any tables that we're isolation. For example, if we have 10 system table regions that we've isolated, and they're on 10 different servers, then it feels wasteful to have 10 servers dedicated to hosting only 1 region of said system table. This generator will ensure that we smartly find our way to a plan that colocates the given system table on one server (or n servers, if the table has >1 replica)

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public final class CandidateGeneratorTestUtil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the changes here, probably at least half of this PR, are tests. I've setup minicluster tests to validate that clusters startup and balance as intended with conditionals enabled, and I've setup simpler tests that validate the StochasticLoadBalancer's ability to get out of very hairy, high scale, imbalances via conditionals and their generators. I've setup tests that validate a wide variety of conditional usage, and that all conditionals implemented in this PR play nicely with one another.

@rmdmattingly
Copy link
Contributor Author

rmdmattingly commented Jan 9, 2025

Hey @Apache9, I see your name in a lot of the StochasticLoadBalancer git blame so I wanted to reach out in case you're interested here. I know this is a pretty big changeset, so I'd be happy to break it up if you'd like, but the individual components are pretty nicely divided between:

  1. Some changes to the StochasticLoadBalancer, mostly minor changes or gated behind conditional usage
  2. The replica distribution conditional
  3. The system table isolation conditional
  4. The meta table isolation conditional
  5. System table isolation candidate generators
  6. Table colocation candidate generators
  7. The replica distribution candidate generator
  8. And finally, a robust test suite

I think this will be a meaningful improvement for hbase's load balancer. We envision this being useful in:

  1. Better distributing read replicas, while also keeping the balancer simple to configure based on storefile size. I dig into it above, but the existing stuff doesn't isolate secondaries well and it's difficult to balance replica multipliers well with everything else. In comparison to the replica cost functions, I think this conditionals stuff is quite simple
  2. System table isolation without needing to manage wastefully over-provisioned RegionServer groups
  3. Likewise, simple/cheap meta table isolation
  4. We also have a large project in the early stages which would involve colocating regions based on a prefix in their start/stop keys. This would allow us to flexibly control logical groupings of our customers' data so that we could pull a lever to, for example, increase isolation between customers in an otherwise randomly distributed table so that a single RegionServer outage would only affect some smaller proportion of all customers. This idea isn't in this PR, but I would love to add it in a subsequent one
  5. I mention it in the self review here, and also this isn't in this PR, but this PR would enable pluggable candidate generators which I think could be a nice way to offer balancer customization with minimal complexity for users

*/
@Test
public void testBalanceCluster() throws Exception {
conf.setLong("hbase.master.balancer.stochastic.maxRunningTime", 10_000);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a lot of changes like this, wherever I could, to speed up our balancer test suite. We made a lot of assumptions that the balancer should run for 30s here, 60s here, and while the round numbers or less configuration is nice when writing tests, that runtime really adds up when running the whole suite locally

@rmdmattingly rmdmattingly force-pushed the HBASE-28513 branch 2 times, most recently from 3246c57 to 3ce828e Compare January 10, 2025 18:14
@rmdmattingly rmdmattingly requested a review from Apache9 January 10, 2025 18:24
Comment on lines 352 to 377
return areSomeRegionReplicasColocatedOnHost(c) || areSomeRegionReplicasColocatedOnRack(c);
}

private boolean areSomeRegionReplicasColocatedOnHost(BalancerClusterState c) {
if (c.numHosts >= c.maxReplicas) {
regionReplicaHostCostFunction.prepare(c);
double hostCost = Math.abs(regionReplicaHostCostFunction.cost());
boolean colocatedAtHost = hostCost > CostFunction.COST_EPSILON;
if (colocatedAtHost) {
return true;
}
LOG.trace("No host colocation detected with host cost={}", hostCost);
}
return false;
}

private boolean areSomeRegionReplicasColocatedOnRack(BalancerClusterState c) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HBASE-26327 we removed consideration for distribution of replicas across racks. This was, in my opinion, a mistake — if we want to be flexible for environments that have too few racks, then we should just do so by skipping this check when the rack count is < the max replica count

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@rmdmattingly
Copy link
Contributor Author

rmdmattingly commented Jan 11, 2025

Hmm, some of these test failures seem legitimate, but I can't repro on my machine yet. Looking into it

edit: figured it out, see below

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@rmdmattingly rmdmattingly force-pushed the HBASE-28513 branch 2 times, most recently from 2270f89 to 1aa4aa1 Compare January 12, 2025 13:17
Comment on lines -28 to +31
public static final double COST_EPSILON = 0.0001;
public static double getCostEpsilon(double cost) {
return Math.ulp(cost);
}
Copy link
Contributor Author

@rmdmattingly rmdmattingly Jan 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures were difficult to repro because they were caused by the imprecise cost epsilon in our cost functions — the inaccuracy caused our areRegionReplicasColocated method to sometimes return a false negative, so if your balancer run ended at the wrong time then it may not start up again to eliminate your final bad replica placements. This was previously covered up by just having a bizarrely long test run, and by fixing this bug we've moved the expected runtime of TestStochasticLoadBalancerRegionReplicaHighReplication from 2min to 5sec

I've updated the cost epsilon to be dynamically calculated using Math#ulp which should let us eliminate floating point calculation errors in a much more precise way.

This PR definitely has a ton of distinct changes in it at this point, so I would be happy to stack these changes in a feature branch. Spitballing the PRs that I would have if I broke this up:

  1. Fix cost epsilon HBASE-29070 Balancer cost function epsilon is imprecise #6597
  2. Get rid of fragile candidate generator enum ordinal setup, fix generator picking fairness HBASE-29071 StochasticLoadBalancer candidate generators should use a Map, rather than ordinal based indexing #6598
  3. Fix awareness of rack colocation in areReplicasColocated
  4. Fix null multiplier possibility
  5. Add region plan conditional framework, plus 1 conditional+generator+tests (probably replica distribution)
  6. Add meta table isolation conditional+generator+tests
  7. Add system table isolation conditional+generator+tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created Jiras for each of these steps, and will split this PR up into more reviewable chunks

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 33s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for branch
+1 💚 mvninstall 2m 50s master passed
+1 💚 compile 3m 22s master passed
+1 💚 checkstyle 0m 43s master passed
+1 💚 spotbugs 1m 49s master passed
+1 💚 spotless 0m 42s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 11s Maven dependency ordering for patch
+1 💚 mvninstall 2m 49s the patch passed
+1 💚 compile 3m 25s the patch passed
+1 💚 javac 0m 23s hbase-balancer generated 0 new + 66 unchanged - 4 fixed = 66 total (was 70)
+1 💚 javac 3m 2s hbase-server in the patch passed.
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 43s the patch passed
+1 💚 spotbugs 2m 5s the patch passed
+1 💚 hadoopcheck 10m 44s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 41s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 17s The patch does not generate ASF License warnings.
38m 21s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6593/6/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6593
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux d560c94e68da 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / fa5d701
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 86 (vs. ulimit of 30000)
modules C: hbase-balancer hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6593/6/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 28s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 🆗 mvndep 0m 21s Maven dependency ordering for branch
+1 💚 mvninstall 2m 45s master passed
+1 💚 compile 1m 4s master passed
+1 💚 javadoc 0m 35s master passed
+1 💚 shadedjars 5m 24s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 12s Maven dependency ordering for patch
+1 💚 mvninstall 2m 44s the patch passed
+1 💚 compile 1m 4s the patch passed
+1 💚 javac 1m 4s the patch passed
+1 💚 javadoc 0m 34s the patch passed
+1 💚 shadedjars 5m 24s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 7m 44s hbase-balancer in the patch passed.
+1 💚 unit 175m 3s hbase-server in the patch passed.
207m 55s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6593/6/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6593
Optional Tests javac javadoc unit compile shadedjars
uname Linux 9a1c7161ffe7 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / fa5d701
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6593/6/testReport/
Max. process+thread count 5278 (vs. ulimit of 30000)
modules C: hbase-balancer hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6593/6/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@rmdmattingly
Copy link
Contributor Author

Closing this because I'm splitting these changes off into several distinct PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants