-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28513 The StochasticLoadBalancer should support discrete evaluations #6593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cf3f468 to
42b29d1
Compare
| ASSIGN_REGION, | ||
| MOVE_REGION, | ||
| SWAP_REGIONS, | ||
| MOVE_BATCH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditional candidate generators might need to make a series of moves, depending on how complex the path to success is. This new action type allows generators to string together a series of moves
| private Supplier<List<Integer>> shuffledServerIndicesSupplier = | ||
| Suppliers.memoizeWithExpiration(() -> { | ||
| Collection<Integer> serverIndices = serversToIndex.values(); | ||
| List<Integer> shuffledServerIndices = new ArrayList<>(serverIndices); | ||
| Collections.shuffle(shuffledServerIndices); | ||
| return shuffledServerIndices; | ||
| }, 5, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I often found that it was nice to let opinionated candidate generators iterate through servers in different orders on subsequent runs, so this saves some cycles across generators that may want that (replica distribution and table isolation both rely on this atm)
| } | ||
|
|
||
| public void doAction(BalanceAction action) { | ||
| public List<RegionPlan> convertActionToPlans(BalanceAction action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to reason about what a RegionPlan is, vs what a BalanceAction is. So the new conditionals are intentionally RegionPlanConditionals that expect to work with only RegionPlans as the unified representation for what any BalanceAction might want to do.
As a result, I needed a way to transform BalanceActions to RegionPlans without running the action against the BCS
| int[] removeRegions(int[] regions, Set<Integer> regionIndicesToRemove) { | ||
| // Calculate the size of the new regions array | ||
| int newSize = regions.length - regionIndicesToRemove.size(); | ||
| if (newSize < 0) { | ||
| throw new IllegalStateException( | ||
| "Region indices mismatch: more regions to remove than in the regions array"); | ||
| } | ||
|
|
||
| int[] newRegions = new int[newSize]; | ||
| int newIndex = 0; | ||
|
|
||
| // Copy only the regions not in the removal set | ||
| for (int region : regions) { | ||
| if (!regionIndicesToRemove.contains(region)) { | ||
| newRegions[newIndex++] = region; | ||
| } | ||
| } | ||
|
|
||
| // If the newIndex is smaller than newSize, some regions were missing from the input array | ||
| if (newIndex != newSize) { | ||
| throw new IllegalStateException("Region indices mismatch: some regions in the removal " | ||
| + "set were not found in the regions array"); | ||
| } | ||
|
|
||
| return newRegions; | ||
| } | ||
|
|
||
| int[] addRegions(int[] regions, Set<Integer> regionIndicesToAdd) { | ||
| int[] newRegions = new int[regions.length + regionIndicesToAdd.size()]; | ||
|
|
||
| // Copy the existing regions to the new array | ||
| System.arraycopy(regions, 0, newRegions, 0, regions.length); | ||
|
|
||
| // Add the new regions at the end of the array | ||
| int newIndex = regions.length; | ||
| for (int regionIndex : regionIndicesToAdd) { | ||
| newRegions[newIndex++] = regionIndex; | ||
| } | ||
|
|
||
| return newRegions; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods make it easier to support the MoveBatch action that might remove or add several regions to a server
| * from finding a solution. | ||
| */ | ||
| @InterfaceAudience.Private | ||
| final class BalancerConditionals implements Configurable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is statically available via the package protected BalancerConditionals.INSTANCE, and it exists to be a unified interface into all conditionals that are configured in the current balancer run
| Pair<CandidateGenerator, BalanceAction> nextAction(BalancerClusterState cluster) { | ||
| CandidateGenerator generator = getRandomGenerator(cluster); | ||
| return Pair.newPair(generator, generator.generate(cluster)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a pair here allows for better logging of generator activity, which can be nice when debugging the balancer's decision making
| // Prefer conditional generators if they have moves to make | ||
| if (balancerConditionals.isConditionalBalancingEnabled()) { | ||
| for (RegionPlanConditional conditional : balancerConditionals.getConditionals()) { | ||
| List<RegionPlanConditionalCandidateGenerator> generators = | ||
| conditional.getCandidateGenerators(); | ||
| for (RegionPlanConditionalCandidateGenerator generator : generators) { | ||
| if (generator.getWeight(cluster) > 0) { | ||
| return generator; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will always prefer conditional generator moves if they exist
| * selecting a candidate generator is propotional to the share of cost of all cost functions among | ||
| * all cost functions that benefit from it. | ||
| */ | ||
| protected CandidateGenerator getRandomGenerator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to rewrite this method, largely because of the refactor to a Map of generators and their weights, rather than lists. Anyway, this also fixes a bug in this method: it modifies the weightsOfGenerators array in place. That is:
- It first computes a running sum in weightsOfGenerators[i].
- It then divides those values by the total to get a cumulative distribution.
- It uses this (mutated) array to select a generator.
- But the original weights are overwritten and never restored.
As a result, this method can produce an unfair distribution of generator activity. Maybe in practice it was okay because we called updateCostsAndWeightsWithAction frequently enough and that implicitly resets weights — but I think in reality, it was likely that we ran into NULL_ACTIONs that manifested as no-ops frequently, and that produced skewed generator weights. Anyway, this was fragile and buggy, and it should be better and more fair now
| * number of replicas. For example, this is useful when isolating system tables. | ||
| */ | ||
| @InterfaceAudience.Private | ||
| public final class TableColocationCandidateGenerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System table isolation is one thing, but we also want colocation of any tables that we're isolation. For example, if we have 10 system table regions that we've isolated, and they're on 10 different servers, then it feels wasteful to have 10 servers dedicated to hosting only 1 region of said system table. This generator will ensure that we smartly find our way to a plan that colocates the given system table on one server (or n servers, if the table has >1 replica)
| import org.slf4j.Logger; | ||
| import org.slf4j.LoggerFactory; | ||
|
|
||
| public final class CandidateGeneratorTestUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the changes here, probably at least half of this PR, are tests. I've setup minicluster tests to validate that clusters startup and balance as intended with conditionals enabled, and I've setup simpler tests that validate the StochasticLoadBalancer's ability to get out of very hairy, high scale, imbalances via conditionals and their generators. I've setup tests that validate a wide variety of conditional usage, and that all conditionals implemented in this PR play nicely with one another.
|
Hey @Apache9, I see your name in a lot of the StochasticLoadBalancer git blame so I wanted to reach out in case you're interested here. I know this is a pretty big changeset, so I'd be happy to break it up if you'd like, but the individual components are pretty nicely divided between:
I think this will be a meaningful improvement for hbase's load balancer. We envision this being useful in:
|
| */ | ||
| @Test | ||
| public void testBalanceCluster() throws Exception { | ||
| conf.setLong("hbase.master.balancer.stochastic.maxRunningTime", 10_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a lot of changes like this, wherever I could, to speed up our balancer test suite. We made a lot of assumptions that the balancer should run for 30s here, 60s here, and while the round numbers or less configuration is nice when writing tests, that runtime really adds up when running the whole suite locally
3246c57 to
3ce828e
Compare
3ce828e to
7802e8d
Compare
| return areSomeRegionReplicasColocatedOnHost(c) || areSomeRegionReplicasColocatedOnRack(c); | ||
| } | ||
|
|
||
| private boolean areSomeRegionReplicasColocatedOnHost(BalancerClusterState c) { | ||
| if (c.numHosts >= c.maxReplicas) { | ||
| regionReplicaHostCostFunction.prepare(c); | ||
| double hostCost = Math.abs(regionReplicaHostCostFunction.cost()); | ||
| boolean colocatedAtHost = hostCost > CostFunction.COST_EPSILON; | ||
| if (colocatedAtHost) { | ||
| return true; | ||
| } | ||
| LOG.trace("No host colocation detected with host cost={}", hostCost); | ||
| } | ||
| return false; | ||
| } | ||
|
|
||
| private boolean areSomeRegionReplicasColocatedOnRack(BalancerClusterState c) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In HBASE-26327 we removed consideration for distribution of replicas across racks. This was, in my opinion, a mistake — if we want to be flexible for environments that have too few racks, then we should just do so by skipping this check when the rack count is < the max replica count
7802e8d to
d87809e
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Hmm, some of these test failures seem legitimate, but I can't repro on my machine yet. Looking into it edit: figured it out, see below |
d87809e to
c636209
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
2270f89 to
1aa4aa1
Compare
1aa4aa1 to
fa5d701
Compare
| public static final double COST_EPSILON = 0.0001; | ||
| public static double getCostEpsilon(double cost) { | ||
| return Math.ulp(cost); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failures were difficult to repro because they were caused by the imprecise cost epsilon in our cost functions — the inaccuracy caused our areRegionReplicasColocated method to sometimes return a false negative, so if your balancer run ended at the wrong time then it may not start up again to eliminate your final bad replica placements. This was previously covered up by just having a bizarrely long test run, and by fixing this bug we've moved the expected runtime of TestStochasticLoadBalancerRegionReplicaHighReplication from 2min to 5sec
I've updated the cost epsilon to be dynamically calculated using Math#ulp which should let us eliminate floating point calculation errors in a much more precise way.
This PR definitely has a ton of distinct changes in it at this point, so I would be happy to stack these changes in a feature branch. Spitballing the PRs that I would have if I broke this up:
- Fix cost epsilon HBASE-29070 Balancer cost function epsilon is imprecise #6597
- Get rid of fragile candidate generator enum ordinal setup, fix generator picking fairness HBASE-29071 StochasticLoadBalancer candidate generators should use a Map, rather than ordinal based indexing #6598
- Fix awareness of rack colocation in
areReplicasColocated - Fix null multiplier possibility
- Add region plan conditional framework, plus 1 conditional+generator+tests (probably replica distribution)
- Add meta table isolation conditional+generator+tests
- Add system table isolation conditional+generator+tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created Jiras for each of these steps, and will split this PR up into more reviewable chunks
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Closing this because I'm splitting these changes off into several distinct PRs |
See my design doc here
To sum it up, the current load balancer isn't great for what it's supposed to do now, and it won't support all of the things that we'd like it to do in a perfect world.
Right now: primary replica balancing squashes all other considerations. The default weight for one of the several cost functions that factor into primary replica balancing is 100,000. Meanwhile the default read request cost is 5. The result is that the load balancer, OOTB, basically doesn't care about balancing actual load. To solve this, you can either set primary replica balancing costs to zero, which is fine if you don't use read replicas, or — if you do use read replicas — maybe you can produce a magic incantation of configurations that work just right, until your needs change.
In the future: we'd like a lot more out of the balancer. System table isolation, meta table isolation, colocation of regions based on start key prefix similarity (this is a very rough idea atm, and not touched in the scope of this PR). And to support all of these features with either cost functions or RS groups would be a real burden. I think what I'm proposing here will be a much, much easier path for HBase operators.
New features
This PR introduces some new features:
These can be controlled via:
Testing
I wrote a lot of unit tests to validate the functionality here — both lightweight and some minicluster tests. Even in the most extreme cases (like, system table isolation + meta table isolation enabled on a 3 node cluster, or the number of read replicas == the number of servers) the balancer does what we'd expect.
Replica Distribution Improvements
Perfect primary and secondary replica distribution
Not only does this PR offer an alternative means of distributing replicas, but it's actually a massive improvement on the existing approach.
See the Replica Distribution testing section of my design doc. Cost functions never successfully balance 3 replicas across 3 servers OOTB — but balancer conditionals do so expeditiously.
To summarize the testing, we have
replicated_table, a table with 3 region replicas. The 3 regions of a given replica share a color, and there are also 3 RegionServers in the cluster. We expect the balancer to evenly distribute one replica per server across the 3 RegionServers...Cost functions don't work:


….omitting the meaningless snapshots between 4 and 27…
At this point, I just exited the test because it was clear that our existing balancer would never achieve true replica distribution.
But balancer conditionals do work:





New Features: Table Isolation Working as Designed
See below where we ran a new unit test, TestLargerClusterBalancerConditionals, and tracked the locations of regions for 3 tables across 18 RegionServers:
All regions began on a single RegionServer, and within 4 balancer iterations we had a well balanced cluster, and isolation of key system tables. It achieved this in about 2min on my local machine, where most of that time was spent bootstrapping the mini cluster.
Table isolation performance testing
Likewise, we created large tests for system table isolation, meta table isolation, multi table isolation, and multi table isolation + replica distribution. These tests reliably find exactly what we're looking for, and do so expeditiously on my local machine for 100 servers and 10k+ regions — all tests reliably pass within a few minutes.
cc @ndimiduk @charlesconnell @ksravista @aalhour