-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Shortcut simple patterns ending in *
#43904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shortcut simple patterns ending in *
#43904
Conversation
When profiling a call to `AllocationService#reroute()` in a large cluster
containing allocation filters of the form `node-name-*` I observed a nontrivial
amount of time spent in `Regex#simpleMatch` due to these allocation filters.
Patterns ending in a wildcard are not uncommon, and this change treats them as
a special case in `Regex#simpleMatch` in order to shave a bit of time off this
calculation.
Microbenchmark results before this change:
Result "org.elasticsearch.common.regex.RegexStartsWithBenchmark.performSimpleMatch":
1113.839 ±(99.9%) 6.338 ns/op [Average]
(min, avg, max) = (1102.388, 1113.839, 1135.783), stdev = 9.486
CI (99.9%): [1107.502, 1120.177] (assumes normal distribution)
Microbenchmark results with this change applied:
Result "org.elasticsearch.common.regex.RegexStartsWithBenchmark.performSimpleMatch":
502.942 ±(99.9%) 0.590 ns/op [Average]
(min, avg, max) = (501.292, 502.942, 504.490), stdev = 0.883
CI (99.9%): [502.351, 503.532] (assumes normal distribution)
The microbenchmark in question was:
@fork(3)
@WarmUp(iterations = 10)
@measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@SuppressWarnings("unused") //invoked by benchmarking framework
public class RegexStartsWithBenchmark {
private static final String testString = "abcdefghijklmnopqrstuvwxyz";
private static final String[] patterns;
static {
patterns = new String[testString.length() + 1];
for (int i = 0; i <= testString.length(); i++) {
patterns[i] = testString.substring(0, i) + "*";
}
}
@benchmark
public void performSimpleMatch() {
for (int i = 0; i < patterns.length; i++) {
Regex.simpleMatch(patterns[i], testString);
}
}
}
|
Pinging @elastic/es-core-infra |
| } | ||
|
|
||
| public void testSimpleMatch() { | ||
| for (int i = 0; i < 100000; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover from benchmarking experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh 100k iterations is still only 300ms :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced in 489e82b.
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but I wonder if we could get rid of more allocations (if that was the culprit?), see comments.
| return false; | ||
| } | ||
| if (firstIndex == pattern.length() - 1) { | ||
| return str.startsWith(pattern.substring(0, firstIndex)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not entirely clear to me where the performance issue comes from, but a guess is that the substring allocations in the recursion below is part of it. I wonder if we should try to avoid the last substring call by using regionMatches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL regionMatches is the name of the function I was trying to find. Yes, this shaves off a bit more time. I've updated the PR description with new benchmark results.
| return str.startsWith(pattern.substring(0, firstIndex)); | ||
| } | ||
| return (str.length() >= firstIndex && | ||
| pattern.substring(0, firstIndex).equals(str.substring(0, firstIndex)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on outcome of using regionMatches above, I would suggest also using it here (or str.startsWith if that is deemed faster).
henningandersen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
original-brownbear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
When profiling a call to `AllocationService#reroute()` in a large cluster
containing allocation filters of the form `node-name-*` I observed a nontrivial
amount of time spent in `Regex#simpleMatch` due to these allocation filters.
Patterns ending in a wildcard are not uncommon, and this change treats them as
a special case in `Regex#simpleMatch` in order to shave a bit of time off this
calculation. It also uses `String#regionMatches()` to avoid an allocation in
the case that the pattern's only wildcard is at the start.
Microbenchmark results before this change:
Result "org.elasticsearch.common.regex.RegexStartsWithBenchmark.performSimpleMatch":
1113.839 ±(99.9%) 6.338 ns/op [Average]
(min, avg, max) = (1102.388, 1113.839, 1135.783), stdev = 9.486
CI (99.9%): [1107.502, 1120.177] (assumes normal distribution)
Microbenchmark results with this change applied:
Result "org.elasticsearch.common.regex.RegexStartsWithBenchmark.performSimpleMatch":
433.190 ±(99.9%) 0.644 ns/op [Average]
(min, avg, max) = (431.518, 433.190, 435.456), stdev = 0.964
CI (99.9%): [432.546, 433.833] (assumes normal distribution)
The microbenchmark in question was:
@fork(3)
@WarmUp(iterations = 10)
@measurement(iterations = 10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@SuppressWarnings("unused") //invoked by benchmarking framework
public class RegexStartsWithBenchmark {
private static final String testString = "abcdefghijklmnopqrstuvwxyz";
private static final String[] patterns;
static {
patterns = new String[testString.length() + 1];
for (int i = 0; i <= testString.length(); i++) {
patterns[i] = testString.substring(0, i) + "*";
}
}
@benchmark
public void performSimpleMatch() {
for (int i = 0; i < patterns.length; i++) {
Regex.simpleMatch(patterns[i], testString);
}
}
}
When profiling a call to
AllocationService#reroute()in a large clustercontaining allocation filters of the form
node-name-*I observed a nontrivialamount of time spent in
Regex#simpleMatchdue to these allocation filters.Patterns ending in a wildcard are not uncommon, and this change treats them as
a special case in
Regex#simpleMatchin order to shave a bit of time off thiscalculation. It also uses
String#regionMatches()to avoid an allocation inthe case that the pattern's only wildcard is at the start.
Microbenchmark results before this change:
Microbenchmark results with this change applied:
The microbenchmark in question was: