-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() #10055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #46934 has finished for PR 10055 at commit
|
|
Test build #46947 has finished for PR 10055 at commit
|
|
Test build #46955 has finished for PR 10055 at commit
|
|
Test build #46956 has finished for PR 10055 at commit
|
|
I like this idea. How many queries to we have to retry? Should we cache that? |
|
AFAIK it currently doesn't need to retry any queries, since the current set of heuristics for determining when / whether to reset() seems to be working well. |
|
I've created a JIRA for this and have updated the PR title and description; PTAL. |
|
Update: it looks like only one query needed to be retried: Since things seem to be running quickly, I'm inclined to skip the caching of the |
|
Test build #46979 has finished for PR 10055 at commit
|
|
LGTM - merging this in master and branch-1.6. |
…g up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <[email protected]> Closes #10055 from JoshRosen/speculative-testhive-reset. (cherry picked from commit ef6790f) Signed-off-by: Reynold Xin <[email protected]>
When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive
TestHive.reset()calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:TestHive.reset()whenever possible:reset()in between tests.reset()before the re-attempt.TestHive.reset(): loading thesrcandsrcparttables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.
This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.