-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP] [SQL] Experiment with speculatively running without reset() in HiveCompatibilitySuite #6663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 3c9c944.
|
Test build #34243 has finished for PR 6663 at commit
|
|
Cool, HiveCompatibilitySuite now runs in 12 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this, and SerilaizableWritable more generally, may be a huge source of perf. bottlenecks for short tasks. A common use of SerializableWritable is in serializing Hadoop Configurations, but it seems kind of crazy to create and discard a new Configuration in order to be able to deserialize the driver-provided conf. Maybe we can make a substitute for SerializableWritable which only deals with Configuration subclasses and just calls writeFields() and readFields() directly. This would sidestep a lot of the performance penalties involved in creating Configuration objects and having them spend tons of time loading defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently other folks have noticed Configuration's expensive instantiation costs, too: https://issues.apache.org/jira/browse/MAPREDUCE-5399
|
Test build #34257 has finished for PR 6663 at commit
|
|
Test build #34258 has finished for PR 6663 at commit
|
137a0c7 to
55041d2
Compare
|
Test build #34261 has finished for PR 6663 at commit
|
|
Test build #34262 has finished for PR 6663 at commit
|
|
Test build #34263 has finished for PR 6663 at commit
|
|
Test build #34265 has finished for PR 6663 at commit
|
|
Test build #34312 has finished for PR 6663 at commit
|
|
Test build #34327 has finished for PR 6663 at commit
|
This reverts commit 9e116d1.
…iating Configuration
|
Test build #34336 has finished for PR 6663 at commit
|
|
Test build #34339 has finished for PR 6663 at commit
|
|
Jenkins, retest this please. |
|
Test build #34341 has finished for PR 6663 at commit
|
|
Test build #34455 has finished for PR 6663 at commit
|
|
Jenkins, retest this please. |
|
Test build #34523 has finished for PR 6663 at commit
|
|
Test build #903 has finished for PR 6663 at commit
|
|
Test build #34789 has finished for PR 6663 at commit
|
|
Test build #904 has finished for PR 6663 at commit
|
|
Test build #905 has finished for PR 6663 at commit
|
|
Test build #34793 has finished for PR 6663 at commit
|
|
Test build #906 has finished for PR 6663 at commit
|
|
Test build #34795 has finished for PR 6663 at commit
|
|
Test build #907 has finished for PR 6663 at commit
|
…g up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of apache#6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <[email protected]> Closes apache#10055 from JoshRosen/speculative-testhive-reset.
…g up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <[email protected]> Closes #10055 from JoshRosen/speculative-testhive-reset. (cherry picked from commit ef6790f) Signed-off-by: Reynold Xin <[email protected]>

This is an experiment to see if we can easily speed up
HiveCompatibilitySuiteby speculatively running without callingTestHive.reset(), then retrying if that fails.Note: This PR is a bit of a mess since it's now serving as a quick playground for me to rapidly prototype perf. patches and test them with Jenkins.