Skip to content

Conversation

@JoshRosen
Copy link
Contributor

When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive TestHive.reset() calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:

  • Avoid TestHive.reset() whenever possible:
    • Use a simple set of heuristics to guess whether we need to call reset() in between tests.
    • As a safety-net, automatically re-run failed tests by calling reset() before the re-attempt.
  • Speed up the expensive parts of TestHive.reset(): loading the src and srcpart tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.

After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.

This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46934 has finished for PR 10055 at commit 8b2e4e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46947 has finished for PR 10055 at commit 50694b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46955 has finished for PR 10055 at commit eb5c185.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46956 has finished for PR 10055 at commit 7ecd9f5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

I like this idea. How many queries to we have to retry? Should we cache that?

@JoshRosen
Copy link
Contributor Author

AFAIK it currently doesn't need to retry any queries, since the current set of heuristics for determining when / whether to reset() seems to be working well.

@JoshRosen JoshRosen changed the title [WIP] Experiment with speculatively skipping TestHive.reset() in HiveComparisionTest [SPARK-12075][SQL] Speed up HiveComparisionTest by avoiding / speeding up TestHive.reset() Dec 1, 2015
@JoshRosen
Copy link
Contributor Author

I've created a JIRA for this and have updated the PR title and description; PTAL.

@JoshRosen
Copy link
Contributor Author

Update: it looks like only one query needed to be retried:

13:03:52.601 WARN org.apache.spark.sql.hive.execution.HiveCompatibilitySuite: Test failed without reset(); retrying with reset()
[info] - router_join_ppr (1 second, 554 milliseconds)

Since things seem to be running quickly, I'm inclined to skip the caching of the reset() list for now.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46979 has finished for PR 10055 at commit 0d91c0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 1, 2015

LGTM - merging this in master and branch-1.6.

asfgit pushed a commit that referenced this pull request Dec 1, 2015
…g up TestHive.reset()

When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes:

- Avoid `TestHive.reset()` whenever possible:
  - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests.
  - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt.
- Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary.

After these changes, HiveCompatibilitySuite seems to run in about 10 minutes.

This PR is a revival of #6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite.

Author: Josh Rosen <[email protected]>

Closes #10055 from JoshRosen/speculative-testhive-reset.

(cherry picked from commit ef6790f)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in ef6790f Dec 1, 2015
@JoshRosen JoshRosen deleted the speculative-testhive-reset branch August 29, 2016 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants