Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in LocalityPlacementStrategySuite.

It does not fully resolve the JIRA SPARK-31746 yet. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid.

Why are the changes needed?

This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/):

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)

After this PR, it will help to investigate the root cause:

Before:

[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (824 milliseconds)
[info]   java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
[info]   at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
...

After:

[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (825 milliseconds)
[info]   StackOverflowError should not be thrown; however, got:
[info]
[info]    java.lang.StackOverflowError
[info]   	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
...

Does this PR introduce any user-facing change?

No, dev-only.

How was this patch tested?

Manually tested by reverting 76db394 locally.

@HyukjinKwon HyukjinKwon changed the title Show the actual error message in LocalityPlacementStrategySuite [SPARK-31746][YARN][TESTS] Show the actual error message in LocalityPlacementStrategySuite May 18, 2020
@HyukjinKwon
Copy link
Member Author

cc @cloud-fan. The flakiness was found at #28463. Can you take a quick look?

@HyukjinKwon
Copy link
Member Author

I am going to merge this, see #28463 (comment).

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.0.

HyukjinKwon added a commit that referenced this pull request May 18, 2020
…lacementStrategySuite

### What changes were proposed in this pull request?

This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`.

**It does not fully resolve the JIRA SPARK-31746 yet**. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid.

### Why are the changes needed?

This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/):

```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
```

After this PR, it will help to investigate the root cause:

**Before**:

```
[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (824 milliseconds)
[info]   java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
[info]   at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
...
```

**After**:

```
[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (825 milliseconds)
[info]   StackOverflowError should not be thrown; however, got:
[info]
[info]    java.lang.StackOverflowError
[info]   	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
...

```

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested by reverting 76db394 locally.

Closes #28566 from HyukjinKwon/SPARK-31746.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 3bf7bf9)
Signed-off-by: HyukjinKwon <[email protected]>
@maropu
Copy link
Member

maropu commented May 18, 2020

@HyukjinKwon It seems branch-3.0 broken?

[info] Done packaging.
[error] /home/jenkins/workspace/SparkPullRequestBuilder@4/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/LocalityPlacementStrategySuite.scala:65: not found: value resource
[error]       yarnConf, resource, new MockResolver())
[error]                 ^
[info] Packaging /home/jenkins/workspace/SparkPullRequestBuilder@4/external/kafka-0-10-token-provider/target/scala-2.12/spark-token-provider-kafka-0-10_2.12-3.0.1-SNAPSHOT-tests.jar ...

// goal is to create enough requests for localized containers (so there should be many
// tasks on several hosts that have no allocated containers).

val resource = Resource.newInstance(8 * 1024, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah. Seems it's used in the branch-3.0. I will fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed now. thanks for pointing out quickly.

HyukjinKwon added a commit that referenced this pull request May 18, 2020
…lacementStrategySuite

This PR proposes to show the actual traceback when "handle large number of containers and tasks (SPARK-18750)" test fails in `LocalityPlacementStrategySuite`.

**It does not fully resolve the JIRA SPARK-31746 yet**. I tried to reproduce in my local by controlling the factors in the tests but I couldn't. I double checked the changes in SPARK-18750 are still valid.

This test is flaky for an unknown reason (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/122768/testReport/org.apache.spark.deploy.yarn/LocalityPlacementStrategySuite/handle_large_number_of_containers_and_tasks__SPARK_18750_/):

```
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError did not equal null
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
	at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
```

After this PR, it will help to investigate the root cause:

**Before**:

```
[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (824 milliseconds)
[info]   java.lang.StackOverflowError did not equal null (LocalityPlacementStrategySuite.scala:49)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:49)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:157)
[info]   at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info]   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
[info]   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
[info]   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
...
```

**After**:

```
[info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (825 milliseconds)
[info]   StackOverflowError should not be thrown; however, got:
[info]
[info]    java.lang.StackOverflowError
[info]   	at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
[info]   	at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
[info]   	at scala.collection.MapLike$MappedValues.$anonfun$foreach$3(MapLike.scala:256)
...

```

No, dev-only.

Manually tested by reverting 76db394 locally.

Closes #28566 from HyukjinKwon/SPARK-31746.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 3bf7bf9)
Signed-off-by: HyukjinKwon <[email protected]>
@SparkQA
Copy link

SparkQA commented May 18, 2020

Test build #122782 has finished for PR 28566 at commit c7bc490.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-31746 branch July 27, 2020 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants