Skip to content

Conversation

@squito
Copy link
Contributor

@squito squito commented Aug 26, 2015

DAGSchedulerEventLoop normally only logs errors (so it can continue to process more events, from other jobs). However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception.

This was suggested by @mateiz on #7699. It may have already turned up an issue in "zero split job".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is so I could trigger an exception (by passing in null), seems like a good change in any case.

@SparkQA
Copy link

SparkQA commented Aug 26, 2015

Test build #41634 has finished for PR 8466 at commit 2c45f78.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegressionModel @Since("1.3.0") (
    • class SVMModel @Since("1.1.0") (
    • class GaussianMixtureModel @Since("1.3.0") (
    • class KMeansModel @Since("1.1.0") (@Since("1.0.0") val clusterCenters: Array[Vector])
    • class PowerIterationClusteringModel @Since("1.3.0") (
    • class StreamingKMeansModel @Since("1.2.0") (
    • class StreamingKMeans @Since("1.2.0") (
    • class BinaryClassificationMetrics @Since("1.3.0") (
    • class MulticlassMetrics @Since("1.1.0") (predictionAndLabels: RDD[(Double, Double)])
    • class MultilabelMetrics @Since("1.2.0") (predictionAndLabels: RDD[(Array[Double], Array[Double])])
    • class RegressionMetrics @Since("1.2.0") (
    • class ChiSqSelectorModel @Since("1.3.0") (
    • class ChiSqSelector @Since("1.3.0") (
    • class ElementwiseProduct @Since("1.4.0") (
    • class IDF @Since("1.2.0") (@Since("1.2.0") val minDocFreq: Int)
    • class Normalizer @Since("1.1.0") (p: Double) extends VectorTransformer
    • class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int)
    • class StandardScaler @Since("1.1.0") (withMean: Boolean, withStd: Boolean) extends Logging
    • class StandardScalerModel @Since("1.3.0") (
    • class FPGrowthModel[Item: ClassTag] @Since("1.3.0") (
    • class FreqItemset[Item] @Since("1.3.0") (
    • class FreqSequence[Item] @Since("1.5.0") (
    • class PrefixSpanModel[Item] @Since("1.5.0") (
    • class DenseMatrix @Since("1.3.0") (
    • class SparseMatrix @Since("1.3.0") (
    • class DenseVector @Since("1.0.0") (
    • class SparseVector @Since("1.0.0") (
    • class BlockMatrix @Since("1.3.0") (
    • class CoordinateMatrix @Since("1.0.0") (
    • class IndexedRowMatrix @Since("1.0.0") (
    • class RowMatrix @Since("1.0.0") (
    • class PoissonGenerator @Since("1.1.0") (
    • class ExponentialGenerator @Since("1.3.0") (
    • class GammaGenerator @Since("1.3.0") (
    • class LogNormalGenerator @Since("1.3.0") (
    • abstract class GeneralizedLinearModel @Since("1.0.0") (
    • class IsotonicRegressionModel @Since("1.3.0") (
    • case class LabeledPoint @Since("1.0.0") (
    • class LassoModel @Since("1.1.0") (
    • class LinearRegressionModel @Since("1.1.0") (
    • class RidgeRegressionModel @Since("1.1.0") (
    • class MultivariateGaussian @Since("1.3.0") (
    • case class BoostingStrategy @Since("1.4.0") (
    • class Strategy @Since("1.3.0") (
    • class DecisionTreeModel @Since("1.0.0") (
    • class Node @Since("1.2.0") (
    • class Predict @Since("1.2.0") (
    • class RandomForestModel @Since("1.2.0") (
    • class GradientBoostedTreesModel @Since("1.2.0") (
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test used to just log an exception on cancel(jobId). I'm not sure what it was supposed to be testing before. I made the minimal change here, by capturing the exception and checking it. But maybe cancel(jobId) should not be creating an exception? Is the idea that if you submit a job with no partitions, it will immediately stop? That way, if you try to cancel it, you'd just hit this case with a harmless logDebug? That suggests we should change handleJobSubmitted to handle empty jobs, the same way we handle it in submitMissingTasks for stages with no partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary, but was helpful for some debugging, figured it couldn't hurt.

@SparkQA
Copy link

SparkQA commented Aug 26, 2015

Test build #41635 has finished for PR 8466 at commit 32102f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mateiz
Copy link
Contributor

mateiz commented Sep 2, 2015

Hey, so is the conclusion that the DAGScheduler actually did pass the exception to JobListeners, but we weren't listening for it in our test suite? I thought the initial problem was that some exception got swallowed before ever being passed up.

Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
@squito
Copy link
Contributor Author

squito commented Oct 9, 2015

Hi @mateiz sorry I totally missed your comment a while back. so I think poorly explained the issue in the first place. Exceptions are already making it to the joblisteners to the best of my knowledge (as part of doCancelAllJobs).

The issue that I'm trying to address is just confusing behavior in the tests. Before this change, if there is an exception inside the event processing loop, then all that happens is some internal state is set. Furthermore, the event process loop stops the SparkContext, but it doesn't actually stop the DAGScheduler in use b/c in these tests the DAGScheduler is custom, not the one created by the SparkContext.

The net effect is that when a test doesn't behave the way you expect it to, its hard to unravel why. Unless tests include assert(failure == null) on every other line, if you do hit an exception, the test will keep running quite a bit further but with strange behavior which can be hard to make sense of.

The only goal here is just to make those issues much more immediately obvious. For these tests, as soon as there is an exception in the event process loop, throw the exception and fail the test.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43477 has finished for PR 8466 at commit f1f4814.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Oct 9, 2015

LGTM.

@zsxwing
Copy link
Member

zsxwing commented Dec 11, 2015

retest this please

@zsxwing
Copy link
Member

zsxwing commented Dec 11, 2015

LGTM pending tests

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47556 has finished for PR 8466 at commit f1f4814.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47684 has finished for PR 8466 at commit f1f4814.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a related change? If we add another TaskEndReason in the future we might forget to add it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I think the compiler will report uncompleted matching as a warning and the build will fail. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was needed just for the added test case -- I needed a way to intentionally throw an error inside the event loop, and with this change if event.reason is null, you get an exception.

And yeah, if you add another type but dont' include it in this match, you'll get a fatal warning in compilation since its a sealed trait.

@squito
Copy link
Contributor Author

squito commented Dec 16, 2015

I don't think that test failure is related at all, but I also dont' see why that test would be flaky :(

@squito
Copy link
Contributor Author

squito commented Dec 16, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47861 has finished for PR 8466 at commit f1f4814.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

Test failure is unrelated. Merging into master 1.6.

asfgit pushed a commit that referenced this pull request Dec 17, 2015
`DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs).  However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception.

This was suggested by mateiz on #7699.  It may have already turned up an issue in "zero split job".

Author: Imran Rashid <[email protected]>

Closes #8466 from squito/SPARK-10248.

(cherry picked from commit 38d9795)
Signed-off-by: Andrew Or <[email protected]>
@asfgit asfgit closed this in 38d9795 Dec 17, 2015
@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47883 has finished for PR 8466 at commit f1f4814.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants