Skip to content

Conversation

@erikerlandson
Copy link
Contributor

...rtByKey() until evaluation.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 1689. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 1689:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps make this thread safe?

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA tests have started for PR 1689. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1689:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to deserialize valRB if it is not null? Are you worried rangeBounds might be called while the deserialization is happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also was assuming readObject might be called in multiple threads. Can that happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not possible

@erikerlandson
Copy link
Contributor Author

Latest push updates RangePartition sampling job to be async, and updates the async action functions so that they will properly enclose the sampling job induced by calling 'partitions'.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1689 at commit 09f0637.

  • This patch merges cleanly.

@markhamstra
Copy link
Contributor

Excellent! I'll try to find some time to review this soon.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have finished for PR 1689 at commit 09f0637.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have started for PR 1689 at commit f3448e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 16, 2014

QA tests have finished for PR 1689 at commit f3448e4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Sep 12, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have started for PR 1689 at commit 50b6da6.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Sep 12, 2014

@erikerlandson thanks for looking at this.

A few questions:

  1. After this pull request, does anything still use SimpleFutureAction?
  2. If I understand this correctly, this could potentially block the single-threaded scheduler from doing anything else while waiting for the rangeBounds to be computed. Any comment on this?
  3. This is not always lazy still right? See a test case
c.parallelize(1 to 1000).map(x => (x, x)).sortByKey().join(sc.parallelize(1 to 10).map(x=>(x,x)))

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have finished for PR 1689 at commit 50b6da6.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@erikerlandson
Copy link
Contributor Author

Hi @rxin,

  1. SimpleFutureAction is still referred to in submitJob method, but that doesn't appear to be invoked anywhere. I was reluctant to get rid of it, as it's all experimental, and I could envision use cases for it.

  2. I see your point. I don't currently have any clever ideas to avoid that scenario when it happens.

  3. Very interesting -- so this scenario is triggered because defaultPartitioner starts examining input RDD partitioners, which sets off the job when it trips over the data driven partitioning computation from sortByKey.

My impression is that this whack-a-mole with non-laziness stems from a combination of (a) a data-dependent partitioner(s), with (b) methods that refer to input partitioners as part of the construction of new RDDs. It might be possible to thread some design changes around so that references to partitioning are consistently encapsulated in a Future. Functions such as defaultPartitioner would then also have to return a Future, etc. Or, even more generally, somehow encapsulate all RDD initialization in a Future, with the idea that these futures would finally unwind when some Action was invoked.

However it seems (imo) outside the scope of this particular Jira/PR. Maybe we could start another umbrella Jira to track possible solutions along these lines.

Another orthogonal thought -- you can short circuit all this by providing a partitioner instead of forcing it to be computed from data. That's not as sexy, or widely applicable, as some deeper fix to the problem, but users can do it now as a workaround when it's feasible.

@erikerlandson
Copy link
Contributor Author

Or, maybe just look into playing the same game with the cogrouped RDDs that I did with sortByKey. Don't get into invoking defaultPartitioner until somebody asks for the output RDD's partitioner, etc.

@rxin
Copy link
Contributor

rxin commented Sep 16, 2014

Yea I don't think we need to fully solve 3 here.

My main concern with these set of changes is 2, since a single badly behaved RDD can potentially block the (unfortunately single threaded) scheduler forever. Let me think about this a little bit and get back to you.

If you have an idea about how to fix that, feel free to suggest them.

@erikerlandson
Copy link
Contributor Author

So far the best idea I have for (2) is to set some kind of time-out on the evaluation. The bound computation uses subsampling that will (when all goes well) cap the computation at constant time(*). If the timeout triggers, some sub-optimal falback for partitioning might be used. Or just fail the entire evaluation.

(*) more accurately, constant number of samples. the time required could depend on various things.

@rxin
Copy link
Contributor

rxin commented Sep 27, 2014

Actually I looked at it again. I don't think it would block the scheduler because we compute partitions outside the scheduler thread. This approach looks good to me!

@rxin
Copy link
Contributor

rxin commented Sep 27, 2014

@erikerlandson i'm going to merge this first. Maybe we can do the cleanup later.

@asfgit asfgit closed this in 2d972fd Sep 27, 2014
@rxin
Copy link
Contributor

rxin commented Sep 27, 2014

BTW one thing that would be great to add is a test that makes sure we don't block the main dag scheduler thread. The reason I think we don't block is that we call rdd.partitions.length in submitJob:

  /**
   * Submit a job to the job scheduler and get a JobWaiter object back. The JobWaiter object
   * can be used to block until the the job finishes executing or can be used to cancel the job.
   */
  def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      allowLocal: Boolean,
      resultHandler: (Int, U) => Unit,
      properties: Properties = null): JobWaiter[U] =
  {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length

@markhamstra
Copy link
Contributor

Have either of you thought about how to coordinate this with Josh's work on SPARK-3626? #2482

@marmbrus
Copy link
Contributor

Since this PR was merged the correlationoptimizer14 test has been hanging. We might want to consider rolling back. You can reproduce the problem as follows: sbt -Dspark.hive.whitelist=correlationoptimizer14 hive/test

@rxin
Copy link
Contributor

rxin commented Sep 29, 2014

I reverted this commit. @erikerlandson mind taking a look at this problem?

@erikerlandson
Copy link
Contributor Author

@rxin @marmbrus I will check it out

@erikerlandson
Copy link
Contributor Author

@marmbrus, FWIW, the correlationoptimizer14 test appears to be working for me. I ran it using: env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true ./dev/run-tests > ~/run-tests-1021.txt 2>&1

Not sure why, but runningsbt -Dspark.hive.whitelist=correlationoptimizer14 hive/test was not causing the test to run in my environment.

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…apache#1689)

We’ve added checkAllStateStoreProviders for Dedisco project to allow checking all state stores which is a debugging feature.

One thing we recently discussed with Dedisco project is, the time spent on such check is not counted by reportTimeTaken which can be observed in stream progress later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants