Skip to content

Conversation

@ajithme
Copy link
Contributor

@ajithme ajithme commented Jan 17, 2020

What changes were proposed in this pull request?

In org.apache.spark.sql.execution.SubqueryExec#relationFuture make a copy of org.apache.spark.SparkContext#localProperties and pass it to the sub-execution thread in org.apache.spark.sql.execution.SubqueryExec#executionContext

Why are the changes needed?

Local properties set via sparkContext are not available as TaskContext properties when executing jobs and threadpools have idle threads which are reused

Explanation:
When SubqueryExec, the relationFuture is evaluated via a separate thread. The threads inherit the localProperties from sparkContext as they are the child threads.
These threads are created in the executionContext (thread pools). Each Thread pool has a default keepAliveSeconds of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via sparkContext.runJob/submitJob

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@ajithme
Copy link
Contributor Author

ajithme commented Jan 17, 2020

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30556] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext [SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext Jan 17, 2020
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For WIP PR, please add [WIP] into the PR title.

How was this patch tested?

WIP

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext [WIP][SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext Jan 17, 2020
@srowen
Copy link
Member

srowen commented Jan 17, 2020

Closely related to #27266?

@ajithme
Copy link
Contributor Author

ajithme commented Jan 17, 2020

Closely related to #27266?

Currently, org.apache.spark.sql.execution.SubqueryExec#executionContext has a hardcoded size of 16 threads which makes writing UT difficult so i plan to make it configurable + fix subquery thread bug hence created a new pull request compared to broadcast bug (which is independent of this fix) in #27266

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116960 has finished for PR 27267 at commit 5042156.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

IMO the thread pools should not be a big issue. The subquery is guaranteed to be executed on a different thread (you can even add an assert for this). You just set some unique property on the local properties (value should also be unique), construct the something that contains a broadcast join, use an accumulator that you modify using either a UDF (easy) or a dataset operation.

@ajithme
Copy link
Contributor Author

ajithme commented Jan 20, 2020

IMO the thread pools should not be a big issue. The subquery is guaranteed to be executed on a different thread (you can even add an assert for this). You just set some unique property on the local properties (value should also be unique), construct the something that contains a broadcast join, use an accumulator that you modify using either a UDF (easy) or a dataset operation.

Agree. But with a pool size of 16, i will have to ensure all 16 threads are used at-least once and are alive to reproduce this issue and not to make that test flaky. Its lot easier if i can set the pool size to 1 in test to reproduce.

@ajithme ajithme changed the title [WIP][SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext [SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext Jan 21, 2020
@hvanhovell
Copy link
Contributor

@ajithme why do we need to make sure all 16 threads are used?

@cloud-fan
Copy link
Contributor

I think this is a common problem when we run code in a thread pool. I just realized that we hit a similar issue before with SQLConf.get and we add a hack to work around it.

Shall we add a util function to easily capture the necessary thread locals and propagate them to thread pool? For example, we can add a SQLExecution.withThreadLocalCaptured

def withThreadLocalCaptured(f: => T): Future[T] = {
  val activeSession = ...
  val localProperties = ...
  // any other important thread locals?
  Future {
    // set active sessionn
    // set local properties
   f
  }
}

@ajithme
Copy link
Contributor Author

ajithme commented Jan 21, 2020

util function to easily capture the necessary thread locals and propagate them to thread pool? For example, we can add a SQLExecution.withThreadLocalCaptured

sure @cloud-fan , i think similar approach was proposed by @hvanhovell , i will update my PR to introduce a utility than having a local fix.

@ajithme
Copy link
Contributor Author

ajithme commented Jan 21, 2020

@ajithme why do we need to make sure all 16 threads are used?

if not, the thread executing the subquery may be a new one ( created from pool,) thus inheriting the localproperties from sparkContext . This JIRA issue is reproduced only when pool threads are reused (within keepAlive time)

@ajithme
Copy link
Contributor Author

ajithme commented Jan 21, 2020

cc @cloud-fan @hvanhovell i have updated PR as per the suggestion. Please review

@ajithme
Copy link
Contributor Author

ajithme commented Jan 21, 2020

This seems to be problems what i try to fix in

  1. BroadcastExchangeExec (Refer: [SPARK-22590][SQL] Copy sparkContext.localproperties to child thread in BroadcastExchangeExec.executionContext #27266)
  2. SubqueryBroadcastExec
  3. SubqueryExec (Refer: [SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext #27267)

so should i try to fix all in one PR or should i have a separate PR for each.? i previously raised them separately so that i can complete them with individual UTs.

Please suggest @cloud-fan @srowen @dongjoon-hyun @hvanhovell, i am kinda neutral if to make separate PRs or single PR

@cloud-fan
Copy link
Contributor

I think we can fix one place in this PR, and send more PRs to fix more places later, with tests.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117171 has finished for PR 27267 at commit fea9160.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117173 has finished for PR 27267 at commit 577904f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajithme
Copy link
Contributor Author

ajithme commented Jan 22, 2020

@dongjoon-hyun Updated with fix

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117252 has finished for PR 27267 at commit f1cac4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

@dongjoon-hyun
Copy link
Member

Could you make a backporting PR against branch-2.4, @ajithme ?

@ajithme
Copy link
Contributor Author

ajithme commented Jan 23, 2020

Thanks @dongjoon-hyun @cloud-fan @hvanhovell

@dongjoon-hyun sure, i will make a backport PR for this over 2.4

.checkValue(thres => thres > 0 && thres <= 128, "The threshold must be in (0,128].")
.createWithDefault(128)

val SUBQUERY_MAX_THREAD_THRESHOLD =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a different change right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be a static configuration since we can only change it at startup.

Copy link
Contributor Author

@ajithme ajithme Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This is just opening as a configuration to make the change testable. DO you want me to raise separate PR just to make this configuration change seperate.?

  2. This is part of StaticSQLConf which is defined at startup, is there any other mechanism to define static conf.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its a static conf, then why isn't your unite test failing? Moreover, if its static then setting it in your test probably does not have any effect because we use the same JVM/SparkContext to run most tests, the chances are pretty high that it has been set before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. as per documentation in

* Static SQL configuration is a cross-session, immutable Spark configuration. External users can
its should not modified. I followed the same way BroadcastExchangeExec creates executionContext. My initial guess is executionContext is not created till first subquery hence it work for UT. i will further investigate and get back with analysis.

}

// set local configuration and assert
val confValue1 = "e"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it is better to use something unique here to avoid a fluke. How about using UUID.randomUUID().toString()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral about both approaches. Just wanted to have a fixed input value to get predictable output value from testcase. I can update this and raise followup if you insist

@hvanhovell
Copy link
Contributor

@ajithme can you address my comments in a follow up?

@gatorsmile
Copy link
Member

@ajithme Can you submit a PR to address the comments?

@gatorsmile
Copy link
Member

@xuanyuanking Could you submit a PR to address the comments?

@gatorsmile
Copy link
Member

https://github.com/apache/spark/pull/27267/files#r370089158 is the major comment we need to address.

.checkValue(thres => thres > 0 && thres <= 128, "The threshold must be in (0,128].")
.createWithDefault(128)

val SUBQUERY_MAX_THREAD_THRESHOLD =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should submit a separate PR for this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could keep this in StaticSQLConf.

  • I locally test the static config takes effect, it can only be set while startup.
  • The UT can pass because it was used in lazy val SubqueryExec.relationFuture on the executor side, so the withSQLConf in UT could set the config before executor start.

cc @hvanhovell

@ajithme
Copy link
Contributor Author

ajithme commented Feb 10, 2020

@ajithme Can you submit a PR to address the comments?

@gatorsmile @xuanyuanking Sure, i will submit a follow up PR shortly.

@xuanyuanking
Copy link
Member

@ajithme Sorry, I nearly submit a follow-up PR, could you help on reviewing? Thanks.

hvanhovell pushed a commit that referenced this pull request Feb 10, 2020
….withThreadLocalCaptured

### What changes were proposed in this pull request?
Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured.

### Why are the changes needed?
For code safety.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #27516 from xuanyuanking/SPARK-30556-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: herman <[email protected]>
hvanhovell pushed a commit that referenced this pull request Feb 10, 2020
….withThreadLocalCaptured

### What changes were proposed in this pull request?
Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured.

### Why are the changes needed?
For code safety.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #27516 from xuanyuanking/SPARK-30556-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: herman <[email protected]>
(cherry picked from commit a6b91d2)
Signed-off-by: herman <[email protected]>
val localProps = Utils.cloneProperties(sc.getLocalProperties)
Future {
SparkSession.setActiveSession(activeSession)
sc.setLocalProperties(localProps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell two questions:

  • Shouldn't we clonelocalProps here? in the sense that what if a concurrent thread modify them?
  • Does the order of setting localProps and activeSession matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the localProps is already a clone: https://github.com/apache/spark/pull/27267/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R178

And I think the order doesn't matter.

xuanyuanking added a commit to xuanyuanking/spark that referenced this pull request Feb 19, 2020
….withThreadLocalCaptured

Follow up for apache#27267, reset the status changed in SQLExecution.withThreadLocalCaptured.

For code safety.

No.

Existing UT.

Closes apache#27516 from xuanyuanking/SPARK-30556-follow.

(cherry picked from commit a6b91d2)
cloud-fan pushed a commit that referenced this pull request Feb 19, 2020
…tion withThreadLocalCaptured

### What changes were proposed in this pull request?
Follow up for #27267, reset the status changed in SQLExecution.withThreadLocalCaptured.

### Why are the changes needed?
For code safety.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

(cherry picked from commit a6b91d2)

Closes #27633 from xuanyuanking/SPARK-30556-backport.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
….withThreadLocalCaptured

### What changes were proposed in this pull request?
Follow up for apache#27267, reset the status changed in SQLExecution.withThreadLocalCaptured.

### Why are the changes needed?
For code safety.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes apache#27516 from xuanyuanking/SPARK-30556-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: herman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants