-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22938][SQL][followup] Assert that SQLConf.get is accessed only on the driver #21190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I believe this is also the root cause of the branch 2.3 test failures like https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/lastCompletedBuild/testReport/org.apache.spark.sql.execution.datasources.parquet/ParquetQuerySuite/SPARK_15678__not_use_cache_on_append/ This PR might be too large to backport, we should look into how branch master avoids the test failures and backport it 2.3. cc @vanzin |
|
Test build #89963 has finished for PR 21190 at commit
|
|
Retest this please. |
|
Test build #89971 has finished for PR 21190 at commit
|
|
@cloud-fan . Thank you for investigating this. Could you fix jsonExpressions.scala Line 520, too? val forceNullableSchema = SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA) |
|
Retest this please. |
|
Test build #89985 has finished for PR 21190 at commit
|
|
Test build #90058 has finished for PR 21190 at commit
|
|
Test build #90089 has finished for PR 21190 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call DataType.equalsIgnoreNullability here for better show the difference between the call of DataType.equalsIgnoreCaseAndNullability below?
… executor side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. ## How was this patch tested? it's tested in #21190 Author: Wenchen Fan <[email protected]> Closes #21223 from cloud-fan/minor1. (cherry picked from commit 991b526) Signed-off-by: Wenchen Fan <[email protected]>
… executor side ## What changes were proposed in this pull request? This PR is extracted from apache#21190 , to make it easier to backport. `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. ## How was this patch tested? it's tested in apache#21190 Author: Wenchen Fan <[email protected]> Closes apache#21223 from cloud-fan/minor1.
…r side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. ## How was this patch tested? tested in #21190 Author: Wenchen Fan <[email protected]> Closes #21226 from cloud-fan/minor4. (cherry picked from commit 96a5001) Signed-off-by: Wenchen Fan <[email protected]>
…r side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. ## How was this patch tested? tested in #21190 Author: Wenchen Fan <[email protected]> Closes #21226 from cloud-fan/minor4.
## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. ## How was this patch tested? tested in #21190 Author: Wenchen Fan <[email protected]> Closes #21225 from cloud-fan/minor3. (cherry picked from commit e646ae6) Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? This PR is extracted from apache#21190 , to make it easier to backport. `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. ## How was this patch tested? tested in apache#21190 Author: Wenchen Fan <[email protected]> Closes apache#21225 from cloud-fan/minor3.
…or side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't call `conf.parquetFilterPushDownDate` there. ## How was this patch tested? it's tested in #21190 Author: Wenchen Fan <[email protected]> Closes #21224 from cloud-fan/minor2.
|
retest this please |
|
Test build #90181 has finished for PR 21190 at commit
|
|
Test build #90182 has finished for PR 21190 at commit
|
|
Test build #90317 has finished for PR 21190 at commit
|
|
retest this please |
|
LGTM |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm too
|
Test build #90464 has finished for PR 21190 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master. |
## What changes were proposed in this pull request? Previously in apache#20136 we decided to forbid tasks to access `SQLConf`, because it doesn't work and always give you the default conf value. In apache#21190 we fixed the check and all the places that violate it. Currently the pattern of accessing configs at the executor side is: read the configs at the driver side, then access the variables holding the config values in the RDD closure, so that they will be serialized to the executor side. Something like ``` val someConf = conf.getXXX child.execute().mapPartitions { if (someConf == ...) ... ... } ``` However, this pattern is hard to apply if the config needs to be propagated via a long call stack. An example is `DataType.sameType`, and see how many changes were made in apache#21190 . When it comes to code generation, it's even worse. I tried it locally and we need to change a ton of files to propagate configs to code generators. This PR proposes to allow tasks to access `SQLConf`. The idea is, we can save all the SQL configs to job properties when an SQL execution is triggered. At executor side we rebuild the `SQLConf` from job properties. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes apache#21299 from cloud-fan/config.
re-submit #21299 which broke build. A few new commits are added to fix the SQLConf problem in `JsonSchemaInference.infer`, and prevent us to access `SQLConf` in DAGScheduler event loop thread. ## What changes were proposed in this pull request? Previously in #20136 we decided to forbid tasks to access `SQLConf`, because it doesn't work and always give you the default conf value. In #21190 we fixed the check and all the places that violate it. Currently the pattern of accessing configs at the executor side is: read the configs at the driver side, then access the variables holding the config values in the RDD closure, so that they will be serialized to the executor side. Something like ``` val someConf = conf.getXXX child.execute().mapPartitions { if (someConf == ...) ... ... } ``` However, this pattern is hard to apply if the config needs to be propagated via a long call stack. An example is `DataType.sameType`, and see how many changes were made in #21190 . When it comes to code generation, it's even worse. I tried it locally and we need to change a ton of files to propagate configs to code generators. This PR proposes to allow tasks to access `SQLConf`. The idea is, we can save all the SQL configs to job properties when an SQL execution is triggered. At executor side we rebuild the `SQLConf` from job properties. ## How was this patch tested? a new test suite Author: Wenchen Fan <[email protected]> Closes #21376 from cloud-fan/config.
… on the driver ## What changes were proposed in this pull request? This is a followup of apache#20136 . apache#20136 didn't really work because in the test, we are using local backend, which shares the driver side `SparkEnv`, so `SparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIER` doesn't work. This PR changes the check to `TaskContext.get != null`, and move the check to `SQLConf.get`, and fix all the places that violate this check: * `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. apache#21223 merged * `DataType#sameType` may be executed in executor side, for things like json schema inference, so we can't call `conf.caseSensitiveAnalysis` there. This contributes to most of the code changes, as we need to add `caseSensitive` parameter to a lot of methods. * `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't can't call `conf.parquetFilterPushDownDate` there. apache#21224 merged * `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. apache#21225 merged * `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. apache#21226 merged ## How was this patch tested? existing test Author: Wenchen Fan <[email protected]> Closes apache#21190 from cloud-fan/minor.
What changes were proposed in this pull request?
This is a followup of #20136 . #20136 didn't really work because in the test, we are using local backend, which shares the driver side
SparkEnv, soSparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIERdoesn't work.This PR changes the check to
TaskContext.get != null, and move the check toSQLConf.get, and fix all the places that violate this check:InMemoryTableScanExec#createAndDecompressColumnis executed insiderdd.map, we can't accessconf.offHeapColumnVectorEnabledthere. [SPARK-24166][SQL] InMemoryTableScanExec should not access SQLConf at executor side #21223 mergedDataType#sameTypemay be executed in executor side, for things like json schema inference, so we can't callconf.caseSensitiveAnalysisthere. This contributes to most of the code changes, as we need to addcaseSensitiveparameter to a lot of methods.ParquetFiltersis used in the file scan function, which is executed in executor side, so we can't can't callconf.parquetFilterPushDownDatethere. [SPARK-24167][SQL] ParquetFilters should not access SQLConf at executor side #21224 mergedWindowExec#createBoundOrderingis called on executor side, so we can't useconf.sessionLocalTimezonethere. [SPARK-24168][SQL] WindowExec should not access SQLConf at executor side #21225 mergedJsonToStructscan be serialized to executors and evaluate, we should not callSQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)in the body. [SPARK-24169][SQL] JsonToStructs should not access SQLConf at executor side #21226 mergedHow was this patch tested?
existing test