-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2460] Optimize SparkContext.hadoopFile api #1385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…y in SparkContext.hadoopFile
|
Can one of the admins verify this patch? |
|
Thanks for submitting the pull request. Did I read this correctly? The master branch deadlocks? If yes, we should file a JIRA for that also and make that more clear. If it is simply about optimizing an API to reduce code, it is a much lower priority issue. If it deadlocks, this needs to be a BLOCKER. |
|
Is this related to the other conf-related concurrency issue that was fixed recently? #1273 |
|
@rxin and @aarondav, yeah ,the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)"), the program stucks. i think clean SparkContext.hadoopFile api is a better way for fix it. in this way, we do not need the lock in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this check? I think this may have been the only place we put the JobConf inside the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there is no need to cache the jobconf if it is in broadcast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,i agree with you. broadcastedConf is cached by blockManager in Broadcast
…dd/drop partition command' (apache#1385) ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes apache#36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <[email protected]> Signed-off-by: Chao Sun <[email protected]>
1 use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directly in SparkContext.hadoopFile.
SparkContext.hadoopRDD() Add necessary security credentials to the JobConf before broadcasting it.
2 broadcast jobConf in HadoopRDD, not Configuration. this will resolve the dead lock issue----
now HadoopRDD broadcast Configuration and in each task (compute method) to get jobConf
then the lock of
conf.synchronized {
val newJobConf = new JobConf(conf)
initLocalJobConfFuncOpt.map(f => f(newJobConf))
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
newJobConf
}
will conflict with hadoop version fix Hadoop-10456:
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
synchronized(Configuration.class) {
REGISTRY.put(this, null);
}
we can reproduce the bug like this:
hadoop version 2.4.1
spark master branch
hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)")