[SPARK-2460] Optimize SparkContext.hadoopFile api #1385

scwf · 2014-07-12T08:54:38Z

1 use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directly in SparkContext.hadoopFile.
SparkContext.hadoopRDD() Add necessary security credentials to the JobConf before broadcasting it.

2 broadcast jobConf in HadoopRDD, not Configuration. this will resolve the dead lock issue----
now HadoopRDD broadcast Configuration and in each task (compute method) to get jobConf
then the lock of
conf.synchronized {
val newJobConf = new JobConf(conf)
initLocalJobConfFuncOpt.map(f => f(newJobConf))
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
newJobConf
}
will conflict with hadoop version fix Hadoop-10456:

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
synchronized(Configuration.class) {
REGISTRY.put(this, null);
}

we can reproduce the bug like this:
hadoop version 2.4.1
spark master branch

hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)")

…y in SparkContext.hadoopFile

AmplabJenkins · 2014-07-12T08:56:21Z

Can one of the admins verify this patch?

rxin · 2014-07-13T07:10:31Z

Thanks for submitting the pull request.

Did I read this correctly? The master branch deadlocks? If yes, we should file a JIRA for that also and make that more clear. If it is simply about optimizing an API to reduce code, it is a much lower priority issue. If it deadlocks, this needs to be a BLOCKER.

aarondav · 2014-07-13T08:02:22Z

Is this related to the other conf-related concurrency issue that was fixed recently? #1273

scwf · 2014-07-14T03:06:02Z

@rxin and @aarondav, yeah ，the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)"), the program stucks.

i think clean SparkContext.hadoopFile api is a better way for fix it. in this way, we do not need the lock in
#1273

aarondav · 2014-07-14T03:31:49Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

Do we still need this check? I think this may have been the only place we put the JobConf inside the cache.

yeah, there is no need to cache the jobconf if it is in broadcast

yes,i agree with you. broadcastedConf is cached by blockManager in Broadcast

…dd/drop partition command' (apache#1385) ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes apache#36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <[email protected]> Signed-off-by: Chao Sun <[email protected]>

use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directl…

c691f3d

…y in SparkContext.hadoopFile

scwf changed the title ~~Optimize SparkContext.hadoopFile api~~ [SPARK-2460] Optimize SparkContext.hadoopFile api Jul 12, 2014

aarondav reviewed Jul 14, 2014
View reviewed changes

scwf added 8 commits July 14, 2014 16:55

use one broadcast for hive partitioning

94b3fa2

fix return bug

5d8265b

sync initLocalJobConfFuncOpt

754f68b

first get jobConfFunc

1f0b4ec

fix f

659df51

fix fun

9967588

add notation

afbaae7

bad commit， return back

1c85a3a

scwf closed this Aug 10, 2014

scwf deleted the kasi branch August 22, 2014 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-2460] Optimize SparkContext.hadoopFile api #1385

[SPARK-2460] Optimize SparkContext.hadoopFile api #1385

Uh oh!

scwf commented Jul 12, 2014

Uh oh!

AmplabJenkins commented Jul 12, 2014

Uh oh!

rxin commented Jul 13, 2014

Uh oh!

aarondav commented Jul 13, 2014

Uh oh!

scwf commented Jul 14, 2014

Uh oh!

aarondav Jul 14, 2014

Uh oh!

scwf Jul 14, 2014

Uh oh!

lianhuiwang Jul 17, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-2460] Optimize SparkContext.hadoopFile api #1385

[SPARK-2460] Optimize SparkContext.hadoopFile api #1385

Uh oh!

Conversation

scwf commented Jul 12, 2014

Uh oh!

AmplabJenkins commented Jul 12, 2014

Uh oh!

rxin commented Jul 13, 2014

Uh oh!

aarondav commented Jul 13, 2014

Uh oh!

scwf commented Jul 14, 2014

Uh oh!

aarondav Jul 14, 2014

Choose a reason for hiding this comment

Uh oh!

scwf Jul 14, 2014

Choose a reason for hiding this comment

Uh oh!

lianhuiwang Jul 17, 2014

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants