Skip to content

Conversation

@scwf
Copy link
Contributor

@scwf scwf commented Jul 12, 2014

1 use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directly in SparkContext.hadoopFile.
SparkContext.hadoopRDD() Add necessary security credentials to the JobConf before broadcasting it.

2 broadcast jobConf in HadoopRDD, not Configuration. this will resolve the dead lock issue----
now HadoopRDD broadcast Configuration and in each task (compute method) to get jobConf
then the lock of
conf.synchronized {
val newJobConf = new JobConf(conf)
initLocalJobConfFuncOpt.map(f => f(newJobConf))
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
newJobConf
}
will conflict with hadoop version fix Hadoop-10456:

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
synchronized(Configuration.class) {
REGISTRY.put(this, null);
}

we can reproduce the bug like this:
hadoop version 2.4.1
spark master branch

hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)")

@scwf scwf changed the title Optimize SparkContext.hadoopFile api [SPARK-2460] Optimize SparkContext.hadoopFile api Jul 12, 2014
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Jul 13, 2014

Thanks for submitting the pull request.

Did I read this correctly? The master branch deadlocks? If yes, we should file a JIRA for that also and make that more clear. If it is simply about optimizing an API to reduce code, it is a much lower priority issue. If it deadlocks, this needs to be a BLOCKER.

@aarondav
Copy link
Contributor

Is this related to the other conf-related concurrency issue that was fixed recently? #1273

@scwf
Copy link
Contributor Author

scwf commented Jul 14, 2014

@rxin and @aarondav, yeah ,the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)"), the program stucks.

i think clean SparkContext.hadoopFile api is a better way for fix it. in this way, we do not need the lock in
#1273

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this check? I think this may have been the only place we put the JobConf inside the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there is no need to cache the jobconf if it is in broadcast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,i agree with you. broadcastedConf is cached by blockManager in Broadcast

@scwf scwf closed this Aug 10, 2014
@scwf scwf deleted the kasi branch August 22, 2014 15:22
kazuyukitanimura added a commit to kazuyukitanimura/spark that referenced this pull request Aug 10, 2022
…dd/drop partition command' (apache#1385)

### What changes were proposed in this pull request?
https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979
It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug.

### Why are the changes needed?
Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added extra underlying file location check.

Closes apache#36075 from kazuyukitanimura/SPARK-38786.

Authored-by: Kazuyuki Tanimura <[email protected]>
Signed-off-by: Chao Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants