Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Jul 30, 2014

Previously we broadcast JobConf for HadoopRDD because it is large. Now we always broadcast RDDs and task closures so it should no longer be necessary to broadcast the JobConf anymore.

@rxin
Copy link
Contributor Author

rxin commented Jul 30, 2014

Jenkins, what are you doing ...

@rxin
Copy link
Contributor Author

rxin commented Jul 30, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17425/consoleFull

@rxin
Copy link
Contributor Author

rxin commented Jul 30, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17454/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment looks out of place now.

@vanzin
Copy link
Contributor

vanzin commented Jul 30, 2014

LGTM, but I'm not entirely familiar with all this code yet.

@rxin
Copy link
Contributor Author

rxin commented Jul 30, 2014

This is unfortunately not working because of some thing with HiveConf ....

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17490/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17508/consoleFull

@rxin
Copy link
Contributor Author

rxin commented Jul 31, 2014

Jenkins, retest this please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to return a new copy of the conf for every partition or something? Because otherwise I'm not sure I see why we can safely remove the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because RDD objects are not reused at all. Each task gets its own deserialized copy of the HadoopRDD and the conf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth a comment here then saying that the createJobConf() method really does create a new job conf because [xyz] even though it looks like it's just accessing the broadcast value.

@JoshRosen
Copy link
Contributor

It looks like two unrelated commits from #1675 got pulled into this PR. Do you mind rebasing to exclude them?

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17894/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17898/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA results for PR 1648:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17898/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17906/consoleFull

@JoshRosen
Copy link
Contributor

It looks like deserializing the JobConf could be pretty expensive. Here's part of the deserialization stack trace:

Executor task launch worker-0 [RUNNABLE]
java.util.zip.ZipFile.getEntry(long, byte[], boolean)
java.util.zip.ZipFile.getEntry(String)
java.util.jar.JarFile.getEntry(String)
java.util.jar.JarFile.getJarEntry(String)
sun.misc.URLClassPath$JarLoader.getResource(String, boolean)
sun.misc.URLClassPath$JarLoader.findResource(String, boolean)
sun.misc.URLClassPath.findResource(String, boolean)
java.net.URLClassLoader$2.run()<2 recursive calls>
java.security.AccessController.doPrivileged(PrivilegedAction, AccessControlContext)
java.net.URLClassLoader.findResource(String)
java.lang.ClassLoader.getResource(String)<2 recursive calls>
java.net.URLClassLoader.getResourceAsStream(String)
javax.xml.parsers.SecuritySupport$4.run()
java.security.AccessController.doPrivileged(PrivilegedAction)
javax.xml.parsers.SecuritySupport.getResourceAsStream(ClassLoader, String)
javax.xml.parsers.FactoryFinder.findJarServiceProvider(String)
javax.xml.parsers.FactoryFinder.find(String, String)
javax.xml.parsers.DocumentBuilderFactory.newInstance()
org.apache.hadoop.conf.Configuration.loadResource(Properties, Object, boolean)
org.apache.hadoop.conf.Configuration.loadResources(Properties, ArrayList, boolean)
org.apache.hadoop.conf.Configuration.getProps()
org.apache.hadoop.conf.Configuration.get(String, String)
org.apache.hadoop.hive.conf.HiveConf.initialize(Class)
org.apache.hadoop.hive.conf.HiveConf.<init>()
sun.reflect.GeneratedConstructorAccessor142.newInstance(Object[])
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Object[])
java.lang.reflect.Constructor.newInstance(Object[])
org.apache.hadoop.util.ReflectionUtils.newInstance(Class, Configuration)
org.apache.hadoop.io.WritableFactories.newInstance(Class, Configuration)
org.apache.hadoop.io.ObjectWritable.readObject(DataInput, ObjectWritable, Configuration)
org.apache.hadoop.io.ObjectWritable.readFields(DataInput)
org.apache.spark.SerializableWritable.readObject(ObjectInputStream)
[...]
org.apache.spark.serializer.JavaDeserializationStream.readObject(ClassTag)
org.apache.spark.serializer.JavaSerializerInstance.deserialize(ByteBuffer, ClassLoader, ClassTag)
org.apache.spark.scheduler.ResultTask.runTask(TaskContext)
org.apache.spark.scheduler.Task.run(long)
org.apache.spark.executor.Executor$TaskRunner.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

This seems to involve fairly expensive searches of the classpath.

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA results for PR 1648:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17906/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: we need to remove this setting before merging it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep it at 2 to speed up tests ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually speeds up the tests quite a bit, although it might be masking some of the expensive serialization/deserialization issues.

@rxin
Copy link
Contributor Author

rxin commented Aug 5, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1648. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17938/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA results for PR 1648:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17938/consoleFull

@JoshRosen
Copy link
Contributor

It looks like this test failure was due to a BindException causing an unrelated test to fail.

@pwendell
Copy link
Contributor

pwendell commented Aug 5, 2014

@rxin @JoshRosen yeah the test failure was unrelated. We need to fix one of the streaming tests.

@JoshRosen
Copy link
Contributor

I've spent a bit of time looking into some of the performance issues that we've seen in this patch.

After this patch, it looks like some of the mapPartitions stages in the correlationoptimizer.* tests are taking ~7 seconds instead of a few tens of milliseconds (these were called from the SparkSQL Exchange operator). @marmbrus, maybe we should chat about this, since you're more familiar with that code.

Operating under the theory that deserializing Hadoop Configuration / JobConfs was expensive, I tried a few alternative serialization approaches, including using WritableUtils to manually serialize the configuration and writing my own code to read that back into a configuration; this didn't seem to make a huge difference.

I'm going to put this fix on hold for now until I have more time to figure out why we're seeing this slowdown.

@ash211 Do you have a way to reliably reproduce the thread-safety issues that you reported in SPARK-2546? That would be helpful in order to know whether I've actually fixed the problem with clone().

@marmbrus
Copy link
Contributor

marmbrus commented Aug 5, 2014

@rxin opened #1784 to try and combat the performance issues in tests.

@marmbrus
Copy link
Contributor

We have merged the patch to reduce the # of shuffle partitions when testing. Time to revisit or close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants