Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support #412

kimoonkim · 2017-08-02T19:06:26Z

What changes were proposed in this pull request?

Fixes #405 by flag-guarding potentially heavy DNS lookup RPCs.

How was this patch tested?

Modified existing unit tests to cover both flag-on and off cases.

…DFS locality support

erikerlandson · 2017-08-02T20:21:37Z

@kimoonkim is there an informative failure that can be detected, so that DNS lookup happens automatically if using the short name fails? Sort of like a fallback. Or is that failure scenario not possible to identify automatically?

kimoonkim · 2017-08-02T20:58:10Z

@erikerlandson Good question. It's hard to automatically detect specifically if DNS lookup is needed.

But we can automatically detect if the HDFS namenode is configured and use that as indication if HDFS support is needed. And flag guard the entire HDFS support. Curious what others think about this approach.

erikerlandson · 2017-08-02T21:22:03Z

I think it's a good idea to be able to enable/disable HDFS in general, especially if it has non-trivial performance implications

kimoonkim · 2017-08-02T21:24:36Z

@erikerlandson It just occurred to me that we can do adaptive approach. If DNS returns hostnames that are same as the cluster node names for a few hosts, then it is very likely that cluster node names are full hostnames. Then we don't need to issue more DNS requests. Thoughts?

kimoonkim · 2017-08-02T21:27:06Z

Yes. Unlike Spark on YARN, HDFS is an optional component for Spark on K8s. I don't think we want to force everyone to pay the performance cost.

Should I incorporate the broader flag in this PR or can it be a separate PR?

erikerlandson · 2017-08-02T21:32:34Z

it seems fine to me to incorporate the flag here, that ought to be a modest amount of code

kimoonkim · 2017-08-03T15:35:40Z

@erikerlandson @hustcat said in the issue that they use remote HDFS. (comment) So it may not work to automatically disable HDFS locality support based on whether the HDFS namenode is configured or not. (I'll make this change anyway because it would benefit others who don't use HDFS)

I think we still want to disable the DNS lookup by default. The bottom line is that DNS lookup turned out too expensive to do inside this critical section of the code. We'll have to redesign this part while disabling it for now. Perhaps we can issue DNS lookups asynchronously and pick up the results at a later point when the results are available.

erikerlandson · 2017-08-03T15:41:14Z

@kimoonkim that sounds like a good way to proceed

kimoonkim · 2017-08-03T16:46:26Z

@erikerlandson I was thinking about checking whether the HDFS namenode is configured or not in the SparkContext.hadoopConfiguration. i.e. Check the config key fs.defaultFS appear as hdfs://HOST:PORT. For non-HDFS users, the value would become one for local file system such as file:///, or one for s3 s3a:///, etc.

But on further thought, it seems this approach is error-prone. As the name defaultFS indicates this is only for the default. Job users can override this by putting the namenode address in any file paths for the jobs' input/output data, e.g. hdfs://HOST:PORT/DIR1/FILE1. In such case, the auto-detection logic will disable HDFS support code when it is actually needed.

Sorry about misleading the discussion. I think disabling the DNS lookup code is the best fix we can do for now, while redesigning the approach in the background.

ash211

This seems like the right move until we can find a way to infer the flag.

Thanks folks for debugging!

kimoonkim · 2017-08-03T20:13:58Z

rerun integration tests please

erikerlandson · 2017-08-03T20:24:18Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/config.scala

        " The driver can slow down and fail to respond to executor heartbeats in time." +
        " If enabling this flag, make sure your DNS server has enough capacity" +
        " for the workload.")
+      .internal()


what is the purpose of the internal() ?

This makes the flag non-public. From the code comment:

@param isPublic if this configuration is public to the user. If it's false, this
configuration is only used internally and we should not expose it to users.

I think we want to redesign the DNS lookup code path so that this flag is not needed in the future. So keeping the flag internal allow us to remove it later without worrying if there are some users depend on it.

kimoonkim · 2017-08-03T20:53:19Z

Hmm, the Jenkins integration test is unhappy. I think I am seeing the same error that #389 reported about resource staging server timeout.

kimoonkim · 2017-08-03T22:09:10Z

rerun integration tests please

liyinan926 · 2017-08-07T20:48:55Z

rerun integration tests please

erikerlandson · 2017-08-07T20:54:53Z

integration testing is probably going to have trouble passing until we fix #389, which should be soon

liyinan926 · 2017-08-08T00:53:41Z

All tests passed. Can someone merge this so to unblock #416? Thanks!

…DFS locality support (#412) (#421) * Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support * Clean up a bit * Improve unit tests

…DFS locality support (apache-spark-on-k8s#412) * Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support * Clean up a bit * Improve unit tests

Flag-guard expensive DNS lookup of cluster node full names, part of H…

8936661

…DFS locality support

kimoonkim mentioned this pull request Aug 2, 2017

heartbeat timeout causing all executors exit #405

Closed

Clean up a bit

4c4bfe2

ash211 approved these changes Aug 3, 2017

View reviewed changes

foxish mentioned this pull request Aug 3, 2017

Cutting the Spark 2.2 release #398

Closed

10 tasks

erikerlandson reviewed Aug 3, 2017

View reviewed changes

foxish mentioned this pull request Aug 3, 2017

Cutting a new Spark 2.1 release #416

Closed

7 tasks

Improve unit tests

27f433f

kimoonkim mentioned this pull request Aug 4, 2017

Integration test failing due to Resource Staging Server #389

Closed

erikerlandson merged commit e3cfaa4 into apache-spark-on-k8s:branch-2.2-kubernetes Aug 8, 2017

liyinan926 mentioned this pull request Aug 8, 2017

Cherry-pick PR #412 for flag-guarding DNS lookup #421

Merged

kimoonkim deleted the flag-guard-hdfs-dns-lookup branch August 9, 2017 22:38

kimoonkim mentioned this pull request Aug 10, 2017

Support HDFS rack locality #350

Merged

tangzhankun mentioned this pull request Aug 14, 2017

Fix an HDFS data locality bug in case cluster node names are short host names #291

Merged

Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support #412

Flag-guard expensive DNS lookup of cluster node full names, part of HDFS locality support #412

Uh oh!

Conversation

kimoonkim commented Aug 2, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

erikerlandson commented Aug 2, 2017

Uh oh!

kimoonkim commented Aug 2, 2017

Uh oh!

erikerlandson commented Aug 2, 2017

Uh oh!

kimoonkim commented Aug 2, 2017

Uh oh!

kimoonkim commented Aug 2, 2017

Uh oh!

erikerlandson commented Aug 2, 2017

Uh oh!

kimoonkim commented Aug 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikerlandson commented Aug 3, 2017

Uh oh!

kimoonkim commented Aug 3, 2017

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

kimoonkim commented Aug 3, 2017

Uh oh!

erikerlandson Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

kimoonkim Aug 3, 2017

Choose a reason for hiding this comment

Uh oh!

kimoonkim commented Aug 3, 2017

Uh oh!

kimoonkim commented Aug 3, 2017

Uh oh!

liyinan926 commented Aug 7, 2017

Uh oh!

erikerlandson commented Aug 7, 2017

Uh oh!

liyinan926 commented Aug 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kimoonkim commented Aug 3, 2017 •

edited

Loading