Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented Feb 15, 2017

Closes #112.

@mccheah
Copy link
Author

mccheah commented Feb 15, 2017

@foxish this fixes the issue on my machine.

var clientConfigBuilder = new ConfigBuilder()
.withApiVersion("v1")
.withMasterUrl(kubernetesMaster)
.withMasterUrl(s"$urlScheme://$kubernetesHost:$kubernetesPort")
Copy link
Member

@foxish foxish Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be https always. Even if the user uses the insecure endpoint to access the apiserver from outside the cluster, KUBERNETES_SERVICE_PORT should point to the secure endpoint.

@foxish
Copy link
Member

foxish commented Feb 15, 2017

I'm a bit wary about this change because the most robust way here is to use the dns name and let it be resolved.
There are other issues around using the pod env vars, like kubernetes/kubernetes#40973

The comments in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L412-L416 read:

	// Note that there is a race between Kubelet seeing the pod and kubelet seeing the service.
	// To avoid this users can: (1) wait between starting a service and starting; or (2) detect
	// missing service env var and exit and be restarted; or (3) use DNS instead of env vars
	// and keep trying to resolve the DNS name of the service (recommended).

Have you tried resolving the DNS name in a loop within the driver pod? Does it always fail in your environment?

@mccheah
Copy link
Author

mccheah commented Feb 15, 2017

The issue has reproduced every time I've run the integration tests. I haven't tried resolving in a loop yet.

@foxish
Copy link
Member

foxish commented Feb 15, 2017

If it is specific to minikube 0.16.0, I think we should report it and switch to the newer version when it's fixed.

@mccheah
Copy link
Author

mccheah commented Feb 15, 2017

I also saw it in Minikube 0.15.0. I also just tried resolving in a loop - basically trying the failing call to kubernetesClient.pods().... 5 times with 5 second intervals in KubernetesClusterSchedulerBackend. The resolution failed every time.

@foxish
Copy link
Member

foxish commented Feb 15, 2017

I wasn't testing cross namespace. The issue was that it is no longer automatically looking up kubernetes.svc.default in a different namespace. I added a fix in #118. PTAL.

I should've caught this earlier. Apologies.

@mccheah
Copy link
Author

mccheah commented Feb 15, 2017

Closing in favor of #118 which I've confirmed fixes the issue.

@mccheah mccheah closed this Feb 15, 2017
@mccheah mccheah deleted the use-service-ip branch February 15, 2017 21:25
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
ifilonenko pushed a commit to bloomberg/apache-spark-on-k8s that referenced this pull request Oct 21, 2019
### What changes were proposed in this pull request?

Updated kubernetes client.

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in apache#25785

### Does this PR introduce any user-facing change?

Nope, it should be transparent to users

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." apache-spark-on-k8s#121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." apache-spark-on-k8s#117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully

Closes apache#26093 from igorcalabria/k8s-client-update.

Authored-by: igor.calabria <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants