Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@varunkatta
Copy link
Member

Fixes #120

What changes were proposed in this pull request?

The scheduler inside LoggingPodStatusWatcher doesn't seem to be shutdown after the receiving job finished events. So, I shutdown the scheduler on receiving these events.

How was this patch tested?

Tested manually that job launcher client exits when the spark job succeeds.

017-02-15 13:18:19 INFO  LoggingPodStatusWatcher:54 - Application status for spark-pi-1487193446018 (phase: Running)
2017-02-15 13:18:20 INFO  LoggingPodStatusWatcher:54 - Application status for spark-pi-1487193446018 (phase: Running)
2017-02-15 13:18:21 INFO  LoggingPodStatusWatcher:54 - Application status for spark-pi-1487193446018 (phase: Succeeded)
2017-02-15 13:18:21 INFO  LoggingPodStatusWatcher:54 - Phase changed, new state:
	 pod name: spark-pi-1487193446018
	 namespace: default
	 labels: spark-app-id -> spark-pi-1487193446018, spark-app-name -> spark-pi, spark-driver -> spark-pi-1487193446018
	 pod uid: 28b36eeb-f3c4-11e6-80cf-02f2c310e88c
	 creation time: 2017-02-15T21:17:30Z
	 service account name: default
	 volumes: spark-submission-secret-volume, default-token-7eejh
	 node name: kube-n2.pepperdata.com
	 start time: 2017-02-15T21:17:30Z
	 container images: docker:5000/spark-driver:varun_2_14
	 phase: Succeeded
2017-02-15 13:18:21 INFO  Client:54 - Application spark-pi-1487193446018 finished.
2017-02-15 13:18:21 INFO  WatchConnectionManager:296 - WebSocket close received. code: 1000, reason:
2017-02-15 13:18:21 WARN  WatchConnectionManager:298 - Ignoring onClose for already closed/closing websocket

@ash211
Copy link

ash211 commented Feb 16, 2017

Just tested this and confirmed it indeed fixes the regression. However the regression in spark-on-k8s is looking like a symptom of a regression in the upstream kubernetes-client library where the onClose method of watches is no longer called.

I think we should still merge this though since it makes the LoggingPodStatusWatcher more resilient to issues like this one.

@ash211 ash211 merged commit 84f147b into apache-spark-on-k8s:k8s-support-alternate-incremental Feb 16, 2017
@ash211
Copy link

ash211 commented Feb 16, 2017

Thanks @varunkatta for the contribution!

ash211 pushed a commit that referenced this pull request Mar 8, 2017
foxish pushed a commit that referenced this pull request Jul 24, 2017
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
…ng-distributedsuite

Ignore hanging DistributedSuite
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
ifilonenko pushed a commit to bloomberg/apache-spark-on-k8s that referenced this pull request Oct 21, 2019
### What changes were proposed in this pull request?

Updated kubernetes client.

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in apache#25785

### Does this PR introduce any user-facing change?

Nope, it should be transparent to users

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." apache-spark-on-k8s#121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." apache-spark-on-k8s#117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully

Closes apache#26093 from igorcalabria/k8s-client-update.

Authored-by: igor.calabria <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants