Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented Aug 23, 2017

More efforts to reduce the complexity of the KubernetesClusterSchedulerBackend. The scheduler backend should not be concerned about anything other than the coordination of the executor lifecycle.

Requires #452.

@mccheah mccheah force-pushed the separate-external-shuffle-management branch from 4b6dee4 to 6e060ed Compare August 23, 2017 00:36
* limitations under the License.
*/

package org.apache.spark.network.shuffle.kubernetes;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to what used to be KubernetesExternalShuffleClient, with just a rename and an interface extraction. I suppose git doesn't present that in the most intuitive way here.

@mccheah
Copy link
Author

mccheah commented Aug 23, 2017

@foxish for review. I hope that organizing the code this way will make the scheduler backend and the shuffle related logic easier to test.

@foxish foxish self-assigned this Aug 23, 2017
@mccheah
Copy link
Author

mccheah commented Aug 23, 2017

rerun integration tests please

@erikerlandson
Copy link
Member

This seems like a good refactoring

@mccheah
Copy link
Author

mccheah commented Aug 24, 2017

Think the failure is legitimate, investigating.

Copy link
Member

@foxish foxish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks.
Left a couple of comments.

Some(new File(Config.KUBERNETES_SERVICE_ACCOUNT_CA_CRT_PATH)))

val kubernetesShuffleManager = if (Utils.isDynamicAllocationEnabled(sparkConf)) {
val kubernetesExternalShuffleClient = new KubernetesExternalShuffleClientImpl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this ensure that each executor gets a different instance of the shuffle client?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubernetesClusterManager is on the driver.

SparkEnv.get.securityManager.getIOEncryptionKey())
context.reply(reply)
}
val shuffleSpecifiProperties = shuffleManager.get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffleSpecifiProperties -> shuffleSpecificProperties

@mccheah
Copy link
Author

mccheah commented Aug 29, 2017

Test failure is because apparently the volume names are bad, now that we're appending index numbers to them. We probably just don't need the index numbers.

@mccheah mccheah force-pushed the separate-executor-pod-construction branch from dbb113d to dc6b186 Compare August 29, 2017 18:38
@mccheah mccheah force-pushed the separate-external-shuffle-management branch from 9f9b432 to 65496d2 Compare August 29, 2017 18:47
.getLogger(KubernetesExternalShuffleClientImpl.class);

/**
* Creates an Kubernetes external shuffle client that wraps the {@link ExternalShuffleClient}.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Creates an Kubernetes .. -> Creates a Kubernetes ..

@Override
public void close() {
super.close();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we omit this method and let the close method inherited from ExternalShuffleClient suffice?

throw new SparkException(s"Unable to find shuffle pod on node $nodeName"))
}
// Inform the shuffle pod about this application so it can watch.
shuffleClient.registerDriverWithShuffleService(shufflePodIp, externalShufflePort)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this called by the driver every time an executor registers with the driver?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this was the case before the change as well, I think. @foxish

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be telling each shuffle pod about each drivers that it needs to keep track of (in order for the cleanup to work even in cases where the application doesn't terminate gracefully). There may be duplicate calls if we're registering the same driver with the same shuffle service but it's not been common.

override def getExecutorShuffleDirVolumesWithMounts(): Seq[(Volume, VolumeMount)] = {
shuffleDirs.zipWithIndex.map {
case (shuffleDir, shuffleDirIndex) =>
val volumeName = s"$shuffleDirIndex-${FilenameUtils.getBaseName(shuffleDir)}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this fixes a bug from before, where the old implementation would have attempted to mount multiple volumes with the same name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@ash211
Copy link

ash211 commented Sep 6, 2017

@mccheah ready for re-aim into branch-2.2-kubernetes

More efforts to reduce the complexity of the
KubernetesClusterSchedulerBackend. The scheduler backend should not be
concerned about anything other than the coordination of the executor lifecycle.
@mccheah mccheah force-pushed the separate-external-shuffle-management branch from a55b28e to e7a460e Compare September 7, 2017 00:43
@mccheah mccheah changed the base branch from separate-executor-pod-construction to branch-2.2-kubernetes September 7, 2017 00:43
@mccheah
Copy link
Author

mccheah commented Sep 7, 2017

Rebase is done.

@ash211
Copy link

ash211 commented Sep 7, 2017

@mccheah please resolve conflicts

@mccheah
Copy link
Author

mccheah commented Sep 7, 2017

Conflicts are fixed, is this good to merge?

@foxish
Copy link
Member

foxish commented Sep 7, 2017

The new structure is much nicer. Thanks! Ok to merge from my end.

@ash211 ash211 merged commit 6d7d798 into branch-2.2-kubernetes Sep 7, 2017
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
…spark-on-k8s#454)

* Extract more of the shuffle management to a different class.

More efforts to reduce the complexity of the
KubernetesClusterSchedulerBackend. The scheduler backend should not be
concerned about anything other than the coordination of the executor lifecycle.

* Fix scalastyle

* Add override annotation

* Fix Java style

* Remove unused imports.

* Move volume index to the beginning to satisfy index

* Address PR comments.
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
Need to figure out how to merge the checkstyles because otherwise we will end up with a lot of mess
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…spark-on-k8s#454)

* Extract more of the shuffle management to a different class.

More efforts to reduce the complexity of the
KubernetesClusterSchedulerBackend. The scheduler backend should not be
concerned about anything other than the coordination of the executor lifecycle.

* Fix scalastyle

* Add override annotation

* Fix Java style

* Remove unused imports.

* Move volume index to the beginning to satisfy index

* Address PR comments.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants