Skip to content

Commit 89883f2

Browse files
committed
Update mcad kuberay example
1 parent f05c6a8 commit 89883f2

File tree

2 files changed

+114
-99
lines changed

2 files changed

+114
-99
lines changed
Lines changed: 77 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,46 @@
11
apiVersion: mcad.ibm.com/v1beta1
22
kind: AppWrapper
33
metadata:
4-
name: raycluster-autoscaler
4+
name: raycluster-complete
55
namespace: default
66
spec:
77
resources:
8-
Items: []
8+
# Items: []
99
GenericItems:
10-
- replicas: 1
11-
custompodresources:
12-
- replicas: 2
13-
requests:
14-
cpu: 10
15-
memory: 512Mi
16-
limits:
17-
cpu: 10
18-
memory: 1G
19-
generictemplate:
20-
# This config demonstrates KubeRay's Ray autoscaler integration.
10+
- generictemplate:
2111
# The resource requests and limits in this config are too small for production!
22-
# For an example with more realistic resource configuration, see
12+
# For examples with more realistic resource configuration, see
13+
# ray-cluster.complete.large.yaml and
2314
# ray-cluster.autoscaler.large.yaml.
2415
apiVersion: ray.io/v1alpha1
2516
kind: RayCluster
2617
metadata:
2718
labels:
2819
controller-tools.k8s.io: "1.0"
2920
# A unique identifier for the head node and workers of this cluster.
30-
name: raycluster-autoscaler
21+
name: raycluster-complete
3122
spec:
32-
# The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
33-
rayVersion: '2.0.0'
34-
# If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
35-
# Ray autoscaler integration is supported only for Ray versions >= 1.11.0
36-
# Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
37-
enableInTreeAutoscaling: true
38-
# autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
39-
# The example configuration shown below below represents the DEFAULT values.
40-
# (You may delete autoscalerOptions if the defaults are suitable.)
41-
autoscalerOptions:
42-
# upscalingMode is "Default" or "Aggressive."
43-
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
44-
# Default: Upscaling is not rate-limited.
45-
# Aggressive: An alias for Default; upscaling is not rate-limited.
46-
upscalingMode: Default
47-
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
48-
idleTimeoutSeconds: 60
49-
# image optionally overrides the autoscaler's container image.
50-
# If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
51-
# the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
52-
## image: "my-repo/my-custom-autoscaler-image:tag"
53-
# imagePullPolicy optionally overrides the autoscaler container's image pull policy.
54-
imagePullPolicy: Always
55-
# resources specifies optional resource request and limit overrides for the autoscaler container.
56-
# For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
57-
resources:
58-
limits:
59-
cpu: "500m"
60-
memory: "512Mi"
61-
requests:
62-
cpu: "500m"
63-
memory: "512Mi"
64-
######################headGroupSpec#################################
65-
# head group template and specs, (perhaps 'group' is not needed in the name)
23+
rayVersion: '2.5.0'
24+
# Ray head pod configuration
6625
headGroupSpec:
67-
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
26+
# Kubernetes Service Type. This is an optional field, and the default value is ClusterIP.
27+
# Refer to https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types.
6828
serviceType: ClusterIP
69-
# logical group name, for this called head-group, also can be functional
70-
# pod type head or worker
71-
# rayNodeType: head # Not needed since it is under the headgroup
72-
# the following params are used to complete the ray start: ray start --head --block ...
29+
# The `rayStartParams` are used to configure the `ray start` command.
30+
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
31+
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
7332
rayStartParams:
74-
# Flag "no-monitor" will be automatically set when autoscaling is enabled.
7533
dashboard-host: '0.0.0.0'
76-
block: 'true'
77-
# num-cpus: '1' # can be auto-completed from the limits
78-
# Use `resources` to optionally specify custom resource annotations for the Ray node.
79-
# The value of `resources` is a string-integer mapping.
80-
# Currently, `resources` must be provided in the specific format demonstrated below:
81-
# resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
82-
#pod template
34+
# pod template
8335
template:
36+
metadata:
37+
# Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
38+
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
39+
labels: {}
8440
spec:
8541
containers:
86-
# The Ray head pod
8742
- name: ray-head
88-
image: rayproject/ray:2.0.0
89-
imagePullPolicy: Always
43+
image: rayproject/ray:2.5.0
9044
ports:
9145
- containerPort: 6379
9246
name: gcs
@@ -98,59 +52,90 @@ spec:
9852
preStop:
9953
exec:
10054
command: ["/bin/sh","-c","ray stop"]
55+
volumeMounts:
56+
- mountPath: /tmp/ray
57+
name: ray-logs
58+
# The resource requests and limits in this config are too small for production!
59+
# For an example with more realistic resource configuration, see
60+
# ray-cluster.autoscaler.large.yaml.
61+
# It is better to use a few large Ray pod than many small ones.
62+
# For production, it is ideal to size each Ray pod to take up the
63+
# entire Kubernetes node on which it is scheduled.
10164
resources:
10265
limits:
10366
cpu: "1"
104-
memory: "1G"
67+
memory: "2G"
10568
requests:
69+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
70+
# We also recommend setting requests equal to limits for both CPU and memory.
71+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
72+
# Kubernetes testing environments such as KinD and minikube.
10673
cpu: "500m"
107-
memory: "512Mi"
74+
memory: "2G"
75+
volumes:
76+
- name: ray-logs
77+
emptyDir: {}
10878
workerGroupSpecs:
10979
# the pod replicas in this group typed worker
11080
- replicas: 1
11181
minReplicas: 1
112-
maxReplicas: 300
82+
maxReplicas: 10
11383
# logical group name, for this called small-group, also can be functional
11484
groupName: small-group
115-
# if worker pods need to be added, we can simply increment the replicas
116-
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
117-
# the operator will remove pods from the list until the number of replicas is satisfied
118-
# when a pod is confirmed to be deleted, its name will be removed from the list below
85+
# If worker pods need to be added, we can increment the replicas.
86+
# If worker pods need to be removed, we decrement the replicas, and populate the workersToDelete list.
87+
# The operator will remove pods from the list until the desired number of replicas is satisfied.
88+
# If the difference between the current replica count and the desired replicas is greater than the
89+
# number of entries in workersToDelete, random worker pods will be deleted.
11990
#scaleStrategy:
12091
# workersToDelete:
12192
# - raycluster-complete-worker-small-group-bdtwh
12293
# - raycluster-complete-worker-small-group-hv457
12394
# - raycluster-complete-worker-small-group-k8tj7
124-
# the following params are used to complete the ray start: ray start --block ...
125-
rayStartParams:
126-
block: 'true'
95+
# The `rayStartParams` are used to configure the `ray start` command.
96+
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
97+
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
98+
rayStartParams: {}
12799
#pod template
128100
template:
129-
metadata:
130-
labels:
131-
key: value
132-
# annotations for pod
133-
annotations:
134-
key: value
135101
spec:
136-
initContainers:
137-
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
138-
- name: init-myservice
139-
image: busybox:1.28
140-
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
141102
containers:
142-
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
143-
image: rayproject/ray:2.0.0
144-
# environment variables to set in the container.Optional.
145-
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
103+
- name: ray-worker
104+
image: rayproject/ray:2.5.0
146105
lifecycle:
147106
preStop:
148107
exec:
149108
command: ["/bin/sh","-c","ray stop"]
109+
# use volumeMounts.Optional.
110+
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
111+
volumeMounts:
112+
- mountPath: /tmp/ray
113+
name: ray-logs
114+
# The resource requests and limits in this config are too small for production!
115+
# For an example with more realistic resource configuration, see
116+
# ray-cluster.autoscaler.large.yaml.
117+
# It is better to use a few large Ray pod than many small ones.
118+
# For production, it is ideal to size each Ray pod to take up the
119+
# entire Kubernetes node on which it is scheduled.
150120
resources:
151121
limits:
152122
cpu: "1"
153-
memory: "512Mi"
123+
memory: "1G"
124+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
125+
# We also recommend setting requests equal to limits for both CPU and memory.
126+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
127+
# Kubernetes testing environments such as KinD and minikube.
154128
requests:
129+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
130+
# We also recommend setting requests equal to limits for both CPU and memory.
131+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
132+
# Kubernetes testing environments such as KinD and minikube.
155133
cpu: "500m"
156-
memory: "256Mi"
134+
# For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
135+
memory: "1G"
136+
# use volumes
137+
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
138+
volumes:
139+
- name: ray-logs
140+
emptyDir: {}
141+

doc/usage/examples/kuberay/kuberay-mcad.md

Lines changed: 37 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,43 @@ This integration will help in queuing on [kuberay](https://github.com/ray-projec
44

55
#### Prerequisites
66

7-
- kubernetes or Openshift cluster
8-
- Install MCAD using instructions present under `deployment` directory
9-
- Make sure MCAD has clusterrole to create ray resources, please patch using configuration file present in `config` directory with name `xqueuejob-controller.yaml`
7+
- Kubernetes(see [KinD](https://helm.sh/docs/intro/install/)) or Openshift cluster(see [OpenShift Local](https://developers.redhat.com/products/openshift-local/overview))
8+
- Kubernetes client tools such as [kubectl](https://kubernetes.io/docs/tasks/tools/) or [OpenShift CLI](https://docs.openshift.com/container-platform/4.13/cli_reference/openshift_cli/getting-started-cli.html)
9+
- [Helm](https://helm.sh/docs/intro/install/)
10+
- Install a stable MCAD release on your Kubernetes cluster using helm
11+
```bash
12+
git clone https://github.com/project-codeflare/multi-cluster-app-dispatcher
13+
cd multi-cluster-app-dispatcher
14+
# Install from local charts
15+
helm install mcad --set image.repository=quay.io/project-codeflare/mcad-controller --set image.tag=stable deployment/mcad-controller
16+
```
17+
See [deployment.md](doc/deploy/deployment.md) for more options.
18+
- Make sure MCAD has clusterrole to create ray resources, please patch using [xqueuejob-controller.yaml](doc/usage/examples/kuberay/config/xqueuejob-controller.yaml). For example:
19+
```
20+
kubectl apply -f doc/usage/examples/kuberay/config/xqueuejob-controller.yaml
21+
```
1022

1123
#### Steps
1224

13-
- Install kuberay operator from [link](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html#deploying-the-kuberay-operator)
14-
- Submit ray cluster to MCAD as appwrapper using the config file `aw-raycluster.yaml` present in the `config` directory using command `kubectl create -f aw-raycluster.yaml`
15-
- Check the status of the appwrapper using command `kubectl describe appwrapper <your-appwrapper-name>`
16-
- Check running pods using command `kubectl get pods -n <your-name-space>`
25+
- Install kuberay operator using [instructions](https://github.com/ray-project/kuberay#quick-start). For example, install kuberay v0.6.0 from remote helm repo:
26+
```
27+
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
28+
helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0
29+
```
30+
- Submit the RayCluster custom resource to MCAD as AppWrapper using the [aw-raycluster.yaml](doc/usage/examples/kuberay/config/aw-raycluster.yaml) exmaple:
31+
```bash
32+
kubectl create -f doc/usage/examples/kuberay/config/aw-raycluster.yaml
33+
```
34+
- Check the status of the AppWrapper custom resource using command
35+
```bash
36+
kubectl describe appwrapper raycluster-complete -n default
37+
```
38+
- Check the raycluster status is ready using command
39+
```bash
40+
kubectl get raycluster -n default
41+
```
42+
Expect:
43+
``````
44+
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
45+
raycluster-complete 1 1 ready 6m45s
46+
```

0 commit comments

Comments
 (0)