Skip to content

Commit 6b295a4

Browse files
committed
Update mcad kuberay example
1 parent f05c6a8 commit 6b295a4

File tree

2 files changed

+125
-99
lines changed

2 files changed

+125
-99
lines changed
Lines changed: 76 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,45 @@
11
apiVersion: mcad.ibm.com/v1beta1
22
kind: AppWrapper
33
metadata:
4-
name: raycluster-autoscaler
4+
name: raycluster-complete
55
namespace: default
66
spec:
77
resources:
8-
Items: []
98
GenericItems:
10-
- replicas: 1
11-
custompodresources:
12-
- replicas: 2
13-
requests:
14-
cpu: 10
15-
memory: 512Mi
16-
limits:
17-
cpu: 10
18-
memory: 1G
19-
generictemplate:
20-
# This config demonstrates KubeRay's Ray autoscaler integration.
9+
- generictemplate:
2110
# The resource requests and limits in this config are too small for production!
22-
# For an example with more realistic resource configuration, see
11+
# For examples with more realistic resource configuration, see
12+
# ray-cluster.complete.large.yaml and
2313
# ray-cluster.autoscaler.large.yaml.
2414
apiVersion: ray.io/v1alpha1
2515
kind: RayCluster
2616
metadata:
2717
labels:
2818
controller-tools.k8s.io: "1.0"
2919
# A unique identifier for the head node and workers of this cluster.
30-
name: raycluster-autoscaler
20+
name: raycluster-complete
3121
spec:
32-
# The version of Ray you are using. Make sure all Ray containers are running this version of Ray.
33-
rayVersion: '2.0.0'
34-
# If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
35-
# Ray autoscaler integration is supported only for Ray versions >= 1.11.0
36-
# Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0.
37-
enableInTreeAutoscaling: true
38-
# autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler.
39-
# The example configuration shown below below represents the DEFAULT values.
40-
# (You may delete autoscalerOptions if the defaults are suitable.)
41-
autoscalerOptions:
42-
# upscalingMode is "Default" or "Aggressive."
43-
# Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
44-
# Default: Upscaling is not rate-limited.
45-
# Aggressive: An alias for Default; upscaling is not rate-limited.
46-
upscalingMode: Default
47-
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
48-
idleTimeoutSeconds: 60
49-
# image optionally overrides the autoscaler's container image.
50-
# If instance.spec.rayVersion is at least "2.0.0", the autoscaler will default to the same image as
51-
# the ray container. For older Ray versions, the autoscaler will default to using the Ray 2.0.0 image.
52-
## image: "my-repo/my-custom-autoscaler-image:tag"
53-
# imagePullPolicy optionally overrides the autoscaler container's image pull policy.
54-
imagePullPolicy: Always
55-
# resources specifies optional resource request and limit overrides for the autoscaler container.
56-
# For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required.
57-
resources:
58-
limits:
59-
cpu: "500m"
60-
memory: "512Mi"
61-
requests:
62-
cpu: "500m"
63-
memory: "512Mi"
64-
######################headGroupSpec#################################
65-
# head group template and specs, (perhaps 'group' is not needed in the name)
22+
rayVersion: '2.5.0'
23+
# Ray head pod configuration
6624
headGroupSpec:
67-
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
25+
# Kubernetes Service Type. This is an optional field, and the default value is ClusterIP.
26+
# Refer to https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types.
6827
serviceType: ClusterIP
69-
# logical group name, for this called head-group, also can be functional
70-
# pod type head or worker
71-
# rayNodeType: head # Not needed since it is under the headgroup
72-
# the following params are used to complete the ray start: ray start --head --block ...
28+
# The `rayStartParams` are used to configure the `ray start` command.
29+
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
30+
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
7331
rayStartParams:
74-
# Flag "no-monitor" will be automatically set when autoscaling is enabled.
7532
dashboard-host: '0.0.0.0'
76-
block: 'true'
77-
# num-cpus: '1' # can be auto-completed from the limits
78-
# Use `resources` to optionally specify custom resource annotations for the Ray node.
79-
# The value of `resources` is a string-integer mapping.
80-
# Currently, `resources` must be provided in the specific format demonstrated below:
81-
# resources: '"{\"Custom1\": 1, \"Custom2\": 5}"'
82-
#pod template
33+
# pod template
8334
template:
35+
metadata:
36+
# Custom labels. NOTE: To avoid conflicts with KubeRay operator, do not define custom labels start with `raycluster`.
37+
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
38+
labels: {}
8439
spec:
8540
containers:
86-
# The Ray head pod
8741
- name: ray-head
88-
image: rayproject/ray:2.0.0
89-
imagePullPolicy: Always
42+
image: rayproject/ray:2.5.0
9043
ports:
9144
- containerPort: 6379
9245
name: gcs
@@ -98,59 +51,90 @@ spec:
9851
preStop:
9952
exec:
10053
command: ["/bin/sh","-c","ray stop"]
54+
volumeMounts:
55+
- mountPath: /tmp/ray
56+
name: ray-logs
57+
# The resource requests and limits in this config are too small for production!
58+
# For an example with more realistic resource configuration, see
59+
# ray-cluster.autoscaler.large.yaml.
60+
# It is better to use a few large Ray pod than many small ones.
61+
# For production, it is ideal to size each Ray pod to take up the
62+
# entire Kubernetes node on which it is scheduled.
10163
resources:
10264
limits:
10365
cpu: "1"
104-
memory: "1G"
66+
memory: "2G"
10567
requests:
68+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
69+
# We also recommend setting requests equal to limits for both CPU and memory.
70+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
71+
# Kubernetes testing environments such as KinD and minikube.
10672
cpu: "500m"
107-
memory: "512Mi"
73+
memory: "2G"
74+
volumes:
75+
- name: ray-logs
76+
emptyDir: {}
10877
workerGroupSpecs:
10978
# the pod replicas in this group typed worker
11079
- replicas: 1
11180
minReplicas: 1
112-
maxReplicas: 300
81+
maxReplicas: 10
11382
# logical group name, for this called small-group, also can be functional
11483
groupName: small-group
115-
# if worker pods need to be added, we can simply increment the replicas
116-
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
117-
# the operator will remove pods from the list until the number of replicas is satisfied
118-
# when a pod is confirmed to be deleted, its name will be removed from the list below
84+
# If worker pods need to be added, we can increment the replicas.
85+
# If worker pods need to be removed, we decrement the replicas, and populate the workersToDelete list.
86+
# The operator will remove pods from the list until the desired number of replicas is satisfied.
87+
# If the difference between the current replica count and the desired replicas is greater than the
88+
# number of entries in workersToDelete, random worker pods will be deleted.
11989
#scaleStrategy:
12090
# workersToDelete:
12191
# - raycluster-complete-worker-small-group-bdtwh
12292
# - raycluster-complete-worker-small-group-hv457
12393
# - raycluster-complete-worker-small-group-k8tj7
124-
# the following params are used to complete the ray start: ray start --block ...
125-
rayStartParams:
126-
block: 'true'
94+
# The `rayStartParams` are used to configure the `ray start` command.
95+
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
96+
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
97+
rayStartParams: {}
12798
#pod template
12899
template:
129-
metadata:
130-
labels:
131-
key: value
132-
# annotations for pod
133-
annotations:
134-
key: value
135100
spec:
136-
initContainers:
137-
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
138-
- name: init-myservice
139-
image: busybox:1.28
140-
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
141101
containers:
142-
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
143-
image: rayproject/ray:2.0.0
144-
# environment variables to set in the container.Optional.
145-
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
102+
- name: ray-worker
103+
image: rayproject/ray:2.5.0
146104
lifecycle:
147105
preStop:
148106
exec:
149107
command: ["/bin/sh","-c","ray stop"]
108+
# use volumeMounts.Optional.
109+
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
110+
volumeMounts:
111+
- mountPath: /tmp/ray
112+
name: ray-logs
113+
# The resource requests and limits in this config are too small for production!
114+
# For an example with more realistic resource configuration, see
115+
# ray-cluster.autoscaler.large.yaml.
116+
# It is better to use a few large Ray pod than many small ones.
117+
# For production, it is ideal to size each Ray pod to take up the
118+
# entire Kubernetes node on which it is scheduled.
150119
resources:
151120
limits:
152121
cpu: "1"
153-
memory: "512Mi"
122+
memory: "1G"
123+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
124+
# We also recommend setting requests equal to limits for both CPU and memory.
125+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
126+
# Kubernetes testing environments such as KinD and minikube.
154127
requests:
128+
# For production use-cases, we recommend specifying integer CPU reqests and limits.
129+
# We also recommend setting requests equal to limits for both CPU and memory.
130+
# For this example, we use a 500m CPU request to accomodate resource-constrained local
131+
# Kubernetes testing environments such as KinD and minikube.
155132
cpu: "500m"
156-
memory: "256Mi"
133+
# For production use-cases, we recommend allocating at least 8Gb memory for each Ray container.
134+
memory: "1G"
135+
# use volumes
136+
# Refer to https://kubernetes.io/docs/concepts/storage/volumes/
137+
volumes:
138+
- name: ray-logs
139+
emptyDir: {}
140+

doc/usage/examples/kuberay/kuberay-mcad.md

Lines changed: 49 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,55 @@ This integration will help in queuing on [kuberay](https://github.com/ray-projec
44

55
#### Prerequisites
66

7-
- kubernetes or Openshift cluster
8-
- Install MCAD using instructions present under `deployment` directory
9-
- Make sure MCAD has clusterrole to create ray resources, please patch using configuration file present in `config` directory with name `xqueuejob-controller.yaml`
7+
- Kubernetes(see [KinD](https://helm.sh/docs/intro/install/)) or Openshift cluster(see [OpenShift Local](https://developers.redhat.com/products/openshift-local/overview))
8+
- Kubernetes client tools such as [kubectl](https://kubernetes.io/docs/tasks/tools/) or [OpenShift CLI](https://docs.openshift.com/container-platform/4.13/cli_reference/openshift_cli/getting-started-cli.html)
9+
- [Helm](https://helm.sh/docs/intro/install/)
10+
- Install MCAD and KubeRay operators:
11+
- KinD cluster:
12+
13+
Install the stable release of MCAD opeartor from local charts
14+
```bash
15+
git clone https://github.com/project-codeflare/multi-cluster-app-dispatcher
16+
cd multi-cluster-app-dispatcher
17+
helm install mcad --set image.repository=quay.io/project-codeflare/mcad-controller --set image.tag=stable deployment/mcad-controller
18+
```
19+
20+
Make sure MCAD has clusterrole to create ray resources, please patch using [xqueuejob-controller.yaml](doc/usage/examples/kuberay/config/xqueuejob-controller.yaml). For example:
21+
```
22+
kubectl apply -f doc/usage/examples/kuberay/config/xqueuejob-controller.yaml
23+
```
24+
25+
See [deployment.md](../../../../doc/deploy/deployment.md) for more options.
26+
27+
Install kuberay operator using the [instructions](https://github.com/ray-project/kuberay#quick-start). For example, install kuberay v0.6.0 from remote helm repo:
28+
```
29+
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
30+
helm install kuberay-operator kuberay/kuberay-operator --version 0.6.0
31+
```
32+
33+
- OpenShift cluster:
34+
35+
MCAD and KubeRay Operators are part of the CodeFlare stack which provides a simple, user-friendly abstraction for scaling,
36+
queuing and resource management of distributed AI/ML and Python workloads. Please follow the `Distributed Workloads` [Quick-Start](https://github.com/opendatahub-io/distributed-workloads/blob/main/Quick-Start.md) for installation.
37+
1038

1139
#### Steps
1240

13-
- Install kuberay operator from [link](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html#deploying-the-kuberay-operator)
14-
- Submit ray cluster to MCAD as appwrapper using the config file `aw-raycluster.yaml` present in the `config` directory using command `kubectl create -f aw-raycluster.yaml`
15-
- Check the status of the appwrapper using command `kubectl describe appwrapper <your-appwrapper-name>`
16-
- Check running pods using command `kubectl get pods -n <your-name-space>`
41+
42+
- Submit the RayCluster custom resource to MCAD as AppWrapper using the [aw-raycluster.yaml](doc/usage/examples/kuberay/config/aw-raycluster.yaml) exmaple:
43+
```bash
44+
kubectl create -f doc/usage/examples/kuberay/config/aw-raycluster.yaml
45+
```
46+
- Check the status of the AppWrapper custom resource using command
47+
```bash
48+
kubectl describe appwrapper raycluster-complete -n default
49+
```
50+
- Check the raycluster status is ready using command
51+
```bash
52+
kubectl get raycluster -n default
53+
```
54+
Expect:
55+
``````
56+
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE
57+
raycluster-complete 1 1 ready 6m45s
58+
```

0 commit comments

Comments
 (0)