Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
c63a8db
Opensource
comzyh Feb 22, 2017
55f1a02
update README.md
comzyh Feb 22, 2017
e15c42f
fix encoding issue
comzyh Feb 23, 2017
f37e9d1
move deployment yaml
comzyh Feb 27, 2017
a29526d
update document
comzyh Feb 27, 2017
8dae9ba
update document
comzyh Feb 27, 2017
964055e
support rbac
comzyh Feb 27, 2017
2b80985
fix rbac
Feb 27, 2017
3ea2dfe
fix pods/log permission
Feb 27, 2017
ad24e5e
requested feature
comzyh Feb 28, 2017
9287270
support multi version log
comzyh Feb 28, 2017
4bb4d5d
finish log
comzyh Feb 28, 2017
d5e90f5
support branch
comzyh Feb 28, 2017
708d64c
oauth2
comzyh Mar 2, 2017
bd8371e
support login
comzyh Mar 2, 2017
7f82705
support comments
comzyh Mar 2, 2017
0409f7f
deployment yaml
comzyh Mar 2, 2017
f64e8dd
support hide & favorite
comzyh Mar 3, 2017
cd00b4d
use update_many rather than update
comzyh Mar 3, 2017
0cce9aa
fix permission
comzyh Mar 3, 2017
87bdd83
log wrap-line & job-name validation
comzyh Mar 4, 2017
57650bc
update document
comzyh Mar 19, 2017
39cb25e
add status filter, to support Running etc.
comzyh Mar 19, 2017
7511af2
Collapse some line
comzyh Mar 20, 2017
6e9101a
support available_gpu
comzyh Mar 20, 2017
6c7baa7
better log version control
comzyh Mar 20, 2017
373ef4b
use query parameter to store page
comzyh Mar 20, 2017
8876202
add version to url parameters
comzyh Mar 20, 2017
b32c868
refactor
comzyh Mar 20, 2017
eecaa92
support restart
comzyh Mar 20, 2017
f738302
better log
comzyh Mar 20, 2017
583afc4
better ui
comzyh Mar 20, 2017
ba14aa3
better ui
comzyh Mar 21, 2017
5712d81
fix tensorboard
comzyh Mar 22, 2017
9a4f334
report Fetch Error
comzyh Mar 22, 2017
b2a2661
support download log
comzyh Mar 22, 2017
45f5304
filter user
comzyh Mar 23, 2017
a55db4f
fix bug
comzyh Mar 23, 2017
a2daa8d
fix bug
comzyh Mar 23, 2017
8d064f8
use inwall cdn.
comzyh Mar 23, 2017
541418e
support page size
comzyh Mar 23, 2017
96c2d6e
fix running & hide
comzyh Mar 27, 2017
e7fc36b
defense programing
comzyh Mar 27, 2017
5947b4c
fix bug
comzyh Mar 28, 2017
33f16a8
better regexp
comzyh Apr 10, 2017
22f178e
refactor
comzyh Apr 11, 2017
6a6ab73
migrate to webpack!
comzyh Apr 11, 2017
a75e9c6
fix
comzyh Apr 11, 2017
3a3bd60
better strcutre
comzyh Apr 12, 2017
88a6ae6
small change
comzyh Apr 12, 2017
38b512a
split dialog
comzyh Apr 12, 2017
a15c046
allow edit comments and node (@jiahua)
comzyh Apr 12, 2017
5b08b58
fix bug
comzyh Apr 13, 2017
2d7e2da
don’t remove node info after crash.
comzyh Apr 21, 2017
518987c
better devserver config
comzyh Apr 25, 2017
af7b6c3
fix bug
comzyh Apr 26, 2017
9c48ad9
better loading
comzyh Apr 26, 2017
6afc01d
eliminate redundant log
comzyh Apr 26, 2017
4515dd7
ktqueue require access to repo
comzyh May 3, 2017
de1b85b
logout
comzyh May 4, 2017
f222fff
new repo API
comzyh May 4, 2017
e0d0da9
use token to clone
comzyh May 4, 2017
9e039f3
log long polling
comzyh May 8, 2017
0260448
proper connection timeout
comzyh May 9, 2017
ffa2333
websocket API
comzyh May 9, 2017
d538403
auto refresh log
comzyh May 9, 2017
23a1bfb
fix user filter
comzyh May 10, 2017
389c5c8
fix Websocket close
comzyh May 10, 2017
1d27d7e
fix bugs
comzyh May 12, 2017
19eb683
support azure file as shared filesystem
comzyh May 16, 2017
52ab049
let user set KTQ_SHAREFS_HOSTPATH
comzyh May 19, 2017
ccfc798
fix bug
comzyh May 19, 2017
373e4d3
kubernetes 1.5 sucks!
comzyh May 19, 2017
a185d8a
change oauth URL
comzyh May 23, 2017
745e65d
Support Nginx auth_request directive
comzyh May 24, 2017
4528925
add ClusterRoleBinding
comzyh May 24, 2017
409aacc
fix scroll
comzyh May 31, 2017
990e96a
forbid hide running task, allow user filter node
comzyh Jun 17, 2017
40016f4
support dot in job name
comzyh Jun 17, 2017
07644cc
fix tensorboard bugs
comzyh Jun 17, 2017
c2efed4
now we can hide FetchError task & comment's wrap-line is shown
comzyh Jun 19, 2017
8d98067
support Websocket over TLS
comzyh Jun 21, 2017
4da8eb0
UE improvment
comzyh Jul 3, 2017
820ea55
fix filter
comzyh Jul 7, 2017
c026260
fix log encoding
comzyh Jul 10, 2017
c4defd8
fix watching bug
comzyh Jul 18, 2017
8a87135
better log following and update package
comzyh Jul 18, 2017
bbac2fb
update package
comzyh Jul 18, 2017
ad579de
fix dialog bug
comzyh Jul 19, 2017
52ce492
support NFS
comzyh Jul 22, 2017
9f766fd
settings: add mongodb-server field
comzyh Jul 22, 2017
d3c0f37
show ContainerCreating status
comzyh Jul 24, 2017
f32a8a3
update package
comzyh Jul 24, 2017
296a4ce
Frontend Login
comzyh Jul 24, 2017
84eb15d
add cpuLimt and memoryLimit, rename commit_id & gpu_num
comzyh Jul 26, 2017
64801ae
add hash to assets
comzyh Jul 27, 2017
feeaac7
Update README
comzyh Jul 28, 2017
c7161bf
move deploy docs and add image building docs
comzyh Jul 29, 2017
026121f
No. 100 commit, cheers
comzyh Jul 29, 2017
4d3a197
provide a option to control restart policy
KKRainbow Sep 25, 2017
0950fa7
remove inline-item; remove bool()
KKRainbow Sep 25, 2017
60c8b80
Merge pull request #2 from KKRainbow/autostart
xiaoyunwu Sep 25, 2017
06f4290
nvidia-docker2 and k8s1.9 support
KKRainbow Mar 14, 2018
6f56d75
fix
KKRainbow Mar 14, 2018
0513725
fix
KKRainbow Mar 14, 2018
5aa0cf1
try fix manualstop->terminated bug
KKRainbow Mar 20, 2018
4a74053
try fix
KKRainbow Mar 21, 2018
5853121
try fix bug
KKRainbow Mar 21, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.git
frontend/node_modules
93 changes: 93 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject

.remote-sync.json
dev_environment.sh
frontend/node_modules
21 changes: 21 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM ubuntu:16.04

RUN apt-get update && apt-get install -y apt-transport-https ca-certificates

RUN echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted universe multiverse" > /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse" >> /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted universe multiverse" >> /etc/apt/sources.list

RUN apt-get update
RUN apt-get install -y wget python3.5 python3-pip git

RUN python3.5 -m pip install kubernetes tornado aiohttp pymongo --ignore-installed -i https://pypi.tuna.tsinghua.edu.cn/simple

ADD . /ktqueue
WORKDIR /ktqueue

RUN python3.5 -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
CMD python3 /ktqueue/server.py

EXPOSE 8080
28 changes: 27 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,27 @@
# ktq
# KTQueue

kubernetes task queue with GPU support

# Features

- support GPU tasks
- support assigning the task to node manually
- realtime logs on webpage & log version management
- mount host-path to Pod manually
- Tensorboard manage & proxy
- git clone repository with ssh-key or username & password or Github OAuth
- CPU & Memory limit supported

# screenshoot

![screenshoot](https://user-images.githubusercontent.com/1068203/28708229-10e6e19e-73ae-11e7-882f-f4fb6bff877a.png)

# How to deploy

deployment guide under [deploy](./docs/deploy) directory

# How to build images for KTQueue

You can use any framework you want as long as you have the correct docker image, here are examples to build docker image for KTQueue

- [tensorflow](./docs/docker_image_example/tensorflow)
1 change: 1 addition & 0 deletions docs/deploy/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
dep*.yaml
42 changes: 42 additions & 0 deletions docs/deploy/NODE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# How to setup a KTQueue node

## Brief

1. join the kubernetes cluster, and ensure `kubelet`'s GPU support is enabled.
2. mount the cephfs on `/mnt/cephfs`
3. install `nvidia-docker` and run it once

## GPU support of kubelet

Kubernetes (> v1.6.0-beta.1) is supporting multi-GPU now, please follow the newest guide to enable GPU support.

## Nvidia-Docker

[nvidia-docker](https://github.com/NVIDIA/nvidia-docker) homepage

CUDA in container need NVIDIA drivers such as `libcuda.so.1` to work. And KTQueue assume that drivers located at `/var/lib/nvidia-docker/volumes/nvidia_driver/`. So you need to install nvidia-docker and use nvidia-docker to run a cuda image to ensure that the drivers are located in the right place.

1. follow the installation instruction to install nvidia-docker
2. run the test nvidia-smi with nvidia-docker
3. make sure that there is a driver like `367.57` are located at `/var/lib/nvidia-docker/volumes/nvidia_driver/`

## Trouble shoot

- nvidia-smi print the right output, but there are no drivers found at `/var/lib/nvidia-docker/volumes/nvidia_driver/`

try to update the nvidia-docker(>1.0.0), and check docker volumes

> docker volume ls

you may get a line like

```
DRIVER VOLUME NAME
nvidia-docker nvidia_driver_367.57
```

try to remove that volume and run `nvidia-docker run --rm nvidia/cuda nvidia-smi` again.

- docker: Error response from daemon: create nvidia_driver_367.57: VolumeDriver.Create: internal error, check logs for details.

> sudo chown nvidia-docker:nvidia-docker /var/lib/nvidia-docker/volumes/nvidia_driver
115 changes: 115 additions & 0 deletions docs/deploy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Deployment

# Brief

1. setup kubernetes and ceph cluster
2. add ceph-secret to kubernetes
3. deploy mongodb based on kubernetes and ceph
4. build ktqueue docker image
5. deploy ktqueue

To setup a KTQueue node, please read [How to setup a KTQueue node](./NODE.md)


# File list

- ktqueue.yaml
- mongodb-service.yaml
- mongodb-dev.yaml
- mongodb-production.yaml

# Dependancy

ktqueue requires kubernetes and cephfs, make sure you already have them.

# Prepare ceph

kubernetes and other components needs permission (aka, a secret) to access ceph.

if you want to know mons in your ceph cluster, just type:

> ceph mon stat

then, get your ceph-secret to kubernetes

> sudo ceph auth get-key client.admin | base64

Note: though `ceph get-key`'s response is encoded by base64, you should encode it again.

then, update your ceph-secret.yaml, add your secret after `key:`

> cp ceph-secret.yaml dep-ceph-secret.yaml && vi dep-ceph-secret.yaml

and import `ceph-secret.yaml`

> kubectl create -f dep-ceph-secret.yaml

# Deploy mongodb

## create rbd

mongodb needs a `persistent volume`(just like a disk) to store data. so you should create one.

> rbd create rbd/ktqueue-mongodb -s 10240

> rbd info rbd/ktqueue-mongodb

try:

> rbd map ktqueue-mongodb

current linux kernal doesn't support all the features. if you get error, refer [this](http://tonybai.com/2016/11/07/integrate-kubernetes-with-ceph-rbd/)

> rbd feature disable ktqueue-mongodb exclusive-lock, object-map, fast-diff, deep-flatten

## Create mongodb services

> cp mongodb-production.yaml dep-mongodb-production.yaml

create mongodb service

> kubectl create -f mongodb-service.yaml

create mongodb server

> kubectl create -f dep-mongodb-production.yaml

# Deploy ktqueue

## mount cephfs
ktqueue dameon needs to access ceph to clone code, store log, etc. and ktqueue job needs to access ceph to store output.

so you should ensure that cephfs has been mounted at `/mnt/cephfs` on ensure every single node you want to run ktqueue jobs or ktqueue dameon.

you should modify `fstab` and add cephfs mount.

## build image

build front-end

> cd frontend
> npm install
> npm run build

build docker image

> docker build -t ktqueue .

you can modify `dep-ktqueue.yaml`(see next step)and change image name if you want.


## deploy

> cp ktqueue.yaml dep-ktqueue.yaml

if you want to change IP/port of ktqueue or assign ktqueue to a specific node, you should modify `dep-ktqueue.yaml`

to select node you want to run ktqueue, change `host_name_you_want` after `kubernetes.io/hostname` and uncomment this line.

to select IP/port, change `ip_you_want_to_access_form_outside` under `externalIPs`

finish it:

> kubectl create -f dep-ktqueue.yaml

enjoy!
6 changes: 6 additions & 0 deletions docs/deploy/ceph-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Version: v1
kind: Secret
metadata:
name: ceph-secret
data:
key:
40 changes: 40 additions & 0 deletions docs/deploy/ktqueue-rbac.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Create the clusterrole:
# $ kubectl create -f ktqueue-rbac.yaml
# Bind the ktqueue serviceaccount to the ktqueue clusterrole:
# $ kubectl create clusterrolebinding ktqueue --clusterrole=ktqueue --serviceaccount=default:ktqueue
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: ktqueue
rules:
- apiGroups:
- ""
resources:
- pods
- pods/log
verbs: ["*"]
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- apiGroups:
- "batch"
resources:
- jobs
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: ktqueue
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ktqueue
subjects:
- kind: ServiceAccount
name: ktqueue
namespace: default
Loading