-
Notifications
You must be signed in to change notification settings - Fork 617
set up GPU environment #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
seanpmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Far from a full review, but thought I'd start some discussion. Thank you very much for getting this going @facaiy !
|
@seanpmorgan Yeah, Sean. The PR is a draft, and most of codes are just copied from the gpu example of custom-op. I want to use it to check whether our GPU environment is ready. If all tests pass, then I'll like to reconstruct all the codes and invite everyone to take a look :-) |
|
@av8ramit Amit, can you compare internal image with tensorflow/tensorflow:custom-op-gpu? Does they share the same CUDA setting?
|
|
Friendly bump as this is blocking one our of milestone issues. cc @karmel |
|
I assume there are some minor differences between the setup, and we are also in the middle of moving everything to centos (cc: @av8ramit). |
So our current setup is to use travis and custom-op images for build/release and GCP for CI testing (travis does not have GPU servers). Writing this out though we'll need to confirm we can compile with nvcc on a no GPU server (I believe it's possible but haven't tried.) |
|
I'm currently in the process of moving everything to CentOS. As for the current Ubuntu images I'm not sure of any significant differences. As I understand it they are designed to be fairly similar to each other. Any commands or tests I can run on the GCP instances for you? |
|
I think we'd better use the same image for both build/release and CI testing. It seems that tensorflow/tensorflow:custom-op-gpu works well right now. As for GCP image, is it possible that we can get it from somewhere? Or, can we use tensorflow/tensorflow:custom-op-gpu for CI testing? |
|
I find it really difficult to debug if we cannot get access to the CI testing image. |
|
Hey @facaiy yeah I agree. Unfortunately this is for security reasons as well as a route to provide easy debugging internally. I'm not opposed to having your entire setup run on tensorflow/tensorflow:custom-op-gpu. |
|
Unfortunately, I cannot give you our GCP images. As far as I can see, for addons we only have ubuntu/linux builds set up. |
|
I have no objections to using the custom-op docker images for travis and kokoro. Tbh that'd be ideal so we have a full understanding of our build env. |
|
I'm not sure what you mean by your question. Do you mean what to adjust to use the docker container? I guess the next step would be to make We can also modify how it's being invoked internally. Right now it's being called in a virtualenv. |
|
Yeah, I want to know what to do if we'll use custom-op-gpu image for CI testing.
Agree, let's run
Do you mean the script is invoked in a virtual environment inside a docker container? It sounds good, because it's convenient to support both python 2 and 3 when using the same custom-op image. cc @seanpmorgan any thoughts, Sean? |
|
I think what Amit wanted to say was, kokoro directly calls |
I agree, and I think it's a good idea. But how to run it in the corresponding docker image? I think we cannot do it without help from Amit. Or do you suggest to call |
|
Yes, i do recommend running docker run from addons_gpu. As long as it is
properly reviewed, i think it should be fine.
…On Thu, Jun 27, 2019, 4:32 PM Yan Facai (颜发才) ***@***.***> wrote:
kokoro directly calls addons_gpu.sh
I agree, and I think it's a good idea. But how to run it in the
corresponding docker image? I think we cannot do it without help from Amit.
Or do you suggest to call docker run command in the addons_gpu.sh by
ourselves? It looks unsafe to me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#294?email_source=notifications&email_token=AB4UEOLT2GJG5ZHXFRUUQ63P4VEYNA5CNFSM4HYSW742YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYYUYFY#issuecomment-506547223>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB4UEOLDAY55IDLOEPITM3LP4VEYNANCNFSM4HYSW74Q>
.
|
|
@seanpmorgan @yifeif Sean, Yifei, can you take a look? I prefer to solve #118 issue step by step, so I try to make the PR a minimum change (most of the code is from https://github.com/tensorflow/custom-op, thank Yifei) , and it just works. cc @Squadrick who might be interested. |
tensorflow_addons/custom_ops/image/cc/kernels/euclidean_distance_transform_op.cc
Show resolved
Hide resolved
seanpmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Yan great work! So everything compiles well but when testing the runtime of //tensorflow_addons/image:transform_ops_test it never runs on the GPU (See #118 (comment))
If you would like to expedite this PR, I'm okay with a requirements-gpu.txt and modifying config.,sh. When I ran with tf-nightly-gpu-2.0-preview the test executed successfully.
d67a07e to
075e641
Compare
|
So tests are failing OOM because we launch parallel jobs on a single card. Once #344 is merged we'll see that everything is working though 2 new tests fail when running on GPU: And all of the I'm okay with commenting these tests out and cutting new issues to address them. |
4f28186 to
6d4fa20
Compare
|
Thank you, Sean. I have disabled all those failures temporally, and filed the corresponding issues to track them. |
seanpmorgan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Appreciate the work towards this milestone!
First step of #118 .
Refer to GPU op example in https://github.com/tensorflow/custom-op