Skip to content

Conversation

@facaiy
Copy link
Member

@facaiy facaiy commented Jun 17, 2019

First step of #118 .
Refer to GPU op example in https://github.com/tensorflow/custom-op

@facaiy facaiy requested review from a team and WindQAQ as code owners June 17, 2019 02:50
@facaiy facaiy removed request for a team and WindQAQ June 17, 2019 02:50
@facaiy facaiy mentioned this pull request Jun 17, 2019
7 tasks
Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Far from a full review, but thought I'd start some discussion. Thank you very much for getting this going @facaiy !

@facaiy
Copy link
Member Author

facaiy commented Jun 18, 2019

@seanpmorgan Yeah, Sean. The PR is a draft, and most of codes are just copied from the gpu example of custom-op. I want to use it to check whether our GPU environment is ready. If all tests pass, then I'll like to reconstruct all the codes and invite everyone to take a look :-)

@facaiy
Copy link
Member Author

facaiy commented Jun 18, 2019

@av8ramit Amit, can you compare internal image with tensorflow/tensorflow:custom-op-gpu? Does they share the same CUDA setting?

Cuda Configuration Error: No library found under: /usr/lib/x86_64-linux-gnu/lib64/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib64/stubs/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/powerpc64le-linux-gnu/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/x86_64-linux-gnu/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/x64/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/libcudnn.so.7

@facaiy
Copy link
Member Author

facaiy commented Jun 18, 2019

cc @yifeif @gunan

@seanpmorgan
Copy link
Member

Friendly bump as this is blocking one our of milestone issues.
Is cuDNN installed on the GCP image? If so we can set an ENV variable in our GPU CI script

cc @karmel

@yifeif
Copy link

yifeif commented Jun 20, 2019

I assume there are some minor differences between the setup, and we are also in the middle of moving everything to centos (cc: @av8ramit).
Does addon use tensorflow/tensorflow:custom-op-gpu for release, GCP image for build? Or GCP image for both?

@seanpmorgan
Copy link
Member

I assume there are some minor differences between the setup, and we are also in the middle of moving everything to centos (cc: @av8ramit).
Does addon use tensorflow/tensorflow:custom-op-gpu for release, GCP image for build? Or GCP image for both?

So our current setup is to use travis and custom-op images for build/release and GCP for CI testing (travis does not have GPU servers).

Writing this out though we'll need to confirm we can compile with nvcc on a no GPU server (I believe it's possible but haven't tried.)

@av8ramit
Copy link
Contributor

I'm currently in the process of moving everything to CentOS. As for the current Ubuntu images I'm not sure of any significant differences. As I understand it they are designed to be fairly similar to each other. Any commands or tests I can run on the GCP instances for you?

@facaiy
Copy link
Member Author

facaiy commented Jun 21, 2019

I think we'd better use the same image for both build/release and CI testing. It seems that tensorflow/tensorflow:custom-op-gpu works well right now. As for GCP image, is it possible that we can get it from somewhere? Or, can we use tensorflow/tensorflow:custom-op-gpu for CI testing?

@facaiy
Copy link
Member Author

facaiy commented Jun 21, 2019

I find it really difficult to debug if we cannot get access to the CI testing image.

@av8ramit
Copy link
Contributor

Hey @facaiy yeah I agree. Unfortunately this is for security reasons as well as a route to provide easy debugging internally. I'm not opposed to having your entire setup run on tensorflow/tensorflow:custom-op-gpu.

@gunan
Copy link

gunan commented Jun 21, 2019

Unfortunately, I cannot give you our GCP images.
However, the custom op docker image has exactly the same toolchains set up.

As far as I can see, for addons we only have ubuntu/linux builds set up.
I am OK with changing the build setup for addons repository so every build runs inside the docker container.

@seanpmorgan
Copy link
Member

seanpmorgan commented Jun 22, 2019

I have no objections to using the custom-op docker images for travis and kokoro. Tbh that'd be ideal so we have a full understanding of our build env.

@facaiy
Copy link
Member Author

facaiy commented Jun 25, 2019

@av8ramit @gunan Thanks for the input, Amit, Gunhan. We agree to switch to custom-op-gpu, so what do we need to do?

@facaiy
Copy link
Member Author

facaiy commented Jun 27, 2019

Gently ping @av8ramit :-)

cc @karmel for visibility.

@av8ramit
Copy link
Contributor

I'm not sure what you mean by your question. Do you mean what to adjust to use the docker container?

I guess the next step would be to make addons_gpu.sh invoke and run tests in the corresponding docker container. That's what is being invoked on our side internally.

We can also modify how it's being invoked internally. Right now it's being called in a virtualenv.

@facaiy
Copy link
Member Author

facaiy commented Jun 27, 2019

Yeah, I want to know what to do if we'll use custom-op-gpu image for CI testing.

I guess the next step would be to make addons_gpu.sh invoke and run tests in the corresponding docker container.

Agree, let's run addons_gpu.sh in custom-op-gpu container. Can you set up it, please?

Right now it's being called in a virtualenv.

Do you mean the script is invoked in a virtual environment inside a docker container? It sounds good, because it's convenient to support both python 2 and 3 when using the same custom-op image. cc @seanpmorgan any thoughts, Sean?

@gunan
Copy link

gunan commented Jun 27, 2019

I think what Amit wanted to say was, kokoro directly calls addons_gpu.sh
You are free to make any changes to it, and the CI will just pick that up.
You can edit that script right here:
https://github.com/tensorflow/addons/blob/master/tools/ci_testing/addons_gpu.sh

@facaiy
Copy link
Member Author

facaiy commented Jun 27, 2019

kokoro directly calls addons_gpu.sh

I agree, and I think it's a good idea. But how to run it in the corresponding docker image? I think we cannot do it without help from Amit. Or do you suggest to call docker run command in the addons_gpu.sh by ourselves? It looks unsafe to me.

@gunan
Copy link

gunan commented Jun 28, 2019 via email

@facaiy facaiy changed the title WIP: compile gpu kernel, and run gpu test cases compile gpu kernel, and run gpu test cases Jul 11, 2019
@facaiy facaiy requested review from a team, Squadrick and yifeif July 11, 2019 07:10
@facaiy
Copy link
Member Author

facaiy commented Jul 11, 2019

@seanpmorgan @yifeif Sean, Yifei, can you take a look? I prefer to solve #118 issue step by step, so I try to make the PR a minimum change (most of the code is from https://github.com/tensorflow/custom-op, thank Yifei) , and it just works.

cc @Squadrick who might be interested.

@facaiy facaiy changed the title compile gpu kernel, and run gpu test cases Set up gpu environment Jul 11, 2019
@facaiy facaiy changed the title Set up gpu environment set up GPU environment Jul 11, 2019
Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yan great work! So everything compiles well but when testing the runtime of //tensorflow_addons/image:transform_ops_test it never runs on the GPU (See #118 (comment))

If you would like to expedite this PR, I'm okay with a requirements-gpu.txt and modifying config.,sh. When I ran with tf-nightly-gpu-2.0-preview the test executed successfully.

@facaiy facaiy force-pushed the BLD/add_cuda_py_test branch from d67a07e to 075e641 Compare July 12, 2019 06:04
@seanpmorgan
Copy link
Member

So tests are failing OOM because we launch parallel jobs on a single card. Once #344 is merged we'll see that everything is working though 2 new tests fail when running on GPU:

https://github.com/tensorflow/addons/blob/master/tensorflow_addons/seq2seq/beam_search_ops_test.py#L73

And all of the testSparseRepeatedIndices in:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers_test.py

I'm okay with commenting these tests out and cutting new issues to address them.

@facaiy facaiy requested a review from qlzh727 as a code owner July 12, 2019 23:18
@facaiy
Copy link
Member Author

facaiy commented Jul 13, 2019

Thank you, Sean. I have disabled all those failures temporally, and filed the corresponding issues to track them.

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Appreciate the work towards this milestone!

@seanpmorgan seanpmorgan merged commit a7afaa7 into tensorflow:master Jul 13, 2019
@facaiy facaiy deleted the BLD/add_cuda_py_test branch July 13, 2019 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants