set up GPU environment #294

facaiy · 2019-06-17T02:50:11Z

First step of #118 .
Refer to GPU op example in https://github.com/tensorflow/custom-op

seanpmorgan

Far from a full review, but thought I'd start some discussion. Thank you very much for getting this going @facaiy !

configure.sh

tensorflow_addons/custom_ops/image/BUILD

tensorflow_addons/utils/test_utils.py

facaiy · 2019-06-18T05:21:01Z

@seanpmorgan Yeah, Sean. The PR is a draft, and most of codes are just copied from the gpu example of custom-op. I want to use it to check whether our GPU environment is ready. If all tests pass, then I'll like to reconstruct all the codes and invite everyone to take a look :-)

facaiy · 2019-06-18T05:56:24Z

@av8ramit Amit, can you compare internal image with tensorflow/tensorflow:custom-op-gpu? Does they share the same CUDA setting?

Cuda Configuration Error: No library found under: /usr/lib/x86_64-linux-gnu/lib64/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib64/stubs/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/powerpc64le-linux-gnu/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/x86_64-linux-gnu/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/x64/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/lib/libcudnn.so.7, /usr/lib/x86_64-linux-gnu/libcudnn.so.7

facaiy · 2019-06-18T05:57:23Z

cc @yifeif @gunan

seanpmorgan · 2019-06-20T18:00:00Z

Friendly bump as this is blocking one our of milestone issues.
Is cuDNN installed on the GCP image? If so we can set an ENV variable in our GPU CI script

cc @karmel

yifeif · 2019-06-20T18:19:08Z

I assume there are some minor differences between the setup, and we are also in the middle of moving everything to centos (cc: @av8ramit).
Does addon use tensorflow/tensorflow:custom-op-gpu for release, GCP image for build? Or GCP image for both?

seanpmorgan · 2019-06-20T18:31:21Z

I assume there are some minor differences between the setup, and we are also in the middle of moving everything to centos (cc: @av8ramit).
Does addon use tensorflow/tensorflow:custom-op-gpu for release, GCP image for build? Or GCP image for both?

So our current setup is to use travis and custom-op images for build/release and GCP for CI testing (travis does not have GPU servers).

Writing this out though we'll need to confirm we can compile with nvcc on a no GPU server (I believe it's possible but haven't tried.)

av8ramit · 2019-06-20T20:53:14Z

I'm currently in the process of moving everything to CentOS. As for the current Ubuntu images I'm not sure of any significant differences. As I understand it they are designed to be fairly similar to each other. Any commands or tests I can run on the GCP instances for you?

facaiy · 2019-06-21T05:25:15Z

I think we'd better use the same image for both build/release and CI testing. It seems that tensorflow/tensorflow:custom-op-gpu works well right now. As for GCP image, is it possible that we can get it from somewhere? Or, can we use tensorflow/tensorflow:custom-op-gpu for CI testing?

facaiy · 2019-06-21T05:27:23Z

I find it really difficult to debug if we cannot get access to the CI testing image.

av8ramit · 2019-06-21T16:46:51Z

Hey @facaiy yeah I agree. Unfortunately this is for security reasons as well as a route to provide easy debugging internally. I'm not opposed to having your entire setup run on tensorflow/tensorflow:custom-op-gpu.

gunan · 2019-06-21T16:59:54Z

Unfortunately, I cannot give you our GCP images.
However, the custom op docker image has exactly the same toolchains set up.

As far as I can see, for addons we only have ubuntu/linux builds set up.
I am OK with changing the build setup for addons repository so every build runs inside the docker container.

seanpmorgan · 2019-06-22T00:40:58Z

I have no objections to using the custom-op docker images for travis and kokoro. Tbh that'd be ideal so we have a full understanding of our build env.

facaiy · 2019-06-25T00:15:09Z

@av8ramit @gunan Thanks for the input, Amit, Gunhan. We agree to switch to custom-op-gpu, so what do we need to do?

facaiy · 2019-06-27T01:12:13Z

Gently ping @av8ramit :-)

cc @karmel for visibility.

av8ramit · 2019-06-27T16:24:33Z

I'm not sure what you mean by your question. Do you mean what to adjust to use the docker container?

I guess the next step would be to make addons_gpu.sh invoke and run tests in the corresponding docker container. That's what is being invoked on our side internally.

We can also modify how it's being invoked internally. Right now it's being called in a virtualenv.

facaiy · 2019-06-27T22:49:10Z

Yeah, I want to know what to do if we'll use custom-op-gpu image for CI testing.

I guess the next step would be to make addons_gpu.sh invoke and run tests in the corresponding docker container.

Agree, let's run addons_gpu.sh in custom-op-gpu container. Can you set up it, please?

Right now it's being called in a virtualenv.

Do you mean the script is invoked in a virtual environment inside a docker container? It sounds good, because it's convenient to support both python 2 and 3 when using the same custom-op image. cc @seanpmorgan any thoughts, Sean?

gunan · 2019-06-27T22:59:50Z

I think what Amit wanted to say was, kokoro directly calls addons_gpu.sh
You are free to make any changes to it, and the CI will just pick that up.
You can edit that script right here:
https://github.com/tensorflow/addons/blob/master/tools/ci_testing/addons_gpu.sh

facaiy · 2019-06-27T23:32:21Z

kokoro directly calls addons_gpu.sh

I agree, and I think it's a good idea. But how to run it in the corresponding docker image? I think we cannot do it without help from Amit. Or do you suggest to call docker run command in the addons_gpu.sh by ourselves? It looks unsafe to me.

gunan · 2019-06-28T00:54:06Z

Yes, i do recommend running docker run from addons_gpu. As long as it is properly reviewed, i think it should be fine.

…

On Thu, Jun 27, 2019, 4:32 PM Yan Facai (颜发才) ***@***.***> wrote: kokoro directly calls addons_gpu.sh I agree, and I think it's a good idea. But how to run it in the corresponding docker image? I think we cannot do it without help from Amit. Or do you suggest to call docker run command in the addons_gpu.sh by ourselves? It looks unsafe to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#294?email_source=notifications&email_token=AB4UEOLT2GJG5ZHXFRUUQ63P4VEYNA5CNFSM4HYSW742YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYYUYFY#issuecomment-506547223>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB4UEOLDAY55IDLOEPITM3LP4VEYNANCNFSM4HYSW74Q> .

facaiy · 2019-07-11T07:18:38Z

@seanpmorgan @yifeif Sean, Yifei, can you take a look? I prefer to solve #118 issue step by step, so I try to make the PR a minimum change (most of the code is from https://github.com/tensorflow/custom-op, thank Yifei) , and it just works.

cc @Squadrick who might be interested.

WORKSPACE

tensorflow_addons/custom_ops/image/cc/kernels/euclidean_distance_transform_op.cc

seanpmorgan

Thanks Yan great work! So everything compiles well but when testing the runtime of //tensorflow_addons/image:transform_ops_test it never runs on the GPU (See #118 (comment))

If you would like to expedite this PR, I'm okay with a requirements-gpu.txt and modifying config.,sh. When I ran with tf-nightly-gpu-2.0-preview the test executed successfully.

configure.sh

WORKSPACE

seanpmorgan · 2019-07-12T14:01:55Z

So tests are failing OOM because we launch parallel jobs on a single card. Once #344 is merged we'll see that everything is working though 2 new tests fail when running on GPU:

https://github.com/tensorflow/addons/blob/master/tensorflow_addons/seq2seq/beam_search_ops_test.py#L73

And all of the testSparseRepeatedIndices in:
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/weight_decay_optimizers_test.py

I'm okay with commenting these tests out and cutting new issues to address them.

facaiy · 2019-07-13T00:20:36Z

Thank you, Sean. I have disabled all those failures temporally, and filed the corresponding issues to track them.

seanpmorgan

Thanks! Appreciate the work towards this milestone!

facaiy added 4 commits June 17, 2019 10:04

BLD: compile gpu kernel for image_projective_transform_op_gpu

39ca5ba

BUG: minor fix

976e0ab

TST: use gpu

f3b2f4a

BLD: fix code style

dad0d0a

facaiy requested review from a team and WindQAQ as code owners June 17, 2019 02:50

googlebot added the cla: yes label Jun 17, 2019

facaiy removed request for a team and WindQAQ June 17, 2019 02:50

facaiy mentioned this pull request Jun 17, 2019

Implement GPU kernels #118

Closed

7 tasks

seanpmorgan requested changes Jun 17, 2019

View reviewed changes

configure.sh Outdated Show resolved Hide resolved

tensorflow_addons/custom_ops/image/BUILD Outdated Show resolved Hide resolved

tensorflow_addons/utils/test_utils.py Outdated Show resolved Hide resolved

facaiy changed the title ~~WIP: compile gpu kernel, and run gpu test cases~~ compile gpu kernel, and run gpu test cases Jul 11, 2019

facaiy requested review from a team, Squadrick and yifeif July 11, 2019 07:10

facaiy commented Jul 11, 2019

View reviewed changes

WORKSPACE Outdated Show resolved Hide resolved

facaiy commented Jul 11, 2019

View reviewed changes

tensorflow_addons/custom_ops/image/cc/kernels/euclidean_distance_transform_op.cc Show resolved Hide resolved

facaiy changed the title ~~compile gpu kernel, and run gpu test cases~~ Set up gpu environment Jul 11, 2019

facaiy changed the title ~~Set up gpu environment~~ set up GPU environment Jul 11, 2019

seanpmorgan requested changes Jul 11, 2019

View reviewed changes

configure.sh Outdated Show resolved Hide resolved

configure.sh Outdated Show resolved Hide resolved

WORKSPACE Outdated Show resolved Hide resolved

facaiy added 3 commits July 12, 2019 10:21

CLN: remove -c opt and use CUDA_HOME

e91ef56

CLN: rename external => build_deps

a85f2d9

CLN: add requirements_gpu.txt

075e641

facaiy force-pushed the BLD/add_cuda_py_test branch from d67a07e to 075e641 Compare July 12, 2019 06:04

WindQAQ added the kokoro:force-run label Jul 12, 2019

kokoro-team removed the kokoro:force-run label Jul 12, 2019

Merge branch 'master' into BLD/add_cuda_py_test

a813255

facaiy requested a review from qlzh727 as a code owner July 12, 2019 23:18

TST: disable failed test cases on GPU

6d4fa20

facaiy force-pushed the BLD/add_cuda_py_test branch from 4f28186 to 6d4fa20 Compare July 13, 2019 00:06

facaiy mentioned this pull request Jul 13, 2019

euclidean_distance_transform_op.cc build falied on GPU #349

Closed

CLN: tracked by compile issue

8df0f65

This was referenced Jul 13, 2019

weight decay optimizers doesn't support tf.half on gpu in eager mode #347

Closed

testBadParentValuesOnGPU failed on GPU #348

Closed

seanpmorgan approved these changes Jul 13, 2019

View reviewed changes

seanpmorgan merged commit a7afaa7 into tensorflow:master Jul 13, 2019

facaiy deleted the BLD/add_cuda_py_test branch July 13, 2019 00:41

seanpmorgan mentioned this pull request Jul 16, 2019

BLD: Restructure directories #356

Merged

set up GPU environment #294

set up GPU environment #294

Uh oh!

Conversation

facaiy commented Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanpmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facaiy commented Jun 18, 2019

Uh oh!

facaiy commented Jun 18, 2019

Uh oh!

facaiy commented Jun 18, 2019

Uh oh!

seanpmorgan commented Jun 20, 2019

Uh oh!

yifeif commented Jun 20, 2019

Uh oh!

seanpmorgan commented Jun 20, 2019

Uh oh!

av8ramit commented Jun 20, 2019

Uh oh!

facaiy commented Jun 21, 2019

Uh oh!

facaiy commented Jun 21, 2019

Uh oh!

av8ramit commented Jun 21, 2019

Uh oh!

gunan commented Jun 21, 2019

Uh oh!

seanpmorgan commented Jun 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facaiy commented Jun 25, 2019

Uh oh!

facaiy commented Jun 27, 2019

Uh oh!

av8ramit commented Jun 27, 2019

Uh oh!

facaiy commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gunan commented Jun 27, 2019

Uh oh!

facaiy commented Jun 27, 2019

Uh oh!

gunan commented Jun 28, 2019 via email

Uh oh!

facaiy commented Jul 11, 2019

Uh oh!

Uh oh!

Uh oh!

seanpmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seanpmorgan commented Jul 12, 2019

Uh oh!

facaiy commented Jul 13, 2019

Uh oh!

seanpmorgan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

facaiy commented Jun 17, 2019 •

edited

Loading

seanpmorgan commented Jun 22, 2019 •

edited

Loading

facaiy commented Jun 27, 2019 •

edited

Loading