Skip to content

Conversation

@gabrieldemarmiesse
Copy link
Member

@gabrieldemarmiesse gabrieldemarmiesse commented Feb 23, 2020

See #1114

What is left to do:

  • Add try-except. If there are problems with loading the .so
    fallback to the pure python implementation and print a warning
    explaining the TF_ADDONS_PY_OPS and addons versions compatible with
    the current tf.
    May also flip the TF_ADDONS_PY_OPS switch automatically?
  • Recommend that contributors making new ops add pure python equivalent in the CONTRIBUTING

I think macos and windows users will be happy to use the activations functions on gpu :)

I also believe that the description of TF addons discouraged Windows and MacOS users to use addons when they want to use the gpu too. That's not good, because the majority of the components of addons work on windows and mac with a gpu.

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Feb 23, 2020

Bazel is really not easy for python devs.
When locally, I'm doing

TF_ADDONS_NO_BUILD=1 pip install -e ./

I can do:

from tensorflow_addons.activations import hardshrink

and it works.

Is there a flag to make bazel respect basic python rules? This is really fustrating to spend so much time on this PR just to add a single python file.

@Squadrick
Copy link
Member

@gabrieldemarmiesse yeah, that happens with bazel due to the way it resolves path imports. Haven't found a fix for it yet.

@googlebot

This comment has been minimized.

@seanpmorgan
Copy link
Member

@googlebot I consent.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

"tanhshrink.py",
],
data = [
"//tensorflow_addons:options.py",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that is this painful and unituitive. Sorry for any wasted time. We could raise this with the bazel team, but I believe this is fundamental to how this resolving works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's linked to how bazel works and that makes sense for monorepos like tensorflow. I'm not convinced the added value of making the wheels + running the tests with bazel outweight the costs in addons. It makes totally sense to use it to build the SO though. I know we had a similar conversation a while ago, maybe I'll open an issue so that we can discuss it more. Anyway thanks for the fix! It helps a lot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this can be discussed further. Typical contributions to Addons will/should fall into a subpackage that should have the BUILD already setup. This PR is a change that affects the entirety of the project so it's a bit more complicated. I agree we should encourage contributors to make these types of changes but we can look at and cons again to see if its worth splitting the C++ test/builds from python tests/builds.

@seanpmorgan
Copy link
Member

  • Add try-except. If there are problems with loading the .so
    fallback to the pure python implementation and print a warning
    explaining the TF_ADDONS_PY_OPS and addons versions compatible with
    the current tf.

This should be doable since it throws a python error instead of a crash. tensorflow.python.framework.errors_impl.NotFoundError

May also flip the TF_ADDONS_PY_OPS switch automatically?

  • Recommend that contributors making new ops add pure python equivalent in the CONTRIBUTING

Good idea, but not sure this is always feasible so maybe dependant on the subpackage. For example, image kernels may be hard to reproduce without depending on OpenCV or something.

I think macos and windows users will be happy to use the activations functions on gpu :)

Agree, but I think macos users are out of luck unless they're using a ROCm supported TF. AFAIK there hasn't been an NVIDIA GPU on macos system for a long time. I could be wrong on the current state of this:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream

@gabrieldemarmiesse
Copy link
Member Author

gabrieldemarmiesse commented Feb 25, 2020

I'll implement the fallback in this PR. We can change the CONTRIBUTING in another one

There seem to be a lot of activity around ROCm in the tensorflow/tensorflow repo. Do you guys know what's the status? It'd be cool if TF were to support AMD officially :)

@gabrieldemarmiesse
Copy link
Member Author

Here is the warning message:

2020-02-25 16:58:17.446948: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-25 16:58:17.447611: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-25 16:58:17.448055: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (CRE1-L11685): /proc/driver/nvidia/version does not exist
2020-02-25 16:58:17.458128: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 16:58:17.507655: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2712000000 Hz
2020-02-25 16:58:17.510852: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fffe1fe08d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 16:58:17.511249: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
/mnt/c/Users/gdemarmi/Desktop/projects/addons/tensorflow_addons/options.py:48: RuntimeWarning: Traceback (most recent call last):
  File "/mnt/c/Users/gdemarmi/Desktop/projects/addons/tensorflow_addons/activations/hardshrink.py", line 49, in hardshrink
    return _hardshrink_custom_op(x, lower, upper)
  File "/mnt/c/Users/gdemarmi/Desktop/projects/addons/tensorflow_addons/activations/hardshrink.py", line 58, in _hardshrink_custom_op
    return _activation_so.ops.addons_hardshrink(x, lower, upper)
  File "/mnt/c/Users/gdemarmi/Desktop/projects/addons/tensorflow_addons/utils/resource_loader.py", line 49, in ops
    self._ops = tf.load_op_library(get_path_to_datafile(self.relative_path))
  File "/home/gdemarmi/softwares/python/anaconda/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /mnt/c/Users/gdemarmi/Desktop/projects/addons/tensorflow_addons/custom_ops/activations/_activation_ops.so: cannot open shared object file: No such file or directory


The hardshrink C++/CUDA custom op could not be loaded. 
For this reason, Addons will fallback to an implementation written 
in Python with public TensorFlow ops. There worst you might experience with
this is a moderate slowdown on GPU. There can be multiple 
reason for this loading error, one of them may be an ABI incompatibility between 
the TensorFlow installed on your system and the TensorFlow used to compile
TensorFlow Addons' custom ops. The stacktrace generated when loading the 
shared object file was displayed above.

If you want this warning to disappear, either make sure the TensorFlow installed
is compatible with this version of Addons, or tell TensorFlow Addons to 
prefer using Python implementations and not custom C++/CUDA ones. You can do that 
by changing the TF_ADDONS_PY_OPS flag
either with the environment variable:

TF_ADDONS_PY_OPS=1 python my_script.py

or in your code, after your imports:

import tensorflow_addons as tfa
import ...
import ...

tfa.options.TF_ADDONS_PY_OPS = True


  warnings.warn(warning_msg, RuntimeWarning)

@boring-cyborg boring-cyborg bot added the build label Feb 25, 2020
Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very close to LGTM. Thanks for the effort on this. Few comments, I think we need to clarify that this is not entirely a GPU issue. The slowdown will occur for not loading CPU ops as well.

"tanhshrink.py",
],
data = [
"//tensorflow_addons:options.py",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this can be discussed further. Typical contributions to Addons will/should fall into a subpackage that should have the BUILD already setup. This PR is a change that affects the entirety of the project so it's a bit more complicated. I agree we should encourage contributors to make these types of changes but we can look at and cons again to see if its worth splitting the C++ test/builds from python tests/builds.

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM. Minor edits.

One last question I have is the implications of serializing and deserializing these ops. For example if someone trained a model with custom-ops and then someone else loaded using py_ops what guarantees do we have that they behave the same? I suppose as long as we have test cases ensuring they're the same?

gabrieldemarmiesse and others added 3 commits February 27, 2020 05:57
Co-Authored-By: Sean Morgan <[email protected]>
Co-Authored-By: Sean Morgan <[email protected]>
Co-Authored-By: Sean Morgan <[email protected]>
@gabrieldemarmiesse
Copy link
Member Author

I'm unsure about serialization. I think we need another opinion on this one.

@bot-of-gabrieldemarmiesse

@facaiy @seanpmorgan @tensorflow/sig-addons-maintainers

You are owners of some files modified in this pull request.
Would you kindly review the changes whenever you have the time to?
Thank you very much.

@gabrieldemarmiesse
Copy link
Member Author

Sorry about the spam. My bot is training.

@Squadrick
Copy link
Member

Almost LGTM. Minor edits.

One last question I have is the implications of serializing and deserializing these ops. For example, if someone trained a model with custom-ops and then someone else loaded using py_ops what guarantees do we have that they behave the same? I suppose as long as we have test cases ensuring they're the same?

That's the only way I can see it work. Also, in verify_funcs_are_equivalent, atol=1e-4 is a significant error bound that can't be blamed on fp-approx, the default value for atol is 1e-6, and I think we should aim for lower.

Copy link
Member

@seanpmorgan seanpmorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants