Skip to content

Conversation

@jzhoulon
Copy link
Contributor

Add PluggableDevice mechanism implementation, including plugin initialization, PluggableDevice create and compute, and also add a new flag(is_pluggable_device) in DeviceFactory to do some low-level device specialization.

@google-ml-butler google-ml-butler bot added the size:XL CL Change Size:Extra Large label Dec 17, 2020
@google-cla google-cla bot added the cla: yes label Dec 17, 2020
@jzhoulon
Copy link
Contributor Author

@penpornk @annarev I have submit the pluggable device implementation PR, please help to have a review. thanks very much!

@kkimdev kkimdev requested review from annarev and penpornk and removed request for kkimdev and rohan100jain December 17, 2020 08:20
@gbaned gbaned self-assigned this Dec 17, 2020
@Jianhui-Li
Copy link

@gbaned gbaned added the awaiting review Pull request awaiting review label Dec 21, 2020
@penpornk penpornk requested a review from sanjoy December 21, 2020 20:51
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 31, 2020
@gbaned gbaned requested a review from annarev January 5, 2021 17:53
@gbaned gbaned added the awaiting review Pull request awaiting review label Jan 5, 2021
@gbaned gbaned requested review from sanjoy and removed request for sanjoy January 13, 2021 15:17
CHECK(process_state_);
const string& allocator_type = options.allocator_type();
se::Platform* platform = PluggableDeviceMachineManager(platform_name_);
mutex_lock lock(mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I missed it, it seems like we are using the default BFC memory allocator for device allocations, and not going through the SE interface discussed in the Stream Executor RFC:

void (*create_allocator)(const SP_Platform* platform,
                           SE_CreateAllocatorParams* params, TF_Status* status);
  void (*destroy_allocator)(const SP_Platform* platform,
                            SP_Allocator* allocator,
                            SP_AllocatorFns* allocator_fns);

Also I think custom allocator support might need to be added. I can follow up with another PR to add support for that. Please let me know if I missed it elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, custom allocator support is needed in streamexecutor c api. we are appreciate that you can contribute custom allocator feature in PluggableDevice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @annarev has already started an internal PR for this, but was deciding between

  1. Having a separate registration for the custom allocator, e.g.,
void TF_InitAllocatorPlugin(TF_AllocatorRegistrationParams* params, TF_Status* status);
  1. Adding them as part of the StreamExecutor C API surface, e.g.,
// For BFC.
void (*create_allocator)(const SP_Platform* platform, const SP_Device* device,
                         SE_CreateAllocatorParams* params, TF_Status* status);
void (*destroy_allocator)(const SP_Platform* platform,
                          const SP_Device* device, SP_Allocator* allocator);
void (*create_allocator_fns)(const SP_Platform* platform,
                             SP_AllocatorFns* allocator_fns,
                             TF_Status* status);
void (*destroy_allocator_fns)(const SP_Platform* platform,
                              SP_AllocatorFns* allocator_fns);

// For custom allocator.
void (*create_custom_allocator)(const SP_Platform* platform,
                                const SP_Device* device,
                                SE_CreateCustomAllocatorParams* params,
                                TF_Status* status);
void (*destroy_custom_allocator)(const SP_Platform* platform,
                                 const SP_Device* device,
                                 SP_CustomAllocator* allocator);
void (*create_custom_allocator_fns)(const SP_Platform* platform,
                                    SP_CustomAllocatorFns* allocator_fns,
                                    TF_Status* status);
void (*destroy_custom_allocator_fns)(const SP_Platform* platform,
                                     SP_CustomAllocatorFns* allocator_fns);

(Only the top 4 or the bottom 4 can be set.)

It has been put on hold for a while. I think @annarev plans to revisit this in the next few weeks. (Please correct me if I'm wrong.)

Let us know if you have any preference/suggestion. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @penpornk and @annarev. Either one would be okay, the StreamExecutor API is what we had discussed during the RFC, so locally I was experimenting with that. Option (1) does provide a cleaner interface. Has Anna already made the change we can try out with ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linking this here for wider coverage:
Anna has a PR for custom allocator up for comments: #47598

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jan 18, 2021
@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Jan 19, 2021
Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm so sorry for the delay! Submitting some comments first.

CHECK(process_state_);
const string& allocator_type = options.allocator_type();
se::Platform* platform = PluggableDeviceMachineManager(platform_name_);
mutex_lock lock(mu_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @annarev has already started an internal PR for this, but was deciding between

  1. Having a separate registration for the custom allocator, e.g.,
void TF_InitAllocatorPlugin(TF_AllocatorRegistrationParams* params, TF_Status* status);
  1. Adding them as part of the StreamExecutor C API surface, e.g.,
// For BFC.
void (*create_allocator)(const SP_Platform* platform, const SP_Device* device,
                         SE_CreateAllocatorParams* params, TF_Status* status);
void (*destroy_allocator)(const SP_Platform* platform,
                          const SP_Device* device, SP_Allocator* allocator);
void (*create_allocator_fns)(const SP_Platform* platform,
                             SP_AllocatorFns* allocator_fns,
                             TF_Status* status);
void (*destroy_allocator_fns)(const SP_Platform* platform,
                              SP_AllocatorFns* allocator_fns);

// For custom allocator.
void (*create_custom_allocator)(const SP_Platform* platform,
                                const SP_Device* device,
                                SE_CreateCustomAllocatorParams* params,
                                TF_Status* status);
void (*destroy_custom_allocator)(const SP_Platform* platform,
                                 const SP_Device* device,
                                 SP_CustomAllocator* allocator);
void (*create_custom_allocator_fns)(const SP_Platform* platform,
                                    SP_CustomAllocatorFns* allocator_fns,
                                    TF_Status* status);
void (*destroy_custom_allocator_fns)(const SP_Platform* platform,
                                     SP_CustomAllocatorFns* allocator_fns);

(Only the top 4 or the bottom 4 can be set.)

It has been put on hold for a while. I think @annarev plans to revisit this in the next few weeks. (Please correct me if I'm wrong.)

Let us know if you have any preference/suggestion. :)

@google-ml-butler google-ml-butler bot added the ready to pull PR ready for merge process label Mar 11, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 11, 2021
@penpornk penpornk added the kokoro:force-run Tests on submitted change label Mar 11, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 11, 2021
@jzhoulon
Copy link
Contributor Author

@penpornk Seems still one checks failed(MacOs CPU python3), It seems a strange error(one is direct_session, another is grappler), I am not sure whether it is related with this PR since I have no MacOS machine to reproduce, can you help to check this also? I will also check the possible issue. Thanks very much!

@penpornk
Copy link
Member

@jzhoulon This check failed 4 hours ago too (same errors) when it passed elsewhere. So I reran it. It's still failing so it's likely related to this PR. I'll try to find how I can reproduce this from my side as well.

@kulinseth
Copy link
Contributor

kulinseth commented Mar 12, 2021

@jzhoulon This check failed 4 hours ago too (same errors) when it passed elsewhere. So I reran it. It's still failing so it's likely related to this PR. I'll try to find how I can reproduce this from my side as well.

@jzhoulon and @penpornk , I have pulled in the latest patches and will try to repro on my side. I will update if I find something locally.

@jzhoulon
Copy link
Contributor Author

jzhoulon commented Mar 12, 2021

@jzhoulon This check failed 4 hours ago too (same errors) when it passed elsewhere. So I reran it. It's still failing so it's likely related to this PR. I'll try to find how I can reproduce this from my side as well.

@jzhoulon and @penpornk , I have pulled in the latest patches and will try to repro on my side. I will update if I find something locally.

@kulinseth Thanks for the help! I have some findings , though I still didn't figure out why only MacOS has this issue
it was caused by framework:allocator be built into several modules, and make the static variable (cpu_allocator_collect_full_stats) duplicated and have inconsistent stats in several modules. I delete the dependence of framework:allocator in common_runtime:bfc_allocator and device:device_mem_allocator), the test can pass now.

Details:
GPU and PluggableDevice's bfc_allocator deps on common_runtime:bfc_allocator and device:device_mem_allocator, both of these modules(common_runtime:bfc_allocator && device::device_mem_allocator) depends on framework:allocator. In framework/allocator.cc, it has static variable(cpu_allocator_collect_full_stats), it is supposed to be global unique, however, I printed its address, it has two address(means it has duplicated variable) when the test case failure, as a result, cpu_alloc is always an old allocator(not new TrackingAllocator) in cpu_allocator_base. And I remove the dependence (framework::allocator) from common_runtime:bfc_allocator and device:device_mem_allocator, the build seems also pass(already depends on framework headers), and the test case pass.

Code:
56 // If true, cpu allocator collects full stats.
57 static bool cpu_allocator_collect_full_stats = false;
58
59 void EnableCPUAllocatorFullStats() { cpu_allocator_collect_full_stats = true; 
      printf("set cpu_allocator_collect_full_state_ptr = %p\n", &cpu_allocator_collect_full_stats);
}

68 Allocator* cpu_allocator_base() {
69   static Allocator* cpu_alloc =
70       AllocatorFactoryRegistry::singleton()->GetAllocator();
71   // TODO(tucker): This really seems wrong.  It's only going to be effective on
72   // the first call in a process (but the desired effect is associated with a
73   // session), and we probably ought to be tracking the highest level Allocator,
74   // not the lowest.  Revisit the advertised semantics of the triggering option.
75   printf("get cpu_allocator_collect_full_stats_ptr = %p\n",  &cpu_allocator_collect_full_stats);
76
77   if (cpu_allocator_collect_full_stats && !cpu_alloc->TracksAllocationSizes()) {
78     cpu_alloc = new TrackingAllocator(cpu_alloc, true);
79   }
80   return cpu_alloc;
81 }

debug log:

2021-03-12 14:09:32.971918: F tensorflow/core/grappler/clusters/single_machine_test.cc:55] Non-OK-status: cluster_->Provision() status: Invalid argument: Tracking allocation is not enabled.
set cpu_allocator_collect_full_state_ptr = 0x11629b330

get cpu_allocator_collect_full_stats_ptr = 0x11fc59b90
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***

fix patch:

--- a/tensorflow/core/common_runtime/BUILD
+++ b/tensorflow/core/common_runtime/BUILD
@@ -1697,7 +1697,7 @@ cc_library(
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
         "//tensorflow/core:protos_all_cc",
-        "//tensorflow/core/framework:allocator",
+        #"//tensorflow/core/framework:allocator",
         "//tensorflow/core/profiler/lib:traceme",
         "@com_google_absl//absl/container:flat_hash_set",
        "@com_google_absl//absl/strings",
diff --git a/tensorflow/core/common_runtime/device/BUILD b/tensorflow/core/common_runtime/device/BUILD
index 3a49500053c..f0cd9030954 100644
--- a/tensorflow/core/common_runtime/device/BUILD
+++ b/tensorflow/core/common_runtime/device/BUILD
@@ -75,7 +75,7 @@ tf_cuda_library(
         ":device_id",
         "//tensorflow/core:lib",
         "//tensorflow/core:lib_internal",
-        "//tensorflow/core/framework:allocator",
+        #"//tensorflow/core/framework:allocator",
         "//tensorflow/core/platform:stream_executor",

(edited by penpornk@ to fix markdown formatting error which makes the patch hard to read)

@penpornk
Copy link
Member

@jzhoulon Thank you for the quick findings! I think the duplicate symbol might be because both PluggableDevice runtime and GPU runtime are linked statically. Could you please help test linking PluggableDevice runtime dynamically? Our tpu_runtime is also linked dynamically.

@kulinseth Thank you for your help as well!

I've pulled the PR in to test internally and am fixing other failures (which obscure this failure on Mac OS CI). I'll get to this once I'm done with other failures.

@kulinseth
Copy link
Contributor

@jzhoulon Thank you for the quick findings! I think the duplicate symbol might be because both PluggableDevice runtime and GPU runtime are linked statically. Could you please help test linking PluggableDevice runtime dynamically? Our tpu_runtime is also linked dynamically.

@kulinseth Thank you for your help as well!

I've pulled the PR in to test internally and am fixing other failures (which obscure this failure on Mac OS CI). I'll get to this once I'm done with other failures.

Thanks @jzhoulon for the findings. I was also curious why this was Mac only failure. I looked at the tests and they have the "no_gpu" tag to it (although Ubuntu CPU would have caught it too). Its also possible that the Xcode toolchain with compiler flags have different behavior with duplicate symbols.

@kulinseth
Copy link
Contributor

@jzhoulon I had a question regarding the Pluggable impl. Did you test the pluggable implementation with host_memory_allocate/deallocate APIs being exercised ? I am locally not seeing them getting used, was curious if we need to set some special flag.

copybara-service bot pushed a commit that referenced this pull request Mar 16, 2021
…ream_executor_test to a common file, so test_pluggable_device can use it too.

This change is a preparation for PR #45784. In the PR, the test `LibraryPluggableDeviceLoadFunctions` in //tensorflow/c:c_api_experimental_test eventually verifies that all structs in `test_pluggable_device.so` are populated properly.

PR #45784: #45784

PiperOrigin-RevId: 363065561
Change-Id: Ic3e019ea21d998922ff8fd84ebd6faabdaccfd5d
@jzhoulon
Copy link
Contributor Author

@jzhoulon I had a question regarding the Pluggable impl. Did you test the pluggable implementation with host_memory_allocate/deallocate APIs being exercised ? I am locally not seeing them getting used, was curious if we need to set some special flag.

@kulinseth Thanks for the review. the host_memory_allocate is used in PluggableDeviceProcessState::GetPluggableDeviceHostAllocator, it will call DeviceHostAllocator to create host allocator, DeviceHostAllocator will call streamexecutor's HostMemAllocator. no, you don't need to pass any flags in plugin, only to register the host_memory_allocate func ptr

    SubAllocator* sub_allocator = new DeviceHostAllocator(
        se, numa_node, pluggable_device_host_alloc_visitors_[numa_node],
        pluggable_device_host_free_visitors_[numa_node]);
    int64 pluggable_device_host_mem_limit_in_mb = -1;

@kulinseth
Copy link
Contributor

@kulinseth Thanks for the review. the host_memory_allocate is used in PluggableDeviceProcessState::GetPluggableDeviceHostAllocator, it will call DeviceHostAllocator to create host allocator, DeviceHostAllocator will call streamexecutor's HostMemAllocator. no, you don't need to pass any flags in plugin, only to register the host_memory_allocate func ptr

    SubAllocator* sub_allocator = new DeviceHostAllocator(
        se, numa_node, pluggable_device_host_alloc_visitors_[numa_node],
        pluggable_device_host_free_visitors_[numa_node]);
    int64 pluggable_device_host_mem_limit_in_mb = -1;

Thanks @jzhoulon. I do see the DeviceHostAllocator getting called (using the BFCAllocator) and registered, and we have the host_memory_allocate and deallocate functions registered but they are not getting invoked (rest of the alloc functions are working fine, like device mem allocate()/deallocate()) . Are we missing any set_on_host() or force_gpu_compatible flags to be set ? Is there any dependence on the numa configuration or memory stats.

I am curious in your backend impl did you see any issues with using host_memory allocation features.

@jzhoulon
Copy link
Contributor Author

jzhoulon commented Mar 16, 2021

@kulinseth Thanks for the review. the host_memory_allocate is used in PluggableDeviceProcessState::GetPluggableDeviceHostAllocator, it will call DeviceHostAllocator to create host allocator, DeviceHostAllocator will call streamexecutor's HostMemAllocator. no, you don't need to pass any flags in plugin, only to register the host_memory_allocate func ptr

    SubAllocator* sub_allocator = new DeviceHostAllocator(
        se, numa_node, pluggable_device_host_alloc_visitors_[numa_node],
        pluggable_device_host_free_visitors_[numa_node]);
    int64 pluggable_device_host_mem_limit_in_mb = -1;

Thanks @jzhoulon. I do see the DeviceHostAllocator getting called (using the BFCAllocator) and registered, and we have the host_memory_allocate and deallocate functions registered but they are not getting invoked (rest of the alloc functions are working fine, like device mem allocate()/deallocate()) . Are we missing any set_on_host() or force_gpu_compatible flags to be set ? Is there any dependence on the numa configuration or memory stats.

I am curious in your backend impl did you see any issues with using host_memory allocation features.

@kulinseth I just confirmed that host_memory_allocate can be invoked.(I add printf in hsot_memory_allocate and printed), can you try force_gpu_compatible to see whether it can be invoke? although I didn't find we use this option in our model, but it seems be invoked..

Allocator* PluggableDevice::GetAllocator(AllocatorAttributes attr) {
  DCHECK(cpu_allocator_) << "CPU allocator must be set";
  if (attr.on_host()) {
    if (attr.gpu_compatible() || force_gpu_compatible_) {
      PluggableDeviceProcessState* ps =
          PluggableDeviceProcessState::singleton(device_type(), platform_name_);
      return ps->GetPluggableDeviceHostAllocator(0);
    } else {
      return cpu_allocator_;
    }
  } else {
    return device_allocator_;
  }
}

copybara-service bot pushed a commit that referenced this pull request Mar 16, 2021
… from stream_executor_test to a common file, so test_pluggable_device can use it too.

This change is a preparation for PR #45784. In the PR, the test `LibraryPluggableDeviceLoadFunctions` in //tensorflow/c:c_api_experimental_test eventually verifies that all structs in `test_pluggable_device.so` are populated properly.

PR #45784: #45784

PiperOrigin-RevId: 363205807
Change-Id: I903882bf82391ebc3493ecfa3c111dfc0916b9cb
copybara-service bot pushed a commit that referenced this pull request Mar 16, 2021
…plicate symbols in PR #45784 (PluggableDevice implementation).

PiperOrigin-RevId: 363299146
Change-Id: I8f7a9dbbf491020eb0cec54616c6f7c181887924
copybara-service bot pushed a commit that referenced this pull request Mar 17, 2021
… symbols in PR #45784 (PluggableDevice implementation).

PiperOrigin-RevId: 363302623
Change-Id: Ibb31404a32151989e3f5f3bec2663bf275c0398f
copybara-service bot pushed a commit that referenced this pull request Mar 17, 2021
…ream_executor_test to a common file, so test_pluggable_device can use it too.

This change is a preparation for PR #45784. In the PR, the test `LibraryPluggableDeviceLoadFunctions` in //tensorflow/c:c_api_experimental_test eventually verifies that all structs in `test_pluggable_device.so` are populated properly.

PR #45784: #45784

PiperOrigin-RevId: 363315826
Change-Id: I36391180bbc0aca6d41575d5b6f0299a5e99af76
@google-ml-butler google-ml-butler bot removed the ready to pull PR ready for merge process label Mar 18, 2021
copybara-service bot pushed a commit that referenced this pull request Mar 18, 2021
…ementation

PiperOrigin-RevId: 363579316
Change-Id: I2ac70795e5ab14fce9c788ba00dd1923ff72a26d
@penpornk
Copy link
Member

This PR was merged in 3a3878f. I'm closing the PR now. Thank you very much everyone for the hard work!
(We'll see in a day or two whether it will stick.)

@kulinseth
Copy link
Contributor

@kulinseth I just confirmed that host_memory_allocate can be invoked.(I add printf in hsot_memory_allocate and printed), can you try force_gpu_compatible to see whether it can be invoke? although I didn't find we use this option in our model, but it seems be invoked..

Thanks so much for checking.

"//tensorflow/core/platform:stream_executor",
"//tensorflow/stream_executor:event",
"//tensorflow/stream_executor:kernel",
] + if_static([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jzhoulon and @penpornk
This MacOS test failure commit in pluggable impl seems to have caused regression on mac platforms to register pluggable device. When _pywrap_tensorflow .so loads up the plugin it goes through the device initialization fine then it hands over to libtensorflow_framework dylib to query the plugin handle using the Platform name in MultiPlatformManager . And during this part it fails with "Platform " not found. Seems like a linker issue where the plugin registry is not getting shared across the so and dylib.

Reverting this locally workarounds it. Will create a PR with proper fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes size:XL CL Change Size:Extra Large

Projects

None yet

Development

Successfully merging this pull request may close these issues.