DSA Transparent Offload (DTO) is a shared library for applications to transparently (i.e., without modification) use DSA (Data Streaming Accelerator). The library dynamically links to applications using "-ldto" linker options (requires recompilation). The library can also be linked using LD_PRELOAD environment variable (doesn't require recompilication). The library intercepts standard memcpy, memmove, memset, and memcmp standard API calls from the application and transparently uses DSA to perform those operations using DSA's memory move, fill, and compare operations. DTO is limited to synchronous offload model since these APIs have synchronous semantics.
DTO library works with DSA's Shared Work Queues (SWQs). DTO also works with multiple DSAs and uses them in round robin manner. During initialization, DTO library can either auto-discover all configured SWQs (potentially on multiple DSAs), or a list of specific SWQs that is specified using an environment variable DTO_WQ_LIST.
DTO library falls back to using standard APIs on CPU under following scenarios: a. DTO library is not able to find any DSA instances (e.g., not configured/provisioned) b. Memory operation size is smaller than a offload threshold (configurable using an environment variable DTO_MIN_BYTES) c. Not able to do work submission (e.g., reached ENQ retry threshold) due to WQ full d. DSA encounters a page-fault and completes partially (resulting in rest of the operation being completed using standard library on CPU)
To improve throughput for synchronous offload, DTO uses "pseudo asynchronous" execution using following steps.
- After intercepting the API call, DTO splits the API job into two parts; 1) CPU job and 2) DSA job. For example, a 64 KB memcpy may be split into 20 KB CPU job and 44 KB DSA job. The split fraction can be configured using an environment variable DTO_CPU_SIZE_FRACTION.
- DTO submits the DSA portion of the job to DSA. If DTO_IS_NUMA_AWARE=1 DTO uses work queues of DSA device located on the same numa node as buffer (memcpy/memmove - dest buffer, memcmp - ptr2) delivered to method - buffer-centric numa awareness. If DTO_IS_NUMA_AWARE=2 DTO uses work queues of DSA device located on the same numa node as calling thread cpu - cpu-centric numa awareness.
- In parallel, DTO performs the CPU portion of the job using std library on CPU.
- DTO waits for DSA to complete (if it hasn't completed already). The wait method can be configured using an environment variable DTO_WAIT_METHOD. The wait method can be one of the following: yield, busypoll, umwait, or tpause. The default is busypoll.
For some workloads, complete offloading to the DSA device can result in improved performance and reduce power consumption via the UMWAIT (or TPAUSE) instruction. In this case, DTO_CPU_SIZE_FRACTION can be set to 0.0, which means that the CPU job is 0 bytes and the entire job is offloaded to DSA and AUTO_ADJUST_KNOBS is set to 0. Then during the wait step, DTO will use the UMWAIT (or TPAUSE) instruction to wait for the DSA job to complete. UMWAIT / TPAUSE are instructions that allow the CPU to enter a low-power state while waiting for the job to complete. For UMWAIT, please refer to this enabling guide for more details.
DTO also implements a heuristic to auto tune dsa_min_bytes and cpu_size_fraction parameters based on current DSA load. For example, if DSA is heavily loaded, DTO tries to reduce the DSA load by increasing cpu_size_fraction and dsa_min_bytes. Conversely, if DSA is lightly loaded, DTO tries to increase the DSA load by decreasing cpu_size_fraction and dsa_min_bytes. The goal of the heuristic is to minimize the wait time in step 4 above while maximizing throughput. The auto-tuning can be enabled or disabled using an environment variable DTO_AUTO_ADJUST_KNOBS.
DTO can also be used to learn certain application characterics by building histogram of various API types and sizes. The histogram can be built using an environment variable DTO_COLLECT_STATS.
dto.c: DSA Transparent Offload shared library
dto-test.c: Sample multi-threaded test application
test.sh: Sample test script to showcase how to use DTO with dto-test app (using both "-ldto" and "LD_PRELOAD" methods)
dto-4-dsa.conf:  An example json config file for configuring DSAs
Following environment variables control the behavior of DTO library:
	DTO_USESTDC_CALLS=0/1, 1 (uses std c memory functions only), 0 (uses DSA along with std c lib call; in case of DSA page fault - reverts to std c lib call). Default is 0.
	DTO_COLLECT_STATS=0/1, 1 (enables stats collection - #of operations, avg latency for each API, etc.>, 0 (disables stats collection).
				Should be enabled for debugging/profiling only, not for perf evaluation (enabling it slows down the workload). Default is 0.
	DTO_WAIT_METHOD=<yield,busypoll,umwait> (specifies the method to use while waiting for DSA to complete operation, default is yield)
	DTO_MIN_BYTES=xxxx (specifies minimum size of API call needed for DSA operation execution, default is 16384 bytes)
	DTO_CPU_SIZE_FRACTION=0.xx (specifies fraction of job performed by CPU, in parallel to DSA). Default is 0.00
	DTO_AUTO_ADJUST_KNOBS=0/1 (disables/enables auto tuning of cpu_size_fraction and dsa_min_bytes parameters. 0 -- disable, 1 -- enable (default))
   DTO_IS_NUMA_AWARE=0/1/2 (disables/buffer-centric/cpu-centric numa awareness. 0 -- disable (default), 1 -- buffer-centric, 2 - cpu-centric)
	DTO_WQ_LIST="semi-colon(;) separated list of DSA WQs to use". The WQ names should match their names in /dev/dsa/ directory (see example below).
				If not specified, DTO will try to auto-discover and use all available WQs.
   DTO_DSA_MEMCPY=0/1, 1 (default) - DTO uses DSA to process memcpy, 0 - DTO uses system memcpy
   DTO_DSA_MEMMOVE=0/1, 1 (default) - DTO uses DSA to process memmove, 0 - DTO uses system memmove
   DTO_DSA_MEMSET=0/1, 1 (default) - DTO uses DSA to process memset, 0 - DTO use system memset
   DTO_DSA_MEMCMP=0/1, 1 (default) - DTO uses DSA to process memcmp, 0 - DTO use system memcmp
   DTO_DSA_CC=0/1, 1 (default) - DTO sets DSA Cache Control flag to 1 if DSA supports cache control, 0 - DTO sets DSA Cache Control flag to 0
   DTO_OVERLAPPING_MEMMOVE_ACTION=0/1 0 (default) DTO submits memmove operations with overlapping buffers entirely to CPU, 1 - entirely to DSA
   DTO_UMWAIT_DELAY=xxxx defines delay for umwait command (check max possible value at: /sys/devices/system/cpu/umwait_control/max_time), default is 100000
	DTO_LOG_FILE=<dto log file path> Redirect the DTO output to the specified file instead of std output (useful for debugging and statistics collection). file name is suffixed by process pid.
	DTO_LOG_LEVEL=0/1/2 controls the log level. higher value means more verbose logging (default 0).Although not the only usage models of DTO, the following are some common ones: Latency reduction - the goal is to minimize the latency of offloaded operations. Use the following settings: DTO_AUTO_ADJUST_KNOBS=1 (the CPU fraction setting is critical to this mode. The optimal value is dynamic so autotune algorithm needs to be enabled) DTO_WAIT_METHOD=busypoll
Power Reduction - the goal is to reduce power by offloading memory operations to DSA allowing the cpu core to go into a lower power state. This mode may reduce or increase the latency of operations depending on the load on DSA devices. DTO_AUTO_ADJUST_KNOBS=0 DTO_CPU_SIZE_FRACTION=0.0 (offload the entire operations to DSA) DTO_WAIT_METHOD=umwait
Cycle count Reduction - the goal is to reduce cpu cycles by offloading memory operations to DSA. This mode may reduce or increase the latency of operations depending on the load on DSA devices and on interaction with the OS scheduler and other threads. The idea is to offload operations to DSA and allow the OS to schedule other work while DSA perform the operation. DTO_AUTO_ADJUST_KNOBS=0 DTO_CPU_SIZE_FRACTION=0.0 (offload the entire operations to DSA) DTO_WAIT_METHOD=yield
Avoiding Cache polution - the goal is to avoid polluting the cache with data from the given process. DTO_DSA_CC=0 DTO_AUTO_ADJUST_KNOBS=0 DTO_CPU_SIZE_FRACTION=0.0 (offload the entire operations to DSA so none of the data is pulled into cache) DTO_WAIT_METHOD=yield or umwait (saves either cycles or power)
Pre-requisite packages:
Should use glibc version 2.36 or later for best DTO performance. You can use "ldd --version" command to check glibc version on your system. glibc versions less than 2.36 have a bug which reduces DTO performance.
On Fedora/CentOS/Rhel: kernel-headers, accel-config-devel, libuuid-devel, libnuma-devel
On Ubuntu/Debian: linux-libc-dev, libaccel-config-dev, uuid-dev, libnuma-dev
make libdto
make installBuilding the test application.
# When using -ldto method
make dto-test
# When using LD_PRELOAD method
make dto-test-wodto
You can initialize a DSA device using the following:
./accelConfig.sh <device-id> <enable (yes/no)> <wq-id>If you get invalid traffic class error, reload the idxd driver with tc_override=1:
modprobe -r idxd
modprobe idxd tc_override=1You can set the permissions of the DSA via chmod /dev/dsa/wqX.Y to allow users to submit to the DSA device.
1. Follow the steps in the Initializing DSA Devices section. Or you can make changes to test.sh or dto-4-dsa.conf to change DSA configuration if desired. test.sh configures DSA(s) using the config parameters in dto-4-dsa.conf.
2. Run the test.sh script to run the dto-test app (DTO linked using -ldto) or dto-test-wodto app (DTO linked using LD_PRELOAD).
3. Using with other applications (two ways to use it)
    3a. Using "-ldto" linker option (requires recompiling the application)
        i. Recompile the application with "-ldto" linker options
	ii. Setup DTO environment variables (examples below)
            export DTO_USESTDC_CALLS=0
            export DTO_COLLECT_STATS=1
            export DTO_WAIT_METHOD=busypoll
            export DTO_MIN_BYTES=8192
            export DTO_CPU_SIZE_FRACTION=0.33
            export DTO_AUTO_ADJUST_KNOBS=1
            export DTO_WQ_LIST="wq0.0;wq2.0;wq4.0;wq6.0"
            export DTO_IS_NUMA_AWARE=1
	iii. Run the application - (CacheBench example below)
    3b. Using LD_PRELOAD method (doesn not require recompiling the application)
	i. setup LD_PRELOAD environment variable to point to DTO library
            export  LD_PRELOAD=<libdto file path>:$LD_PRELOAD
	ii. Export all environment variables (similar to ii. in option 3a. above)
	iii. Run the application - (CacheBench example below)
	<CBENCH_DIR>/cachebench -json_test_config <json file> --progress_stats_file=dto.log --report_api_latency
4. Sample histogram given below (generated using DTO_COLLECT_STATS=1)
	i. Numbers under columns set, cpy, mov, and cmp show number of API calls or per-API completion latency for memset, memcpy, memmove, and memcmp respectively.
	ii. Numbers in bytes column show total bytes processed across all 4 API calls
------------------------------------------------------------------------------------------------------------------------------------------
DTO Run Time: 22826 ms
******** Number of Memory Operations ********
                     <*************** stdc calls    ***************> <*************** dsa (success) ***************> <*************** dsa (failed)  ***************> <***** failure reason *****>
Byte Range        -- set      cpy      mov      cmp      bytes        set      cpy      mov      cmp      bytes        set      cpy      mov      cmp      bytes        Retries PFs    Others
       0-4095     -- 24616626 12319310 7676731  244777209 4531255821  0        0        0        0        0            0        0        0        0        0            0      0      0
    4096-8191     -- 3        290      14       0        1842507      0        0        0        0        0            0        0        0        0        0            0      0      0
    8192-12287    -- 0        2        6        0        45579        64       176      2        0        2545968      0        2        6        0        25276        0      8      0
   12288-16383    -- 1        1        0        0        20756        1        137      0        0        1965343      1        1        0        0        11172        0      2      0
   16384-20479    -- 0        1        3        0        41974        0        126      2        0        2328434      0        1        3        0        24337        0      4      0
   32768-36863    -- 0        0        0        0        0            0        67       2        0        2378165      0        0        0        0        0            0      0      0
   53248-57343    -- 0        0        0        0        0            0        46       0        0        2545890      0        0        0        0        0            0      0      0
   98304-102399   -- 0        0        0        0        0            0        13       0        0        1309627      0        0        0        0        0            0      0      0
  106496-110591   -- 0        1        0        0        31637        0        8        0        0        869880       0        1        0        0        77586        0      1      0
  118784-122879   -- 0        0        0        0        0            0        9        0        0        1083519      0        0        0        0        0            0      0      0
  131072-135167   -- 0        0        2        0        175638       0        5        0        0        665475       0        0        2        0        86506        0      2      0
  143360-147455   -- 0        0        0        0        0            0        1        0        0        143399       0        0        0        0        0            0      0      0
  147456-151551   -- 0        0        0        0        0            0        1        0        0        148647       0        0        0        0        0            0      0      0
  159744-163839   -- 0        0        0        0        0            0        6        0        0        970186       0        0        0        0        0            0      0      0
  212992-217087   -- 0        0        11       0        1580248      0        4        2        0        1292252      0        0        11       0        779384       0      11     0
  278528-282623   -- 0        0        0        0        0            0        1        0        0        279719       0        0        0        0        0            0      0      0
  466944-471039   -- 0        0        0        0        0            0        1        0        0        467783       0        0        0        0        0            0      0      0
  528384-532479   -- 0        0        0        0        0            0        1        0        0        529927       0        0        0        0        0            0      0      0
  720896-724991   -- 0        0        0        0        0            1        0        0        0        720896       0        0        0        0        0            0      0      0
  999424-1003519  -- 0        0        0        0        0            0        1        0        0        1002439      0        0        0        0        0            0      0      0
   >=2093056      -- 0        1        0        0        1975911      0        0        0        0        0            0        1        0        0        973209       0      1      0
******** Average Memory Operation Latency (us)  ********
                     <******** stdc calls    ********> <******** dsa (success) ********> <******** dsa (failed)  ********> 
Byte Range        -- set      cpy      mov      cmp      set      cpy      mov      cmp      set      cpy      mov      cmp
       0-4095     -- 0.01     0.02     0.01     0.04     0        0        0        0        0        0        0        0
    4096-8191     -- 0.07     0.42     0.47     0        0        0        0        0        0        0        0        0
    8192-12287    -- 0        1.29     0.47     0        0.84     1.19     1.82     0        0        56.18    2.54     0
   12288-16383    -- 2.90     3.34     0        0        7.49     1.26     0        0        2.07     1.93     0        0
   16384-20479    -- 0        0.27     3.24     0        0        1.37     1.86     0        0        84.92    2.47     0
   32768-36863    -- 0        0        0        0        0        1.87     2.91     0        0        0        0        0
   53248-57343    -- 0        0        0        0        0        2.45     0        0        0        0        0        0
   98304-102399   -- 0        0        0        0        0        3.76     0        0        0        0        0        0
  106496-110591   -- 0        229.18   0        0        0        4.19     0        0        0        3.65     0        0
  118784-122879   -- 0        0        0        0        0        5.18     0        0        0        0        0        0
  131072-135167   -- 0        0        25.00    0        0        5.92     0        0        0        0        14.39    0
  143360-147455   -- 0        0        0        0        0        5.11     0        0        0        0        0        0
  147456-151551   -- 0        0        0        0        0        6.16     0        0        0        0        0        0
  159744-163839   -- 0        0        0        0        0        7.65     0        0        0        0        0        0
  212992-217087   -- 0        0        36.10    0        0        8.15     6.59     0        0        0        18.07    0
  278528-282623   -- 0        0        0        0        0        15.47    0        0        0        0        0        0
  466944-471039   -- 0        0        0        0        0        15.91    0        0        0        0        0        0
  528384-532479   -- 0        0        0        0        0        18.65    0        0        0        0        0        0
  720896-724991   -- 0        0        0        0        17.71    0        0        0        0        0        0        0
  999424-1003519  -- 0        0        0        0        0        28.42    0        0        0        0        0        0
   >=2093056      -- 0        422.15   0        0        0        0        0        0        0        272.87   0        0
## Usage Notes
When linking DTO using LD_PRELOAD environment variable special care is required when the application is called from within a shell script.
   - It is best to set the LD_PRELOAD variable as close to the invocation of the actual application. For environments that use nested shell scripts
     calls, it is ideal to set LD_PRELOAD within the script that invokes the application, and not in outer scripts.
   - Setting the LD_PRELOAD variable before invoking the shell script which eventually invokes the application can lead to the following problems:
      - Some shell commands rely on stdout output, for example the output of 'pwd', to set internal variables or environment variables. Setting 
        DTO_LOG_LEVEL > 0 can cause problems because text output by DTO can be interspersed with output from the shell script causing errors 
        in the script.
      - When the application is started by a script with #!<location of shell> which invokes another script with #!<location of shell>, for 
        unknown reasons DTO causes a segmentation fault during a memset operation on an 8K sized buffer. This can be avoided by setting the minimum 
        DTO size above 8K, or by avoiding this invocation sequence.