Skip to content

nerdsane/datafusion-gpu

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datafusion GPU

This repo intends to showcase the capabilities of wiring up Apache Datafusion with GPU execution runtimes in order to speed up heavy computations.

Objective

The main objective is not to provide a wide set of execution nodes or math functions that can run on GPU, but trying out different technologies to run a single aggregation function, and see what are the benefits and drawbacks of each approach.

For that, two approaches where followed:

Compiling compute kernels at runtime with CubeCL

This approach uses https://github.com/tracel-ai/cubecl for writing kernels directly in Rust code, which get compiled down to different backends, like CUDA or WGPU.

Example here

Advantages

  • Write the kernel once, and use it for any datatype and several GPU technologies
  • Use Rust for writing the kernel, no need to learn hardware-specific languages

Disadvantages

  • Small ecosystem, lack of documentation, lack of examples, immature technology
  • Bad performance (this could be on me)
  • Bugs? (got my laptop bricked several times trying to run some kernels)
  • Certain abstractions very tailored to working with Tensors rather than 1d arrays

Writing CUDA kernels by hand and feeding them data with cudarc

This approach uses https://github.com/coreylowman/cudarc with some handwritten CUDA kernels. The library allows feeding buffers to the GPU and scheduling them to be executed using the compiled handwritten kernel.

Example here

Advantages

  • More control over the kernel code that gets executed on the GPU
  • Good performance
  • Wide CUDA ecosystem

Disadvantages

  • Works only on CUDA devices
  • Making a kernel work for different datatypes is not supported out of the box
  • Needs knowledge about writing CUDA code

Results

Given the following conditions:

  • Measured on a g4dn.xlarge AWS instance with 4vCPU and a T4 GPU
  • In-memory table called types with 1000000 entries with the following schema:
+--------+-------+-----+
| string | float | int |
+--------+-------+-----+
  • Datafusion runtime with two sum aggregation function variants:
    • sum_cubecl: using plane reduction kernel from cudarc
    • sum_cudarc: using a handwritten CUDA kernel based on a shared memory algorithm
  • Code run with the following command: cargo run --release --features cuda -- -l 1000000
Query Execution time
SELECT sum(float) FROM types ~7.5ms
SELECT sum_cudarc(float) FROM types ~2ms
SELECT sum_cubecl(float) FROM types ~440ms

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Rust 100.0%