-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
DataFusion makes many IO requests
Optimizing performance for DataFusion often involves reviewing the IO patterns, especially for remote storage like AWS S3, and optimizing first requires understanding the pattern.
We often use datafusion-cli to debug code that comes with DataFusion, such as the ListingTable even though many (most?) users don't use datafusion-cli directly
@BlakeOrth sums the usecase nicely here
I'm ultimately an API user, not a CLI user. I've been using a hacky-instrumented CLI here to help give a common tool and example(s) of potential improvements.
Exposing additional metrics around where DataFusion is spending its time at the API level (and in turn through the CLI) does seem very useful to me though.
I personally had to rely on a mix of production metrics for our object storage, doing off-cpu-time profiling, and the aforementioned hacked in timing instrumentation, to help me understand that listing files and collecting their object metadata was taking a non-trivial amount of time...
Describe the solution you'd like
I would like some way to easily show the object store operations made by datafusion-cli, when some of profiling is enabled.
Describe alternatives you've considered
UX Suggestion
I suggest we take inspiration from @kosiew and #17021 -- specifically something like
-- Enable profiling. Any subsequent query will run as normal and then print out a object_store requests trace
\object_store_profiling
...
-- Disable profiling (toggle)
\object_store_profilingThe output should have relevant information for each object store request. Relevant information (maybe there is more)
- Start time: timestamp e.g 2025-01-01T10:20:30
- operation: what object store operation (e.g.
GET,LIST) - request ranges: what was requested (e.g.
1000..2000) - response_size: what size object was returned, in bytes (e.g.
1000000for 1MB) - path: What path was requested
- duration: how long did the operation take (e.g.
0.500for 500ms)
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 3.579 seconds.
Object Store Requests
2025-01-01T10:20:30 operation=LIST duration=0.050 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-01-01T10:20:30 operation=GET duration=0.532 path=hits_compatible/athena_partitioned/hits_1.parquet ranges="123..456" response_size=432342
Bonus points (maybe a follow on PR) for a more machine readable format (like JSON)
Implementation Suggestion
Using the ObjectStore API, you can wrap an ObjectStore instance and intercept all IO requests and instrument / observe them however you want, for example the InstrumentedObjectStore in
/// A wrapper around an `ObjectStore` that instruments all public methods with tracing.
#[derive(Clone, Debug)]
struct InstrumentedObjectStore {
inner: Arc<dyn ObjectStore>,
name: String,
}https://github.com/datafusion-contrib/datafusion-tracing/blob/8fc214b192ad67114742f8a582292bc06e2b5247/instrumented-object-store/src/instrumented_object_store.rs#L138 from @geoffreyclaude and @gabotechs
I recommend a similar approach in datafusion-cli -- when profiling is desired, use a wrapped ObjectStore that saves the requests for subsequent displayBonu
Additional context
No response