Skip to content

[datafusion-cli] Add a way to see what object store requests are made #17207

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

DataFusion makes many IO requests

Optimizing performance for DataFusion often involves reviewing the IO patterns, especially for remote storage like AWS S3, and optimizing first requires understanding the pattern.

We often use datafusion-cli to debug code that comes with DataFusion, such as the ListingTable even though many (most?) users don't use datafusion-cli directly

@BlakeOrth sums the usecase nicely here

I'm ultimately an API user, not a CLI user. I've been using a hacky-instrumented CLI here to help give a common tool and example(s) of potential improvements.

Exposing additional metrics around where DataFusion is spending its time at the API level (and in turn through the CLI) does seem very useful to me though.

I personally had to rely on a mix of production metrics for our object storage, doing off-cpu-time profiling, and the aforementioned hacked in timing instrumentation, to help me understand that listing files and collecting their object metadata was taking a non-trivial amount of time...

Describe the solution you'd like

I would like some way to easily show the object store operations made by datafusion-cli, when some of profiling is enabled.

Describe alternatives you've considered

UX Suggestion

I suggest we take inspiration from @kosiew and #17021 -- specifically something like

-- Enable profiling. Any subsequent query will run as normal and then print out a object_store requests trace
\object_store_profiling
...

-- Disable profiling (toggle)
\object_store_profiling

The output should have relevant information for each object store request. Relevant information (maybe there is more)

  • Start time: timestamp e.g 2025-01-01T10:20:30
  • operation: what object store operation (e.g. GET, LIST)
  • request ranges: what was requested (e.g. 1000..2000)
  • response_size: what size object was returned, in bytes (e.g. 1000000 for 1MB)
  • path: What path was requested
  • duration: how long did the operation take (e.g. 0.500 for 500ms)
> select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 3.579 seconds.

Object Store Requests
2025-01-01T10:20:30 operation=LIST duration=0.050 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-01-01T10:20:30 operation=GET duration=0.532 path=hits_compatible/athena_partitioned/hits_1.parquet ranges="123..456" response_size=432342

Bonus points (maybe a follow on PR) for a more machine readable format (like JSON)

Implementation Suggestion

Using the ObjectStore API, you can wrap an ObjectStore instance and intercept all IO requests and instrument / observe them however you want, for example the InstrumentedObjectStore in

/// A wrapper around an `ObjectStore` that instruments all public methods with tracing.
#[derive(Clone, Debug)]
struct InstrumentedObjectStore {
    inner: Arc<dyn ObjectStore>,
    name: String,
}

https://github.com/datafusion-contrib/datafusion-tracing/blob/8fc214b192ad67114742f8a582292bc06e2b5247/instrumented-object-store/src/instrumented_object_store.rs#L138 from @geoffreyclaude and @gabotechs

I recommend a similar approach in datafusion-cli -- when profiling is desired, use a wrapped ObjectStore that saves the requests for subsequent displayBonu

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions