-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
While testing #16300, I (re-noticed) that datafusion-cli is really slow when reading remote files
For example,
The initial table creation takes 7.5 seconds on my pretty crappy connection:
DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.492 seconds.However then simple queries just to get the count take 8-10 seconds 😱
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 8.945 seconds.> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 10.456 seconds.I am almost certain this delay is due to having to read the footers of the parquet files for each query.
Note the speed is much faster when collect_statistics is on
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.012 seconds.
> CREATE EXTERNAL TABLE nyc_taxi_rides
STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
0 row(s) fetched.
Elapsed 7.770 seconds.The first query is still slow for some reason: (5 seconds)
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 5.114 seconds.But subsequent queries are quite fast:
> select count(*) from nyc_taxi_rides;
+------------+
| count(*) |
+------------+
| 1310903963 |
+------------+
1 row(s) fetched.
Elapsed 0.297 seconds.Describe the solution you'd like
I would like:
datafusion-clito be faster for such queriesdatafusion-clito be an easy to follow model for how to cache metadata when working with ListingTabe that others who build with DataFusion could follow
Describe alternatives you've considered
I think the obvious thing that is needed is a cache for the ParquetMetadata
I think the actual cache should be in datafusion-cli but NOT in the datafusion core crate as I think what and how to cache will be different across systems.
What I envision is:
- APIs in the ListingTable / RuntimeEnv / etc for adding caching of ParquetMetadata
- An implementation of those APIs in
datafusion-cli - Bonus Points: documentation / examples that show how to use those APIs in other system
There is some vestigal code in the cache_manager crate that I think could provide a home for such caching APIs: * https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html
Additional context
Related issues