-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
When scanning partitioned files, there are scenarios where runtime-generated values (not persisted in the files) need to be attached to each RecordBatch.
Currently, when a partition contains multiple files, loadNextBatch has no context about which file it is returning rows from.
This makes it impossible to append per-file runtime data to the resulting RecordBatch.
We’d like a way to extend the file schema and stream with additional columns—similar to how table_partition_cols are added from directory structure.
Example
Partition directory: /data1/
Files:
/data1/file1
/data1/file2
/data1/file3
File schema: { row_id: Int32, b: Int32 }
Runtime metadata:
file1 -> cumulative_total_rows = 5
file2 -> cumulative_total_rows = 7
file3 -> cumulative_total_rows = 17
Derived schema:
{ row_id: Int32, b: Int32, cumulative_total_rows }
Example expression:
row_id + cumulative_total_rows
Describe the solution you'd like
Extend the ListingTable and ListingOptions to support user-provided extended columns (extended_cols), which are appended to each file’s stream and schema—analogous to table_partition_cols.
- Add extended_cols to ListingOptions, defined as:
extended_cols: HashMap<String, HashMap<String, ScalarValue>>
where:
- outer key = column name
- inner key = file name
- value = runtime constant for that file
- These values should be made available in the scan output (similar to partition columns), allowing expressions to reference them.
Describe alternatives you've considered
- Expose ObjectMeta to PhysicalExprAdapter, allowing it to append file metadata (e.g., file name) to the stream.
- Then a MemTable with file_name → extended_col mappings could be joined to enrich data.
Any alternative mechanism that makes per-file runtime context accessible during scan would work.
@alamb @timsaucer Any thoughts?