[suggestion] Write path optimization

### Feature Request / Improvement

Let's investigate the level of abstraction on the write path. 

Currently, we are doing schema-compatible checks, schema coercion, bin-packing, transformation, etc at different levels of the stack. It'll be good to optimize and see which functions can be pushed up the stack. 

For example, here's what the `overwrite` path looks like
```
overwrite
	_dataframe_to_data_files
		write_file
			write_parquet 
```
(copied over from https://github.com/apache/iceberg-python/pull/910#pullrequestreview-2175574772) 

Another example https://github.com/apache/iceberg-python/pull/786#discussion_r1646417180


## More info
`overwrite` checks schema compatibility https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/table/__init__.py#L541-L550

`_dataframe_to_data_files` bin-packs the pyarrow Table https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2222-L2225

`write_parquet` transforms table schema https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2001-L2008
and 
https://github.com/apache/iceberg-python/blob/3f44dfe711e96beda6aa8622cf5b0baffa6eb0f2/pyiceberg/io/pyarrow.py#L2011-L2021


	_check_schema_compatible(
	self._table.schema(), other_schema=df.schema, downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us
	)

	self.delete(delete_filter=overwrite_filter, snapshot_properties=snapshot_properties)

	with self.update_snapshot(snapshot_properties=snapshot_properties).fast_append() as update_snapshot:
	# skip writing data files if the dataframe is empty
	if df.shape[0] > 0:
	data_files = _dataframe_to_data_files(

	tasks=iter([
	WriteTask(write_uuid=write_uuid, task_id=next(counter), record_batches=batches, schema=table_metadata.schema())
	for batches in bin_pack_arrow_table(df, target_file_size)
	]),

	table_schema = task.schema

	# if schema needs to be transformed, use the transformed schema and adjust the arrow table accordingly
	# otherwise use the original schema
	if (sanitized_schema := sanitize_column_names(table_schema)) != table_schema:
	file_schema = sanitized_schema
	else:
	file_schema = table_schema

	batches = [
	_to_requested_schema(
	requested_schema=file_schema,
	file_schema=table_schema,
	batch=batch,
	downcast_ns_timestamp_to_us=downcast_ns_timestamp_to_us,
	include_field_ids=True,
	)
	for batch in task.record_batches
	]
	arrow_table = pa.Table.from_batches(batches)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[suggestion] Write path optimization #926

Feature Request / Improvement

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[suggestion] Write path optimization #926

Description

Feature Request / Improvement

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions