Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR updates how to handle path values from the load() method.
It changes the DataSource class constructor and add path as a key-value pair in the options field.

Also, this PR blocks loading multiple paths.

Why are the changes needed?

To make the behavior consistent with the existing data source APIs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@allisonwang-db
Copy link
Contributor Author

cc @HyukjinKwon @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we also remove user specified schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is actually not used. Both the reader and writer functions take in the schema parameter, and we can pass in the actual schema there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks all v1 sources, right? I think we should either follow v1 and ignore multi paths for python data source, or only apply this check for python data sources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we only apply this check to Python data sources (it's in `loadUserDefinedDataSource). The behavior is indeed different from the v1 source, but I find it more user-friendly to raise an explicit error than silently ignoring multiple paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it help to add a paths option using JSON to hold String[]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, let's just follow the DSv2 approach (options['paths'] = json serialized string list) to make Python data source behave the same as DSv2. I will update this.

@allisonwang-db allisonwang-db changed the title [SPARK-45927][PYTHON] Update path handling in Python data source [SPARK-45927][PYTHON] Update path handling for Python data source Nov 15, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extraOptions + ("path" -> paths.head)
extraOptions + ("path" -> paths.head)

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless we should remove this paths in the interface. Not all Python Datasources require paths.

@github-actions github-actions bot removed the DOCS label Nov 17, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0. Thank you, @allisonwang-db and all.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants