-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45927][PYTHON] Update path handling for Python data source #43809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/datasource.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we also remove user specified schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is actually not used. Both the reader and writer functions take in the schema parameter, and we can pass in the actual schema there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks all v1 sources, right? I think we should either follow v1 and ignore multi paths for python data source, or only apply this check for python data sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we only apply this check to Python data sources (it's in `loadUserDefinedDataSource). The behavior is indeed different from the v1 source, but I find it more user-friendly to raise an explicit error than silently ignoring multiple paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it help to add a paths option using JSON to hold String[]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, let's just follow the DSv2 approach (options['paths'] = json serialized string list) to make Python data source behave the same as DSv2. I will update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| extraOptions + ("path" -> paths.head) | |
| extraOptions + ("path" -> paths.head) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless we should remove this paths in the interface. Not all Python Datasources require paths.
a027ce3 to
60f230a
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master for Apache Spark 4.0.0. Thank you, @allisonwang-db and all. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
What changes were proposed in this pull request?
This PR updates how to handle
pathvalues from theload()method.It changes the DataSource class constructor and add
pathas a key-value pair in the options field.Also, this PR blocks loading multiple paths.
Why are the changes needed?
To make the behavior consistent with the existing data source APIs.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No