[SPARK-45927][PYTHON] Update path handling for Python data source #43809

allisonwang-db · 2023-11-14T23:02:24Z

What changes were proposed in this pull request?

This PR updates how to handle path values from the load() method.
It changes the DataSource class constructor and add path as a key-value pair in the options field.

Also, this PR blocks loading multiple paths.

Why are the changes needed?

To make the behavior consistent with the existing data source APIs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2023-11-14T23:02:36Z

cc @HyukjinKwon @cloud-fan

cloud-fan · 2023-11-15T05:03:09Z

python/pyspark/sql/datasource.py

why do we also remove user specified schema?

This field is actually not used. Both the reader and writer functions take in the schema parameter, and we can pass in the actual schema there.

cloud-fan · 2023-11-15T05:21:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

This breaks all v1 sources, right? I think we should either follow v1 and ignore multi paths for python data source, or only apply this check for python data sources

Yes we only apply this check to Python data sources (it's in `loadUserDefinedDataSource). The behavior is indeed different from the v1 source, but I find it more user-friendly to raise an explicit error than silently ignoring multiple paths.

does it help to add a paths option using JSON to hold String[]?

Yea, let's just follow the DSv2 approach (options['paths'] = json serialized string list) to make Python data source behave the same as DSv2. I will update this.

cloud-fan · 2023-11-16T05:17:23Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Suggested change

extraOptions + ("path" -> paths.head)

extraOptions + ("path" -> paths.head)

HyukjinKwon · 2023-11-17T00:36:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

regardless we should remove this paths in the interface. Not all Python Datasources require paths.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-11-20T23:06:45Z

Merged to master for Apache Spark 4.0.0. Thank you, @allisonwang-db and all.

HyukjinKwon

LGTM2

github-actions bot added SQL DOCS PYTHON labels Nov 14, 2023

allisonwang-db mentioned this pull request Nov 14, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Closed

cloud-fan reviewed Nov 15, 2023

View reviewed changes

allisonwang-db changed the title ~~[SPARK-45927][PYTHON] Update path handling in Python data source~~ [SPARK-45927][PYTHON] Update path handling for Python data source Nov 15, 2023

cloud-fan reviewed Nov 16, 2023

View reviewed changes

HyukjinKwon reviewed Nov 17, 2023

View reviewed changes

github-actions bot removed the DOCS label Nov 17, 2023

cloud-fan approved these changes Nov 20, 2023

View reviewed changes

allisonwang-db added 4 commits November 20, 2023 11:33

update

f2d8b7e

fix tests

5c02050

update comments

8fa5560

address comments

60f230a

allisonwang-db force-pushed the spark-45927-path branch from a027ce3 to 60f230a Compare November 20, 2023 19:33

dongjoon-hyun approved these changes Nov 20, 2023

View reviewed changes

dongjoon-hyun closed this in 25ee62e Nov 20, 2023

HyukjinKwon reviewed Nov 21, 2023

View reviewed changes

	extraOptions + ("path" -> paths.head)
	extraOptions + ("path" -> paths.head)

[SPARK-45927][PYTHON] Update path handling for Python data source #43809

[SPARK-45927][PYTHON] Update path handling for Python data source #43809

Uh oh!

Conversation

allisonwang-db commented Nov 14, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db commented Nov 14, 2023

Uh oh!

cloud-fan Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 15, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 20, 2023

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon Nov 17, 2023 •

edited

Loading