-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630
Conversation
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
Outdated
Show resolved
Hide resolved
| DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions, | ||
| source, paths: _*) | ||
| }.getOrElse(loadV1Source(paths: _*)) | ||
| val isUserDefinedDataSource = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @allisonwang-db do we want to support this datasource via USING syntax unlike DSv2, right?
If that's the case, the logics of loading DataSource has to be within DataSource.lookupDataSource and/or DataSource.providingInstance. I don't think we should mix the logics here with DSv2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's at least separate the logic into a separate function if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately DS v2 TableProvider does not support USING yet. That's why the code is a bit messy here as it's not shared with the SQL USING path. We should support it though...
|
Merged to master. I had some offline discussion, and I will follow up on my own. |
|
Thanks @HyukjinKwon! |
|
Made a POC (or draft?) PTAL: #43784 |
What changes were proposed in this pull request?
This PR supports
spark.read.format(...).load()for Python data sources.After this PR, users can use a Python data source directly like this:
Why are the changes needed?
To support Python data sources.
Does this PR introduce any user-facing change?
Yes. After this PR, users can load a custom Python data source using
spark.read.format(...).load().How was this patch tested?
New unit tests.
Was this patch authored or co-authored using generative AI tooling?
No