-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting approach!
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
Outdated
Show resolved
Hide resolved
7bc44ac to
2b75b13
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
Outdated
Show resolved
Hide resolved
2c4bdee to
457c04c
Compare
7b5a5d8 to
69585a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan and @allisonwang-db, Here yet I use V1Scan interface.
In order to fully leverage DSv2, we should actually refactor the whole PlanPythonDataSourceScan and UserDefinedPythonDataSource.
- First we should remove
PlanPythonDataSourceScanrule soDataSourceV2Strategycan resolve the DSv2. - Second, we should fix/port the partitioning/reading logics from
UserDefinedPythonDataSourceto thisScanandScanBuilderimplementation.
While I don't think this is a problem now, but we should do it in the end for write path, etc I believe (?). I would like it to be done separately if you don't mind (and I would like to focus on static/runtime registration part).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe it's good enough for read since we can mix-in to implement write, etc. separately(?)
69585a0 to
611b52d
Compare
What changes were proposed in this pull request?
This PR is a sort of a followup of #43630 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses
V1Tableinterface).The approach is as follows:
Self-contained working example:
results in:
There are limitations and followups to make:
We should change the dynamically generated classname fromorg.apache.spark.sql.execution.datasources.PythonTableScanto something else that maps to individual Python Data Source so the classes are not confused.Multi-paths are not supported (inherited from DSv1). Relates to [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611, resolved by [SPARK-45927][PYTHON] Update path handling for Python data source #43809Using a wrapper of DSv1 might be a blocker to implement commit protocol in Python Data Source.. From my code reading, it'd be still possible.Why are the changes needed?
In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.
Does this PR introduce any user-facing change?
Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.
How was this patch tested?
Unittests were added, and manually tested.
Was this patch authored or co-authored using generative AI tooling?
No.
Closes #44233