[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

HyukjinKwon · 2023-11-13T11:36:13Z

What changes were proposed in this pull request?

This PR is a sort of a followup of #43630 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses V1Table interface).

The approach is as follows:

PySpark registers a Python Data Source with its short name.
Later, when the Data Sources are looked up, JVM creates a class that inherits DSv2 class dynamically that has the same short name as registered in Python.
The returned class invokes Python Data Source, and works wherever it works with DSv2 even including SparkR, Scala, all SQL places.

Self-contained working example:

from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)

class TestDataSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)

spark.dataSource.register(TestDataSource)
sql("CREATE TABLE tblA USING test")
sql("SELECT * from tblA").show()

results in:

+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

There are limitations and followups to make:

We should change the dynamically generated classname from org.apache.spark.sql.execution.datasources.PythonTableScan to something else that maps to individual Python Data Source so the classes are not confused.
Whenever you load Python Data Source, it creates a new class dynamically generated. Should probably cache. (SPARK-45916)
If you save this table after you restart your driver, the table cannot be loaded because the dynamically generated class does not exist anymore. Should figure out the way of reloading them (SPARK-45916 and SPARK-45917)
Multi-paths are not supported (inherited from DSv1). Relates to [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611, resolved by [SPARK-45927][PYTHON] Update path handling for Python data source #43809
~~Using a wrapper of DSv1 might be a blocker to implement commit protocol in Python Data Source.~~. From my code reading, it'd be still possible.
Statically loading Python Data Sources is still not supported (SPARK-45917)

Why are the changes needed?

In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.

Does this PR introduce any user-facing change?

Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.

How was this patch tested?

Unittests were added, and manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44233

allisonwang-db

Very interesting approach!

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

python/pyspark/sql/tests/test_python_datasource.py

HyukjinKwon · 2023-12-06T08:11:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

@cloud-fan and @allisonwang-db, Here yet I use V1Scan interface.

In order to fully leverage DSv2, we should actually refactor the whole PlanPythonDataSourceScan and UserDefinedPythonDataSource.

First we should remove PlanPythonDataSourceScan rule so DataSourceV2Strategy can resolve the DSv2.

Second, we should fix/port the partitioning/reading logics from UserDefinedPythonDataSource to this Scan and ScanBuilder implementation.

While I don't think this is a problem now, but we should do it in the end for write path, etc I believe (?). I would like it to be done separately if you don't mind (and I would like to focus on static/runtime registration part).

Or maybe it's good enough for read since we can mix-in to implement write, etc. separately(?)

HyukjinKwon marked this pull request as draft November 13, 2023 11:36

github-actions bot added SQL PYTHON labels Nov 13, 2023

HyukjinKwon mentioned this pull request Nov 13, 2023

[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

Closed

allisonwang-db reviewed Nov 13, 2023

View reviewed changes

HyukjinKwon marked this pull request as ready for review November 14, 2023 04:22

zhengruifeng reviewed Nov 14, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala Outdated Show resolved Hide resolved

HyukjinKwon force-pushed the sql-register-pydatasource branch from 7bc44ac to 2b75b13 Compare November 14, 2023 04:27

allisonwang-db reviewed Nov 14, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala Outdated Show resolved Hide resolved

allisonwang-db reviewed Nov 14, 2023

View reviewed changes

python/pyspark/sql/tests/test_python_datasource.py Outdated Show resolved Hide resolved

HyukjinKwon marked this pull request as draft November 15, 2023 07:50

Reusing existing codegeneration logic

90ba777

HyukjinKwon force-pushed the sql-register-pydatasource branch from 2c4bdee to 457c04c Compare December 6, 2023 08:01

HyukjinKwon marked this pull request as ready for review December 6, 2023 08:01

HyukjinKwon force-pushed the sql-register-pydatasource branch 2 times, most recently from 7b5a5d8 to 69585a0 Compare December 6, 2023 08:07

HyukjinKwon commented Dec 6, 2023

View reviewed changes

DSv2 interface

611b52d

HyukjinKwon force-pushed the sql-register-pydatasource branch from 69585a0 to 611b52d Compare December 6, 2023 09:46

HyukjinKwon changed the title ~~[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL~~ [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) Dec 7, 2023

HyukjinKwon mentioned this pull request Dec 7, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (single wrapper) #44233

Closed

HyukjinKwon marked this pull request as draft December 9, 2023 06:54

HyukjinKwon mentioned this pull request Dec 13, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (DSv2 exec) #44305

Closed

HyukjinKwon closed this in a1b0da2 Dec 15, 2023

HyukjinKwon deleted the sql-register-pydatasource branch January 15, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Uh oh!

HyukjinKwon commented Nov 13, 2023 •

edited

Loading

Uh oh!

allisonwang-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Dec 6, 2023

Uh oh!

HyukjinKwon Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Uh oh!

Conversation

HyukjinKwon commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 6, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Nov 13, 2023 •

edited

Loading