Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 13, 2023

What changes were proposed in this pull request?

This PR is a sort of a followup of #43630 which proposes to support Python Data Source can be with SQL (in favour of #43949), SparkR and all other exiting combinations by wrapping the Python Data Source by DSv2 interface (but yet uses V1Table interface).

The approach is as follows:

  1. PySpark registers a Python Data Source with its short name.
  2. Later, when the Data Sources are looked up, JVM creates a class that inherits DSv2 class dynamically that has the same short name as registered in Python.
  3. The returned class invokes Python Data Source, and works wherever it works with DSv2 even including SparkR, Scala, all SQL places.

Self-contained working example:

from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)

class TestDataSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)
spark.dataSource.register(TestDataSource)
sql("CREATE TABLE tblA USING test")
sql("SELECT * from tblA").show()

results in:

+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

There are limitations and followups to make:

  1. We should change the dynamically generated classname from org.apache.spark.sql.execution.datasources.PythonTableScan to something else that maps to individual Python Data Source so the classes are not confused.
  2. Whenever you load Python Data Source, it creates a new class dynamically generated. Should probably cache. (SPARK-45916)
  3. If you save this table after you restart your driver, the table cannot be loaded because the dynamically generated class does not exist anymore. Should figure out the way of reloading them (SPARK-45916 and SPARK-45917)
  4. Multi-paths are not supported (inherited from DSv1). Relates to [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611, resolved by [SPARK-45927][PYTHON] Update path handling for Python data source #43809
  5. Using a wrapper of DSv1 might be a blocker to implement commit protocol in Python Data Source.. From my code reading, it'd be still possible.
  6. Statically loading Python Data Sources is still not supported (SPARK-45917)

Why are the changes needed?

In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.

Does this PR introduce any user-facing change?

Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.

How was this patch tested?

Unittests were added, and manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44233

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting approach!

@HyukjinKwon HyukjinKwon marked this pull request as ready for review November 14, 2023 04:22
@HyukjinKwon HyukjinKwon force-pushed the sql-register-pydatasource branch from 7bc44ac to 2b75b13 Compare November 14, 2023 04:27
@HyukjinKwon HyukjinKwon marked this pull request as draft November 15, 2023 07:50
@HyukjinKwon HyukjinKwon force-pushed the sql-register-pydatasource branch from 2c4bdee to 457c04c Compare December 6, 2023 08:01
@HyukjinKwon HyukjinKwon marked this pull request as ready for review December 6, 2023 08:01
@HyukjinKwon HyukjinKwon force-pushed the sql-register-pydatasource branch 2 times, most recently from 7b5a5d8 to 69585a0 Compare December 6, 2023 08:07
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan and @allisonwang-db, Here yet I use V1Scan interface.

In order to fully leverage DSv2, we should actually refactor the whole PlanPythonDataSourceScan and UserDefinedPythonDataSource.

  1. First we should remove PlanPythonDataSourceScan rule so DataSourceV2Strategy can resolve the DSv2.
  2. Second, we should fix/port the partitioning/reading logics from UserDefinedPythonDataSource to this Scan and ScanBuilder implementation.

While I don't think this is a problem now, but we should do it in the end for write path, etc I believe (?). I would like it to be done separately if you don't mind (and I would like to focus on static/runtime registration part).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe it's good enough for read since we can mix-in to implement write, etc. separately(?)

@HyukjinKwon HyukjinKwon force-pushed the sql-register-pydatasource branch from 69585a0 to 611b52d Compare December 6, 2023 09:46
@HyukjinKwon HyukjinKwon changed the title [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL [SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) Dec 7, 2023
@HyukjinKwon HyukjinKwon marked this pull request as draft December 9, 2023 06:54
@HyukjinKwon HyukjinKwon deleted the sql-register-pydatasource branch January 15, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants