[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

allisonwang-db · 2023-11-01T22:51:32Z

What changes were proposed in this pull request?

This PR supports spark.read.format(...).load() for Python data sources.

After this PR, users can use a Python data source directly like this:

from pyspark.sql.datasource import DataSource, DataSourceReader

class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    @classmethod
    def name(cls):
        return "my-source"

    def schema(self):
        return "id INT, value INT"
    
    def reader(self, schema):
        return MyReader()

spark.dataSource.register(MyDataSource)

df = spark.read.format("my-source").load()
df.show()
+---+-----+
| id|value|
+---+-----+
|  0|    1|
+---+-----+

Why are the changes needed?

To support Python data sources.

Does this PR introduce any user-facing change?

Yes. After this PR, users can load a custom Python data source using spark.read.format(...).load().

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

allisonwang-db · 2023-11-02T04:37:48Z

cc @HyukjinKwon @cloud-fan

HyukjinKwon · 2023-11-03T00:25:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-      DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions,
-        source, paths: _*)
-    }.getOrElse(loadV1Source(paths: _*))
+    val isUserDefinedDataSource =


@cloud-fan @allisonwang-db do we want to support this datasource via USING syntax unlike DSv2, right?

If that's the case, the logics of loading DataSource has to be within DataSource.lookupDataSource and/or DataSource.providingInstance. I don't think we should mix the logics here with DSv2.

Let's at least separate the logic into a separate function if possible.

Unfortunately DS v2 TableProvider does not support USING yet. That's why the code is a bit messy here as it's not shared with the SQL USING path. We should support it though...

HyukjinKwon · 2023-11-08T17:22:43Z

Merged to master.

I had some offline discussion, and I will follow up on my own.

allisonwang-db · 2023-11-08T17:25:39Z

Thanks @HyukjinKwon!

HyukjinKwon · 2023-11-13T11:37:39Z

Made a POC (or draft?) PTAL: #43784

support lookup

34904b8

github-actions bot added SQL DOCS PYTHON labels Nov 1, 2023

HyukjinKwon reviewed Nov 2, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala Show resolved Hide resolved

HyukjinKwon reviewed Nov 2, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Nov 2, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

remove unstable

5392442

update

b94f4e2

github-actions bot added the BUILD label Nov 2, 2023

HyukjinKwon reviewed Nov 3, 2023

View reviewed changes

fix tests

1868a15

HyukjinKwon approved these changes Nov 8, 2023

View reviewed changes

HyukjinKwon closed this in 9d93b71 Nov 8, 2023

This was referenced Nov 10, 2023

[SPARK-45600][PYTHON] Make Python data source registration session level #43742

Closed

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

Uh oh!

allisonwang-db commented Nov 1, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonwang-db commented Nov 2, 2023

Uh oh!

HyukjinKwon Nov 3, 2023

Uh oh!

HyukjinKwon Nov 3, 2023

Uh oh!

cloud-fan Nov 7, 2023

Uh oh!

HyukjinKwon commented Nov 8, 2023

Uh oh!

allisonwang-db commented Nov 8, 2023

Uh oh!

HyukjinKwon commented Nov 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

[SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader #43630

Uh oh!

Conversation

allisonwang-db commented Nov 1, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

allisonwang-db commented Nov 2, 2023

Uh oh!

HyukjinKwon Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 8, 2023

Uh oh!

allisonwang-db commented Nov 8, 2023

Uh oh!

HyukjinKwon commented Nov 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants