Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR supports spark.read.format(...).load() for Python data sources.

After this PR, users can use a Python data source directly like this:

from pyspark.sql.datasource import DataSource, DataSourceReader

class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    @classmethod
    def name(cls):
        return "my-source"

    def schema(self):
        return "id INT, value INT"
    
    def reader(self, schema):
        return MyReader()

spark.dataSource.register(MyDataSource)

df = spark.read.format("my-source").load()
df.show()
+---+-----+
| id|value|
+---+-----+
|  0|    1|
+---+-----+

Why are the changes needed?

To support Python data sources.

Does this PR introduce any user-facing change?

Yes. After this PR, users can load a custom Python data source using spark.read.format(...).load().

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@allisonwang-db
Copy link
Contributor Author

cc @HyukjinKwon @cloud-fan

@github-actions github-actions bot added the BUILD label Nov 2, 2023
DataSourceV2Utils.loadV2Source(sparkSession, provider, userSpecifiedSchema, extraOptions,
source, paths: _*)
}.getOrElse(loadV1Source(paths: _*))
val isUserDefinedDataSource =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @allisonwang-db do we want to support this datasource via USING syntax unlike DSv2, right?

If that's the case, the logics of loading DataSource has to be within DataSource.lookupDataSource and/or DataSource.providingInstance. I don't think we should mix the logics here with DSv2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's at least separate the logic into a separate function if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately DS v2 TableProvider does not support USING yet. That's why the code is a bit messy here as it's not shared with the SQL USING path. We should support it though...

@HyukjinKwon
Copy link
Member

Merged to master.

I had some offline discussion, and I will follow up on my own.

@allisonwang-db
Copy link
Contributor Author

Thanks @HyukjinKwon!

@HyukjinKwon
Copy link
Member

Made a POC (or draft?) PTAL: #43784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants