[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (DSv2 exec) #44305

HyukjinKwon · 2023-12-11T23:20:49Z

What changes were proposed in this pull request?

This PR is same as #44233 but does not use V1Table but the original DSv2 interface by reusing UDTF execution code.

Why are the changes needed?

In order for Python Data Source to be able to be used in all other place including SparkR, Scala together.

Does this PR introduce any user-facing change?

Yes. Users can register their Python Data Source, and use them in SQL, SparkR, etc.

How was this patch tested?

Unittests were added, and manually tested.

Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44269
Closes #44233
Closes #43784

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonUDTFExec.scala

cloud-fan · 2023-12-13T20:17:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

   * there is no corresponding Data Source V2 implementation, or the provider is configured to
   * fallback to Data Source V1 code path.
   */
  def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {


how do you think of my idea to only put python data source handing in this method? https://github.com/apache/spark/pull/44269/files#diff-2a3ed194aac77f3de25418a74a756d8d821feb2b3d38f4fec144f312e022801aR709

yeah I like that idea. Can I do it in a followup though? I would like to extract some changes from your PR, and make another PR.

It's not a followup... I have a concern about changing lookupDataSource which is only used for the DS v1 path. Let's avoid the risk of breaking anything. It's also less code change if we only instantiate the PythonTableProvider here, so that the existing caller of lookupDataSource can still instantiate the objects directly instead of calling the new newDataSourceInstance function.

oh okie dokie. I was actually thinking about porting more changes in your PR. I will fix that one alone here for now.

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

cloud-fan · 2023-12-13T20:37:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

    val schema = StructType.fromDDL("id INT, partition INT")
    val dataSource = createUserDefinedPythonDataSource(
      name = dataSourceName, pythonScript = dataSourceScript)
+    spark.dataSource.registerPython(dataSourceName, dataSource)


why do we need the extra registration?

Previously UserDefinedPythonDataSource was able to create a DataFrame directly (from LogicalRelation) via UserDefinedPythonDataSource.apply.

Now it is not possible anymore because we're using DSv2. So, now here we register and load via using DataFrameReader to create a DataFrame to test.

allisonwang-db

Looks great!

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala

allisonwang-db · 2023-12-14T14:41:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala

+      properties: java.util.Map[String, String]): Table = {
+    assert(partitioning.isEmpty)
+    val outputSchema = schema
+    new Table with SupportsRead {


We can create a new class PythonTable to make it more extensible in the future.

Actually I intentionally put it together because we should cache dataSourceInPython executed from the Python worker (that contains both schema and pickled datasource), once for schema inference, and once for getting partitions. So it becomes more readable, and localize the scope of the cache. In addition, I think we won't likely extend this Python Table class/instance.

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala

allisonwang-db · 2023-12-14T14:48:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInBatchEvaluatorFactory.scala

    largeVarTypes: Boolean,
    pythonRunnerConf: Map[String, String],
-    pythonMetrics: Map[String, SQLMetric],
+    pythonMetrics: Option[Map[String, SQLMetric]],


Why do we need to change this to Optional?

In order to reuse MapInBatchEvaluatorFactory to read the data in executor side. We should integrate this to Scan.supportedCustomMetrics though.

Here: #44375

HyukjinKwon · 2023-12-15T05:33:45Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

   */
  @Unstable
  def executeCommand(runner: String, command: String, options: Map[String, String]): DataFrame = {
    DataSource.lookupDataSource(runner, sessionState.conf) match {


Actually @cloud-fan that would not work .. E.g., if PythonDataSource implements ExternalCommandRunner, we should load it here.

lemme fix it separately. Reading the code path, I think it won't more and less affect.

Let's worry about it when we actually adding this ability to the python data source. We may never add it for simplicity.

HyukjinKwon · 2023-12-15T05:34:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

    // instead of `providingClass`.
-    cls.getDeclaredConstructor().newInstance() match {
+    DataSource.newDataSourceInstance(className, cls) match {
      case f: FileDataSourceV2 => f.fallbackFileFormat


and here too

and tables.scala as well:

if (DDLUtils.isDatasourceTable(catalogTable)) { DataSource.newDataSourceInstance( catalogTable.provider.get, DataSource.lookupDataSource(catalogTable.provider.get, conf)) match { // For datasource table, this command can only support the following File format. // TextFileFormat only default to one column "value" // Hive type is already considered as hive serde table, so the logic will not // come in here. case _: CSVFileFormat | _: JsonFileFormat | _: ParquetFileFormat => case _: JsonDataSourceV2 | _: CSVDataSourceV2 | _: OrcDataSourceV2 | _: ParquetDataSourceV2 => case s if s.getClass.getCanonicalName.endsWith("OrcFileFormat") => case s => throw QueryCompilationErrors.alterAddColNotSupportDatasourceTableError(s, table) } } catalogTable

and DataStreamReader:

val v1DataSource = DataSource( sparkSession, userSpecifiedSchema = userSpecifiedSchema, className = source, options = optionsWithPath.originalMap) val v1Relation = ds match { case _: StreamSourceProvider => Some(StreamingRelation(v1DataSource)) case _ => None } ds match { // file source v2 does not support streaming yet. case provider: TableProvider if !provider.isInstanceOf[FileDataSourceV2] =>

and DataStreamWriter:

val cls = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf) val disabledSources = Utils.stringToSeq(df.sparkSession.sessionState.conf.disabledV2StreamingWriters) val useV1Source = disabledSources.contains(cls.getCanonicalName) || // file source v2 does not support streaming yet. classOf[FileDataSourceV2].isAssignableFrom(cls) val optionsWithPath = if (path.isEmpty) { extraOptions } else { extraOptions + ("path" -> path.get) } val sink = if (classOf[TableProvider].isAssignableFrom(cls) && !useV1Source) {

allisonwang-db · 2023-12-15T09:07:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala

+    }
+  }
+
+  override def supportsExternalMetadata(): Boolean = true


I am actually thinking about whether we should expose this as an API in Python data source.
If a data source cannot handle external metadata, then .schema(....) or CREATE TABLE table(...) should fail, instead of failing when executing the query.
But I am not sure if this will make the Python API too complicated. WDTY?

For simplicity, I think we can set it to false (default value) for now. It can be difficult to implement a data source that supports user-specified schema actually.

cloud-fan · 2023-12-15T18:39:28Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

      case source if classOf[ExternalCommandRunner].isAssignableFrom(source) =>
        Dataset.ofRows(self, ExternalCommandExecutor(
-          source.getDeclaredConstructor().newInstance()
+          DataSource.newDataSourceInstance(runner, source)


It may be arguable that if this is a breaking change. Now people need to worry about python data source in the code that is to deal with DS v1 only.

ExternalCommandRunner is DSv2 API..

HyukjinKwon · 2023-12-15T19:04:05Z

Merged to master.

…taSource.lookupDataSourceV2 ### What changes were proposed in this pull request? This PR is a kind of a followup of #44305 that proposes to create Python Data Source instance at `DataSource.lookupDataSourceV2` ### Why are the changes needed? Semantically the instance has to be ready at `DataSource.lookupDataSourceV2` level instead of after that. It's more consistent as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44374 from HyukjinKwon/SPARK-46423. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ation session level ### What changes were proposed in this pull request? This PR is a followup of #44305. It already works properly with the session-level. ### Why are the changes needed? To remove unnecessary TODO JIRA. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing Ci in this PR should verify them ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44487 from HyukjinKwon/SPARK-45600. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon marked this pull request as draft December 11, 2023 23:20

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Dec 11, 2023

HyukjinKwon commented Dec 11, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonUDTFExec.scala Outdated Show resolved Hide resolved

Reusing existing codegeneration logic

8be3206

HyukjinKwon force-pushed the SPARK-45597-3 branch 3 times, most recently from 274d08f to 2e9c06f Compare December 13, 2023 01:04

DSv2 exec version

82e8ec7

HyukjinKwon force-pushed the SPARK-45597-3 branch from 2e9c06f to 82e8ec7 Compare December 13, 2023 01:41

Fix

5497b9f

HyukjinKwon force-pushed the SPARK-45597-3 branch from 78dc948 to 5497b9f Compare December 13, 2023 06:39

HyukjinKwon marked this pull request as ready for review December 13, 2023 06:58

fix

6e1a9f1

HyukjinKwon force-pushed the SPARK-45597-3 branch from 3b07e59 to 6e1a9f1 Compare December 13, 2023 07:02

Refactoring

0251314

cloud-fan reviewed Dec 13, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala Show resolved Hide resolved

cloud-fan reviewed Dec 13, 2023

View reviewed changes

HyukjinKwon added 2 commits December 13, 2023 12:50

Recover test case

a9d1e1c

Further refactoring and cleanup

8846bf5

allisonwang-db reviewed Dec 14, 2023

View reviewed changes

Address comments

7a23cb4

HyukjinKwon force-pushed the SPARK-45597-3 branch from 110bc47 to 7a23cb4 Compare December 14, 2023 18:41

HyukjinKwon commented Dec 15, 2023

View reviewed changes

allisonwang-db reviewed Dec 15, 2023

View reviewed changes

cloud-fan reviewed Dec 15, 2023

View reviewed changes

cloud-fan approved these changes Dec 15, 2023

View reviewed changes

HyukjinKwon closed this in a1b0da2 Dec 15, 2023

HyukjinKwon mentioned this pull request Dec 15, 2023

[SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 #44374

Closed

HyukjinKwon mentioned this pull request Dec 16, 2023

[SPARK-46424][PYTHON][SQL] Support Python metrics in Python Data Source #44375

Closed

HyukjinKwon mentioned this pull request Dec 26, 2023

[SPARK-45600][SQL][PYTHON][FOLLOW-UP] Make Python data source registration session level #44487

Closed

HyukjinKwon deleted the SPARK-45597-3 branch January 15, 2024 00:47

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (DSv2 exec) #44305

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (DSv2 exec) #44305

Uh oh!

Conversation

HyukjinKwon commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Dec 11, 2023 •

edited

Loading

HyukjinKwon Dec 14, 2023 •

edited

Loading

HyukjinKwon Dec 14, 2023 •

edited

Loading