Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR is a kind of a followup of #44305 that proposes to create Python Data Source instance at DataSource.lookupDataSourceV2

Why are the changes needed?

Semantically the instance has to be ready at DataSource.lookupDataSourceV2 level instead of after that. It's more consistent as well.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should cover.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan and @allisonwang-db

@HyukjinKwon
Copy link
Member Author

documentation build will be fixed at #44376.

Merged to master.

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This approach is cleaner.

* there is no corresponding Data Source V2 implementation, or the provider is configured to
* fallback to Data Source V1 code path.
*/
def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the changes in the lookupDataSource method above? Since a Python data source can only be a V2 source.

else if (isUserDefinedDataSource) {
  classOf[PythonTableProvider]
} 

and

case head :: Nil =>
  // there is exactly one registered alias
  // TODO(SPARK-45600): should be session-based.
  val isUserDefinedDataSource = SparkSession.getActiveSession.exists(
    _.sessionState.dataSourceManager.dataSourceExists(provider))
  // The source can be successfully loaded as either a V1 or a V2 data source.
  // Check if it is also a user-defined data source.
  if (isUserDefinedDataSource) {
    throw QueryCompilationErrors.foundMultipleDataSources(provider)
  }
  head.getClass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on whether we will call lookupDataSource to detect python data sources. Seems no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookupDataSource detects both V1 and V2 classes. lookupDataSourceV2 just creates the instance from V2 classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but technically python data source is not v1 or v2. So it seems ok if lookupDataSource doesn't return python data sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when we extend the Python Data Source features, we will use v2 code path, for example, ExternalCommandRunner. Same as DataStreamReader.loadInternal code path.

@HyukjinKwon HyukjinKwon deleted the SPARK-46423 branch January 15, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants