-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-46423][PYTHON][SQL] Make the Python Data Source instance at DataSource.lookupDataSourceV2 #44374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @allisonwang-db |
|
documentation build will be fixed at #44376. Merged to master. |
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This approach is cleaner.
| * there is no corresponding Data Source V2 implementation, or the provider is configured to | ||
| * fallback to Data Source V1 code path. | ||
| */ | ||
| def lookupDataSourceV2(provider: String, conf: SQLConf): Option[TableProvider] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the changes in the lookupDataSource method above? Since a Python data source can only be a V2 source.
else if (isUserDefinedDataSource) {
classOf[PythonTableProvider]
} and
case head :: Nil =>
// there is exactly one registered alias
// TODO(SPARK-45600): should be session-based.
val isUserDefinedDataSource = SparkSession.getActiveSession.exists(
_.sessionState.dataSourceManager.dataSourceExists(provider))
// The source can be successfully loaded as either a V1 or a V2 data source.
// Check if it is also a user-defined data source.
if (isUserDefinedDataSource) {
throw QueryCompilationErrors.foundMultipleDataSources(provider)
}
head.getClassThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on whether we will call lookupDataSource to detect python data sources. Seems no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookupDataSource detects both V1 and V2 classes. lookupDataSourceV2 just creates the instance from V2 classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but technically python data source is not v1 or v2. So it seems ok if lookupDataSource doesn't return python data sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But when we extend the Python Data Source features, we will use v2 code path, for example, ExternalCommandRunner. Same as DataStreamReader.loadInternal code path.
What changes were proposed in this pull request?
This PR is a kind of a followup of #44305 that proposes to create Python Data Source instance at
DataSource.lookupDataSourceV2Why are the changes needed?
Semantically the instance has to be ready at
DataSource.lookupDataSourceV2level instead of after that. It's more consistent as well.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests should cover.
Was this patch authored or co-authored using generative AI tooling?
No.