-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27576][SQL] table capability to skip the output column resolution #24469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #104937 has finished for PR 24469 at commit
|
|
Retest this please. |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamedRelation.scala
Outdated
Show resolved
Hide resolved
...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
Outdated
Show resolved
Hide resolved
|
Test build #105020 has finished for PR 24469 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate from the discussion about the method name, I don't think it makes sense for this to be in NamedRelation.
NamedRelation exists to help create better error messages from the generic rules that apply across any relation. This addition doesn't fit with that purpose. I think the reason why this was added to NamedRelation is because the v2 relation class is not available to catalyst, but this rule is in catalyst. If that's the case, then this depends on moving v2 into catalyst and I think it makes sense to do that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving DS v2 to catalyst module should be done soon, we can wait for it. BTW do we still need NamedRelation after that? It looks to me that NamedRelation is mostly for testing. Currently only the v2 relation class extends it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need it any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, we may want to keep NamedRelation to be able to use UnresolvedRelation in v2 plans. If we updated AppendData (for example) to use DataSourceV2Relation, then we would not be able to create it with an UnresolvedRelation and delegate resolving the table to the analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the test cases in this class test either by name or by position, and I would like for this to keep using that convention. When test cases fail, it is easier to see what happened because the case is specific to one path. It also avoids using names that are not specific, like plan1 and plan2.
|
Hi @rdblue , I think we need to make an exception now. #24233 removes To get rid of that hack, this PR adds a new table capability to bypass the schema check. However, this PR has a hack in To remove the hack in this PR, #24416 was created to move data source v2 to catalyst. However, #24416 is blocked because I think we need to merge either #24233 or this PR first, even it has a hack. Do you have a better idea? |
|
@cloud-fan, I think the cleanest way to fix it is to commit this PR with the new method in Does that work for you? |
| checkAnalysis(parsedPlan, parsedPlan) | ||
| } | ||
|
|
||
| withClue("byPosition") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue if 2 tests are very similar, it's recommended to use withClue and merge the 2 tests into one. For example, when the byPosition case fails, we will see
bypass output column resolution *** FAILED *** (36 milliseconds)
[info] byPosition (DataSourceV2AnalysisSuite.scala:473)
[info] org.scalatest.exceptions.TestFailedException:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send a followup PR to update the entire test suite to use withClue later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds good.
|
Test build #105446 has finished for PR 24469 at commit
|
|
Test build #105447 has finished for PR 24469 at commit
|
|
+1 from me to unblock the circular dependency. @dongjoon-hyun, does this look okay to you? |
|
Thank you for asking. Yes. This looks good to me, too. My comments are also addressed. :) |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cloud-fan and @rdblue .
Merged to master to unblock DSv2 dev.
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.
However, not all data sources need this check. For example, the
NoopDataSourcedoesn't care about the schema of input query at all.This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.
Note that, we already skip resolving output columns for
NoopDataSourcebecause it implementsSupportsSaveMode. However,SupportsSaveModeis a hack and will be removed soon.How was this patch tested?
new test cases