-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24251][SQL] Add analysis tests for AppendData. #22043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24251][SQL] Add analysis tests for AppendData. #22043
Conversation
|
@cloud-fan, here are tests to validate the analysis of AppendData logical plans. |
|
Test build #94449 has finished for PR 22043 at commit
|
| case (inAttr, outAttr) => | ||
| // names and types must match, nullability must be compatible | ||
| inAttr.name == outAttr.name && | ||
| inAttr.resolved && outAttr.resolved && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more clear to write table.resolved && query.resolved && query.output.size == table.output.size && ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
| class DataSourceV2AnalysisSuite extends AnalysisTest { | ||
| val table = TestRelation(StructType(Seq( | ||
| StructField("x", FloatType), | ||
| StructField("y", FloatType))).toAttributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
val table = TestRelation(Seq('x.float, 'y.float))
val requiredTable = TestRelation(Seq('x.float.notNull, 'y.float.notNull))
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Symbols are rarely used in Scala, so I think it is better to use the StructType and convert. It matches what users do more closely.
| assertResolved(parsedPlan) | ||
| } | ||
|
|
||
| test("Append.byName: does not match by position") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is by name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is testing that the query that would work when matching by position fails when matching by name.
| val parsedPlan = AppendData.byName(table, query) | ||
|
|
||
| assertNotResolved(parsedPlan) | ||
| assertAnalysisError(parsedPlan, Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's clearer to specify the caseSensitive parameter of assertAnalysisError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
| StructField("y", FloatType))).toAttributes) | ||
|
|
||
| val X = query.output.toIndexedSeq(0) | ||
| val y = query.output.toIndexedSeq(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query.output.last
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| "Cannot find data for output column", "'x'")) | ||
| } | ||
|
|
||
| test("Append.byName: missing required columns cause failure and are identified by name") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there really a difference between missing required columns and missing optional columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing optional columns may be allowed in the future. We've already had a team request this feature (enabled by a flag) to support schema evolution. The use case is that you don't want to fail existing jobs when you add a column to the table. Iceberg supports this, so Spark should too.
| "Cannot find data for output column", "'x'")) | ||
| } | ||
|
|
||
| test("Append.byName: missing optional columns cause failure and are identified by name") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is identical to "Append.byName: missing columns are identified by name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably intended to update it for byPosition. I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Looks like it was from when I split out the tests for required/optional.
| assertResolved(expectedPlan) | ||
| } | ||
|
|
||
| test("Append.byPosition: case does not fail column resolution") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this test? In "Append.byPosition: basic behavior" we proved that we can do append even the column names are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it. I was including most test cases for both byName and byPosition to validate the different behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| StructField("X", FloatType), // doesn't match case! | ||
| StructField("y", FloatType))).toAttributes) | ||
|
|
||
| val X = query.output.toIndexedSeq(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just call query.output.head?
| assertResolved(expectedPlan) | ||
| } | ||
|
|
||
| test("Append.byPosition: case does not fail column resolution") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this test? In "Append.byPosition: basic behavior" we proved that we can do append even the column names are different.
|
Test build #94516 has finished for PR 22043 at commit
|
|
thanks, merging to master! |
|
Thanks for reviewing, @cloud-fan! |
## What changes were proposed in this pull request? This is a follow-up to apache#21305 that adds a test suite for AppendData analysis. This also fixes the following problems uncovered by these tests: * Incorrect order of data types passed to `canWrite` is fixed * The field check calls `canWrite` first to ensure all errors are found * `AppendData#resolved` must check resolution of the query's attributes * Column names are quoted to show empty names ## How was this patch tested? This PR adds a test suite for AppendData analysis. Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit bdd2796)
This is a follow-up to apache#21305 that adds a test suite for AppendData analysis. This also fixes the following problems uncovered by these tests: * Incorrect order of data types passed to `canWrite` is fixed * The field check calls `canWrite` first to ensure all errors are found * `AppendData#resolved` must check resolution of the query's attributes * Column names are quoted to show empty names This PR adds a test suite for AppendData analysis. Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This is a follow-up to apache#21305 that adds a test suite for AppendData analysis. This also fixes the following problems uncovered by these tests: * Incorrect order of data types passed to `canWrite` is fixed * The field check calls `canWrite` first to ensure all errors are found * `AppendData#resolved` must check resolution of the query's attributes * Column names are quoted to show empty names ## How was this patch tested? This PR adds a test suite for AppendData analysis. Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit bdd2796)
What changes were proposed in this pull request?
This is a follow-up to #21305 that adds a test suite for AppendData analysis.
This also fixes the following problems uncovered by these tests:
canWriteis fixedcanWritefirst to ensure all errors are foundAppendData#resolvedmust check resolution of the query's attributesHow was this patch tested?
This PR adds a test suite for AppendData analysis.