Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 8, 2018

What changes were proposed in this pull request?

This is a follow-up to #21305 that adds a test suite for AppendData analysis.

This also fixes the following problems uncovered by these tests:

  • Incorrect order of data types passed to canWrite is fixed
  • The field check calls canWrite first to ensure all errors are found
  • AppendData#resolved must check resolution of the query's attributes
  • Column names are quoted to show empty names

How was this patch tested?

This PR adds a test suite for AppendData analysis.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 8, 2018

@cloud-fan, here are tests to validate the analysis of AppendData logical plans.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94449 has finished for PR 22043 at commit e58d4fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (inAttr, outAttr) =>
// names and types must match, nullability must be compatible
inAttr.name == outAttr.name &&
inAttr.resolved && outAttr.resolved &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more clear to write table.resolved && query.resolved && query.output.size == table.output.size && ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

class DataSourceV2AnalysisSuite extends AnalysisTest {
val table = TestRelation(StructType(Seq(
StructField("x", FloatType),
StructField("y", FloatType))).toAttributes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._

val table = TestRelation(Seq('x.float, 'y.float))
val requiredTable = TestRelation(Seq('x.float.notNull, 'y.float.notNull))
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symbols are rarely used in Scala, so I think it is better to use the StructType and convert. It matches what users do more closely.

assertResolved(parsedPlan)
}

test("Append.byName: does not match by position") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is by name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is testing that the query that would work when matching by position fails when matching by name.

val parsedPlan = AppendData.byName(table, query)

assertNotResolved(parsedPlan)
assertAnalysisError(parsedPlan, Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's clearer to specify the caseSensitive parameter of assertAnalysisError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

StructField("y", FloatType))).toAttributes)

val X = query.output.toIndexedSeq(0)
val y = query.output.toIndexedSeq(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query.output.last

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"Cannot find data for output column", "'x'"))
}

test("Append.byName: missing required columns cause failure and are identified by name") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there really a difference between missing required columns and missing optional columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing optional columns may be allowed in the future. We've already had a team request this feature (enabled by a flag) to support schema evolution. The use case is that you don't want to fail existing jobs when you add a column to the table. Iceberg supports this, so Spark should too.

"Cannot find data for output column", "'x'"))
}

test("Append.byName: missing optional columns cause failure and are identified by name") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is identical to "Append.byName: missing columns are identified by name"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably intended to update it for byPosition. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Looks like it was from when I split out the tests for required/optional.

assertResolved(expectedPlan)
}

test("Append.byPosition: case does not fail column resolution") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this test? In "Append.byPosition: basic behavior" we proved that we can do append even the column names are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it. I was including most test cases for both byName and byPosition to validate the different behaviors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

StructField("X", FloatType), // doesn't match case!
StructField("y", FloatType))).toAttributes)

val X = query.output.toIndexedSeq(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just call query.output.head?

assertResolved(expectedPlan)
}

test("Append.byPosition: case does not fail column resolution") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this test? In "Append.byPosition: basic behavior" we proved that we can do append even the column names are different.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94516 has finished for PR 22043 at commit 765c5b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in bdd2796 Aug 10, 2018
@rdblue
Copy link
Contributor Author

rdblue commented Aug 13, 2018

Thanks for reviewing, @cloud-fan!

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
## What changes were proposed in this pull request?

This is a follow-up to apache#21305 that adds a test suite for AppendData analysis.

This also fixes the following problems uncovered by these tests:
* Incorrect order of data types passed to `canWrite` is fixed
* The field check calls `canWrite` first to ensure all errors are found
* `AppendData#resolved` must check resolution of the query's attributes
* Column names are quoted to show empty names

## How was this patch tested?

This PR adds a test suite for AppendData analysis.

Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

(cherry picked from commit bdd2796)
rdblue added a commit to rdblue/spark that referenced this pull request Apr 3, 2019
This is a follow-up to apache#21305 that adds a test suite for AppendData analysis.

This also fixes the following problems uncovered by these tests:
* Incorrect order of data types passed to `canWrite` is fixed
* The field check calls `canWrite` first to ensure all errors are found
* `AppendData#resolved` must check resolution of the query's attributes
* Column names are quoted to show empty names

This PR adds a test suite for AppendData analysis.

Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
## What changes were proposed in this pull request?

This is a follow-up to apache#21305 that adds a test suite for AppendData analysis.

This also fixes the following problems uncovered by these tests:
* Incorrect order of data types passed to `canWrite` is fixed
* The field check calls `canWrite` first to ensure all errors are found
* `AppendData#resolved` must check resolution of the query's attributes
* Column names are quoted to show empty names

## How was this patch tested?

This PR adds a test suite for AppendData analysis.

Closes apache#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

(cherry picked from commit bdd2796)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants