Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

update DS v2 API to support add/alter column with column position

Why are the changes needed?

We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position.

Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not.

Does this PR introduce any user-facing change?

not yet

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @brkyvz @rdblue @imback82

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115042 has finished for PR 26817 at commit 7618712.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class First implements ColumnPosition
  • final class After implements ColumnPosition
  • final class UpdateColumnPosition implements ColumnChange
  • case class QualifiedColType(

@dongjoon-hyun
Copy link
Member

The failure is due to old PR which is untested by GitHub Action (lint-java).

@dongjoon-hyun
Copy link
Member

Since the master is fixed, I retriggered GitHub Action.


import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the lint-java error in this file.

Checkstyle checks failed at following occurrences:
105
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[23,8] (imports) UnusedImports: Unused import - java.util.ArrayList.
106
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[25,8] (imports) UnusedImports: Unused import - java.util.List.
107
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[27,8] (imports) UnusedImports: Unused import - 

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115070 has finished for PR 26817 at commit 22a13d3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class First implements ColumnPosition
  • final class After implements ColumnPosition
  • final class UpdateColumnPosition implements ColumnChange
  • case class QualifiedColType(

@cloud-fan cloud-fan force-pushed the parser branch 2 times, most recently from 8e754e9 to d13cf86 Compare December 10, 2019 05:00
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (few minor comments)

}

interface ColumnPosition {
final class First implements ColumnPosition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class doc for First and After?


test("tableCreation: duplicate column names in the table definition") {
val errorMsg = "Found duplicate column(s) in the table definition of `t`"
val errorMsg = "Found duplicate column(s) in the table definition of t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason t is not quoted any more? The quoted looks clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only IdentifierImpl always adds the outer quote symbol. I make it consistent: https://github.com/apache/spark/pull/26817/files#diff-4885a1f40c0c8766af2cc33104a4c8e8R56

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this change because I added quoteNameParts to remove code duplication and then found IdentifierImpl use a slightly different quoting implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for not using quotes unless it is necessary.

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115080 has finished for PR 26817 at commit d13cf86.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class First implements ColumnPosition
  • final class After implements ColumnPosition
  • final class UpdateColumnPosition implements ColumnChange
  • case class QualifiedColType(

@SparkQA
Copy link

SparkQA commented Dec 10, 2019

Test build #115108 has finished for PR 26817 at commit f6f031f.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class First implements ColumnPosition
  • final class After implements ColumnPosition
  • final class UpdateColumnPosition implements ColumnChange
  • case class QualifiedColType(

* <p>
* If the field already exists, the change will result in an {@link IllegalArgumentException}.
* If the new field is nested and its parent does not exist or is not a struct, the change will
* result in an {@link IllegalArgumentException}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick nit. I noticed that the error was wrapped in a SparkException when running alterTable in AlterTableExec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is for catalog implementations. They should throw IllegalArgumentException if something goes wrong.

interface ColumnPosition {
First FIRST = new First();

static ColumnPosition After(String[] column) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameParts? It'd be nice to add documentation that it refers to a nested field if String[] is longer than 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should AFTER take a multipart identifier? Should the syntax be:

ALTER TABLE ADD COLUMN m.n.y STRING AFTER x

where x is a column after the nested field m.n.x

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a different function name instead of After, @cloud-fan ?

typedVisit[DataType](ctx.dataType),
Option(ctx.comment).map(string))
Option(ctx.comment).map(string),
Option(ctx.colPosition).map(typedVisit[ColumnPosition]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(visitColPosition)?

Copy link
Contributor Author

@cloud-fan cloud-fan Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same. typedVisit would call visitColPosition under the hood (via looking at the context). We use typedVisit a lot in this file.

test("alter table: update column type, comment and position") {
comparePlans(
parsePlan("ALTER TABLE table_name CHANGE COLUMN a.b.c " +
"TYPE bigint COMMENT 'new comment' AFTER x.y"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean to put a.b.c after x.y?

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to this addition! My main confusion is around why we support a multipartIdentifier for AFTER. Shouldn't that be a single length identifier describing the field in the struct we're adding to?

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115151 has finished for PR 26817 at commit be2f1b8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115155 has finished for PR 26817 at commit 49434e4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115162 has finished for PR 26817 at commit 49434e4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM with two minor questions/comments

/**
* Column position AFTER means the specified column should be put after the given `column`.
* Note that, the specified column may be a nested field, and then the given `column` refers to
* a nested field in the same struct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a field in the same struct?

override def visitColPosition(ctx: ColPositionContext): ColumnPosition = {
ctx.position.getType match {
case SqlBaseParser.FIRST => ColumnPosition.FIRST
case SqlBaseParser.AFTER => ColumnPosition.createAfter(ctx.afterCol.getText)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be a visitIdentifier or something? You'd know better probably, but what would the parsing be with backticks, etc when you simply use getText?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks are excluded by the lexer, we don't need to worry about it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @brkyvz. The identifier should be a multi-part identifier because it may be a nested column. This should visit the multi-part identifier to get a Seq[String].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think this is right after all. The parser uses a single identifier, so the syntax is actually ADD COLUMN point.z AFTER y not ADD COLUMN point.z AFTER point.y.

I think I'd prefer updating it to parse a multi-part identifier and accept both point.y and y, but that can be done as a follow-up.

Seq("a", "b", "c"),
Some(LongType),
Some("new comment"),
Some(createAfter("d"))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of where createAfter seems strange because a.b.c already exists.

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2019

@cloud-fan, thanks for fixing this! Mostly looks good to me, but I found a few minor things to fix.

Also, I think this should update the in-memory tables to implement reorder and test the behavior. It seems strange to have tests for all the other ALTER TABLE features, but not ordering.

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115229 has finished for PR 26817 at commit 8d865ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115248 has finished for PR 26817 at commit ea58952.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import javax.annotation.Nullable;

import org.apache.spark.annotation.Experimental;
import org.apache.spark.sql.types.DataType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this causing validation to fail? I think we generally want to avoid changes like this unless they are enforced by a linter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not enforced by the linter but we do require it in the style guide.

* be the first one within the struct.
*/
final class First implements ColumnPosition {
private static First singleton = new First();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleton instances should also be final, and static final variables in Java typically use ALL_CAPS.

case update: UpdateColumnPosition =>
def updateFieldPos(struct: StructType, name: String): StructType = {
val oldField = struct.fields.find(_.name == name).getOrElse {
throw new IllegalArgumentException("field not found: " + name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The first word in an exception message is usually capitalized.

StructType(field +: schema.fields)
} else {
val afterCol = position.asInstanceOf[After].column()
val (before, after) = schema.fields.span(_.name == afterCol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the use of span is correct:

val afterCol = "b"
val cols = Seq("a", "b", "c")
cols.span(_ == afterCol)
=> List(), List("a", "b", "c")

@SparkQA
Copy link

SparkQA commented Dec 12, 2019

Test build #115246 has finished for PR 26817 at commit 4e739b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("AlterTable: add column with position") {
val t = s"${catalogAndNamespace}table_name"
withTable(t) {
sql(s"CREATE TABLE $t (id struct<x: int>) USING $v2Format")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's odd to use id for a struct. I think the test would be more readable using point if you're adding x and y.

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2019

Thanks, @cloud-fan! I found a bug with the use of span and had a few comments on the new test cases.

@cloud-fan
Copy link
Contributor Author

@rdblue thanks for catching the bug! comments addressed.

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115286 has finished for PR 26817 at commit c01f565.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115306 has finished for PR 26817 at commit c01f565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Dec 14, 2019

+1. Thanks @cloud-fan!

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in fdcd0e7 Dec 16, 2019
* be the first one within the struct.
*/
final class First implements ColumnPosition {
private static final First SINGLETON = new First();
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.
Sorry for late nit-picking, but shall we rename this to INSTANCE for consistency with the other singletons?
During review a new PR, #27380, I found that we are starting to lose the consistency due to this. I'll make a follow-up.

cc @brkyvz

@dongjoon-hyun
Copy link
Member

I created #27409 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants