[SPARK-30192][SQL] support column position in DS v2 #26817

cloud-fan · 2019-12-09T14:30:40Z

What changes were proposed in this pull request?

update DS v2 API to support add/alter column with column position

Why are the changes needed?

We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position.

Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not.

Does this PR introduce any user-facing change?

not yet

How was this patch tested?

new tests

cloud-fan · 2019-12-09T14:31:21Z

cc @brkyvz @rdblue @imback82

SparkQA · 2019-12-09T14:45:57Z

Test build #115042 has finished for PR 26817 at commit 7618712.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class First implements ColumnPosition
final class After implements ColumnPosition
final class UpdateColumnPosition implements ColumnChange
case class QualifiedColType(

dongjoon-hyun · 2019-12-09T16:52:48Z

The failure is due to old PR which is untested by GitHub Action (lint-java).

dongjoon-hyun · 2019-12-09T17:00:04Z

Since the master is fixed, I retriggered GitHub Action.

dongjoon-hyun · 2019-12-09T17:08:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java


+import java.util.ArrayList;
 import java.util.Arrays;
+import java.util.List;


Please fix the lint-java error in this file.

Checkstyle checks failed at following occurrences: 105 [ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[23,8] (imports) UnusedImports: Unused import - java.util.ArrayList. 106 [ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[25,8] (imports) UnusedImports: Unused import - java.util.List. 107 [ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[27,8] (imports) UnusedImports: Unused import -

SparkQA · 2019-12-10T04:32:14Z

Test build #115070 has finished for PR 26817 at commit 22a13d3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class First implements ColumnPosition
final class After implements ColumnPosition
final class UpdateColumnPosition implements ColumnChange
case class QualifiedColType(

imback82

+1 (few minor comments)

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

imback82 · 2019-12-10T06:47:02Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

  }

+  interface ColumnPosition {
+    final class First implements ColumnPosition {


class doc for First and After?

imback82 · 2019-12-10T06:51:39Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala


  test("tableCreation: duplicate column names in the table definition") {
-    val errorMsg = "Found duplicate column(s) in the table definition of `t`"
+    val errorMsg = "Found duplicate column(s) in the table definition of t"


Any reason t is not quoted any more? The quoted looks clearer.

Only IdentifierImpl always adds the outer quote symbol. I make it consistent: https://github.com/apache/spark/pull/26817/files#diff-4885a1f40c0c8766af2cc33104a4c8e8R56

I did this change because I added quoteNameParts to remove code duplication and then found IdentifierImpl use a slightly different quoting implementation.

+1 for not using quotes unless it is necessary.

SparkQA · 2019-12-10T08:05:01Z

Test build #115080 has finished for PR 26817 at commit d13cf86.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class First implements ColumnPosition
final class After implements ColumnPosition
final class UpdateColumnPosition implements ColumnChange
case class QualifiedColType(

SparkQA · 2019-12-10T16:24:48Z

Test build #115108 has finished for PR 26817 at commit f6f031f.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class First implements ColumnPosition
final class After implements ColumnPosition
final class UpdateColumnPosition implements ColumnChange
case class QualifiedColType(

brkyvz · 2019-12-10T19:45:14Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+   * <p>
+   * If the field already exists, the change will result in an {@link IllegalArgumentException}.
+   * If the new field is nested and its parent does not exist or is not a struct, the change will
+   * result in an {@link IllegalArgumentException}.


quick nit. I noticed that the error was wrapped in a SparkException when running alterTable in AlterTableExec

This doc is for catalog implementations. They should throw IllegalArgumentException if something goes wrong.

brkyvz · 2019-12-10T19:46:20Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+  interface ColumnPosition {
+    First FIRST = new First();
+
+    static ColumnPosition After(String[] column) {


nameParts? It'd be nice to add documentation that it refers to a nested field if String[] is longer than 1

Also should AFTER take a multipart identifier? Should the syntax be:

ALTER TABLE ADD COLUMN m.n.y STRING AFTER x

where x is a column after the nested field m.n.x

Can we have a different function name instead of After, @cloud-fan ?

brkyvz · 2019-12-10T21:32:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

      typedVisit[DataType](ctx.dataType),
-      Option(ctx.comment).map(string))
+      Option(ctx.comment).map(string),
+      Option(ctx.colPosition).map(typedVisit[ColumnPosition]))


.map(visitColPosition)?

They are the same. typedVisit would call visitColPosition under the hood (via looking at the context). We use typedVisit a lot in this file.

brkyvz · 2019-12-10T21:40:35Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

+  test("alter table: update column type, comment and position") {
+    comparePlans(
+      parsePlan("ALTER TABLE table_name CHANGE COLUMN a.b.c " +
+        "TYPE bigint COMMENT 'new comment' AFTER x.y"),


what does it mean to put a.b.c after x.y?

brkyvz

I'm looking forward to this addition! My main confusion is around why we support a multipartIdentifier for AFTER. Shouldn't that be a single length identifier describing the field in the struct we're adding to?

SparkQA · 2019-12-11T08:05:01Z

Test build #115151 has finished for PR 26817 at commit be2f1b8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-11T08:05:02Z

Test build #115155 has finished for PR 26817 at commit 49434e4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-11T08:16:47Z

retest this please

SparkQA · 2019-12-11T11:32:02Z

Test build #115162 has finished for PR 26817 at commit 49434e4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

+1. LGTM with two minor questions/comments

brkyvz · 2019-12-11T17:18:51Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+  /**
+   * Column position AFTER means the specified column should be put after the given `column`.
+   * Note that, the specified column may be a nested field, and then the given `column` refers to
+   * a nested field in the same struct.


a field in the same struct?

brkyvz · 2019-12-11T17:25:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+  override def visitColPosition(ctx: ColPositionContext): ColumnPosition = {
+    ctx.position.getType match {
+      case SqlBaseParser.FIRST => ColumnPosition.FIRST
+      case SqlBaseParser.AFTER => ColumnPosition.createAfter(ctx.afterCol.getText)


shouldn't that be a visitIdentifier or something? You'd know better probably, but what would the parsing be with backticks, etc when you simply use getText?

backticks are excluded by the lexer, we don't need to worry about it here.

I agree with @brkyvz. The identifier should be a multi-part identifier because it may be a nested column. This should visit the multi-part identifier to get a Seq[String].

Sorry, I think this is right after all. The parser uses a single identifier, so the syntax is actually ADD COLUMN point.z AFTER y not ADD COLUMN point.z AFTER point.y.

I think I'd prefer updating it to parse a multi-part identifier and accept both point.y and y, but that can be done as a follow-up.

rdblue · 2019-12-12T00:16:41Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

+        Seq("a", "b", "c"),
+        Some(LongType),
+        Some("new comment"),
+        Some(createAfter("d"))))


Here's an example of where createAfter seems strange because a.b.c already exists.

rdblue · 2019-12-12T00:21:51Z

@cloud-fan, thanks for fixing this! Mostly looks good to me, but I found a few minor things to fix.

Also, I think this should update the in-memory tables to implement reorder and test the behavior. It seems strange to have tests for all the other ALTER TABLE features, but not ordering.

SparkQA · 2019-12-12T13:17:46Z

Test build #115229 has finished for PR 26817 at commit 8d865ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-12T21:32:58Z

Test build #115248 has finished for PR 26817 at commit ea58952.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-12-12T21:38:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+import javax.annotation.Nullable;
+
+import org.apache.spark.annotation.Experimental;
+import org.apache.spark.sql.types.DataType;


Is this causing validation to fail? I think we generally want to avoid changes like this unless they are enforced by a linter.

It's not enforced by the linter but we do require it in the style guide.

rdblue · 2019-12-12T21:39:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+   * be the first one within the struct.
+   */
+  final class First implements ColumnPosition {
+    private static First singleton = new First();


Singleton instances should also be final, and static final variables in Java typically use ALL_CAPS.

rdblue · 2019-12-12T21:48:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

+        case update: UpdateColumnPosition =>
+          def updateFieldPos(struct: StructType, name: String): StructType = {
+            val oldField = struct.fields.find(_.name == name).getOrElse {
+              throw new IllegalArgumentException("field not found: " + name)


Nit: The first word in an exception message is usually capitalized.

rdblue · 2019-12-12T21:55:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

+      StructType(field +: schema.fields)
+    } else {
+      val afterCol = position.asInstanceOf[After].column()
+      val (before, after) = schema.fields.span(_.name == afterCol)


I don't think the use of span is correct:

val afterCol = "b" val cols = Seq("a", "b", "c") cols.span(_ == afterCol) => List(), List("a", "b", "c")

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

SparkQA · 2019-12-12T22:01:19Z

Test build #115246 has finished for PR 26817 at commit 4e739b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-12-12T22:02:28Z

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala

+  test("AlterTable: add column with position") {
+    val t = s"${catalogAndNamespace}table_name"
+    withTable(t) {
+      sql(s"CREATE TABLE $t (id struct<x: int>) USING $v2Format")


Nit: it's odd to use id for a struct. I think the test would be more readable using point if you're adding x and y.

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala

rdblue · 2019-12-12T22:08:57Z

Thanks, @cloud-fan! I found a bug with the use of span and had a few comments on the new test cases.

cloud-fan · 2019-12-13T07:53:52Z

@rdblue thanks for catching the bug! comments addressed.

SparkQA · 2019-12-13T08:05:02Z

Test build #115286 has finished for PR 26817 at commit c01f565.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-13T14:38:11Z

retest this please

SparkQA · 2019-12-13T18:49:02Z

Test build #115306 has finished for PR 26817 at commit c01f565.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-12-14T01:17:17Z

+1. Thanks @cloud-fan!

cloud-fan · 2019-12-16T10:55:28Z

thanks for the review, merging to master!

dongjoon-hyun · 2020-01-31T01:00:17Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java

+   * be the first one within the struct.
+   */
+  final class First implements ColumnPosition {
+    private static final First SINGLETON = new First();


Hi, All.
Sorry for late nit-picking, but shall we rename this to INSTANCE for consistency with the other singletons?
During review a new PR, #27380, I found that we are starting to lose the consistency due to this. I'll make a follow-up.

cc @brkyvz

dongjoon-hyun · 2020-01-31T01:07:44Z

I created #27409 .

dongjoon-hyun reviewed Dec 9, 2019

View reviewed changes

dongjoon-hyun added the SQL label Dec 9, 2019

cloud-fan force-pushed the parser branch from 7618712 to 22a13d3 Compare December 10, 2019 03:37

cloud-fan force-pushed the parser branch 2 times, most recently from 8e754e9 to d13cf86 Compare December 10, 2019 05:00

imback82 reviewed Dec 10, 2019

View reviewed changes

cloud-fan force-pushed the parser branch from d13cf86 to f6f031f Compare December 10, 2019 15:55

brkyvz reviewed Dec 10, 2019

View reviewed changes

cloud-fan force-pushed the parser branch from be2f1b8 to 49434e4 Compare December 11, 2019 07:05

support column position in DS v2

5da54d2

cloud-fan force-pushed the parser branch from 49434e4 to 17ab2e3 Compare December 11, 2019 15:28

address comments

43398af

cloud-fan force-pushed the parser branch from 17ab2e3 to 43398af Compare December 11, 2019 15:31

brkyvz approved these changes Dec 11, 2019

View reviewed changes

address comment

af271ee

rdblue reviewed Dec 12, 2019

View reviewed changes

address comments

8d865ab

improve test

ea58952

cloud-fan force-pushed the parser branch from 4e739b8 to ea58952 Compare December 12, 2019 18:08

rdblue reviewed Dec 12, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala Show resolved Hide resolved

rdblue reviewed Dec 12, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala Show resolved Hide resolved

rdblue reviewed Dec 12, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala Show resolved Hide resolved

address comments

c01f565

cloud-fan closed this in fdcd0e7 Dec 16, 2019

dongjoon-hyun reviewed Jan 31, 2020

View reviewed changes

[SPARK-30192][SQL] support column position in DS v2 #26817

[SPARK-30192][SQL] support column position in DS v2 #26817

Uh oh!

Conversation

cloud-fan commented Dec 9, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 9, 2019

Uh oh!

dongjoon-hyun commented Dec 9, 2019

Uh oh!

dongjoon-hyun commented Dec 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

cloud-fan commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Dec 10, 2019 •

edited

Loading

cloud-fan Dec 11, 2019 •

edited

Loading