-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30192][SQL] support column position in DS v2 #26817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #115042 has finished for PR 26817 at commit
|
|
The failure is due to old PR which is untested by GitHub Action (lint-java). |
|
Since the master is fixed, I retriggered GitHub Action. |
|
|
||
| import java.util.ArrayList; | ||
| import java.util.Arrays; | ||
| import java.util.List; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the lint-java error in this file.
Checkstyle checks failed at following occurrences:
105
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[23,8] (imports) UnusedImports: Unused import - java.util.ArrayList.
106
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[25,8] (imports) UnusedImports: Unused import - java.util.List.
107
[ERROR] src/main/java/org/apache/spark/sql/connector/catalog/IdentifierImpl.java:[27,8] (imports) UnusedImports: Unused import - |
Test build #115070 has finished for PR 26817 at commit
|
8e754e9 to
d13cf86
Compare
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (few minor comments)
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| interface ColumnPosition { | ||
| final class First implements ColumnPosition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class doc for First and After?
|
|
||
| test("tableCreation: duplicate column names in the table definition") { | ||
| val errorMsg = "Found duplicate column(s) in the table definition of `t`" | ||
| val errorMsg = "Found duplicate column(s) in the table definition of t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason t is not quoted any more? The quoted looks clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only IdentifierImpl always adds the outer quote symbol. I make it consistent: https://github.com/apache/spark/pull/26817/files#diff-4885a1f40c0c8766af2cc33104a4c8e8R56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this change because I added quoteNameParts to remove code duplication and then found IdentifierImpl use a slightly different quoting implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for not using quotes unless it is necessary.
|
Test build #115080 has finished for PR 26817 at commit
|
|
Test build #115108 has finished for PR 26817 at commit
|
| * <p> | ||
| * If the field already exists, the change will result in an {@link IllegalArgumentException}. | ||
| * If the new field is nested and its parent does not exist or is not a struct, the change will | ||
| * result in an {@link IllegalArgumentException}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick nit. I noticed that the error was wrapped in a SparkException when running alterTable in AlterTableExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is for catalog implementations. They should throw IllegalArgumentException if something goes wrong.
| interface ColumnPosition { | ||
| First FIRST = new First(); | ||
|
|
||
| static ColumnPosition After(String[] column) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameParts? It'd be nice to add documentation that it refers to a nested field if String[] is longer than 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should AFTER take a multipart identifier? Should the syntax be:
ALTER TABLE ADD COLUMN m.n.y STRING AFTER x
where x is a column after the nested field m.n.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a different function name instead of After, @cloud-fan ?
| typedVisit[DataType](ctx.dataType), | ||
| Option(ctx.comment).map(string)) | ||
| Option(ctx.comment).map(string), | ||
| Option(ctx.colPosition).map(typedVisit[ColumnPosition])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.map(visitColPosition)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same. typedVisit would call visitColPosition under the hood (via looking at the context). We use typedVisit a lot in this file.
| test("alter table: update column type, comment and position") { | ||
| comparePlans( | ||
| parsePlan("ALTER TABLE table_name CHANGE COLUMN a.b.c " + | ||
| "TYPE bigint COMMENT 'new comment' AFTER x.y"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean to put a.b.c after x.y?
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking forward to this addition! My main confusion is around why we support a multipartIdentifier for AFTER. Shouldn't that be a single length identifier describing the field in the struct we're adding to?
|
Test build #115151 has finished for PR 26817 at commit
|
|
Test build #115155 has finished for PR 26817 at commit
|
|
retest this please |
|
Test build #115162 has finished for PR 26817 at commit
|
brkyvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. LGTM with two minor questions/comments
| /** | ||
| * Column position AFTER means the specified column should be put after the given `column`. | ||
| * Note that, the specified column may be a nested field, and then the given `column` refers to | ||
| * a nested field in the same struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a field in the same struct?
| override def visitColPosition(ctx: ColPositionContext): ColumnPosition = { | ||
| ctx.position.getType match { | ||
| case SqlBaseParser.FIRST => ColumnPosition.FIRST | ||
| case SqlBaseParser.AFTER => ColumnPosition.createAfter(ctx.afterCol.getText) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be a visitIdentifier or something? You'd know better probably, but what would the parsing be with backticks, etc when you simply use getText?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks are excluded by the lexer, we don't need to worry about it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @brkyvz. The identifier should be a multi-part identifier because it may be a nested column. This should visit the multi-part identifier to get a Seq[String].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think this is right after all. The parser uses a single identifier, so the syntax is actually ADD COLUMN point.z AFTER y not ADD COLUMN point.z AFTER point.y.
I think I'd prefer updating it to parse a multi-part identifier and accept both point.y and y, but that can be done as a follow-up.
| Seq("a", "b", "c"), | ||
| Some(LongType), | ||
| Some("new comment"), | ||
| Some(createAfter("d")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of where createAfter seems strange because a.b.c already exists.
|
@cloud-fan, thanks for fixing this! Mostly looks good to me, but I found a few minor things to fix. Also, I think this should update the in-memory tables to implement reorder and test the behavior. It seems strange to have tests for all the other ALTER TABLE features, but not ordering. |
|
Test build #115229 has finished for PR 26817 at commit
|
|
Test build #115248 has finished for PR 26817 at commit
|
| import javax.annotation.Nullable; | ||
|
|
||
| import org.apache.spark.annotation.Experimental; | ||
| import org.apache.spark.sql.types.DataType; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this causing validation to fail? I think we generally want to avoid changes like this unless they are enforced by a linter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not enforced by the linter but we do require it in the style guide.
| * be the first one within the struct. | ||
| */ | ||
| final class First implements ColumnPosition { | ||
| private static First singleton = new First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singleton instances should also be final, and static final variables in Java typically use ALL_CAPS.
| case update: UpdateColumnPosition => | ||
| def updateFieldPos(struct: StructType, name: String): StructType = { | ||
| val oldField = struct.fields.find(_.name == name).getOrElse { | ||
| throw new IllegalArgumentException("field not found: " + name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The first word in an exception message is usually capitalized.
| StructType(field +: schema.fields) | ||
| } else { | ||
| val afterCol = position.asInstanceOf[After].column() | ||
| val (before, after) = schema.fields.span(_.name == afterCol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the use of span is correct:
val afterCol = "b"
val cols = Seq("a", "b", "c")
cols.span(_ == afterCol)
=> List(), List("a", "b", "c")
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Show resolved
Hide resolved
|
Test build #115246 has finished for PR 26817 at commit
|
| test("AlterTable: add column with position") { | ||
| val t = s"${catalogAndNamespace}table_name" | ||
| withTable(t) { | ||
| sql(s"CREATE TABLE $t (id struct<x: int>) USING $v2Format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's odd to use id for a struct. I think the test would be more readable using point if you're adding x and y.
sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala
Show resolved
Hide resolved
|
Thanks, @cloud-fan! I found a bug with the use of |
|
@rdblue thanks for catching the bug! comments addressed. |
|
Test build #115286 has finished for PR 26817 at commit
|
|
retest this please |
|
Test build #115306 has finished for PR 26817 at commit
|
|
+1. Thanks @cloud-fan! |
|
thanks for the review, merging to master! |
| * be the first one within the struct. | ||
| */ | ||
| final class First implements ColumnPosition { | ||
| private static final First SINGLETON = new First(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
I created #27409 . |
What changes were proposed in this pull request?
update DS v2 API to support add/alter column with column position
Why are the changes needed?
We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position.
Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not.
Does this PR introduce any user-facing change?
not yet
How was this patch tested?
new tests