[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 #29339

jackylee-ch · 2020-08-03T15:56:36Z

What changes were proposed in this pull request?

This patch is trying to add AlterTableAddPartitionExec and AlterTableDropPartitionExec with the new table partition API, defined in #28617.

Does this PR introduce any user-facing change?

Yes. User can use alter table add partition or alter table drop partition to create/drop partition in V2Table.

How was this patch tested?

Run suites and fix old tests.

Change-Id: I002942962f8b41115edad0461c8980f67517947d

Change-Id: Id4c4ee16dec31def7fbbb8609ef5ed41804c5402

rdblue · 2020-08-03T16:56:57Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala

+      val partParams = new java.util.HashMap[String, String](table.properties())
+      location.foreach(locationUri =>
+        partParams.put("location", locationUri))
+      partParams.put("ignoreIfExists", ignoreIfExists.toString)


Why is this added to the partition parameters? I think Spark should handle this by ignoring PartitionAlreadyExistsException like we do in other cases.

Fine, I change it.

rdblue · 2020-08-03T16:57:10Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala

+      if (conflictKeys.nonEmpty) {
+        throw new AnalysisException(
+          s"Partition key ${conflictKeys.mkString(",")} " +
+            s"not exists in ${ident.namespace().quoted}.${ident.name()}")


Nit: indentation doesn't match between these lines.

rdblue · 2020-08-03T16:58:32Z

...yst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Implicits.scala

+
+  def convertPartitionIndentifers(
+      partSpec: TablePartitionSpec,
+      partSchema: StructType): InternalRow = {


Why is this included with the implicits when it isn't an implicit class?

Hm, a little ugly to me if it defined with implicits. I can change it if you think it's better with implicits.

I don't think it needs to be implicit. I just don't think it belongs in the implicits class if it isn't an implicit. I think there is a util class you could include this in.

Ah, I misunderstood. thanks.

rdblue · 2020-08-03T17:01:49Z

...yst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Implicits.scala

+    val partValues = partSchema.map { part =>
+      part.dataType match {
+        case _: ByteType =>
+          partSpec.getOrElse(part.name, "0").toByte


Conversion to InternalRow should not modify the partition values by filling in defaults. Filling in a default like this is a correctness bug.

I think this should require that all partition names are present in the map, and pass null if a name is present but does not have a value. If the partition doesn't allow null partition values, then it should throw an exception.

sure. sounds reasonable to me.

rdblue · 2020-08-03T17:09:59Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala

+ */
+case class AlterTableAddPartitionExec(
+    catalog: TableCatalog,
+    ident: Identifier,


Why not pass a Table instance like other plans that modify table data (e.g., AppendDataExec)?

We generally like to load tables early, so that we can do as much validation as possible in the analyzer and planner. By loading the table before passing it here, we would be able to use analyzer rules to validate the partition specs against the table's partition schema, and to make sure the table implements SupportsPartitions.

Sure, I will change it.

SparkQA · 2020-08-03T18:52:46Z

Test build #126993 has finished for PR 29339 at commit 6efca68.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AlterTableAddPartition(
case class AlterTableDropPartition(
case class AlterTableAddPartitionExec(
case class AlterTableDropPartitionExec(

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableDropPartitionExec.scala

cloud-fan · 2020-08-04T04:37:39Z

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableDropPartitionExec.scala

+    specs: Seq[TablePartitionSpec],
+    ignoreIfNotExists: Boolean,
+    purge: Boolean,
+    retainData: Boolean) extends V2CommandExec {


We should think about how to deal with purge and retainData. Shall we just put them into partition properties? Or put it into dropPartition parameters?

These two configurations seem to be mainly used in the hive tables. Besides, the retainData is always false, and purge only works in some versions.
Maybe put then into table properties? AFAIT, it the table that define these operations.

I have give a warning for purge. Does this looks fine to you? @cloud-fan

Change-Id: I10d4ff8d86fb70f195efa21156eed03dd0a74a32

SparkQA · 2020-08-04T20:17:00Z

Test build #127060 has finished for PR 29339 at commit f0bc357.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class TablePartitionSpecHelper(partSpec: TablePartitionSpec)

Change-Id: I725a84ec99187b8da6807b4acbdd7b39a740c036

Change-Id: I11a9dc214102854f533330a739d32a41500d9278

SparkQA · 2020-08-05T06:18:56Z

Test build #127072 has finished for PR 29339 at commit 61cae52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jackylee-ch · 2020-08-05T07:00:41Z

cc @rdblue and @cloud-fan

SparkQA · 2020-08-05T07:05:02Z

Test build #127075 has finished for PR 29339 at commit b1fc84b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

jackylee-ch · 2020-08-05T07:06:39Z

retest this please

cloud-fan · 2020-08-05T08:09:47Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala

+  override protected def run(): Seq[InternalRow] = {
+    partitions.foreach { case (partIdent, properties) =>
+      try {
+        table.createPartition(partIdent, properties.asJava)


The SQL command can add multiple partitions at once, and ideally it should be atomic.

I have 2 thoughts:

for v2, we don't allow the SQL command to add more than one partitions at once.

the v2 API should be createPartitions that takes an array of partitions.

Which one do you prefer? The same problem applies to drop partitions as well. @rdblue @stczwd

FYI, the hive catalog API is createPartitions, which can create a list of partitions at once. Personally I prefer 2, other catalog implementation can fail with more than one partitions if they can't do it in an atomic way.

Hm. Supporting createPartitions means that Table needs to support the atomic partitions operations. Once there is a problem in the middle operation, such as partition already exists, it can roll back to the state before operations.
Hive did support atomic createPartitions, I don't know if others support this.

I think that a variant of createPartition that works for multiple partitions should be added as an optional trait, or using optional methods in the existing interface.

We want sources to have predictable, standard behavior. I think that means that when adding a single partition in a group fails, the ones that were already successful should be rolled back. That way multiple calls to createPartition have the same result as a single call to createPartitions and the SQL statement has well-defined and reliable behavior.

If we agree on that behavior, then I think it is clear that Spark should handle calling either createPartition multiple times or calling createPartitions because that's the best way to get reliable behavior, while keeping table implementations simple -- those that don't support an atomic operation just implement createPartition.

That way multiple calls to createPartition have the same result as a single call to createPartitions and the SQL statement has well-defined and reliable behavior.

It is hard to support atomic operation with multiple calls to createPartition, as the program may stop in the middle of operations.

How about this? We create a new optional trait SupportsAtomicPartitions to support multiple partition atomic operations, and for those who don't support, we should return UnsupportedOperationException if user trying to operate multiple partitions at the same time.

Rolling back partition changes is like what we do with CTAS. If the write fails for non-atomic CTAS, we drop the table that was created. That won't always work, but at least the expectation is that the commands have the same behavior.

I'm okay with failing ADD PARTITION commands that have multiple partitions if atomic create/drop is not supported as well. That seems like another reasonable way to go. The important thing to me is that the commands have the same stated behavior across sources.

The important thing to me is that the commands have the same stated behavior across sources.

Yes, agree with that.

We create a new optional trait SupportsAtomicPartitions to support multiple partition atomic operations, and for those who don't support, we should return UnsupportedOperationException if user trying to operate multiple partitions at the same time.

I still prefer this. @cloud-fan does this look good to you?

SupportsAtomicPartitions SGTM, but we need to make the naming better.

How about SupportsPartitionManagement and SupportsAtomicPartitionManagement?

It's OK to me. I'll change it, thanks.

SparkQA · 2020-08-05T11:32:04Z

Test build #127086 has finished for PR 29339 at commit b1fc84b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I7b782bfafe77e62b842fd6533347d6c705c62033

Change-Id: I377cb196b8c8189b96db5765e3435bd464a2e6b2

SparkQA · 2020-08-06T13:53:12Z

Test build #127145 has finished for PR 29339 at commit f4a6ee3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: Ia9f4c9d2724f6c05ecf1aa6a025d379140586853

SparkQA · 2020-08-06T14:53:22Z

Test build #127146 has finished for PR 29339 at commit 9bd20ba.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-10T15:27:11Z

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

-    case AlterTableDropPartitionStatement(tbl, specs, ifExists, purge, retainData) =>
-      val v1TableName = parseV1Table(tbl, "ALTER TABLE DROP PARTITION")
+    case AlterTableDropPartition(
+        r @ ResolvedTable(_, _, _: V1Table), specs, ifExists, purge, retainData)


cloud-fan

LGTM except for some minor comments

cloud-fan · 2020-11-10T15:49:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+        failAnalysis(s"Table ${table.name()} can not alter partitions.")
+
+      // Skip atomic partition tables
+      case (_: SupportsAtomicPartitionManagement, _) =>


Not related to this PR: I'm wondering if we do need this separation. Do we have a concern that it's hard for implementations to add/drop multiple partitions atomically?

Em, it depends on whether the third-party system or storage supports transaction. MySQL and Hive can support this very well.

As an example, TableCatalog.alterTable accepts a list of TableChange without adding a new atomic API. I don't know why partition API needs to be different.

SparkQA · 2020-11-10T16:01:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35484/

SparkQA · 2020-11-10T16:24:05Z

Test build #130878 has finished for PR 29339 at commit effd0ed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class NoOpMergedShuffleFileManager implements MergedShuffleFileManager
public class RemoteBlockPushResolver implements MergedShuffleFileManager
static class PushBlockStreamCallback implements StreamCallbackWithID
public static class AppShuffleId
public static class AppShufflePartitionInfo
trait OffsetWindowFunction extends WindowFunction
case class LoadData(
case class DropTableExec(
class HDFSBackedReadStateStore(val version: Long, map: MapType)
trait ReadStateStore
trait StateStore extends ReadStateStore
class WrappedReadStateStore(store: StateStore) extends ReadStateStore
abstract class BaseStateStoreRDD[T: ClassTag, U: ClassTag](
class ReadStateStoreRDD[T: ClassTag, U: ClassTag](
class StateStoreRDD[T: ClassTag, U: ClassTag](
abstract class JdbcDialect extends Serializable with Logging

SparkQA · 2020-11-10T16:24:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35484/

SparkQA · 2020-11-10T16:47:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35485/

SparkQA · 2020-11-10T17:09:33Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35485/

SparkQA · 2020-11-10T17:34:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35489/

SparkQA · 2020-11-10T18:02:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35489/

SparkQA · 2020-11-10T18:32:58Z

Test build #130879 has finished for PR 29339 at commit 7377469.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-10T19:42:15Z

Test build #130883 has finished for PR 29339 at commit dcd5060.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-11T04:59:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35509/

SparkQA · 2020-11-11T05:22:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35509/

SparkQA · 2020-11-11T05:29:03Z

Test build #130903 has finished for PR 29339 at commit d316e56.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-11T05:36:40Z

retest this please

SparkQA · 2020-11-11T06:27:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35512/

SparkQA · 2020-11-11T06:50:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35512/

SparkQA · 2020-11-11T08:05:01Z

Test build #130906 has finished for PR 29339 at commit d316e56.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-11T09:30:41Z

GA passed, merging to master, thanks!

jackylee-ch · 2020-11-11T11:48:29Z

@cloud-fan @rdblue @MaxGekk
Thanks for your help

MaxGekk · 2020-11-20T11:13:55Z

I removed duplicate tests from DataSourceV2SQLSuite: #30444 . Please, review the PR.

…TABLE .. PARTITIONS from DataSourceV2SQLSuite ### What changes were proposed in this pull request? Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by #29339. ### Why are the changes needed? - To reduce tests execution time - To improve test maintenance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly *DataSourceV2SQLSuite" $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" ``` Closes #30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

stczwd added 2 commits August 3, 2020 23:46

add table partition API

a57aafc

Change-Id: I002942962f8b41115edad0461c8980f67517947d

add AlterTableAddPartitionExec and AlterTableDropPartitionExec

6efca68

Change-Id: Id4c4ee16dec31def7fbbb8609ef5ed41804c5402

probot-autolabeler bot added the SQL label Aug 3, 2020

jackylee-ch mentioned this pull request Aug 3, 2020

[SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2 #28617

Closed

rdblue reviewed Aug 3, 2020

View reviewed changes

dongjoon-hyun marked this pull request as draft August 3, 2020 21:45

cloud-fan reviewed Aug 4, 2020

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableDropPartitionExec.scala Show resolved Hide resolved

cloud-fan reviewed Aug 4, 2020

View reviewed changes

redefine AlterTableAddPartitionExec and AlterTableDropPartitionExec

f0bc357

Change-Id: I10d4ff8d86fb70f195efa21156eed03dd0a74a32

fix test failed

61cae52

Change-Id: I725a84ec99187b8da6807b4acbdd7b39a740c036

jackylee-ch changed the title ~~[Spark-32512][SQL][WIP] add alter table add/drop partition command for datasourcev2~~ [Spark-32512][SQL] add alter table add/drop partition command for datasourcev2 Aug 5, 2020

change the warning for purge

b1fc84b

Change-Id: I11a9dc214102854f533330a739d32a41500d9278

jackylee-ch marked this pull request as ready for review August 5, 2020 07:00

cloud-fan reviewed Aug 5, 2020

View reviewed changes

stczwd added 2 commits August 6, 2020 21:43

add SupportsAtomicPartitionManagement API support

67fcb12

Change-Id: I7b782bfafe77e62b842fd6533347d6c705c62033

reorder match cases

f4a6ee3

Change-Id: I377cb196b8c8189b96db5765e3435bd464a2e6b2

fix errors

9bd20ba

Change-Id: Ia9f4c9d2724f6c05ecf1aa6a025d379140586853

cloud-fan reviewed Nov 10, 2020

View reviewed changes

cloud-fan approved these changes Nov 10, 2020

View reviewed changes

cloud-fan reviewed Nov 10, 2020

View reviewed changes

use ResolvedV1TableIdentifier

7377469

Jacky Lee added 2 commits November 11, 2020 00:42

fix suite test failed

c6711cd

fix suite test failed

dcd5060

fix test failed

d316e56

cloud-fan closed this in 1eb236b Nov 11, 2020

jackylee-ch deleted the SPARK-32512-new branch November 11, 2020 11:47

MaxGekk mentioned this pull request Nov 20, 2020

[SPARK-32512][SQL][TESTS][FOLLOWUP] Remove duplicate tests for ALTER TABLE .. PARTITIONS from DataSourceV2SQLSuite #30444

Closed

[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 #29339

[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 #29339

Uh oh!

Conversation

jackylee-ch commented Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

Uh oh!

cloud-fan Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 4, 2020

Uh oh!

SparkQA commented Aug 5, 2020

Uh oh!

jackylee-ch commented Aug 5, 2020

Uh oh!

SparkQA commented Aug 5, 2020

Uh oh!

jackylee-ch commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackylee-ch Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 5, 2020

Uh oh!

SparkQA commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 6, 2020

jackylee-ch commented Aug 3, 2020 •

edited

Loading

cloud-fan Aug 4, 2020 •

edited

Loading

jackylee-ch Aug 5, 2020 •

edited

Loading

jackylee-ch Aug 5, 2020 •

edited

Loading