Skip to content

Conversation

@jackylee-ch
Copy link
Contributor

@jackylee-ch jackylee-ch commented Aug 3, 2020

What changes were proposed in this pull request?

This patch is trying to add AlterTableAddPartitionExec and AlterTableDropPartitionExec with the new table partition API, defined in #28617.

Does this PR introduce any user-facing change?

Yes. User can use alter table add partition or alter table drop partition to create/drop partition in V2Table.

How was this patch tested?

Run suites and fix old tests.

stczwd added 2 commits August 3, 2020 23:46
Change-Id: I002942962f8b41115edad0461c8980f67517947d
Change-Id: Id4c4ee16dec31def7fbbb8609ef5ed41804c5402
val partParams = new java.util.HashMap[String, String](table.properties())
location.foreach(locationUri =>
partParams.put("location", locationUri))
partParams.put("ignoreIfExists", ignoreIfExists.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added to the partition parameters? I think Spark should handle this by ignoring PartitionAlreadyExistsException like we do in other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, I change it.

if (conflictKeys.nonEmpty) {
throw new AnalysisException(
s"Partition key ${conflictKeys.mkString(",")} " +
s"not exists in ${ident.namespace().quoted}.${ident.name()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation doesn't match between these lines.


def convertPartitionIndentifers(
partSpec: TablePartitionSpec,
partSchema: StructType): InternalRow = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this included with the implicits when it isn't an implicit class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a little ugly to me if it defined with implicits. I can change it if you think it's better with implicits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be implicit. I just don't think it belongs in the implicits class if it isn't an implicit. I think there is a util class you could include this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood. thanks.

val partValues = partSchema.map { part =>
part.dataType match {
case _: ByteType =>
partSpec.getOrElse(part.name, "0").toByte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversion to InternalRow should not modify the partition values by filling in defaults. Filling in a default like this is a correctness bug.

I think this should require that all partition names are present in the map, and pass null if a name is present but does not have a value. If the partition doesn't allow null partition values, then it should throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. sounds reasonable to me.

*/
case class AlterTableAddPartitionExec(
catalog: TableCatalog,
ident: Identifier,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass a Table instance like other plans that modify table data (e.g., AppendDataExec)?

We generally like to load tables early, so that we can do as much validation as possible in the analyzer and planner. By loading the table before passing it here, we would be able to use analyzer rules to validate the partition specs against the table's partition schema, and to make sure the table implements SupportsPartitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will change it.

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126993 has finished for PR 29339 at commit 6efca68.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AlterTableAddPartition(
  • case class AlterTableDropPartition(
  • case class AlterTableAddPartitionExec(
  • case class AlterTableDropPartitionExec(

@dongjoon-hyun dongjoon-hyun marked this pull request as draft August 3, 2020 21:45
specs: Seq[TablePartitionSpec],
ignoreIfNotExists: Boolean,
purge: Boolean,
retainData: Boolean) extends V2CommandExec {
Copy link
Contributor

@cloud-fan cloud-fan Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about how to deal with purge and retainData. Shall we just put them into partition properties? Or put it into dropPartition parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two configurations seem to be mainly used in the hive tables. Besides, the retainData is always false, and purge only works in some versions.
Maybe put then into table properties? AFAIT, it the table that define these operations.

Copy link
Contributor Author

@jackylee-ch jackylee-ch Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have give a warning for purge. Does this looks fine to you? @cloud-fan

Change-Id: I10d4ff8d86fb70f195efa21156eed03dd0a74a32
@SparkQA
Copy link

SparkQA commented Aug 4, 2020

Test build #127060 has finished for PR 29339 at commit f0bc357.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • implicit class TablePartitionSpecHelper(partSpec: TablePartitionSpec)

Change-Id: I725a84ec99187b8da6807b4acbdd7b39a740c036
@jackylee-ch jackylee-ch changed the title [Spark-32512][SQL][WIP] add alter table add/drop partition command for datasourcev2 [Spark-32512][SQL] add alter table add/drop partition command for datasourcev2 Aug 5, 2020
Change-Id: I11a9dc214102854f533330a739d32a41500d9278
@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127072 has finished for PR 29339 at commit 61cae52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jackylee-ch jackylee-ch marked this pull request as ready for review August 5, 2020 07:00
@jackylee-ch
Copy link
Contributor Author

cc @rdblue and @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127075 has finished for PR 29339 at commit b1fc84b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jackylee-ch
Copy link
Contributor Author

retest this please

override protected def run(): Seq[InternalRow] = {
partitions.foreach { case (partIdent, properties) =>
try {
table.createPartition(partIdent, properties.asJava)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL command can add multiple partitions at once, and ideally it should be atomic.

I have 2 thoughts:

  1. for v2, we don't allow the SQL command to add more than one partitions at once.
  2. the v2 API should be createPartitions that takes an array of partitions.

Which one do you prefer? The same problem applies to drop partitions as well. @rdblue @stczwd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the hive catalog API is createPartitions, which can create a list of partitions at once. Personally I prefer 2, other catalog implementation can fail with more than one partitions if they can't do it in an atomic way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Supporting createPartitions means that Table needs to support the atomic partitions operations. Once there is a problem in the middle operation, such as partition already exists, it can roll back to the state before operations.
Hive did support atomic createPartitions, I don't know if others support this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a variant of createPartition that works for multiple partitions should be added as an optional trait, or using optional methods in the existing interface.

We want sources to have predictable, standard behavior. I think that means that when adding a single partition in a group fails, the ones that were already successful should be rolled back. That way multiple calls to createPartition have the same result as a single call to createPartitions and the SQL statement has well-defined and reliable behavior.

If we agree on that behavior, then I think it is clear that Spark should handle calling either createPartition multiple times or calling createPartitions because that's the best way to get reliable behavior, while keeping table implementations simple -- those that don't support an atomic operation just implement createPartition.

Copy link
Contributor Author

@jackylee-ch jackylee-ch Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way multiple calls to createPartition have the same result as a single call to createPartitions and the SQL statement has well-defined and reliable behavior.

It is hard to support atomic operation with multiple calls to createPartition, as the program may stop in the middle of operations.

How about this? We create a new optional trait SupportsAtomicPartitions to support multiple partition atomic operations, and for those who don't support, we should return UnsupportedOperationException if user trying to operate multiple partitions at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rolling back partition changes is like what we do with CTAS. If the write fails for non-atomic CTAS, we drop the table that was created. That won't always work, but at least the expectation is that the commands have the same behavior.

I'm okay with failing ADD PARTITION commands that have multiple partitions if atomic create/drop is not supported as well. That seems like another reasonable way to go. The important thing to me is that the commands have the same stated behavior across sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important thing to me is that the commands have the same stated behavior across sources.

Yes, agree with that.

We create a new optional trait SupportsAtomicPartitions to support multiple partition atomic operations, and for those who don't support, we should return UnsupportedOperationException if user trying to operate multiple partitions at the same time.

I still prefer this. @cloud-fan does this look good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SupportsAtomicPartitions SGTM, but we need to make the naming better.

How about SupportsPartitionManagement and SupportsAtomicPartitionManagement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to me. I'll change it, thanks.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127086 has finished for PR 29339 at commit b1fc84b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

stczwd added 2 commits August 6, 2020 21:43
Change-Id: I7b782bfafe77e62b842fd6533347d6c705c62033
Change-Id: I377cb196b8c8189b96db5765e3435bd464a2e6b2
@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127145 has finished for PR 29339 at commit f4a6ee3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Change-Id: Ia9f4c9d2724f6c05ecf1aa6a025d379140586853
@SparkQA
Copy link

SparkQA commented Aug 6, 2020

Test build #127146 has finished for PR 29339 at commit 9bd20ba.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

case AlterTableDropPartitionStatement(tbl, specs, ifExists, purge, retainData) =>
val v1TableName = parseV1Table(tbl, "ALTER TABLE DROP PARTITION")
case AlterTableDropPartition(
r @ ResolvedTable(_, _, _: V1Table), specs, ifExists, purge, retainData)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some minor comments

failAnalysis(s"Table ${table.name()} can not alter partitions.")

// Skip atomic partition tables
case (_: SupportsAtomicPartitionManagement, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: I'm wondering if we do need this separation. Do we have a concern that it's hard for implementations to add/drop multiple partitions atomically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em, it depends on whether the third-party system or storage supports transaction. MySQL and Hive can support this very well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, TableCatalog.alterTable accepts a list of TableChange without adding a new atomic API. I don't know why partition API needs to be different.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35484/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130878 has finished for PR 29339 at commit effd0ed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class NoOpMergedShuffleFileManager implements MergedShuffleFileManager
  • public class RemoteBlockPushResolver implements MergedShuffleFileManager
  • static class PushBlockStreamCallback implements StreamCallbackWithID
  • public static class AppShuffleId
  • public static class AppShufflePartitionInfo
  • trait OffsetWindowFunction extends WindowFunction
  • case class LoadData(
  • case class DropTableExec(
  • class HDFSBackedReadStateStore(val version: Long, map: MapType)
  • trait ReadStateStore
  • trait StateStore extends ReadStateStore
  • class WrappedReadStateStore(store: StateStore) extends ReadStateStore
  • abstract class BaseStateStoreRDD[T: ClassTag, U: ClassTag](
  • class ReadStateStoreRDD[T: ClassTag, U: ClassTag](
  • class StateStoreRDD[T: ClassTag, U: ClassTag](
  • abstract class JdbcDialect extends Serializable with Logging

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35484/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35485/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35485/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35489/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35489/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130879 has finished for PR 29339 at commit 7377469.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130883 has finished for PR 29339 at commit dcd5060.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35509/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35509/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Test build #130903 has finished for PR 29339 at commit d316e56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35512/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35512/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Test build #130906 has finished for PR 29339 at commit d316e56.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to master, thanks!

@cloud-fan cloud-fan closed this in 1eb236b Nov 11, 2020
@jackylee-ch jackylee-ch deleted the SPARK-32512-new branch November 11, 2020 11:47
@jackylee-ch
Copy link
Contributor Author

@cloud-fan @rdblue @MaxGekk
Thanks for your help

@MaxGekk
Copy link
Member

MaxGekk commented Nov 20, 2020

I removed duplicate tests from DataSourceV2SQLSuite: #30444 . Please, review the PR.

cloud-fan pushed a commit that referenced this pull request Nov 20, 2020
…TABLE .. PARTITIONS from DataSourceV2SQLSuite

### What changes were proposed in this pull request?
Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by #29339.

### Why are the changes needed?
- To reduce tests execution time
- To improve test maintenance

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified tests:
```
$ build/sbt "test:testOnly *DataSourceV2SQLSuite"
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
```

Closes #30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants