-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16905] SQL DDL: MSCK REPAIR TABLE #14500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@yhuai Could you help to generate the golden result for this suite? |
|
Jenkins, retest this please. |
|
We do not generate golden files anymore. Let's port those tests. Thanks. |
|
@yhuai Just checked the repair.q, it's kind of useless, already covered by out unit test, we could just ignore it. |
| case class RepairTableCommand(tableName: TableIdentifier) extends RunnableCommand { | ||
| override def run(spark: SparkSession): Seq[Row] = { | ||
| val catalog = spark.sessionState.catalog | ||
| val table = catalog.getTableMetadata(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dead code. The previous line already checks whether the table exists or not.
val table = catalog.getTableMetadata(tableName)
|
Test build #63242 has finished for PR 14500 at commit
|
|
Test build #63243 has finished for PR 14500 at commit
|
|
Test build #63244 has finished for PR 14500 at commit
|
|
Test build #63246 has finished for PR 14500 at commit
|
| * Create an [[AlterTableDiscoverPartitionsCommand]] command | ||
| * | ||
| * For example: | ||
| * {{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Update the syntax and the comments here.
|
Test build #63279 has finished for PR 14500 at commit
|
|
Test build #63283 has finished for PR 14500 at commit
|
| val threshold = spark.conf.get("spark.rdd.parallelListingThreshold", "10").toInt | ||
| val statusPar: GenSeq[FileStatus] = | ||
| if (partitionNames.length > 1 && statuses.length > threshold || partitionNames.length > 2) { | ||
| val parArray = statuses.par |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't look carefully - but if you are using the default exec context, please create a new one. otherwise it'd block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. can we make it explicit, e.g. statuses.par(evalTaskSupport)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from UnionRDD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not figure out how it work, at least statuses.par(evalTaskSupport) does not work.
| if (st.isDirectory && name.contains("=")) { | ||
| val ps = name.split("=", 2) | ||
| val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase | ||
| val value = PartitioningUtils.unescapePathName(ps(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if the value is valid? E.g., for a partition column "a" of IntegerType, "a=abc" is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a TODO here.
|
Test build #63375 has finished for PR 14500 at commit
|
| val ps = name.split("=", 2) | ||
| val columnName = PartitioningUtils.unescapePathName(ps(0)).toLowerCase | ||
| // TODO: Validate the value | ||
| val value = PartitioningUtils.unescapePathName(ps(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this escaping cause problems in (say) S3 for objects of the form "foo%20bar"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the partitions are generated by Spark, they could be unescape back correctly. For others, they could be compatibility issues. For example, Spark does not escape in Linux, the unescaping for%20 could be wrong (we could show an warning?). I think these are not in the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that makes sense.
|
LGTM |
|
Merging into master and 2.0 branch, thanks! |
|
@liancheng Can you do a post-hoc review? |
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system. Another syntax is: ALTER TABLE table RECOVER PARTITIONS The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed). Added unit tests for it and Hive compatibility test suite. Author: Davies Liu <[email protected]> Closes #14500 from davies/repair_table.
|
LGTM |
| CatalogTablePartition(spec, table.storage.copy(locationUri = Some(location.toUri.toString))) | ||
| } | ||
| spark.sessionState.catalog.createPartitions(tableName, | ||
| parts.toArray[CatalogTablePartition], ignoreIfExists = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if we get thousands of new partitions of tens thousands of new partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, see the implementation in HiveShim:
// Follows exactly the same logic of DDLTask.createPartitions in Hive 0.12
override def createPartitions(
hive: Hive,
database: String,
tableName: String,
parts: Seq[CatalogTablePartition],
ignoreIfExists: Boolean): Unit = {
val table = hive.getTable(database, tableName)
parts.foreach { s =>
val location = s.storage.locationUri.map(new Path(table.getPath, _)).orNull
val spec = s.spec.asJava
if (hive.getPartition(table, spec, false) != null && ignoreIfExists) {
// Ignore this partition since it already exists and ignoreIfExists == true
} else {
if (location == null && table.isView()) {
throw new HiveException("LOCATION clause illegal for view partition");
}
createPartitionMethod.invoke(
hive,
table,
spec,
location,
null, // partParams
null, // inputFormat
null, // outputFormat
-1: JInteger, // numBuckets
null, // cols
null, // serializationLib
null, // serdeParams
null, // bucketCols
null) // sortCols
}
}
}
All these partitions will be insert into Hive in sequential way, so group them as batches will not help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is true for Hive <=0.12, for Hive 0.13+, they are sent in single RPC. so we should verify that what's limit for a single RPC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the Hive Metastore can't handle a RPC with millions of partitions, I will send a patch to do it in batch.
What changes were proposed in this pull request?
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.
Another syntax is: ALTER TABLE table RECOVER PARTITIONS
The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).
How was this patch tested?
Added unit tests for it and Hive compatibility test suite.