-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10304][SQL] Partition discovery should throw an exception if the dir structure is invalid #8840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10304][SQL] Partition discovery should throw an exception if the dir structure is invalid #8840
Changes from all commits
2d13156
66895a2
cc363da
779fbd2
cdf6dc4
f9d1e64
3db4226
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -77,16 +77,24 @@ private[sql] object PartitioningUtils { | |
| defaultPartitionName: String, | ||
| typeInference: Boolean): PartitionSpec = { | ||
| // First, we need to parse every partition's path and see if we can find partition values. | ||
| val pathsWithPartitionValues = paths.flatMap { path => | ||
| parsePartition(path, defaultPartitionName, typeInference).map(path -> _) | ||
| } | ||
| val (partitionValues, optBasePaths) = paths.map { path => | ||
| parsePartition(path, defaultPartitionName, typeInference) | ||
| }.unzip | ||
|
|
||
| val pathsWithPartitionValues = paths.zip(partitionValues).flatMap(x => x._2.map(x._1 -> _)) | ||
|
|
||
| if (pathsWithPartitionValues.isEmpty) { | ||
| // This dataset is not partitioned. | ||
| PartitionSpec.emptySpec | ||
| } else { | ||
| // This dataset is partitioned. We need to check whether all partitions have the same | ||
| // partition columns and resolve potential type conflicts. | ||
| val basePaths = optBasePaths.flatMap(x => x) | ||
| assert( | ||
| basePaths.distinct.size == 1, | ||
| "Conflicting directory structures detected. Suspicious paths:\b" + | ||
| basePaths.mkString("\n\t", "\n\t", "\n\n")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give a case that we will hit this branch? What will |
||
|
|
||
| val resolvedPartitionValues = resolvePartitions(pathsWithPartitionValues) | ||
|
|
||
| // Creates the StructType which represents the partition columns. | ||
|
|
@@ -110,12 +118,12 @@ private[sql] object PartitioningUtils { | |
| } | ||
|
|
||
| /** | ||
| * Parses a single partition, returns column names and values of each partition column. For | ||
| * example, given: | ||
| * Parses a single partition, returns column names and values of each partition column, also | ||
| * the base path. For example, given: | ||
| * {{{ | ||
| * path = hdfs://<host>:<port>/path/to/partition/a=42/b=hello/c=3.14 | ||
| * }}} | ||
| * it returns: | ||
| * it returns the partition: | ||
| * {{{ | ||
| * PartitionValues( | ||
| * Seq("a", "b", "c"), | ||
|
|
@@ -124,34 +132,40 @@ private[sql] object PartitioningUtils { | |
| * Literal.create("hello", StringType), | ||
| * Literal.create(3.14, FloatType))) | ||
| * }}} | ||
| * and the base path: | ||
| * {{{ | ||
| * /path/to/partition | ||
| * }}} | ||
| */ | ||
| private[sql] def parsePartition( | ||
| path: Path, | ||
| defaultPartitionName: String, | ||
| typeInference: Boolean): Option[PartitionValues] = { | ||
| typeInference: Boolean): (Option[PartitionValues], Option[Path]) = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering if it's possible to make the return type as
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, you need to update this function description also for its return type.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or change the Then the code probably much simple and readable.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But it is possible that we only have Path without corresponding PartitionValues, i.e., (None, Some(path)). So we can't just make it as
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A base path is not always associated with a That is why I don't make |
||
| val columns = ArrayBuffer.empty[(String, Literal)] | ||
| // Old Hadoop versions don't have `Path.isRoot` | ||
| var finished = path.getParent == null | ||
| var chopped = path | ||
| var basePath = path | ||
|
|
||
| while (!finished) { | ||
| // Sometimes (e.g., when speculative task is enabled), temporary directories may be left | ||
| // uncleaned. Here we simply ignore them. | ||
| if (chopped.getName.toLowerCase == "_temporary") { | ||
| return None | ||
| return (None, None) | ||
| } | ||
|
|
||
| val maybeColumn = parsePartitionColumn(chopped.getName, defaultPartitionName, typeInference) | ||
| maybeColumn.foreach(columns += _) | ||
| basePath = chopped | ||
| chopped = chopped.getParent | ||
| finished = maybeColumn.isEmpty || chopped.getParent == null | ||
| finished = (maybeColumn.isEmpty && !columns.isEmpty) || chopped.getParent == null | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh i see. It is for something like |
||
| } | ||
|
|
||
| if (columns.isEmpty) { | ||
| None | ||
| (None, Some(path)) | ||
| } else { | ||
| val (columnNames, values) = columns.reverse.unzip | ||
| Some(PartitionValues(columnNames, values)) | ||
| (Some(PartitionValues(columnNames, values)), Some(basePath)) | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -58,14 +58,46 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha | |
| check(defaultPartitionName, Literal.create(null, NullType)) | ||
| } | ||
|
|
||
| test("parse invalid partitioned directories") { | ||
| // Invalid | ||
| var paths = Seq( | ||
| "hdfs://host:9000/invalidPath", | ||
| "hdfs://host:9000/path/a=10/b=20", | ||
| "hdfs://host:9000/path/a=10.5/b=hello") | ||
|
|
||
| var exception = intercept[AssertionError] { | ||
| parsePartitions(paths.map(new Path(_)), defaultPartitionName, true) | ||
| } | ||
| assert(exception.getMessage().contains("Conflicting directory structures detected")) | ||
|
|
||
| // Valid | ||
| paths = Seq( | ||
| "hdfs://host:9000/path/_temporary", | ||
| "hdfs://host:9000/path/a=10/b=20", | ||
| "hdfs://host:9000/path/_temporary/path") | ||
|
|
||
| parsePartitions(paths.map(new Path(_)), defaultPartitionName, true) | ||
|
|
||
| // Invalid | ||
| paths = Seq( | ||
| "hdfs://host:9000/path/_temporary", | ||
| "hdfs://host:9000/path/a=10/b=20", | ||
| "hdfs://host:9000/path/path1") | ||
|
|
||
| exception = intercept[AssertionError] { | ||
| parsePartitions(paths.map(new Path(_)), defaultPartitionName, true) | ||
| } | ||
| assert(exception.getMessage().contains("Conflicting directory structures detected")) | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which case is for the one I mentioned in the jira? |
||
|
|
||
| test("parse partition") { | ||
| def check(path: String, expected: Option[PartitionValues]): Unit = { | ||
| assert(expected === parsePartition(new Path(path), defaultPartitionName, true)) | ||
| assert(expected === parsePartition(new Path(path), defaultPartitionName, true)._1) | ||
| } | ||
|
|
||
| def checkThrows[T <: Throwable: Manifest](path: String, expected: String): Unit = { | ||
| val message = intercept[T] { | ||
| parsePartition(new Path(path), defaultPartitionName, true).get | ||
| parsePartition(new Path(path), defaultPartitionName, true) | ||
| }.getMessage | ||
|
|
||
| assert(message.contains(expected)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is not very obvious what we are doing at here. Maybe a comment can help.