-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat #14339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I will cc you who I think might be related with this, @marmbrus @cloud-fan @rxin It seems a regression because this does not happen in Spark 1.6, meaning, val path = "/tmp/test.json"
val json =""" {"a.b":"data"}"""
sqlContext.sparkContext
.parallelize(json :: Nil)
.saveAsTextFile(path)
sqlContext.read.json(path).show()is fine with the output below: |
| assert(!(getPhysicalFilters(df) contains resolve(df, "p1 = 1"))) | ||
| } | ||
|
|
||
| test("field names containing dots for both fields and partitioned fields") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only partitioned fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually, I am testing c.3 as well - it seems this actually affects both
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
Lines 80 to 89 in 37f3be5
| val partitionColumns = | |
| l.resolve( | |
| fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver) | |
| val partitionSet = AttributeSet(partitionColumns) | |
| val partitionKeyFilters = | |
| ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet))) | |
| logInfo(s"Pruning directories with: ${partitionKeyFilters.mkString(",")}") | |
| val dataColumns = | |
| l.resolve(fsRelation.dataSchema, fsRelation.sparkSession.sessionState.analyzer.resolver) |
|
retest this please |
|
can we test it in |
|
Sure! |
| .format("parquet") | ||
| .partitionBy("part.col1", "part.col2") | ||
| .save(path.getCanonicalPath) | ||
| val copyData = spark.read.format("parquet").load(path.getCanonicalPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyData -> readBack?
|
LGTM, cc @liancheng to take another look |
|
Test build #62785 has finished for PR 14339 at commit
|
|
Test build #62787 has finished for PR 14339 at commit
|
|
Test build #62791 has finished for PR 14339 at commit
|
|
retest this please |
|
Test build #62803 has finished for PR 14339 at commit
|
|
Test build #62805 has finished for PR 14339 at commit
|
|
LGTM, merging to master and branch-2.0. Thanks! |
…sources based on FileFormat ## What changes were proposed in this pull request? It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698. Field name having dots throws an exception. For example the codes below: ```scala val path = "/tmp/path" val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path) spark.read.json(path).collect() ``` throws an exception as below: ``` Unable to resolve a.b given [a.b]; org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) ``` This problem was introduced in 17eec0a#diff-27c76f96a7b2733ecfd6f46a1716e153R121 When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields. For example, this throws an exception. (**Loading JSON from RDD is fine**) ```scala val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext.parallelize(json :: Nil) spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true)))) .json(rdd).select("`a.b`").printSchema() ``` as below: ``` cannot resolve '```a.b```' given input columns: [`a.b`]; org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` ## How was this patch tested? Unit tests in `FileSourceStrategySuite`. Author: hyukjinkwon <[email protected]> Closes #14339 from HyukjinKwon/SPARK-16698-regression. (cherry picked from commit 79826f3) Signed-off-by: Cheng Lian <[email protected]>
What changes were proposed in this pull request?
It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.
Field name having dots throws an exception. For example the codes below:
throws an exception as below:
This problem was introduced in 17eec0a#diff-27c76f96a7b2733ecfd6f46a1716e153R121
When extracting the data columns, it does not count that it can contain dots in field names. Actually, it seems the fields name are not expected as quoted. So, It does not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.
For example, this throws an exception. (Loading JSON from RDD is fine)
as below:
How was this patch tested?
Unit tests in
FileSourceStrategySuite.