[SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat #14339

HyukjinKwon · 2016-07-25T02:34:34Z

What changes were proposed in this pull request?

It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()

throws an exception as below:

Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
    at scala.Option.getOrElse(Option.scala:121)

This problem was introduced in 17eec0a#diff-27c76f96a7b2733ecfd6f46a1716e153R121

When extracting the data columns, it does not count that it can contain dots in field names. Actually, it seems the fields name are not expected as quoted. So, It does not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields.

For example, this throws an exception. (Loading JSON from RDD is fine)

val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true))))
  .json(rdd).select("`a.b`").printSchema()

as below:

cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`];
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

How was this patch tested?

Unit tests in FileSourceStrategySuite.

HyukjinKwon · 2016-07-25T02:39:44Z

I will cc you who I think might be related with this, @marmbrus @cloud-fan @rxin

It seems a regression because this does not happen in Spark 1.6, meaning,

val path = "/tmp/test.json"
val json =""" {"a.b":"data"}"""
sqlContext.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
sqlContext.read.json(path).show()

is fine with the output below:

+----+
| a.b|
+----+
|data|
+----+

cloud-fan · 2016-07-25T02:50:01Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala

    assert(!(getPhysicalFilters(df) contains resolve(df, "p1 = 1")))
  }

+  test("field names containing dots for both fields and partitioned fields") {


only partitioned fields?

Oh, actually, I am testing c.3 as well - it seems this actually affects both

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

Lines 80 to 89 in 37f3be5

val partitionColumns =

l.resolve(

fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)

val partitionSet = AttributeSet(partitionColumns)

val partitionKeyFilters =

ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))

logInfo(s"Pruning directories with: ${partitionKeyFilters.mkString(",")}")

val dataColumns =

l.resolve(fsRelation.dataSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)

.

HyukjinKwon · 2016-07-25T03:08:00Z

retest this please

cloud-fan · 2016-07-25T03:15:10Z

can we test it in SQLQuerySuite or something? I think FileSourceStrategySuite should be used to test the strategy.

HyukjinKwon · 2016-07-25T03:26:24Z

Sure!

cloud-fan · 2016-07-25T04:48:41Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+        .format("parquet")
+        .partitionBy("part.col1", "part.col2")
+        .save(path.getCanonicalPath)
+      val copyData = spark.read.format("parquet").load(path.getCanonicalPath)


copyData -> readBack?

cloud-fan · 2016-07-25T04:49:05Z

LGTM, cc @liancheng to take another look

SparkQA · 2016-07-25T05:07:51Z

Test build #62785 has finished for PR 14339 at commit cd5d04a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-25T05:37:50Z

Test build #62787 has finished for PR 14339 at commit f5c566c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-25T06:39:27Z

Test build #62791 has finished for PR 14339 at commit a3bb8d8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-07-25T06:47:08Z

retest this please

SparkQA · 2016-07-25T08:32:40Z

Test build #62803 has finished for PR 14339 at commit a3bb8d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-25T08:50:51Z

Test build #62805 has finished for PR 14339 at commit a3bb8d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-25T14:42:17Z

LGTM, merging to master and branch-2.0. Thanks!

…sources based on FileFormat ## What changes were proposed in this pull request? It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698. Field name having dots throws an exception. For example the codes below: ```scala val path = "/tmp/path" val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path) spark.read.json(path).collect() ``` throws an exception as below: ``` Unable to resolve a.b given [a.b]; org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) ``` This problem was introduced in 17eec0a#diff-27c76f96a7b2733ecfd6f46a1716e153R121 When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields. For example, this throws an exception. (**Loading JSON from RDD is fine**) ```scala val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext.parallelize(json :: Nil) spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true)))) .json(rdd).select("`a.b`").printSchema() ``` as below: ``` cannot resolve '```a.b```' given input columns: [`a.b`]; org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` ## How was this patch tested? Unit tests in `FileSourceStrategySuite`. Author: hyukjinkwon <[email protected]> Closes #14339 from HyukjinKwon/SPARK-16698-regression. (cherry picked from commit 79826f3) Signed-off-by: Cheng Lian <[email protected]>

Field names having dots should be allowed

cd5d04a

cloud-fan reviewed Jul 25, 2016
View reviewed changes

Move the test to SQLQuerySuite

f5c566c

cloud-fan reviewed Jul 25, 2016
View reviewed changes

Rename from copyData to readBack

a3bb8d8

asfgit closed this in 79826f3 Jul 25, 2016

HyukjinKwon mentioned this pull request Aug 21, 2016

[SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. #14736

Closed

HyukjinKwon deleted the SPARK-16698-regression branch January 2, 2018 03:40

	val partitionColumns =
	l.resolve(
	fsRelation.partitionSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)
	val partitionSet = AttributeSet(partitionColumns)
	val partitionKeyFilters =
	ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
	logInfo(s"Pruning directories with: ${partitionKeyFilters.mkString(",")}")

	val dataColumns =
	l.resolve(fsRelation.dataSchema, fsRelation.sparkSession.sessionState.analyzer.resolver)

[SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat #14339

[SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat #14339

Uh oh!

Conversation

HyukjinKwon commented Jul 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 25, 2016

Uh oh!

cloud-fan Jul 25, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 25, 2016

Uh oh!

cloud-fan commented Jul 25, 2016

Uh oh!

HyukjinKwon commented Jul 25, 2016

Uh oh!

cloud-fan Jul 25, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

HyukjinKwon commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

liancheng commented Jul 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Jul 25, 2016 •

edited

Loading

HyukjinKwon Jul 25, 2016 •

edited

Loading