Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Jan 20, 2022

What changes were proposed in this pull request?

Remove supportFieldName check in DataSource ORCFormat.

  1. org.apache.spark.sql.hive.orc.OrcFileFormat didn't add this check too
  2. Tried a lot of wield column name, all can be reading and writing.

Why are the changes needed?

Remove unnecessary check

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@AngersZhuuuu AngersZhuuuu marked this pull request as draft January 20, 2022 03:13
@github-actions github-actions bot added the SQL label Jan 20, 2022
@HyukjinKwon
Copy link
Member

I checked the history. Seems like we added this check mainly because Parquet restricts the column names that will be removed from #35229. So this change seems fine to me but would be great to double check w/ @dongjoon-hyun

@HyukjinKwon
Copy link
Member

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-37965][SQL]Remove check field name when reading/writing existing data in Orc [WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 20, 2022
@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

Yea, will do this later.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, @HyukjinKwon 's comment is correct.
Let's review this after #35229 landed to the master first.

Thank you for keeping Apache Spark data sources consistent.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Jan 21, 2022

This check is added in #19124 but change to use back quote to wrap field name in #29761

And in pr #29761 added a test test("SPARK-32889: ORC table column name supports special characters")

and a b is supported. , is not supported caused by we cannot create a table having a column whose name
contains commas in Hive metastore. It's not related to file format

@AngersZhuuuu AngersZhuuuu marked this pull request as ready for review January 21, 2022 08:26
@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc [SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 21, 2022
@AngersZhuuuu
Copy link
Contributor Author

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5b91381 Jan 24, 2022
@dongjoon-hyun
Copy link
Member

+1, LGTM.

@cxzl25
Copy link
Contributor

cxzl25 commented Nov 21, 2022

I think we still check for the empty character case, or add the supportFieldName method back.

Here are some tests.

native

native write

set spark.sql.orc.impl=native;
create table t_1 stored as orc  as select '' ;

suceess.

native read

set spark.sql.orc.impl=native;
select t_1;
java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
        at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:147)

hive read

set spark.sql.orc.impl=hive;
select t_1;
java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:63)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:55)
        at org.apache.spark.sql.hive.orc.OrcFileOperator$.$anonfun$getFileReader$3(OrcFileOperator.scala:76)

hive

hive write

set spark.sql.orc.impl=hive;
create table t_1 stored as orc  as select '' ;
java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<:string>' but ':' is found.
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:378)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:502)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:831)
        at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:242)
        at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:279)
        at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:110)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)

use HiveFileFormat

set spark.sql.hive.convertMetastoreOrc=false;
create table t_1 stored as orc  as select '' ;
Error in query:  Column name "" contains invalid character(s). Please use alias to rename it.

org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName

case "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" =>
try {
TypeInfoUtils.getTypeInfoFromTypeString(s"struct<$name:int>")
true
} catch {
case _: IllegalArgumentException => false

@cloud-fan
Copy link
Contributor

@AngersZhuuuu is there a way to only check field name in the write side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants