[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

AngersZhuuuu · 2022-01-20T03:13:47Z

What changes were proposed in this pull request?

Remove supportFieldName check in DataSource ORCFormat.

org.apache.spark.sql.hive.orc.OrcFileFormat didn't add this check too
Tried a lot of wield column name, all can be reading and writing.

Why are the changes needed?

Remove unnecessary check

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

…ng data in Orc

HyukjinKwon · 2022-01-20T10:33:44Z

I checked the history. Seems like we added this check mainly because Parquet restricts the column names that will be removed from #35229. So this change seems fine to me but would be great to double check w/ @dongjoon-hyun

HyukjinKwon · 2022-01-20T10:35:44Z

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

AngersZhuuuu · 2022-01-20T10:36:43Z

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

Yea, will do this later.

dongjoon-hyun

Yep, @HyukjinKwon 's comment is correct.
Let's review this after #35229 landed to the master first.

Thank you for keeping Apache Spark data sources consistent.

AngersZhuuuu · 2022-01-21T03:00:42Z

This check is added in #19124 but change to use back quote to wrap field name in #29761

And in pr #29761 added a test test("SPARK-32889: ORC table column name supports special characters")

and a b is supported. , is not supported caused by we cannot create a table having a column whose name
contains commas in Hive metastore. It's not related to file format

…g existing data in Orc" This reverts commit c4fbc9c.

AngersZhuuuu · 2022-01-22T01:24:18Z

ping @cloud-fan @HyukjinKwon @dongjoon-hyun

cloud-fan · 2022-01-24T08:46:41Z

thanks, merging to master!

dongjoon-hyun · 2022-01-24T10:23:05Z

+1, LGTM.

cxzl25 · 2022-11-21T10:54:12Z

I think we still check for the empty character case, or add the supportFieldName method back.

Here are some tests.

native

native write

set spark.sql.orc.impl=native;
create table t_1 stored as orc  as select '' ;

suceess.

native read

set spark.sql.orc.impl=native;
select t_1;

java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
        at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:147)

hive read

set spark.sql.orc.impl=hive;
select t_1;

java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:63)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:55)
        at org.apache.spark.sql.hive.orc.OrcFileOperator$.$anonfun$getFileReader$3(OrcFileOperator.scala:76)

hive

hive write

set spark.sql.orc.impl=hive;
create table t_1 stored as orc  as select '' ;

java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<:string>' but ':' is found.
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:378)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:502)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:831)
        at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:242)
        at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:279)
        at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:110)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)

use HiveFileFormat

set spark.sql.hive.convertMetastoreOrc=false;
create table t_1 stored as orc  as select '' ;

Error in query:  Column name "" contains invalid character(s). Please use alias to rename it.

org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

Lines 115 to 120 in 305388d

    
           case "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" => 
        
             try { 
        
               TypeInfoUtils.getTypeInfoFromTypeString(s"struct<$name:int>") 
        
               true 
        
             } catch { 
        
               case _: IllegalArgumentException => false

cloud-fan · 2022-11-21T12:30:29Z

@AngersZhuuuu is there a way to only check field name in the write side?

[SPARK-37965][SQL]Remove check field name when reading/writing existi…

c4fbc9c

…ng data in Orc

AngersZhuuuu marked this pull request as draft January 20, 2022 03:13

github-actions bot added the SQL label Jan 20, 2022

HyukjinKwon changed the title ~~[WIP][SPARK-37965][SQL]Remove check field name when reading/writing existing data in Orc~~ [WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 20, 2022

dongjoon-hyun reviewed Jan 20, 2022

View reviewed changes

AngersZhuuuu added 3 commits January 21, 2022 11:03

Revert "[SPARK-37965][SQL]Remove check field name when reading/writin…

c3d6386

…g existing data in Orc" This reverts commit c4fbc9c.

Add UT

d2406e6

Merge branch 'master' into SPARK-37965

918dca5

AngersZhuuuu marked this pull request as ready for review January 21, 2022 08:26

AngersZhuuuu changed the title ~~[WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc~~ [SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 21, 2022

AngersZhuuuu added 2 commits January 21, 2022 16:32

Update OrcFileFormat.scala

8e3a409

re-trigger

fd5d1d9

HyukjinKwon approved these changes Jan 23, 2022

View reviewed changes

cloud-fan approved these changes Jan 24, 2022

View reviewed changes

cloud-fan closed this in 5b91381 Jan 24, 2022

cxzl25 mentioned this pull request Apr 7, 2023

ORC-1403: ORC supports reading empty field name apache/orc#1458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

Uh oh!

AngersZhuuuu commented Jan 20, 2022 •

edited by cloud-fan

Loading

Uh oh!

HyukjinKwon commented Jan 20, 2022

Uh oh!

HyukjinKwon commented Jan 20, 2022

Uh oh!

AngersZhuuuu commented Jan 20, 2022

Uh oh!

dongjoon-hyun left a comment

Uh oh!

AngersZhuuuu commented Jan 21, 2022 •

edited

Loading

Uh oh!

AngersZhuuuu commented Jan 22, 2022

Uh oh!

cloud-fan commented Jan 24, 2022

Uh oh!

dongjoon-hyun commented Jan 24, 2022

Uh oh!

cxzl25 commented Nov 21, 2022 •

edited

Loading

Uh oh!

cloud-fan commented Nov 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

Uh oh!

Conversation

AngersZhuuuu commented Jan 20, 2022 • edited by cloud-fan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 20, 2022

Uh oh!

HyukjinKwon commented Jan 20, 2022

Uh oh!

AngersZhuuuu commented Jan 20, 2022

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AngersZhuuuu commented Jan 22, 2022

Uh oh!

cloud-fan commented Jan 24, 2022

Uh oh!

dongjoon-hyun commented Jan 24, 2022

Uh oh!

cxzl25 commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

native

native write

native read

hive read

hive

hive write

use HiveFileFormat

Uh oh!

cloud-fan commented Nov 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AngersZhuuuu commented Jan 20, 2022 •

edited by cloud-fan

Loading

AngersZhuuuu commented Jan 21, 2022 •

edited

Loading

cxzl25 commented Nov 21, 2022 •

edited

Loading