Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR adds the tests for writing and reading back empty data for Parquet, JSON and Text data sources.

The tests were not added in HadoopFsRelationTest because each test is a little bit different due to the differences among those data sources.

  • JSON dose not write schema when it is empty.
  • TEXT needs a dataSchema option
  • Parquet writes schema when it is empty.

How was this patch tested?

Unit tests in ParquetHadoopFsRelationSuite, JsonHadoopFsRelationSuiteand SimpleTextHadoopFsRelationSuite.

@SparkQA
Copy link

SparkQA commented May 22, 2016

Test build #59102 has finished for PR 13253 at commit d450094.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented May 22, 2016

Test build #59103 has finished for PR 13253 at commit d450094.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 24, 2016

Hi @rxin and @marmbrus,
As you already know a "critical" issue was found here, SPARK-15393. So, SPARK-10216 was reverted. It seems writing and reading empty data back have been not tested across data sources.
This PR includes the test which resembles the one provided in the JIRA ticket.
Could you please take a look?

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Did we ever end up fixing https://issues.apache.org/jira/browse/SPARK-10216 after it was reverted?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 24, 2016

@rxin No. it has not been fixed.. So, I wanted to add some test codes first to check writing and reading empty data to make sure this is working first.

The way to fix SPARK-10216 might be varied if they are data-source specific issue. For example, ORC does not write files for empty data and also does not allow to read empty files SPARK-8501..

So, I thought I can focus on fixing SPARK-10216 within, for example, Parquet (not WriterContainer) if SPARK-15393 is a Parquet datasource specific problem.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 24, 2016

I don't mind closing this. I will close if you think so. I can do this together later with SPARK-10216.

I just would appreciate if I can be sure that writing and reading empty files should be supported for all or some of them, for example, maybe only Parquet, ORC and CSV because they can write schema separately even if the data is empty.

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59196 has finished for PR 13253 at commit c51fbe3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Yea let's just do it together. Thanks.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented May 24, 2016

Closing this. But could I please ask if it is basically perfered to support to wrtie and read empty data back for all data sources and others (or maybe only for Parquet, ORC and CSV)?
I just want to be sure on that.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

Yes definitely want to be able to read/write empty dfs.

@HyukjinKwon
Copy link
Member Author

@rxin Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants