Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Mar 7, 2019

What changes were proposed in this pull request?

Migrate CSV to File Data Source V2.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

This PR is marked as WIP, since it contains 46e4603e6c6ac90aa49a127f6dece0c8a4fa4df0 to test the write path. The temporary commit will be reverted after all tests passed.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103136 has finished for PR 24005 at commit 2e8f51f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103148 has finished for PR 24005 at commit c063fa0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103149 has finished for PR 24005 at commit 3436cee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @HyukjinKwon @dongjoon-hyun Are you interested in the code review for this PR?

@dongjoon-hyun
Copy link
Member

Sure, @gatorsmile . I'll try to take a look more tonight.

@HyukjinKwon
Copy link
Member

Thanks for cc'ing me. To me, will take a look tomorrow.

@dongjoon-hyun
Copy link
Member

Hi, @gatorsmile , @gengliangwang . I finished my first-round review. I'll do the second round after this is rebased after merging @cloud-fan 's #24025 .

@HyukjinKwon
Copy link
Member

BTW, @gengliangwang, some CSV code path like schema inference is dependent on Text datasource. So, I always think Text datasource should better be fixed first before fixing CSV and JSON if there's something to be fixed across them (for instance, lineSep option as an example).

Since CSV work is already done here first, I am fine but I was thinking the next job should better be Text datasource migration.

@gengliangwang
Copy link
Member Author

@dongjoon-hyun @HyukjinKwon Thanks for the review!

@HyukjinKwon actually I have prepared JSON V2 and it is almost ready. Your suggestion makes sense. I will migrate the Text data source first.

@HyukjinKwon
Copy link
Member

If there's some work already done for JSON, I am okay too. I don't expect there'd be too much difficulties even if we do the Text one later. I'll leave it to you.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103315 has finished for PR 24005 at commit 4dcd29f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CsvOutputWriter(

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103331 has finished for PR 24005 at commit 1b5354e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103329 has finished for PR 24005 at commit 75abe93.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103337 has finished for PR 24005 at commit a408d15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2019

Test build #103540 has finished for PR 24005 at commit 8db6873.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CsvOutputWriter(
  • class PartitionReaderFromIterator[InternalRow](
  • abstract class TextBasedFileScan(
  • class CSVDataSourceV2 extends FileDataSourceV2
  • case class CSVPartitionReaderFactory(
  • case class CSVScan(
  • case class CSVScanBuilder(
  • case class CSVTable(
  • class CSVWriteBuilder(options: CaseInsensitiveStringMap, paths: Seq[String])

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103798 has finished for PR 24005 at commit 2be5277.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

I finished the second round. I'll review later again after the PR is updated.

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103813 has finished for PR 24005 at commit 7eb54a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2019

Test build #103814 has finished for PR 24005 at commit 689f17e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CsvOutputWriter(

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @gengliangwang , @gatorsmile , @HyukjinKwon .

Merged to master.

cc @cloud-fan .

@gengliangwang
Copy link
Member Author

@dongjoon-hyun @HyukjinKwon Thanks for the review!

userSpecifiedSchema: Option[StructType])
extends FileTable(sparkSession, options, paths, userSpecifiedSchema) {
override def newScanBuilder(options: CaseInsensitiveStringMap): CSVScanBuilder =
CSVScanBuilder(sparkSession, fileIndex, schema, dataSchema, options)
Copy link
Contributor

@liujiayi771 liujiayi771 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gengliangwang. Should we use this.options here instead of the passed-in options?

For the TableCatalog, the dsOptions can be set into the CSVTable.options returned by the TableCatalog.loadTable method. If the passed-in options are used here, the TableCatalog will not be able to pass dsOptions that contains CSV options to CSVScan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we combine them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we combine them?

@cloud-fan Yes, it would be better to combine them. Can I submit a PR to make changes here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants