-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27085][SQL] Migrate CSV to File Data Source V2 #24005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR is marked as WIP, since it contains 46e4603e6c6ac90aa49a127f6dece0c8a4fa4df0 to test the write path. The temporary commit will be reverted after all tests passed. |
|
Test build #103136 has finished for PR 24005 at commit
|
|
Test build #103148 has finished for PR 24005 at commit
|
|
Test build #103149 has finished for PR 24005 at commit
|
|
cc @HyukjinKwon @dongjoon-hyun Are you interested in the code review for this PR? |
|
Sure, @gatorsmile . I'll try to take a look more tonight. |
|
Thanks for cc'ing me. To me, will take a look tomorrow. |
...main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScanBuilder.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScanBuilder.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
Outdated
Show resolved
Hide resolved
|
Hi, @gatorsmile , @gengliangwang . I finished my first-round review. I'll do the second round after this is rebased after merging @cloud-fan 's #24025 . |
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Outdated
Show resolved
Hide resolved
|
BTW, @gengliangwang, some CSV code path like schema inference is dependent on Text datasource. So, I always think Text datasource should better be fixed first before fixing CSV and JSON if there's something to be fixed across them (for instance, Since CSV work is already done here first, I am fine but I was thinking the next job should better be Text datasource migration. |
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
|
@dongjoon-hyun @HyukjinKwon Thanks for the review! @HyukjinKwon actually I have prepared JSON V2 and it is almost ready. Your suggestion makes sense. I will migrate the Text data source first. |
|
If there's some work already done for JSON, I am okay too. I don't expect there'd be too much difficulties even if we do the Text one later. I'll leave it to you. |
|
Test build #103315 has finished for PR 24005 at commit
|
|
Test build #103331 has finished for PR 24005 at commit
|
|
Test build #103329 has finished for PR 24005 at commit
|
|
Test build #103337 has finished for PR 24005 at commit
|
a408d15 to
8db6873
Compare
|
Test build #103540 has finished for PR 24005 at commit
|
|
Test build #103798 has finished for PR 24005 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVWriteBuilder.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala
Outdated
Show resolved
Hide resolved
...c/main/scala/org/apache/spark/sql/execution/datasources/v2/PartitionReaderFromIterator.scala
Outdated
Show resolved
Hide resolved
...c/main/scala/org/apache/spark/sql/execution/datasources/v2/PartitionReaderFromIterator.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVDataSourceV2.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVScan.scala
Outdated
Show resolved
Hide resolved
|
I finished the second round. I'll review later again after the PR is updated. |
|
Test build #103813 has finished for PR 24005 at commit
|
|
Test build #103814 has finished for PR 24005 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @gengliangwang , @gatorsmile , @HyukjinKwon .
Merged to master.
cc @cloud-fan .
|
@dongjoon-hyun @HyukjinKwon Thanks for the review! |
| userSpecifiedSchema: Option[StructType]) | ||
| extends FileTable(sparkSession, options, paths, userSpecifiedSchema) { | ||
| override def newScanBuilder(options: CaseInsensitiveStringMap): CSVScanBuilder = | ||
| CSVScanBuilder(sparkSession, fileIndex, schema, dataSchema, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gengliangwang. Should we use this.options here instead of the passed-in options?
For the TableCatalog, the dsOptions can be set into the CSVTable.options returned by the TableCatalog.loadTable method. If the passed-in options are used here, the TableCatalog will not be able to pass dsOptions that contains CSV options to CSVScan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we combine them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we combine them?
@cloud-fan Yes, it would be better to combine them. Can I submit a PR to make changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please!
What changes were proposed in this pull request?
Migrate CSV to File Data Source V2.
How was this patch tested?
Unit test