-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24068] Propagating DataFrameReader's options to Text datasource on schema inferring #21182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89929 has finished for PR 21182 at commit
|
|
jenkins, retest this, please |
|
Test build #89942 has finished for PR 21182 at commit
|
|
Test build #89953 has finished for PR 21182 at commit
|
|
Test build #90010 has finished for PR 21182 at commit
|
|
jenkins, retest this, please |
|
Test build #90047 has finished for PR 21182 at commit
|
| factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) | ||
| } | ||
|
|
||
| val textOptions = ListMap(parameters.toList: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use parameters instead of constructing a new map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I constructed new map is I got an exception like the textOptions value is not serializable on one test suite. ListMap extends Serializable and allows to eliminate such exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, CaseInsensitiveMap.. with Serializable so I think it was ok. I didn't try it and run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I faced to the problem on FileStreamSourceSuite.read new files in partitioned table with globbing, should not read partition data. Let me check it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test fails on serialization of the textOptions value because the former one refers to transient parameters. I add the @transient annotation to textOptions (it should be safe because textOptions is using on the driver only and shouldn't be serialized).
|
Test build #90247 has finished for PR 21182 at commit
|
| factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) | ||
| } | ||
|
|
||
| @transient val textOptions = parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to access parameters like this, why not just remove private on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation for creating of the separate variable for text parameters was having one place where the text parameters are forming. So, in the future if we need additional options to pass to the Text datasource, don't need to modify all places where they are forming. @viirya If you believe, it over-complicates the implementation, I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me it's less possibly we have new options which are not passed by parameters. If you think it can be, I'm fine with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make this less complicated and do what @MaxGekk's said in place next time?
| settings | ||
| } | ||
|
|
||
| val textOptions = ListMap(parameters.toList: _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow JSONOptions to make it @transient and use parameters instead of creating a new map?
| factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) | ||
| } | ||
|
|
||
| @transient val textOptions = parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me it's less possibly we have new options which are not passed by parameters. If you think it can be, I'm fine with this.
|
Seems fine otherwise. |
| import java.nio.charset.StandardCharsets | ||
| import java.util.{Locale, TimeZone} | ||
|
|
||
| import scala.collection.immutable.ListMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems not used.
|
Test build #90270 has finished for PR 21182 at commit
|
|
Test build #90269 has finished for PR 21182 at commit
|
|
jenkins, retest this, please |
|
Test build #90276 has finished for PR 21182 at commit
|
|
Test build #90284 has finished for PR 21182 at commit
|
|
jenkins, retest this, please |
|
Test build #90286 has finished for PR 21182 at commit
|
|
@HyukjinKwon @viirya @gengliangwang May I ask you to look at the PR again. |
gengliangwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merged to master. |
|
shall we backport it to 2.3? |
|
I tried but it had conflict when I merged this. @MaxGekk mind opening a backport pr? |
|
@HyukjinKwon Sure, I will prepare a PR |
|
Here is the backport to 2.3: #21292 |
What changes were proposed in this pull request?
While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302
but the options are not propagated to Text datasource on schema inferring, for instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188
The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified.
How was this patch tested?
The changes were tested manually by using https://github.com/twitter/hadoop-lzo:
Create 2 test files in JSON and CSV format and compress them:
Run
spark-shellwith hadoop-lzo:reading compressed CSV and JSON without schema: