-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40215][SQL] Add SQL configs to control CSV/JSON date and timestamp parsing behavior #37653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @MaxGekk Could you review this PR? Thank you. |
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is supposed life time of the SQL configs spark.sql.*.enableDateTimeParsingFallback? Should we place them in the spark.sql.legacy namespace similar to spark.sql.legacy.timeParserPolicy.
|
I just wanted to keep the option name short have a custom build already with those configs but I can move them under legacy, e.g. |
|
|
||
| def avroFilterPushDown: Boolean = getConf(AVRO_FILTER_PUSHDOWN_ENABLED) | ||
|
|
||
| def jsonEnableDateTimeParsingFallback: Option[Boolean] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to add "legacy" prefix in the method as it would make the method name very long 🙂.
|
@MaxGekk I addressed your comment. Would you be able to review again? Thanks. |
|
+1, LGTM. Merging to master. |
|
Thank you, @MaxGekk! |
What changes were proposed in this pull request?
This is a follow-up for SPARK-39731 and PR #37147.
I found that it could be problematic to change
spark.sql.legacy.timeParserPolicyto LEGACY when inferring dates and timestamps in CSV and JSON. Sometimes it is beneficial to have the time parser policy as CORRECTED but still use a more lenient date and timestamp inference (or when migrating to a newer Spark version).I added two separate configs that control this behavior:
spark.sql.legacy.csv.enableDateTimeParsingFallbackspark.sql.legacy.json.enableDateTimeParsingFallbackWhen the configs are set to
true, the legacy time parsing behaviour is enabled (pre Spark 3.0).With this PR, the precedence order is as follows for CSV (similar for JSON):
enableDateTimeParsingFallbackspark.sql.legacy.{csv,json}.enableDateTimeParsingFallbackspark.sql.legacy.timeParserPolicyand whether or not a custom format is used.Why are the changes needed?
The change makes it easier for users to migrate to a newer Spark version without changing global config
spark.sql.legacy.timeParserPolicy. Also, allows to enable legacy parsing for CSV and JSON separately without changing the code or the global time parser config.Does this PR introduce any user-facing change?
No, simply adds an ability to change the behaviour specifically for CSV or JSON.
How was this patch tested?
I added a unit test for CSV and JSON to verify the flag.