Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Aug 25, 2022

What changes were proposed in this pull request?

This is a follow-up for SPARK-39731 and PR #37147.

I found that it could be problematic to change spark.sql.legacy.timeParserPolicy to LEGACY when inferring dates and timestamps in CSV and JSON. Sometimes it is beneficial to have the time parser policy as CORRECTED but still use a more lenient date and timestamp inference (or when migrating to a newer Spark version).

I added two separate configs that control this behavior:

  • spark.sql.legacy.csv.enableDateTimeParsingFallback
  • spark.sql.legacy.json.enableDateTimeParsingFallback

When the configs are set to true, the legacy time parsing behaviour is enabled (pre Spark 3.0).

With this PR, the precedence order is as follows for CSV (similar for JSON):

  • data source option enableDateTimeParsingFallback
  • if that is not set, check spark.sql.legacy.{csv,json}.enableDateTimeParsingFallback
  • if that is not set, check spark.sql.legacy.timeParserPolicy and whether or not a custom format is used.

Why are the changes needed?

The change makes it easier for users to migrate to a newer Spark version without changing global config spark.sql.legacy.timeParserPolicy. Also, allows to enable legacy parsing for CSV and JSON separately without changing the code or the global time parser config.

Does this PR introduce any user-facing change?

No, simply adds an ability to change the behaviour specifically for CSV or JSON.

How was this patch tested?

I added a unit test for CSV and JSON to verify the flag.

@github-actions github-actions bot added the SQL label Aug 25, 2022
@sadikovi sadikovi changed the title [SPARK-40215][SQL] Add SQL configs to control CSV/JSON date and timestamp parsing behaviour [SPARK-40215][SQL] Add SQL configs to control CSV/JSON date and timestamp parsing behavior Aug 25, 2022
@sadikovi
Copy link
Contributor Author

@HyukjinKwon @MaxGekk Could you review this PR? Thank you.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is supposed life time of the SQL configs spark.sql.*.enableDateTimeParsingFallback? Should we place them in the spark.sql.legacy namespace similar to spark.sql.legacy.timeParserPolicy.

@sadikovi
Copy link
Contributor Author

I just wanted to keep the option name short have a custom build already with those configs but I can move them under legacy, e.g.

spark.sql.legacy.csv.enableDateTimeParsingFallback
spark.sql.legacy.json.enableDateTimeParsingFallback


def avroFilterPushDown: Boolean = getConf(AVRO_FILTER_PUSHDOWN_ENABLED)

def jsonEnableDateTimeParsingFallback: Option[Boolean] =
Copy link
Contributor Author

@sadikovi sadikovi Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to add "legacy" prefix in the method as it would make the method name very long 🙂.

@sadikovi
Copy link
Contributor Author

@MaxGekk I addressed your comment. Would you be able to review again? Thanks.

@MaxGekk
Copy link
Member

MaxGekk commented Aug 26, 2022

+1, LGTM. Merging to master.
Thank you, @sadikovi and @LuciferYang @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in 630aa0b Aug 26, 2022
@sadikovi
Copy link
Contributor Author

Thank you, @MaxGekk!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants