You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request adds functionality to spark-csv with the goal of having the ability to write null values to file and read them back out again as null. Two changes were made to enable this.
First, since the `com.databricks.spark.csv` package previously had the null string hardcoded to "`null`" when saving to a csv file, this was changed to read the null token out of the passed in parameters map, from the value for "`nullToken`", enabling writing null values as empty strings by use of this option. The default is left to "`null`" to maintain the previous behavior of the library.
Secondly, the `castTo` method from `com.databricks.spark.csv.util.TypeCast` had an impossible-to-reach case statement when the `castType` was an instance of `StringType`. As a result, it was not possible to read string values from file as null. This pull request adds a setting 'treatEmptyValuesAsNulls' that allows empty string values in fields that are marked as nullable to be read as null values, as expected. Again, the previous behavior is enabled by default, so this pull request only changes the behavior when `treatEmptyValuesAsNulls` is explicitly set to true. The appropriate changes to `CsvParser` and `CsvRelation` were made to include this new setting.
Additionally, a unit test has been added to `CsvSuite` to test the ability to round-trip (both string and non-string) null values by writing nulls and reading them back out again as nulls.
Author: Andres Perez <[email protected]>
Closes#147 from andy327/feat-set-null-tokens.
0 commit comments