-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-38067][PYTHON] Preserve None values when saved to JSON.
#35296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38067][PYTHON] Preserve None values when saved to JSON.
#35296
Conversation
### What changes were proposed in this pull request? This is for SPARK-37981Deletes columns with all Null as default. Do also see #26098 User HyukjinKwon did a reviewed on 21 Oct 2019 "Hey, you should document this in DataFrameWrtier, DataStreamWrtier, readwriter.py" ### Why are the changes needed? Users need to know why there column(s) with all NaN or Null are gone. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No.
|
cc @HyukjinKwon @ueshin FYI |
|
Thank you for your proposal @bjornjorgensen. First of all, some formalities:
Regarding the change ‒ as-is, this doesn't really describe the actual behavior:
Finally, you have a typo ‒ "The column well be deleted" -> "The column will be deleted." |
|
Maybe something like this?
|
|
Can one of the admins verify this patch? |
|
I think maybe we have something for "Does this PR introduce any user-facing change?" and "How was this patch tested?" section in the PR description. For example, Does this PR introduce any user-facing change? Yes, the document for How was this patch tested? The linter and doc build test should be passed. |
|
Added a new post at jira. |
Default ignoreNullFields have been set to False to prevent columns with only NaN or Null to be deleted during saving to JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add tests with different parameter values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
|
Let's fix up PR title/description too with making it ready for a review. Then I think it's good to go. It's better to have a test case but I am personally fine since it bypasses an option. We can add a test in a followup too. |
Co-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
|
Yes, about testing the only test I find for IO in pandas is for testing csv. I was thinking about making a test where I use shape function to test a file before and after writing and write. But there are another problem pandas prints a tuple with rows, columns pandas_df_json2.shape (6, 4). While pandas on spark print dataframe, columns pandas_api.shape (1, 4). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
|
@ueshin @HyukjinKwon We should create a new ticket for this, that actually matches the problem that is solved, or at least rewrite and reopen one, to which this PR points to. I'll leave decision to you. |
One shouldn't really depend on schema inference and reader behavior to test writer for formats which, like JSON lines, provide no schema and / or are less expressive than Spark SQL types. In general case, irrespective of options, the following spark: SparkSession
df: DataFrame
path: str
df.write.json(path)
assert spark.read.json(path).schema == df.schemais not, and cannot be, guaranteed. |
|
Yeah, technically we should create a new JIRA or edit the existing JIRA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM (tested locally), but can we please use a title of PR (and JIRA), that really reflects what is going on and which behavior is changed? It really, has nothing to do with "all None values".
Impact on
ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]}).to_json(path)and
ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]}).to_json(path)is exactly the same and, in general case, reader still won't be able to infer original schema if there are no not-null values.
|
@zero323 This PR won't have any impact on your first example because there is at least one value that is not None. In your second example it will be without this PR. df = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})
df
id val
0 1 NaN
1 2 NaN
df.to_json("testdf.json", num_files=1)
df2 = ps.read_json("testdf.json/*")
df2
id val
0 1 None
1 2 None
df2.to_json("testdf2.json", ignoreNullFields=True, num_files=1)
df3 = ps.read_json("testdf2.json/*")
df3
id
0 1
1 2I will change what we agree on, but for now I think this has something to do with None values. |
It seems like there is still some confusion regarding actual impact of the changes that are proposed here. So let's start with establishing simple fact ‒ Spark uses row-oriented JSON lines format when JSON (that includes
Decision is made on row-by-row basis, and is not affected by presence of missing values for the same field in any other row. Your PR changes default behavior for Finally, counting columns in the re-read dataset is simply misleading and your PR doesn't and cannot affect reader behavior (it just affects what reader "sees") and, as mentioned before, explicitly writing missing values cannot resolve the problem of properly restoring original schema. Consider the following: >>> import numpy as np
>>> df_all_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})If you write it with >>> path_true_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_true_all_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_all_null).collect():
... print(x.value)
{"id":2}
{"id":1}but there is no trace that other column was ever there. >>> ps.read_json(path_true_all_null)
id
0 2Setting >>> path_false_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_false_all_null, ignoreNullFields=False)
>>> for x in spark.read.text(path_false_all_null).collect():
... print(x.value)
{"id":1,"val":null}
{"id":2,"val":null}writes JSON >>> df_all_null_read_with_false = ps.read_json(path_false_all_null)
id val
0 1 None
1 2 NoneSo, does it restore the original schema? It doesn't. The original contained double values: >>> df_all_null.to_spark().printSchema()
root
|-- id: long (nullable = false)
|-- val: double (nullable = true)and restored one cannot make any assumptions about the types, so it defaults to strings: >>> df_all_null_read_with_false.to_spark().printSchema()
root
|-- id: long (nullable = true)
|-- val: string (nullable = true)If user requires specific schema for the frame, schema should be provided on read, which will give the same result independent of specific fields being present in the output or not >>> ps.read_json(path_false_all_null, schema="id long, val double").to_spark().printSchema()
root
|-- id: long (nullable = true)
|-- val: double (nullable = true)This is standard approach in production pipelines and, on top of consistency, provides significant performance improvements on realistic size data. So to re-iterate:
What this PR really achieves, is making output of >>> path_pr_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_pr_all_null)
>>> for x in spark.read.text(path_pr_all_null).collect():
... print(x.value)
{"id":2,"val":null}
{"id":1,"val":null}roughly equivalent to the output of >>> df_all_null.to_json()
'[{"id":1,"val":null},{"id":2,"val":null}]'and pandas equivalents. I wouldn't bother with pointing all of that out, but I have enough experience with users making incorrect assumptions about Spark behavior and nature of fixes, based on misleading JIRA tickets. |
None values when saved to JSON file.
None values when saved to JSON file.None values when saved to JSON.
|
@zero323 Thank you |
|
Merged into master. Thanks everyone! |
What changes were proposed in this pull request?
This PR preserves columns with all values with
NaN,NullorNoneto omit when saved to JSON files.Changes default behavior for
pyspark.pandasJSON writer, from missing values are omitted, to missing values are preserved. This impacts output to file if there is any missing value in any field in any row. It can have significant (proportional to N * M, for N rows and M columns) impact in case of datasets (performance, storage cost).Add an option to delete columns with all values with
NaN,NullorNoneto omit when saved to JSON file.Why are the changes needed?
Pandas on spark deletes columns with all
Nonevalues as default.Pandas writes all columns to JSON even if the values are all
None.This is the same behavior as pandas users are used to.
Does this PR introduce any user-facing change?
The document for the
to_jsonfunction is changed.The
ignoreNullFieldsoption has been set toFalseas default to prevent columns with onlyNullto omit during saving to JSON files.How was this patch tested?
Tested manually.
data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} test = ps.DataFrame.from_dict(data) test.to_json("test.json") test2 = ps.read_json("test.json/*") test2 col_1 col_2 0 3 None 1 2 None 2 1 None 3 0 None test2.to_json("test2.json", ignoreNullFields=True) test3 = ps.read_json("test2.json/*") test3 col_1 0 3 1 2 2 1 3 0