[SPARK-38067][PYTHON] Preserve `None` values when saved to JSON. #35296

bjornjorgensen · 2022-01-23T21:09:19Z

What changes were proposed in this pull request?

This PR preserves columns with all values with NaN, Null or None to omit when saved to JSON files.
Changes default behavior for pyspark.pandas JSON writer, from missing values are omitted, to missing values are preserved. This impacts output to file if there is any missing value in any field in any row. It can have significant (proportional to N * M, for N rows and M columns) impact in case of datasets (performance, storage cost).
Add an option to delete columns with all values with NaN, Null or None to omit when saved to JSON file.

Why are the changes needed?

Pandas on spark deletes columns with all None values as default.
Pandas writes all columns to JSON even if the values are all None.
This is the same behavior as pandas users are used to.

Does this PR introduce any user-facing change?

The document for the to_json function is changed.
The ignoreNullFields option has been set to False as default to prevent columns with only Null to omit during saving to JSON files.

How was this patch tested?

Tested manually.

data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test = ps.DataFrame.from_dict(data)

test.to_json("test.json")

test2 = ps.read_json("test.json/*")
test2

col_1	col_2
0 3 None
1 2 None
2 1 None
3 0 None

test2.to_json("test2.json", ignoreNullFields=True)

test3 = ps.read_json("test2.json/*")
test3

col_1
0 3
1 2
2 1
3 0

### What changes were proposed in this pull request? This is for SPARK-37981Deletes columns with all Null as default. Do also see #26098 User HyukjinKwon did a reviewed on 21 Oct 2019 "Hey, you should document this in DataFrameWrtier, DataStreamWrtier, readwriter.py" ### Why are the changes needed? Users need to know why there column(s) with all NaN or Null are gone. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No.

zero323 · 2022-01-23T23:05:39Z

cc @HyukjinKwon @ueshin FYI

zero323 · 2022-01-23T23:06:06Z

Thank you for your proposal @bjornjorgensen.

First of all, some formalities:

Could you enable GitHub actions in your fork?
It would be better to have JIRA ticket that specifically targets documentation update, as the one you linked doesn't even use pyspark.pandas.

Regarding the change ‒ as-is, this doesn't really describe the actual behavior:

If path is not provided, to_json delegates transformation to pandas, so fields with NULL are preserved.
If path is provided, standard writer is used, so the same rules apply (in particular ignoreNullFields keyword option or spark.sql.jsonGenerator.ignoreNullFields config are respected).
In both cases, behavior doesn't depend on the number of null values in the column, as the output is row oriented.

Finally, you have a typo ‒ "The column well be deleted" -> "The column will be deleted."

zero323 · 2022-01-23T23:19:32Z

Maybe something like this?

If path is provided, writer omits missing values by default. This behavior can be controlled using ignoreNullFields option.

python/pyspark/pandas/generic.py

AmplabJenkins · 2022-01-24T02:29:25Z

Can one of the admins verify this patch?

itholic · 2022-01-24T07:01:17Z

I think maybe we have something for "Does this PR introduce any user-facing change?" and "How was this patch tested?" section in the PR description.

For example,

Does this PR introduce any user-facing change?

Yes, the document for to_json function is changed.

How was this patch tested?

The linter and doc build test should be passed.

bjornjorgensen · 2022-01-24T12:23:38Z

Added a new post at jira.
Github actions is now enabled.

Default ignoreNullFields have been set to False to prevent columns with only NaN or Null to be deleted during saving to JSON.

ueshin

Could you also add tests with different parameter values?

python/pyspark/pandas/generic.py

…ull-and-NaN

python/pyspark/pandas/generic.py

HyukjinKwon

LGTM otherwise

python/pyspark/pandas/generic.py

HyukjinKwon · 2022-01-25T10:25:24Z

Let's fix up PR title/description too with making it ready for a review. Then I think it's good to go. It's better to have a test case but I am personally fine since it bypasses an option. We can add a test in a followup too.

Co-authored-by: Hyukjin Kwon <[email protected]>

bjornjorgensen · 2022-01-25T12:12:36Z

Yes, about testing the only test I find for IO in pandas is for testing csv. I was thinking about making a test where I use shape function to test a file before and after writing and write. But there are another problem pandas prints a tuple with rows, columns pandas_df_json2.shape (6, 4). While pandas on spark print dataframe, columns pandas_api.shape (1, 4).

ueshin

Otherwise, LGTM.

python/pyspark/pandas/generic.py

zero323 · 2022-01-25T22:47:11Z

@ueshin @HyukjinKwon We should create a new ticket for this, that actually matches the problem that is solved, or at least rewrite and reopen one, to which this PR points to. I'll leave decision to you.

zero323 · 2022-01-25T23:07:50Z

Yes, about testing the only test I find for IO in pandas is for testing csv. I was thinking about making a test where I use shape function to test a file before and after writing and write.

One shouldn't really depend on schema inference and reader behavior to test writer for formats which, like JSON lines, provide no schema and / or are less expressive than Spark SQL types. In general case, irrespective of options, the following

spark: SparkSession
df: DataFrame
path: str

df.write.json(path)
assert spark.read.json(path).schema == df.schema

is not, and cannot be, guaranteed.

HyukjinKwon · 2022-01-28T02:17:47Z

Yeah, technically we should create a new JIRA or edit the existing JIRA.

python/pyspark/pandas/generic.py

…g-Null-and-NaN'

zero323

Changes LGTM (tested locally), but can we please use a title of PR (and JIRA), that really reflects what is going on and which behavior is changed? It really, has nothing to do with "all None values".

Impact on

ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]}).to_json(path)

and

ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]}).to_json(path)

is exactly the same and, in general case, reader still won't be able to infer original schema if there are no not-null values.

bjornjorgensen · 2022-01-30T13:31:00Z

@zero323 This PR won't have any impact on your first example because there is at least one value that is not None. In your second example it will be without this PR.

df = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})
df
id val
0 1 NaN
1 2 NaN

df.to_json("testdf.json", num_files=1)

df2 = ps.read_json("testdf.json/*")
df2

id val
0 1 None
1 2 None

df2.to_json("testdf2.json", ignoreNullFields=True, num_files=1)


df3 = ps.read_json("testdf2.json/*")

df3

id
0 1
1 2

I will change what we agree on, but for now I think this has something to do with None values.

zero323 · 2022-01-30T14:47:12Z

@zero323 This PR won't have any impact on your first example because there is at least one value that is not None. In your second example it will be without this PR.

It seems like there is still some confusion regarding actual impact of the changes that are proposed here.

So let's start with establishing simple fact ‒ Spark uses row-oriented JSON lines format when JSON (that includes to_json) is used. ignoreNullFields option, by default set to true controls writer behavior when missing values are encountered:

If it is set to true, then field is omitted in the output for a given row.

>>> import tempfile
>>> path_true_some_null = tempfile.mktemp()
>>> df_some_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]})
>>> df_some_null.to_json(path_true_some_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_some_null).collect():
...     print(x.value)
{"id":2,"val":1.0}
{"id":1}

If it is set to false, then field is emitted with JSON null (untyped) value.

>>> for x in spark.read.text(path_false_some_null).collect():
...     print(x.value)
{"id":1,"val":null}
{"id":2,"val":1.0}

Decision is made on row-by-row basis, and is not affected by presence of missing values for the same field in any other row.

Your PR changes default behavior for pyspark.pandas JSON writer, from the former (missing values are omitted) to the latter (missing values are preserved). This impacts output, as long as there is any missing value in any field in any row. Furthermore, it can have significant (proportional to N * M, for N rows and M columns) impact in case of sparse datasets (IO cost, storage cost) in case of sparse datasets.

Finally, counting columns in the re-read dataset is simply misleading and your PR doesn't and cannot affect reader behavior (it just affects what reader "sees") and, as mentioned before, explicitly writing missing values cannot resolve the problem of properly restoring original schema. Consider the following:

>>> import numpy as np
>>> df_all_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})

If you write it with ignoreNullFields set true, it omits the missing values same as in the case where some not missing values are present.

>>> path_true_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_true_all_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_all_null).collect():
...     print(x.value)
{"id":2}
{"id":1}

but there is no trace that other column was ever there.

>>> ps.read_json(path_true_all_null)
   id
0   2

Setting ignoreNullFields set false

>>> path_false_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_false_all_null, ignoreNullFields=False)
>>> for x in spark.read.text(path_false_all_null).collect():
...     print(x.value)
{"id":1,"val":null}
{"id":2,"val":null}

writes JSON nulls explicitly (same as in case of (df_some_null), and by doing so, provides enough information for the reader to infer that some column should be present.

>>> df_all_null_read_with_false = ps.read_json(path_false_all_null)
   id   val
0   1  None
1   2  None

So, does it restore the original schema?

It doesn't. The original contained double values:

>>> df_all_null.to_spark().printSchema()
root
 |-- id: long (nullable = false)
 |-- val: double (nullable = true)

and restored one cannot make any assumptions about the types, so it defaults to strings:

>>> df_all_null_read_with_false.to_spark().printSchema()
root
 |-- id: long (nullable = true)
 |-- val: string (nullable = true)

If user requires specific schema for the frame, schema should be provided on read, which will give the same result independent of specific fields being present in the output or not

>>> ps.read_json(path_false_all_null, schema="id long, val double").to_spark().printSchema()
root
 |-- id: long (nullable = true)
 |-- val: double (nullable = true)

This is standard approach in production pipelines and, on top of consistency, provides significant performance improvements on realistic size data.

So to re-iterate:

This PR affects the output, as long as dataset contains at least one missing value, and can have significant impact on performance and storage costs.
Observed impact on reader behavior is more incidental and doesn't guarantee correct schema of the loaded data (not to mention recovering of all column names in case of nested structures).

What this PR really achieves, is making output of to_json(some_path)

>>> path_pr_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_pr_all_null)
>>> for x in spark.read.text(path_pr_all_null).collect():
...     print(x.value)
{"id":2,"val":null}
{"id":1,"val":null}

roughly equivalent to the output of to_json()

>>> df_all_null.to_json()
'[{"id":1,"val":null},{"id":2,"val":null}]'

and pandas equivalents.

I wouldn't bother with pointing all of that out, but I have enough experience with users making incorrect assumptions about Spark behavior and nature of fixes, based on misleading JIRA tickets.

bjornjorgensen · 2022-02-01T08:56:47Z

@zero323 Thank you

zero323 · 2022-02-01T11:59:53Z

Merged into master.

Thanks everyone!

github-actions bot added CORE PYTHON labels Jan 23, 2022

HyukjinKwon reviewed Jan 23, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

bjornjorgensen marked this pull request as draft January 24, 2022 00:49

bjornjorgensen added 3 commits January 24, 2022 15:01

Set ignoreNullFields to False

a073be4

Default ignoreNullFields have been set to False to prevent columns with only NaN or Null to be deleted during saving to JSON.

Removed note

670c690

Removed extra space.

da2da56

ueshin reviewed Jan 24, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

bjornjorgensen added 3 commits January 24, 2022 21:01

Fix new line

fc659ca

Updated options to more Pandas-specific defaults as a dict.

d6f5b53

Merge branch 'apache:master' into SPARK-37981-Add-note-for-deleting-N…

4349d0b

…ull-and-NaN

HyukjinKwon reviewed Jan 25, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 25, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

bjornjorgensen added 2 commits January 25, 2022 11:00

Set it as a note.

d8de0c5

Fix indenting

8ffbef2

HyukjinKwon approved these changes Jan 25, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

bjornjorgensen and others added 2 commits January 25, 2022 12:08

Update python/pyspark/pandas/generic.py

0d4794c

Co-authored-by: Hyukjin Kwon <[email protected]>

Update python/pyspark/pandas/generic.py

fec39c7

Co-authored-by: Hyukjin Kwon <[email protected]>

bjornjorgensen changed the title ~~[SPARK-37981][PYTHON] Add note for deleting Null and NaN~~ [SPARK-37981][PYTHON] Change the way pandas saves JSON files. Jan 25, 2022

bjornjorgensen marked this pull request as ready for review January 25, 2022 12:22

ueshin approved these changes Jan 25, 2022

View reviewed changes

python/pyspark/pandas/generic.py Outdated Show resolved Hide resolved

bjornjorgensen added 2 commits January 25, 2022 19:39

moved merging default_options after if-clause

99100ac

Removed new line

cd6737b

HyukjinKwon reviewed Jan 28, 2022

View reviewed changes

python/pyspark/pandas/generic.py Show resolved Hide resolved

Merge remote-tracking branch 'origin/SPARK-37981-Add-note-for-deletin…

d992250

…g-Null-and-NaN'

bjornjorgensen marked this pull request as draft January 29, 2022 12:55

Add Dict, to import

c2df065

bjornjorgensen changed the title ~~[SPARK-37981][PYTHON] Change the way pandas saves JSON files.~~ [SPARK-37981][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 29, 2022

bjornjorgensen changed the title ~~[SPARK-37981][PYTHON] Preserve columns with all None values when saved to JSON file.~~ [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 29, 2022

HyukjinKwon marked this pull request as ready for review January 30, 2022 00:30

HyukjinKwon approved these changes Jan 30, 2022

View reviewed changes

zero323 reviewed Jan 30, 2022

View reviewed changes

bjornjorgensen changed the title ~~[SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file.~~ [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 30, 2022

bjornjorgensen changed the title ~~[SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file.~~ [SPARK-38067][PYTHON] Preserve None values when saved to JSON. Jan 31, 2022

zero323 approved these changes Feb 1, 2022

View reviewed changes

zero323 closed this in 66b9087 Feb 1, 2022

[SPARK-38067][PYTHON] Preserve None values when saved to JSON. #35296

[SPARK-38067][PYTHON] Preserve None values when saved to JSON. #35296

Uh oh!

Conversation

bjornjorgensen commented Jan 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zero323 commented Jan 23, 2022

Uh oh!

zero323 commented Jan 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Jan 23, 2022

Uh oh!

Uh oh!

AmplabJenkins commented Jan 24, 2022

Uh oh!

itholic commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjornjorgensen commented Jan 24, 2022

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjornjorgensen commented Jan 25, 2022

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zero323 commented Jan 25, 2022

Uh oh!

zero323 commented Jan 25, 2022

Uh oh!

HyukjinKwon commented Jan 28, 2022

Uh oh!

Uh oh!

zero323 left a comment

Choose a reason for hiding this comment

Uh oh!

bjornjorgensen commented Jan 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Jan 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjornjorgensen commented Feb 1, 2022

Uh oh!

zero323 commented Feb 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-38067][PYTHON] Preserve `None` values when saved to JSON. #35296

[SPARK-38067][PYTHON] Preserve `None` values when saved to JSON. #35296

bjornjorgensen commented Jan 23, 2022 •

edited

Loading

zero323 commented Jan 23, 2022 •

edited

Loading

itholic commented Jan 24, 2022 •

edited

Loading

HyukjinKwon commented Jan 25, 2022 •

edited

Loading

bjornjorgensen commented Jan 30, 2022 •

edited

Loading

zero323 commented Jan 30, 2022 •

edited

Loading