Skip to content

Conversation

@bjornjorgensen
Copy link
Contributor

@bjornjorgensen bjornjorgensen commented Jan 23, 2022

What changes were proposed in this pull request?

This PR preserves columns with all values with NaN, Null or None to omit when saved to JSON files.
Changes default behavior for pyspark.pandas JSON writer, from missing values are omitted, to missing values are preserved. This impacts output to file if there is any missing value in any field in any row. It can have significant (proportional to N * M, for N rows and M columns) impact in case of datasets (performance, storage cost).
Add an option to delete columns with all values with NaN, Null or None to omit when saved to JSON file.

Why are the changes needed?

Pandas on spark deletes columns with all None values as default.
Pandas writes all columns to JSON even if the values are all None.
This is the same behavior as pandas users are used to.

Does this PR introduce any user-facing change?

The document for the to_json function is changed.
The ignoreNullFields option has been set to False as default to prevent columns with only Null to omit during saving to JSON files.

How was this patch tested?

Tested manually.

data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test = ps.DataFrame.from_dict(data)

test.to_json("test.json")

test2 = ps.read_json("test.json/*")
test2

col_1	col_2
0 3 None
1 2 None
2 1 None
3 0 None

test2.to_json("test2.json", ignoreNullFields=True)

test3 = ps.read_json("test2.json/*")
test3

col_1
0 3
1 2
2 1
3 0

### What changes were proposed in this pull request?

This is for SPARK-37981Deletes columns with all Null as default.
Do also see #26098
User HyukjinKwon did a reviewed on 21 Oct 2019 
"Hey, you should document this in DataFrameWrtier, DataStreamWrtier, readwriter.py"

### Why are the changes needed?
Users need to know why there column(s) with all NaN or Null are gone. 

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No.
@zero323
Copy link
Member

zero323 commented Jan 23, 2022

cc @HyukjinKwon @ueshin FYI

@zero323
Copy link
Member

zero323 commented Jan 23, 2022

Thank you for your proposal @bjornjorgensen.

First of all, some formalities:

  • Could you enable GitHub actions in your fork?
  • It would be better to have JIRA ticket that specifically targets documentation update, as the one you linked doesn't even use pyspark.pandas.

Regarding the change ‒ as-is, this doesn't really describe the actual behavior:

  • If path is not provided, to_json delegates transformation to pandas, so fields with NULL are preserved.
  • If path is provided, standard writer is used, so the same rules apply (in particular ignoreNullFields keyword option or spark.sql.jsonGenerator.ignoreNullFields config are respected).
  • In both cases, behavior doesn't depend on the number of null values in the column, as the output is row oriented.

Finally, you have a typo ‒ "The column well be deleted" -> "The column will be deleted."

@zero323
Copy link
Member

zero323 commented Jan 23, 2022

Maybe something like this?

If path is provided, writer omits missing values by default. This behavior can be controlled using ignoreNullFields option.

@bjornjorgensen bjornjorgensen marked this pull request as draft January 24, 2022 00:49
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@itholic
Copy link
Contributor

itholic commented Jan 24, 2022

I think maybe we have something for "Does this PR introduce any user-facing change?" and "How was this patch tested?" section in the PR description.

For example,

Does this PR introduce any user-facing change?

Yes, the document for to_json function is changed.

How was this patch tested?

The linter and doc build test should be passed.

@bjornjorgensen
Copy link
Contributor Author

Added a new post at jira.
Github actions is now enabled.

Default ignoreNullFields have been set to False to prevent columns with only NaN or Null to be deleted during saving to JSON.
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add tests with different parameter values?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 25, 2022

Let's fix up PR title/description too with making it ready for a review. Then I think it's good to go. It's better to have a test case but I am personally fine since it bypasses an option. We can add a test in a followup too.

@bjornjorgensen bjornjorgensen changed the title [SPARK-37981][PYTHON] Add note for deleting Null and NaN [SPARK-37981][PYTHON] Change the way pandas saves JSON files. Jan 25, 2022
@bjornjorgensen
Copy link
Contributor Author

Yes, about testing the only test I find for IO in pandas is for testing csv. I was thinking about making a test where I use shape function to test a file before and after writing and write. But there are another problem pandas prints a tuple with rows, columns pandas_df_json2.shape (6, 4). While pandas on spark print dataframe, columns pandas_api.shape (1, 4).

@bjornjorgensen bjornjorgensen marked this pull request as ready for review January 25, 2022 12:22
Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

@zero323
Copy link
Member

zero323 commented Jan 25, 2022

@ueshin @HyukjinKwon We should create a new ticket for this, that actually matches the problem that is solved, or at least rewrite and reopen one, to which this PR points to. I'll leave decision to you.

@zero323
Copy link
Member

zero323 commented Jan 25, 2022

Yes, about testing the only test I find for IO in pandas is for testing csv. I was thinking about making a test where I use shape function to test a file before and after writing and write.

One shouldn't really depend on schema inference and reader behavior to test writer for formats which, like JSON lines, provide no schema and / or are less expressive than Spark SQL types. In general case, irrespective of options, the following

spark: SparkSession
df: DataFrame
path: str

df.write.json(path)
assert spark.read.json(path).schema == df.schema

is not, and cannot be, guaranteed.

@HyukjinKwon
Copy link
Member

Yeah, technically we should create a new JIRA or edit the existing JIRA.

@bjornjorgensen bjornjorgensen marked this pull request as draft January 29, 2022 12:55
@bjornjorgensen bjornjorgensen changed the title [SPARK-37981][PYTHON] Change the way pandas saves JSON files. [SPARK-37981][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 29, 2022
@bjornjorgensen bjornjorgensen changed the title [SPARK-37981][PYTHON] Preserve columns with all None values when saved to JSON file. [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 29, 2022
@HyukjinKwon HyukjinKwon marked this pull request as ready for review January 30, 2022 00:30
Copy link
Member

@zero323 zero323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM (tested locally), but can we please use a title of PR (and JIRA), that really reflects what is going on and which behavior is changed? It really, has nothing to do with "all None values".

Impact on

ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]}).to_json(path)

and

ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]}).to_json(path)

is exactly the same and, in general case, reader still won't be able to infer original schema if there are no not-null values.

@bjornjorgensen
Copy link
Contributor Author

bjornjorgensen commented Jan 30, 2022

@zero323 This PR won't have any impact on your first example because there is at least one value that is not None.  In your second example it will be without this PR. 

df = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})
df
id val
0 1 NaN
1 2 NaN

df.to_json("testdf.json", num_files=1)

df2 = ps.read_json("testdf.json/*")
df2

id val
0 1 None
1 2 None

df2.to_json("testdf2.json", ignoreNullFields=True, num_files=1)


df3 = ps.read_json("testdf2.json/*")

df3

id
0 1
1 2

I will change what we agree on, but for now I think this has something to do with None values.

@zero323
Copy link
Member

zero323 commented Jan 30, 2022

@zero323 This PR won't have any impact on your first example because there is at least one value that is not None. In your second example it will be without this PR.

It seems like there is still some confusion regarding actual impact of the changes that are proposed here.

So let's start with establishing simple fact ‒ Spark uses row-oriented JSON lines format when JSON (that includes to_json) is used. ignoreNullFields option, by default set to true controls writer behavior when missing values are encountered:

  • If it is set to true, then field is omitted in the output for a given row.

    >>> import tempfile
    >>> path_true_some_null = tempfile.mktemp()
    >>> df_some_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]})
    >>> df_some_null.to_json(path_true_some_null, ignoreNullFields=True)
    >>> for x in spark.read.text(path_true_some_null).collect():
    ...     print(x.value)
    {"id":2,"val":1.0}
    {"id":1}
  • If it is set to false, then field is emitted with JSON null (untyped) value.

    >>> for x in spark.read.text(path_false_some_null).collect():
    ...     print(x.value)
    {"id":1,"val":null}
    {"id":2,"val":1.0}

Decision is made on row-by-row basis, and is not affected by presence of missing values for the same field in any other row.

Your PR changes default behavior for pyspark.pandas JSON writer, from the former (missing values are omitted) to the latter (missing values are preserved). This impacts output, as long as there is any missing value in any field in any row. Furthermore, it can have significant (proportional to N * M, for N rows and M columns) impact in case of sparse datasets (IO cost, storage cost) in case of sparse datasets.

Finally, counting columns in the re-read dataset is simply misleading and your PR doesn't and cannot affect reader behavior (it just affects what reader "sees") and, as mentioned before, explicitly writing missing values cannot resolve the problem of properly restoring original schema. Consider the following:

>>> import numpy as np
>>> df_all_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, np.nan]})

If you write it with ignoreNullFields set true, it omits the missing values same as in the case where some not missing values are present.

>>> path_true_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_true_all_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_all_null).collect():
...     print(x.value)
{"id":2}
{"id":1}

but there is no trace that other column was ever there.

>>> ps.read_json(path_true_all_null)
   id
0   2

Setting ignoreNullFields set false

>>> path_false_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_false_all_null, ignoreNullFields=False)
>>> for x in spark.read.text(path_false_all_null).collect():
...     print(x.value)
{"id":1,"val":null}
{"id":2,"val":null}

writes JSON nulls explicitly (same as in case of (df_some_null), and by doing so, provides enough information for the reader to infer that some column should be present.

>>> df_all_null_read_with_false = ps.read_json(path_false_all_null)
   id   val
0   1  None
1   2  None

So, does it restore the original schema?

It doesn't. The original contained double values:

>>> df_all_null.to_spark().printSchema()
root
 |-- id: long (nullable = false)
 |-- val: double (nullable = true)

and restored one cannot make any assumptions about the types, so it defaults to strings:

>>> df_all_null_read_with_false.to_spark().printSchema()
root
 |-- id: long (nullable = true)
 |-- val: string (nullable = true)

If user requires specific schema for the frame, schema should be provided on read, which will give the same result independent of specific fields being present in the output or not

>>> ps.read_json(path_false_all_null, schema="id long, val double").to_spark().printSchema()
root
 |-- id: long (nullable = true)
 |-- val: double (nullable = true)

This is standard approach in production pipelines and, on top of consistency, provides significant performance improvements on realistic size data.

So to re-iterate:

  • This PR affects the output, as long as dataset contains at least one missing value, and can have significant impact on performance and storage costs.
  • Observed impact on reader behavior is more incidental and doesn't guarantee correct schema of the loaded data (not to mention recovering of all column names in case of nested structures).

What this PR really achieves, is making output of to_json(some_path)

>>> path_pr_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_pr_all_null)
>>> for x in spark.read.text(path_pr_all_null).collect():
...     print(x.value)
{"id":2,"val":null}
{"id":1,"val":null}

roughly equivalent to the output of to_json()

>>> df_all_null.to_json()
'[{"id":1,"val":null},{"id":2,"val":null}]'

and pandas equivalents.

I wouldn't bother with pointing all of that out, but I have enough experience with users making incorrect assumptions about Spark behavior and nature of fixes, based on misleading JIRA tickets.

@bjornjorgensen bjornjorgensen changed the title [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. Jan 30, 2022
@bjornjorgensen bjornjorgensen changed the title [SPARK-38067][PYTHON] Preserve columns with all None values when saved to JSON file. [SPARK-38067][PYTHON] Preserve None values when saved to JSON. Jan 31, 2022
@bjornjorgensen
Copy link
Contributor Author

@zero323 Thank you

@zero323 zero323 closed this in 66b9087 Feb 1, 2022
@zero323
Copy link
Member

zero323 commented Feb 1, 2022

Merged into master.

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants