-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we should disallow the case. Please see #16793 (comment)
|
@HyukjinKwon We need a separate JIRA and target it to 2.3 |
|
cc @rxin, @gatorsmile, @holdenk, @zero323 and @viirya, this is an alternative of reverting its alias matching, and a fix to address #16793 (comment). Could you guys take a look and see if makes sense? |
Sure. |
|
Thanks! Also cc @ueshin @cloud-fan |
|
The linked JIRA targets 2.3.0 and it was an alternative of reverting #20496 (comment) .. Let me rebase it here anyway .. |
13bdc24 to
198bda4
Compare
|
Test build #87037 has finished for PR 20499 at commit
|
|
Test build #87038 has finished for PR 20499 at commit
|
|
I think this is what originally proposed in the JIRA:
df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
df.replace({"Alice": "Bob"}, 1) |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some old check below like:
if not isinstance(value, valid_types) and value is not None \
and not isinstance(to_replace, dict):
raise ValueError("If to_replace is not a dict, value should be "
"a bool, float, int, long, string, list, tuple or None. "
"Got {0}".format(type(value)))Should we clean up it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, can't we just remove value is not None in above to let None disallowed when to_replace is not a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think that check is still valid. The newly added logic here focuses on checking missing arguments whereas the below logics focus on checking if arguments are valid types.
Will try to add a explicit test for #20499 (comment) case with few comment changes.
For #20499 (comment), I just tried to check. Seems we should keep that None to support:
>>> df.na.replace('Alice', None).show()+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
If we remove that condition above, seems we will hit:
...
ValueError: If to_replace is not a dict, value should be a bool, float, int, long, string, list, tuple or None. Got <type 'NoneType'>
|
Test build #87042 has finished for PR 20499 at commit
|
|
Test build #87051 has finished for PR 20499 at commit
|
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just describe this in value's param doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the above two test changes necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.na.replace({'Alice': 'Bob'}, foo = 'bar').show()Seems this case can't be detected?
|
Test build #87068 has finished for PR 20499 at commit
|
|
I think that behavior is shipped in 2.2, right? Then we may need to add a note in migration guide. |
|
Yup, sounds good. |
python/pyspark/sql/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this conflict with the convention of function argument in Python?
Usually, I think the arguments before keyword arg are resolved by position. But now age is resolved to subset which is the third argument behind value.
Since the function signature is changed, this may not be a big issue.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this few times and still feel that this is kind of verbose. But seems there is no better way to check if an optional parameter is set or not in Python.
|
Seems RC3 is near to be cut, do we want to get this in 2.3? |
|
is it a bug fix or a new feature? |
|
I think it's a bug fix. For the context,
|
|
Looks like an existing issue since Spark 2.2, I don't think this should block 2.3. |
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the expectation? if to_replace is a dict, value should be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if value is explicitly given, I thought ignoring value as we did from the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem now. If to_replace is dict, then value should be ignored and we should provide a default value. If to_replace is not dict, then value is required and we should not provide a default value.
Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think that summarises the issue
Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.
Yea, we could define a class / instance to indeicate no value like NumPy does -
https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was thinking resembling this way too but this is kind of a new approach to Spark and this is a single case so far.
To get to the point, yea, we could maybe use an invalid value and unset/ignore it if to_replace is a dictionary. For example, I can assign {}. But then the problem is docstring by pydoc and API documentation. It will show something like:
Help on method replace in module pyspark.sql.dataframe:
replace(self, to_replace, value={}, subset=None) method of pyspark.sql.dataframe.DataFrame instance
Returns a new :class:`DataFrame` replacing a value with another value.
...
This is pretty confusing. Up to my knowledge, we can't really override this signature in the doc - I tried few times before, and I failed if I remember this correctly.
Maybe, this is good enough but I didn't want to start it by such because the issue @rxin raised sounds like because it has a default value, to be more strictly.
To be honest, seems Pandas's replace also has None for default value -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to cut it short, yea, if less pretty doc is fine, I can try. That would reduce the change a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the docstring for def replace(self, to_replace, *args, **kwargs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just as is:
def replace(self, to_replace, *args, **kwargs)
but this is better than replace(self, to_replace, value={}, subset=None) IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer def replace(self, to_replace, value=_NoValue, subset=None).
def replace(self, to_replace, *args, **kwargs) loses the information about value and subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, to me either way works to me. Let me try to look around this a bit more and give a shot to show how it looks like.
|
Sure, let me unset the target version. |
|
I'd fix this in 2.3, and 2.2.1 as well. It's just bad API design for 2.2. |
|
Will update this tonight |
1849f59 to
9f49b05
Compare
9f49b05 to
a349d07
Compare
|
LGTM, waiting for more feedbacks. |
|
Test build #87196 has finished for PR 20499 at commit
|
|
Test build #87198 has finished for PR 20499 at commit
|
|
Test build #87201 has finished for PR 20499 at commit
|
|
retest this please |
|
Test build #87205 has finished for PR 20499 at commit
|
|
retest this please |
|
Test build #87206 has finished for PR 20499 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ce when 'to_replace' is not a dictionary
## What changes were proposed in this pull request?
This PR proposes to disallow default value None when 'to_replace' is not a dictionary.
It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below:
```python
>>> df.show()
```
```
+----+------+-----+
| age|height| name|
+----+------+-----+
| 10| 80|Alice|
...
```
```python
>>> df.na.replace('Alice').show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
```
**After**
This PR targets to disallow the case above:
```python
>>> df.na.replace('Alice').show()
```
```
...
TypeError: value is required when to_replace is not a dictionary.
```
while we still allow when `to_replace` is a dictionary:
```python
>>> df.na.replace({'Alice': None}).show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
```
## How was this patch tested?
Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed.
Author: hyukjinkwon <[email protected]>
Closes #20499 from HyukjinKwon/SPARK-19454-followup.
(cherry picked from commit 4b4ee26)
Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.3! Can you send a new PR for 2.2? it conflicts... |
|
Yup, I should fix the guide for 2.2 anyway :-) Will open a backport tonight KST. |
…ce when 'to_replace' is not a dictionary
## What changes were proposed in this pull request?
This PR proposes to disallow default value None when 'to_replace' is not a dictionary.
It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below:
```python
>>> df.show()
```
```
+----+------+-----+
| age|height| name|
+----+------+-----+
| 10| 80|Alice|
...
```
```python
>>> df.na.replace('Alice').show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
```
**After**
This PR targets to disallow the case above:
```python
>>> df.na.replace('Alice').show()
```
```
...
TypeError: value is required when to_replace is not a dictionary.
```
while we still allow when `to_replace` is a dictionary:
```python
>>> df.na.replace({'Alice': None}).show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
```
## How was this patch tested?
Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed.
Author: hyukjinkwon <[email protected]>
Closes apache#20499 from HyukjinKwon/SPARK-19454-followup.
What changes were proposed in this pull request?
This PR proposes to disallow default value None when 'to_replace' is not a dictionary.
It seems weird we set the default value of
valuetoNoneand we ended up allowing the case as below:After
This PR targets to disallow the case above:
while we still allow when
to_replaceis a dictionary:How was this patch tested?
Manually tested, tests were added in
python/pyspark/sql/tests.pyand doctests were fixed.