-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21712] [PySpark] Clarify type error for Column.substr() #18926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #80544 has finished for PR 18926 at commit
|
|
Pinging freshly minted committer @HyukjinKwon for a review on this tiny PR. |
|
Thank for cc'ing me. Yea looks fine. Could we add the small test in the description just in case? |
|
Oh, like a docstring test for the type error? |
|
I was thinking of adding it in |
| raise TypeError("Can not mix the type") | ||
| raise TypeError( | ||
| "startPos and length must be the same type. " | ||
| "Got {startPos_t} and {length_t}, respectively." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> startPos: {startPos_t}; length: {length_t}.
BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?
In addition, we do not support the mixed cases? For example, startPos is int, length is long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?
Do you mean exposing Java types in the error message is better or suggesting method signature change in Scala impl of substr with the check logic?
In addition, we do not support the mixed cases? For example, startPos is int, length is long.
In Python, I guess it makes sense calling int in general. long and int are unified in Python 3 and this PR looks targeting only the exception message fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PySpark always needs to check the types, are we doing the same things in all the other function calls?
In addition, why not directly checking
if isinstance(length, (int, long)):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to check the types in general and we need to hide the error message related with Java types. It is also true that we also need to make such logics in to Scala one to deduplicate this logic if they are duplicated. R has also a similar problem in some places. I don't think we should change this case anyway.
It looks we should ...
py4j.Py4JException: Method substr([class java.lang.Long ...
or we should introduce bridge methods in Scala side and implement this checking logic IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the latter, It looks we should call either substr with column,column or with int,int. I would like to avoid changing these If either way does not reduce the code diff and is virtually same, if I understood correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delay.
I guess we can support long by casting to int and also the "mixed" cases @gatorsmile metioned.
What do you think @HyukjinKwon ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think we could support long. I think this PR basically targets exception message fix. Could we make this separate?
I guess supporting the case above requires a set of regression tests with min/max of int, fix for documentation and etc, which I think is rather loosely related with the JIRA.
python/pyspark/sql/column.py
Outdated
| startPos_t=type(startPos), | ||
| length_t=type(length), | ||
| )) | ||
| if isinstance(startPos, (int, long)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas, supporting long with Python 2 is not documented in the docstring and looks we throw unexpected exception by long with Python 2 as below:
from pyspark.sql import Row
df = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
df.select(df.name.substr(long(1), long(3)).alias("col")).collect()Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/column.py", line 411, in substr
jc = self._jc.substr(startPos, length)
File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 324, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o47.substr. Trace:
py4j.Py4JException: Method substr([class java.lang.Long, class java.lang.Long]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Would you mind double checking this and taking long out with a simple test with Python 2 as well? I think this will also address @gatorsmile's concern above as well.
|
To summarize the feedback from @HyukjinKwon and @gatorsmile, I think what I need to do is:
|
| startPos_t=type(startPos), | ||
| length_t=type(length), | ||
| )) | ||
| if isinstance(startPos, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since long is not supported, I just removed it from here.
|
Test build #80643 has finished for PR 18926 at commit
|
|
I think my latest commits address the concerns raised here. Let me know if I missed or misunderstood anything. |
python/pyspark/sql/tests.py
Outdated
| def test_string_functions(self): | ||
| from pyspark.sql.functions import col, lit | ||
| df = self.spark.createDataFrame([['nick']], schema=['name']) | ||
| self.assertRaises(TypeError, lambda: df.select(col('name').substr(0, lit(1)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this below as this PR targets the exception message?
startPos = 0
length = lit(1)
self.assertRaisesRegexp(
TypeError,
"must be the same type.*%s.*%s.*" % (type(startPos), type(length)),
lambda: df.select(col('name').substr(startPos, length)))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering doing that at first, but it felt like just duplicating logic. Looking through the other uses of assertRaisesRegexp(), it looks like most of the time we just search for a keyword, but there are also some instances where a large part of the exception message is checked. I can do that here as well.
|
LGTM except for the comment above. |
|
It sounds like the comment hides. Could you address the comment #18926 (comment)? |
|
For ^, I want to make this separate if possible. Do you guys strongly feel about supporting |
|
Even if we plan to drop Since this PR is pretty small, we should fix the issue instead of opening another one. |
Yes, but to be more correct, I think this makes sure if both are same types, Possible way I think is with keeping this checking (which I believe is not shorter): if isinstance(startPos, int) and isinstance(length, int):
jc = self._jc.substr(startPos, length)
elif isinstance(startPos, Column) and isinstance(length, Column)::
jc = self._jc.substr(startPos._jc, length._jc)
else:
if type(startPos) != type(length):
raise TypeError(...)
else:
raise TypeError("Unexpected type: %s" % type(startPos))or, with removing this excessive checking: if isinstance(startPos, int) and isinstance(length, int):
jc = self._jc.substr(startPos, length)
elif isinstance(startPos, Column) and isinstance(length, Column)::
jc = self._jc.substr(startPos._jc, length._jc)
else:
raise TypeError(...)I agree this is excessive (the former) but I wonder if we should remove already existing one. Please correct me if I am mistaken here.
The current state fixes the JIRA specified here. I will stop staying against if you guys here feel strongly with fixing together but if you are not, could we go without that issue? |
We are not dropping
Can you elaborate please? As @HyukjinKwon pointed out, I am not changing any semantics or behavior other than to throw a Python |
| self.assertRaisesRegexp( | ||
| TypeError, | ||
| "must be the same type", | ||
| lambda: df.select(col('name').substr(0, lit(1)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon - I opted to just search for a key phrase since that sufficiently captures the intent of the updated error message.
|
Test build #80683 has finished for PR 18926 at commit
|
if isinstance(startPos, int) and isinstance(length, int):
jc = self._jc.substr(startPos, length)
elif isinstance(startPos, Column) and isinstance(length, Column)::
jc = self._jc.substr(startPos._jc, length._jc)
else:
raise TypeError(...)This looks much cleaner to me. |
|
It's cleaner but less specific. Unless we branch on whether If we want to group all the type checking in one place, we should do it as in the first example from Hyukjin's comment. |
|
I don't think this suggestion / discussion blocks this PR. Let's go as is and make a followup or a separate PR as another improvement if anyone feels so. I will review that at my best. |
|
I am merging this as it looks there is no explicit objection for the current change itself and it looks the issue is fixed by this. To summarize the discussion here:
|
|
Merged to master. Please open JIRAs / PRs related with the discussion above if anyone is willing to proceed. |
|
To be honest, the current codes do not look good to me. Since this does not make the code worse, I will not revert it back. |
|
The current codes around what this PR changes look not quite clean to me too and we should clean around this. But I think this PR itself is quite well-formed with the fix that is valid, simple and targeted with tests. |
|
Agreed with @HyukjinKwon. This PR has a very narrow goal -- improving the error message for |
Proposed changes:
Column.substr()gives.Test plan: