[SPARK-21712] [PySpark] Clarify type error for Column.substr() #18926

nchammas · 2017-08-11T18:44:41Z

Proposed changes:

Clarify the type error that Column.substr() gives.

Test plan:

Tested this manually.

Test code:

from pyspark.sql.functions import col, lit
spark.createDataFrame([['nick']], schema=['name']).select(col('name').substr(0, lit(1)))

Before:
```
TypeError: Can not mix the type
```

After:

TypeError: startPos and length must be the same type. Got <class 'int'> and
<class 'pyspark.sql.column.Column'>, respectively.

SparkQA · 2017-08-11T19:14:28Z

Test build #80544 has finished for PR 18926 at commit 753dbe1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2017-08-11T19:16:09Z

Pinging freshly minted committer @HyukjinKwon for a review on this tiny PR.

HyukjinKwon · 2017-08-12T02:20:27Z

Thank for cc'ing me. Yea looks fine. Could we add the small test in the description just in case?

nchammas · 2017-08-12T04:01:15Z

Oh, like a docstring test for the type error?

HyukjinKwon · 2017-08-12T04:21:36Z

I was thinking of adding it in python/pyspark/sql/tests.py. Just in case.. maybe we could add it around 224e0e7.

gatorsmile · 2017-08-13T08:24:01Z

python/pyspark/sql/column.py

-            raise TypeError("Can not mix the type")
+            raise TypeError(
+                "startPos and length must be the same type. "
+                "Got {startPos_t} and {length_t}, respectively."


-> startPos: {startPos_t}; length: {length_t}.

BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?

In addition, we do not support the mixed cases? For example, startPos is int, length is long.

BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?

Do you mean exposing Java types in the error message is better or suggesting method signature change in Scala impl of substr with the check logic?

In addition, we do not support the mixed cases? For example, startPos is int, length is long.

In Python, I guess it makes sense calling int in general. long and int are unified in Python 3 and this PR looks targeting only the exception message fix.

If PySpark always needs to check the types, are we doing the same things in all the other function calls?

In addition, why not directly checking

if isinstance(length, (int, long)):

It needs to check the types in general and we need to hide the error message related with Java types. It is also true that we also need to make such logics in to Scala one to deduplicate this logic if they are duplicated. R has also a similar problem in some places. I don't think we should change this case anyway.

It looks we should ...

py4j.Py4JException: Method substr([class java.lang.Long ...

or we should introduce bridge methods in Scala side and implement this checking logic IIRC.

For the latter, It looks we should call either substr with column,column or with int,int. I would like to avoid changing these If either way does not reduce the code diff and is virtually same, if I understood correctly.

I'm sorry for the delay.
I guess we can support long by casting to int and also the "mixed" cases @gatorsmile metioned.
What do you think @HyukjinKwon ?

Yea, I think we could support long. I think this PR basically targets exception message fix. Could we make this separate?

I guess supporting the case above requires a set of regression tests with min/max of int, fix for documentation and etc, which I think is rather loosely related with the JIRA.

HyukjinKwon · 2017-08-13T10:27:45Z

python/pyspark/sql/column.py

+                    startPos_t=type(startPos),
+                    length_t=type(length),
+                ))
        if isinstance(startPos, (int, long)):


@nchammas, supporting long with Python 2 is not documented in the docstring and looks we throw unexpected exception by long with Python 2 as below:

from pyspark.sql import Row df = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)]) df.select(df.name.substr(long(1), long(3)).alias("col")).collect()

Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/column.py", line 411, in substr jc = self._jc.substr(startPos, length) File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 324, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o47.substr. Trace: py4j.Py4JException: Method substr([class java.lang.Long, class java.lang.Long]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

Would you mind double checking this and taking long out with a simple test with Python 2 as well? I think this will also address @gatorsmile's concern above as well.

nchammas · 2017-08-14T15:50:32Z

To summarize the feedback from @HyukjinKwon and @gatorsmile, I think what I need to do is:

Add a test for the mixed type case.
Explicitly check for long in Python 2 and throw a TypeError from PySpark.
Add a test for the long TypeError in Python 2.

nchammas · 2017-08-14T18:44:59Z

python/pyspark/sql/column.py

+                    startPos_t=type(startPos),
+                    length_t=type(length),
+                ))
+        if isinstance(startPos, int):


Since long is not supported, I just removed it from here.

SparkQA · 2017-08-14T19:16:18Z

Test build #80643 has finished for PR 18926 at commit fc1d84f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2017-08-14T19:24:47Z

I think my latest commits address the concerns raised here. Let me know if I missed or misunderstood anything.

HyukjinKwon · 2017-08-15T04:44:48Z

python/pyspark/sql/tests.py

+    def test_string_functions(self):
+        from pyspark.sql.functions import col, lit
+        df = self.spark.createDataFrame([['nick']], schema=['name'])
+        self.assertRaises(TypeError, lambda: df.select(col('name').substr(0, lit(1))))


How about something like this below as this PR targets the exception message?

startPos = 0 length = lit(1) self.assertRaisesRegexp( TypeError, "must be the same type.*%s.*%s.*" % (type(startPos), type(length)), lambda: df.select(col('name').substr(startPos, length)))

I was considering doing that at first, but it felt like just duplicating logic. Looking through the other uses of assertRaisesRegexp(), it looks like most of the time we just search for a keyword, but there are also some instances where a large part of the exception message is checked. I can do that here as well.

HyukjinKwon · 2017-08-15T05:04:29Z

LGTM except for the comment above.

gatorsmile · 2017-08-15T06:01:27Z

It sounds like the comment hides. Could you address the comment #18926 (comment)?

HyukjinKwon · 2017-08-15T06:24:51Z

For ^, I want to make this separate if possible. Do you guys strongly feel about supporting long (and namely "mixed" types) here - @gatorsmile and @ueshin?

gatorsmile · 2017-08-15T06:34:24Z

Even if we plan to drop long in this PR, the checking looks weird to me. Basically, the change just wants to ensure the type of length is int.

Since this PR is pretty small, we should fix the issue instead of opening another one.

HyukjinKwon · 2017-08-15T07:11:56Z

Basically, the change just wants to ensure the type of length is int.

Yes, but to be more correct, I think this makes sure if both are same types, Column or int. I think this throws an exception for a better error message, (rather than, for example, "Unexpected types: startPos :<type 'int'> length: <class 'pyspark.sql.column.Column'>"). I know this sounds rather excessive checking but this is still a valid checking.

Possible way I think is with keeping this checking (which I believe is not shorter):

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            if type(startPos) != type(length):
                raise TypeError(...)
            else:
                raise TypeError("Unexpected type: %s" % type(startPos))

or, with removing this excessive checking:

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            raise TypeError(...)

I agree this is excessive (the former) but I wonder if we should remove already existing one. Please correct me if I am mistaken here.

long does not work already but with unexpected exception message, which this PR fixes.

The current state fixes the JIRA specified here.

I will stop staying against if you guys here feel strongly with fixing together but if you are not, could we go without that issue?

nchammas · 2017-08-15T12:34:14Z

@gatorsmile

Even if we plan to drop long in this PR

We are not dropping long in this PR. It was never supported. Both the docstring and actual behavior of .substr() make it clear that long is not supported. Only int and Column are supported.

the checking looks weird to me. Basically, the change just wants to ensure the type of length is int.

Can you elaborate please? As @HyukjinKwon pointed out, .substr() accepts either int or Column, but both arguments must be of the same type. The goal of this PR is to make that clearer.

I am not changing any semantics or behavior other than to throw a Python TypeError on long, as opposed to letting the underlying Scala implementation throw a messy exception.

nchammas · 2017-08-15T13:18:17Z

python/pyspark/sql/tests.py

+        self.assertRaisesRegexp(
+            TypeError,
+            "must be the same type",
+            lambda: df.select(col('name').substr(0, lit(1))))


@HyukjinKwon - I opted to just search for a key phrase since that sufficiently captures the intent of the updated error message.

SparkQA · 2017-08-15T13:44:42Z

Test build #80683 has finished for PR 18926 at commit a7fea20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-15T16:11:10Z

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            raise TypeError(...)

This looks much cleaner to me.

nchammas · 2017-08-15T16:23:52Z

It's cleaner but less specific. Unless we branch on whether startPos and length are the same type, we will give the same error message for mixed types and for unsupported types. That seems like a step back to me as these are two different problems which should get different error messages.

If we want to group all the type checking in one place, we should do it as in the first example from Hyukjin's comment.

HyukjinKwon · 2017-08-16T00:20:45Z

I don't think this suggestion / discussion blocks this PR. Let's go as is and make a followup or a separate PR as another improvement if anyone feels so. I will review that at my best.

HyukjinKwon · 2017-08-16T02:17:25Z

I am merging this as it looks there is no explicit objection for the current change itself and it looks the issue is fixed by this.

To summarize the discussion here:

Cleaning up type checking logics, if possible.
Supporting "mixed" types. For example, long in Python 2 by casting. Another idea might be just wrapping it with Column for different types.

HyukjinKwon · 2017-08-16T02:25:58Z

Merged to master.

Please open JIRAs / PRs related with the discussion above if anyone is willing to proceed.

gatorsmile · 2017-08-16T05:30:37Z

To be honest, the current codes do not look good to me. Since this does not make the code worse, I will not revert it back.

HyukjinKwon · 2017-08-16T05:52:44Z

The current codes around what this PR changes look not quite clean to me too and we should clean around this.

But I think this PR itself is quite well-formed with the fix that is valid, simple and targeted with tests.

nchammas · 2017-08-16T14:15:08Z

Agreed with @HyukjinKwon. This PR has a very narrow goal -- improving the error message for Column.substr() -- which I think it accomplished. I think @gatorsmile was expecting a more significant set of improvements, but that's not what this PR (or the associated JIRA) are about.

clarify type error for Column.substr()

753dbe1

gatorsmile reviewed Aug 13, 2017

View reviewed changes

HyukjinKwon reviewed Aug 13, 2017

View reviewed changes

nchammas added 2 commits August 14, 2017 14:40

long is not supported

ff9b07c

add type tests for substr

fc1d84f

nchammas commented Aug 14, 2017

View reviewed changes

HyukjinKwon reviewed Aug 15, 2017

View reviewed changes

check substr type error message

a7fea20

nchammas commented Aug 15, 2017

View reviewed changes

asfgit closed this in 9660831 Aug 16, 2017

nchammas deleted the SPARK-21712-substr-type-error branch August 16, 2017 14:15

[SPARK-21712] [PySpark] Clarify type error for Column.substr() #18926

[SPARK-21712] [PySpark] Clarify type error for Column.substr() #18926

Uh oh!

Conversation

nchammas commented Aug 11, 2017

Uh oh!

SparkQA commented Aug 11, 2017

Uh oh!

nchammas commented Aug 11, 2017

Uh oh!

HyukjinKwon commented Aug 12, 2017

Uh oh!

nchammas commented Aug 12, 2017

Uh oh!

HyukjinKwon commented Aug 12, 2017

Uh oh!

gatorsmile Aug 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Aug 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas commented Aug 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

nchammas commented Aug 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 15, 2017

Uh oh!

gatorsmile commented Aug 15, 2017

Uh oh!

HyukjinKwon commented Aug 15, 2017

Uh oh!

gatorsmile commented Aug 15, 2017

Uh oh!

HyukjinKwon commented Aug 15, 2017

Uh oh!

nchammas commented Aug 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2017

Uh oh!

gatorsmile commented Aug 15, 2017

Uh oh!

nchammas commented Aug 15, 2017

Uh oh!

HyukjinKwon commented Aug 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 16, 2017

Uh oh!

gatorsmile commented Aug 16, 2017

Uh oh!

gatorsmile Aug 13, 2017 •

edited

Loading

HyukjinKwon Aug 13, 2017 •

edited

Loading

HyukjinKwon commented Aug 16, 2017 •

edited

Loading

HyukjinKwon commented Aug 16, 2017 •

edited

Loading

nchammas commented Aug 16, 2017 •

edited

Loading