Skip to content

Conversation

@nchammas
Copy link
Contributor

Proposed changes:

  • Clarify the type error that Column.substr() gives.

Test plan:

  • Tested this manually.
  • Test code:
    from pyspark.sql.functions import col, lit
    spark.createDataFrame([['nick']], schema=['name']).select(col('name').substr(0, lit(1)))
  • Before:
    TypeError: Can not mix the type
    
  • After:
    TypeError: startPos and length must be the same type. Got <class 'int'> and
    <class 'pyspark.sql.column.Column'>, respectively.
    

@SparkQA
Copy link

SparkQA commented Aug 11, 2017

Test build #80544 has finished for PR 18926 at commit 753dbe1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

Pinging freshly minted committer @HyukjinKwon for a review on this tiny PR.

@HyukjinKwon
Copy link
Member

Thank for cc'ing me. Yea looks fine. Could we add the small test in the description just in case?

@nchammas
Copy link
Contributor Author

Oh, like a docstring test for the type error?

@HyukjinKwon
Copy link
Member

I was thinking of adding it in python/pyspark/sql/tests.py. Just in case.. maybe we could add it around 224e0e7.

raise TypeError("Can not mix the type")
raise TypeError(
"startPos and length must be the same type. "
"Got {startPos_t} and {length_t}, respectively."
Copy link
Member

@gatorsmile gatorsmile Aug 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> startPos: {startPos_t}; length: {length_t}.

BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?

In addition, we do not support the mixed cases? For example, startPos is int, length is long.

Copy link
Member

@HyukjinKwon HyukjinKwon Aug 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why we do the type checking here, instead of doing it in the actual Scala impl of substr?

Do you mean exposing Java types in the error message is better or suggesting method signature change in Scala impl of substr with the check logic?

In addition, we do not support the mixed cases? For example, startPos is int, length is long.

In Python, I guess it makes sense calling int in general. long and int are unified in Python 3 and this PR looks targeting only the exception message fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PySpark always needs to check the types, are we doing the same things in all the other function calls?

In addition, why not directly checking

if isinstance(length, (int, long)):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to check the types in general and we need to hide the error message related with Java types. It is also true that we also need to make such logics in to Scala one to deduplicate this logic if they are duplicated. R has also a similar problem in some places. I don't think we should change this case anyway.

It looks we should ...

py4j.Py4JException: Method substr([class java.lang.Long ...

or we should introduce bridge methods in Scala side and implement this checking logic IIRC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the latter, It looks we should call either substr with column,column or with int,int. I would like to avoid changing these If either way does not reduce the code diff and is virtually same, if I understood correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the delay.
I guess we can support long by casting to int and also the "mixed" cases @gatorsmile metioned.
What do you think @HyukjinKwon ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we could support long. I think this PR basically targets exception message fix. Could we make this separate?

I guess supporting the case above requires a set of regression tests with min/max of int, fix for documentation and etc, which I think is rather loosely related with the JIRA.

startPos_t=type(startPos),
length_t=type(length),
))
if isinstance(startPos, (int, long)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas, supporting long with Python 2 is not documented in the docstring and looks we throw unexpected exception by long with Python 2 as below:

from pyspark.sql import Row
df = spark.createDataFrame([Row(name=u'Tom', height=80), Row(name=u'Alice', height=None)])
df.select(df.name.substr(long(1), long(3)).alias("col")).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/column.py", line 411, in substr
    jc = self._jc.substr(startPos, length)
  File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File ".../spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 324, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o47.substr. Trace:
py4j.Py4JException: Method substr([class java.lang.Long, class java.lang.Long]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

Would you mind double checking this and taking long out with a simple test with Python 2 as well? I think this will also address @gatorsmile's concern above as well.

@nchammas
Copy link
Contributor Author

To summarize the feedback from @HyukjinKwon and @gatorsmile, I think what I need to do is:

  • Add a test for the mixed type case.
  • Explicitly check for long in Python 2 and throw a TypeError from PySpark.
  • Add a test for the long TypeError in Python 2.

startPos_t=type(startPos),
length_t=type(length),
))
if isinstance(startPos, int):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since long is not supported, I just removed it from here.

@SparkQA
Copy link

SparkQA commented Aug 14, 2017

Test build #80643 has finished for PR 18926 at commit fc1d84f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

I think my latest commits address the concerns raised here. Let me know if I missed or misunderstood anything.

def test_string_functions(self):
from pyspark.sql.functions import col, lit
df = self.spark.createDataFrame([['nick']], schema=['name'])
self.assertRaises(TypeError, lambda: df.select(col('name').substr(0, lit(1))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this below as this PR targets the exception message?

startPos = 0
length = lit(1)
self.assertRaisesRegexp(
    TypeError,
    "must be the same type.*%s.*%s.*" % (type(startPos), type(length)),
    lambda: df.select(col('name').substr(startPos, length)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering doing that at first, but it felt like just duplicating logic. Looking through the other uses of assertRaisesRegexp(), it looks like most of the time we just search for a keyword, but there are also some instances where a large part of the exception message is checked. I can do that here as well.

@HyukjinKwon
Copy link
Member

LGTM except for the comment above.

@gatorsmile
Copy link
Member

It sounds like the comment hides. Could you address the comment #18926 (comment)?

@HyukjinKwon
Copy link
Member

For ^, I want to make this separate if possible. Do you guys strongly feel about supporting long (and namely "mixed" types) here - @gatorsmile and @ueshin?

@gatorsmile
Copy link
Member

Even if we plan to drop long in this PR, the checking looks weird to me. Basically, the change just wants to ensure the type of length is int.

Since this PR is pretty small, we should fix the issue instead of opening another one.

@HyukjinKwon
Copy link
Member

Basically, the change just wants to ensure the type of length is int.

Yes, but to be more correct, I think this makes sure if both are same types, Column or int. I think this throws an exception for a better error message, (rather than, for example, "Unexpected types: startPos :<type 'int'> length: <class 'pyspark.sql.column.Column'>"). I know this sounds rather excessive checking but this is still a valid checking.

Possible way I think is with keeping this checking (which I believe is not shorter):

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            if type(startPos) != type(length):
                raise TypeError(...)
            else:
                raise TypeError("Unexpected type: %s" % type(startPos))

or, with removing this excessive checking:

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            raise TypeError(...)

I agree this is excessive (the former) but I wonder if we should remove already existing one. Please correct me if I am mistaken here.

long does not work already but with unexpected exception message, which this PR fixes.

The current state fixes the JIRA specified here.

I will stop staying against if you guys here feel strongly with fixing together but if you are not, could we go without that issue?

@nchammas
Copy link
Contributor Author

@gatorsmile

Even if we plan to drop long in this PR

We are not dropping long in this PR. It was never supported. Both the docstring and actual behavior of .substr() make it clear that long is not supported. Only int and Column are supported.

the checking looks weird to me. Basically, the change just wants to ensure the type of length is int.

Can you elaborate please? As @HyukjinKwon pointed out, .substr() accepts either int or Column, but both arguments must be of the same type. The goal of this PR is to make that clearer.

I am not changing any semantics or behavior other than to throw a Python TypeError on long, as opposed to letting the underlying Scala implementation throw a messy exception.

self.assertRaisesRegexp(
TypeError,
"must be the same type",
lambda: df.select(col('name').substr(0, lit(1))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon - I opted to just search for a key phrase since that sufficiently captures the intent of the updated error message.

@SparkQA
Copy link

SparkQA commented Aug 15, 2017

Test build #80683 has finished for PR 18926 at commit a7fea20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

        if isinstance(startPos, int) and isinstance(length, int):
            jc = self._jc.substr(startPos, length)
        elif isinstance(startPos, Column) and isinstance(length, Column)::
            jc = self._jc.substr(startPos._jc, length._jc)
        else:
            raise TypeError(...)

This looks much cleaner to me.

@nchammas
Copy link
Contributor Author

It's cleaner but less specific. Unless we branch on whether startPos and length are the same type, we will give the same error message for mixed types and for unsupported types. That seems like a step back to me as these are two different problems which should get different error messages.

If we want to group all the type checking in one place, we should do it as in the first example from Hyukjin's comment.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 16, 2017

I don't think this suggestion / discussion blocks this PR. Let's go as is and make a followup or a separate PR as another improvement if anyone feels so. I will review that at my best.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 16, 2017

I am merging this as it looks there is no explicit objection for the current change itself and it looks the issue is fixed by this.

To summarize the discussion here:

  • Cleaning up type checking logics, if possible.

  • Supporting "mixed" types. For example, long in Python 2 by casting. Another idea might be just wrapping it with Column for different types.

@asfgit asfgit closed this in 9660831 Aug 16, 2017
@HyukjinKwon
Copy link
Member

Merged to master.

Please open JIRAs / PRs related with the discussion above if anyone is willing to proceed.

@gatorsmile
Copy link
Member

To be honest, the current codes do not look good to me. Since this does not make the code worse, I will not revert it back.

@HyukjinKwon
Copy link
Member

The current codes around what this PR changes look not quite clean to me too and we should clean around this.

But I think this PR itself is quite well-formed with the fix that is valid, simple and targeted with tests.

@nchammas
Copy link
Contributor Author

nchammas commented Aug 16, 2017

Agreed with @HyukjinKwon. This PR has a very narrow goal -- improving the error message for Column.substr() -- which I think it accomplished. I think @gatorsmile was expecting a more significant set of improvements, but that's not what this PR (or the associated JIRA) are about.

@nchammas nchammas deleted the SPARK-21712-substr-type-error branch August 16, 2017 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants