-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs #22655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan, @viirya and @BryanCutler, WDYT? |
|
Test build #97046 has finished for PR 22655 at commit
|
|
Thanks for pinging me. I'll look into this this tonight or tomorrow. |
python/pyspark/sql/functions.py
Outdated
| # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa | ||
| # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)| 1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)| # noqa | ||
| # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa | ||
| # | null| None| None| None| None| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seems have to document what X means in this table?
python/pyspark/sql/functions.py
Outdated
| # | smallint| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa | ||
| # | int| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa | ||
| # | bigint| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa | ||
| # | string| None| true| 1| 1| a| a|java.util.Gregori...| java.util.Gregori...| 1.0| [I@7f1970e1| [1]|[Ljava.lang.Objec...| [B@284838a9| 1| {a=1}| X| X| # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true means a string 'true'? Shall we add quotes for strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it meaningful for [B@284838a9 in this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmmm .. I see the type is not clear here. Let me think about this a bit more.
[B@284838a9 is a quite buggy behaviour - we should fix. So I was thinking of documenting internally since we already spent much time to figure out how it works for each case individually (at #20163).
python/pyspark/sql/functions.py
Outdated
| # Please see SPARK-25666's PR to see the codes in order to generate the table below. | ||
| # | ||
| # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa | ||
| # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)| 1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)| # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any difference between last two Row(a=1)(Row)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, one was Row(a=1) and the other one was namedtuple approach Row("a")(1). Let me try to update.
|
it's useful to have this table, thanks! Shall we discuss the expected behavior here or in another JIRA ticket? |
|
Let me make this table for Pandas UDF too and then open another JIRA (or mailing list) to discuss about this further. I need more investigations to propose the desired behaviour targeting 3.0. |
|
Test build #97097 has finished for PR 22655 at commit
|
|
Test build #97098 has finished for PR 22655 at commit
|
|
I am getting this in. This is an ongoing effort and it just documents them internally for now. |
|
Merged to master. |
|
@viirya and @BryanCutler, do you guys have some time to go for Pandas one? I think I wouldn't have some time within a couple of weeks. If you guys have some time, I would appreciate if you could go ahead. Otherwise, I will start this one after a couple of weeks. |
|
@HyukjinKwon I can take some time to do similar for Pandas UDF. |
|
Thanks @viirya ! |
|
@HyukjinKwon Cool! Thanks! |
…hon data and SQL types in normal UDFs ### What changes were proposed in this pull request? We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well). It's even difficult to identify the problems (see apache#20163 and apache#22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf if sys.version >= '3': long = int data = [ None, True, 1, long(1), "a", u"a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` This table was generated under Python 2 but the code above is Python 3 compatible as well. ## How was this patch tested? Manually tested and lint check. Closes apache#22655 from HyukjinKwon/SPARK-25666. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see #20163 and #22610).
This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them.
This table was generated under Python 2 but the code above is Python 3 compatible as well.
How was this patch tested?
Manually tested and lint check.