[SPARK-28359][SQL][PYTHON][TESTS] Make integrated UDF tests robust by making them no-op #25132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This alternative was abandoned for two reasons below:
Seems Pyrolite doesn't look it guarantees floats to be corrected after the roundtrip between PVM and JVM.
There seems a bug in Arrow conversions for decimal precision and scale:
What changes were proposed in this pull request?
This is another alternative take comparing to #25130.
Current UDFs available in
IntegratedUDFTestUtilsare not exactly no-op. It converts input column to strings and outputs to strings.It causes some issues when we convert and port the tests at SPARK-27921. Integrated UDF test cases share one output file and it should outputs the same. However,
Special values are converted into strings differently:
nullNoneInfinityinf-Infinity-infNaNnanDue to float limitation at Python (see https://docs.python.org/3/tutorial/floatingpoint.html), if float is passed into Python and sent back to JVM, the values are potentially not exactly correct. See [SPARK-28270][test-maven][FOLLOW-UP][SQL][PYTHON][TESTS] Avoid cast input of UDF as double in the failed test in udf-aggregate_part1.sql #25128 and [SPARK-28270][SQL][FOLLOW-UP] Explicitly cast into int/long/decimal in udf-aggregates_part1.sql to avoid Python float limitation #25110
To work around this, this PR targets to change the UDF that returns as are (input column to output column). In this way, we can also handle complex type like map, array or structs too; however, it's too hacky and a bit overkill.
Before:
After:
However, in this way, it requires to launch an external Python process to generate Python functions for each return type everytime it evaluates. Without caching, it increases the testing time twoice. So, I had to add a cache to keep the testing time almost same. However, now it looks pretty complicated and a bit. overkill.
In this way, UDF is almost completely no-op.
There's no diff in terms of output comparing to #25130
Diff comparing to the PR 25180
How was this patch tested?
Manually tested.