-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23936][SQL] Implement map_concat #21073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89369 has finished for PR 21073 at commit
|
|
Test build #89378 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the result of map_concat(NULL, NULL)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryr empty map:
scala> df.select(map_concat('map1, 'map2).as('newMap)).show
+------+
|newMap|
+------+
| []|
| []|
+------+
Presto docs (from which the proposed spec comes) are quiet on the matter. Even after looking at the Presto code, I am still hard-pressed to say.
I did divine from the Presto code that there should be at least two inputs (and I don't currently verify that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, seems a bit unusual to me to have, in effect, NULL ++ NULL => Map(). I checked with Presto and it looks like it returns NULL:
presto> select map_concat(NULL, NULL)
-> ;
_col0
-------
NULL
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryr Since Presto is the reference, map_concat should return NULL in this case. I will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryr Another quick test of Presto also shows that if any input is NULL, the result is NULL:
presto:default> SELECT map_concat(NULL, map(ARRAY[1,3], ARRAY[2,4])); _col0 ------- NULL (1 row)
Looks like I need to check if any input is NULL.
|
Test build #89473 has finished for PR 21073 at commit
|
44137cc to
d3d6ad6
Compare
|
Test build #89523 has finished for PR 21073 at commit
|
|
Test build #89579 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this logic is big enough (and similar enough to the logic in eval), I wonder if the merge logic should be moved to a utility class and called from both eval as well as the generated code.
The FromUTCTimestamp expression does something sort of like that, where the eval method as well as the generated code both call utility functions in the DateTimeUtils scala object. Also, the Concat expression's eval method and generated code both call utility functions on UTF8String (although in this case, UTF8String is a Java class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.
|
cc @ueshin |
|
Test build #89590 has finished for PR 21073 at commit
|
henryr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me, would be good to have one of the people most familiar with codegen take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the casts to Object necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there one extra space before if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea to check Seq(mNull, m0) as well in case there's any asymmetry in the way the first argument is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line between tests? makes it a bit easier to see the separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this for?
Excellent question. I don't know, except that it seems sometimes the first column is a list of columns. I used other functions as a template.
|
Test build #89759 has finished for PR 21073 at commit
|
|
retest this please |
|
Test build #89785 has finished for PR 21073 at commit
|
|
@gatorsmile this looks ready for your review (asking because you filed the JIRA) if you time, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about override def nullable: Boolean = children.exists(_.nullable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use cases with children.size < 2 don't make sense but I think that all functions with a variable number of children should behave the same way. Check implementation of Concat and Concat_ws.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to inherit from CodegenFallback if you've overriden doGenCode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add handling of nulls when values are of a primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since = "2.4.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more test cases with null values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Test build #89944 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about merging these two lines into one line import org.apache.spark.sql.catalyst.util._?
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #89993 has finished for PR 21073 at commit
|
|
A test failed with "./bin/spark-submit ... No such file or directory" Seems like there's lots of spurious test failures right now. I will hold off on re-running for a little while. |
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: The below appears to be by design (see SPARK-9415). That is, MapData objects explicitly should not support hashCode or equality. There is even a test for this. As a result, concatenating two Maps with keys that are also Maps can result in duplicate keys in the resulting map. Adding hashCode and equals fixed the issue, but violates the basis for SPARK-9415. Any opinion @rxin @viirya @gatorsmile? (pinging people on that Jira).
I found an issue. I was preparing to add some more tests when I noticed that using maps as keys doesn't work well in interpreted mode (seems to work fine in codegen mode, so far).
So, something like this doesn't work in interpreted mode (and in some cases gencode mode):
scala> dfmapmap.show(truncate=false)
+--------------------------------------------------+---------------------------------------------+
|mapmap1 |mapmap2 |
+--------------------------------------------------+---------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+--------------------------------------------------+---------------------------------------------+
scala> dfmapmap.select(map_concat('mapmap1, 'mapmap2).as('mapmap3)).show(truncate=false)
+-----------------------------------------------------------------------------------------------+
|mapmap3 |
+-----------------------------------------------------------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+-----------------------------------------------------------------------------------------------+
As you can see, the key [1 -> 2, 3 -> 4] shows up twice in the new map.
This is because:
val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) a1 == a2 // will be false a1.hashCode() == a2.hashCode() // will be false
Different instances of ArrayBasedMapData with the exact same data are not considered the same key. The same seems to be the case for UnsafeMapData as well (but usually works out in gencode mode only because of some magic under the hood that returns the same reference for identical keys).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bersprockets Hi, thanks for the investigation. We don't need to care about key duplication like CreateMap for now.
|
Test build #90020 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use ArrayBasedMapData.apply().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m.value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ctx.splitExpressionsWithCurrentInputs() or something to avoid exceeding JVM limit.
|
Test build #90091 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question. What happens if union.entrySet().toArray() has more than 0x7FFF_FFFF elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine bad things would happen before you got this far (even Map's size method returns an Int).
|
Test build #92662 has finished for PR 21073 at commit
|
|
Test build #92660 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
LGTM pending Jenkins. |
|
Test build #92672 has finished for PR 21073 at commit
|
|
retest this please. |
|
Test build #92689 has finished for PR 21073 at commit
|
|
I'd retrigger the build again, just in case. |
|
Jenkins, retest this please. |
|
Test build #92728 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92733 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92740 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92745 has finished for PR 21073 at commit
|
|
Thanks! merging to master. |
|
@ueshin Thanks for all your help! |
What changes were proposed in this pull request?
Implement map_concat high order function.
This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with @ueshin, we settled on option 1 from here).
How was this patch tested?
New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests