Skip to content

Conversation

@bersprockets
Copy link
Contributor

@bersprockets bersprockets commented Apr 15, 2018

What changes were proposed in this pull request?

Implement map_concat high order function.

This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with @ueshin, we settled on option 1 from here).

How was this patch tested?

New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests

@SparkQA
Copy link

SparkQA commented Apr 15, 2018

Test build #89369 has finished for PR 21073 at commit d04893b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2018

Test build #89378 has finished for PR 21073 at commit 97cffbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the result of map_concat(NULL, NULL)?

Copy link
Contributor Author

@bersprockets bersprockets Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryr empty map:

scala> df.select(map_concat('map1, 'map2).as('newMap)).show
+------+
|newMap|
+------+
|    []|
|    []|
+------+

Presto docs (from which the proposed spec comes) are quiet on the matter. Even after looking at the Presto code, I am still hard-pressed to say.

I did divine from the Presto code that there should be at least two inputs (and I don't currently verify that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, seems a bit unusual to me to have, in effect, NULL ++ NULL => Map(). I checked with Presto and it looks like it returns NULL:

presto> select map_concat(NULL, NULL)
     -> ;
 _col0
-------
 NULL
(1 row)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryr Since Presto is the reference, map_concat should return NULL in this case. I will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryr Another quick test of Presto also shows that if any input is NULL, the result is NULL:

presto:default> SELECT map_concat(NULL, map(ARRAY[1,3], ARRAY[2,4]));
 _col0 
-------
 NULL  
(1 row)

Looks like I need to check if any input is NULL.

@bersprockets bersprockets changed the title [SPARK-23936][SQL][WIP] Implement map_concat [SPARK-23936][SQL] Implement map_concat Apr 17, 2018
@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89473 has finished for PR 21073 at commit 44137cc.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89523 has finished for PR 21073 at commit d3d6ad6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89579 has finished for PR 21073 at commit 62df629.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this logic is big enough (and similar enough to the logic in eval), I wonder if the merge logic should be moved to a utility class and called from both eval as well as the generated code.

The FromUTCTimestamp expression does something sort of like that, where the eval method as well as the generated code both call utility functions in the DateTimeUtils scala object. Also, the Concat expression's eval method and generated code both call utility functions on UTF8String (although in this case, UTF8String is a Java class).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89590 has finished for PR 21073 at commit a904c17.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@henryr henryr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, would be good to have one of the people most familiar with codegen take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the casts to Object necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there one extra space before if?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea to check Seq(mNull, m0) as well in case there's any asymmetry in the way the first argument is handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line between tests? makes it a bit easier to see the separation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this for?

Excellent question. I don't know, except that it seems sometimes the first column is a list of columns. I used other functions as a template.

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89759 has finished for PR 21073 at commit 13baf96.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89785 has finished for PR 21073 at commit 13baf96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@henryr
Copy link
Contributor

henryr commented Apr 24, 2018

@gatorsmile this looks ready for your review (asking because you filed the JIRA) if you time, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about override def nullable: Boolean = children.exists(_.nullable)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cases with children.size < 2 don't make sense but I think that all functions with a variable number of children should behave the same way. Check implementation of Concat and Concat_ws.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to inherit from CodegenFallback if you've overriden doGenCode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add handling of nulls when values are of a primitive type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since = "2.4.0"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more test cases with null values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89944 has finished for PR 21073 at commit 2e49b1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@kiszk kiszk Apr 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about merging these two lines into one line import org.apache.spark.sql.catalyst.util._?

@bersprockets
Copy link
Contributor Author

@mn-mikke @kiszk Thanks for the review. I addressed the comments. Please take a look when you have a chance.

@bersprockets
Copy link
Contributor Author

retest this please

1 similar comment
@bersprockets
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 1, 2018

Test build #89993 has finished for PR 21073 at commit d9dccd3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Contributor Author

A test failed with "./bin/spark-submit ... No such file or directory"

Seems like there's lots of spurious test failures right now. I will hold off on re-running for a little while.

@bersprockets
Copy link
Contributor Author

retest this please

Copy link
Contributor Author

@bersprockets bersprockets May 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: The below appears to be by design (see SPARK-9415). That is, MapData objects explicitly should not support hashCode or equality. There is even a test for this. As a result, concatenating two Maps with keys that are also Maps can result in duplicate keys in the resulting map. Adding hashCode and equals fixed the issue, but violates the basis for SPARK-9415. Any opinion @rxin @viirya @gatorsmile? (pinging people on that Jira).

I found an issue. I was preparing to add some more tests when I noticed that using maps as keys doesn't work well in interpreted mode (seems to work fine in codegen mode, so far).

So, something like this doesn't work in interpreted mode (and in some cases gencode mode):

scala> dfmapmap.show(truncate=false)
+--------------------------------------------------+---------------------------------------------+
|mapmap1                                           |mapmap2                                      |
+--------------------------------------------------+---------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+--------------------------------------------------+---------------------------------------------+
scala> dfmapmap.select(map_concat('mapmap1, 'mapmap2).as('mapmap3)).show(truncate=false)
+-----------------------------------------------------------------------------------------------+
|mapmap3                                                                                        |
+-----------------------------------------------------------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+-----------------------------------------------------------------------------------------------+

As you can see, the key [1 -> 2, 3 -> 4] shows up twice in the new map.

This is because:

  val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4)))
  val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4)))
  a1 == a2 // will be false
  a1.hashCode() == a2.hashCode() // will be false

Different instances of ArrayBasedMapData with the exact same data are not considered the same key. The same seems to be the case for UnsafeMapData as well (but usually works out in gencode mode only because of some magic under the hood that returns the same reference for identical keys).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bersprockets Hi, thanks for the investigation. We don't need to care about key duplication like CreateMap for now.

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90020 has finished for PR 21073 at commit d9dccd3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use ArrayBasedMapData.apply().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ctx.splitExpressionsWithCurrentInputs() or something to avoid exceeding JVM limit.

@SparkQA
Copy link

SparkQA commented May 3, 2018

Test build #90091 has finished for PR 21073 at commit 77ae014.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@kiszk kiszk May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question. What happens if union.entrySet().toArray() has more than 0x7FFF_FFFF elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine bad things would happen before you got this far (even Map's size method returns an Int).

@SparkQA
Copy link

SparkQA commented Jul 6, 2018

Test build #92662 has finished for PR 21073 at commit 03328a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2018

Test build #92660 has finished for PR 21073 at commit 3c0da03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 6, 2018

Jenkins, retest this please.

@ueshin
Copy link
Member

ueshin commented Jul 6, 2018

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jul 6, 2018

Test build #92672 has finished for PR 21073 at commit 03328a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bersprockets
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 6, 2018

Test build #92689 has finished for PR 21073 at commit 03328a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

I'd retrigger the build again, just in case.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92728 has finished for PR 21073 at commit 03328a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92733 has finished for PR 21073 at commit 03328a4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92740 has finished for PR 21073 at commit 03328a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 9, 2018

Test build #92745 has finished for PR 21073 at commit 03328a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jul 9, 2018

Thanks! merging to master.

@asfgit asfgit closed this in 034913b Jul 9, 2018
@bersprockets
Copy link
Contributor Author

@ueshin Thanks for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants