-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18207][SQL] Fix a compilation error due to HashExpression.doGenCode #15745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68042 has finished for PR 15745 at commit
|
|
Test build #68043 has finished for PR 15745 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also need to add this to HiveHash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance that you can make this UT a bit more concise. This seems way to contrived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this hashing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This union involves hashing. I confirmed that the original error occurred without this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk The link above is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the hashing is not explicit to reader, can you add a comment for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya thank you for pointing out. I fixed the link, and will add a comment there later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is more like an end-to-end test. The hashing is not obvious as we seen.
Can we add unit test? In HashExpressionsSuite, I think? So you can test HiveHash too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree with @viirya the current test is way to indirect and can actually be broken silently if the planner choses a sort based aggregate over a hash based aggregate. It would be very nice to have a direct test on the hash function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I created direct tests in HashExpressionsSuite.scala.
|
Test build #68194 has finished for PR 15745 at commit
|
|
Test build #68207 has finished for PR 15745 at commit
|
|
Test build #68208 has finished for PR 15745 at commit
|
|
Test build #68223 has finished for PR 15745 at commit
|
|
Jenkins, retest this please |
|
Test build #68226 has finished for PR 15745 at commit
|
|
Jenkins, retest this please |
|
Test build #68234 has finished for PR 15745 at commit
|
|
Test build #68264 has finished for PR 15745 at commit
|
| val widePlus2 = widePlus.withColumn("d_rank", lit(0)) | ||
| widePlus2.createOrReplaceTempView("wide_plus2") | ||
|
|
||
| // union operation in this SQL involves computation of hash for a row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Actually the hash computation is happened at HashAggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated the comment
|
Test build #68277 has finished for PR 15745 at commit
|
|
Test build #68293 has finished for PR 15745 at commit
|
|
@kiszk 2 little things left, but this looks good. |
|
|
||
| test("SPARK-18207: Compute hash for a lot of expressions") { | ||
| val N = 1000 | ||
| val wideRow = new GenericInternalRow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you using the wideRow?
|
Test build #68294 has finished for PR 15745 at commit
|
|
LGTM - pending jenkins. |
|
Test build #68298 has finished for PR 15745 at commit
|
| val df = spark.createDataFrame(spark.sparkContext.makeRDD(rows), schema) | ||
| assert(df.filter($"array1" === $"array2").count() == 1) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my typo. It is time for me to have to sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Happens to us all :)
|
Test build #68299 has finished for PR 15745 at commit
|
| assert(df.filter($"array1" === $"array2").count() == 1) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there is an additional new line.
| } | ||
| val murmur3HashExpr = Murmur3Hash(exprs, 42) | ||
| val murmur3HashPlan = GenerateMutableProjection.generate(Seq(murmur3HashExpr)) | ||
| assert(murmur3HashPlan(wideRow).getInt(0) == 58499324) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to compare with non-codegen result, i.e., murmur3HashPlan.eval(wideRow), instead of a hard-coded result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. done
|
@kiszk Seems the hard-coded hashing result is not correct and causes the test failed. |
|
LGTM |
|
Test build #68310 has finished for PR 15745 at commit
|
|
Merging to master/2.1. Thanks! |
…nCode
This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since generate java code for computing a hash value for a row is too big. This PR fixes this compilation error by splitting a big code chunk into multiple methods by calling `CodegenContext.splitExpression` at `HashExpression.doGenCode`
The test case requires a calculation of hash code for a row that includes 1000 String fields. `HashExpression.doGenCode` generate a lot of Java code for this computation into one function. As a result, the size of the corresponding Java bytecode is more than 64 KB.
Generated code without this PR
````java
/* 027 */ public UnsafeRow apply(InternalRow i) {
/* 028 */ boolean isNull = false;
/* 029 */
/* 030 */ int value1 = 42;
/* 031 */
/* 032 */ boolean isNull2 = i.isNullAt(0);
/* 033 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
/* 034 */ if (!isNull2) {
/* 035 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
/* 036 */ }
/* 037 */
/* 038 */
/* 039 */ boolean isNull3 = i.isNullAt(1);
/* 040 */ UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
/* 041 */ if (!isNull3) {
/* 042 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
/* 043 */ }
/* 044 */
/* 045 */
...
/* 7024 */
/* 7025 */ boolean isNull1001 = i.isNullAt(999);
/* 7026 */ UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
/* 7027 */ if (!isNull1001) {
/* 7028 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
/* 7029 */ }
/* 7030 */
/* 7031 */
/* 7032 */ boolean isNull1002 = i.isNullAt(1000);
/* 7033 */ UTF8String value1002 = isNull1002 ? null : (i.getUTF8String(1000));
/* 7034 */ if (!isNull1002) {
/* 7035 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1002.getBaseObject(), value1002.getBaseOffset(), value1002.numBytes(), value1);
/* 7036 */ }
````
Generated code with this PR
````java
/* 3807 */ private void apply_249(InternalRow i) {
/* 3808 */
/* 3809 */ boolean isNull998 = i.isNullAt(996);
/* 3810 */ UTF8String value998 = isNull998 ? null : (i.getUTF8String(996));
/* 3811 */ if (!isNull998) {
/* 3812 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value998.getBaseObject(), value998.getBaseOffset(), value998.numBytes(), value1);
/* 3813 */ }
/* 3814 */
/* 3815 */ boolean isNull999 = i.isNullAt(997);
/* 3816 */ UTF8String value999 = isNull999 ? null : (i.getUTF8String(997));
/* 3817 */ if (!isNull999) {
/* 3818 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value999.getBaseObject(), value999.getBaseOffset(), value999.numBytes(), value1);
/* 3819 */ }
/* 3820 */
/* 3821 */ boolean isNull1000 = i.isNullAt(998);
/* 3822 */ UTF8String value1000 = isNull1000 ? null : (i.getUTF8String(998));
/* 3823 */ if (!isNull1000) {
/* 3824 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1000.getBaseObject(), value1000.getBaseOffset(), value1000.numBytes(), value1);
/* 3825 */ }
/* 3826 */
/* 3827 */ boolean isNull1001 = i.isNullAt(999);
/* 3828 */ UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
/* 3829 */ if (!isNull1001) {
/* 3830 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
/* 3831 */ }
/* 3832 */
/* 3833 */ }
/* 3834 */
...
/* 4532 */ private void apply_0(InternalRow i) {
/* 4533 */
/* 4534 */ boolean isNull2 = i.isNullAt(0);
/* 4535 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
/* 4536 */ if (!isNull2) {
/* 4537 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
/* 4538 */ }
/* 4539 */
/* 4540 */ boolean isNull3 = i.isNullAt(1);
/* 4541 */ UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
/* 4542 */ if (!isNull3) {
/* 4543 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
/* 4544 */ }
/* 4545 */
/* 4546 */ boolean isNull4 = i.isNullAt(2);
/* 4547 */ UTF8String value4 = isNull4 ? null : (i.getUTF8String(2));
/* 4548 */ if (!isNull4) {
/* 4549 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value4.getBaseObject(), value4.getBaseOffset(), value4.numBytes(), value1);
/* 4550 */ }
/* 4551 */
/* 4552 */ boolean isNull5 = i.isNullAt(3);
/* 4553 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(3));
/* 4554 */ if (!isNull5) {
/* 4555 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), value5.getBaseOffset(), value5.numBytes(), value1);
/* 4556 */ }
/* 4557 */
/* 4558 */ }
...
/* 7344 */ public UnsafeRow apply(InternalRow i) {
/* 7345 */ boolean isNull = false;
/* 7346 */
/* 7347 */ value1 = 42;
/* 7348 */ apply_0(i);
/* 7349 */ apply_1(i);
...
/* 7596 */ apply_248(i);
/* 7597 */ apply_249(i);
/* 7598 */ apply_250(i);
/* 7599 */ apply_251(i);
...
````
Add a new test in `DataFrameSuite`
Author: Kazuaki Ishizaki <[email protected]>
Closes #15745 from kiszk/SPARK-18207.
(cherry picked from commit 47731e1)
Signed-off-by: Herman van Hovell <[email protected]>
…nCode
## What changes were proposed in this pull request?
This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since generate java code for computing a hash value for a row is too big. This PR fixes this compilation error by splitting a big code chunk into multiple methods by calling `CodegenContext.splitExpression` at `HashExpression.doGenCode`
The test case requires a calculation of hash code for a row that includes 1000 String fields. `HashExpression.doGenCode` generate a lot of Java code for this computation into one function. As a result, the size of the corresponding Java bytecode is more than 64 KB.
Generated code without this PR
````java
/* 027 */ public UnsafeRow apply(InternalRow i) {
/* 028 */ boolean isNull = false;
/* 029 */
/* 030 */ int value1 = 42;
/* 031 */
/* 032 */ boolean isNull2 = i.isNullAt(0);
/* 033 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
/* 034 */ if (!isNull2) {
/* 035 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
/* 036 */ }
/* 037 */
/* 038 */
/* 039 */ boolean isNull3 = i.isNullAt(1);
/* 040 */ UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
/* 041 */ if (!isNull3) {
/* 042 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
/* 043 */ }
/* 044 */
/* 045 */
...
/* 7024 */
/* 7025 */ boolean isNull1001 = i.isNullAt(999);
/* 7026 */ UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
/* 7027 */ if (!isNull1001) {
/* 7028 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
/* 7029 */ }
/* 7030 */
/* 7031 */
/* 7032 */ boolean isNull1002 = i.isNullAt(1000);
/* 7033 */ UTF8String value1002 = isNull1002 ? null : (i.getUTF8String(1000));
/* 7034 */ if (!isNull1002) {
/* 7035 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1002.getBaseObject(), value1002.getBaseOffset(), value1002.numBytes(), value1);
/* 7036 */ }
````
Generated code with this PR
````java
/* 3807 */ private void apply_249(InternalRow i) {
/* 3808 */
/* 3809 */ boolean isNull998 = i.isNullAt(996);
/* 3810 */ UTF8String value998 = isNull998 ? null : (i.getUTF8String(996));
/* 3811 */ if (!isNull998) {
/* 3812 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value998.getBaseObject(), value998.getBaseOffset(), value998.numBytes(), value1);
/* 3813 */ }
/* 3814 */
/* 3815 */ boolean isNull999 = i.isNullAt(997);
/* 3816 */ UTF8String value999 = isNull999 ? null : (i.getUTF8String(997));
/* 3817 */ if (!isNull999) {
/* 3818 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value999.getBaseObject(), value999.getBaseOffset(), value999.numBytes(), value1);
/* 3819 */ }
/* 3820 */
/* 3821 */ boolean isNull1000 = i.isNullAt(998);
/* 3822 */ UTF8String value1000 = isNull1000 ? null : (i.getUTF8String(998));
/* 3823 */ if (!isNull1000) {
/* 3824 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1000.getBaseObject(), value1000.getBaseOffset(), value1000.numBytes(), value1);
/* 3825 */ }
/* 3826 */
/* 3827 */ boolean isNull1001 = i.isNullAt(999);
/* 3828 */ UTF8String value1001 = isNull1001 ? null : (i.getUTF8String(999));
/* 3829 */ if (!isNull1001) {
/* 3830 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value1001.getBaseObject(), value1001.getBaseOffset(), value1001.numBytes(), value1);
/* 3831 */ }
/* 3832 */
/* 3833 */ }
/* 3834 */
...
/* 4532 */ private void apply_0(InternalRow i) {
/* 4533 */
/* 4534 */ boolean isNull2 = i.isNullAt(0);
/* 4535 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
/* 4536 */ if (!isNull2) {
/* 4537 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value2.getBaseObject(), value2.getBaseOffset(), value2.numBytes(), value1);
/* 4538 */ }
/* 4539 */
/* 4540 */ boolean isNull3 = i.isNullAt(1);
/* 4541 */ UTF8String value3 = isNull3 ? null : (i.getUTF8String(1));
/* 4542 */ if (!isNull3) {
/* 4543 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value3.getBaseObject(), value3.getBaseOffset(), value3.numBytes(), value1);
/* 4544 */ }
/* 4545 */
/* 4546 */ boolean isNull4 = i.isNullAt(2);
/* 4547 */ UTF8String value4 = isNull4 ? null : (i.getUTF8String(2));
/* 4548 */ if (!isNull4) {
/* 4549 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value4.getBaseObject(), value4.getBaseOffset(), value4.numBytes(), value1);
/* 4550 */ }
/* 4551 */
/* 4552 */ boolean isNull5 = i.isNullAt(3);
/* 4553 */ UTF8String value5 = isNull5 ? null : (i.getUTF8String(3));
/* 4554 */ if (!isNull5) {
/* 4555 */ value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value5.getBaseObject(), value5.getBaseOffset(), value5.numBytes(), value1);
/* 4556 */ }
/* 4557 */
/* 4558 */ }
...
/* 7344 */ public UnsafeRow apply(InternalRow i) {
/* 7345 */ boolean isNull = false;
/* 7346 */
/* 7347 */ value1 = 42;
/* 7348 */ apply_0(i);
/* 7349 */ apply_1(i);
...
/* 7596 */ apply_248(i);
/* 7597 */ apply_249(i);
/* 7598 */ apply_250(i);
/* 7599 */ apply_251(i);
...
````
## How was this patch tested?
Add a new test in `DataFrameSuite`
Author: Kazuaki Ishizaki <[email protected]>
Closes apache#15745 from kiszk/SPARK-18207.
What changes were proposed in this pull request?
This PR avoids a compilation error due to more than 64KB Java byte code size. This error occur since generate java code for computing a hash value for a row is too big. This PR fixes this compilation error by splitting a big code chunk into multiple methods by calling
CodegenContext.splitExpressionatHashExpression.doGenCodeThe test case requires a calculation of hash code for a row that includes 1000 String fields.
HashExpression.doGenCodegenerate a lot of Java code for this computation into one function. As a result, the size of the corresponding Java bytecode is more than 64 KB.Generated code without this PR
Generated code with this PR
How was this patch tested?
Add a new test in
DataFrameSuite