Skip to content

Conversation

@yangw1234
Copy link

What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
case class point(a:String, b:String, c:String, d: Int)

val data = Seq( point("1","2","3", 1), point("4","5","6", 1), point("7","8","9", 1) )
sc.parallelize(data).toDF().registerTempTable("table")
spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was 001, while after #10677 the bit mask was changed to 011. However, the nonNullBitmask was not changed accordingly.

This pr will fix this problem.

How was this patch tested?

add integration tests

@yangw1234
Copy link
Author

cc @davies Would you help reviewing this?

@yangw1234
Copy link
Author

also cc @hvanhovell

@yangw1234 yangw1234 changed the title [SPARK-17849] Fix NPE problem when using grouping sets [SPARK-17849] [SQL] Fix NPE problem when using grouping sets Oct 10, 2016
}
}

test("SPARK-17849: grouping set throws NPE") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can move this into SQLQueryTestSuite, by creating a new grouping_set.q file??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin done

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #3309 has finished for PR 15416 at commit 42f7a63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

val nonNullBitmask = x.bitmasks.reduce(_ & _)
val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
Copy link
Contributor

@hvanhovell hvanhovell Oct 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit manipulation magic is hard to follow. This is should be documented better. Could you add a line or two to explain how the bitmasks are structured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell @rxin comments are added

// The left most bit in the bitmasks corresponds to the last expression in groupByAliases
// with 0 indicating this expression is in the grouping set. The following line of code
// calculates the bit mask representing the expressions that exist in all the grouping sets.
val nonNullBitmask = ~ x.bitmasks.reduce(_ | _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the '~' here, and use (nonNullBitmask & (1 << (attrLength - idx - 1))) == 1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ((nonNullBitmask >> (attrLength - idx - 1)) & 1) == 1? We can only test on 0 if we left shift 1, right? @davies

@davies
Copy link
Contributor

davies commented Oct 11, 2016

@yangw1234 Thanks for working on this, could you also double check that all the places that use bitmasks are correct?

@yangw1234
Copy link
Author

@davies Other places all seem to be correct.

// The rightmost bit in the bitmasks corresponds to the last expression in groupByAliases with 0
// indicating this expression is in the grouping set. The following line of code calculates the
// bitmask representing the expressions that exist in all the grouping sets (also indicated by 0).
val nonNullBitmask = x.bitmasks.reduce(_ | _)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call this nullBitmask now? (1 means it's nullable)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done @davies

@SparkQA
Copy link

SparkQA commented Oct 13, 2016

Test build #3337 has finished for PR 15416 at commit 69f6e4f.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Oct 13, 2016

@yangw1234 can you fix the scala styles?

@yangw1234
Copy link
Author

scala style fixed.
I didn't notice. Sorry for the delay. @rxin

@hvanhovell
Copy link
Contributor

retest this please

@davies
Copy link
Contributor

davies commented Oct 14, 2016

LGTM, pending tests

@SparkQA
Copy link

SparkQA commented Oct 14, 2016

Test build #66970 has finished for PR 15416 at commit 0ad7aba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yangw1234
Copy link
Author

@rxin @davies Will this patch be merged in 2.0.2? Kind of need this to upgrade our production environment. Thanks.

@hvanhovell
Copy link
Contributor

retest this please

@hvanhovell
Copy link
Contributor

I'll merge after a successfull test run

@SparkQA
Copy link

SparkQA commented Nov 5, 2016

Test build #68203 has finished for PR 15416 at commit 0ad7aba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 5, 2016
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <[email protected]>

Closes #15416 from yangw1234/groupingid.

(cherry picked from commit fb0d608)
Signed-off-by: Herman van Hovell <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 5, 2016
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in #10677, some code (which should be changed) was left out.

Take the above code for example, prior #10677, the bit mask for set "(a)" was `001`, while after #10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <[email protected]>

Closes #15416 from yangw1234/groupingid.

(cherry picked from commit fb0d608)
Signed-off-by: Herman van Hovell <[email protected]>
@hvanhovell
Copy link
Contributor

LGTM - Merging to master/2.1/2.0. Thanks!

@asfgit asfgit closed this in fb0d608 Nov 5, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Prior this pr, the following code would cause an NPE:
`case class point(a:String, b:String, c:String, d: Int)`

`val data = Seq(
point("1","2","3", 1),
point("4","5","6", 1),
point("7","8","9", 1)
)`
`sc.parallelize(data).toDF().registerTempTable("table")`
`spark.sql("select a, b, c, count(d) from table group by a, b, c GROUPING SETS ((a)) ").show()`

The reason is that when the grouping_id() behavior was changed in apache#10677, some code (which should be changed) was left out.

Take the above code for example, prior apache#10677, the bit mask for set "(a)" was `001`, while after apache#10677 the bit mask was changed to `011`. However, the `nonNullBitmask` was not changed accordingly.

This pr will fix this problem.
## How was this patch tested?

add integration tests

Author: wangyang <[email protected]>

Closes apache#15416 from yangw1234/groupingid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants