-
Notifications
You must be signed in to change notification settings - Fork 3
Changes BitSubvector to use System.arraycopy #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes BitSubvector to use System.arraycopy #10
Conversation
… active node set indices.
|
This is currently failing unit tests for |
|
Once this is working, maybe we should submit the Spark core changes as a PR to Spark. |
|
@jkbradley This is ready for review/benchmarking. We ended up not being able to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried modifying this test to use offset 63, and it failed. Maybe create a test which loops through important offsets: 0, 1, 63, 64, 65.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird... I am getting tests to pass for all the offsets. I will push an update which loops through the offsets you gave and if you can still repro can you post the failing code? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be splitting up shared state here as well
|
@jkbradley Can you take another look when you have a chance? I couldn't repro the failure you reported (likely it was caused by my confusion over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could check for elements 0 until offset being set to 0
|
Oops, didn't mean to comment yet. Hold off on updates... |
|
Oh well, nevermind, that's the only item I saw. Btw, did you try running this with AltDT, and was it any faster? |
|
@feynmanliang Let me know if you'll be able to run quick timing tests for this on your laptop. It'd be nice to confirm the speedup. Thanks! |
|
@jkbradley Did some local benchmarks and saw improvements Add to end of "BitSubvector merge" test in val start = System.nanoTime();
val N = 100000
for (i <- (1 to N)) {
BitSubvector.merge(parts1, parts2)
}
println(s"$N runs took ${System.nanoTime() - start} ns")
|
|
Will update the |
|
@jkbradley ready for review |
|
Those are smaller improvements than I would have expected. Do you think it's JIT magic hiding something? I hate to say this, but I'm wondering if this is worth it, especially because it requires a change to Spark core. If it's OK, I might ask that we hold off on merging this until we can do larger scale tests comparing using an Array of BitSubvectors vs. using a single BitSet (from @fabuzaid21 ). If we stick with the sparse representation using BitSubvectors, then we could return to this PR. I'll leave it open for now. |
f41608f to
5db9171
Compare
…nput of UDF as double in the failed test in udf-aggregate_part1.sql ## What changes were proposed in this pull request? It still can be flaky on certain environments due to float limitation described at apache#25110 . See apache#25110 (comment) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/ ``` Expected "700000000000[6] 1", but got "700000000000[5] 1" Result did not match for query #33 SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3)) FROM (VALUES (7000000000005), (7000000000007)) v(x) ``` Here;s what's going on: apache#25110 (comment) ``` scala> Seq("7000000000004.999", "7000000000006.999").toDF().selectExpr("CAST(avg(value) AS long)").show() +--------------------------+ |CAST(avg(value) AS BIGINT)| +--------------------------+ | 7000000000005| +--------------------------+ ``` Therefore, this PR just avoid to cast in the specific test. This is a temp fix. We need more robust way to avoid such cases. ## How was this patch tested? It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure.. Closes apache#25128 from HyukjinKwon/SPARK-28270-2. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After apache#26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes apache#26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
private[spark] orWithOffsettoBitSet, which improves from bit-level OR to word-level OR. IMO this should remain private and be merged with the entire feature since it's a low level specialized API I don't see others reusing|=toBitSubvectorwhich does an in-place OR operation, accounting forfromandtoThis change is