Skip to content

Conversation

@feynmanliang
Copy link

@feynmanliang feynmanliang commented Sep 29, 2015

  • Adds private[spark] orWithOffset to BitSet, which improves from bit-level OR to word-level OR. IMO this should remain private and be merged with the entire feature since it's a low level specialized API I don't see others reusing
  • Adds |= to BitSubvector which does an in-place OR operation, accounting for from and to
  • Misc docs/tests

This change is Reviewable

@feynmanliang
Copy link
Author

This is currently failing unit tests for BitSubvector.merge because the indices into bits does not correspond to the from/to indices so arraycopy is incorrect. It's not immediately obvious to me how to maintain both the flexible from/to ranges in BitSubvector while also implementing efficient arrayCopy but it might be possible if we play with the srcPos, dstPos, and length arguments to arrayCopy; I will play with this some more during the week.

@jkbradley
Copy link
Owner

Once this is working, maybe we should submit the Spark core changes as a PR to Spark.

@feynmanliang feynmanliang changed the title [WIP] Changes BitSubvector to use System.arraycopy Changes BitSubvector to use System.arraycopy Oct 3, 2015
@feynmanliang
Copy link
Author

@jkbradley This is ready for review/benchmarking. We ended up not being able to use System.arraycopy because BitSubvector.merge requires a bitwise OR operation rather than a copy (which will override previously set bits if they are not set in the copy).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried modifying this test to use offset 63, and it failed. Maybe create a test which loops through important offsets: 0, 1, 63, 64, 65.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird... I am getting tests to pass for all the offsets. I will push an update which loops through the offsets you gave and if you can still repro can you post the failing code? Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be splitting up shared state here as well

@feynmanliang
Copy link
Author

@jkbradley Can you take another look when you have a chance? I couldn't repro the failure you reported (likely it was caused by my confusion over Long.MaxValue vs -1L). Other changes made according to feedback.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could check for elements 0 until offset being set to 0

@jkbradley
Copy link
Owner

Oops, didn't mean to comment yet. Hold off on updates...

@jkbradley
Copy link
Owner

Oh well, nevermind, that's the only item I saw. Btw, did you try running this with AltDT, and was it any faster?

@jkbradley
Copy link
Owner

@feynmanliang Let me know if you'll be able to run quick timing tests for this on your laptop. It'd be nice to confirm the speedup. Thanks!

@feynmanliang
Copy link
Author

@jkbradley Did some local benchmarks and saw improvements

Add to end of "BitSubvector merge" test in BitSubvectorSuite.scala

    val start = System.nanoTime();
    val N = 100000
    for (i <- (1 to N)) {
      BitSubvector.merge(parts1, parts2)
    }
    println(s"$N runs took ${System.nanoTime() - start} ns")
N Before (ns) After (ns)
100000 621788767 620082344
1000000 1392177699 1170287557
10000000 8319413981 6838749612

@feynmanliang
Copy link
Author

Will update the BitSetSuite test to check 0 until offset

@feynmanliang
Copy link
Author

@jkbradley ready for review

@jkbradley
Copy link
Owner

Those are smaller improvements than I would have expected. Do you think it's JIT magic hiding something? I hate to say this, but I'm wondering if this is worth it, especially because it requires a change to Spark core.

If it's OK, I might ask that we hold off on merging this until we can do larger scale tests comparing using an Array of BitSubvectors vs. using a single BitSet (from @fabuzaid21 ). If we stick with the sparse representation using BitSubvectors, then we could return to this PR.

I'll leave it open for now.

jkbradley pushed a commit that referenced this pull request Oct 24, 2019
…nput of UDF as double in the failed test in udf-aggregate_part1.sql

## What changes were proposed in this pull request?

It still can be flaky on certain environments due to float limitation described at apache#25110 . See apache#25110 (comment)

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/

```
Expected "700000000000[6] 1", but got "700000000000[5] 1" Result did not match for query #33&#10;SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))&#10;FROM (VALUES (7000000000005), (7000000000007)) v(x)
```

Here;s what's going on: apache#25110 (comment)

```
scala> Seq("7000000000004.999", "7000000000006.999").toDF().selectExpr("CAST(avg(value) AS long)").show()
+--------------------------+
|CAST(avg(value) AS BIGINT)|
+--------------------------+
|             7000000000005|
+--------------------------+
```

Therefore, this PR just avoid to cast in the specific test.

This is a temp fix. We need more robust way to avoid such cases.

## How was this patch tested?

It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure..

Closes apache#25128 from HyukjinKwon/SPARK-28270-2.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
jkbradley pushed a commit that referenced this pull request Jan 2, 2020
… Arrow on JDK9+

### What changes were proposed in this pull request?

This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow.

Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)).
> #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
> This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.

### Why are the changes needed?

After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After apache#26133, JDK11 Jenkins job seem to fail.

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/

```scala
Previous exception in task:
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available&#10;
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)&#10;
io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)&#10;
io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)&#10;
io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)&#10;
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)&#10;
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with JDK11.

Closes apache#26552 from dongjoon-hyun/SPARK-ARROW-JDK11.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 3, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 3, 2020
@github-actions github-actions bot closed this Jan 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants