Changes BitSubvector to use System.arraycopy #10

feynmanliang · 2015-09-29T11:47:28Z

Adds private[spark] orWithOffset to BitSet, which improves from bit-level OR to word-level OR. IMO this should remain private and be merged with the entire feature since it's a low level specialized API I don't see others reusing
Adds |= to BitSubvector which does an in-place OR operation, accounting for from and to
Misc docs/tests

This change is

…debug

… active node set indices.

…al features

feynmanliang · 2015-09-29T12:25:00Z

This is currently failing unit tests for BitSubvector.merge because the indices into bits does not correspond to the from/to indices so arraycopy is incorrect. It's not immediately obvious to me how to maintain both the flexible from/to ranges in BitSubvector while also implementing efficient arrayCopy but it might be possible if we play with the srcPos, dstPos, and length arguments to arrayCopy; I will play with this some more during the week.

jkbradley · 2015-10-02T20:00:25Z

Once this is working, maybe we should submit the Spark core changes as a PR to Spark.

feynmanliang · 2015-10-03T17:08:14Z

@jkbradley This is ready for review/benchmarking. We ended up not being able to use System.arraycopy because BitSubvector.merge requires a bitwise OR operation rather than a copy (which will override previously set bits if they are not set in the copy).

jkbradley · 2015-10-07T20:05:21Z

core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala

I tried modifying this test to use offset 63, and it failed. Maybe create a test which loops through important offsets: 0, 1, 63, 64, 65.

Weird... I am getting tests to pass for all the offsets. I will push an update which loops through the offsets you gave and if you can still repro can you post the failing code? Thanks!

feynmanliang · 2015-10-10T22:25:42Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/BitSubvectorSuite.scala

Will be splitting up shared state here as well

feynmanliang · 2015-10-10T22:37:21Z

@jkbradley Can you take another look when you have a chance? I couldn't repro the failure you reported (likely it was caused by my confusion over Long.MaxValue vs -1L). Other changes made according to feedback.

jkbradley · 2015-10-13T17:12:50Z

core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala

could check for elements 0 until offset being set to 0

jkbradley · 2015-10-13T17:13:10Z

Oops, didn't mean to comment yet. Hold off on updates...

jkbradley · 2015-10-14T18:57:48Z

Oh well, nevermind, that's the only item I saw. Btw, did you try running this with AltDT, and was it any faster?

jkbradley · 2015-10-17T22:49:51Z

@feynmanliang Let me know if you'll be able to run quick timing tests for this on your laptop. It'd be nice to confirm the speedup. Thanks!

feynmanliang · 2015-11-14T20:47:59Z

@jkbradley Did some local benchmarks and saw improvements

Add to end of "BitSubvector merge" test in BitSubvectorSuite.scala

    val start = System.nanoTime();
    val N = 100000
    for (i <- (1 to N)) {
      BitSubvector.merge(parts1, parts2)
    }
    println(s"$N runs took ${System.nanoTime() - start} ns")

N	Before (ns)	After (ns)
100000	621788767	620082344
1000000	1392177699	1170287557
10000000	8319413981	6838749612

feynmanliang · 2015-11-14T20:49:01Z

Will update the BitSetSuite test to check 0 until offset

feynmanliang · 2015-11-14T20:55:17Z

@jkbradley ready for review

jkbradley · 2015-11-15T21:20:18Z

Those are smaller improvements than I would have expected. Do you think it's JIT magic hiding something? I hate to say this, but I'm wondering if this is worth it, especially because it requires a change to Spark core.

If it's OK, I might ask that we hold off on merging this until we can do larger scale tests comparing using an Array of BitSubvectors vs. using a single BitSet (from @fabuzaid21 ). If we stick with the sparse representation using BitSubvectors, then we could return to this PR.

I'll leave it open for now.

…nput of UDF as double in the failed test in udf-aggregate_part1.sql ## What changes were proposed in this pull request? It still can be flaky on certain environments due to float limitation described at apache#25110 . See apache#25110 (comment) - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6584/testReport/org.apache.spark.sql/SQLQueryTestSuite/udf_pgSQL_udf_aggregates_part1_sql___Regular_Python_UDF/ ``` Expected "700000000000[6] 1", but got "700000000000[5] 1" Result did not match for query #33
SELECT CAST(avg(udf(CAST(x AS DOUBLE))) AS long), CAST(udf(var_pop(CAST(x AS DOUBLE))) AS decimal(10,3))
FROM (VALUES (7000000000005), (7000000000007)) v(x) ``` Here;s what's going on: apache#25110 (comment) ``` scala> Seq("7000000000004.999", "7000000000006.999").toDF().selectExpr("CAST(avg(value) AS long)").show() +--------------------------+ |CAST(avg(value) AS BIGINT)| +--------------------------+ | 7000000000005| +--------------------------+ ``` Therefore, this PR just avoid to cast in the specific test. This is a temp fix. We need more robust way to avoid such cases. ## How was this patch tested? It passes with Maven in my local before/after this PR. I believe the problem seems similarly the Python or OS installed in the machine. I should test this against PR builder with `test-maven` for sure.. Closes apache#25128 from HyukjinKwon/SPARK-28270-2. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After apache#26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
 io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
 io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
 io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
 io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
 org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes apache#26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions · 2020-01-03T00:04:50Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

jkbradley and others added 21 commits August 25, 2015 18:34

Wrote rowToColumnStoreDense method

4f4bf8a

Wrote TreeUtilSuite to test rowToColumnStoreDense

b8d2c93

scala style

100d506

Added rowToColumnStoreSparse to tree Utils, and added test suite for it.

74d6732

small fix in TreeUtilSuite

1d1d71f

some updates before rebasing

3fcdeaf

Done implement MVP for partitioning by feature, but need to test and …

bfa1819

…debug

debugging AltDT, not done yet

9e7b0e9

debugging and adding unit tests

be5b237

AltDT failing on imbalanced tree test. Going to refactor to broadcast…

3c1a4d3

… active node set indices.

AltDT is working

15b009b

removed debug printlns

d58ce40

fix merge conflicts

40719af

modified AltDT to use ImpurityStats

5941577

fixed conflicts with master, and started to add support for categoric…

248d0a7

…al features

Added computation of unordered features

0764b90

refactored to prepare for adding categorical feature support

ceaf5f5

Added support for ordered categorical features, but not tested yet

675951e

Added a few unit tests

1855176

Changes BitSubvector to use System.arraycopy

d897714

Removes commented code

6cc1ed8

Fixes nested tests

71fd54c

feynmanliang added 6 commits October 3, 2015 14:46

Makes BitSubvectors left-align on word boundaries and adds comments

1597d11

Changes copyFrom to orWithOffset

92a6fa5

Adds dumb impl and failing tests for offset!=0

506ac10

Pass unit tests, adds debug code

2c329df

Removes debug printlns

9a642b8

Fixes implementation, cleans comments, more tests

49a628a

feynmanliang changed the title ~~[WIP] Changes BitSubvector to use System.arraycopy~~ Changes BitSubvector to use System.arraycopy Oct 3, 2015

jkbradley reviewed Oct 7, 2015
View reviewed changes

Improves docs and fixes Long.MaxValue bug

c5e5480

feynmanliang reviewed Oct 10, 2015
View reviewed changes

mllib/src/test/scala/org/apache/spark/ml/tree/impl/BitSubvectorSuite.scala Outdated

Copy link

Author

feynmanliang Oct 10, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be splitting up shared state here as well

feynmanliang added 4 commits October 10, 2015 23:27

Cleans up shared state in BitSetSuite

1213f86

Cleans up shared state in BitSubvectorSuite

8f1360f

Loops through offsets for orWithOffset test

3c5060d

Updates test names

2db78ee

jkbradley reviewed Oct 13, 2015
View reviewed changes

core/src/test/scala/org/apache/spark/util/collection/BitSetSuite.scala Outdated

Copy link

Owner

jkbradley Oct 13, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could check for elements 0 until offset being set to 0

Adds tests for 0 until offset

7072f91

jkbradley force-pushed the dt-features branch from f41608f to 5db9171 Compare December 13, 2015 20:01

github-actions bot added the Stale label Jan 3, 2020

github-actions bot closed this Jan 4, 2020

Changes BitSubvector to use System.arraycopy #10

Changes BitSubvector to use System.arraycopy #10

Uh oh!

Conversation

feynmanliang commented Sep 29, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feynmanliang commented Sep 29, 2015

Uh oh!

jkbradley commented Oct 2, 2015

Uh oh!

feynmanliang commented Oct 3, 2015

Uh oh!

jkbradley Oct 7, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang Oct 10, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang Oct 10, 2015

Choose a reason for hiding this comment

Uh oh!

feynmanliang commented Oct 10, 2015

Uh oh!

jkbradley Oct 13, 2015

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Oct 13, 2015

Uh oh!

jkbradley commented Oct 14, 2015

Uh oh!

jkbradley commented Oct 17, 2015

Uh oh!

feynmanliang commented Nov 14, 2015

Uh oh!

feynmanliang commented Nov 14, 2015

Uh oh!

feynmanliang commented Nov 14, 2015

Uh oh!

jkbradley commented Nov 15, 2015

Uh oh!

github-actions bot commented Jan 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feynmanliang commented Sep 29, 2015 •

edited

Loading