[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

kiszk · 2017-06-29T19:05:44Z

What changes were proposed in this pull request?

This PR adds a new class OnHeapCachedBatch class, which can have compressed data by using CompressibleColumnAccessor, derived from ColumnVector class.

As first step of this implementation, this JIRA supports primitive data and string types. Another PR will support array and other data types.

Current implementation adds compressed data by using putByteArray() method, and then gets data by using a getter (e.g. getInt()). Another PR will support setter (e.g. putInt()).

Current implementation uses UnsafeRow for a getter. It is slow implementation. Another PR will make it fast by eliminating UnsafeRow with specialized getters of ColumnAccessor for each data type.

How was this patch tested?

Added test suites

SparkQA · 2017-06-29T19:09:15Z

Test build #78925 has finished for PR 18468 at commit 00f70f5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class OnHeapCachedBatch extends ColumnVector implements java.io.Serializable

SparkQA · 2017-06-30T04:27:44Z

Test build #78944 has finished for PR 18468 at commit 514400c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-06-30T04:32:25Z

@cloud-fan Could you review this?
As we discussed at Spark Summit, I prepared a new ColumnVector for compressed column using the current schemes. Any comments are appreciated.

kiszk · 2017-07-03T06:00:13Z

cc: @hvanhovell

kiszk · 2017-07-04T01:55:32Z

ping @cloud-fan

cloud-fan · 2017-07-04T02:42:01Z

core/src/main/java/org/apache/spark/memory/MemoryMode.java

hmm, I don't think this can be a new memory mode...

Current implementation relies on memory mode to allocate a kind of ColumnVector.
If we do not add a new memory model, I think that we have to introduce additional conditional branches in getter/setter.

Is it better to add a new argument to specify a type (e.g. Compressible)?

What do you think?

we can make ColumnVector.allocate accept a VectorType(not exists yet) instead of MemoryMode

Could you elaborate this idea?
Does VectorType take a value NonCompress or Compressible for now?

cloud-fan · 2017-07-04T02:47:25Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapCachedBatch.java

this looks weird that we put the value to a row and then read that value from the row, can we return that value directly? e.g. columnAccessor.extractTo should be able to take a ColumnVector as input and set value to it.

I agree with you. We can optimize these access by enhancing existing APIs.
Should we address these extensions in this PR? In my original plan, I will address such an optimization in another PR.

What do you think?

This PR is building the infrastructure that not being used yet, so I think we don't need to rush.

I tried to make a set of pull request for ease of reviews.
However, I will add the optimization to directly return the value without UnsafeRow.

SparkQA · 2017-07-07T18:29:22Z

Test build #79342 has finished for PR 18468 at commit 3f8e024.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-08T02:49:28Z

Test build #79359 has finished for PR 18468 at commit f657fa8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-08T04:37:31Z

Test build #79366 has finished for PR 18468 at commit 101e4b7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-08T07:04:55Z

Test build #79367 has finished for PR 18468 at commit 2c9e63e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-08T09:32:13Z

@cloud-fan Is this what you proposed for VectorType?

SparkQA · 2017-07-11T19:28:41Z

Test build #79532 has finished for PR 18468 at commit be0a6a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-07-12T02:06:05Z

@cloud-fan could you please review this?

kiszk · 2017-07-14T01:53:21Z

ping @cloud-fan

kiszk · 2017-07-18T04:35:55Z

ping @ueshin @cloud-fan

cloud-fan · 2017-07-18T05:24:02Z

I think this PR doesn't have a good abstraction of the problem. For table cache, our goal is not making the comressed data a ColumnVector, but having an efficient way to convert the compressed data(byte array) to ColumnVector. I think the most efficient way is to not do conversion at all, but having a wrapper, i.e. having a class CachedBatchColumnVector(data: Array[Byte]), which implements various getXXX methods by doing decompression. Then we don't need to introduce the VectorType concept and change ColumnVector.

@kiszk what do you think?

kiszk · 2017-07-18T05:37:45Z

@cloud-fan Thank you for your comments. Based on this discussion, I introduced VectorType.
I have just seen @ueshin 's ArrowColumnVector implementation.
Otherwise, I will update CachedBatchColumnVector based on your comments and @ueshin 's implementation.

cloud-fan · 2017-07-18T05:39:43Z

ArrowColumnVector is also a wrapper for arrow vector, and it doesn't introduce vector type stuff.

SparkQA · 2017-07-18T11:36:28Z

Test build #79703 has finished for PR 18468 at commit 0aa1b78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-18T12:34:03Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java

do we support reading values in a random order? e.g. getBoolean(2), getBoolean(1), getBoolean(2)?

We do not support reading values in a random order. This is because implementation of CompressionScheme (e.g. IntDelta) supports only sequential access.

then we should throw exception for this case instead of returning wrong result.

I see. I will add code to track access order for each getter.

SparkQA · 2017-07-18T18:53:11Z

Test build #79712 has finished for PR 18468 at commit 23f7ea5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class CachedBatchColumnVector extends ColumnVector

cloud-fan · 2017-07-19T03:02:33Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java

We should explicitly say that, this is a wrapper to read compressed data as ColumnVector in table cache.

cloud-fan · 2017-07-19T03:15:37Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java

can we inline this method?

Yes, we can. done

cloud-fan · 2017-07-19T03:18:41Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java

we should allow previousRowId == rowId, as we are able to support getInt(1), getInt(1), getInt(2)

yes, we can do it for now. See line 78. It means that extractTo() must be called only once for a given rowId.
Now, we also support that isNullAt(0), getInt(0), too.

cloud-fan · 2017-07-19T05:40:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala

I think we don't need to be so strict. The rule is, users can't jump back and read values, but other than that is OK, e.g. getInt(0), getInt(0), getInt(10), getInt(11).

Sure. I changed the limitation to "same or ascending".

SparkQA · 2017-07-19T06:55:45Z

Test build #79739 has finished for PR 18468 at commit b83dedb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

remove unused imports

SparkQA · 2017-08-07T03:52:43Z

Test build #80314 has finished for PR 18468 at commit a26dc15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-14T07:33:46Z

retest this please

SparkQA · 2017-08-14T09:18:30Z

Test build #80619 has finished for PR 18468 at commit a26dc15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-14T13:03:32Z

retest this please

SparkQA · 2017-08-14T15:42:50Z

Test build #80629 has finished for PR 18468 at commit a26dc15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-10-03T16:32:38Z

This is followed by #18704

…d column (batch method) ## What changes were proposed in this pull request? This PR abstracts data compressed by `CompressibleColumnAccessor` using `ColumnVector` in batch method. When `ColumnAccessor.decompress` is called, `ColumnVector` will have uncompressed data. This batch decompress does not use `InternalRow` to reduce the number of memory accesses. As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types. This implementation decompress data in batch into uncompressed column batch, as rxin suggested at [here](#18468 (comment)). Another implementation uses adapter approach [as cloud-fan suggested](#18468). ## How was this patch tested? Added test suites Author: Kazuaki Ishizaki <[email protected]> Closes #18704 from kiszk/SPARK-20783a.

cloud-fan reviewed Jul 4, 2017

View reviewed changes

kiszk force-pushed the SPARK-20783 branch from 2c9e63e to be0a6a5 Compare July 11, 2017 17:01

kiszk changed the title ~~[SPARK-20873][SQL] Enhance ColumnVector to support compressed representation~~ [SPARK-20873][SQL] Creat CachedBatchColumnVector to abstract existing compressed column Jul 18, 2017

cloud-fan reviewed Jul 18, 2017

View reviewed changes

cloud-fan reviewed Jul 19, 2017

View reviewed changes

kiszk changed the title ~~[SPARK-20873][SQL] Creat CachedBatchColumnVector to abstract existing compressed column~~ [SPARK-20873][SQL] Create CachedBatchColumnVector to abstract existing compressed column Jul 19, 2017

kiszk added 20 commits August 7, 2017 10:17

fix scala style error

176fcc9

remove unused imports

introduced VectorType into ColumnVector

920b38b

add a new file

aa67f9c

remove ON_HEAP_CACHEDBATCH

de27a10

add missing methods

bb47299

add ColumnAccessor.extract(ColumnVector, rowId)

cb3e631

rename OnHeapCachedBatch.java to CachedBatchColumnVector.java

68f7e39

address review comments

a068ada

address review comments

7c13ed6

address review comments

d8fede6

address review comment

780115b

address review comments

31d2dd7

address review comments

8aa9b16

address review comment

332fc6c

address review comments

dc331f1

fix test faiures

70042d5

address review comment

d16779f

addressed review comments

ff324fd

add initialize() method for reusing CachedBatchColumnVector

157fde1

make initialize public

a26dc15

kiszk force-pushed the SPARK-20783 branch from 0888df3 to a26dc15 Compare August 7, 2017 01:19

kiszk closed this Oct 3, 2017

[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

Uh oh!

Conversation

kiszk commented Jun 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 29, 2017

Uh oh!

SparkQA commented Jun 30, 2017

Uh oh!

kiszk commented Jun 30, 2017

Uh oh!

kiszk commented Jul 3, 2017

Uh oh!

kiszk commented Jul 4, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 7, 2017

Uh oh!

SparkQA commented Jul 8, 2017

Uh oh!

SparkQA commented Jul 8, 2017

Uh oh!

SparkQA commented Jul 8, 2017

Uh oh!

kiszk commented Jul 8, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

kiszk commented Jul 12, 2017

Uh oh!

kiszk commented Jul 14, 2017

Uh oh!

kiszk commented Jul 18, 2017

Uh oh!

cloud-fan commented Jul 18, 2017

Uh oh!

kiszk commented Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 18, 2017

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk commented Jun 29, 2017 •

edited

Loading

kiszk Jul 4, 2017 •

edited

Loading

kiszk Jul 4, 2017 •

edited

Loading

kiszk Jul 4, 2017 •

edited

Loading

kiszk commented Jul 18, 2017 •

edited

Loading

kiszk Jul 19, 2017 •

edited

Loading