-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #78925 has finished for PR 18468 at commit
|
|
Test build #78944 has finished for PR 18468 at commit
|
|
@cloud-fan Could you review this? |
|
cc: @hvanhovell |
|
ping @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't think this can be a new memory mode...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation relies on memory mode to allocate a kind of ColumnVector.
If we do not add a new memory model, I think that we have to introduce additional conditional branches in getter/setter.
Is it better to add a new argument to specify a type (e.g. Compressible)?
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make ColumnVector.allocate accept a VectorType(not exists yet) instead of MemoryMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate this idea?
Does VectorType take a value NonCompress or Compressible for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks weird that we put the value to a row and then read that value from the row, can we return that value directly? e.g. columnAccessor.extractTo should be able to take a ColumnVector as input and set value to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. We can optimize these access by enhancing existing APIs.
Should we address these extensions in this PR? In my original plan, I will address such an optimization in another PR.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is building the infrastructure that not being used yet, so I think we don't need to rush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make a set of pull request for ease of reviews.
However, I will add the optimization to directly return the value without UnsafeRow.
|
Test build #79342 has finished for PR 18468 at commit
|
|
Test build #79359 has finished for PR 18468 at commit
|
|
Test build #79366 has finished for PR 18468 at commit
|
|
Test build #79367 has finished for PR 18468 at commit
|
|
@cloud-fan Is this what you proposed for |
|
Test build #79532 has finished for PR 18468 at commit
|
|
@cloud-fan could you please review this? |
|
ping @cloud-fan |
|
ping @ueshin @cloud-fan |
|
I think this PR doesn't have a good abstraction of the problem. For table cache, our goal is not making the comressed data a @kiszk what do you think? |
|
@cloud-fan Thank you for your comments. Based on this discussion, I introduced |
|
|
|
Test build #79703 has finished for PR 18468 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we support reading values in a random order? e.g. getBoolean(2), getBoolean(1), getBoolean(2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not support reading values in a random order. This is because implementation of CompressionScheme (e.g. IntDelta) supports only sequential access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we should throw exception for this case instead of returning wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will add code to track access order for each getter.
|
Test build #79712 has finished for PR 18468 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explicitly say that, this is a wrapper to read compressed data as ColumnVector in table cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we inline this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can. done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should allow previousRowId == rowId, as we are able to support getInt(1), getInt(1), getInt(2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can do it for now. See line 78. It means that extractTo() must be called only once for a given rowId.
Now, we also support that isNullAt(0), getInt(0), too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to be so strict. The rule is, users can't jump back and read values, but other than that is OK, e.g. getInt(0), getInt(0), getInt(10), getInt(11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I changed the limitation to "same or ascending".
|
Test build #79739 has finished for PR 18468 at commit
|
remove unused imports
|
Test build #80314 has finished for PR 18468 at commit
|
|
retest this please |
|
Test build #80619 has finished for PR 18468 at commit
|
|
retest this please |
|
Test build #80629 has finished for PR 18468 at commit
|
|
This is followed by #18704 |
…d column (batch method) ## What changes were proposed in this pull request? This PR abstracts data compressed by `CompressibleColumnAccessor` using `ColumnVector` in batch method. When `ColumnAccessor.decompress` is called, `ColumnVector` will have uncompressed data. This batch decompress does not use `InternalRow` to reduce the number of memory accesses. As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types. This implementation decompress data in batch into uncompressed column batch, as rxin suggested at [here](#18468 (comment)). Another implementation uses adapter approach [as cloud-fan suggested](#18468). ## How was this patch tested? Added test suites Author: Kazuaki Ishizaki <[email protected]> Closes #18704 from kiszk/SPARK-20783a.
What changes were proposed in this pull request?
This PR adds a new class
OnHeapCachedBatchclass, which can have compressed data by usingCompressibleColumnAccessor, derived fromColumnVectorclass.As first step of this implementation, this JIRA supports primitive data and string types. Another PR will support array and other data types.
Current implementation adds compressed data by using
putByteArray()method, and then gets data by using a getter (e.g.getInt()). Another PR will support setter (e.g.putInt()).Current implementation uses
UnsafeRowfor a getter. It is slow implementation. Another PR will make it fast by eliminatingUnsafeRowwith specialized getters ofColumnAccessorfor each data type.How was this patch tested?
Added test suites