-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20822][SQL] Generate code to build table cache using ColumnarBatch and to get value from ColumnVector #18066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #77222 has finished for PR 18066 at commit
|
|
Test build #77235 has finished for PR 18066 at commit
|
|
@hvanhovell @sameeragarwal Would you please review this? |
|
ping @hvanhovell |
|
Test build #77702 has finished for PR 18066 at commit
|
| storageLevel == MEMORY_AND_DISK_SER || storageLevel == MEMORY_AND_DISK_SER_2) | ||
| } | ||
|
|
||
| private val typeToName = Map[AbstractDataType, String]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @kiszk .
Is there any reason having only two types, int and double?
The PR looks more general to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I described in the description, this is for ease of review.
As the first step, for ease of review, I supported only integer and double data types with whole-stage codegen. Another PR will address an execution path without whole-stage codegen
|
|
||
| private[columnar] val useColumnarBatches: Boolean = { | ||
| // In the initial implementation, for ease of review | ||
| // support only integer and double and # of fields is less than wholeStageMaxNumFields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Here is the comment about the reason.
|
#18747 is another PR for this JIRA entry |
What changes were proposed in this pull request?
This PR generates the following Java code
ColumnarBatchwithColumnVectorinstead of using CachedBatch withArray[Byte].As the first step, for ease of review, I supported only integer and double data types with whole-stage codegen. Another PR will address an execution path without whole-stage codegen
This PR implements the follings:
ColumnarBatchwithColumnVector. For supporting the new and coventional cache data structure, this PR declaresCachedBatchas trait, and declaresCachedColumnarBatchandCachedBatchBytesas actual implementations.ColumnVector.This PR improves runtime performance by
InternalRowin aSpecificColumnarIteratoriterator.Options
A ColumnVector for all primitive data types in ColumnarBatch can be compressed. Currently, there are two ways to enable compression:
spark.sql.inMemoryColumnarStorage.compressed (default is true), orDataFrame.persist(st), where st isMEMORY_ONLY_SER,MEMORY_ONLY_SER_2,MEMORY_AND_DISK_SER, orMEMORY_AND_DISK_SER_2.an example program
Generated code for building a in-memory table cache
Generated code by whole-stage codegen (lines 75-78 are major changes)
How was this patch tested?
Add test suites for wider columns