-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20822][SQL] Generate code to directly get value from ColumnVector for table cache #18747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #79990 has finished for PR 18747 at commit
|
|
Test build #79992 has finished for PR 18747 at commit
|
|
Jenkins, retest this please |
|
Test build #79993 has finished for PR 18747 at commit
|
|
Test build #80312 has finished for PR 18747 at commit
|
|
Test build #82043 has finished for PR 18747 at commit
|
|
Test build #82121 has finished for PR 18747 at commit
|
|
Test build #82127 has finished for PR 18747 at commit
|
|
Test build #82182 has finished for PR 18747 at commit
|
|
Jenkins, retest this please |
|
Test build #82187 has finished for PR 18747 at commit
|
|
Test build #82451 has finished for PR 18747 at commit
|
|
Test build #82453 has finished for PR 18747 at commit
|
|
@cloud-fan could you please review this in my PRs at first? |
|
Test build #82691 has finished for PR 18747 at commit
|
|
@cloud-fan could you please review this since I resolved a confclit? |
| val colVars = output.indices.map(i => ctx.freshName("colInstance" + i)) | ||
| val columnVectorClzs = vectorTypes.getOrElse( | ||
| Seq.fill(colVars.size)(classOf[ColumnVector].getName)) | ||
| val columnAccessorClz = "org.apache.spark.sql.execution.columnar.ColumnAccessor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these table cache specific code should go to InMemoryTableScanExec instead of here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, done
| val buffers = relation.cachedColumnBuffers | ||
| // HACK ALERT: This is actually an RDD[CachedBatch]. | ||
| // We're taking advantage of Scala's type erasure here to pass these batches along. | ||
| Seq(buffers.asInstanceOf[RDD[InternalRow]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnarBatchScan assumes the input RDD is RDD[ColumnarBatch], you are breaking this assumption here.
I think we should convert CachedBatch to ColumnarBatch first, you can codegen a class to do it if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I break that assumption (RDD[CachedBatch]) since we have to create ColumnarBatch when it will be read.
Should we convert CachedBatch to ColumnarBatch here in inputRDDs()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile sure, done
| columnarBatch.setNumRows(rowCount) | ||
|
|
||
| for (i <- 0 until attributes.length) { | ||
| val index = if (columnIndices.length == 0) i else columnIndices(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen cases columnIndices = Array[Int](0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean attributes is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, while I ran several test suites in my local environment, I cannot find the case columnIndices = Array[Int](0). Let me commit without this if.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, all tests are succeeded without this if. Thank you for your comment.
| columnarBatch.column(i).asInstanceOf[WritableColumnVector], | ||
| columnarBatchSchema.fields(i).dataType, rowCount) | ||
| } | ||
| return columnarBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove return
|
Test build #82845 has finished for PR 18747 at commit
|
|
ping @cloud-fan |
| case _ => true | ||
| }).isEmpty && | ||
| !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && | ||
| children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is unnecessary, this is a LeafExecNode so it has no children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
|
||
| private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { | ||
| val rowCount = cachedColumnarBatch.numRows | ||
| val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the OnHeapColumnVector for the cached batches? It's a little inefficient to create one column vector for each cached batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can improve efficiency if we can reuse the OnHeapColumnVector.
I think that it is not easy to reuse the OnHeapColumnVector between different cached batches.
IIUC there is no point to know a cached batch will not be referenced. We rely the management of the lifetime on GC by creating OnHeapColumnVector every time for each cached batch.
If we reuse the OnHeapColumnVector (i.e. keep a reference to OnHeapColumnVector), GC will not dispose OnHeapColumnVector even if the generated code will not use the OnHeapColumnVector. It means that uncompressed (huge) data would be alive for a long time. If we know the point where a cache batch will not be referenced, we could set null to data in OnHeapColumnVector.
Thus, I currently create OnHeapColumnVector for each cached batch. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use TaskContext.addTaskCompletionListener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. We could accomplish to set null into a data field (e.g. intData) in OnHeapColumnVector.intData by registering clear() method to TaskContext.addTaskCompletionListener.
In that case, I realized that we would have to reallocate a large array for a data field in OnHeapColumnVector.intData each time. Do we still need to take care of efficiency of allocating OnHeapColumnVector whose size is relatively smaller than the size of the large array?
If it still makes sense, I will try to implement clear() method and reallocating a large array that may introduce new API into OnHeapColumnVector.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok maybe leave it as a followup for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let us make a follow-up PR in the future.
|
Jenkins, retest this please |
|
retest this please |
|
Test build #82912 has finished for PR 18747 at commit
|
|
ping @cloud-fan |
| override val supportCodegen: Boolean = { | ||
| // In the initial implementation, for ease of review | ||
| // support only primitive data types and # of fields is less than wholeStageMaxNumFields | ||
| val schema = StructType.fromAttributes(relation.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just relation.schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| } | ||
|
|
||
| override def inputRDDs(): Seq[RDD[InternalRow]] = { | ||
| if (supportCodegen) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If supportCodegen is false, I think we never call inputRDDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I will insert assertion
|
|
||
| private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { | ||
| val rowCount = cachedColumnarBatch.numRows | ||
| val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason we use OnHeapColumnVector instead of OffHeapColumnVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both can be used. I follow this default configuration since it is not easy to get tungstenMemoryMode here.
|
Test build #82993 has finished for PR 18747 at commit
|
| WholeStageCodegenExec.isTooManyFields(conf, plan.schema) | ||
| val hasTooManyInputFields = | ||
| plan.children.map(p => numOfNestedFields(p.schema)).exists(_ > conf.wholeStageMaxNumFields) | ||
| plan.children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isDefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find(...).isDefined -> exists?
| override val supportCodegen: Boolean = { | ||
| // In the initial implementation, for ease of review | ||
| // support only primitive data types and # of fields is less than wholeStageMaxNumFields | ||
| relation.schema.fields.find(f => f.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrite find(...) -> forall?
|
LGTM except 2 minor comments. Can you benchmark some complex queries instead of full scan? I was expecting to see larger speed up via the columnar reader. |
|
Test build #83003 has finished for PR 18747 at commit
|
|
thanks, merging to master! |
|
Thank you for merging this. I will run another benchmark. |
|
@cloud-fan I updated benchmark result (2.9x) in the description. |
What changes were proposed in this pull request?
This PR generates the Java code to directly get a value for a column in
ColumnVectorwithout using an iterator (e.g. at lines 54-69 in the generated code example) for table cache (e.g.dataframe.cache). This PR improves runtime performance by eliminating data copy from column-oriented storage toInternalRowin aSpecificColumnarIteratoriterator for primitive type. Another PR will support primitive type array.Benchmark result: 2.9x
Benchmark program
Motivating example
Generated code
How was this patch tested?
Add test cases into
DataFrameTungstenSuiteandWholeStageCodegenSuite