-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16060][SQL][follow-up] add a wrapper solution for vectorized orc reader #20205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| for (WritableColumnVector vector : columnVectors) { | ||
| vector.reset(); | ||
| } | ||
| columnarBatch.setNumRows(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan . Can we keep this like Parquet? At the final empty batch, we need clear up this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I meant keeping here since we return at line 390 and 240. Parquet also does.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much!
LGTM except one minor comment.
|
BTW, if you don't mind, could you update the followings? It was @viirya 's comment, so I made a followup patch, but we had better have this in your PR. To make another follow-up PR is overkill. :) |
|
Test build #85857 has finished for PR 20205 at commit
|
|
Test build #85901 has finished for PR 20205 at commit
|
|
thanks, merging to master/2.3! |
…rc reader ## What changes were proposed in this pull request? This is mostly from #13775 The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution. I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #20205 from cloud-fan/orc. (cherry picked from commit eaac60a) Signed-off-by: Wenchen Fan <[email protected]>
| int colId = requestedColIds[i]; | ||
| // Initialize the missing columns once. | ||
| if (colId == -1) { | ||
| OnHeapColumnVector missingCol = new OnHeapColumnVector(capacity, dt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Shouldn't we respect MEMORY_MODE parameter here?
| int partitionIdx = requiredFields.length; | ||
| for (int i = 0; i < partitionValues.numFields(); i++) { | ||
| DataType dt = partitionSchema.fields()[i].dataType(); | ||
| OnHeapColumnVector partitionCol = new OnHeapColumnVector(capacity, dt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| /** | ||
| * A column vector class wrapping Hive's ColumnVector. Because Spark ColumnarBatch only accepts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not Hive's ColumnVector, but ORC's ColumnVector.
| Hive built-in ORC 2940 / 2952 3.6 280.4 0.8X | ||
| Native ORC MR 2234 / 2255 4.7 213.1 1.0X | ||
| Native ORC Vectorized 854 / 869 12.3 81.4 2.6X | ||
| Native ORC Vectorized with copy 1099 / 1128 9.5 104.8 2.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like wrapper approach is usually faster than copy approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For long term, maybe we can consider remove copy approach to simplify the codes.
|
I'm busy for relocating so sorry not review promptly. LGTM with few minor comments. |
What changes were proposed in this pull request?
This is mostly from #13775
The wrapper solution is pretty good for string/binary type, as the ORC column vector doesn't keep bytes in a continuous memory region, and has a significant overhead when copying the data to Spark columnar batch. For other cases, the wrapper solution is almost same with the current solution.
I think we can treat the wrapper solution as a baseline and keep improving the writing to Spark solution.
How was this patch tested?
existing tests.