-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15956] [SQL] When unwrapping ORC avoid pattern matching at runtime #13676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
| r => unwrap(si.getStructFieldData(data, r), r.getFieldObjectInspector))) | ||
| def unwrap(data: Any, oi: ObjectInspector): Any = { | ||
| val unwrapper = unwrapperFor(oi) | ||
| unwrapper(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would this actually improve performance if you are doing it per row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no improvement when calling unwrap directly. This change moves the unwrap logic to unwrapperFor. This improves performance for ORC files, because OrcFileFormat instead calls unwrapperFor to return a function for each field, then unwraps each row with this function (see https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala#L356).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change can skip case match per row for complex data type in unwrap. because unwrap for complex data type return a function. btw: it also improve for HadoopTableReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we work to remove the unwrap function here?
|
@rxin thanks for the review. I've added a commit that removes |
|
I took a quick look and this looked reasonable. Would be great for somebody else to look at it more carefully too. cc @hvanhovell |
|
LGTM, pending jenkins |
| } | ||
| case poi: WritableConstantIntObjectInspector => | ||
| data: Any => | ||
| poi.getWritableConstantValue.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it faster to call poi.getWritableConstantValue.get() outside of the function? And use the result in the function? Or am I missing something here? The same goes for all other constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, as the contract for a ConstantObjectInspector is that its object "represent constant values and can return them without an evaluation" [1]. I will make this change.
|
This looks pretty good. What I am thinking is that generating an encoder could create another nice performance speedup here. |
| fields.map(_.getFieldObjectInspector).map(unwrapperFor)) | ||
| data: Any => { | ||
| if (data != null) { | ||
| InternalRow.fromSeq(fieldsToUnwrap.map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just can use a pattern match here to get to the field and the unwrap function directly, i.e.:
InternalRow.fromSeq(fieldsToUnwrap.map { case (field, unwrap) =>
unwrap(si.getStructFieldData(data, field))
}That makes it a bit more readable.
|
Thanks @hvanhovell I've pushed a commit addressing your comments. |
|
LGTM - merging to master. |
|
Thanks :) |
## What changes were proposed in this pull request? Extend the returning of unwrapper functions from primitive types to all types. This PR is based on #13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista. ## How was this patch tested? The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%. Author: Brian Cho <[email protected]> Author: Herman van Hovell <[email protected]> Closes #13854 from hvanhovell/SPARK-15956-scala210.
What changes were proposed in this pull request?
Extend the returning of unwrapper functions from primitive types to all types.
How was this patch tested?
The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.
The github diff is very noisy. Attaching the screenshots below for improved readability: