Skip to content

Conversation

@bchocho
Copy link
Contributor

@bchocho bchocho commented Jun 15, 2016

What changes were proposed in this pull request?

Extend the returning of unwrapper functions from primitive types to all types.

How was this patch tested?

The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.

The github diff is very noisy. Attaching the screenshots below for improved readability:

screen shot 2016-06-14 at 5 33 16 pm

screen shot 2016-06-14 at 5 33 28 pm

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

r => unwrap(si.getStructFieldData(data, r), r.getFieldObjectInspector)))
def unwrap(data: Any, oi: ObjectInspector): Any = {
val unwrapper = unwrapperFor(oi)
unwrapper(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would this actually improve performance if you are doing it per row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no improvement when calling unwrap directly. This change moves the unwrap logic to unwrapperFor. This improves performance for ORC files, because OrcFileFormat instead calls unwrapperFor to return a function for each field, then unwraps each row with this function (see https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala#L356).

Copy link
Contributor

@lianhuiwang lianhuiwang Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change can skip case match per row for complex data type in unwrap. because unwrap for complex data type return a function. btw: it also improve for HadoopTableReader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we work to remove the unwrap function here?

@bchocho
Copy link
Contributor Author

bchocho commented Jun 15, 2016

@rxin thanks for the review. I've added a commit that removes unwrap by replacing with the unwrapperFor pattern. (I've moved unwrap to tests, where it's much cleaner and no need for top performance.)

@rxin
Copy link
Contributor

rxin commented Jun 15, 2016

I took a quick look and this looked reasonable. Would be great for somebody else to look at it more carefully too.

cc @hvanhovell

@lianhuiwang
Copy link
Contributor

lianhuiwang commented Jun 16, 2016

LGTM, pending jenkins

}
case poi: WritableConstantIntObjectInspector =>
data: Any =>
poi.getWritableConstantValue.get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it faster to call poi.getWritableConstantValue.get() outside of the function? And use the result in the function? Or am I missing something here? The same goes for all other constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, as the contract for a ConstantObjectInspector is that its object "represent constant values and can return them without an evaluation" [1]. I will make this change.

@hvanhovell
Copy link
Contributor

This looks pretty good. What I am thinking is that generating an encoder could create another nice performance speedup here.

fields.map(_.getFieldObjectInspector).map(unwrapperFor))
data: Any => {
if (data != null) {
InternalRow.fromSeq(fieldsToUnwrap.map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just can use a pattern match here to get to the field and the unwrap function directly, i.e.:

InternalRow.fromSeq(fieldsToUnwrap.map { case (field, unwrap) =>
  unwrap(si.getStructFieldData(data, field))
}

That makes it a bit more readable.

@bchocho
Copy link
Contributor Author

bchocho commented Jun 20, 2016

Thanks @hvanhovell I've pushed a commit addressing your comments.

@hvanhovell
Copy link
Contributor

LGTM - merging to master.

@hvanhovell
Copy link
Contributor

Thanks :)

@asfgit asfgit closed this in 0a9c027 Jun 22, 2016
asfgit pushed a commit that referenced this pull request Jun 22, 2016
## What changes were proposed in this pull request?

Extend the returning of unwrapper functions from primitive types to all types.

This PR is based on #13676. It only fixes a bug with scala-2.10 compilation. All credit should go to dafrista.

## How was this patch tested?

The patch should pass all unit tests. Reading ORC files with non-primitive types with this change reduced the read time by ~15%.

Author: Brian Cho <[email protected]>
Author: Herman van Hovell <[email protected]>

Closes #13854 from hvanhovell/SPARK-15956-scala210.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants