Skip to content

Conversation

@cloud-fan
Copy link
Contributor

As we begin to use unsafe row writing framework(BufferHolder and UnsafeRowWriter) in more and more places(UnsafeProjection, UnsafeRowParquetRecordReader, GenerateColumnAccessor, etc.), we should add more doc to it and make it easier to use.

This PR abstract the technique used in UnsafeRowParquetRecordReader: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.

a local benchmark shows UnsafeProjection is up to 1.7x faster after this PR:
old version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             2616.04           102.61         1.00 X
single nullable long                    3032.54            88.52         0.86 X
primitive types                         9121.05            29.43         0.29 X
nullable primitive types               12410.60            21.63         0.21 X

new version

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
unsafe projection:                 Avg Time(ms)    Avg Rate(M/s)  Relative Rate
-------------------------------------------------------------------------------
single long                             1533.34           175.07         1.00 X
single nullable long                    2306.73           116.37         0.66 X
primitive types                         8403.93            31.94         0.18 X
nullable primitive types               12448.39            21.56         0.12 X

For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR.

@cloud-fan
Copy link
Contributor Author

cc @davies @nongli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49602 has finished for PR 10809 at commit 3978711.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullBytes

@SparkQA
Copy link

SparkQA commented Jan 18, 2016

Test build #49606 has finished for PR 10809 at commit 9a63852.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write instead of UnsafeRow.setXXX, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense for me.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49647 has finished for PR 10809 at commit 5567ef1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Jan 19, 2016

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new record mean nested struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the record that this writer is responsible to write, it can be the whole row record, or a nested struct, or even a struct type element in array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the new record is not clear to me, it should be nested struct.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50026 has finished for PR 10809 at commit f79f63c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jan 26, 2016

LGTM, merging this into master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants