[SPARK-13582] [SQL] defer dictionary decoding in parquet reader #11437

davies · 2016-02-29T23:17:49Z

What changes were proposed in this pull request?

This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.

This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.

How was this patch tested?

Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).

davies · 2016-02-29T23:17:56Z

cc @nongli

SparkQA · 2016-02-29T23:24:07Z

Test build #52202 has finished for PR 11437 at commit 6676e74.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T00:53:53Z

Test build #52205 has finished for PR 11437 at commit 081e6fe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T01:13:46Z

Test build #52206 has finished for PR 11437 at commit 5faa786.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T02:45:00Z

Test build #52207 has finished for PR 11437 at commit 6fce801.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T08:01:09Z

Test build #2593 has finished for PR 11437 at commit 6fce801.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-01T16:54:58Z

...in/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java

        int num = Math.min(total, leftInPage);
        if (useDictionary) {
-          // Data is dictionary encoded. We will vector decode the ids and then resolve the values.
-          if (dictionaryIds == null) {


Remove dictionaryIds from this class.

nongli · 2016-03-01T16:58:40Z

Can you run the ColumnarBatch/ParquetRead benchmark? Does this have perf problems if there is no dictionary or there is no filter?

davies · 2016-03-01T18:37:09Z

@nongli There is no visible difference on all existing benchmarks (ColumnarBatch and ParquetRead), they don't use dictionary encoding.

After changed the intStringScan to use dictionary encoding (small number unique values), here is the result:

Before this patch

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader                       1248 / 1281          8.4         119.0       1.0X
SQL Parquet MR                           1962 / 2093          5.3         187.1       0.6X
SQL Parquet Vectorized                    876 / 1018         12.0          83.5       1.4X
ParquetReader                             741 /  755         14.1          70.7       1.7X

After the patch

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader                       1247 / 1279          8.4         118.9       1.0X
SQL Parquet MR                           1809 / 1851          5.8         172.5       0.7X
SQL Parquet Vectorized                    805 /  909         13.0          76.8       1.5X
ParquetReader                             742 /  756         14.1          70.7       1.7X

We can see 10% improvement on SQL Parquet Vectorized, but no difference on ParquetReader, I don't know why. (I didn't included #11274 )

nongli · 2016-03-01T18:40:23Z

Cool. Lgtm

SparkQA · 2016-03-01T20:53:31Z

Test build #52251 has finished for PR 11437 at commit e539d8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-01T21:06:39Z

Merging this into master.

## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR apache#11274). Author: Davies Liu <[email protected]> Closes apache#11437 from davies/decode_dict.

defer dictionary decoding

6676e74

fix style

081e6fe

fix decimal

6fce801

davies force-pushed the decode_dict branch from 5faa786 to 6fce801 Compare March 1, 2016 01:18

nongli reviewed Mar 1, 2016
View reviewed changes

address comment

e539d8a

asfgit closed this in c27ba0d Mar 1, 2016

cloud-fan mentioned this pull request Jul 19, 2017

[SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column #18468

Closed

[SPARK-13582] [SQL] defer dictionary decoding in parquet reader #11437

[SPARK-13582] [SQL] defer dictionary decoding in parquet reader #11437

Uh oh!

Conversation

davies commented Feb 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies commented Feb 29, 2016

Uh oh!

SparkQA commented Feb 29, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

nongli Mar 1, 2016

Choose a reason for hiding this comment

Uh oh!

nongli commented Mar 1, 2016

Uh oh!

davies commented Mar 1, 2016

Uh oh!

nongli commented Mar 1, 2016

Uh oh!

SparkQA commented Mar 1, 2016

Uh oh!

davies commented Mar 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants