Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Feb 29, 2016

What changes were proposed in this pull request?

This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.

This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.

How was this patch tested?

Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).

@davies
Copy link
Contributor Author

davies commented Feb 29, 2016

cc @nongli

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52202 has finished for PR 11437 at commit 6676e74.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52205 has finished for PR 11437 at commit 081e6fe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52206 has finished for PR 11437 at commit 5faa786.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52207 has finished for PR 11437 at commit 6fce801.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #2593 has finished for PR 11437 at commit 6fce801.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

int num = Math.min(total, leftInPage);
if (useDictionary) {
// Data is dictionary encoded. We will vector decode the ids and then resolve the values.
if (dictionaryIds == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dictionaryIds from this class.

@nongli
Copy link
Contributor

nongli commented Mar 1, 2016

Can you run the ColumnarBatch/ParquetRead benchmark? Does this have perf problems if there is no dictionary or there is no filter?

@davies
Copy link
Contributor Author

davies commented Mar 1, 2016

@nongli There is no visible difference on all existing benchmarks (ColumnarBatch and ParquetRead), they don't use dictionary encoding.

After changed the intStringScan to use dictionary encoding (small number unique values), here is the result:

Before this patch

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader                       1248 / 1281          8.4         119.0       1.0X
SQL Parquet MR                           1962 / 2093          5.3         187.1       0.6X
SQL Parquet Vectorized                    876 / 1018         12.0          83.5       1.4X
ParquetReader                             741 /  755         14.1          70.7       1.7X

After the patch

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader                       1247 / 1279          8.4         118.9       1.0X
SQL Parquet MR                           1809 / 1851          5.8         172.5       0.7X
SQL Parquet Vectorized                    805 /  909         13.0          76.8       1.5X
ParquetReader                             742 /  756         14.1          70.7       1.7X

We can see 10% improvement on SQL Parquet Vectorized, but no difference on ParquetReader, I don't know why. (I didn't included #11274 )

@nongli
Copy link
Contributor

nongli commented Mar 1, 2016

Cool. Lgtm

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52251 has finished for PR 11437 at commit e539d8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Mar 1, 2016

Merging this into master.

@asfgit asfgit closed this in c27ba0d Mar 1, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.

This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary.

## How was this patch tested?

Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR apache#11274).

Author: Davies Liu <[email protected]>

Closes apache#11437 from davies/decode_dict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants