Skip to content

Conversation

@sameeragarwal
Copy link
Member

What changes were proposed in this pull request?

It's common for many SQL operators to not care about reading null values for correctness. Currently, this is achieved by performing isNotNull checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls).

How was this patch tested?

    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    String with Nulls Scan (0%):        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                   1229 / 1648          8.5         117.2       1.0X
    PR Vectorized                             833 /  846         12.6          79.4       1.5X
    PR Vectorized (Null Filtering)            732 /  782         14.3          69.8       1.7X

    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    String with Nulls Scan (50%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    995 / 1053         10.5          94.9       1.0X
    PR Vectorized                             732 /  772         14.3          69.8       1.4X
    PR Vectorized (Null Filtering)            725 /  790         14.5          69.1       1.4X

    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
    String with Nulls Scan (95%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -------------------------------------------------------------------------------------------
    SQL Parquet Vectorized                    326 /  333         32.2          31.1       1.0X
    PR Vectorized                             190 /  200         55.1          18.2       1.7X
    PR Vectorized (Null Filtering)            168 /  172         62.2          16.1       1.9X

@sameeragarwal
Copy link
Member Author

cc @nongli

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53253 has finished for PR 11749 at commit af217fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of commenting it this way, can you put the fraction in the benchmark name?

e.g. String with Nulls Scan (95%)

@nongli
Copy link
Contributor

nongli commented Mar 16, 2016

Can you add some test cases to columnarbatchsuite that exercises this?

@sameeragarwal
Copy link
Member Author

thanks, all comments addressed!

@sameeragarwal sameeragarwal changed the title [SPARK-13922][SQL] Filter rows with null attributes in parquet vectorized reader [SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader Mar 16, 2016
@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53338 has finished for PR 11749 at commit 2d1066f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* attribute is filtered out.
*/
public final void filterNullsInColumn(int ordinal) {
assert(!nullFilteredColumns.contains(ordinal));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this assert is necessary. I think this is perfectly valid and makes this api easier to use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, with the set it's not longer required. Fixed.

@nongli
Copy link
Contributor

nongli commented Mar 16, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53354 has finished for PR 11749 at commit 0688cf8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in b90c020 Mar 16, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…quet reader

# What changes were proposed in this pull request?

It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls).

## How was this patch tested?

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (0%):        Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                   1229 / 1648          8.5         117.2       1.0X
        PR Vectorized                             833 /  846         12.6          79.4       1.5X
        PR Vectorized (Null Filtering)            732 /  782         14.3          69.8       1.7X

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (50%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                    995 / 1053         10.5          94.9       1.0X
        PR Vectorized                             732 /  772         14.3          69.8       1.4X
        PR Vectorized (Null Filtering)            725 /  790         14.5          69.1       1.4X

        Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
        String with Nulls Scan (95%):       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
        -------------------------------------------------------------------------------------------
        SQL Parquet Vectorized                    326 /  333         32.2          31.1       1.0X
        PR Vectorized                             190 /  200         55.1          18.2       1.7X
        PR Vectorized (Null Filtering)            168 /  172         62.2          16.1       1.9X

Author: Sameer Agarwal <[email protected]>

Closes apache#11749 from sameeragarwal/perf-testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants