[SPARK-SQL] HiveTableScan operator Performance Improvement #456

esoroush · 2014-04-19T21:48:38Z

The goal is to improve the performance of the HiveTableScan Operator:

As a quick benchmark run the following code in the scala interpreter:

scala> :paste

hql("CREATE TABLE IF NOT EXISTS sample (key1 INT, key2 INT,value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','")
hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/sample2.txt' INTO TABLE sample")
println("Result of SELECT * FROM sample:")
val start = System.nanoTime
val recs = hql("FROM sample SELECT key1,key2,value").collect()
val micros = (System.nanoTime - start) / 1000
println("%d microsecondss".format(micros))

scala> CTRL-D

you can download the test file from here:
http://homes.cs.washington.edu/~soroush/sample2.txt

"sample2.txt contains about 3.6 million rows. The improved code scans the entire table in about 9 seconds while the original code scans the entire table in about 22 seconds.

Regarding the last item in the task:
"Avoid Reading Unneeded Data - Some Hive Serializer/Deserializer (SerDe) interfaces support reading only the required columns from the underlying HDFS files. We should use ColumnProjectionUtils to configure these correctly."
The way to do it, should be similar to the following code:

https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/TableScanOperator.scala

I tried to take a similar approach, but I am not sure columnar reading is working at hiveOperators.scala right now. Anyway, it requires more time for me to make sure that last feature is working. Please notice that it was the first time that I wrote code in scala and it took me some time to get comfortable with the language.

AmplabJenkins · 2014-04-19T21:53:12Z

Can one of the admins verify this patch?

mateiz · 2014-04-19T22:03:15Z

Jenkins, this is ok to test

AmplabJenkins · 2014-04-19T22:08:11Z

Merged build triggered.

AmplabJenkins · 2014-04-19T22:08:19Z

Merged build started.

AmplabJenkins · 2014-04-19T22:10:01Z

Merged build finished.

AmplabJenkins · 2014-04-19T22:10:01Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14261/

marmbrus · 2014-04-21T16:58:47Z

Mind fixing the tab characters so we can test this?

esoroush · 2014-04-21T18:37:46Z

sure, I am looking at right now.

AmplabJenkins · 2014-04-21T19:42:56Z

Merged build triggered.

AmplabJenkins · 2014-04-21T19:43:04Z

Merged build started.

AmplabJenkins · 2014-04-21T19:44:34Z

Merged build finished.

AmplabJenkins · 2014-04-21T19:44:35Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14295/

marmbrus · 2014-04-21T20:02:09Z

BTW, you can check style locally sbt scalastyle.

AmplabJenkins · 2014-04-21T20:17:56Z

Merged build triggered.

AmplabJenkins · 2014-04-21T20:18:04Z

Merged build started.

AmplabJenkins · 2014-04-21T20:19:58Z

Merged build finished.

AmplabJenkins · 2014-04-21T20:19:58Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14300/

esoroush · 2014-04-21T20:23:54Z

The problem was I used vim for coding and it screwed up the tabbing for some reason.

I did the sbt scalastyle and it succeed now locally

esoroush · 2014-04-21T20:36:21Z

I am a little bit confused why it does not complain in my machine locally but it produces errors here ...

AmplabJenkins · 2014-04-21T21:32:55Z

Merged build triggered.

AmplabJenkins · 2014-04-21T21:33:03Z

Merged build started.

AmplabJenkins · 2014-04-21T22:24:08Z

Merged build finished.

AmplabJenkins · 2014-04-21T22:24:08Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14304/

mateiz · 2014-04-22T07:46:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala

Typo in here

AmplabJenkins · 2014-04-25T21:02:58Z

Build triggered.

AmplabJenkins · 2014-04-25T21:03:05Z

Build started.

AmplabJenkins · 2014-04-25T21:04:51Z

Build finished.

AmplabJenkins · 2014-04-25T21:04:52Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14496/

AmplabJenkins · 2014-04-25T21:17:57Z

Build triggered.

AmplabJenkins · 2014-04-25T21:18:06Z

Build started.

AmplabJenkins · 2014-04-25T22:50:20Z

Build finished.

AmplabJenkins · 2014-04-25T22:50:20Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14497/

## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain) Not filed in upstream, touches code for conda. ## What changes were proposed in this pull request? rLibDir contains a sequence of possible paths for the SparkR package on the executor and is passed on to the R daemon with the SPARKR_RLIBDIR environment variable. This PR filters rLibDir for paths that exist before setting SPARKR_RLIBDIR, retaining existing functionality to preferentially choose a YARN or local SparkR install over conda if both are present. See daemon.R: https://github.com/palantir/spark/blob/master/R/pkg/inst/worker/daemon.R#L23 Fixes apache#456 ## How was this patch tested? Manually testing cherry picked on older version Please review http://spark.apache.org/contributing.html before opening a pull request.

It will add two periodic jobs of integration test of helm with native k8s cluster, which use v2.2.0 chart-testing tool: 1. helm with 2.12.2 and kubernetes with v1.12.7 2. helm with 2.12.2 and kubernetes with v1.13.4 Closes: theopenlab/openlab#212 Closes: theopenlab/openlab#213

…e functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root |-- a: struct (nullable = true) | |-- c: struct (nullable = true) | | |-- e: string (nullable = true) | |-- d: integer (nullable = true) |-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

…e functions into projections ### What changes were proposed in this pull request? This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections. ### Why are the changes needed? To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception. Consider the following schema: ``` root |-- a: struct (nullable = true) | |-- c: struct (nullable = true) | | |-- e: string (nullable = true) | |-- d: integer (nullable = true) |-- b: string (nullable = true) ``` and the query: `SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b` Executing the query before this PR will result in the error: ``` java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true]) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99) ... ``` The optimised plan before this PR is: ``` 'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3] +- 'Project [max(a#0).c.e AS _extract_e#5, b#1] +- Relation default.test_aggregates[a#0,b#1] parquet ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier. The produced optimized plan is checked for equivalency with a plan of the form: ``` Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456] +- LocalRelation <empty>, [a#451, b#452] ``` Closes #33921 from vicennial/spark-36677. Authored-by: Venkata Sai Akhil Gudesa <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit 2ed6e7b) Signed-off-by: Liang-Chi Hsieh <[email protected]>

… 0 (apache#456)

… scale < 0 (apache#456)" (apache#467) This reverts commit 0017da5

…AtomicCreateTableAsSelectExec (apache#456)

Emad Soroush and others added 2 commits April 19, 2014 14:11

test code commit

bd68941

code task 2

ddc1c23

esoroush changed the title ~~Coding Task~~ HiveTableScan operator Performance Improvement Apr 19, 2014

esoroush changed the title ~~HiveTableScan operator Performance Improvement~~ [SPARK-SQL] HiveTableScan operator Performance Improvement Apr 20, 2014

tab characters correction

e492069

tab characters correction

ccb66ba

esoroush added 2 commits April 21, 2014 14:28

tab characters correction-3

c3b6d12

tab characters correction-4

bb74abd

mateiz reviewed Apr 22, 2014
View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala Outdated

Copy link

Contributor

mateiz Apr 22, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in here

more refinement ... now the code passes the unit tests

440a3e6

minor refinements ... tab deletion ...

9d2c6ca

esoroush closed this Jun 3, 2014

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

KE-36162 fix AnalysisException bug roundBase decimalType when scale <…

0017da5

… 0 (apache#456)

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

Revert "KE-36162 fix AnalysisException bug roundBase decimalType when…

9888847

… scale < 0 (apache#456)" (apache#467) This reverts commit 0017da5

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[HADP-54373] Skip authorization phase for internal plan generated in …

ea839fb

…AtomicCreateTableAsSelectExec (apache#456)

[SPARK-SQL] HiveTableScan operator Performance Improvement #456

[SPARK-SQL] HiveTableScan operator Performance Improvement #456

Uh oh!

Conversation

esoroush commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

mateiz commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

AmplabJenkins commented Apr 19, 2014

Uh oh!

marmbrus commented Apr 21, 2014

Uh oh!

esoroush commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

marmbrus commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

esoroush commented Apr 21, 2014

Uh oh!

esoroush commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

AmplabJenkins commented Apr 21, 2014

Uh oh!

mateiz Apr 22, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

AmplabJenkins commented Apr 25, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants