Optimizing CPU performance #15

fabuzaid21 · 2015-10-30T17:22:12Z

Several changes:

PartitionInfo.update has been rewritten so that there are fewer zipWithIndex operations and sortBys. We manually sort by bit, but we still use standard library functions to sort by value, then index. This is a further improvement that needs to be made.
Instead of using bitSubvectors to encode which instances split left or right, we now use a single bitVector to encode this information. This reduces some of the overhead in the communication costs, since this is broadcasted to all workers. This change was made to make it easier to look up the corresponding bit for each instance -- we now use the original index of the instance to look up the its bit in the bit vector.
We introduce a second bitVector -- called nodeSplitBitVector -- to encode whether the node was split or not. Previously we would determine this by examining the number of instances that split left for a given bitSubvector. Since we no longer use bitSubvector, we needed an alternate way of encoding this information. This increases the overhead of communcation between the master and the workers (now, we have to broadcast two bitVectors instead of one), but this should still be less than the communication cost we had previously.

Unit Tests in AltDTSuite all pass. Same accuracy achieved as byRow algorithm on large dataset (1 million instances, 20 columns).

When I profile on my machine, byCol takes around ~27 seconds to train, and byRow takes around ~10 seconds. The last bottleneck is the unnecessary materialization of partitionInfos, I think.

1) PartitionInfo.update has been rewritten so that there are fewer zipWithIndex operations and sortBys. We manually sort by bit, but we still use standard library functions to sort by value, then index. This is a further improvement that needs to be made. 2) Instead of using bitSubvectors to encode which instances split left or right, we now use a single bitVector to encode this information. This reduces some of the overhead in the communication costs, since this is broadcasted to all workers. This change was made to make it easier to look up the corresponding bit for each instance -- we now use the original index of the instance to look up the its bit in the bit vector. 3) We introduce a second bitVector -- called nodeSplitBitVector -- to encode whether the node was split or not. Previously we would determine this by examining the number of instances that split left for a given bitSubvector. Since we no longer use bitSubvector, we needed an alternate way of encoding this information. This increases the overhead of communcation between the master and the workers (now, we have to broadcast two bitVectors instead of one), but this should still be less than the communication cost we had previously.

…ed them with while loops. Next optimization should be to replace zip + sort with our own custom sort

…duced extra stages in the DAG

…er using zip or zipWithIndex to sort features with indices.

jkbradley · 2015-11-13T19:13:13Z

@fabuzaid21 Sorry for taking a long time to respond! I'll be able to prioritize this more now.

PartitionInfo.update has been rewritten so that there are fewer zipWithIndex operations and sortBys. We manually sort by bit, but we still use standard library functions to sort by value, then index. This is a further improvement that needs to be made.

This sounds good to me for sure. However, I'd prefer we not introduce a hand-written sorting algorithm into Spark. Is there not a good way to do this using existing sorting algorithms from Java? I would assume the main overheads in current sorting are:

using Scala sequence operations, rather than something like Java Array sort [http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html]
unnecessary sorts when splitting a node (where we split the feature values for that node)

Instead of using bitSubvectors to encode which instances split left or right, we now use a single bitVector to encode this information. This reduces some of the overhead in the communication costs, since this is broadcasted to all workers. This change was made to make it easier to look up the corresponding bit for each instance -- we now use the original index of the instance to look up the its bit in the bit vector.

I thought about this too. I agree it's best when the tree is balanced and almost all current leaf nodes are active. However, many realistic trees become imbalanced once they get deep, so this could mean communicating more data.

We introduce a second bitVector -- called nodeSplitBitVector -- to encode whether the node was split or not. Previously we would determine this by examining the number of instances that split left for a given bitSubvector. Since we no longer use bitSubvector, we needed an alternate way of encoding this information. This increases the overhead of communcation between the master and the workers (now, we have to broadcast two bitVectors instead of one), but this should still be less than the communication cost we had previously.

I don't think this is necessary. Workers don't get 1 BitSubvector per node; a single BitSubvector could include results from multiple nodes (because of the merging step in aggregation). Workers can still identify the boundaries between nodes because of the nodeOffsets array.

…gle batch ## What changes were proposed in this pull request? This PR support multiple Python UDFs within single batch, also improve the performance. ```python >>> from pyspark.sql.types import IntegerType >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType()) >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType()) >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)] +- OneRowRelation$ == Analyzed Logical Plan == double(add(1, 2)): int, add(double(2), 1): int Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Optimized Logical Plan == Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] : +- INPUT +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18] +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- Scan OneRowRelation[] ``` ## How was this patch tested? Added new tests. Using the following script to benchmark 1, 2 and 3 udfs, ``` df = sqlContext.range(1, 1 << 23, 1, 4) double = F.udf(lambda x: x * 2, LongType()) print df.select(double(df.id)).count() print df.select(double(df.id), double(df.id + 1)).count() print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count() ``` Here is the results: N | Before | After | speed up ---- |------------ | -------------|------ 1 | 22 s | 7 s | 3.1X 2 | 38 s | 13 s | 2.9X 3 | 58 s | 16 s | 3.6X This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering). Author: Davies Liu <[email protected]> Closes apache#12057 from davies/multi_udfs.

## What changes were proposed in this pull request? This PR aims at improving the way physical plans are explained in spark. Currently, the explain output for physical plan may look very cluttered and each operator's string representation can be very wide and wraps around in the display making it little hard to follow. This especially happens when explaining a query 1) Operating on wide tables 2) Has complex expressions etc. This PR attempts to split the output into two sections. In the header section, we display the basic operator tree with a number associated with each operator. In this section, we strictly control what we output for each operator. In the footer section, each operator is verbosely displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be correlated by the originating expression id from its parent plan. To illustrate, here is a simple plan displayed in old vs new way. Example query1 : ``` EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0 ``` Old : ``` *(2) Project [key#2, max(val)#15] +- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0)) +- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18]) +- Exchange hashpartitioning(key#2, 200) +- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21]) +- *(1) Project [key#2, val#3] +- *(1) Filter (isnotnull(key#2) AND (key#2 > 0)) +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int> ``` New : ``` Project (8) +- Filter (7) +- HashAggregate (6) +- Exchange (5) +- HashAggregate (4) +- Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (isnotnull(key#2) AND (key#2 > 0)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] (4) HashAggregate [codegen id : 1] Input: [key#2, val#3] (5) Exchange Input: [key#2, max#11] (6) HashAggregate [codegen id : 2] Input: [key#2, max#11] (7) Filter [codegen id : 2] Input : [key#2, max(val)#5, max(val#3)#8] Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0)) (8) Project [codegen id : 2] Output : [key#2, max(val)#5] Input : [key#2, max(val)#5, max(val#3)#8] ``` Example Query2 (subquery): ``` SELECT * FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3 ``` Old: ``` *(1) Project [key#2, val#3] +- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3)) : +- Subquery scalar-subquery#39 : +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)apache#45]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47]) : +- *(1) Project [key#26] : +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2)) : : +- Subquery scalar-subquery#38 : : +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)apache#43]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49]) : : +- *(1) Project [key#28] : : +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0)) : : +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int> : +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int> +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int> ``` New: ``` Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23 HashAggregate (9) +- Exchange (8) +- HashAggregate (7) +- Project (6) +- Filter (5) +- Scan parquet default.explain_temp2 (4) (4) Scan parquet default.explain_temp2 [codegen id : 1] Output: [key#26, val#27] (5) Filter [codegen id : 1] Input : [key#26, val#27] Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2)) (6) Project [codegen id : 1] Output : [key#26] Input : [key#26, val#27] (7) HashAggregate [codegen id : 1] Input: [key#26] (8) Exchange Input: [max#35] (9) HashAggregate [codegen id : 2] Input: [max#35] Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22 HashAggregate (15) +- Exchange (14) +- HashAggregate (13) +- Project (12) +- Filter (11) +- Scan parquet default.explain_temp3 (10) (10) Scan parquet default.explain_temp3 [codegen id : 1] Output: [key#28, val#29] (11) Filter [codegen id : 1] Input : [key#28, val#29] Condition : (isnotnull(val#29) AND (val#29 > 0)) (12) Project [codegen id : 1] Output : [key#28] Input : [key#28, val#29] (13) HashAggregate [codegen id : 1] Input: [key#28] (14) Exchange Input: [max#37] (15) HashAggregate [codegen id : 2] Input: [max#37] ``` Note: I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow would not be able to immediately incorporate the feedback. I will start to work on them as soon as i can. Also, currently this PR provides a basic infrastructure for explain enhancement. The details about individual operators will be implemented in follow-up prs ## How was this patch tested? Added a new test `explain.sql` that tests basic scenarios. Need to add more tests. Closes apache#24759 from dilipbiswal/explain_feature. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

fabuzaid21 added 7 commits October 20, 2015 17:40

removed partitionInfosDebug

2f2ac8d

Additional optimizations, fairly minor. Removed foreach calls, replac…

60f28f6

…ed them with while loops. Next optimization should be to replace zip + sort with our own custom sort

removed unnecessary operations -- first(), zipWithIndex -- that intro…

4a71a84

…duced extra stages in the DAG

Forgot to include change to TreeUtilSuite

68d59ca

Sorting now handled using a custom class, DualPivotQuicksort. No long…

d6e32cd

…er using zip or zipWithIndex to sort features with indices.

couple more minor improvements

6dd5e67

Now conforming to Spark style guidelines

c228f7f

fabuzaid21 force-pushed the dt-features-firas branch from 8a39441 to c228f7f Compare November 22, 2015 21:46

fabuzaid21 closed this Nov 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing CPU performance #15

Optimizing CPU performance #15

Uh oh!

fabuzaid21 commented Oct 30, 2015

Uh oh!

jkbradley commented Nov 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimizing CPU performance #15

Optimizing CPU performance #15

Uh oh!

Conversation

fabuzaid21 commented Oct 30, 2015

Uh oh!

jkbradley commented Nov 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants