[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. #862

ueshin · 2014-05-23T11:27:27Z

Average values are difference between the calculation is done partially or not partially.
Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.

AmplabJenkins · 2014-05-23T11:27:58Z

Merged build triggered.

AmplabJenkins · 2014-05-23T11:28:07Z

Merged build started.

AmplabJenkins · 2014-05-23T12:42:44Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-23T12:42:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15165/

liancheng · 2014-05-26T14:03:40Z

Hi @ueshin, would you please give some pointer about the semantics of avg function you described? I investigated official documentation and code base of Hive, and also have run several experiments with Hive 0.12.0, but didn't find any clue that avg should count null values.

For a quick proof, here is a sample session I ran under Hive 0.12.0:

hive> create table src(key int, value string);             
hive> load data local inpath '/tmp/kv3.txt' into table src;
hive> select avg(key) from src;                  
...
OK
237.06666666666666
hive> select avg(key) from src where key is not null;
...
OK
237.06666666666666

The kv3.txt is copied from the Hive test data files, with 15 non-null keys and 10 null keys:

hive> select key, value from src;
hive> select * from src;                                   
...
OK
238     val_238
NULL
311     val_311
NULL    val_27
NULL    val_165
NULL    val_409
255     val_255
278     val_278
98      val_98
NULL    val_484
NULL    val_265
NULL    val_193
401     val_401
150     val_150
273     val_273
224
369
66      val_66
128
213     val_213
146     val_146
406     val_406
NULL
NULL
NULL

ueshin · 2014-05-26T16:11:16Z

@liancheng Thank you for your comment.
You mean the avg should not count null values, right?
That is what I meant.
Current AverageFunction counts null values, so I think my patch is needed.

rxin · 2014-05-27T21:55:08Z

Thanks. I've merged this into master & branch-1.0

…value is null. Average values are difference between the calculation is done partially or not partially. Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null. Author: Takuya UESHIN <[email protected]> Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits: b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null. (cherry picked from commit 3b0baba) Signed-off-by: Reynold Xin <[email protected]>

liancheng · 2014-05-27T23:13:21Z

@ueshin Sorry, my bad, misunderstood your PR description. And I think you are right.

On the other hand, it seems that AverageFunction is not used in query plan unless we use it in DSL queries explicitly (maybe I missed something related to aggregation functions here):

scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)

It seems that currently AVG is always turned into partial sum divided by partial count, and leads us to the right answer (null values are ignored). But I think your fix still makes sense and should be merged.

liancheng · 2014-05-27T23:25:42Z

Ah, realized what's wrong, I need at least 1 non-partial aggregation:

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]

And it does lead to the wrong answer.

…value is null. Average values are difference between the calculation is done partially or not partially. Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null. Author: Takuya UESHIN <[email protected]> Closes apache#862 from ueshin/issues/SPARK-1915 and squashes the following commits: b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.

Modify AverageFunction not to count if the evaluated value is null.

b1ff3c0

asfgit closed this in 3b0baba May 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. #862

[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. #862

Uh oh!

ueshin commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

liancheng commented May 26, 2014

Uh oh!

ueshin commented May 26, 2014

Uh oh!

rxin commented May 27, 2014

Uh oh!

liancheng commented May 27, 2014

Uh oh!

liancheng commented May 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. #862

[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. #862

Uh oh!

Conversation

ueshin commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

AmplabJenkins commented May 23, 2014

Uh oh!

liancheng commented May 26, 2014

Uh oh!

ueshin commented May 26, 2014

Uh oh!

rxin commented May 27, 2014

Uh oh!

liancheng commented May 27, 2014

Uh oh!

liancheng commented May 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants