Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented May 23, 2014

Average values are difference between the calculation is done partially or not partially.
Because AverageFunction (in not-partially calculation) counts even if the evaluated value is null.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15165/

@liancheng
Copy link
Contributor

Hi @ueshin, would you please give some pointer about the semantics of avg function you described? I investigated official documentation and code base of Hive, and also have run several experiments with Hive 0.12.0, but didn't find any clue that avg should count null values.

For a quick proof, here is a sample session I ran under Hive 0.12.0:

hive> create table src(key int, value string);             
hive> load data local inpath '/tmp/kv3.txt' into table src;
hive> select avg(key) from src;                  
...
OK
237.06666666666666
hive> select avg(key) from src where key is not null;
...
OK
237.06666666666666

The kv3.txt is copied from the Hive test data files, with 15 non-null keys and 10 null keys:

hive> select key, value from src;
hive> select * from src;                                   
...
OK
238     val_238
NULL
311     val_311
NULL    val_27
NULL    val_165
NULL    val_409
255     val_255
278     val_278
98      val_98
NULL    val_484
NULL    val_265
NULL    val_193
401     val_401
150     val_150
273     val_273
224
369
66      val_66
128
213     val_213
146     val_146
406     val_406
NULL
NULL
NULL

@ueshin
Copy link
Member Author

ueshin commented May 26, 2014

@liancheng Thank you for your comment.
You mean the avg should not count null values, right?
That is what I meant.
Current AverageFunction counts null values, so I think my patch is needed.

@rxin
Copy link
Contributor

rxin commented May 27, 2014

Thanks. I've merged this into master & branch-1.0

asfgit pushed a commit that referenced this pull request May 27, 2014
…value is null.

Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <[email protected]>

Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.

(cherry picked from commit 3b0baba)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 3b0baba May 27, 2014
@liancheng
Copy link
Contributor

@ueshin Sorry, my bad, misunderstood your PR description. And I think you are right.

On the other hand, it seems that AverageFunction is not used in query plan unless we use it in DSL queries explicitly (maybe I missed something related to aggregation functions here):

scala> sql("SELECT AVG(key) FROM src1").collect().foreach(println)
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:04:33 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 8 (SchemaRDD[45] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [(CAST(SUM(PartialSum#648), DoubleType) / CAST(SUM(PartialCount#649), DoubleType)) AS c0#644]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(key#646) AS PartialCount#649,SUM(key#646) AS PartialSum#648]
   HiveTableScan [key#646], (MetastoreRelation default, src1, None), None)

It seems that currently AVG is always turned into partial sum divided by partial count, and leads us to the right answer (null values are ignored). But I think your fix still makes sense and should be merged.

@liancheng
Copy link
Contributor

Ah, realized what's wrong, I need at least 1 non-partial aggregation:

scala> sql("SELECT AVG(key), COUNT(DISTINCT key) FROM src1").collect().foreach(println)
...
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None), which is now runnable
14/05/28 07:21:31 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 12 (SchemaRDD[67] at RDD at SchemaRDD.scala:98
== Query Plan ==
Aggregate false, [], [AVG(key#672) AS c0#668,COUNT(DISTINCT key#672}) AS c1#669]
 Exchange SinglePartition
  HiveTableScan [key#672], (MetastoreRelation default, src1, None), None)
...
[142.24,15]

And it does lead to the wrong answer.

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…value is null.

Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <[email protected]>

Closes apache#862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…value is null.

Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <[email protected]>

Closes apache#862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants