From 6a26282d7c2836fbc8b034a1f731762191adc065 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Fri, 3 Apr 2020 18:22:26 -0700
Subject: [PATCH 1/9] init doc

---
 docs/sql-ref-functions-builtin-aggregate.md | 152 +++++++++++++++++++-
 1 file changed, 151 insertions(+), 1 deletion(-)
diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index d59543647e02..bc0f687c890f 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -19,4 +19,154 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
\ No newline at end of file
+Aggregate functions
+* Table of contents
+{:toc}
+
+Spark SQL provides build-in Aggregate functions defines in dataset API and SQL interface. Aggregate functions
+operate on a group of rows and return a single value.
+
+Spark SQL Aggregate functions are grouped as "agg_funcs" in spark SQL. Below is the list of functions.
+
+Note: Every below function has another signature which take String as a column name instead of Column.
+
+<table class="table">
+  <thead>
+    <tr><th>Function</th><th>Parameters(s)</th><<th>Description</th><th></tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>avg(e: Column)</td><td>Column name</td><td> Returns the average of values in the input column.</td> 
+    </tr>
+    <tr>
+      <td>mean(e: Column)</td><td>Column name</td><td> Returns the average of values in the input column.</td> 
+    </tr>        
+    <tr>
+      <td>bool_and(e: Column) every(e: Column)</td><td>Column name</td><td>Returns true if all values are true</td>
+    </tr>
+    <tr>
+      <td>any(e: Column)  some(e: Column) bool_or(e: Column)</td><td>Column name</td><td>Returns true if at least one value is true</td>
+    </tr>
+    <tr>
+      <td>approx_count_distinct(e: Column)</td><td>Column name</td><td>Returns the estimated cardinality by HyperLogLog++</td>td>
+    </tr>
+    <tr>
+      <td>corr(e1: Column, e2: Column)</td><td>Column name</td><td>Returns Pearson coefficient of correlation between a set of number pairs</td>
+    </tr>
+    <tr>
+      <td>count(*)</td><td>None</td><td>Returns the total number of retrieved rows, including rows containing null</td>
+    </tr>
+    <tr>
+      <td>count(e: Column[, e: Column])</td><td>Column name</td><td>Returns the number of rows for which the supplied column(s) are all not null</td>
+    </tr>
+    <tr>
+      <td>count(DISTINCT e: Column[, e: Column])</td><td>Column name</td><td>Returns the number of rows for which the supplied column(s) are unique and non-null</td>
+    </tr> 
+    <tr>
+      <td>count_if(e: Column)</td><td>Column name</td><td>Returns the number of `TRUE` values for the column</td>
+    </tr> 
+    <tr>
+      <td>covar_pop(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the population covariance of a set of number pairs</td>
+    </tr> 
+    <tr>
+      <td>covar_samp(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the sample covariance of a set of number pairs</td>
+    </tr>  
+    <tr>
+      <td>first(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the first value of column for a group of rows.
+                                                           If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+    </tr>      
+    <tr>
+      <td>first_value(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the first value of column for a group of rows.
+                                                               If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+    </tr>     
+    <tr>
+       <td>skewness(e: Column)</td><td>Column name</td><td>Returns the skewness value calculated from values of a group</td>
+    </tr>    
+    <tr>
+       <td>kurtosis(e: Column)</td><td>Column name</td><td>Returns the kurtosis value calculated from values of a group</td>
+    </tr>    
+    <tr>
+      <td>last(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the last value of column for a group of rows.
+                                                               If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+    </tr>      
+    <tr>
+      <td>last_value(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the last value of column for a group of rows.
+                                                                   If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+    </tr>     
+    <tr>
+      <td>max(e: Column)</td><td>Column name</td><td>Returns the maximum value of the column.</td>
+    </tr>          
+    <tr>
+      <td>max_by(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the value of column e1 associated with the maximum value of column e2.</td>
+    </tr>   
+    <tr>
+      <td>min(e: Column)</td><td>Column name</td><td>Returns the minimum value of the column.</td>
+    </tr>          
+    <tr>
+      <td>min_by(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the value of column e1 associated with the minimum value of column e2.</td>
+    </tr>      
+    <tr>
+      <td>percentile(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the exact percentile value of numeric column
+                       `col` at the given percentage.</td>
+    </tr>         
+    <tr>
+      <td>percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the exact
+                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
+    </tr>        
+    <tr>
+      <td>percentile_approx(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate percentile value of numeric
+                                                                           column `col` at the given percentage.</td>
+    </tr>         
+    <tr>
+      <td>percentile_approx(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate
+                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
+    </tr>      
+    <tr>
+      <td>approx_percentile(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate percentile value of numeric
+                                                                           column `col` at the given percentage.</td>
+    </tr>         
+    <tr>
+      <td>approx_percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate
+                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
+    </tr>        
+    <tr>
+      <td>stddev_samp(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
+    </tr>  
+    <tr>
+      <td>stddev(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
+    </tr>  
+    <tr>
+      <td>std(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
+    </tr>  
+    <tr>
+      <td>stddev_pop(e: Column)</td><td>Column name</td><td>Returns the population standard deviation calculated from values of a group</td>
+    </tr>
+    <tr>
+      <td>stddev_samp(e: Column)</td><td>Column name</td><td>Returns the sum calculated from values of a group</td>
+    </tr>    
+    <tr>
+      <td>(variance | var_samp)(e: Column)</td><td>Column name</td><td>Returns the sample variance calculated from values of a group</td>
+    </tr>    
+    <tr>
+      <td>sum(e: Column)</td><td>Column name</td><td>Returns the sum calculated from values of a group.</td>
+    </tr>       
+    <tr>
+      <td>var_pop(e: Column)</td><td>Column name</td><td>Returns the population variance calculated from values of a group</td>
+    </tr>        
+    <tr>
+      <td>collect_list(e: Column)</td><td>Column name</td><td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends
+                                                          on the order of the rows which may be non-deterministic after a shuffle</td>
+    </tr>       
+    <tr>
+      <td>collect_set(e: Column)</td><td>Column name</td><td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends
+                                                         on the order of the rows which may be non-deterministic after a shuffle.</td>
+    </tr>
+    <tr>
+        <td>count_min_sketch(e: Column, eps: double, confidence: double, seed integer)</td><td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td><td>Returns a count-min sketch of a column with the given esp,
+                                                        confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for
+                                                        cardinality estimation using sub-linear space..</td>
+    </tr>
+                              
+  </tbody>
+</table>
+

From d3a508e703786b7c7bd6ee2856f62120d9918579 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Sat, 4 Apr 2020 12:39:49 -0700
Subject: [PATCH 2/9] first version

---
 docs/sql-ref-functions-builtin-aggregate.md | 628 +++++++++++++++++---
 1 file changed, 550 insertions(+), 78 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index bc0f687c890f..be73a92ea640 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -19,154 +19,626 @@ license: |
   limitations under the License.
 ---
 
-Aggregate functions
-* Table of contents
-{:toc}
-
 Spark SQL provides build-in Aggregate functions defines in dataset API and SQL interface. Aggregate functions
 operate on a group of rows and return a single value.
 
-Spark SQL Aggregate functions are grouped as "agg_funcs" in spark SQL. Below is the list of functions.
+Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.
 
-Note: Every below function has another signature which take String as a column name instead of Column.
+**Note:** Every below function has another signature which take String as a column name instead of Column.
 
+* Table of contents
+{:toc}
 <table class="table">
   <thead>
-    <tr><th>Function</th><th>Parameters(s)</th><<th>Description</th><th></tr>
+    <tr><th style="width:25%">Function</th><th>Parameters</th><th>Description</th></tr>
   </thead>
   <tbody>
     <tr>
-      <td>avg(e: Column)</td><td>Column name</td><td> Returns the average of values in the input column.</td> 
+      <td> <b>{avg | mean}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td> Returns the average of values in the input column.</td> 
     </tr>
     <tr>
-      <td>mean(e: Column)</td><td>Column name</td><td> Returns the average of values in the input column.</td> 
-    </tr>        
-    <tr>
-      <td>bool_and(e: Column) every(e: Column)</td><td>Column name</td><td>Returns true if all values are true</td>
+      <td> <b>{bool_and | every}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns true if all values are true</td>
     </tr>
     <tr>
-      <td>any(e: Column)  some(e: Column) bool_or(e: Column)</td><td>Column name</td><td>Returns true if at least one value is true</td>
+      <td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns true if at least one value is true</td>
     </tr>
     <tr>
-      <td>approx_count_distinct(e: Column)</td><td>Column name</td><td>Returns the estimated cardinality by HyperLogLog++</td>td>
+      <td> <b>approx_count_distinct</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the estimated cardinality by HyperLogLog++</td>
     </tr>
     <tr>
-      <td>corr(e1: Column, e2: Column)</td><td>Column name</td><td>Returns Pearson coefficient of correlation between a set of number pairs</td>
+      <td> <b>corr</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns Pearson coefficient of correlation between a set of number pairs</td>
     </tr>
     <tr>
-      <td>count(*)</td><td>None</td><td>Returns the total number of retrieved rows, including rows containing null</td>
+      <td> <b>count</b>(<i>*</i>)</td>
+      <td>None</td>
+      <td>Returns the total number of retrieved rows, including rows containing null</td>
     </tr>
     <tr>
-      <td>count(e: Column[, e: Column])</td><td>Column name</td><td>Returns the number of rows for which the supplied column(s) are all not null</td>
+      <td> <b>count</b>(<i>e: Column[, e: Column]</i>)</td>
+      <td>Column name</td>
+      <td>Returns the number of rows for which the supplied column(s) are all not null</td>
     </tr>
     <tr>
-      <td>count(DISTINCT e: Column[, e: Column])</td><td>Column name</td><td>Returns the number of rows for which the supplied column(s) are unique and non-null</td>
+      <td> <b>count</b>(<b>DISTINCT</b> <i> e: Column[, e: Column</i>])</td>
+      <td>Column name</td>
+      <td>Returns the number of rows for which the supplied column(s) are unique and not null</td>
     </tr> 
     <tr>
-      <td>count_if(e: Column)</td><td>Column name</td><td>Returns the number of `TRUE` values for the column</td>
+      <td> <b>count_if</b>(<i>Predicate</i>)</td>
+      <td>Expression that will be used for aggregation calculation</td>
+      <td>Returns the count number from the predicate evaluate to `TRUE` values</td>
     </tr> 
     <tr>
-      <td>covar_pop(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the population covariance of a set of number pairs</td>
+      <td> <b>covar_pop</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the population covariance of a set of number pairs</td>
     </tr> 
     <tr>
-      <td>covar_samp(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the sample covariance of a set of number pairs</td>
+      <td> <b>covar_samp</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the sample covariance of a set of number pairs</td>
     </tr>  
     <tr>
-      <td>first(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the first value of column for a group of rows.
-                                                           If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+      <td> <b>{first | first_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
+      <td>Column name[, True/False(default)]</td>
+      <td>Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
     <tr>
-      <td>first_value(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the first value of column for a group of rows.
-                                                               If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
-    </tr>     
-    <tr>
-       <td>skewness(e: Column)</td><td>Column name</td><td>Returns the skewness value calculated from values of a group</td>
+       <td> <b>skewness</b>(<i>e: Column</i>)</td>
+       <td>Column name</td>
+       <td>Returns the skewness value calculated from values of a group</td>
     </tr>    
     <tr>
-       <td>kurtosis(e: Column)</td><td>Column name</td><td>Returns the kurtosis value calculated from values of a group</td>
+       <td> <b>kurtosis</b>(<i>e: Column</i>)</td>
+       <td>Column name</td>
+       <td>Returns the kurtosis value calculated from values of a group</td>
     </tr>    
     <tr>
-      <td>last(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the last value of column for a group of rows.
-                                                               If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
+      <td> <b>{last | last_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
+      <td>Column name[, True/False(default)]</td>
+      <td>Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
     <tr>
-      <td>last_value(e: Column[, isIgnoreNull])</td><td>Column name[, True/False(default)]</td><td>Returns the last value of column for a group of rows.
-                                                                   If `isIgnoreNull` is true, returns only non-null values, default is false.</td>
-    </tr>     
-    <tr>
-      <td>max(e: Column)</td><td>Column name</td><td>Returns the maximum value of the column.</td>
+      <td> <b>max</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the maximum value of the column.</td>
     </tr>          
     <tr>
-      <td>max_by(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the value of column e1 associated with the maximum value of column e2.</td>
+      <td> <b>max_by</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the value of column e1 associated with the maximum value of column e2.</td>
     </tr>   
     <tr>
-      <td>min(e: Column)</td><td>Column name</td><td>Returns the minimum value of the column.</td>
+      <td> <b>min</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the minimum value of the column.</td>
     </tr>          
     <tr>
-      <td>min_by(e1: Column, e2: Column)</td><td>Column name</td><td>Returns the value of column e1 associated with the minimum value of column e2.</td>
+      <td> <b>min_by</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the value of column e1 associated with the minimum value of column e2.</td>
     </tr>      
     <tr>
-      <td>percentile(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the exact percentile value of numeric column
-                       `col` at the given percentage.</td>
+      <td> <b>percentile</b>(<i>e: Column, percentage [, frequency]</i>)</td>
+      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
+      <td>Returns the exact percentile value of numeric column at the given percentage.</td>
     </tr>         
     <tr>
-      <td>percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the exact
-                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
+      <td> <b>percentile</b>(<i>e: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td>
+      <td>Returns the exact percentile value array of numeric column at the given percentage(s).</td>
     </tr>        
     <tr>
-      <td>percentile_approx(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate percentile value of numeric
-                                                                           column `col` at the given percentage.</td>
+      <td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, percentage [, frequency]</i>)</td>
+      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
+      <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
     </tr>         
     <tr>
-      <td>percentile_approx(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate
-                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
-    </tr>      
+      <td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
+      <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
+    </tr>             
     <tr>
-      <td>approx_percentile(e: Column, percentage [, frequency])</td><td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate percentile value of numeric
-                                                                           column `col` at the given percentage.</td>
-    </tr>         
-    <tr>
-      <td>approx_percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])</td><td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td><td>Returns the approximate
-                                                                           percentile value array of numeric column `col` at the given percentage(s).</td>
-    </tr>        
-    <tr>
-      <td>stddev_samp(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
-    </tr>  
-    <tr>
-      <td>stddev(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
+      <td> <b>{stddev_samp | stddev | std}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the sample standard deviation calculated from values of a group</td>
     </tr>  
     <tr>
-      <td>std(e: Column)</td><td>Column name</td><td>Returns the sample standard deviation calculated from values of a group</td>
-    </tr>  
-    <tr>
-      <td>stddev_pop(e: Column)</td><td>Column name</td><td>Returns the population standard deviation calculated from values of a group</td>
+      <td> <b>stddev_pop</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the population standard deviation calculated from values of a group</td>
     </tr>
     <tr>
-      <td>stddev_samp(e: Column)</td><td>Column name</td><td>Returns the sum calculated from values of a group</td>
+      <td> <b>{variance | var_samp}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the sample variance calculated from values of a group</td>
     </tr>    
     <tr>
-      <td>(variance | var_samp)(e: Column)</td><td>Column name</td><td>Returns the sample variance calculated from values of a group</td>
-    </tr>    
-    <tr>
-      <td>sum(e: Column)</td><td>Column name</td><td>Returns the sum calculated from values of a group.</td>
+      <td> <b>sum</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the sum calculated from values of a group.</td>
     </tr>       
     <tr>
-      <td>var_pop(e: Column)</td><td>Column name</td><td>Returns the population variance calculated from values of a group</td>
+      <td> <b>var_pop</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the population variance calculated from values of a group</td>
     </tr>        
     <tr>
-      <td>collect_list(e: Column)</td><td>Column name</td><td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends
-                                                          on the order of the rows which may be non-deterministic after a shuffle</td>
+      <td> <b>collect_list</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
     </tr>       
     <tr>
-      <td>collect_set(e: Column)</td><td>Column name</td><td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends
-                                                         on the order of the rows which may be non-deterministic after a shuffle.</td>
+      <td> <b>collect_set</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
     </tr>
     <tr>
-        <td>count_min_sketch(e: Column, eps: double, confidence: double, seed integer)</td><td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td><td>Returns a count-min sketch of a column with the given esp,
-                                                        confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for
-                                                        cardinality estimation using sub-linear space..</td>
+        <td> <b>count_min_sketch</b>(<i>e: Column, eps: double, confidence: double, seed integer</i>)</td>
+        <td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
+        <td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
     </tr>
-                              
   </tbody>
 </table>
 
+### Example
+{% highlight sql %}
+--base table 
+SELECT * FROM buildin_agg;
++----+----+----+-----+----+
+|  c1|  c2|  c3|   c4|  c5|
++----+----+----+-----+----+
+|   2|   3|agg4| true|true|
+|   1|   2|agg3|false|true|
+|   1|   1|agg1|false|true|
+|   4|   3|agg6|false|true|
+|   3|   3|agg5| true|true|
+|   1|   2|agg2|false|true|
+|   5|null|agg8|false|true|
+|null|   4|agg7|false|true|
++----+----+----+-----+----+
+
+-- AVG and MEAN functions
+SELECT AVG(c1) FROM buildin_agg;
++------------------+
+|           avg(c1)|
++------------------+
+|2.4285714285714284|
++------------------+
+
+SELECT MEAN(c1) FROM buildin_agg;
++------------------+
+|          mean(c1)|
++------------------+
+|2.4285714285714284|
++------------------+
+
+-- BOOL_AND and EVERY 
+SELECT BOOL_AND(c4) FROM buildin_agg;
++------------+
+|bool_and(c4)|
++------------+
+|       false|
++------------+
+
+SELECT EVERY(c5) FROM buildin_agg;
++------------+
+|bool_and(c5)|
++------------+
+|        true|
++------------+
+
+-- ANY, SOME and BOOL_OR
+SELECT ANY(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT SOME(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT BOOL_OR(c5) FROM buildin_agg;
++-----------+
+|bool_or(c5)|
++-----------+
+|       true|
++-----------+
+
+-- APPROX_COUNT_DISTINCT
+SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
++-------------------------+
+|approx_count_distinct(c1)|
++-------------------------+
+|                        5|
++-------------------------+
+
+-- CORR
+SELECT CORR(c1, c2) FROM buildin_agg;
++--------------------------------------------+
+|corr(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
++--------------------------------------------+
+|                          0.7745966692414833|
++--------------------------------------------+
+
+-- COUNT
+SELECT COUNT(c2) FROM buildin_agg;
++---------+
+|count(c2)|
++---------+
+|        7|
++---------+
+
+--COUNT DISTINCT
+SELECT COUNT(DISTINCT c1) FROM buildin_agg;
++------------------+
+|count(DISTINCT c1)|
++------------------+
+|                 5|
++------------------+
+
+SELECT COUNT(DISTINCT c1, c2) FROM buildin_agg;
++----------------------+
+|count(DISTINCT c1, c2)|
++----------------------+
+|                     5|
++----------------------+
+
+SELECT COUNT(*) FROM buildin_agg;
++--------+
+|count(1)|
++--------+
+|       8|
++--------+
+
+--COUNT_IF
+SELECT COUNT_IF(c1 IS NULL) from buildin_agg;
++----------------------+
+|count_if((c1 IS NULL))|
++----------------------+
+|                     1|
++----------------------+
+
+SELECT c1 FROM buildin_agg GROUP BY c1 HAVING COUNT_IF(c2 % 2 = 0);
++----+
+|  c1|
++----+
+|null|
+|   1|
++----+
+
+--COVAR_POP
+SELECT COVAR_POP(c1, c2) FROM buildin_agg;
++-------------------------------------------------+
+|covar_pop(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
++-------------------------------------------------+
+|                               0.6666666666666666|
++-------------------------------------------------+
+
+--COVAR_SAMP
+SELECT COVAR_SAMP(c1, c2) FROM buildin_agg;
++--------------------------------------------------+
+|covar_samp(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
++--------------------------------------------------+
+|                                               0.8|
++--------------------------------------------------+
+
+
+--FIRST and FIRST_VALUE
+SELECT FIRST(c1) FROM buildin_agg;
++----------------+
+|first(c1, false)|
++----------------+
+|               2|
++----------------+
+
+
+SELECT FIRST(col) FROM VALUES (NULL), (5), (20) AS TAB(col);
++-----------------+
+|first(col, false)|
++-----------------+
+|             null|
++-----------------+
+
+SELECT FIRST(col, true) FROM VALUES (NULL), (5), (20) AS TAB(col);
++----------------+
+|first(col, true)|
++----------------+
+|               5|
++----------------+
+
+SELECT FIRST_VALUE(col) FROM VALUES (NULL), (5), (20) AS TAB(col);
++-----------------------+
+|first_value(col, false)|
++-----------------------+
+|                   null|
++-----------------------+
+
+SELECT FIRST_VALUE(col, true) FROM VALUES (NULL), (5), (20) AS TAB(col);
++----------------------+
+|first_value(col, true)|
++----------------------+
+|                     5|
++----------------------+
+
+
+--SKEWNESS
+SELECT SKEWNESS(c1) FROM buildin_agg;
++----------------------------+
+|skewness(CAST(c1 AS DOUBLE))|
++----------------------------+
+|          0.5200705032248686|
++----------------------------+
+
+SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
++-----------------------------+
+|skewness(CAST(col AS DOUBLE))|
++-----------------------------+
+|          -1.1135657469022011|
++-----------------------------+
+
+--KURTOSIS
+SELECT KURTOSIS(c2) FROM buildin_agg;
++----------------------------+
+|kurtosis(CAST(c2 AS DOUBLE))|
++----------------------------+
+|         -0.7325000000000004|
++----------------------------+
+
+SELECT KURTOSIS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
++-----------------------------+
+|kurtosis(CAST(col AS DOUBLE))|
++-----------------------------+
+|          -0.7014368047529627|
++-----------------------------+
+
+--LAST and LAST_VALUE
+SELECT LAST(c1) FROM buildin_agg;
++---------------+
+|last(c1, false)|
++---------------+
+|           null|
++---------------+
+
+SELECT LAST(c1, true) FROM buildin_agg;
++--------------+
+|last(c1, true)|
++--------------+
+|             5|
++--------------+
+
+SELECT LAST_VALUE(c1) FROM buildin_agg;
++---------------------+
+|last_value(c1, false)|
++---------------------+
+|                 null|
++---------------------+
+
+SELECT LAST_VALUE(c1, true) FROM buildin_agg;
++--------------------+
+|last_value(c1, true)|
++--------------------+
+|                   5|
++--------------------+
+
+--MAX
+SELECT MAX(c2) FROM buildin_agg;
++-------+
+|max(c2)|
++-------+
+|      4|
++-------+
+
+--MAX_BY
+SELECT MAX_BY(c1, c3) FROM buildin_agg;
++-------------+
+|maxby(c1, c3)|
++-------------+
+|            5|
++-------------+
+
+--MIN
+SELECT MIN(c1) FROM buildin_agg;
++-------+
+|min(c1)|
++-------+
+|      1|
++-------+
+
+--MIN_BY
+SELECT MIN_BY(c2, c3) FROM buildin_agg;
++-------------+
+|minby(c2, c3)|
++-------------+
+|            1|
++-------------+
+
+--PERCENTILE
+SELECT PERCENTILE(c1, 0.3) FROM buildin_agg;
++--------------------------------------+
+|percentile(c1, CAST(0.3 AS DOUBLE), 1)|
++--------------------------------------+
+|                                   1.0|
++--------------------------------------+
+
+SELECT PERCENTILE(c1, 0.3, 2) FROM buildin_agg;
++--------------------------------------+
+|percentile(c1, CAST(0.3 AS DOUBLE), 2)|
++--------------------------------------+
+|                                   1.0|
++--------------------------------------+
+
+SELECT PERCENTILE(c1, ARRAY(0.25, 0.75)) FROM buildin_agg;
++------------------------------------+
+|percentile(c1, array(0.25, 0.75), 1)|
++------------------------------------+
+|                          [1.0, 3.5]|
++------------------------------------+
+
+SELECT PERCENTILE(c1, ARRAY(0.25, 0.75), 10) FROM buildin_agg;
++-------------------------------------+
+|percentile(c1, array(0.25, 0.75), 10)|
++-------------------------------------+
+|                           [1.0, 4.0]|
++-------------------------------------+
+
+
+--PERCENTILE_APPROX and APPROX_PERCENTILE
+SELECT PERCENTILE_APPROX(c1, 0.25, 100) FROM buildin_agg;
++------------------------------------------------+
+|percentile_approx(c1, CAST(0.25 AS DOUBLE), 100)|
++------------------------------------------------+
+|                                               1|
++------------------------------------------------+
+
+SELECT APPROX_PERCENTILE(c1, 0.25, 100) FROM buildin_agg;
++------------------------------------------------+
+|approx_percentile(c1, CAST(0.25 AS DOUBLE), 100)|
++------------------------------------------------+
+|                                               1|
++------------------------------------------------+
+
+
+SELECT PERCENTILE_APPROX(c1, ARRAY(0.25, 0.85), 100) FROM buildin_agg;
++---------------------------------------------+
+|percentile_approx(c1, array(0.25, 0.85), 100)|
++---------------------------------------------+
+|                                       [1, 4]|
++---------------------------------------------+
+
+
+SELECT APPROX_PERCENTILE(c1, array(0.25, 0.85), 100) FROM buildin_agg;
++---------------------------------------------+
+|approx_percentile(c1, array(0.25, 0.85), 100)|
++---------------------------------------------+
+|                                       [1, 4]|
++---------------------------------------------+
+
+--STDDEV_SAMP, STDDEV and STD
+SELECT STDDEV_SAMP(c1) FROM buildin_agg;
++-------------------------------+
+|stddev_samp(CAST(c1 AS DOUBLE))|
++-------------------------------+
+|              1.618347187425374|
++-------------------------------+
+
+SELECT STDDEV(c1) FROM buildin_agg;
++--------------------------+
+|stddev(CAST(c1 AS DOUBLE))|
++--------------------------+
+|         1.618347187425374|
++--------------------------+
+
+SELECT STD(c1) FROM buildin_agg;
++-----------------------+
+|std(CAST(c1 AS DOUBLE))|
++-----------------------+
+|      1.618347187425374|
++-----------------------+
+
+--STDDEV_POP
+SELECT STDDEV_POP(c1) FROM buildin_agg;
++------------------------------+
+|stddev_pop(CAST(c1 AS DOUBLE))|
++------------------------------+
+|             1.498298354528788|
++------------------------------+
+
+--VARIANCE and VAR_SAMP
+SELECT VARIANCE(c1) FROM buildin_agg;
++----------------------------+
+|variance(CAST(c1 AS DOUBLE))|
++----------------------------+
+|           2.619047619047619|
++----------------------------+
+
+SELECT VAR_SAMP(c1) FROM buildin_agg;
++----------------------------+
+|var_samp(CAST(c1 AS DOUBLE))|
++----------------------------+
+|           2.619047619047619|
++----------------------------+
+
+--SUM
+SELECT SUM(col) FROM VALUES (5), (10), (15) AS TAB(col);
++--------+
+|sum(col)|
++--------+
+|      30|
++--------+
+
+SELECT SUM(c1) FROM buildin_agg;
++-------+
+|sum(c1)|
++-------+
+|     17|
++-------+
+
+SELECT SUM(col) FROM VALUES (NULL), (NULL) AS TAB(col);
++--------+
+|sum(col)|
++--------+
+|    null|
++--------+
+
+--VAR_POP
+SELECT VAR_POP(c1) FROM buildin_agg;
++---------------------------+
+|var_pop(CAST(c1 AS DOUBLE))|
++---------------------------+
+|         2.2448979591836737|
++---------------------------+
+
+--COLLECT_LIST
+SELECT COLLECT_LIST(c2) FROM buildin_agg;
++---------------------+
+|collect_list(c2)     |
++---------------------+
+|[3, 2, 1, 3, 3, 2, 4]|
++---------------------+
+
+SELECT COLLECT_LIST(c4) FROM buildin_agg;
++------------------------------------------------------+
+|collect_list(c4)                                      |
++------------------------------------------------------+
+|[true, false, false, false, true, false, false, false]|
++------------------------------------------------------+
+
+--COLLECT_SET
+SELECT COLLECT_SET(c2) FROM buildin_agg;
++---------------+
+|collect_set(c2)|
++---------------+
+|[1, 2, 3, 4]   |
++---------------+
+
+SELECT COLLECT_SET(c3) FROM buildin_agg;
++------------------------------------------------+
+|collect_set(c3)                                 |
++------------------------------------------------+
+|[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
++------------------------------------------------+
+
+--COUNT_MIN_SKETCH
+SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|count_min_sketch(c1, 0.9, 0.2, 3)                                                                                                                            |
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]|
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+{% endhighlight %}
\ No newline at end of file

From f4aadff126432517cf7b44ea15d149d8bf703e6a Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Sun, 5 Apr 2020 15:42:20 -0700
Subject: [PATCH 3/9] address comments

---
 docs/sql-ref-functions-builtin-aggregate.md | 288 ++++++++++----------
 1 file changed, 145 insertions(+), 143 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index be73a92ea640..30e4528ff470 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -19,12 +19,12 @@ license: |
   limitations under the License.
 ---
 
-Spark SQL provides build-in Aggregate functions defines in dataset API and SQL interface. Aggregate functions
+Spark SQL provides build-in Aggregate functions defined in the dataset API and SQL interface. Aggregate functions
 operate on a group of rows and return a single value.
 
 Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.
 
-**Note:** Every below function has another signature which take String as a column name instead of Column.
+**Note:** Every below function has another signature which takes String as a column name instead of Column.
 
 * Table of contents
 {:toc}
@@ -33,6 +33,16 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
     <tr><th style="width:25%">Function</th><th>Parameters</th><th>Description</th></tr>
   </thead>
   <tbody>
+    <tr>
+      <td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns true if at least one value is true</td>
+    </tr>
+    <tr>
+      <td> <b>approx_count_distinct</b>(<i>e: Column[, relativeSD: Double]]</i>)</td>
+      <td>Column name; relativeSD: the maximum estimation error allowed.</td>
+      <td>Returns the estimated cardinality by HyperLogLog++</td>
+    </tr>   
     <tr>
       <td> <b>{avg | mean}</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
@@ -44,14 +54,14 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
       <td>Returns true if all values are true</td>
     </tr>
     <tr>
-      <td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td>
+      <td> <b>collect_list</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
-      <td>Returns true if at least one value is true</td>
-    </tr>
+      <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
+    </tr>       
     <tr>
-      <td> <b>approx_count_distinct</b>(<i>e: Column</i>)</td>
+      <td> <b>collect_set</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
-      <td>Returns the estimated cardinality by HyperLogLog++</td>
+      <td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
     </tr>
     <tr>
       <td> <b>corr</b>(<i>e1: Column, e2: Column</i>)</td>
@@ -76,8 +86,13 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
     <tr>
       <td> <b>count_if</b>(<i>Predicate</i>)</td>
       <td>Expression that will be used for aggregation calculation</td>
-      <td>Returns the count number from the predicate evaluate to `TRUE` values</td>
+      <td>Returns the count number from the predicate evaluate to <code>TRUE</code> values</td>
     </tr> 
+    <tr>
+        <td> <b>count_min_sketch</b>(<i>e: Column, eps: double, confidence: double, seed integer</i>)</td>
+        <td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
+        <td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
+    </tr>
     <tr>
       <td> <b>covar_pop</b>(<i>e1: Column, e2: Column</i>)</td>
       <td>Column name</td>
@@ -91,13 +106,8 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
     <tr>
       <td> <b>{first | first_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
       <td>Column name[, True/False(default)]</td>
-      <td>Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
+      <td>Returns the first value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
-    <tr>
-       <td> <b>skewness</b>(<i>e: Column</i>)</td>
-       <td>Column name</td>
-       <td>Returns the skewness value calculated from values of a group</td>
-    </tr>    
     <tr>
        <td> <b>kurtosis</b>(<i>e: Column</i>)</td>
        <td>Column name</td>
@@ -106,7 +116,7 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
     <tr>
       <td> <b>{last | last_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
       <td>Column name[, True/False(default)]</td>
-      <td>Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic</td>
+      <td>Returns the last value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
     <tr>
       <td> <b>max</b>(<i>e: Column</i>)</td>
@@ -148,6 +158,11 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
       <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
       <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
     </tr>             
+    <tr>
+       <td> <b>skewness</b>(<i>e: Column</i>)</td>
+       <td>Column name</td>
+       <td>Returns the skewness value calculated from values of a group</td>
+    </tr>    
     <tr>
       <td> <b>{stddev_samp | stddev | std}</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
@@ -158,40 +173,25 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
       <td>Column name</td>
       <td>Returns the population standard deviation calculated from values of a group</td>
     </tr>
-    <tr>
-      <td> <b>{variance | var_samp}</b>(<i>e: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the sample variance calculated from values of a group</td>
-    </tr>    
     <tr>
       <td> <b>sum</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the sum calculated from values of a group.</td>
     </tr>       
+    <tr>
+      <td> <b>{variance | var_samp}</b>(<i>e: Column</i>)</td>
+      <td>Column name</td>
+      <td>Returns the sample variance calculated from values of a group</td>
+    </tr>    
     <tr>
       <td> <b>var_pop</b>(<i>e: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the population variance calculated from values of a group</td>
     </tr>        
-    <tr>
-      <td> <b>collect_list</b>(<i>e: Column</i>)</td>
-      <td>Column name</td>
-      <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
-    </tr>       
-    <tr>
-      <td> <b>collect_set</b>(<i>e: Column</i>)</td>
-      <td>Column name</td>
-      <td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
-    </tr>
-    <tr>
-        <td> <b>count_min_sketch</b>(<i>e: Column, eps: double, confidence: double, seed integer</i>)</td>
-        <td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
-        <td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
-    </tr>
   </tbody>
 </table>
 
-### Example
+### Examples
 {% highlight sql %}
 --base table 
 SELECT * FROM buildin_agg;
@@ -208,6 +208,43 @@ SELECT * FROM buildin_agg;
 |null|   4|agg7|false|true|
 +----+----+----+-----+----+
 
+-- ANY, SOME and BOOL_OR
+SELECT ANY(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT SOME(c4) FROM buildin_agg;
++-------+
+|any(c4)|
++-------+
+|   true|
++-------+
+
+SELECT BOOL_OR(c5) FROM buildin_agg;
++-----------+
+|bool_or(c5)|
++-----------+
+|       true|
++-----------+
+
+-- APPROX_COUNT_DISTINCT
+SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
++-------------------------+
+|approx_count_distinct(c1)|
++-------------------------+
+|                        5|
++-------------------------+
+
+SELECT APPROX_COUNT_DISTINCT(c1,0.39d) FROM buildin_agg;
++-------------------------+
+|approx_count_distinct(c1)|
++-------------------------+
+|                        6|
++-------------------------+
+
 -- AVG and MEAN functions
 SELECT AVG(c1) FROM buildin_agg;
 +------------------+
@@ -238,35 +275,35 @@ SELECT EVERY(c5) FROM buildin_agg;
 |        true|
 +------------+
 
--- ANY, SOME and BOOL_OR
-SELECT ANY(c4) FROM buildin_agg;
-+-------+
-|any(c4)|
-+-------+
-|   true|
-+-------+
+--COLLECT_LIST
+SELECT COLLECT_LIST(c2) FROM buildin_agg;
++---------------------+
+|collect_list(c2)     |
++---------------------+
+|[3, 2, 1, 3, 3, 2, 4]|
++---------------------+
 
-SELECT SOME(c4) FROM buildin_agg;
-+-------+
-|any(c4)|
-+-------+
-|   true|
-+-------+
+SELECT COLLECT_LIST(c4) FROM buildin_agg;
++------------------------------------------------------+
+|collect_list(c4)                                      |
++------------------------------------------------------+
+|[true, false, false, false, true, false, false, false]|
++------------------------------------------------------+
 
-SELECT BOOL_OR(c5) FROM buildin_agg;
-+-----------+
-|bool_or(c5)|
-+-----------+
-|       true|
-+-----------+
+--COLLECT_SET
+SELECT COLLECT_SET(c2) FROM buildin_agg;
++---------------+
+|collect_set(c2)|
++---------------+
+|[1, 2, 3, 4]   |
++---------------+
 
--- APPROX_COUNT_DISTINCT
-SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
-+-------------------------+
-|approx_count_distinct(c1)|
-+-------------------------+
-|                        5|
-+-------------------------+
+SELECT COLLECT_SET(c3) FROM buildin_agg;
++------------------------------------------------+
+|collect_set(c3)                                 |
++------------------------------------------------+
+|[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
++------------------------------------------------+
 
 -- CORR
 SELECT CORR(c1, c2) FROM buildin_agg;
@@ -276,6 +313,14 @@ SELECT CORR(c1, c2) FROM buildin_agg;
 |                          0.7745966692414833|
 +--------------------------------------------+
 
+--COUNT(*)
+SELECT COUNT(*) FROM buildin_agg;
++--------+
+|count(1)|
++--------+
+|       8|
++--------+
+
 -- COUNT
 SELECT COUNT(c2) FROM buildin_agg;
 +---------+
@@ -299,13 +344,6 @@ SELECT COUNT(DISTINCT c1, c2) FROM buildin_agg;
 |                     5|
 +----------------------+
 
-SELECT COUNT(*) FROM buildin_agg;
-+--------+
-|count(1)|
-+--------+
-|       8|
-+--------+
-
 --COUNT_IF
 SELECT COUNT_IF(c1 IS NULL) from buildin_agg;
 +----------------------+
@@ -322,6 +360,14 @@ SELECT c1 FROM buildin_agg GROUP BY c1 HAVING COUNT_IF(c2 % 2 = 0);
 |   1|
 +----+
 
+--COUNT_MIN_SKETCH
+SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|count_min_sketch(c1, 0.9, 0.2, 3)                                                                                                                            |
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]|
++-------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
 --COVAR_POP
 SELECT COVAR_POP(c1, c2) FROM buildin_agg;
 +-------------------------------------------------+
@@ -338,7 +384,6 @@ SELECT COVAR_SAMP(c1, c2) FROM buildin_agg;
 |                                               0.8|
 +--------------------------------------------------+
 
-
 --FIRST and FIRST_VALUE
 SELECT FIRST(c1) FROM buildin_agg;
 +----------------+
@@ -347,7 +392,6 @@ SELECT FIRST(c1) FROM buildin_agg;
 |               2|
 +----------------+
 
-
 SELECT FIRST(col) FROM VALUES (NULL), (5), (20) AS TAB(col);
 +-----------------+
 |first(col, false)|
@@ -376,22 +420,6 @@ SELECT FIRST_VALUE(col, true) FROM VALUES (NULL), (5), (20) AS TAB(col);
 |                     5|
 +----------------------+
 
-
---SKEWNESS
-SELECT SKEWNESS(c1) FROM buildin_agg;
-+----------------------------+
-|skewness(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|          0.5200705032248686|
-+----------------------------+
-
-SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
-+-----------------------------+
-|skewness(CAST(col AS DOUBLE))|
-+-----------------------------+
-|          -1.1135657469022011|
-+-----------------------------+
-
 --KURTOSIS
 SELECT KURTOSIS(c2) FROM buildin_agg;
 +----------------------------+
@@ -497,7 +525,6 @@ SELECT PERCENTILE(c1, ARRAY(0.25, 0.75), 10) FROM buildin_agg;
 |                           [1.0, 4.0]|
 +-------------------------------------+
 
-
 --PERCENTILE_APPROX and APPROX_PERCENTILE
 SELECT PERCENTILE_APPROX(c1, 0.25, 100) FROM buildin_agg;
 +------------------------------------------------+
@@ -513,7 +540,6 @@ SELECT APPROX_PERCENTILE(c1, 0.25, 100) FROM buildin_agg;
 |                                               1|
 +------------------------------------------------+
 
-
 SELECT PERCENTILE_APPROX(c1, ARRAY(0.25, 0.85), 100) FROM buildin_agg;
 +---------------------------------------------+
 |percentile_approx(c1, array(0.25, 0.85), 100)|
@@ -521,7 +547,6 @@ SELECT PERCENTILE_APPROX(c1, ARRAY(0.25, 0.85), 100) FROM buildin_agg;
 |                                       [1, 4]|
 +---------------------------------------------+
 
-
 SELECT APPROX_PERCENTILE(c1, array(0.25, 0.85), 100) FROM buildin_agg;
 +---------------------------------------------+
 |approx_percentile(c1, array(0.25, 0.85), 100)|
@@ -529,6 +554,21 @@ SELECT APPROX_PERCENTILE(c1, array(0.25, 0.85), 100) FROM buildin_agg;
 |                                       [1, 4]|
 +---------------------------------------------+
 
+--SKEWNESS
+SELECT SKEWNESS(c1) FROM buildin_agg;
++----------------------------+
+|skewness(CAST(c1 AS DOUBLE))|
++----------------------------+
+|          0.5200705032248686|
++----------------------------+
+
+SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
++-----------------------------+
+|skewness(CAST(col AS DOUBLE))|
++-----------------------------+
+|          -1.1135657469022011|
++-----------------------------+
+
 --STDDEV_SAMP, STDDEV and STD
 SELECT STDDEV_SAMP(c1) FROM buildin_agg;
 +-------------------------------+
@@ -559,21 +599,6 @@ SELECT STDDEV_POP(c1) FROM buildin_agg;
 |             1.498298354528788|
 +------------------------------+
 
---VARIANCE and VAR_SAMP
-SELECT VARIANCE(c1) FROM buildin_agg;
-+----------------------------+
-|variance(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|           2.619047619047619|
-+----------------------------+
-
-SELECT VAR_SAMP(c1) FROM buildin_agg;
-+----------------------------+
-|var_samp(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|           2.619047619047619|
-+----------------------------+
-
 --SUM
 SELECT SUM(col) FROM VALUES (5), (10), (15) AS TAB(col);
 +--------+
@@ -596,6 +621,21 @@ SELECT SUM(col) FROM VALUES (NULL), (NULL) AS TAB(col);
 |    null|
 +--------+
 
+--VARIANCE and VAR_SAMP
+SELECT VARIANCE(c1) FROM buildin_agg;
++----------------------------+
+|variance(CAST(c1 AS DOUBLE))|
++----------------------------+
+|           2.619047619047619|
++----------------------------+
+
+SELECT VAR_SAMP(c1) FROM buildin_agg;
++----------------------------+
+|var_samp(CAST(c1 AS DOUBLE))|
++----------------------------+
+|           2.619047619047619|
++----------------------------+
+
 --VAR_POP
 SELECT VAR_POP(c1) FROM buildin_agg;
 +---------------------------+
@@ -603,42 +643,4 @@ SELECT VAR_POP(c1) FROM buildin_agg;
 +---------------------------+
 |         2.2448979591836737|
 +---------------------------+
-
---COLLECT_LIST
-SELECT COLLECT_LIST(c2) FROM buildin_agg;
-+---------------------+
-|collect_list(c2)     |
-+---------------------+
-|[3, 2, 1, 3, 3, 2, 4]|
-+---------------------+
-
-SELECT COLLECT_LIST(c4) FROM buildin_agg;
-+------------------------------------------------------+
-|collect_list(c4)                                      |
-+------------------------------------------------------+
-|[true, false, false, false, true, false, false, false]|
-+------------------------------------------------------+
-
---COLLECT_SET
-SELECT COLLECT_SET(c2) FROM buildin_agg;
-+---------------+
-|collect_set(c2)|
-+---------------+
-|[1, 2, 3, 4]   |
-+---------------+
-
-SELECT COLLECT_SET(c3) FROM buildin_agg;
-+------------------------------------------------+
-|collect_set(c3)                                 |
-+------------------------------------------------+
-|[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
-+------------------------------------------------+
-
---COUNT_MIN_SKETCH
-SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
-|count_min_sketch(c1, 0.9, 0.2, 3)                                                                                                                            |
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
-|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]|
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
 {% endhighlight %}
\ No newline at end of file

From 5cbecf4547271eb0e444b976e7fb17c21664387a Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Sun, 5 Apr 2020 23:01:20 -0700
Subject: [PATCH 4/9] address comments

---
 docs/sql-ref-functions-builtin-aggregate.md | 78 ++++++++++-----------
 1 file changed, 39 insertions(+), 39 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index 30e4528ff470..5e77ba66c898 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -19,12 +19,12 @@ license: |
   limitations under the License.
 ---
 
-Spark SQL provides build-in Aggregate functions defined in the dataset API and SQL interface. Aggregate functions
+Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions
 operate on a group of rows and return a single value.
 
-Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL. Below is the list of functions.
+Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions.
 
-**Note:** Every below function has another signature which takes String as a column name instead of Column.
+**Note:** All functions below have another signature which takes String as a column name instead of Column.
 
 * Table of contents
 {:toc}
@@ -34,37 +34,37 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
   </thead>
   <tbody>
     <tr>
-      <td> <b>{any | some | bool_or}</b>(<i>e: Column</i>)</td>
+      <td> <b>{any | some | bool_or}</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns true if at least one value is true</td>
     </tr>
     <tr>
-      <td> <b>approx_count_distinct</b>(<i>e: Column[, relativeSD: Double]]</i>)</td>
+      <td> <b>approx_count_distinct</b>(<i>c: Column[, relativeSD: Double]]</i>)</td>
       <td>Column name; relativeSD: the maximum estimation error allowed.</td>
       <td>Returns the estimated cardinality by HyperLogLog++</td>
     </tr>   
     <tr>
-      <td> <b>{avg | mean}</b>(<i>e: Column</i>)</td>
+      <td> <b>{avg | mean}</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td> Returns the average of values in the input column.</td> 
     </tr>
     <tr>
-      <td> <b>{bool_and | every}</b>(<i>e: Column</i>)</td>
+      <td> <b>{bool_and | every}</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns true if all values are true</td>
     </tr>
     <tr>
-      <td> <b>collect_list</b>(<i>e: Column</i>)</td>
+      <td> <b>collect_list</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
     </tr>       
     <tr>
-      <td> <b>collect_set</b>(<i>e: Column</i>)</td>
+      <td> <b>collect_set</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
     </tr>
     <tr>
-      <td> <b>corr</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td> <b>corr</b>(<i>c1: Column, c2: Column</i>)</td>
       <td>Column name</td>
       <td>Returns Pearson coefficient of correlation between a set of number pairs</td>
     </tr>
@@ -74,12 +74,12 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
       <td>Returns the total number of retrieved rows, including rows containing null</td>
     </tr>
     <tr>
-      <td> <b>count</b>(<i>e: Column[, e: Column]</i>)</td>
+      <td> <b>count</b>(<i>c: Column[, c: Column]</i>)</td>
       <td>Column name</td>
       <td>Returns the number of rows for which the supplied column(s) are all not null</td>
     </tr>
     <tr>
-      <td> <b>count</b>(<b>DISTINCT</b> <i> e: Column[, e: Column</i>])</td>
+      <td> <b>count</b>(<b>DISTINCT</b> <i> c: Column[, c: Column</i>])</td>
       <td>Column name</td>
       <td>Returns the number of rows for which the supplied column(s) are unique and not null</td>
     </tr> 
@@ -89,102 +89,102 @@ Spark SQL Aggregate functions are grouped as <code>agg_funcs</code> in spark SQL
       <td>Returns the count number from the predicate evaluate to <code>TRUE</code> values</td>
     </tr> 
     <tr>
-        <td> <b>count_min_sketch</b>(<i>e: Column, eps: double, confidence: double, seed integer</i>)</td>
+        <td> <b>count_min_sketch</b>(<i>c: Column, eps: double, confidence: double, seed integer</i>)</td>
         <td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
         <td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
     </tr>
     <tr>
-      <td> <b>covar_pop</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td> <b>covar_pop</b>(<i>c1: Column, c2: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the population covariance of a set of number pairs</td>
     </tr> 
     <tr>
-      <td> <b>covar_samp</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td> <b>covar_samp</b>(<i>c1: Column, c2: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the sample covariance of a set of number pairs</td>
     </tr>  
     <tr>
-      <td> <b>{first | first_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
+      <td> <b>{first | first_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td>
       <td>Column name[, True/False(default)]</td>
       <td>Returns the first value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
     <tr>
-       <td> <b>kurtosis</b>(<i>e: Column</i>)</td>
+       <td> <b>kurtosis</b>(<i>c: Column</i>)</td>
        <td>Column name</td>
        <td>Returns the kurtosis value calculated from values of a group</td>
     </tr>    
     <tr>
-      <td> <b>{last | last_value}</b>(<i>e: Column[, isIgnoreNull]</i>)</td>
+      <td> <b>{last | last_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td>
       <td>Column name[, True/False(default)]</td>
       <td>Returns the last value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
     </tr>      
     <tr>
-      <td> <b>max</b>(<i>e: Column</i>)</td>
+      <td> <b>max</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the maximum value of the column.</td>
     </tr>          
     <tr>
-      <td> <b>max_by</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td> <b>max_by</b>(<i>c1: Column, c2: Column</i>)</td>
       <td>Column name</td>
-      <td>Returns the value of column e1 associated with the maximum value of column e2.</td>
+      <td>Returns the value of column c1 associated with the maximum value of column c2.</td>
     </tr>   
     <tr>
-      <td> <b>min</b>(<i>e: Column</i>)</td>
+      <td> <b>min</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the minimum value of the column.</td>
     </tr>          
     <tr>
-      <td> <b>min_by</b>(<i>e1: Column, e2: Column</i>)</td>
+      <td> <b>min_by</b>(<i>c1: Column, c2: Column</i>)</td>
       <td>Column name</td>
-      <td>Returns the value of column e1 associated with the minimum value of column e2.</td>
+      <td>Returns the value of column c1 associated with the minimum value of column c2.</td>
     </tr>      
     <tr>
-      <td> <b>percentile</b>(<i>e: Column, percentage [, frequency]</i>)</td>
+      <td> <b>percentile</b>(<i>c: Column, percentage [, frequency]</i>)</td>
       <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
       <td>Returns the exact percentile value of numeric column at the given percentage.</td>
     </tr>         
     <tr>
-      <td> <b>percentile</b>(<i>e: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td> <b>percentile</b>(<i>c: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
       <td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td>
       <td>Returns the exact percentile value array of numeric column at the given percentage(s).</td>
     </tr>        
     <tr>
-      <td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, percentage [, frequency]</i>)</td>
+      <td> <b>{percentile_approx | percentile_approx}</b>(<i>c: Column, percentage [, frequency]</i>)</td>
       <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
       <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
     </tr>         
     <tr>
-      <td> <b>{percentile_approx | percentile_approx}</b>(<i>e: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td> <b>{percentile_approx | percentile_approx}</b>(<i>c: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
       <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
       <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
     </tr>             
     <tr>
-       <td> <b>skewness</b>(<i>e: Column</i>)</td>
+       <td> <b>skewness</b>(<i>c: Column</i>)</td>
        <td>Column name</td>
        <td>Returns the skewness value calculated from values of a group</td>
     </tr>    
     <tr>
-      <td> <b>{stddev_samp | stddev | std}</b>(<i>e: Column</i>)</td>
+      <td> <b>{stddev_samp | stddev | std}</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the sample standard deviation calculated from values of a group</td>
     </tr>  
     <tr>
-      <td> <b>stddev_pop</b>(<i>e: Column</i>)</td>
+      <td> <b>stddev_pop</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the population standard deviation calculated from values of a group</td>
     </tr>
     <tr>
-      <td> <b>sum</b>(<i>e: Column</i>)</td>
+      <td> <b>sum</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the sum calculated from values of a group.</td>
     </tr>       
     <tr>
-      <td> <b>{variance | var_samp}</b>(<i>e: Column</i>)</td>
+      <td> <b>{variance | var_samp}</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the sample variance calculated from values of a group</td>
     </tr>    
     <tr>
-      <td> <b>var_pop</b>(<i>e: Column</i>)</td>
+      <td> <b>var_pop</b>(<i>c: Column</i>)</td>
       <td>Column name</td>
       <td>Returns the population variance calculated from values of a group</td>
     </tr>        
@@ -362,11 +362,11 @@ SELECT c1 FROM buildin_agg GROUP BY c1 HAVING COUNT_IF(c2 % 2 = 0);
 
 --COUNT_MIN_SKETCH
 SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
-|count_min_sketch(c1, 0.9, 0.2, 3)                                                                                                                            |
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
-|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00 00 03 00 00 00 00 5D 93 49 A6 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06]|
-+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
++----------------------------------------------------------+
+|count_min_sketch(c1, 0.9, 0.2, 3)                         |
++----------------------------------------------------------+
+|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00...]|
++----------------------------------------------------------+
 
 --COVAR_POP
 SELECT COVAR_POP(c1, c2) FROM buildin_agg;

From 85f4181cb62c008e6a7b929f3b4a0e946a802736 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Mon, 6 Apr 2020 20:50:18 -0700
Subject: [PATCH 5/9] add concrete sql type

---
 docs/sql-ref-functions-builtin-aggregate.md | 186 +++++++++-----------
 1 file changed, 87 insertions(+), 99 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index 5e77ba66c898..8779e342d0cb 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -24,169 +24,157 @@ operate on a group of rows and return a single value.
 
 Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions.
 
-**Note:** All functions below have another signature which takes String as a column name instead of Column.
+**Note:** All functions below have another signature which takes String as a expression.
 
-* Table of contents
-{:toc}
 <table class="table">
   <thead>
-    <tr><th style="width:25%">Function</th><th>Parameters</th><th>Description</th></tr>
+    <tr><th style="width:25%">Function</th><th>Parameter Type(s)</th><th>Description</th></tr>
   </thead>
   <tbody>
     <tr>
-      <td> <b>{any | some | bool_or}</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns true if at least one value is true</td>
+      <td><b>{any | some | bool_or}</b>(<i>expression</i>)</td>
+      <td>boolean</td>
+      <td>Returns true if at least one value is true.</td>
     </tr>
     <tr>
-      <td> <b>approx_count_distinct</b>(<i>c: Column[, relativeSD: Double]]</i>)</td>
-      <td>Column name; relativeSD: the maximum estimation error allowed.</td>
-      <td>Returns the estimated cardinality by HyperLogLog++</td>
+      <td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
+      <td>(long, double)</td>
+      <td>RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td>
     </tr>   
     <tr>
-      <td> <b>{avg | mean}</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td> Returns the average of values in the input column.</td> 
+      <td><b>{avg | mean}</b>(<i>expression</i>)</td>
+      <td>numeric or string</td>
+      <td>Returns the average of values in the input expression.</td> 
     </tr>
     <tr>
-      <td> <b>{bool_and | every}</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns true if all values are true</td>
+      <td><b>{bool_and | every}</b>(<i>expression</i>)</td>
+      <td>boolean</td>
+      <td>Returns true if all values are true.</td>
     </tr>
     <tr>
-      <td> <b>collect_list</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle</td>
+      <td><b>collect_list</b>(<i>expression</i>)</td>
+      <td>any</td>
+      <td>Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
     </tr>       
     <tr>
-      <td> <b>collect_set</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
+      <td><b>collect_set</b>(<i>expression</i>)</td>
+      <td>any</td>
       <td>Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.</td>
     </tr>
     <tr>
-      <td> <b>corr</b>(<i>c1: Column, c2: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns Pearson coefficient of correlation between a set of number pairs</td>
+      <td><b>corr</b>(<i>expression1, expression2</i>)</td>
+      <td>double, double</td>
+      <td>Returns Pearson coefficient of correlation between a set of number pairs.</td>
     </tr>
     <tr>
-      <td> <b>count</b>(<i>*</i>)</td>
-      <td>None</td>
-      <td>Returns the total number of retrieved rows, including rows containing null</td>
+      <td><b>count</b>([<b>DISTINCT</b>] {<i><b>*</b></i> | <i>expression1[, expression2</i>]})</td>
+      <td>none; any</td>
+      <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
     </tr>
     <tr>
-      <td> <b>count</b>(<i>c: Column[, c: Column]</i>)</td>
-      <td>Column name</td>
-      <td>Returns the number of rows for which the supplied column(s) are all not null</td>
-    </tr>
-    <tr>
-      <td> <b>count</b>(<b>DISTINCT</b> <i> c: Column[, c: Column</i>])</td>
-      <td>Column name</td>
-      <td>Returns the number of rows for which the supplied column(s) are unique and not null</td>
-    </tr> 
-    <tr>
-      <td> <b>count_if</b>(<i>Predicate</i>)</td>
-      <td>Expression that will be used for aggregation calculation</td>
-      <td>Returns the count number from the predicate evaluate to <code>TRUE</code> values</td>
+      <td><b>count_if</b>(<i>predicate</i>)</td>
+      <td>expression that will be used for aggregation calculation</td>
+      <td>Returns the count number from the predicate evaluate to `TRUE` values.</td>
     </tr> 
     <tr>
-        <td> <b>count_min_sketch</b>(<i>c: Column, eps: double, confidence: double, seed integer</i>)</td>
-        <td>Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer</td>
-        <td>Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..</td>
+      <td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
+      <td>integral or string or binary, double,  double, integer</td>
+      <td>Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
     </tr>
     <tr>
-      <td> <b>covar_pop</b>(<i>c1: Column, c2: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the population covariance of a set of number pairs</td>
+      <td><b>covar_pop</b>(<i>expression1, expression2</i>)</td>
+      <td>double, double</td>
+      <td>Returns the population covariance of a set of number pairs.</td>
     </tr> 
     <tr>
-      <td> <b>covar_samp</b>(<i>c1: Column, c2: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the sample covariance of a set of number pairs</td>
+      <td><b>covar_samp</b>(<i>expression1, expression2</i>)</td>
+      <td>double</td>
+      <td>Returns the sample covariance of a set of number pairs.</td>
     </tr>  
     <tr>
-      <td> <b>{first | first_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td>
-      <td>Column name[, True/False(default)]</td>
-      <td>Returns the first value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
+      <td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
+      <td>any, boolean</td>
+      <td>Returns the first value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
-       <td> <b>kurtosis</b>(<i>c: Column</i>)</td>
-       <td>Column name</td>
-       <td>Returns the kurtosis value calculated from values of a group</td>
+      <td><b>kurtosis</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the kurtosis value calculated from values of a group.</td>
     </tr>    
     <tr>
-      <td> <b>{last | last_value}</b>(<i>c: Column[, isIgnoreNull]</i>)</td>
-      <td>Column name[, True/False(default)]</td>
-      <td>Returns the last value of column for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic</td>
+      <td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
+      <td>any, boolean</td>
+      <td>Returns the last value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
-      <td> <b>max</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the maximum value of the column.</td>
+      <td><b>max</b>(<i>expression</i>)</td>
+      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>Returns the maximum value of the expression.</td>
     </tr>          
     <tr>
-      <td> <b>max_by</b>(<i>c1: Column, c2: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the value of column c1 associated with the maximum value of column c2.</td>
+      <td><b>max_by</b>(<i>expression1, expression2</i>)</td>
+      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>Returns the value of expression1 associated with the maximum value of expression2.</td>
     </tr>   
     <tr>
-      <td> <b>min</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the minimum value of the column.</td>
+      <td><b>min</b>(<i>expression</i>)</td>
+      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>Returns the minimum value of the expression.</td>
     </tr>          
     <tr>
-      <td> <b>min_by</b>(<i>c1: Column, c2: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the value of column c1 associated with the minimum value of column c2.</td>
+      <td><b>min_by</b>(<i>expression1, expression2</i>)</td>
+      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>Returns the value of expression1 associated with the minimum value of expression2.</td>
     </tr>      
     <tr>
-      <td> <b>percentile</b>(<i>c: Column, percentage [, frequency]</i>)</td>
-      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
-      <td>Returns the exact percentile value of numeric column at the given percentage.</td>
+      <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
+      <td>numeric Type, double, integral type</td>
+      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
-      <td> <b>percentile</b>(<i>c: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer</td>
-      <td>Returns the exact percentile value array of numeric column at the given percentage(s).</td>
+      <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>numeric type; double; integral type</td>
+      <td>Percentage array is an array of number between 0 and 1; frequency is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
     </tr>        
     <tr>
-      <td> <b>{percentile_approx | percentile_approx}</b>(<i>c: Column, percentage [, frequency]</i>)</td>
-      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
-      <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
+      <td>numeric, date, timestamp; double; integral</td>
+      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
-      <td> <b>{percentile_approx | percentile_approx}</b>(<i>c: Column, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>Column name; percentage is a number between 0 and 1; frequency is a positive integer</td>
-      <td>Returns the approximate percentile value of numeric column at the given percentage.</td>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>numeric, date, timestamp; double; integral</td>
+      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
-       <td> <b>skewness</b>(<i>c: Column</i>)</td>
-       <td>Column name</td>
-       <td>Returns the skewness value calculated from values of a group</td>
+      <td><b>skewness</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the skewness value calculated from values of a group.</td>
     </tr>    
     <tr>
-      <td> <b>{stddev_samp | stddev | std}</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the sample standard deviation calculated from values of a group</td>
+      <td><b>{stddev_samp | stddev | std}</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the sample standard deviation calculated from values of a group.</td>
     </tr>  
     <tr>
-      <td> <b>stddev_pop</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the population standard deviation calculated from values of a group</td>
+      <td><b>stddev_pop</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the population standard deviation calculated from values of a group.</td>
     </tr>
     <tr>
-      <td> <b>sum</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
+      <td><b>sum</b>(<i>expression</i>)</td>
+      <td>numeric</td>
       <td>Returns the sum calculated from values of a group.</td>
     </tr>       
     <tr>
-      <td> <b>{variance | var_samp}</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the sample variance calculated from values of a group</td>
+      <td><b>{variance | var_samp}</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the sample variance calculated from values of a group.</td>
     </tr>    
     <tr>
-      <td> <b>var_pop</b>(<i>c: Column</i>)</td>
-      <td>Column name</td>
-      <td>Returns the population variance calculated from values of a group</td>
+      <td><b>var_pop</b>(<i>expression</i>)</td>
+      <td>double</td>
+      <td>Returns the population variance calculated from values of a group.</td>
     </tr>        
   </tbody>
 </table>

From 944afd50f10a9fae8ecec4794c867372dcd62bd2 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Mon, 6 Apr 2020 22:35:58 -0700
Subject: [PATCH 6/9] adjust style based on comments

---
 docs/sql-ref-functions-builtin-aggregate.md | 138 ++++++++++++--------
 1 file changed, 82 insertions(+), 56 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index 8779e342d0cb..3cf5450ddf04 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -22,13 +22,9 @@ license: |
 Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions
 operate on a group of rows and return a single value.
 
-Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL. Below is the list of functions.
-
-**Note:** All functions below have another signature which takes String as a expression.
-
 <table class="table">
   <thead>
-    <tr><th style="width:25%">Function</th><th>Parameter Type(s)</th><th>Description</th></tr>
+    <tr><th style="width:25%">Function</th><th>Argument Type(s)</th><th>Description</th></tr>
   </thead>
   <tbody>
     <tr>
@@ -39,7 +35,7 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
     <tr>
       <td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
       <td>(long, double)</td>
-      <td>RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td>
+      <td>`relativeSD` is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td>
     </tr>   
     <tr>
       <td><b>{avg | mean}</b>(<i>expression</i>)</td>
@@ -63,12 +59,12 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
     </tr>
     <tr>
       <td><b>corr</b>(<i>expression1, expression2</i>)</td>
-      <td>double, double</td>
+      <td>(double, double)</td>
       <td>Returns Pearson coefficient of correlation between a set of number pairs.</td>
     </tr>
     <tr>
       <td><b>count</b>([<b>DISTINCT</b>] {<i><b>*</b></i> | <i>expression1[, expression2</i>]})</td>
-      <td>none; any</td>
+      <td>none, any, any</td>
       <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
     </tr>
     <tr>
@@ -79,21 +75,21 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
     <tr>
       <td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
       <td>integral or string or binary, double,  double, integer</td>
-      <td>Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
+      <td>`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
     </tr>
     <tr>
       <td><b>covar_pop</b>(<i>expression1, expression2</i>)</td>
-      <td>double, double</td>
+      <td>(double, double)</td>
       <td>Returns the population covariance of a set of number pairs.</td>
     </tr> 
     <tr>
       <td><b>covar_samp</b>(<i>expression1, expression2</i>)</td>
-      <td>double</td>
+      <td>(double, double)</td>
       <td>Returns the sample covariance of a set of number pairs.</td>
     </tr>  
     <tr>
       <td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
-      <td>any, boolean</td>
+      <td>(any, boolean)</td>
       <td>Returns the first value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
@@ -103,7 +99,7 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
     </tr>    
     <tr>
       <td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
-      <td>any, boolean</td>
+      <td>(any, boolean)</td>
       <td>Returns the last value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
@@ -128,23 +124,23 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
     </tr>      
     <tr>
       <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>numeric Type, double, integral type</td>
-      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
+      <td>numeric type, double, integral type</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
       <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
       <td>numeric type; double; integral type</td>
-      <td>Percentage array is an array of number between 0 and 1; frequency is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
+      <td>Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
     </tr>        
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
       <td>numeric, date, timestamp; double; integral</td>
-      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>numeric, date, timestamp; double; integral</td>
-      <td>Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
+      <td>numeric|date|timestamp, double, integral</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
       <td><b>skewness</b>(<i>expression</i>)</td>
@@ -182,6 +178,7 @@ Spark SQL aggregate functions are grouped as <code>agg_funcs</code> in Spark SQL
 ### Examples
 {% highlight sql %}
 --base table 
+
 SELECT * FROM buildin_agg;
 +----+----+----+-----+----+
 |  c1|  c2|  c3|   c4|  c5|
@@ -196,7 +193,8 @@ SELECT * FROM buildin_agg;
 |null|   4|agg7|false|true|
 +----+----+----+-----+----+
 
--- ANY, SOME and BOOL_OR
+-- any, some and bool_or
+
 SELECT ANY(c4) FROM buildin_agg;
 +-------+
 |any(c4)|
@@ -218,7 +216,8 @@ SELECT BOOL_OR(c5) FROM buildin_agg;
 |       true|
 +-----------+
 
--- APPROX_COUNT_DISTINCT
+-- approx_count_distinct
+
 SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
 +-------------------------+
 |approx_count_distinct(c1)|
@@ -226,14 +225,15 @@ SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
 |                        5|
 +-------------------------+
 
-SELECT APPROX_COUNT_DISTINCT(c1,0.39d) FROM buildin_agg;
+SELECT APPROX_COUNT_DISTINCT(c1,0.39) FROM buildin_agg;
 +-------------------------+
 |approx_count_distinct(c1)|
 +-------------------------+
 |                        6|
 +-------------------------+
 
--- AVG and MEAN functions
+-- avg and mean
+
 SELECT AVG(c1) FROM buildin_agg;
 +------------------+
 |           avg(c1)|
@@ -248,7 +248,8 @@ SELECT MEAN(c1) FROM buildin_agg;
 |2.4285714285714284|
 +------------------+
 
--- BOOL_AND and EVERY 
+-- bool_and and every
+ 
 SELECT BOOL_AND(c4) FROM buildin_agg;
 +------------+
 |bool_and(c4)|
@@ -263,7 +264,8 @@ SELECT EVERY(c5) FROM buildin_agg;
 |        true|
 +------------+
 
---COLLECT_LIST
+--collect_list
+
 SELECT COLLECT_LIST(c2) FROM buildin_agg;
 +---------------------+
 |collect_list(c2)     |
@@ -278,7 +280,8 @@ SELECT COLLECT_LIST(c4) FROM buildin_agg;
 |[true, false, false, false, true, false, false, false]|
 +------------------------------------------------------+
 
---COLLECT_SET
+--collect_set
+ 
 SELECT COLLECT_SET(c2) FROM buildin_agg;
 +---------------+
 |collect_set(c2)|
@@ -293,7 +296,8 @@ SELECT COLLECT_SET(c3) FROM buildin_agg;
 |[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
 +------------------------------------------------+
 
--- CORR
+--corr
+
 SELECT CORR(c1, c2) FROM buildin_agg;
 +--------------------------------------------+
 |corr(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
@@ -301,7 +305,8 @@ SELECT CORR(c1, c2) FROM buildin_agg;
 |                          0.7745966692414833|
 +--------------------------------------------+
 
---COUNT(*)
+--count(*)
+
 SELECT COUNT(*) FROM buildin_agg;
 +--------+
 |count(1)|
@@ -309,7 +314,8 @@ SELECT COUNT(*) FROM buildin_agg;
 |       8|
 +--------+
 
--- COUNT
+--count
+
 SELECT COUNT(c2) FROM buildin_agg;
 +---------+
 |count(c2)|
@@ -317,7 +323,8 @@ SELECT COUNT(c2) FROM buildin_agg;
 |        7|
 +---------+
 
---COUNT DISTINCT
+--count distinct
+
 SELECT COUNT(DISTINCT c1) FROM buildin_agg;
 +------------------+
 |count(DISTINCT c1)|
@@ -332,7 +339,8 @@ SELECT COUNT(DISTINCT c1, c2) FROM buildin_agg;
 |                     5|
 +----------------------+
 
---COUNT_IF
+--count_if
+
 SELECT COUNT_IF(c1 IS NULL) from buildin_agg;
 +----------------------+
 |count_if((c1 IS NULL))|
@@ -348,7 +356,8 @@ SELECT c1 FROM buildin_agg GROUP BY c1 HAVING COUNT_IF(c2 % 2 = 0);
 |   1|
 +----+
 
---COUNT_MIN_SKETCH
+--count_min_sketch
+
 SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
 +----------------------------------------------------------+
 |count_min_sketch(c1, 0.9, 0.2, 3)                         |
@@ -356,7 +365,8 @@ SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
 |[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00...]|
 +----------------------------------------------------------+
 
---COVAR_POP
+--covar_pop
+
 SELECT COVAR_POP(c1, c2) FROM buildin_agg;
 +-------------------------------------------------+
 |covar_pop(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
@@ -364,7 +374,8 @@ SELECT COVAR_POP(c1, c2) FROM buildin_agg;
 |                               0.6666666666666666|
 +-------------------------------------------------+
 
---COVAR_SAMP
+--covar_samp
+
 SELECT COVAR_SAMP(c1, c2) FROM buildin_agg;
 +--------------------------------------------------+
 |covar_samp(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
@@ -372,7 +383,8 @@ SELECT COVAR_SAMP(c1, c2) FROM buildin_agg;
 |                                               0.8|
 +--------------------------------------------------+
 
---FIRST and FIRST_VALUE
+--first and first_value
+
 SELECT FIRST(c1) FROM buildin_agg;
 +----------------+
 |first(c1, false)|
@@ -380,35 +392,36 @@ SELECT FIRST(c1) FROM buildin_agg;
 |               2|
 +----------------+
 
-SELECT FIRST(col) FROM VALUES (NULL), (5), (20) AS TAB(col);
+SELECT FIRST(col) FROM VALUES (NULL), (5), (20) AS t(col);
 +-----------------+
 |first(col, false)|
 +-----------------+
 |             null|
 +-----------------+
 
-SELECT FIRST(col, true) FROM VALUES (NULL), (5), (20) AS TAB(col);
+SELECT FIRST(col, true) FROM VALUES (NULL), (5), (20) AS t(col);
 +----------------+
 |first(col, true)|
 +----------------+
 |               5|
 +----------------+
 
-SELECT FIRST_VALUE(col) FROM VALUES (NULL), (5), (20) AS TAB(col);
+SELECT FIRST_VALUE(col) FROM VALUES (NULL), (5), (20) AS t(col);
 +-----------------------+
 |first_value(col, false)|
 +-----------------------+
 |                   null|
 +-----------------------+
 
-SELECT FIRST_VALUE(col, true) FROM VALUES (NULL), (5), (20) AS TAB(col);
+SELECT FIRST_VALUE(col, true) FROM VALUES (NULL), (5), (20) AS t(col);
 +----------------------+
 |first_value(col, true)|
 +----------------------+
 |                     5|
 +----------------------+
 
---KURTOSIS
+--kurtosis
+
 SELECT KURTOSIS(c2) FROM buildin_agg;
 +----------------------------+
 |kurtosis(CAST(c2 AS DOUBLE))|
@@ -416,14 +429,15 @@ SELECT KURTOSIS(c2) FROM buildin_agg;
 |         -0.7325000000000004|
 +----------------------------+
 
-SELECT KURTOSIS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
+SELECT KURTOSIS(col) FROM VALUES (-1000), (-100), (10), (20) AS t(col);
 +-----------------------------+
 |kurtosis(CAST(col AS DOUBLE))|
 +-----------------------------+
 |          -0.7014368047529627|
 +-----------------------------+
 
---LAST and LAST_VALUE
+--last and last_value
+
 SELECT LAST(c1) FROM buildin_agg;
 +---------------+
 |last(c1, false)|
@@ -452,7 +466,8 @@ SELECT LAST_VALUE(c1, true) FROM buildin_agg;
 |                   5|
 +--------------------+
 
---MAX
+--max
+
 SELECT MAX(c2) FROM buildin_agg;
 +-------+
 |max(c2)|
@@ -460,7 +475,8 @@ SELECT MAX(c2) FROM buildin_agg;
 |      4|
 +-------+
 
---MAX_BY
+--max_by
+
 SELECT MAX_BY(c1, c3) FROM buildin_agg;
 +-------------+
 |maxby(c1, c3)|
@@ -468,7 +484,8 @@ SELECT MAX_BY(c1, c3) FROM buildin_agg;
 |            5|
 +-------------+
 
---MIN
+--min
+
 SELECT MIN(c1) FROM buildin_agg;
 +-------+
 |min(c1)|
@@ -476,7 +493,8 @@ SELECT MIN(c1) FROM buildin_agg;
 |      1|
 +-------+
 
---MIN_BY
+--min_by
+
 SELECT MIN_BY(c2, c3) FROM buildin_agg;
 +-------------+
 |minby(c2, c3)|
@@ -484,7 +502,8 @@ SELECT MIN_BY(c2, c3) FROM buildin_agg;
 |            1|
 +-------------+
 
---PERCENTILE
+--percentile
+
 SELECT PERCENTILE(c1, 0.3) FROM buildin_agg;
 +--------------------------------------+
 |percentile(c1, CAST(0.3 AS DOUBLE), 1)|
@@ -514,6 +533,7 @@ SELECT PERCENTILE(c1, ARRAY(0.25, 0.75), 10) FROM buildin_agg;
 +-------------------------------------+
 
 --PERCENTILE_APPROX and APPROX_PERCENTILE
+
 SELECT PERCENTILE_APPROX(c1, 0.25, 100) FROM buildin_agg;
 +------------------------------------------------+
 |percentile_approx(c1, CAST(0.25 AS DOUBLE), 100)|
@@ -542,7 +562,8 @@ SELECT APPROX_PERCENTILE(c1, array(0.25, 0.85), 100) FROM buildin_agg;
 |                                       [1, 4]|
 +---------------------------------------------+
 
---SKEWNESS
+--skewness
+
 SELECT SKEWNESS(c1) FROM buildin_agg;
 +----------------------------+
 |skewness(CAST(c1 AS DOUBLE))|
@@ -550,14 +571,15 @@ SELECT SKEWNESS(c1) FROM buildin_agg;
 |          0.5200705032248686|
 +----------------------------+
 
-SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS TAB(col);
+SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS t(col);
 +-----------------------------+
 |skewness(CAST(col AS DOUBLE))|
 +-----------------------------+
 |          -1.1135657469022011|
 +-----------------------------+
 
---STDDEV_SAMP, STDDEV and STD
+--stddev_samp, stddev and std
+
 SELECT STDDEV_SAMP(c1) FROM buildin_agg;
 +-------------------------------+
 |stddev_samp(CAST(c1 AS DOUBLE))|
@@ -579,7 +601,8 @@ SELECT STD(c1) FROM buildin_agg;
 |      1.618347187425374|
 +-----------------------+
 
---STDDEV_POP
+--stddev_pop
+
 SELECT STDDEV_POP(c1) FROM buildin_agg;
 +------------------------------+
 |stddev_pop(CAST(c1 AS DOUBLE))|
@@ -587,8 +610,9 @@ SELECT STDDEV_POP(c1) FROM buildin_agg;
 |             1.498298354528788|
 +------------------------------+
 
---SUM
-SELECT SUM(col) FROM VALUES (5), (10), (15) AS TAB(col);
+--sum
+
+SELECT SUM(col) FROM VALUES (5), (10), (15) AS t(col);
 +--------+
 |sum(col)|
 +--------+
@@ -602,14 +626,15 @@ SELECT SUM(c1) FROM buildin_agg;
 |     17|
 +-------+
 
-SELECT SUM(col) FROM VALUES (NULL), (NULL) AS TAB(col);
+SELECT SUM(col) FROM VALUES (NULL), (NULL) AS t(col);
 +--------+
 |sum(col)|
 +--------+
 |    null|
 +--------+
 
---VARIANCE and VAR_SAMP
+--variance and var_samp
+
 SELECT VARIANCE(c1) FROM buildin_agg;
 +----------------------------+
 |variance(CAST(c1 AS DOUBLE))|
@@ -624,7 +649,8 @@ SELECT VAR_SAMP(c1) FROM buildin_agg;
 |           2.619047619047619|
 +----------------------------+
 
---VAR_POP
+--var_pop
+
 SELECT VAR_POP(c1) FROM buildin_agg;
 +---------------------------+
 |var_pop(CAST(c1 AS DOUBLE))|

From 6ddeca49479b77c86de8df4e25d05d2e04bd76f5 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Mon, 6 Apr 2020 23:52:22 -0700
Subject: [PATCH 7/9] seperate entries

---
 docs/sql-ref-functions-builtin-aggregate.md | 41 ++++++++++++++-------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index 3cf5450ddf04..cef5ecdff535 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -63,9 +63,14 @@ operate on a group of rows and return a single value.
       <td>Returns Pearson coefficient of correlation between a set of number pairs.</td>
     </tr>
     <tr>
-      <td><b>count</b>([<b>DISTINCT</b>] {<i><b>*</b></i> | <i>expression1[, expression2</i>]})</td>
-      <td>none, any, any</td>
-      <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
+      <td><b>count</b>([<b>DISTINCT</b>] <i>*</i>)</td>
+      <td>none</td>
+      <td>If specified <code>DISTINCT</code>, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.</td>
+    </tr>
+    <tr>
+      <td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td>
+      <td>(any, any)</td>
+      <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
     </tr>
     <tr>
       <td><b>count_if</b>(<i>predicate</i>)</td>
@@ -74,7 +79,7 @@ operate on a group of rows and return a single value.
     </tr> 
     <tr>
       <td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
-      <td>integral or string or binary, double,  double, integer</td>
+      <td>(integer or string or binary, double,  double, integer)</td>
       <td>`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
     </tr>
     <tr>
@@ -104,42 +109,52 @@ operate on a group of rows and return a single value.
     </tr>      
     <tr>
       <td><b>max</b>(<i>expression</i>)</td>
-      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>any numeric, string, datetime or arrays of these types</td>
       <td>Returns the maximum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>max_by</b>(<i>expression1, expression2</i>)</td>
-      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>any numeric, string, datetime or arrays of these types</td>
       <td>Returns the value of expression1 associated with the maximum value of expression2.</td>
     </tr>   
     <tr>
       <td><b>min</b>(<i>expression</i>)</td>
-      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>any numeric, string, datetime or arrays of these types</td>
       <td>Returns the minimum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>min_by</b>(<i>expression1, expression2</i>)</td>
-      <td>any numeric, string, date/time or arrays of these types</td>
+      <td>any numeric, string, datetime or arrays of these types</td>
       <td>Returns the value of expression1 associated with the minimum value of expression2.</td>
     </tr>      
     <tr>
       <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>numeric type, double, integral type</td>
+      <td>numeric, double, integer</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
       <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>numeric type; double; integral type</td>
+      <td>numeric, double, integer</td>
       <td>Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
     </tr>        
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>numeric, date, timestamp; double; integral</td>
+      <td>numeric, double, integer</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
-    </tr>         
+    </tr>    
+   <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
+      <td>datetime, double, integer</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
+    </tr>                  
+    <tr>
+      <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
+      <td>numeric, double, integer</td>
+      <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
+    </tr>             
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>numeric|date|timestamp, double, integral</td>
+      <td>datetime , double, integer</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>

From 14d303ff4d296ab07c92e39ec6e9f411da232b7a Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Wed, 8 Apr 2020 10:55:30 -0700
Subject: [PATCH 8/9] replace numeric type to concrete type

---
 docs/sql-ref-functions-builtin-aggregate.md | 28 ++++++++++-----------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index cef5ecdff535..5111cf48c869 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -20,7 +20,7 @@ license: |
 ---
 
 Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions
-operate on a group of rows and return a single value.
+operate on a group of rows and return a single aggregated value.
 
 <table class="table">
   <thead>
@@ -39,7 +39,7 @@ operate on a group of rows and return a single value.
     </tr>   
     <tr>
       <td><b>{avg | mean}</b>(<i>expression</i>)</td>
-      <td>numeric or string</td>
+      <td>short, float, byte, decimal, double, int, long or string</td>
       <td>Returns the average of values in the input expression.</td> 
     </tr>
     <tr>
@@ -79,7 +79,7 @@ operate on a group of rows and return a single value.
     </tr> 
     <tr>
       <td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
-      <td>(integer or string or binary, double,  double, integer)</td>
+      <td>(byte, short, int, long, string or binary, double,  double, integer)</td>
       <td>`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
     </tr>
     <tr>
@@ -109,52 +109,52 @@ operate on a group of rows and return a single value.
     </tr>      
     <tr>
       <td><b>max</b>(<i>expression</i>)</td>
-      <td>any numeric, string, datetime or arrays of these types</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
       <td>Returns the maximum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>max_by</b>(<i>expression1, expression2</i>)</td>
-      <td>any numeric, string, datetime or arrays of these types</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
       <td>Returns the value of expression1 associated with the maximum value of expression2.</td>
     </tr>   
     <tr>
       <td><b>min</b>(<i>expression</i>)</td>
-      <td>any numeric, string, datetime or arrays of these types</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
       <td>Returns the minimum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>min_by</b>(<i>expression1, expression2</i>)</td>
-      <td>any numeric, string, datetime or arrays of these types</td>
+      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
       <td>Returns the value of expression1 associated with the minimum value of expression2.</td>
     </tr>      
     <tr>
       <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>numeric, double, integer</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
       <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>numeric, double, integer</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
       <td>Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
     </tr>        
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>numeric, double, integer</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>    
    <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>datetime, double, integer</td>
+      <td>date or timestamp, double, int</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>                  
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>numeric, double, integer</td>
+      <td>short, float, byte, decimal, double, int, or long, double, int</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>datetime , double, integer</td>
+      <td>date or timestamp, double, int</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
@@ -174,7 +174,7 @@ operate on a group of rows and return a single value.
     </tr>
     <tr>
       <td><b>sum</b>(<i>expression</i>)</td>
-      <td>numeric</td>
+      <td>short, float, byte, decimal, double, int, or long</td>
       <td>Returns the sum calculated from values of a group.</td>
     </tr>       
     <tr>

From 9e283b4d7759009b596345b76c5eb0d86dd175d3 Mon Sep 17 00:00:00 2001
From: Qianyang Yu <qyu@us.ibm.com>
Date: Thu, 9 Apr 2020 13:49:14 -0700
Subject: [PATCH 9/9] adjust style

---
 docs/sql-ref-functions-builtin-aggregate.md | 688 ++++++++++----------
 1 file changed, 331 insertions(+), 357 deletions(-)

diff --git a/docs/sql-ref-functions-builtin-aggregate.md b/docs/sql-ref-functions-builtin-aggregate.md
index 5111cf48c869..a2be7577a617 100644
--- a/docs/sql-ref-functions-builtin-aggregate.md
+++ b/docs/sql-ref-functions-builtin-aggregate.md
@@ -19,6 +19,8 @@ license: |
   limitations under the License.
 ---
 
+### Description
+
 Spark SQL provides build-in aggregate functions defined in the dataset API and SQL interface. Aggregate functions
 operate on a group of rows and return a single aggregated value.
 
@@ -34,12 +36,12 @@ operate on a group of rows and return a single aggregated value.
     </tr>
     <tr>
       <td><b>approx_count_distinct</b>(<i>expression[, relativeSD]</i>)</td>
-      <td>(long, double)</td>
+      <td>(bigint[, double])</td>
       <td>`relativeSD` is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.</td>
     </tr>   
     <tr>
       <td><b>{avg | mean}</b>(<i>expression</i>)</td>
-      <td>short, float, byte, decimal, double, int, long or string</td>
+      <td>tinyint|smallint|int|bigint|float|double|decimal|string</td>
       <td>Returns the average of values in the input expression.</td> 
     </tr>
     <tr>
@@ -65,21 +67,21 @@ operate on a group of rows and return a single aggregated value.
     <tr>
       <td><b>count</b>([<b>DISTINCT</b>] <i>*</i>)</td>
       <td>none</td>
-      <td>If specified <code>DISTINCT</code>, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.</td>
+      <td>If specified <code>DISTINCT</code>, returns the total number of retrieved rows are unique and not null; otherwise, returns the total number of retrieved rows, including rows containing null.</td>
     </tr>
     <tr>
       <td><b>count</b>([<b>DISTINCT</b>] <i>expression1[, expression2</i>])</td>
-      <td>(any, any)</td>
-      <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
+      <td>(any[, any])</td>
+      <td>If specified <code>DISTINCT</code>, returns the number of rows for which the supplied expression(s) are unique and not null; otherwise, returns the number of rows for which the supplied expression(s) are all not null.</td>
     </tr>
     <tr>
       <td><b>count_if</b>(<i>predicate</i>)</td>
-      <td>expression that will be used for aggregation calculation</td>
+      <td>expression that returns a boolean value</td>
       <td>Returns the count number from the predicate evaluate to `TRUE` values.</td>
     </tr> 
     <tr>
       <td><b>count_min_sketch</b>(<i>expression, eps, confidence, seed</i>)</td>
-      <td>(byte, short, int, long, string or binary, double,  double, integer)</td>
+      <td>(tinyint|int|bigint|smallint|string|binary, double, double, int)</td>
       <td>`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.</td>
     </tr>
     <tr>
@@ -93,9 +95,9 @@ operate on a group of rows and return a single aggregated value.
       <td>Returns the sample covariance of a set of number pairs.</td>
     </tr>  
     <tr>
-      <td><b>{first | first_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
-      <td>(any, boolean)</td>
-      <td>Returns the first value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
+      <td><b>{first | first_value}</b>(<i>expression[, `isIgnoreNull`]</i>)</td>
+      <td>(any[, boolean])</td>
+      <td>Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
       <td><b>kurtosis</b>(<i>expression</i>)</td>
@@ -103,58 +105,58 @@ operate on a group of rows and return a single aggregated value.
       <td>Returns the kurtosis value calculated from values of a group.</td>
     </tr>    
     <tr>
-      <td><b>{last | last_value}</b>(<i>expression[, isIgnoreNull]</i>)</td>
-      <td>(any, boolean)</td>
-      <td>Returns the last value of expression for a group of rows. If <code>isIgnoreNull</code> is true, returns only non-null values, default is false. This function is non-deterministic.</td>
+      <td><b>{last | last_value}</b>(<i>expression[, `isIgnoreNull`]</i>)</td>
+      <td>(any[, boolean])</td>
+      <td>Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic.</td>
     </tr>      
     <tr>
       <td><b>max</b>(<i>expression</i>)</td>
-      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
+      <td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td>
       <td>Returns the maximum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>max_by</b>(<i>expression1, expression2</i>)</td>
-      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
+      <td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td>
       <td>Returns the value of expression1 associated with the maximum value of expression2.</td>
     </tr>   
     <tr>
       <td><b>min</b>(<i>expression</i>)</td>
-      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
+      <td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td>
       <td>Returns the minimum value of the expression.</td>
     </tr>          
     <tr>
       <td><b>min_by</b>(<i>expression1, expression2</i>)</td>
-      <td>short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types</td>
+      <td>tinyint|short|int|bigint|float|double|date|timestamp|string, or arrays of these types</td>
       <td>Returns the value of expression1 associated with the minimum value of expression2.</td>
     </tr>      
     <tr>
       <td><b>percentile</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>(short|float|byte|decimal|double|int|bigint, double[, int])</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.</td>
     </tr>         
     <tr>
       <td><b>percentile</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td>
       <td>Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).</td>
     </tr>        
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>(short|float|byte|decimal|double|int|bigint, double[, int])</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>    
    <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, percentage [, frequency]</i>)</td>
-      <td>date or timestamp, double, int</td>
+      <td>(date|timestamp, double[, int])</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>                  
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>short, float, byte, decimal, double, int, or long, double, int</td>
+      <td>(short|float|byte|decimal|double|int|bigint, array of double[, int])</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
       <td><b>{percentile_approx | percentile_approx}</b>(<i>expression, <b>array</b>(percentage1 [, percentage2]...) [, frequency]</i>)</td>
-      <td>date or timestamp, double, int</td>
+      <td>(date|timestamp, array of double[, int])</td>
       <td>`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.</td>
     </tr>             
     <tr>
@@ -174,7 +176,7 @@ operate on a group of rows and return a single aggregated value.
     </tr>
     <tr>
       <td><b>sum</b>(<i>expression</i>)</td>
-      <td>short, float, byte, decimal, double, int, or long</td>
+      <td>tinyint|smallint|int|bigint|float|double|decimal</td>
       <td>Returns the sum calculated from values of a group.</td>
     </tr>       
     <tr>
@@ -191,485 +193,457 @@ operate on a group of rows and return a single aggregated value.
 </table>
 
 ### Examples
+
 {% highlight sql %}
---base table 
+--A test table used in the following examples
 
 SELECT * FROM buildin_agg;
-+----+----+----+-----+----+
-|  c1|  c2|  c3|   c4|  c5|
-+----+----+----+-----+----+
-|   2|   3|agg4| true|true|
-|   1|   2|agg3|false|true|
-|   1|   1|agg1|false|true|
-|   4|   3|agg6|false|true|
-|   3|   3|agg5| true|true|
-|   1|   2|agg2|false|true|
-|   5|null|agg8|false|true|
-|null|   4|agg7|false|true|
-+----+----+----+-----+----+
+  +----+----+----+-----+----+
+  |  c1|  c2|  c3|   c4|  c5|
+  +----+----+----+-----+----+
+  |   2|   3|agg4| true|true|
+  |   1|   2|agg3|false|true|
+  |   1|   1|agg1|false|true|
+  |   4|   3|agg6|false|true|
+  |   3|   3|agg5| true|true|
+  |   1|   2|agg2|false|true|
+  |   5|null|agg8|false|true|
+  |null|   4|agg7|false|true|
+  +----+----+----+-----+----+
 
 -- any, some and bool_or
-
 SELECT ANY(c4) FROM buildin_agg;
-+-------+
-|any(c4)|
-+-------+
-|   true|
-+-------+
+  +-------+
+  |any(c4)|
+  +-------+
+  |   true|
+  +-------+
 
 SELECT SOME(c4) FROM buildin_agg;
-+-------+
-|any(c4)|
-+-------+
-|   true|
-+-------+
+  +-------+
+  |any(c4)|
+  +-------+
+  |   true|
+  +-------+
 
 SELECT BOOL_OR(c5) FROM buildin_agg;
-+-----------+
-|bool_or(c5)|
-+-----------+
-|       true|
-+-----------+
+  +-----------+
+  |bool_or(c5)|
+  +-----------+
+  |       true|
+  +-----------+
 
 -- approx_count_distinct
-
 SELECT APPROX_COUNT_DISTINCT(c1) FROM buildin_agg;
-+-------------------------+
-|approx_count_distinct(c1)|
-+-------------------------+
-|                        5|
-+-------------------------+
+  +-------------------------+
+  |approx_count_distinct(c1)|
+  +-------------------------+
+  |                        5|
+  +-------------------------+
 
 SELECT APPROX_COUNT_DISTINCT(c1,0.39) FROM buildin_agg;
-+-------------------------+
-|approx_count_distinct(c1)|
-+-------------------------+
-|                        6|
-+-------------------------+
+  +-------------------------+
+  |approx_count_distinct(c1)|
+  +-------------------------+
+  |                        6|
+  +-------------------------+
 
 -- avg and mean
-
 SELECT AVG(c1) FROM buildin_agg;
-+------------------+
-|           avg(c1)|
-+------------------+
-|2.4285714285714284|
-+------------------+
+  +------------------+
+  |           avg(c1)|
+  +------------------+
+  |2.4285714285714284|
+  +------------------+
 
 SELECT MEAN(c1) FROM buildin_agg;
-+------------------+
-|          mean(c1)|
-+------------------+
-|2.4285714285714284|
-+------------------+
+  +------------------+
+  |          mean(c1)|
+  +------------------+
+  |2.4285714285714284|
+  +------------------+
 
 -- bool_and and every
- 
 SELECT BOOL_AND(c4) FROM buildin_agg;
-+------------+
-|bool_and(c4)|
-+------------+
-|       false|
-+------------+
+  +------------+
+  |bool_and(c4)|
+  +------------+
+  |       false|
+  +------------+
 
 SELECT EVERY(c5) FROM buildin_agg;
-+------------+
-|bool_and(c5)|
-+------------+
-|        true|
-+------------+
+  +------------+
+  |bool_and(c5)|
+  +------------+
+  |        true|
+  +------------+
 
 --collect_list
-
 SELECT COLLECT_LIST(c2) FROM buildin_agg;
-+---------------------+
-|collect_list(c2)     |
-+---------------------+
-|[3, 2, 1, 3, 3, 2, 4]|
-+---------------------+
+  +---------------------+
+  |     collect_list(c2)|
+  +---------------------+
+  |[3, 2, 1, 3, 3, 2, 4]|
+  +---------------------+
 
 SELECT COLLECT_LIST(c4) FROM buildin_agg;
-+------------------------------------------------------+
-|collect_list(c4)                                      |
-+------------------------------------------------------+
-|[true, false, false, false, true, false, false, false]|
-+------------------------------------------------------+
+  +------------------------------------------------------+
+  |                                     collect_list(c4) |
+  +------------------------------------------------------+
+  |[true, false, false, false, true, false, false, false]|
+  +------------------------------------------------------+
 
 --collect_set
- 
 SELECT COLLECT_SET(c2) FROM buildin_agg;
-+---------------+
-|collect_set(c2)|
-+---------------+
-|[1, 2, 3, 4]   |
-+---------------+
+  +---------------+
+  |collect_set(c2)|
+  +---------------+
+  |   [1, 2, 3, 4]|
+  +---------------+
 
 SELECT COLLECT_SET(c3) FROM buildin_agg;
-+------------------------------------------------+
-|collect_set(c3)                                 |
-+------------------------------------------------+
-|[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
-+------------------------------------------------+
+  +------------------------------------------------+
+  |                                 collect_set(c3)|
+  +------------------------------------------------+
+  |[agg7, agg8, agg3, agg6, agg4, agg2, agg5, agg1]|
+  +------------------------------------------------+
 
 --corr
-
 SELECT CORR(c1, c2) FROM buildin_agg;
-+--------------------------------------------+
-|corr(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
-+--------------------------------------------+
-|                          0.7745966692414833|
-+--------------------------------------------+
+  +--------------------------------------------+
+  |corr(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
+  +--------------------------------------------+
+  |                          0.7745966692414833|
+  +--------------------------------------------+
 
 --count(*)
-
 SELECT COUNT(*) FROM buildin_agg;
-+--------+
-|count(1)|
-+--------+
-|       8|
-+--------+
+  +--------+
+  |count(1)|
+  +--------+
+  |       8|
+  +--------+
 
 --count
-
 SELECT COUNT(c2) FROM buildin_agg;
-+---------+
-|count(c2)|
-+---------+
-|        7|
-+---------+
+  +---------+
+  |count(c2)|
+  +---------+
+  |        7|
+  +---------+
 
 --count distinct
-
 SELECT COUNT(DISTINCT c1) FROM buildin_agg;
-+------------------+
-|count(DISTINCT c1)|
-+------------------+
-|                 5|
-+------------------+
+  +------------------+
+  |count(DISTINCT c1)|
+  +------------------+
+  |                 5|
+  +------------------+
 
 SELECT COUNT(DISTINCT c1, c2) FROM buildin_agg;
-+----------------------+
-|count(DISTINCT c1, c2)|
-+----------------------+
-|                     5|
-+----------------------+
+  +----------------------+
+  |count(DISTINCT c1, c2)|
+  +----------------------+
+  |                     5|
+  +----------------------+
 
 --count_if
-
 SELECT COUNT_IF(c1 IS NULL) from buildin_agg;
-+----------------------+
-|count_if((c1 IS NULL))|
-+----------------------+
-|                     1|
-+----------------------+
+  +----------------------+
+  |count_if((c1 IS NULL))|
+  +----------------------+
+  |                     1|
+  +----------------------+
 
 SELECT c1 FROM buildin_agg GROUP BY c1 HAVING COUNT_IF(c2 % 2 = 0);
-+----+
-|  c1|
-+----+
-|null|
-|   1|
-+----+
+  +----+
+  |  c1|
+  +----+
+  |null|
+  |   1|
+  +----+
 
 --count_min_sketch
-
 SELECT COUNT_MIN_SKETCH(c1, 1D, 0.2D, 3) FROM buildin_agg;
-+----------------------------------------------------------+
-|count_min_sketch(c1, 0.9, 0.2, 3)                         |
-+----------------------------------------------------------+
-|[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00...]|
-+----------------------------------------------------------+
+  +----------------------------------------------------------+
+  |                         count_min_sketch(c1, 0.9, 0.2, 3)|
+  +----------------------------------------------------------+
+  |[00 00 00 01 00 00 00 00 00 00 00 07 00 00 00 01 00 00...]|
+  +----------------------------------------------------------+
 
 --covar_pop
-
 SELECT COVAR_POP(c1, c2) FROM buildin_agg;
-+-------------------------------------------------+
-|covar_pop(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
-+-------------------------------------------------+
-|                               0.6666666666666666|
-+-------------------------------------------------+
+  +-------------------------------------------------+
+  |covar_pop(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
+  +-------------------------------------------------+
+  |                               0.6666666666666666|
+  +-------------------------------------------------+
 
 --covar_samp
-
 SELECT COVAR_SAMP(c1, c2) FROM buildin_agg;
-+--------------------------------------------------+
-|covar_samp(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
-+--------------------------------------------------+
-|                                               0.8|
-+--------------------------------------------------+
+  +--------------------------------------------------+
+  |covar_samp(CAST(c1 AS DOUBLE), CAST(c2 AS DOUBLE))|
+  +--------------------------------------------------+
+  |                                               0.8|
+  +--------------------------------------------------+
 
 --first and first_value
-
 SELECT FIRST(c1) FROM buildin_agg;
-+----------------+
-|first(c1, false)|
-+----------------+
-|               2|
-+----------------+
+  +----------------+
+  |first(c1, false)|
+  +----------------+
+  |               2|
+  +----------------+
 
 SELECT FIRST(col) FROM VALUES (NULL), (5), (20) AS t(col);
-+-----------------+
-|first(col, false)|
-+-----------------+
-|             null|
-+-----------------+
+  +-----------------+
+  |first(col, false)|
+  +-----------------+
+  |             null|
+  +-----------------+
 
 SELECT FIRST(col, true) FROM VALUES (NULL), (5), (20) AS t(col);
-+----------------+
-|first(col, true)|
-+----------------+
-|               5|
-+----------------+
+  +----------------+
+  |first(col, true)|
+  +----------------+
+  |               5|
+  +----------------+
 
 SELECT FIRST_VALUE(col) FROM VALUES (NULL), (5), (20) AS t(col);
-+-----------------------+
-|first_value(col, false)|
-+-----------------------+
-|                   null|
-+-----------------------+
+  +-----------------------+
+  |first_value(col, false)|
+  +-----------------------+
+  |                   null|
+  +-----------------------+
 
 SELECT FIRST_VALUE(col, true) FROM VALUES (NULL), (5), (20) AS t(col);
-+----------------------+
-|first_value(col, true)|
-+----------------------+
-|                     5|
-+----------------------+
+  +----------------------+
+  |first_value(col, true)|
+  +----------------------+
+  |                     5|
+  +----------------------+
 
 --kurtosis
-
 SELECT KURTOSIS(c2) FROM buildin_agg;
-+----------------------------+
-|kurtosis(CAST(c2 AS DOUBLE))|
-+----------------------------+
-|         -0.7325000000000004|
-+----------------------------+
+  +----------------------------+
+  |kurtosis(CAST(c2 AS DOUBLE))|
+  +----------------------------+
+  |         -0.7325000000000004|
+  +----------------------------+
 
 SELECT KURTOSIS(col) FROM VALUES (-1000), (-100), (10), (20) AS t(col);
-+-----------------------------+
-|kurtosis(CAST(col AS DOUBLE))|
-+-----------------------------+
-|          -0.7014368047529627|
-+-----------------------------+
+  +-----------------------------+
+  |kurtosis(CAST(col AS DOUBLE))|
+  +-----------------------------+
+  |          -0.7014368047529627|
+  +-----------------------------+
 
 --last and last_value
-
 SELECT LAST(c1) FROM buildin_agg;
-+---------------+
-|last(c1, false)|
-+---------------+
-|           null|
-+---------------+
+  +---------------+
+  |last(c1, false)|
+  +---------------+
+  |           null|
+  +---------------+
 
 SELECT LAST(c1, true) FROM buildin_agg;
-+--------------+
-|last(c1, true)|
-+--------------+
-|             5|
-+--------------+
+  +--------------+
+  |last(c1, true)|
+  +--------------+
+  |             5|
+  +--------------+
 
 SELECT LAST_VALUE(c1) FROM buildin_agg;
-+---------------------+
-|last_value(c1, false)|
-+---------------------+
-|                 null|
-+---------------------+
+  +---------------------+
+  |last_value(c1, false)|
+  +---------------------+
+  |                 null|
+  +---------------------+
 
 SELECT LAST_VALUE(c1, true) FROM buildin_agg;
-+--------------------+
-|last_value(c1, true)|
-+--------------------+
-|                   5|
-+--------------------+
+  +--------------------+
+  |last_value(c1, true)|
+  +--------------------+
+  |                   5|
+  +--------------------+
 
 --max
-
 SELECT MAX(c2) FROM buildin_agg;
-+-------+
-|max(c2)|
-+-------+
-|      4|
-+-------+
+  +-------+
+  |max(c2)|
+  +-------+
+  |      4|
+  +-------+
 
 --max_by
-
 SELECT MAX_BY(c1, c3) FROM buildin_agg;
-+-------------+
-|maxby(c1, c3)|
-+-------------+
-|            5|
-+-------------+
+  +-------------+
+  |maxby(c1, c3)|
+  +-------------+
+  |            5|
+  +-------------+
 
 --min
-
 SELECT MIN(c1) FROM buildin_agg;
-+-------+
-|min(c1)|
-+-------+
-|      1|
-+-------+
+  +-------+
+  |min(c1)|
+  +-------+
+  |      1|
+  +-------+
 
 --min_by
-
 SELECT MIN_BY(c2, c3) FROM buildin_agg;
-+-------------+
-|minby(c2, c3)|
-+-------------+
-|            1|
-+-------------+
+  +-------------+
+  |minby(c2, c3)|
+  +-------------+
+  |            1|
+  +-------------+
 
 --percentile
-
 SELECT PERCENTILE(c1, 0.3) FROM buildin_agg;
-+--------------------------------------+
-|percentile(c1, CAST(0.3 AS DOUBLE), 1)|
-+--------------------------------------+
-|                                   1.0|
-+--------------------------------------+
+  +--------------------------------------+
+  |percentile(c1, CAST(0.3 AS DOUBLE), 1)|
+  +--------------------------------------+
+  |                                   1.0|
+  +--------------------------------------+
 
 SELECT PERCENTILE(c1, 0.3, 2) FROM buildin_agg;
-+--------------------------------------+
-|percentile(c1, CAST(0.3 AS DOUBLE), 2)|
-+--------------------------------------+
-|                                   1.0|
-+--------------------------------------+
+  +--------------------------------------+
+  |percentile(c1, CAST(0.3 AS DOUBLE), 2)|
+  +--------------------------------------+
+  |                                   1.0|
+  +--------------------------------------+
 
 SELECT PERCENTILE(c1, ARRAY(0.25, 0.75)) FROM buildin_agg;
-+------------------------------------+
-|percentile(c1, array(0.25, 0.75), 1)|
-+------------------------------------+
-|                          [1.0, 3.5]|
-+------------------------------------+
+  +------------------------------------+
+  |percentile(c1, array(0.25, 0.75), 1)|
+  +------------------------------------+
+  |                          [1.0, 3.5]|
+  +------------------------------------+
 
 SELECT PERCENTILE(c1, ARRAY(0.25, 0.75), 10) FROM buildin_agg;
-+-------------------------------------+
-|percentile(c1, array(0.25, 0.75), 10)|
-+-------------------------------------+
-|                           [1.0, 4.0]|
-+-------------------------------------+
+  +-------------------------------------+
+  |percentile(c1, array(0.25, 0.75), 10)|
+  +-------------------------------------+
+  |                           [1.0, 4.0]|
+  +-------------------------------------+
 
 --PERCENTILE_APPROX and APPROX_PERCENTILE
-
 SELECT PERCENTILE_APPROX(c1, 0.25, 100) FROM buildin_agg;
-+------------------------------------------------+
-|percentile_approx(c1, CAST(0.25 AS DOUBLE), 100)|
-+------------------------------------------------+
-|                                               1|
-+------------------------------------------------+
+  +------------------------------------------------+
+  |percentile_approx(c1, CAST(0.25 AS DOUBLE), 100)|
+  +------------------------------------------------+
+  |                                               1|
+  +------------------------------------------------+
 
 SELECT APPROX_PERCENTILE(c1, 0.25, 100) FROM buildin_agg;
-+------------------------------------------------+
-|approx_percentile(c1, CAST(0.25 AS DOUBLE), 100)|
-+------------------------------------------------+
-|                                               1|
-+------------------------------------------------+
+  +------------------------------------------------+
+  |approx_percentile(c1, CAST(0.25 AS DOUBLE), 100)|
+  +------------------------------------------------+
+  |                                               1|
+  +------------------------------------------------+
 
 SELECT PERCENTILE_APPROX(c1, ARRAY(0.25, 0.85), 100) FROM buildin_agg;
-+---------------------------------------------+
-|percentile_approx(c1, array(0.25, 0.85), 100)|
-+---------------------------------------------+
-|                                       [1, 4]|
-+---------------------------------------------+
+  +---------------------------------------------+
+  |percentile_approx(c1, array(0.25, 0.85), 100)|
+  +---------------------------------------------+
+  |                                       [1, 4]|
+  +---------------------------------------------+
 
 SELECT APPROX_PERCENTILE(c1, array(0.25, 0.85), 100) FROM buildin_agg;
-+---------------------------------------------+
-|approx_percentile(c1, array(0.25, 0.85), 100)|
-+---------------------------------------------+
-|                                       [1, 4]|
-+---------------------------------------------+
+  +---------------------------------------------+
+  |approx_percentile(c1, array(0.25, 0.85), 100)|
+  +---------------------------------------------+
+  |                                       [1, 4]|
+  +---------------------------------------------+
 
 --skewness
-
 SELECT SKEWNESS(c1) FROM buildin_agg;
-+----------------------------+
-|skewness(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|          0.5200705032248686|
-+----------------------------+
+  +----------------------------+
+  |skewness(CAST(c1 AS DOUBLE))|
+  +----------------------------+
+  |          0.5200705032248686|
+  +----------------------------+
 
 SELECT SKEWNESS(col) FROM VALUES (-1000), (-100), (10), (20) AS t(col);
-+-----------------------------+
-|skewness(CAST(col AS DOUBLE))|
-+-----------------------------+
-|          -1.1135657469022011|
-+-----------------------------+
+  +-----------------------------+
+  |skewness(CAST(col AS DOUBLE))|
+  +-----------------------------+
+  |          -1.1135657469022011|
+  +-----------------------------+
 
 --stddev_samp, stddev and std
-
 SELECT STDDEV_SAMP(c1) FROM buildin_agg;
-+-------------------------------+
-|stddev_samp(CAST(c1 AS DOUBLE))|
-+-------------------------------+
-|              1.618347187425374|
-+-------------------------------+
+  +-------------------------------+
+  |stddev_samp(CAST(c1 AS DOUBLE))|
+  +-------------------------------+
+  |              1.618347187425374|
+  +-------------------------------+
 
 SELECT STDDEV(c1) FROM buildin_agg;
-+--------------------------+
-|stddev(CAST(c1 AS DOUBLE))|
-+--------------------------+
-|         1.618347187425374|
-+--------------------------+
+  +--------------------------+
+  |stddev(CAST(c1 AS DOUBLE))|
+  +--------------------------+
+  |         1.618347187425374|
+  +--------------------------+
 
 SELECT STD(c1) FROM buildin_agg;
-+-----------------------+
-|std(CAST(c1 AS DOUBLE))|
-+-----------------------+
-|      1.618347187425374|
-+-----------------------+
+  +-----------------------+
+  |std(CAST(c1 AS DOUBLE))|
+  +-----------------------+
+  |      1.618347187425374|
+  +-----------------------+
 
 --stddev_pop
-
 SELECT STDDEV_POP(c1) FROM buildin_agg;
-+------------------------------+
-|stddev_pop(CAST(c1 AS DOUBLE))|
-+------------------------------+
-|             1.498298354528788|
-+------------------------------+
+  +------------------------------+
+  |stddev_pop(CAST(c1 AS DOUBLE))|
+  +------------------------------+
+  |             1.498298354528788|
+  +------------------------------+
 
 --sum
-
 SELECT SUM(col) FROM VALUES (5), (10), (15) AS t(col);
-+--------+
-|sum(col)|
-+--------+
-|      30|
-+--------+
+  +--------+
+  |sum(col)|
+  +--------+
+  |      30|
+  +--------+
 
 SELECT SUM(c1) FROM buildin_agg;
-+-------+
-|sum(c1)|
-+-------+
-|     17|
-+-------+
+  +-------+
+  |sum(c1)|
+  +-------+
+  |     17|
+  +-------+
 
 SELECT SUM(col) FROM VALUES (NULL), (NULL) AS t(col);
-+--------+
-|sum(col)|
-+--------+
-|    null|
-+--------+
+  +--------+
+  |sum(col)|
+  +--------+
+  |    null|
+  +--------+
 
 --variance and var_samp
-
 SELECT VARIANCE(c1) FROM buildin_agg;
-+----------------------------+
-|variance(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|           2.619047619047619|
-+----------------------------+
+  +----------------------------+
+  |variance(CAST(c1 AS DOUBLE))|
+  +----------------------------+
+  |           2.619047619047619|
+  +----------------------------+
 
 SELECT VAR_SAMP(c1) FROM buildin_agg;
-+----------------------------+
-|var_samp(CAST(c1 AS DOUBLE))|
-+----------------------------+
-|           2.619047619047619|
-+----------------------------+
+  +----------------------------+
+  |var_samp(CAST(c1 AS DOUBLE))|
+  +----------------------------+
+  |           2.619047619047619|
+  +----------------------------+
 
 --var_pop
-
 SELECT VAR_POP(c1) FROM buildin_agg;
-+---------------------------+
-|var_pop(CAST(c1 AS DOUBLE))|
-+---------------------------+
-|         2.2448979591836737|
-+---------------------------+
+  +---------------------------+
+  |var_pop(CAST(c1 AS DOUBLE))|
+  +---------------------------+
+  |         2.2448979591836737|
+  +---------------------------+
 {% endhighlight %}
\ No newline at end of file

Function	Parameters(s)	Description
avg(e: Column)	Column name	Returns the average of values in the input column.
mean(e: Column)	Column name	Returns the average of values in the input column.
bool_and(e: Column) every(e: Column)	Column name	Returns true if all values are true
any(e: Column) some(e: Column) bool_or(e: Column)	Column name	Returns true if at least one value is true
approx_count_distinct(e: Column)	Column name	Returns the estimated cardinality by HyperLogLog++
corr(e1: Column, e2: Column)	Column name	Returns Pearson coefficient of correlation between a set of number pairs
count(*)	None	Returns the total number of retrieved rows, including rows containing null
count(e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are all not null
count(DISTINCT e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are unique and non-null
count_if(e: Column)	Column name	Returns the number of `TRUE` values for the column
covar_pop(e1: Column, e2: Column)	Column name	Returns the population covariance of a set of number pairs
covar_samp(e1: Column, e2: Column)	Column name	Returns the sample covariance of a set of number pairs
first(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. + If `isIgnoreNull` is true, returns only non-null values, default is false.
first_value(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. + If `isIgnoreNull` is true, returns only non-null values, default is false.
skewness(e: Column)	Column name	Returns the skewness value calculated from values of a group
kurtosis(e: Column)	Column name	Returns the kurtosis value calculated from values of a group
last(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. + If `isIgnoreNull` is true, returns only non-null values, default is false.
last_value(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. + If `isIgnoreNull` is true, returns only non-null values, default is false.
max(e: Column)	Column name	Returns the maximum value of the column.
max_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the maximum value of column e2.
min(e: Column)	Column name	Returns the minimum value of the column.
min_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the minimum value of column e2.
percentile(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the exact percentile value of numeric column + `col` at the given percentage.
percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the exact + percentile value array of numeric column `col` at the given percentage(s).
percentile_approx(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric + column `col` at the given percentage.
percentile_approx(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the approximate + percentile value array of numeric column `col` at the given percentage(s).
approx_percentile(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric + column `col` at the given percentage.
approx_percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the approximate + percentile value array of numeric column `col` at the given percentage(s).
stddev_samp(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
stddev(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
std(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
stddev_pop(e: Column)	Column name	Returns the population standard deviation calculated from values of a group
stddev_samp(e: Column)	Column name	Returns the sum calculated from values of a group
(variance \| var_samp)(e: Column)	Column name	Returns the sample variance calculated from values of a group
sum(e: Column)	Column name	Returns the sum calculated from values of a group.
var_pop(e: Column)	Column name	Returns the population variance calculated from values of a group
collect_list(e: Column)	Column name	Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends + on the order of the rows which may be non-deterministic after a shuffle
collect_set(e: Column)	Column name	Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends + on the order of the rows which may be non-deterministic after a shuffle.
count_min_sketch(e: Column, eps: double, confidence: double, seed integer)	Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer	Returns a count-min sketch of a column with the given esp, + confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for + cardinality estimation using sub-linear space..
Function	Parameters(s)	Description
Function	Parameters	Description
avg(e: Column)	Column name	Returns the average of values in the input column.	{avg \| mean}(e: Column)	Column name	Returns the average of values in the input column.
mean(e: Column)	Column name	Returns the average of values in the input column.
bool_and(e: Column) every(e: Column)	Column name	Returns true if all values are true	{bool_and \| every}(e: Column)	Column name	Returns true if all values are true
any(e: Column) some(e: Column) bool_or(e: Column)	Column name	Returns true if at least one value is true	{any \| some \| bool_or}(e: Column)	Column name	Returns true if at least one value is true
approx_count_distinct(e: Column)	Column name	Returns the estimated cardinality by HyperLogLog++	approx_count_distinct(e: Column)	Column name	Returns the estimated cardinality by HyperLogLog++
corr(e1: Column, e2: Column)	Column name	Returns Pearson coefficient of correlation between a set of number pairs	corr(e1: Column, e2: Column)	Column name	Returns Pearson coefficient of correlation between a set of number pairs
count(*)	None	Returns the total number of retrieved rows, including rows containing null	count(*)	None	Returns the total number of retrieved rows, including rows containing null
count(e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are all not null	count(e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are all not null
count(DISTINCT e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are unique and non-null	count(DISTINCT e: Column[, e: Column])	Column name	Returns the number of rows for which the supplied column(s) are unique and not null
count_if(e: Column)	Column name	Returns the number of `TRUE` values for the column	count_if(Predicate)	Expression that will be used for aggregation calculation	Returns the count number from the predicate evaluate to `TRUE` values
covar_pop(e1: Column, e2: Column)	Column name	Returns the population covariance of a set of number pairs	covar_pop(e1: Column, e2: Column)	Column name	Returns the population covariance of a set of number pairs
covar_samp(e1: Column, e2: Column)	Column name	Returns the sample covariance of a set of number pairs	covar_samp(e1: Column, e2: Column)	Column name	Returns the sample covariance of a set of number pairs
first(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. - If `isIgnoreNull` is true, returns only non-null values, default is false.	{first \| first_value}(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic
first_value(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. - If `isIgnoreNull` is true, returns only non-null values, default is false.
skewness(e: Column)	Column name	Returns the skewness value calculated from values of a group	skewness(e: Column)	Column name	Returns the skewness value calculated from values of a group
kurtosis(e: Column)	Column name	Returns the kurtosis value calculated from values of a group	kurtosis(e: Column)	Column name	Returns the kurtosis value calculated from values of a group
last(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. - If `isIgnoreNull` is true, returns only non-null values, default is false.	{last \| last_value}(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic
last_value(e: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. - If `isIgnoreNull` is true, returns only non-null values, default is false.
max(e: Column)	Column name	Returns the maximum value of the column.	max(e: Column)	Column name	Returns the maximum value of the column.
max_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the maximum value of column e2.	max_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the maximum value of column e2.
min(e: Column)	Column name	Returns the minimum value of the column.	min(e: Column)	Column name	Returns the minimum value of the column.
min_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the minimum value of column e2.	min_by(e1: Column, e2: Column)	Column name	Returns the value of column e1 associated with the minimum value of column e2.
percentile(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the exact percentile value of numeric column - `col` at the given percentage.	percentile(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the exact percentile value of numeric column at the given percentage.
percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the exact - percentile value array of numeric column `col` at the given percentage(s).	percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the exact percentile value array of numeric column at the given percentage(s).
percentile_approx(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric - column `col` at the given percentage.	{percentile_approx \| percentile_approx}(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric column at the given percentage.
percentile_approx(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the approximate - percentile value array of numeric column `col` at the given percentage(s).
{percentile_approx \| percentile_approx}(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric column at the given percentage.
approx_percentile(e: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric - column `col` at the given percentage.
approx_percentile(e: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the approximate - percentile value array of numeric column `col` at the given percentage(s).
stddev_samp(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
stddev(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group	{stddev_samp \| stddev \| std}(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
std(e: Column)	Column name	Returns the sample standard deviation calculated from values of a group
stddev_pop(e: Column)	Column name	Returns the population standard deviation calculated from values of a group	stddev_pop(e: Column)	Column name	Returns the population standard deviation calculated from values of a group
stddev_samp(e: Column)	Column name	Returns the sum calculated from values of a group	{variance \| var_samp}(e: Column)	Column name	Returns the sample variance calculated from values of a group
(variance \| var_samp)(e: Column)	Column name	Returns the sample variance calculated from values of a group
sum(e: Column)	Column name	Returns the sum calculated from values of a group.	sum(e: Column)	Column name	Returns the sum calculated from values of a group.
var_pop(e: Column)	Column name	Returns the population variance calculated from values of a group	var_pop(e: Column)	Column name	Returns the population variance calculated from values of a group
collect_list(e: Column)	Column name	Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends - on the order of the rows which may be non-deterministic after a shuffle	collect_list(e: Column)	Column name	Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle
collect_set(e: Column)	Column name	Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends - on the order of the rows which may be non-deterministic after a shuffle.	collect_set(e: Column)	Column name	Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
count_min_sketch(e: Column, eps: double, confidence: double, seed integer)	Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer	Returns a count-min sketch of a column with the given esp, - confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for - cardinality estimation using sub-linear space..	count_min_sketch(e: Column, eps: double, confidence: double, seed integer)	Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer	Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..
Function	Parameters	Description
Function	Parameter Type(s)	Description
{any \| some \| bool_or}(c: Column)	Column name	Returns true if at least one value is true	{any \| some \| bool_or}(expression)	boolean	Returns true if at least one value is true.
approx_count_distinct(c: Column[, relativeSD: Double]])	Column name; relativeSD: the maximum estimation error allowed.	Returns the estimated cardinality by HyperLogLog++	approx_count_distinct(expression[, relativeSD])	(long, double)	RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.
{avg \| mean}(c: Column)	Column name	Returns the average of values in the input column.	{avg \| mean}(expression)	numeric or string	Returns the average of values in the input expression.
{bool_and \| every}(c: Column)	Column name	Returns true if all values are true	{bool_and \| every}(expression)	boolean	Returns true if all values are true.
collect_list(c: Column)	Column name	Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle	collect_list(expression)	any	Collects and returns a list of non-unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
collect_set(c: Column)	Column name	collect_set(expression)	any	Collects and returns a set of unique elements. The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
corr(c1: Column, c2: Column)	Column name	Returns Pearson coefficient of correlation between a set of number pairs	corr(expression1, expression2)	double, double	Returns Pearson coefficient of correlation between a set of number pairs.
count(*)	None	Returns the total number of retrieved rows, including rows containing null	count([DISTINCT] {* \| expression1[, expression2]})	none; any	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.
count(c: Column[, c: Column])	Column name	Returns the number of rows for which the supplied column(s) are all not null
count(DISTINCT c: Column[, c: Column])	Column name	Returns the number of rows for which the supplied column(s) are unique and not null
count_if(Predicate)	Expression that will be used for aggregation calculation	Returns the count number from the predicate evaluate to `TRUE` values	count_if(predicate)	expression that will be used for aggregation calculation	Returns the count number from the predicate evaluate to `TRUE` values.
count_min_sketch(c: Column, eps: double, confidence: double, seed integer)	Column name; eps is a value between 0.0 and 1.0; confidence is a value between 0.0 and 1.0; seed is a positive integer	Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space..	count_min_sketch(expression, eps, confidence, seed)	integral or string or binary, double, double, integer	Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
covar_pop(c1: Column, c2: Column)	Column name	Returns the population covariance of a set of number pairs	covar_pop(expression1, expression2)	double, double	Returns the population covariance of a set of number pairs.
covar_samp(c1: Column, c2: Column)	Column name	Returns the sample covariance of a set of number pairs	covar_samp(expression1, expression2)	double	Returns the sample covariance of a set of number pairs.
{first \| first_value}(c: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the first value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic	{first \| first_value}(expression[, isIgnoreNull])	any, boolean	Returns the first value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.
kurtosis(c: Column)	Column name	Returns the kurtosis value calculated from values of a group	kurtosis(expression)	double	Returns the kurtosis value calculated from values of a group.
{last \| last_value}(c: Column[, isIgnoreNull])	Column name[, True/False(default)]	Returns the last value of column for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic	{last \| last_value}(expression[, isIgnoreNull])	any, boolean	Returns the last value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.
max(c: Column)	Column name	Returns the maximum value of the column.	max(expression)	any numeric, string, date/time or arrays of these types	Returns the maximum value of the expression.
max_by(c1: Column, c2: Column)	Column name	Returns the value of column c1 associated with the maximum value of column c2.	max_by(expression1, expression2)	any numeric, string, date/time or arrays of these types	Returns the value of expression1 associated with the maximum value of expression2.
min(c: Column)	Column name	Returns the minimum value of the column.	min(expression)	any numeric, string, date/time or arrays of these types	Returns the minimum value of the expression.
min_by(c1: Column, c2: Column)	Column name	Returns the value of column c1 associated with the minimum value of column c2.	min_by(expression1, expression2)	any numeric, string, date/time or arrays of these types	Returns the value of expression1 associated with the minimum value of expression2.
percentile(c: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the exact percentile value of numeric column at the given percentage.	percentile(expression, percentage [, frequency])	numeric Type, double, integral type	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.
percentile(c: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage array is an array of number between 0 and 1; frequency is a positive integer	Returns the exact percentile value array of numeric column at the given percentage(s).	percentile(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric type; double; integral type	Percentage array is an array of number between 0 and 1; frequency is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).
{percentile_approx \| percentile_approx}(c: Column, percentage [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric column at the given percentage.	{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	numeric, date, timestamp; double; integral	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(c: Column, array(percentage1 [, percentage2]...) [, frequency])	Column name; percentage is a number between 0 and 1; frequency is a positive integer	Returns the approximate percentile value of numeric column at the given percentage.	{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric, date, timestamp; double; integral	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
skewness(c: Column)	Column name	Returns the skewness value calculated from values of a group	skewness(expression)	double	Returns the skewness value calculated from values of a group.
{stddev_samp \| stddev \| std}(c: Column)	Column name	Returns the sample standard deviation calculated from values of a group	{stddev_samp \| stddev \| std}(expression)	double	Returns the sample standard deviation calculated from values of a group.
stddev_pop(c: Column)	Column name	Returns the population standard deviation calculated from values of a group	stddev_pop(expression)	double	Returns the population standard deviation calculated from values of a group.
sum(c: Column)	Column name	sum(expression)	numeric	Returns the sum calculated from values of a group.
{variance \| var_samp}(c: Column)	Column name	Returns the sample variance calculated from values of a group	{variance \| var_samp}(expression)	double	Returns the sample variance calculated from values of a group.
var_pop(c: Column)	Column name	Returns the population variance calculated from values of a group	var_pop(expression)	double	Returns the population variance calculated from values of a group.
Function	Parameter Type(s)	Description
Function	Argument Type(s)	Description
approx_count_distinct(expression[, relativeSD])	(long, double)	RelativeSD is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.	`relativeSD` is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.
{avg \| mean}(expression)
corr(expression1, expression2)	double, double	(double, double)	Returns Pearson coefficient of correlation between a set of number pairs.
count([DISTINCT] {* \| expression1[, expression2]})	none; any	none, any, any	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.
count_min_sketch(expression, eps, confidence, seed)	integral or string or binary, double, double, integer	Eps and confidence are the double values between 0.0 and 1.0, seed is a positive integer. Returns a count-min sketch of a expression with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.	`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
covar_pop(expression1, expression2)	double, double	(double, double)	Returns the population covariance of a set of number pairs.
covar_samp(expression1, expression2)	double	(double, double)	Returns the sample covariance of a set of number pairs.
{first \| first_value}(expression[, isIgnoreNull])	any, boolean	(any, boolean)	Returns the first value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.
{last \| last_value}(expression[, isIgnoreNull])	any, boolean	(any, boolean)	Returns the last value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.
percentile(expression, percentage [, frequency])	numeric Type, double, integral type	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.	numeric type, double, integral type	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.
percentile(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric type; double; integral type	Percentage array is an array of number between 0 and 1; frequency is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).	Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	numeric, date, timestamp; double; integral	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric, date, timestamp; double; integral	Percentage is a number between 0 and 1; Frequency is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.	numeric\|date\|timestamp, double, integral	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
skewness(expression)	Returns Pearson coefficient of correlation between a set of number pairs.
count([DISTINCT] {* \| expression1[, expression2]})	none, any, any	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; If specified `*`, returns the total number of retrieved rows, including rows containing null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.	count([DISTINCT] *)	none	If specified `DISTINCT`, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.
count([DISTINCT] expression1[, expression2])	(any, any)	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.
count_if(predicate)
count_min_sketch(expression, eps, confidence, seed)	integral or string or binary, double, double, integer	(integer or string or binary, double, double, integer)	`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
max(expression)	any numeric, string, date/time or arrays of these types	any numeric, string, datetime or arrays of these types	Returns the maximum value of the expression.
max_by(expression1, expression2)	any numeric, string, date/time or arrays of these types	any numeric, string, datetime or arrays of these types	Returns the value of expression1 associated with the maximum value of expression2.
min(expression)	any numeric, string, date/time or arrays of these types	any numeric, string, datetime or arrays of these types	Returns the minimum value of the expression.
min_by(expression1, expression2)	any numeric, string, date/time or arrays of these types	any numeric, string, datetime or arrays of these types	Returns the value of expression1 associated with the minimum value of expression2.
percentile(expression, percentage [, frequency])	numeric type, double, integral type	numeric, double, integer	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.
percentile(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric type; double; integral type	numeric, double, integer	Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	numeric, date, timestamp; double; integral	numeric, double, integer	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	datetime, double, integer	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric, double, integer	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric\|date\|timestamp, double, integral	datetime , double, integer	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{avg \| mean}(expression)	numeric or string	short, float, byte, decimal, double, int, long or string	Returns the average of values in the input expression.
count_min_sketch(expression, eps, confidence, seed)	(integer or string or binary, double, double, integer)	(byte, short, int, long, string or binary, double, double, integer)	`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
max(expression)	any numeric, string, datetime or arrays of these types	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	Returns the maximum value of the expression.
max_by(expression1, expression2)	any numeric, string, datetime or arrays of these types	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	Returns the value of expression1 associated with the maximum value of expression2.
min(expression)	any numeric, string, datetime or arrays of these types	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	Returns the minimum value of the expression.
min_by(expression1, expression2)	any numeric, string, datetime or arrays of these types	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	Returns the value of expression1 associated with the minimum value of expression2.
percentile(expression, percentage [, frequency])	numeric, double, integer	short, float, byte, decimal, double, int, or long, double, int	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.
percentile(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric, double, integer	short, float, byte, decimal, double, int, or long, double, int	Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	numeric, double, integer	short, float, byte, decimal, double, int, or long, double, int	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	datetime, double, integer	date or timestamp, double, int	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	numeric, double, integer	short, float, byte, decimal, double, int, or long, double, int	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	datetime , double, integer	date or timestamp, double, int	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
sum(expression)	numeric	short, float, byte, decimal, double, int, or long	Returns the sum calculated from values of a group.
approx_count_distinct(expression[, relativeSD])	(long, double)	(bigint[, double])	`relativeSD` is the maximum estimation error allowed. Returns the estimated cardinality by HyperLogLog++.
{avg \| mean}(expression)	short, float, byte, decimal, double, int, long or string	tinyint\|smallint\|int\|bigint\|float\|double\|decimal\|string	Returns the average of values in the input expression.
count([DISTINCT] *)	none	If specified `DISTINCT`, returns the total number of retrieved rows are unique and not null; Otherwise, returns the total number of retrieved rows, including rows containing null.	If specified `DISTINCT`, returns the total number of retrieved rows are unique and not null; otherwise, returns the total number of retrieved rows, including rows containing null.
count([DISTINCT] expression1[, expression2])	(any, any)	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; Otherwise, returns the number of rows for which the supplied expression(s) are all not null.	(any[, any])	If specified `DISTINCT`, returns the number of rows for which the supplied expression(s) are unique and not null; otherwise, returns the number of rows for which the supplied expression(s) are all not null.
count_if(predicate)	expression that will be used for aggregation calculation	expression that returns a boolean value	Returns the count number from the predicate evaluate to `TRUE` values.
count_min_sketch(expression, eps, confidence, seed)	(byte, short, int, long, string or binary, double, double, integer)	(tinyint\|int\|bigint\|smallint\|string\|binary, double, double, int)	`eps` and `confidence` are the double values between 0.0 and 1.0, `seed` is a positive integer. Returns a count-min sketch of a expression with the given `esp`, `confidence` and `seed`. The result is an array of bytes, which can be deserialized to a `CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
Returns the sample covariance of a set of number pairs.
{first \| first_value}(expression[, isIgnoreNull])	(any, boolean)	Returns the first value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.	{first \| first_value}(expression[, `isIgnoreNull`])	(any[, boolean])	Returns the first value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic.
kurtosis(expression)	Returns the kurtosis value calculated from values of a group.
{last \| last_value}(expression[, isIgnoreNull])	(any, boolean)	Returns the last value of expression for a group of rows. If `isIgnoreNull` is true, returns only non-null values, default is false. This function is non-deterministic.	{last \| last_value}(expression[, `isIgnoreNull`])	(any[, boolean])	Returns the last value of expression for a group of rows. If isIgnoreNull is true, returns only non-null values, default is false. This function is non-deterministic.
max(expression)	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	tinyint\|short\|int\|bigint\|float\|double\|date\|timestamp\|string, or arrays of these types	Returns the maximum value of the expression.
max_by(expression1, expression2)	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	tinyint\|short\|int\|bigint\|float\|double\|date\|timestamp\|string, or arrays of these types	Returns the value of expression1 associated with the maximum value of expression2.
min(expression)	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	tinyint\|short\|int\|bigint\|float\|double\|date\|timestamp\|string, or arrays of these types	Returns the minimum value of the expression.
min_by(expression1, expression2)	short, float, byte, decimal, double, int, long, string, date, timestamp or arrays of these types	tinyint\|short\|int\|bigint\|float\|double\|date\|timestamp\|string, or arrays of these types	Returns the value of expression1 associated with the minimum value of expression2.
percentile(expression, percentage [, frequency])	short, float, byte, decimal, double, int, or long, double, int	(short\|float\|byte\|decimal\|double\|int\|bigint, double[, int])	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value of numeric expression at the given percentage.
percentile(expression, array(percentage1 [, percentage2]...) [, frequency])	short, float, byte, decimal, double, int, or long, double, int	(short\|float\|byte\|decimal\|double\|int\|bigint, array of double[, int])	Percentage array is an array of number between 0 and 1; `frequency` is a positive integer. Returns the exact percentile value array of numeric expression at the given percentage(s).
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	short, float, byte, decimal, double, int, or long, double, int	(short\|float\|byte\|decimal\|double\|int\|bigint, double[, int])	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, percentage [, frequency])	date or timestamp, double, int	(date\|timestamp, double[, int])	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	short, float, byte, decimal, double, int, or long, double, int	(short\|float\|byte\|decimal\|double\|int\|bigint, array of double[, int])	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
{percentile_approx \| percentile_approx}(expression, array(percentage1 [, percentage2]...) [, frequency])	date or timestamp, double, int	(date\|timestamp, array of double[, int])	`percentage` is a number between 0 and 1; `frequency` is a positive integer. Returns the approximate percentile value of numeric expression at the given percentage.
sum(expression)	short, float, byte, decimal, double, int, or long	tinyint\|smallint\|int\|bigint\|float\|double\|decimal	Returns the sum calculated from values of a group.