From 20f99c6697f1ac158f98815de9877afe5d9db297 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 22:24:38 -0800
Subject: [PATCH 01/18] [SPARK-23313][DOC] Add a migration guide for ORC

---
 docs/sql-programming-guide.md | 71 +++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index a0e221b39cc34..dbf4102b5abe1 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,6 +1776,77 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
+
+    <table class="table">
+      <tr>
+        <th>
+          <b>Property Name</b>
+        </th>
+        <th>
+          <b>Default</b>
+        </th>
+        <th>
+          <b>Meaning</b>
+        </th>
+      </tr>
+      <tr>
+        <td>
+          spark.sql.orc.impl
+        </td>
+        <td>
+          native
+        </td>
+        <td>
+          The name of ORC implementation: 'native' means the native version of ORC support instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
+        </td>
+      </tr>
+      <tr>
+        <td>
+          spark.sql.orc.enableVectorizedReader
+        </td>
+        <td>
+          true
+        </td>
+        <td>
+          Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation.
+        </td>
+      </tr>
+      <tr>
+        <td>
+          spark.sql.orc.columnarReaderBatchSize
+        </td>
+        <td>
+          4096
+        </td>
+        <td>
+          The number of rows to include in a orc vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data.
+        </td>
+      </tr>
+      <tr>
+        <td>
+          spark.sql.orc.filterPushdown
+        </td>
+        <td>
+          true
+        </td>
+        <td>
+          Enable filter pushdown for ORC files. It is 'false' by default prior to Spark 2.3.
+        </td>
+      </tr>
+      <tr>
+        <td>
+          spark.sql.hive.convertMetastoreOrc
+        </td>
+        <td>
+          true
+        </td>
+        <td>
+          Enable the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
+        </td>
+      </tr>
+    </table>
+
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
   - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
   - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.

From 1bb23ef4f733ebcf43751c0bab11ef4405dad0b9 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 22:33:03 -0800
Subject: [PATCH 02/18] fix.

---
 docs/sql-programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index dbf4102b5abe1..59a2a06c20af8 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1831,7 +1831,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enable filter pushdown for ORC files. It is 'false' by default prior to Spark 2.3.
+          Enables filter pushdown for ORC files. It is 'false' by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>
@@ -1842,7 +1842,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enable the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
+          Enables the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
         </td>
       </tr>
     </table>

From df08899d74757a56429eab527b7589a9f15ed4fd Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 22:37:18 -0800
Subject: [PATCH 03/18] Address comments

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 59a2a06c20af8..401215f7c8068 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1798,7 +1798,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           native
         </td>
         <td>
-          The name of ORC implementation: 'native' means the native version of ORC support instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
+          The name of ORC implementation: 'native' means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>

From 0aecd5d7511b0113a084336403b4caf21c4574a1 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 22:44:38 -0800
Subject: [PATCH 04/18] Remove spark.sql.orc.columnarReaderBatchSize

---
 docs/sql-programming-guide.md | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 401215f7c8068..e9d9bd279bc0f 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1812,17 +1812,6 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation.
         </td>
       </tr>
-      <tr>
-        <td>
-          spark.sql.orc.columnarReaderBatchSize
-        </td>
-        <td>
-          4096
-        </td>
-        <td>
-          The number of rows to include in a orc vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data.
-        </td>
-      </tr>
       <tr>
         <td>
           spark.sql.orc.filterPushdown

From 239714a6f79972ffb498970ef49c4207c7d26518 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 22:58:14 -0800
Subject: [PATCH 05/18] Update

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e9d9bd279bc0f..a55431ead0a10 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1831,7 +1831,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enables the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
+          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is 'false' by default prior to Spark 2.3.
         </td>
       </tr>
     </table>

From fc5b3953b1c738f684d306ea5a9c27f0b57481c4 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 23:47:22 -0800
Subject: [PATCH 06/18] address comments.

---
 docs/sql-programming-guide.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index a55431ead0a10..8a98855b2e834 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1798,7 +1798,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           native
         </td>
         <td>
-          The name of ORC implementation: 'native' means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3.
+          The name of ORC implementation: `native` means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is `hive` by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>
@@ -1809,7 +1809,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation.
+          Enables vectorized orc decoding in `native` implementation. If `false`, a new non-vectorized ORC reader is used in `native` implementation. For `hive` implementation, this is ignored.
         </td>
       </tr>
       <tr>
@@ -1820,7 +1820,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enables filter pushdown for ORC files. It is 'false' by default prior to Spark 2.3.
+          Enables filter pushdown for ORC files. It is `false` by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>
@@ -1831,7 +1831,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
           true
         </td>
         <td>
-          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is 'false' by default prior to Spark 2.3.
+          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3.
         </td>
       </tr>
     </table>

From 7b3b0a44f3a396a817b9e78c8a1c6549dbdf7d29 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Thu, 1 Feb 2018 23:55:58 -0800
Subject: [PATCH 07/18] Use <code>

---
 docs/sql-programming-guide.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 8a98855b2e834..8c26ae22ca2a6 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1792,46 +1792,46 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       </tr>
       <tr>
         <td>
-          spark.sql.orc.impl
+          <code>spark.sql.orc.impl</code>
         </td>
         <td>
-          native
+          <code>native</code>
         </td>
         <td>
-          The name of ORC implementation: `native` means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is `hive` by default prior to Spark 2.3.
+          The name of ORC implementation: <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is <code>hive</code> by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>
         <td>
-          spark.sql.orc.enableVectorizedReader
+          <code>spark.sql.orc.enableVectorizedReader</code>
         </td>
         <td>
-          true
+          <code>true</code>
         </td>
         <td>
-          Enables vectorized orc decoding in `native` implementation. If `false`, a new non-vectorized ORC reader is used in `native` implementation. For `hive` implementation, this is ignored.
+          Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.
         </td>
       </tr>
       <tr>
         <td>
-          spark.sql.orc.filterPushdown
+          <code>spark.sql.orc.filterPushdown</code>
         </td>
         <td>
-          true
+          <code>true</code>
         </td>
         <td>
-          Enables filter pushdown for ORC files. It is `false` by default prior to Spark 2.3.
+          Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.
         </td>
       </tr>
       <tr>
         <td>
-          spark.sql.hive.convertMetastoreOrc
+          <code>spark.sql.hive.convertMetastoreOrc</code>
         </td>
         <td>
-          true
+          <code>true</code>
         </td>
         <td>
-          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3.
+          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.
         </td>
       </tr>
     </table>

From cb149f27aec6c1b31cdf00af33198bbbf4a93f31 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Fri, 2 Feb 2018 00:34:25 -0800
Subject: [PATCH 08/18] Split the table.

---
 docs/sql-programming-guide.md | 68 +++++++++++------------------------
 1 file changed, 21 insertions(+), 47 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 8c26ae22ca2a6..d00213e56d2b4 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1778,61 +1778,35 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
   - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
 
+    - New configurations
+
     <table class="table">
+      <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
       <tr>
-        <th>
-          <b>Property Name</b>
-        </th>
-        <th>
-          <b>Default</b>
-        </th>
-        <th>
-          <b>Meaning</b>
-        </th>
+        <td><code>spark.sql.orc.impl</code></td>
+        <td><code>native</code></td>
+        <td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.</td>
       </tr>
       <tr>
-        <td>
-          <code>spark.sql.orc.impl</code>
-        </td>
-        <td>
-          <code>native</code>
-        </td>
-        <td>
-          The name of ORC implementation: <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is <code>hive</code> by default prior to Spark 2.3.
-        </td>
-      </tr>
-      <tr>
-        <td>
-          <code>spark.sql.orc.enableVectorizedReader</code>
-        </td>
-        <td>
-          <code>true</code>
-        </td>
-        <td>
-          Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.
-        </td>
+        <td><code>spark.sql.orc.enableVectorizedReader</code></td>
+        <td><code>true</code></td>
+        <td>Enables vectorized orc decoding in <code>native</code> implementation. If <code>false</code>, a new non-vectorized ORC reader is used in <code>native</code> implementation. For <code>hive</code> implementation, this is ignored.</td>
       </tr>
+    </table>
+
+    - Changed configurations
+
+    <table class="table">
+      <tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
       <tr>
-        <td>
-          <code>spark.sql.orc.filterPushdown</code>
-        </td>
-        <td>
-          <code>true</code>
-        </td>
-        <td>
-          Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.
-        </td>
+        <td><code>spark.sql.orc.filterPushdown</code></td>
+        <td><code>true</code></td>
+        <td>Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.</td>
       </tr>
       <tr>
-        <td>
-          <code>spark.sql.hive.convertMetastoreOrc</code>
-        </td>
-        <td>
-          <code>true</code>
-        </td>
-        <td>
-          Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.
-        </td>
+        <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
+        <td><code>true</code></td>
+        <td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td>
       </tr>
     </table>
 

From 436c0f437fb0085446323fb1f15a238d97d164ef Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Fri, 2 Feb 2018 14:50:52 -0800
Subject: [PATCH 09/18] Address comments

---
 docs/sql-programming-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d00213e56d2b4..ad72b0cb43439 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1810,6 +1810,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       </tr>
     </table>
 
+    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.
+
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
   - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
   - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.

From 354a525144620fccd92e009607894b68a991ebf4 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Tue, 6 Feb 2018 01:31:55 -0800
Subject: [PATCH 10/18] Update link.

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index ad72b0cb43439..1aaffdc92920b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1810,7 +1810,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       </tr>
     </table>
 
-    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.
+    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project.
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
   - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

From d259d666b37066b0907d8ec365244dd0d9880d1f Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Tue, 6 Feb 2018 18:35:31 -0800
Subject: [PATCH 11/18] Add note for convertMetastoreXXX.

---
 docs/sql-programming-guide.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 1aaffdc92920b..49f788c57e41e 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1810,7 +1810,9 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       </tr>
     </table>
 
-    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, you can use ORC configuration name and Hive configuration name. To see a full list of supported ORC configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project.
+    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project for a full list of supported ORC configurations.
+
+    - Note that `convertMetastoreOrc` works like `convertMetastoreParquet`. While converting Hive tables into Spark data source tables, Spark ignores table properties. For table-level storage properties, you can use `CREATE TABLE ... USING HIVE` syntax.
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
   - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.

From a693446ac8c5a13275128f7fbf23e305ccaa50f3 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Wed, 7 Feb 2018 12:42:37 -0800
Subject: [PATCH 12/18] Remove `spark.sql.hive.convertMetastoreOrc` and Hive
 ORC table stuff.

---
 docs/sql-programming-guide.md | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 49f788c57e41e..01d1095f41c95 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files and Hive ORC tables. To do that, the following configurations are newly added or change their default values.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values.
 
     - New configurations
 
@@ -1803,15 +1803,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see
         <td><code>true</code></td>
         <td>Enables filter pushdown for ORC files. It is <code>false</code> by default prior to Spark 2.3.</td>
       </tr>
-      <tr>
-        <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
-        <td><code>true</code></td>
-        <td>Enable the Spark's ORC support, which can be configured by <code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 2.3.</td>
-      </tr>
     </table>
 
-    - Since Apache ORC 1.4.1 is a standalone library providing a subset of Hive ORC related configurations, see <a href="https://orc.apache.org/docs/hive-config.html">Hive Configuration</a> of Apache ORC project for a full list of supported ORC configurations.
-
     - Note that `convertMetastoreOrc` works like `convertMetastoreParquet`. While converting Hive tables into Spark data source tables, Spark ignores table properties. For table-level storage properties, you can use `CREATE TABLE ... USING HIVE` syntax.
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.

From 40c8e02a3b0ff5a9ebbdbba3188c7027fd2a9278 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Wed, 7 Feb 2018 12:44:49 -0800
Subject: [PATCH 13/18] remove more.

---
 docs/sql-programming-guide.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 01d1095f41c95..b8377525f90ff 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1805,8 +1805,6 @@ working with timestamps in `pandas_udf`s to get the best performance, see
       </tr>
     </table>
 
-    - Note that `convertMetastoreOrc` works like `convertMetastoreParquet`. While converting Hive tables into Spark data source tables, Spark ignores table properties. For table-level storage properties, you can use `CREATE TABLE ... USING HIVE` syntax.
-
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
   - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
   - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown.

From 59e957a743f1882f32c9bdd359d4d4b40674ccfd Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Mon, 12 Feb 2018 11:34:20 -0800
Subject: [PATCH 14/18] Add `USING` syntax recommendation.

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index b8377525f90ff..c8d0bf4d8db6d 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For creating ORC tables, `USING ORC` or `USING HIVE` syntaxes are recommended.
 
     - New configurations
 

From 6136d25114a95f2725fec2b551bf09eae573d665 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Mon, 12 Feb 2018 14:12:17 -0800
Subject: [PATCH 15/18] update

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index c8d0bf4d8db6d..a48b84f65907f 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For creating ORC tables, `USING ORC` or `USING HIVE` syntaxes are recommended.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`.
 
     - New configurations
 

From f2bd2c8e9e8a101e4e527dda07a688c60964446b Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Mon, 12 Feb 2018 14:18:07 -0800
Subject: [PATCH 16/18] Add USING HIVE OPTIONS description, too.

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index a48b84f65907f..cd7293ad6d676 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.
 
     - New configurations
 

From 8ae87fc32f25ceabd5fc87f3d525ce34b887d27e Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Mon, 12 Feb 2018 14:20:37 -0800
Subject: [PATCH 17/18] fix

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index cd7293ad6d676..e1d71668501d6 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc=true`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.
 
     - New configurations
 

From 6887d1935acff1b1eefde4ac2e291fb21b4731a1 Mon Sep 17 00:00:00 2001
From: Dongjoon Hyun <dongjoon@apache.org>
Date: Mon, 12 Feb 2018 14:42:31 -0800
Subject: [PATCH 18/18] Update the description.

---
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e1d71668501d6..9e3992124c045 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1776,7 +1776,7 @@ working with timestamps in `pandas_udf`s to get the best performance, see
 
 ## Upgrading From Spark SQL 2.2 to 2.3
 
-  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. For ORC tables, the vectorized reader will be used for the tables created by `USING ORC`. With `spark.sql.hive.convertMetastoreOrc=true`, it will for the tables created by `USING HIVE OPTIONS (fileFormat 'ORC')`, too.
+  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added or change their default values. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to `native` and `spark.sql.orc.enableVectorizedReader` to `true`. For the Hive ORC serde table (e.g., the ones created using the clause `USING HIVE OPTIONS (fileFormat 'ORC')`), the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is set to true.
 
     - New configurations
 

+ Property Name +	+ Default +	+ Meaning +
+ spark.sql.orc.impl +	+ native +	+ The name of ORC implementation: 'native' means the native version of ORC support instead of the ORC library in Hive 1.2.1. It is 'hive' by default prior to Spark 2.3. +
+ spark.sql.orc.enableVectorizedReader +	+ true +	+ Enables vectorized orc decoding in 'native' implementation. If 'false', a new non-vectorized ORC reader is used in 'native' implementation. +
+ spark.sql.orc.columnarReaderBatchSize +	+ 4096 +	+ The number of rows to include in a orc vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data. +
+ spark.sql.orc.filterPushdown +	+ true +	+ Enable filter pushdown for ORC files. It is 'false' by default prior to Spark 2.3. +
+ spark.sql.hive.convertMetastoreOrc +	+ true +	+ Enable the built-in ORC reader and writer to process Hive ORC tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3. +
Property Name	Default	Meaning
- Property Name -	- Default -	- Meaning -	`spark.sql.orc.impl`	`native`	The name of ORC implementation. It can be one of `native` and `hive`. `native` means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1 which is used prior to Spark 2.3.
- `spark.sql.orc.impl` -	- `native` -	- The name of ORC implementation: `native` means the native ORC support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 1.2.1. It is `hive` by default prior to Spark 2.3. -
- `spark.sql.orc.enableVectorizedReader` -	- `true` -	- Enables vectorized orc decoding in `native` implementation. If `false`, a new non-vectorized ORC reader is used in `native` implementation. For `hive` implementation, this is ignored. -	`spark.sql.orc.enableVectorizedReader`	`true`	Enables vectorized orc decoding in `native` implementation. If `false`, a new non-vectorized ORC reader is used in `native` implementation. For `hive` implementation, this is ignored.
Property Name	Default	Meaning
- `spark.sql.orc.filterPushdown` -	- `true` -	- Enables filter pushdown for ORC files. It is `false` by default prior to Spark 2.3. -	`spark.sql.orc.filterPushdown`	`true`	Enables filter pushdown for ORC files. It is `false` by default prior to Spark 2.3.
- `spark.sql.hive.convertMetastoreOrc` -	- `true` -	- Enable Spark's ORC support instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3. -	`spark.sql.hive.convertMetastoreOrc`	`true`	Enable the Spark's ORC support, which can be configured by `spark.sql.orc.impl`, instead of Hive SerDe when reading from and writing to Hive ORC tables. It is `false` by default prior to Spark 2.3.