[SPARK-28997][SQL] Add `spark.sql.dialect` #25697

gengliangwang · 2019-09-05T15:11:04Z

What changes were proposed in this pull request?

After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration spark.sql.dialect for choosing a database dialect.
After this PR, Spark supports two database dialects, Spark and PostgreSQL. With PostgreSQL dialect, Spark will:

perform integral division with the / operator if both sides are integral types;
accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.

Why are the changes needed?

Unify the external database dialect with one configuration, instead of small flags.

Does this PR introduce any user-facing change?

A new configuration spark.sql.dialect for choosing a database dialect.

How was this patch tested?

Existing tests.

gengliangwang · 2019-09-05T15:16:24Z

@wangyum @gatorsmile @dongjoon-hyun @cloud-fan @younggyuchun

SparkQA · 2019-09-05T15:27:07Z

Test build #110188 has finished for PR 25697 at commit 74057fc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-05T19:57:43Z

Test build #110191 has finished for PR 25697 at commit 3560573.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

younggyuchun · 2019-09-06T18:08:40Z

@gengliangwang Would we support only PostgreSQL? Why don't we just use foe example, "ANSISQL" instead? Is there any reason we should use PostgreSQL?.

MaxGekk

@gengliangwang How would it be difficult to change type of the config from string to some integral type? I just afraid that comparing strings per each value in my PR here https://github.com/apache/spark/pull/25716/files#diff-da60f07e1826788aaeb07f295fae4b8aR223 can have significant overhead.

gengliangwang · 2019-09-13T02:30:03Z

@MaxGekk I think we have to make it string type configuration for user experience.

gengliangwang · 2019-09-18T12:07:11Z

I will continue this one after #25693 is merged.

dongjoon-hyun · 2019-09-20T17:19:14Z

Gentle ping, @gengliangwang .

gengliangwang · 2019-09-23T09:06:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

+      val dialect = SQLConf.get.getConf(SQLConf.DIALECT)
      buildCast[UTF8String](_, s => {
-        if (StringUtils.isTrueString(s)) {
+        if (StringUtils.isTrueString(s, dialect)) {


In the case of more dialects, here we can pass the dialect to the StringUtils as a parameter to avoid more future changes in Cast.scala.

Note that this brings performance overhead.
I will add a new expression instead.

SparkQA · 2019-09-23T13:40:35Z

Test build #111208 has finished for PR 25697 at commit 0e7f281.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-09-23T15:02:20Z

I will do the following for pgSQL dialect:

add a new expression for casting string to boolean
add a new wiki page.

Mark this WIP for now. It should be ready this week.

gengliangwang · 2019-09-24T11:50:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+
+  val DIALECT =
+    buildConf("spark.sql.dialect")
+      .doc("The specific features of the SQL language to be adopted, which are available when " +


Let's have a follow-up PR to add wiki page for the PostgreSQL dialect behaviors.

gengliangwang · 2019-09-24T11:55:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

      ResolveRandomSeed ::
      TypeCoercion.typeCoercionRules(conf) ++
      extendedResolutionRules : _*),
+    Batch("PostgreSQl dialect", Once, PostgreSQLDialect.postgreSQLDialectRules(conf): _*),


We can adjust the position and the number of loops of this Batch in future development.

SparkQA · 2019-09-24T14:31:14Z

Test build #111283 has finished for PR 25697 at commit 094d58e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class postgreCastStringToBoolean(conf: SQLConf) extends Rule[LogicalPlan] with Logging
case class PostgreCastStringToBoolean(child: Expression)

SparkQA · 2019-09-24T14:31:18Z

Test build #111288 has started for PR 25697 at commit 9f14680.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PostgreSQLDialect.scala

cloud-fan · 2019-09-24T15:23:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PostgreSQLDialect.scala

+object PostgreSQLDialect {
+  def postgreSQLDialectRules(conf: SQLConf): List[Rule[LogicalPlan]] =
+    if (conf.usePostgreSQLDialect) {
+      postgreCastStringToBoolean(conf) ::


super nit: according to my experience, it's easier to extend using this style

Seq( rule1, rule2, ... )

Let's keep the style consistent with the other rule batches.

Optimizer uses :: to combine batches, not rules.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/Cast.scala

SparkQA · 2019-09-25T17:10:40Z

Test build #111355 has finished for PR 25697 at commit 72a1539.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class PostgreCastStringToBoolean(child: Expression)

SparkQA · 2019-09-25T20:15:02Z

Test build #111361 has finished for PR 25697 at commit af97b99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-09-26T02:46:03Z

retest this please.

maropu · 2019-09-26T04:27:19Z

sql/core/src/test/scala/org/apache/spark/sql/PostgreSQLDialectQuerySuite.scala

+      .set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)
+
+  test("cast string to boolean") {
+   Seq("true", "tru", "tr", "t", "    tRue   ", "    tRu   ", "yes", "ye",


nit: indent

Need a single space before Seq(?

maropu · 2019-09-26T04:29:18Z

sql/core/src/test/scala/org/apache/spark/sql/PostgreSQLDialectQuerySuite.scala

+
+  override def sparkConf: SparkConf =
+    super.sparkConf
+      .set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)


nit: super.sparkConf.set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)?

maropu · 2019-09-26T04:33:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/postgreSQL/StringUtils.scala

@@ -0,0 +1,33 @@
+/*


nit: Since we already have the dir named pgSQL in sql/core/src/test/resources/sql-tests/inputs/pgSQL, postgreSQL -> pgSQL? Both names is ok, but I like a consistent name.

How about a follow-up to change pgSQL to postgreSQL. I prefer the official full name.

Yea, ok to me.

maropu · 2019-09-26T04:36:34Z

LGTM except for minor comments.

SparkQA · 2019-09-26T06:26:01Z

Test build #111381 has finished for PR 25697 at commit af97b99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-09-26T06:43:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/postgreSQL/StringUtils.scala

+  private[this] val falseStrings =
+    Set("false", "fals", "fal", "fa", "f", "no", "n", "off", "of", "0").map(UTF8String.fromString)
+
+  def isTrueString(s: UTF8String): Boolean = trueStrings.contains(s.trim().toLowerCase())


The isTrueString() and isFalseString() function are always used together, and trim().toLowerCase() is performed twice. Would it be possible to execute the code only once?

This is a good point. Maybe we should ask the caller side to do trim and lower-case.

SparkQA · 2019-09-26T07:05:01Z

Test build #111388 has finished for PR 25697 at commit b518114.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-26T07:12:29Z

retest this please

SparkQA · 2019-09-26T11:23:56Z

Test build #111401 has finished for PR 25697 at commit b518114.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-26T11:48:39Z

Test build #111407 has finished for PR 25697 at commit 75aeda7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? Rename the package pgSQL to postgreSQL ### Why are the changes needed? To address the comment in #25697 (comment) . The official full name seems more reasonable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25936 from gengliangwang/renamePGSQL. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>

cloud-fan · 2019-09-26T13:00:40Z

thanks, merging to master!

### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

dongjoon-hyun added the SQL label Sep 5, 2019

gengliangwang changed the title ~~[SPARK-28997][SQL] Add spark.sql.dialect~~ [WIP][SPARK-28997][SQL] Add spark.sql.dialect Sep 6, 2019

MaxGekk mentioned this pull request Sep 8, 2019

[SPARK-28141][SQL] Support special date values #25708

Closed

maropu mentioned this pull request Sep 10, 2019

[SPARK-29012][SQL] Support special timestamp values #25716

Closed

MaxGekk reviewed Sep 12, 2019

View reviewed changes

MaxGekk mentioned this pull request Sep 18, 2019

[WIP][SPARK-29155][SQL] Support special date/timestamp values in the PostgreSQL dialect only #25834

Closed

dialect

0e7f281

gengliangwang force-pushed the dialect branch from 3560573 to 0e7f281 Compare September 23, 2019 08:57

gengliangwang changed the title ~~[WIP][SPARK-28997][SQL] Add spark.sql.dialect~~ [SPARK-28997][SQL] Add spark.sql.dialect Sep 23, 2019

gengliangwang commented Sep 23, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-28997][SQL] Add spark.sql.dialect~~ [WIP][SPARK-28997][SQL] Add spark.sql.dialect Sep 23, 2019

new package and expression

094d58e

gengliangwang changed the title ~~[WIP][SPARK-28997][SQL] Add spark.sql.dialect~~ [SPARK-28997][SQL] Add spark.sql.dialect Sep 24, 2019

gengliangwang commented Sep 24, 2019

View reviewed changes

revise

9f14680

cloud-fan reviewed Sep 24, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PostgreSQLDialect.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 24, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/Cast.scala Show resolved Hide resolved

maropu reviewed Sep 26, 2019

View reviewed changes

maropu approved these changes Sep 26, 2019

View reviewed changes

address more comments

b518114

gengliangwang mentioned this pull request Sep 26, 2019

[SPARK-29255][SQL][TESTS] Rename package pgSQL to postgreSQL #25936

Closed

MaxGekk reviewed Sep 26, 2019

View reviewed changes

one more comment

75aeda7

cloud-fan closed this in a1213d5 Sep 26, 2019

xuanyuanking mentioned this pull request Dec 4, 2019

[SPARK-30125][SQL] Remove PostgreSQL dialect #26763

Closed

[SPARK-28997][SQL] Add spark.sql.dialect #25697

[SPARK-28997][SQL] Add spark.sql.dialect #25697

Uh oh!

Conversation

gengliangwang commented Sep 5, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Sep 5, 2019

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

younggyuchun commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Sep 13, 2019

Uh oh!

gengliangwang commented Sep 18, 2019

Uh oh!

dongjoon-hyun commented Sep 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

gengliangwang commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

SparkQA commented Sep 24, 2019

Uh oh!

Uh oh!

cloud-fan Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

SparkQA commented Sep 25, 2019

Uh oh!

gengliangwang commented Sep 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 26, 2019

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

Choose a reason for hiding this comment

[SPARK-28997][SQL] Add `spark.sql.dialect` #25697

[SPARK-28997][SQL] Add `spark.sql.dialect` #25697

younggyuchun commented Sep 6, 2019 •

edited

Loading

gengliangwang Sep 23, 2019 •

edited

Loading

gengliangwang commented Sep 23, 2019 •

edited

Loading

cloud-fan Sep 24, 2019 •

edited

Loading