Skip to content

Conversation

@gengliangwang
Copy link
Member

What changes were proposed in this pull request?

After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration spark.sql.dialect for choosing a database dialect.
After this PR, Spark supports two database dialects, Spark and PostgreSQL. With PostgreSQL dialect, Spark will:

  1. perform integral division with the / operator if both sides are integral types;
  2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.

Why are the changes needed?

Unify the external database dialect with one configuration, instead of small flags.

Does this PR introduce any user-facing change?

A new configuration spark.sql.dialect for choosing a database dialect.

How was this patch tested?

Existing tests.

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110188 has finished for PR 25697 at commit 74057fc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110191 has finished for PR 25697 at commit 3560573.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-28997][SQL] Add spark.sql.dialect [WIP][SPARK-28997][SQL] Add spark.sql.dialect Sep 6, 2019
@younggyuchun
Copy link

younggyuchun commented Sep 6, 2019

@gengliangwang Would we support only PostgreSQL? Why don't we just use foe example, "ANSISQL" instead? Is there any reason we should use PostgreSQL?.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang How would it be difficult to change type of the config from string to some integral type? I just afraid that comparing strings per each value in my PR here https://github.com/apache/spark/pull/25716/files#diff-da60f07e1826788aaeb07f295fae4b8aR223 can have significant overhead.

@gengliangwang
Copy link
Member Author

@MaxGekk I think we have to make it string type configuration for user experience.

@gengliangwang
Copy link
Member Author

I will continue this one after #25693 is merged.

@dongjoon-hyun
Copy link
Member

Gentle ping, @gengliangwang .

@gengliangwang gengliangwang changed the title [WIP][SPARK-28997][SQL] Add spark.sql.dialect [SPARK-28997][SQL] Add spark.sql.dialect Sep 23, 2019
val dialect = SQLConf.get.getConf(SQLConf.DIALECT)
buildCast[UTF8String](_, s => {
if (StringUtils.isTrueString(s)) {
if (StringUtils.isTrueString(s, dialect)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of more dialects, here we can pass the dialect to the StringUtils as a parameter to avoid more future changes in Cast.scala.

Copy link
Member Author

@gengliangwang gengliangwang Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this brings performance overhead.
I will add a new expression instead.

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111208 has finished for PR 25697 at commit 0e7f281.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-28997][SQL] Add spark.sql.dialect [WIP][SPARK-28997][SQL] Add spark.sql.dialect Sep 23, 2019
@gengliangwang
Copy link
Member Author

gengliangwang commented Sep 23, 2019

I will do the following for pgSQL dialect:

  1. add a new expression for casting string to boolean
  2. add a new wiki page.

Mark this WIP for now. It should be ready this week.

@gengliangwang gengliangwang changed the title [WIP][SPARK-28997][SQL] Add spark.sql.dialect [SPARK-28997][SQL] Add spark.sql.dialect Sep 24, 2019

val DIALECT =
buildConf("spark.sql.dialect")
.doc("The specific features of the SQL language to be adopted, which are available when " +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a follow-up PR to add wiki page for the PostgreSQL dialect behaviors.

ResolveRandomSeed ::
TypeCoercion.typeCoercionRules(conf) ++
extendedResolutionRules : _*),
Batch("PostgreSQl dialect", Once, PostgreSQLDialect.postgreSQLDialectRules(conf): _*),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can adjust the position and the number of loops of this Batch in future development.

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111283 has finished for PR 25697 at commit 094d58e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class postgreCastStringToBoolean(conf: SQLConf) extends Rule[LogicalPlan] with Logging
  • case class PostgreCastStringToBoolean(child: Expression)

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111288 has started for PR 25697 at commit 9f14680.

object PostgreSQLDialect {
def postgreSQLDialectRules(conf: SQLConf): List[Rule[LogicalPlan]] =
if (conf.usePostgreSQLDialect) {
postgreCastStringToBoolean(conf) ::
Copy link
Contributor

@cloud-fan cloud-fan Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: according to my experience, it's easier to extend using this style

Seq(
  rule1,
  rule2,
  ...
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the style consistent with the other rule batches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer uses :: to combine batches, not rules.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111355 has finished for PR 25697 at commit 72a1539.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PostgreCastStringToBoolean(child: Expression)

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111361 has finished for PR 25697 at commit af97b99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

.set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)

test("cast string to boolean") {
Seq("true", "tru", "tr", "t", " tRue ", " tRu ", "yes", "ye",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a single space before Seq(?


override def sparkConf: SparkConf =
super.sparkConf
.set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: super.sparkConf.set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)?

@@ -0,0 +1,33 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we already have the dir named pgSQL in sql/core/src/test/resources/sql-tests/inputs/pgSQL, postgreSQL -> pgSQL? Both names is ok, but I like a consistent name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a follow-up to change pgSQL to postgreSQL. I prefer the official full name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, ok to me.

@maropu
Copy link
Member

maropu commented Sep 26, 2019

LGTM except for minor comments.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111381 has finished for PR 25697 at commit af97b99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[this] val falseStrings =
Set("false", "fals", "fal", "fa", "f", "no", "n", "off", "of", "0").map(UTF8String.fromString)

def isTrueString(s: UTF8String): Boolean = trueStrings.contains(s.trim().toLowerCase())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isTrueString() and isFalseString() function are always used together, and trim().toLowerCase() is performed twice. Would it be possible to execute the code only once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Maybe we should ask the caller side to do trim and lower-case.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111388 has finished for PR 25697 at commit b518114.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 26, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111401 has finished for PR 25697 at commit b518114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111407 has finished for PR 25697 at commit 75aeda7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

wangyum pushed a commit that referenced this pull request Sep 26, 2019
### What changes were proposed in this pull request?

Rename the package pgSQL to postgreSQL

### Why are the changes needed?

To address the comment in #25697 (comment) . The official full name seems more reasonable.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #25936 from gengliangwang/renamePGSQL.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a1213d5 Sep 26, 2019
cloud-fan pushed a commit that referenced this pull request Dec 10, 2019
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:

- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]

### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.

### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.

### How was this patch tested?
Existing UT.

Closes #26763 from xuanyuanking/SPARK-30125.

Lead-authored-by: Yuanjian Li <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants