- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
          [SPARK-28997][SQL] Add spark.sql.dialect
          #25697
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Test build #110188 has finished for PR 25697 at commit  
 | 
| Test build #110191 has finished for PR 25697 at commit  
 | 
spark.sql.dialectspark.sql.dialect
      | @gengliangwang Would we support only PostgreSQL? Why don't we just use foe example, "ANSISQL" instead? Is there any reason we should use PostgreSQL?. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang How would it be difficult to change type of the config from string to some integral type? I just afraid that comparing strings per each value in my PR here https://github.com/apache/spark/pull/25716/files#diff-da60f07e1826788aaeb07f295fae4b8aR223 can have significant overhead.
| @MaxGekk I think we have to make it string type configuration for user experience. | 
| I will continue this one after #25693 is merged. | 
| Gentle ping, @gengliangwang . | 
3560573    to
    0e7f281      
    Compare
  
    spark.sql.dialectspark.sql.dialect
      | val dialect = SQLConf.get.getConf(SQLConf.DIALECT) | ||
| buildCast[UTF8String](_, s => { | ||
| if (StringUtils.isTrueString(s)) { | ||
| if (StringUtils.isTrueString(s, dialect)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of more dialects, here we can pass the dialect to the StringUtils as a parameter to avoid more future changes in Cast.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this brings performance overhead.
I will add a new expression instead.
| Test build #111208 has finished for PR 25697 at commit  
 | 
spark.sql.dialectspark.sql.dialect
      | I will do the following for pgSQL dialect: 
 Mark this WIP for now. It should be ready this week. | 
spark.sql.dialectspark.sql.dialect
      |  | ||
| val DIALECT = | ||
| buildConf("spark.sql.dialect") | ||
| .doc("The specific features of the SQL language to be adopted, which are available when " + | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a follow-up PR to add wiki page for the PostgreSQL dialect behaviors.
| ResolveRandomSeed :: | ||
| TypeCoercion.typeCoercionRules(conf) ++ | ||
| extendedResolutionRules : _*), | ||
| Batch("PostgreSQl dialect", Once, PostgreSQLDialect.postgreSQLDialectRules(conf): _*), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can adjust the position and the number of loops of this Batch in future development.
| Test build #111283 has finished for PR 25697 at commit  
 | 
| Test build #111288 has started for PR 25697 at commit  | 
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PostgreSQLDialect.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | object PostgreSQLDialect { | ||
| def postgreSQLDialectRules(conf: SQLConf): List[Rule[LogicalPlan]] = | ||
| if (conf.usePostgreSQLDialect) { | ||
| postgreCastStringToBoolean(conf) :: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: according to my experience, it's easier to extend using this style
Seq(
  rule1,
  rule2,
  ...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the style consistent with the other rule batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer uses :: to combine batches, not rules.
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/postgreSQL/Cast.scala
          
            Show resolved
            Hide resolved
        
      | Test build #111355 has finished for PR 25697 at commit  
 | 
| Test build #111361 has finished for PR 25697 at commit  
 | 
| retest this please. | 
| .set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString) | ||
|  | ||
| test("cast string to boolean") { | ||
| Seq("true", "tru", "tr", "t", " tRue ", " tRu ", "yes", "ye", | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a single space before Seq(?
|  | ||
| override def sparkConf: SparkConf = | ||
| super.sparkConf | ||
| .set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:     super.sparkConf.set(SQLConf.DIALECT.key, SQLConf.Dialect.POSTGRESQL.toString)?
| @@ -0,0 +1,33 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since we already have the dir named pgSQL in sql/core/src/test/resources/sql-tests/inputs/pgSQL, postgreSQL -> pgSQL? Both names is ok, but I like a consistent name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a follow-up to change pgSQL to postgreSQL. I prefer the official full name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, ok to me.
| LGTM except for minor comments. | 
| Test build #111381 has finished for PR 25697 at commit  
 | 
| private[this] val falseStrings = | ||
| Set("false", "fals", "fal", "fa", "f", "no", "n", "off", "of", "0").map(UTF8String.fromString) | ||
|  | ||
| def isTrueString(s: UTF8String): Boolean = trueStrings.contains(s.trim().toLowerCase()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isTrueString() and isFalseString() function are always used together, and trim().toLowerCase() is performed twice. Would it be possible to execute the code only once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Maybe we should ask the caller side to do trim and lower-case.
| Test build #111388 has finished for PR 25697 at commit  
 | 
| retest this please | 
| Test build #111401 has finished for PR 25697 at commit  
 | 
| Test build #111407 has finished for PR 25697 at commit  
 | 
### What changes were proposed in this pull request? Rename the package pgSQL to postgreSQL ### Why are the changes needed? To address the comment in #25697 (comment) . The official full name seems more reasonable. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #25936 from gengliangwang/renamePGSQL. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]>
| thanks, merging to master! | 
### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
After #25158 and #25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration
spark.sql.dialectfor choosing a database dialect.After this PR, Spark supports two database dialects,
SparkandPostgreSQL. WithPostgreSQLdialect, Spark will:Why are the changes needed?
Unify the external database dialect with one configuration, instead of small flags.
Does this PR introduce any user-facing change?
A new configuration
spark.sql.dialectfor choosing a database dialect.How was this patch tested?
Existing tests.