[SPARK-29532][SQL] Simplify interval string parsing #26190

cloud-fan · 2019-10-21T08:53:52Z

What changes were proposed in this pull request?

Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from CalendarInterval.

Why are the changes needed?

Simplify the code and fix inconsistent behaviors.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins with the updated test cases.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

SparkQA · 2019-10-21T09:01:30Z

Test build #112376 has finished for PR 26190 at commit c13d2fb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T09:15:13Z

Test build #112377 has finished for PR 26190 at commit ed4a22c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T09:22:26Z

Test build #112379 has finished for PR 26190 at commit 92112b5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-10-21T09:56:55Z

Let's benchmark by #26189 your implementation, current one and proposed in #26180

MaxGekk · 2019-10-21T10:04:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+      val intervals = ctx.intervalField.asScala.map(visitIntervalField)
+      validate(intervals.nonEmpty,
+        "at least one time unit should be given for interval literal", ctx)
+      intervals.reduce(_.add(_))


You create an instance of CalendarInterval per each interval units, and always have to summarize months and microseconds (and maybe days in the future), and create new instance per each add()?

Let's benchmark this and see how it is fast.

This is how it was done in the parser. It's indeed different from CalendarInterval.fromCaseInsensitiveString. Let me see how to follow CalendarInterval and make the parser more efficient.

cloud-fan · 2019-10-21T10:26:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

    import ctx._
-    val s = value.getText
+    val s = if (value.STRING() != null) {
+      string(value.STRING())


This is to strip the ' in the string, so that we don't need to deal with in the regex from CalendarInterval.

cloud-fan · 2019-10-21T10:28:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

    checkAnswer(sql("SELECT a.`c.b`, `b.$q`[0].`a@!.q`, `q.w`.`w.i&`[0] FROM t"), Row(1, 1, 1))
  }

-  test("Convert hive interval term into Literal of CalendarIntervalType") {


tests are moved to literal.sql

SparkQA · 2019-10-21T10:29:38Z

Test build #112382 has finished for PR 26190 at commit 92ad086.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

cloud-fan · 2019-10-21T11:23:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

-          CalendarInterval.fromSingleUnitString(u.substring(0, u.length - 1), s)
        case (u, None) =>
-          CalendarInterval.fromSingleUnitString(u, s)
+          CalendarInterval.fromUnitString(Array(normalizeInternalUnit(u)), Array(s))


We can improve this later, by separating the parser rules for 1 year 10 months and '1-10' year to month.

cloud-fan · 2019-10-21T11:27:28Z

@MaxGekk I don't think performance is a strong reason to move parsing logic from antlr to hand-written jave code. It's better to centralize the parsing stuff in antlr.

SparkQA · 2019-10-21T11:29:16Z

Test build #112385 has finished for PR 26190 at commit 863670e.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-21T12:51:42Z

Test build #112384 has finished for PR 26190 at commit 5b4dad1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

SparkQA · 2019-10-21T13:56:06Z

Test build #112387 has finished for PR 26190 at commit b3434ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-10-21T14:15:33Z

@MaxGekk I don't think performance is a strong reason to move parsing logic from antlr to hand-written jave code. It's better to centralize the parsing stuff in antlr.

I would agree with you till Spark support bulk loading of interval strings. Also I think it doesn't matter for users how do you implement parsing. If two implementation are functionally equivalent, users would prefer the fastest one, I think.

cloud-fan · 2019-10-21T16:02:22Z

I agree with you that performance may matter if we need to support bulk loading of interval strings. We can propose a faster version of interval parsing at that time (also benchmark). For now I don't think it's worth to keep 2 versions of interval parsing: one in antlr and one in hand-written java.

MaxGekk · 2019-10-22T05:04:27Z

@cloud-fan Would you mind to run the benchmark (6ffec5e) before and after your changes.

dongjoon-hyun · 2019-10-23T17:09:37Z

Oh, got it, @cloud-fan .
For the benchmark, I'll run and make a PR (with EC2 results) to you in a few hours.

dongjoon-hyun · 2019-10-23T17:14:21Z

BTW, @cloud-fan . I updated the PR description because the UTs are changed in this PR.

dongjoon-hyun · 2019-10-23T18:19:20Z

sql/core/benchmarks/IntervalBenchmark-results.txt

@@ -1,25 +1,25 @@
-Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15
-Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz


Oh, the original benchmark result is on MacOS.

dongjoon-hyun · 2019-10-23T18:21:08Z

sql/core/benchmarks/IntervalBenchmark-results.txt

+8 units w/ interval                               11787          11992         339          0.1       11786.7       0.0X
+8 units w/o interval                              11666          11720          56          0.1       11665.8       0.0X
+9 units w/ interval                               12878          12908          42          0.1       12877.7       0.0X
+9 units w/o interval                              12696          12738          36          0.1       12696.1       0.0X


So, is this the 6x regression, 2212 -> 12738?

I believe Parser approach is a right direction for code maintainability. Just cc @gatorsmile since this seems inevitable.

My comment is hidden because the code changes: #26190 (comment)

Let me copy it again.

It's about 6 times slower, but perf doesn't matter too much here. It's better to keep a single parser, and make the behavior consistent whenever we parse an interval string. For example

select interval +1 day works but select interval '+1 day' does not.

select interval 1 day 1 year works (fields can be any order) but select interval '1 day 1 year' does not.

In general, the handwritten parser is more efficient but is less powerful. I think it's fragile to maintain the handwritten parser and make sure it's consistent with the antlr one.

SparkQA · 2019-10-23T19:03:00Z

Test build #112552 has finished for PR 26190 at commit 33ceedc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-24T00:13:51Z

Hi, @cloud-fan . I made a PR to you. After merging, you can compare with the result on master (I also regenerated the master result.)

https://github.com/cloud-fan/spark/pull/14/files

However, the result looks really bad. It seems that we need to rethink this approach.

* Add JDK11 * Add JDK8

cloud-fan · 2019-10-24T05:52:28Z

The last argument is about performance: #26190 (comment)

My point is that: The regex-based java version is faster but it's not as functional as the antlr parser as I pointed out in the above comment. Now we are in the early stage of exposing the interval type, and we should focus more on the functionality instead of performance. For the INTERVAL '...' literal syntax and parsing watermark string, the performance doesn't really matter. For cast, the performance matters, but it's better to have a UTF8String based parser instead of regex-based.

SparkQA · 2019-10-24T07:05:01Z

Test build #112584 has finished for PR 26190 at commit f759fd5.

This patch fails due to an unknown error code, -9.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-10-24T07:05:01Z

Test build #112587 has finished for PR 26190 at commit 48b7ef4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-24T07:05:02Z

Test build #112586 has finished for PR 26190 at commit 77f69a1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateNamedStruct(children: Seq[Expression]) extends Expression
case class CreateNamespaceStatement(
case class ShowPartitionsStatement(
case class RefreshTableStatement(tableName: Seq[String]) extends ParsedStatement
case class CreateNamespace(
case class RefreshTable(
case class CreateNamespaceExec(
case class RefreshTableExec(

cloud-fan · 2019-10-24T07:15:42Z

retest this please

SparkQA · 2019-10-24T09:57:14Z

Test build #112594 has finished for PR 26190 at commit 48b7ef4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-24T13:38:49Z

I'm generally sympathetic with keeping it simple and functional here, as I also don't know if the perf difference will matter much as used in practice.

dongjoon-hyun · 2019-10-24T16:15:26Z

+1, LGTM. Merged to master.

…valUtils ### What changes were proposed in this pull request? In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala. ### Why are the changes needed? - For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190 - Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite` - By existing test suites Closes #26261 from MaxGekk/refactoring-calendar-interval. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

yaooqinn · 2019-10-30T09:04:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala

+  final val MICROS_PER_MINUTE: Long =
+    DateTimeUtils.MILLIS_PER_MINUTE * DateTimeUtils.MICROS_PER_MILLIS
+  final val DAYS_PER_MONTH: Byte = 30
+  final val MICROS_PER_MONTH: Long = DAYS_PER_MONTH * DateTimeUtils.SECONDS_PER_DAY


MICROS_PER_MONTH is wrong here, which should be DAYS_PER_MONTH * DateTimeUtils. MICROS_PER_DAY I will fix this in #26314

cloud-fan commented Oct 21, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala Show resolved Hide resolved

cloud-fan force-pushed the parser branch 2 times, most recently from ed4a22c to 92112b5 Compare October 21, 2019 09:07

MaxGekk reviewed Oct 21, 2019

View reviewed changes

cloud-fan force-pushed the parser branch from 92112b5 to 92ad086 Compare October 21, 2019 10:16

cloud-fan commented Oct 21, 2019

View reviewed changes

cloud-fan force-pushed the parser branch from 5b4dad1 to 863670e Compare October 21, 2019 11:13

cloud-fan commented Oct 21, 2019

View reviewed changes

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Show resolved Hide resolved

cloud-fan commented Oct 21, 2019

View reviewed changes

cloud-fan mentioned this pull request Oct 21, 2019

[SPARK-29524][SQL] Support unordered interval units in casting from strings #26180

Closed

cloud-fan force-pushed the parser branch from 863670e to b3434ec Compare October 21, 2019 12:08

srowen reviewed Oct 21, 2019

View reviewed changes

MaxGekk mentioned this pull request Oct 21, 2019

[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals #26189

Closed

cloud-fan added 3 commits October 22, 2019 21:20

simplify interval string parsing

c8d81e3

address comments

87417d9

fix test

a5f355f

cloud-fan force-pushed the parser branch from b3434ec to a5f355f Compare October 22, 2019 14:42

dongjoon-hyun reviewed Oct 23, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Oct 23, 2019

[SPARK-29533][SQL][TESTS][FOLLOWUP] Regenerate the result on EC2 #26233

Closed

dongjoon-hyun and others added 3 commits October 24, 2019 13:20

Regenerated result on EC2 (#14)

f759fd5

* Add JDK11 * Add JDK8

Merge remote-tracking branch 'origin/master' into parser

77f69a1

more tests

48b7ef4

dongjoon-hyun changed the title ~~[SPARK-29532][SQL] simplify interval string parsing~~ [SPARK-29532][SQL] Simplify interval string parsing Oct 24, 2019

dongjoon-hyun approved these changes Oct 24, 2019

View reviewed changes

dongjoon-hyun closed this in cdea520 Oct 24, 2019

MaxGekk mentioned this pull request Oct 25, 2019

[SPARK-29607][SQL] Move static methods from CalendarInterval to IntervalUtils #26261

Closed

yaooqinn reviewed Oct 30, 2019

View reviewed changes

		@@ -1,25 +1,25 @@
		Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15
		Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz

[SPARK-29532][SQL] Simplify interval string parsing #26190

[SPARK-29532][SQL] Simplify interval string parsing #26190

Uh oh!

Conversation

cloud-fan commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

MaxGekk commented Oct 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Oct 21, 2019

Uh oh!

MaxGekk commented Oct 21, 2019

Uh oh!

cloud-fan commented Oct 21, 2019

Uh oh!

MaxGekk commented Oct 22, 2019

Uh oh!

dongjoon-hyun commented Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2019

Uh oh!

dongjoon-hyun commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

SparkQA commented Oct 24, 2019

Uh oh!

cloud-fan commented Oct 24, 2019

Uh oh!

cloud-fan commented Oct 21, 2019 •

edited

Loading

dongjoon-hyun commented Oct 23, 2019 •

edited

Loading

dongjoon-hyun commented Oct 24, 2019 •

edited

Loading