Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Oct 21, 2019

What changes were proposed in this pull request?

Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from CalendarInterval.

Why are the changes needed?

Simplify the code and fix inconsistent behaviors.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass the Jenkins with the updated test cases.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112376 has finished for PR 26190 at commit c13d2fb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan force-pushed the parser branch 2 times, most recently from ed4a22c to 92112b5 Compare October 21, 2019 09:07
@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112377 has finished for PR 26190 at commit ed4a22c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112379 has finished for PR 26190 at commit 92112b5.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 21, 2019

Let's benchmark by #26189 your implementation, current one and proposed in #26180

val intervals = ctx.intervalField.asScala.map(visitIntervalField)
validate(intervals.nonEmpty,
"at least one time unit should be given for interval literal", ctx)
intervals.reduce(_.add(_))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You create an instance of CalendarInterval per each interval units, and always have to summarize months and microseconds (and maybe days in the future), and create new instance per each add()?

Let's benchmark this and see how it is fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how it was done in the parser. It's indeed different from CalendarInterval.fromCaseInsensitiveString. Let me see how to follow CalendarInterval and make the parser more efficient.

import ctx._
val s = value.getText
val s = if (value.STRING() != null) {
string(value.STRING())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to strip the ' in the string, so that we don't need to deal with in the regex from CalendarInterval.

checkAnswer(sql("SELECT a.`c.b`, `b.$q`[0].`a@!.q`, `q.w`.`w.i&`[0] FROM t"), Row(1, 1, 1))
}

test("Convert hive interval term into Literal of CalendarIntervalType") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are moved to literal.sql

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112382 has finished for PR 26190 at commit 92ad086.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

CalendarInterval.fromSingleUnitString(u.substring(0, u.length - 1), s)
case (u, None) =>
CalendarInterval.fromSingleUnitString(u, s)
CalendarInterval.fromUnitString(Array(normalizeInternalUnit(u)), Array(s))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve this later, by separating the parser rules for 1 year 10 months and '1-10' year to month.

@cloud-fan
Copy link
Contributor Author

@MaxGekk I don't think performance is a strong reason to move parsing logic from antlr to hand-written jave code. It's better to centralize the parsing stuff in antlr.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112385 has finished for PR 26190 at commit 863670e.

  • This patch fails Java style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112384 has finished for PR 26190 at commit 5b4dad1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112387 has finished for PR 26190 at commit b3434ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 21, 2019

@MaxGekk I don't think performance is a strong reason to move parsing logic from antlr to hand-written jave code. It's better to centralize the parsing stuff in antlr.

I would agree with you till Spark support bulk loading of interval strings. Also I think it doesn't matter for users how do you implement parsing. If two implementation are functionally equivalent, users would prefer the fastest one, I think.

@cloud-fan
Copy link
Contributor Author

I agree with you that performance may matter if we need to support bulk loading of interval strings. We can propose a faster version of interval parsing at that time (also benchmark). For now I don't think it's worth to keep 2 versions of interval parsing: one in antlr and one in hand-written java.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2019

@cloud-fan Would you mind to run the benchmark (6ffec5e) before and after your changes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 23, 2019

Oh, got it, @cloud-fan .
For the benchmark, I'll run and make a PR (with EC2 results) to you in a few hours.

@dongjoon-hyun
Copy link
Member

BTW, @cloud-fan . I updated the PR description because the UTs are changed in this PR.

@@ -1,25 +1,25 @@
Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the original benchmark result is on MacOS.

8 units w/ interval 11787 11992 339 0.1 11786.7 0.0X
8 units w/o interval 11666 11720 56 0.1 11665.8 0.0X
9 units w/ interval 12878 12908 42 0.1 12877.7 0.0X
9 units w/o interval 12696 12738 36 0.1 12696.1 0.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this the 6x regression, 2212 -> 12738?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Parser approach is a right direction for code maintainability. Just cc @gatorsmile since this seems inevitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is hidden because the code changes: #26190 (comment)

Let me copy it again.

It's about 6 times slower, but perf doesn't matter too much here. It's better to keep a single parser, and make the behavior consistent whenever we parse an interval string. For example

  1. select interval +1 day works but select interval '+1 day' does not.
  2. select interval 1 day 1 year works (fields can be any order) but select interval '1 day 1 year' does not.

In general, the handwritten parser is more efficient but is less powerful. I think it's fragile to maintain the handwritten parser and make sure it's consistent with the antlr one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112552 has finished for PR 26190 at commit 33ceedc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 24, 2019

Hi, @cloud-fan . I made a PR to you. After merging, you can compare with the result on master (I also regenerated the master result.)

However, the result looks really bad. It seems that we need to rethink this approach.

@cloud-fan
Copy link
Contributor Author

The last argument is about performance: #26190 (comment)

My point is that: The regex-based java version is faster but it's not as functional as the antlr parser as I pointed out in the above comment. Now we are in the early stage of exposing the interval type, and we should focus more on the functionality instead of performance. For the INTERVAL '...' literal syntax and parsing watermark string, the performance doesn't really matter. For cast, the performance matters, but it's better to have a UTF8String based parser instead of regex-based.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112584 has finished for PR 26190 at commit f759fd5.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112587 has finished for PR 26190 at commit 48b7ef4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112586 has finished for PR 26190 at commit 77f69a1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateNamedStruct(children: Seq[Expression]) extends Expression
  • case class CreateNamespaceStatement(
  • case class ShowPartitionsStatement(
  • case class RefreshTableStatement(tableName: Seq[String]) extends ParsedStatement
  • case class CreateNamespace(
  • case class RefreshTable(
  • case class CreateNamespaceExec(
  • case class RefreshTableExec(

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112594 has finished for PR 26190 at commit 48b7ef4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 24, 2019

I'm generally sympathetic with keeping it simple and functional here, as I also don't know if the perf difference will matter much as used in practice.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29532][SQL] simplify interval string parsing [SPARK-29532][SQL] Simplify interval string parsing Oct 24, 2019
@dongjoon-hyun
Copy link
Member

+1, LGTM. Merged to master.

cloud-fan pushed a commit that referenced this pull request Oct 29, 2019
…valUtils

### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.

### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites

Closes #26261 from MaxGekk/refactoring-calendar-interval.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
final val MICROS_PER_MINUTE: Long =
DateTimeUtils.MILLIS_PER_MINUTE * DateTimeUtils.MICROS_PER_MILLIS
final val DAYS_PER_MONTH: Byte = 30
final val MICROS_PER_MONTH: Long = DAYS_PER_MONTH * DateTimeUtils.SECONDS_PER_DAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MICROS_PER_MONTH is wrong here, which should be DAYS_PER_MONTH * DateTimeUtils. MICROS_PER_DAY I will fix this in #26314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants