-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-29532][SQL] Simplify interval string parsing #26190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Show resolved
Hide resolved
|
Test build #112376 has finished for PR 26190 at commit
|
ed4a22c to
92112b5
Compare
|
Test build #112377 has finished for PR 26190 at commit
|
|
Test build #112379 has finished for PR 26190 at commit
|
| val intervals = ctx.intervalField.asScala.map(visitIntervalField) | ||
| validate(intervals.nonEmpty, | ||
| "at least one time unit should be given for interval literal", ctx) | ||
| intervals.reduce(_.add(_)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You create an instance of CalendarInterval per each interval units, and always have to summarize months and microseconds (and maybe days in the future), and create new instance per each add()?
Let's benchmark this and see how it is fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it was done in the parser. It's indeed different from CalendarInterval.fromCaseInsensitiveString. Let me see how to follow CalendarInterval and make the parser more efficient.
| import ctx._ | ||
| val s = value.getText | ||
| val s = if (value.STRING() != null) { | ||
| string(value.STRING()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to strip the ' in the string, so that we don't need to deal with in the regex from CalendarInterval.
| checkAnswer(sql("SELECT a.`c.b`, `b.$q`[0].`a@!.q`, `q.w`.`w.i&`[0] FROM t"), Row(1, 1, 1)) | ||
| } | ||
|
|
||
| test("Convert hive interval term into Literal of CalendarIntervalType") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are moved to literal.sql
|
Test build #112382 has finished for PR 26190 at commit
|
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Show resolved
Hide resolved
| CalendarInterval.fromSingleUnitString(u.substring(0, u.length - 1), s) | ||
| case (u, None) => | ||
| CalendarInterval.fromSingleUnitString(u, s) | ||
| CalendarInterval.fromUnitString(Array(normalizeInternalUnit(u)), Array(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve this later, by separating the parser rules for 1 year 10 months and '1-10' year to month.
|
@MaxGekk I don't think performance is a strong reason to move parsing logic from antlr to hand-written jave code. It's better to centralize the parsing stuff in antlr. |
|
Test build #112385 has finished for PR 26190 at commit
|
|
Test build #112384 has finished for PR 26190 at commit
|
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Show resolved
Hide resolved
|
Test build #112387 has finished for PR 26190 at commit
|
I would agree with you till Spark support bulk loading of interval strings. Also I think it doesn't matter for users how do you implement parsing. If two implementation are functionally equivalent, users would prefer the fastest one, I think. |
|
I agree with you that performance may matter if we need to support bulk loading of interval strings. We can propose a faster version of interval parsing at that time (also benchmark). For now I don't think it's worth to keep 2 versions of interval parsing: one in antlr and one in hand-written java. |
|
@cloud-fan Would you mind to run the benchmark (6ffec5e) before and after your changes. |
|
Oh, got it, @cloud-fan . |
|
BTW, @cloud-fan . I updated the PR description because the UTs are changed in this PR. |
| @@ -1,25 +1,25 @@ | |||
| Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.15 | |||
| Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the original benchmark result is on MacOS.
| 8 units w/ interval 11787 11992 339 0.1 11786.7 0.0X | ||
| 8 units w/o interval 11666 11720 56 0.1 11665.8 0.0X | ||
| 9 units w/ interval 12878 12908 42 0.1 12877.7 0.0X | ||
| 9 units w/o interval 12696 12738 36 0.1 12696.1 0.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this the 6x regression, 2212 -> 12738?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Parser approach is a right direction for code maintainability. Just cc @gatorsmile since this seems inevitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment is hidden because the code changes: #26190 (comment)
Let me copy it again.
It's about 6 times slower, but perf doesn't matter too much here. It's better to keep a single parser, and make the behavior consistent whenever we parse an interval string. For example
select interval +1 dayworks butselect interval '+1 day'does not.select interval 1 day 1 yearworks (fields can be any order) butselect interval '1 day 1 year'does not.
In general, the handwritten parser is more efficient but is less powerful. I think it's fragile to maintain the handwritten parser and make sure it's consistent with the antlr one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
|
Test build #112552 has finished for PR 26190 at commit
|
|
Hi, @cloud-fan . I made a PR to you. After merging, you can compare with the result on master (I also regenerated the master result.) However, the result looks really bad. It seems that we need to rethink this approach. |
|
The last argument is about performance: #26190 (comment) My point is that: The regex-based java version is faster but it's not as functional as the antlr parser as I pointed out in the above comment. Now we are in the early stage of exposing the interval type, and we should focus more on the functionality instead of performance. For the |
|
Test build #112584 has finished for PR 26190 at commit
|
|
Test build #112587 has finished for PR 26190 at commit
|
|
Test build #112586 has finished for PR 26190 at commit
|
|
retest this please |
|
Test build #112594 has finished for PR 26190 at commit
|
|
I'm generally sympathetic with keeping it simple and functional here, as I also don't know if the perf difference will matter much as used in practice. |
|
+1, LGTM. Merged to master. |
…valUtils ### What changes were proposed in this pull request? In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala. ### Why are the changes needed? - For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see #26190 - Taking into account that `CalendarInterval` will be fully exposed to users in the future (see #25022), it would be nice to clean it up by moving service methods to an internal object. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite` - By existing test suites Closes #26261 from MaxGekk/refactoring-calendar-interval. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| final val MICROS_PER_MINUTE: Long = | ||
| DateTimeUtils.MILLIS_PER_MINUTE * DateTimeUtils.MICROS_PER_MILLIS | ||
| final val DAYS_PER_MONTH: Byte = 30 | ||
| final val MICROS_PER_MONTH: Long = DAYS_PER_MONTH * DateTimeUtils.SECONDS_PER_DAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MICROS_PER_MONTH is wrong here, which should be DAYS_PER_MONTH * DateTimeUtils. MICROS_PER_DAY I will fix this in #26314
What changes were proposed in this pull request?
Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from
CalendarInterval.Why are the changes needed?
Simplify the code and fix inconsistent behaviors.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the Jenkins with the updated test cases.