Skip to content

Commit 3493162

Browse files
xuanyuankingcloud-fan
authored andcommitted
[SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime
### What changes were proposed in this pull request? In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian). Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651. But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API. In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030) ### Why are the changes needed? For backward compatibility. ### Does this PR introduce any user-facing change? No. After we define our own datetime parsing and formatting patterns, it's same to old Spark version. ### How was this patch tested? Existing and new added UT. Locally document test: ![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png) Closes apache#27830 from xuanyuanking/SPARK-31030. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
1 parent d5f5720 commit 3493162

File tree

19 files changed

+341
-66
lines changed

19 files changed

+341
-66
lines changed

R/pkg/R/functions.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2817,7 +2817,7 @@ setMethod("format_string", signature(format = "character", x = "Column"),
28172817
#' head(tmp)}
28182818
#' @note from_unixtime since 1.5.0
28192819
setMethod("from_unixtime", signature(x = "Column"),
2820-
function(x, format = "uuuu-MM-dd HH:mm:ss") {
2820+
function(x, format = "yyyy-MM-dd HH:mm:ss") {
28212821
jc <- callJStatic("org.apache.spark.sql.functions",
28222822
"from_unixtime",
28232823
x@jc, format)
@@ -3103,7 +3103,7 @@ setMethod("unix_timestamp", signature(x = "Column", format = "missing"),
31033103
#' @aliases unix_timestamp,Column,character-method
31043104
#' @note unix_timestamp(Column, character) since 1.5.0
31053105
setMethod("unix_timestamp", signature(x = "Column", format = "character"),
3106-
function(x, format = "uuuu-MM-dd HH:mm:ss") {
3106+
function(x, format = "yyyy-MM-dd HH:mm:ss") {
31073107
jc <- callJStatic("org.apache.spark.sql.functions", "unix_timestamp", x@jc, format)
31083108
column(jc)
31093109
})

docs/_data/menu-sql.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,3 +223,5 @@
223223
url: sql-ref-syntax-aux-resource-mgmt-list-file.html
224224
- text: LIST JAR
225225
url: sql-ref-syntax-aux-resource-mgmt-list-jar.html
226+
- text: Datetime Pattern
227+
url: sql-ref-datetime-pattern.html

docs/sql-ref-datetime-pattern.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
---
2+
layout: global
3+
title: Datetime patterns
4+
displayTitle: Datetime Patterns for Formatting and Parsing
5+
license: |
6+
Licensed to the Apache Software Foundation (ASF) under one or more
7+
contributor license agreements. See the NOTICE file distributed with
8+
this work for additional information regarding copyright ownership.
9+
The ASF licenses this file to You under the Apache License, Version 2.0
10+
(the "License"); you may not use this file except in compliance with
11+
the License. You may obtain a copy of the License at
12+
13+
http://www.apache.org/licenses/LICENSE-2.0
14+
15+
Unless required by applicable law or agreed to in writing, software
16+
distributed under the License is distributed on an "AS IS" BASIS,
17+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18+
See the License for the specific language governing permissions and
19+
limitations under the License.
20+
---
21+
22+
There are several common scenarios for datetime usage in Spark:
23+
24+
- CSV/JSON datasources use the pattern string for parsing and formatting time content.
25+
26+
- Datetime functions related to convert string to/from `DateType` or `TimestampType`. For example, unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp, from_utc_timestamp, to_utc_timestamp, etc.
27+
28+
Spark uses the below letters in date and timestamp parsing and formatting:
29+
<table class="table">
30+
<tr>
31+
<th> <b>Symbol</b> </th>
32+
<th> <b>Meaning</b> </th>
33+
<th> <b>Presentation</b> </th>
34+
<th> <b>Examples</b> </th>
35+
</tr>
36+
<tr>
37+
<td> <b>G</b> </td>
38+
<td> era </td>
39+
<td> text </td>
40+
<td> AD; Anno Domini; A </td>
41+
</tr>
42+
<tr>
43+
<td> <b>y</b> </td>
44+
<td> year </td>
45+
<td> year </td>
46+
<td> 2020; 20 </td>
47+
</tr>
48+
<tr>
49+
<td> <b>D</b> </td>
50+
<td> day-of-year </td>
51+
<td> number </td>
52+
<td> 189 </td>
53+
</tr>
54+
<tr>
55+
<td> <b>M</b> </td>
56+
<td> month-of-year </td>
57+
<td> number/text </td>
58+
<td> 7; 07; Jul; July; J </td>
59+
</tr>
60+
<tr>
61+
<td> <b>d</b> </td>
62+
<td> day-of-month </td>
63+
<td> number </td>
64+
<td> 28 </td>
65+
</tr>
66+
<tr>
67+
<td> <b>Y</b> </td>
68+
<td> week-based-year </td>
69+
<td> year </td>
70+
<td> 1996; 96 </td>
71+
</tr>
72+
<tr>
73+
<td> <b>w</b> </td>
74+
<td> week-of-week-based-year </td>
75+
<td> number </td>
76+
<td> 27 </td>
77+
</tr>
78+
<tr>
79+
<td> <b>W</b> </td>
80+
<td> week-of-month </td>
81+
<td> number </td>
82+
<td> 4 </td>
83+
</tr>
84+
<tr>
85+
<td> <b>E</b> </td>
86+
<td> day-of-week </td>
87+
<td> text </td>
88+
<td> Tue; Tuesday; T </td>
89+
</tr>
90+
<tr>
91+
<td> <b>e</b> </td>
92+
<td> localized day-of-week </td>
93+
<td> number/text </td>
94+
<td> 2; 02; Tue; Tuesday; T </td>
95+
</tr>
96+
<tr>
97+
<td> <b>F</b> </td>
98+
<td> week-of-month </td>
99+
<td> number </td>
100+
<td> 3 </td>
101+
</tr>
102+
<tr>
103+
<td> <b>a</b> </td>
104+
<td> am-pm-of-day </td>
105+
<td> text </td>
106+
<td> PM </td>
107+
</tr>
108+
<tr>
109+
<td> <b>h</b> </td>
110+
<td> clock-hour-of-am-pm (1-12) </td>
111+
<td> number </td>
112+
<td> 12 </td>
113+
</tr>
114+
<tr>
115+
<td> <b>K</b> </td>
116+
<td> hour-of-am-pm (0-11) </td>
117+
<td> number </td>
118+
<td> 0 </td>
119+
</tr>
120+
<tr>
121+
<td> <b>k</b> </td>
122+
<td> clock-hour-of-day (1-24) </td>
123+
<td> number </td>
124+
<td> 0 </td>
125+
</tr>
126+
<tr>
127+
<td> <b>H</b> </td>
128+
<td> hour-of-day (0-23) </td>
129+
<td> number </td>
130+
<td> 0 </td>
131+
</tr>
132+
<tr>
133+
<td> <b>m</b> </td>
134+
<td> minute-of-hour </td>
135+
<td> number </td>
136+
<td> 30 </td>
137+
</tr>
138+
<tr>
139+
<td> <b>s</b> </td>
140+
<td> second-of-minute </td>
141+
<td> number </td>
142+
<td> 55 </td>
143+
</tr>
144+
<tr>
145+
<td> <b>S</b> </td>
146+
<td> fraction-of-second </td>
147+
<td> fraction </td>
148+
<td> 978 </td>
149+
</tr>
150+
<tr>
151+
<td> <b>z</b> </td>
152+
<td> time-zone name </td>
153+
<td> zone-name </td>
154+
<td> Pacific Standard Time; PST </td>
155+
</tr>
156+
<tr>
157+
<td> <b>O</b> </td>
158+
<td> localized zone-offset </td>
159+
<td> offset-O </td>
160+
<td> GMT+8; GMT+08:00; UTC-08:00; </td>
161+
</tr>
162+
<tr>
163+
<td> <b>X</b> </td>
164+
<td> zone-offset 'Z' for zero </td>
165+
<td> offset-X </td>
166+
<td> Z; -08; -0830; -08:30; -083015; -08:30:15; </td>
167+
</tr>
168+
<tr>
169+
<td> <b>x</b> </td>
170+
<td> zone-offset </td>
171+
<td> offset-x </td>
172+
<td> +0000; -08; -0830; -08:30; -083015; -08:30:15; </td>
173+
</tr>
174+
<tr>
175+
<td> <b>Z</b> </td>
176+
<td> zone-offset </td>
177+
<td> offset-Z </td>
178+
<td> +0000; -0800; -08:00; </td>
179+
</tr>
180+
<tr>
181+
<td> <b>'</b> </td>
182+
<td> escape for text </td>
183+
<td> delimiter </td>
184+
<td></td>
185+
</tr>
186+
<tr>
187+
<td> <b>''</b> </td>
188+
<td> single quote </td>
189+
<td> literal </td>
190+
<td> ' </td>
191+
</tr>
192+
</table>
193+
194+
The count of pattern letters determines the format.
195+
196+
- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. Six or more letters will fail.
197+
198+
- Number: If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. The following pattern letters have constraints on the count of letters. Only one letter 'F' can be specified. Up to two letters of 'd', 'H', 'h', 'K', 'k', 'm', and 's' can be specified. Up to three letters of 'D' can be specified.
199+
200+
- Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above.
201+
202+
- Fraction: Outputs the micro-of-second field as a fraction-of-second. The micro-of-second value has six digits, thus the count of pattern letters is from 1 to 6. If it is less than 6, then the micro-of-second value is truncated, with only the most significant digits being output.
203+
204+
- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present.
205+
206+
- Zone names: This outputs the display name of the time-zone ID. If the count of letters is one, two or three, then the short name is output. If the count of letters is four, then the full name is output. Five or more letters will fail.
207+
208+
- Offset X and x: This formats the offset based on the number of pattern letters. One letter outputs just the hour, such as '+01', unless the minute is non-zero in which case the minute is also output, such as '+0130'. Two letters outputs the hour and minute, without a colon, such as '+0130'. Three letters outputs the hour and minute, with a colon, such as '+01:30'. Four letters outputs the hour and minute and optional second, without a colon, such as '+013015'. Five letters outputs the hour and minute and optional second, with a colon, such as '+01:30:15'. Six or more letters will fail. Pattern letter 'X' (upper case) will output 'Z' when the offset to be output would be zero, whereas pattern letter 'x' (lower case) will output '+00', '+0000', or '+00:00'.
209+
210+
- Offset O: This formats the localized offset based on the number of pattern letters. One letter outputs the short form of the localized offset, which is localized offset text, such as 'GMT', with hour without leading zero, optional 2-digit minute and second if non-zero, and colon, for example 'GMT+8'. Four letters outputs the full form, which is localized offset text, such as 'GMT, with 2-digit hour and minute field, optional second field if non-zero, and colon, for example 'GMT+08:00'. Any other count of letters will fail.
211+
212+
- Offset Z: This formats the offset based on the number of pattern letters. One, two or three letters outputs the hour and minute, without a colon, such as '+0130'. The output will be '+0000' when the offset is zero. Four letters outputs the full form of localized offset, equivalent to four letters of Offset-O. The output will be the corresponding localized offset text if the offset is zero. Five letters outputs the hour, minute, with optional second if non-zero, with colon. It outputs 'Z' if the offset is zero. Six or more letters will fail.
213+
214+
More details for the text style:
215+
216+
- Short Form: Short text, typically an abbreviation. For example, day-of-week Monday might output "Mon".
217+
218+
- Full Form: Full text, typically the full description. For example, day-of-week Monday might output "Monday".
219+
220+
- Narrow Form: Narrow text, typically a single letter. For example, day-of-week Monday might output "M".

python/pyspark/sql/functions.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1249,7 +1249,7 @@ def last_day(date):
12491249

12501250
@ignore_unicode_prefix
12511251
@since(1.5)
1252-
def from_unixtime(timestamp, format="uuuu-MM-dd HH:mm:ss"):
1252+
def from_unixtime(timestamp, format="yyyy-MM-dd HH:mm:ss"):
12531253
"""
12541254
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string
12551255
representing the timestamp of that moment in the current system time zone in the given
@@ -1266,9 +1266,9 @@ def from_unixtime(timestamp, format="uuuu-MM-dd HH:mm:ss"):
12661266

12671267

12681268
@since(1.5)
1269-
def unix_timestamp(timestamp=None, format='uuuu-MM-dd HH:mm:ss'):
1269+
def unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss'):
12701270
"""
1271-
Convert time string with given pattern ('uuuu-MM-dd HH:mm:ss', by default)
1271+
Convert time string with given pattern ('yyyy-MM-dd HH:mm:ss', by default)
12721272
to Unix time stamp (in seconds), using the default timezone and the default
12731273
locale, return null if fail.
12741274

python/pyspark/sql/readwriter.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -223,12 +223,12 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
223223
:param dateFormat: sets the string that indicates a date format. Custom date formats
224224
follow the formats at ``java.time.format.DateTimeFormatter``. This
225225
applies to date type. If None is set, it uses the
226-
default value, ``uuuu-MM-dd``.
226+
default value, ``yyyy-MM-dd``.
227227
:param timestampFormat: sets the string that indicates a timestamp format.
228228
Custom date formats follow the formats at
229229
``java.time.format.DateTimeFormatter``.
230230
This applies to timestamp type. If None is set, it uses the
231-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
231+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
232232
:param multiLine: parse one record, which may span multiple lines, per file. If None is
233233
set, it uses the default value, ``false``.
234234
:param allowUnquotedControlChars: allows JSON Strings to contain unquoted control
@@ -432,12 +432,12 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
432432
:param dateFormat: sets the string that indicates a date format. Custom date formats
433433
follow the formats at ``java.time.format.DateTimeFormatter``. This
434434
applies to date type. If None is set, it uses the
435-
default value, ``uuuu-MM-dd``.
435+
default value, ``yyyy-MM-dd``.
436436
:param timestampFormat: sets the string that indicates a timestamp format.
437437
Custom date formats follow the formats at
438438
``java.time.format.DateTimeFormatter``.
439439
This applies to timestamp type. If None is set, it uses the
440-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
440+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
441441
:param maxColumns: defines a hard limit of how many columns a record can have. If None is
442442
set, it uses the default value, ``20480``.
443443
:param maxCharsPerColumn: defines the maximum number of characters allowed for any given
@@ -852,12 +852,12 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm
852852
:param dateFormat: sets the string that indicates a date format. Custom date formats
853853
follow the formats at ``java.time.format.DateTimeFormatter``. This
854854
applies to date type. If None is set, it uses the
855-
default value, ``uuuu-MM-dd``.
855+
default value, ``yyyy-MM-dd``.
856856
:param timestampFormat: sets the string that indicates a timestamp format.
857857
Custom date formats follow the formats at
858858
``java.time.format.DateTimeFormatter``.
859859
This applies to timestamp type. If None is set, it uses the
860-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
860+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
861861
:param encoding: specifies encoding (charset) of saved json files. If None is set,
862862
the default UTF-8 charset will be used.
863863
:param lineSep: defines the line separator that should be used for writing. If None is
@@ -957,12 +957,12 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No
957957
:param dateFormat: sets the string that indicates a date format. Custom date formats
958958
follow the formats at ``java.time.format.DateTimeFormatter``. This
959959
applies to date type. If None is set, it uses the
960-
default value, ``uuuu-MM-dd``.
960+
default value, ``yyyy-MM-dd``.
961961
:param timestampFormat: sets the string that indicates a timestamp format.
962962
Custom date formats follow the formats at
963963
``java.time.format.DateTimeFormatter``.
964964
This applies to timestamp type. If None is set, it uses the
965-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
965+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
966966
:param ignoreLeadingWhiteSpace: a flag indicating whether or not leading whitespaces from
967967
values being written should be skipped. If None is set, it
968968
uses the default value, ``true``.

python/pyspark/sql/streaming.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -461,12 +461,12 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
461461
:param dateFormat: sets the string that indicates a date format. Custom date formats
462462
follow the formats at ``java.time.format.DateTimeFormatter``. This
463463
applies to date type. If None is set, it uses the
464-
default value, ``uuuu-MM-dd``.
464+
default value, ``yyyy-MM-dd``.
465465
:param timestampFormat: sets the string that indicates a timestamp format.
466466
Custom date formats follow the formats at
467467
``java.time.format.DateTimeFormatter``.
468468
This applies to timestamp type. If None is set, it uses the
469-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
469+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
470470
:param multiLine: parse one record, which may span multiple lines, per file. If None is
471471
set, it uses the default value, ``false``.
472472
:param allowUnquotedControlChars: allows JSON Strings to contain unquoted control
@@ -673,12 +673,12 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
673673
:param dateFormat: sets the string that indicates a date format. Custom date formats
674674
follow the formats at ``java.time.format.DateTimeFormatter``. This
675675
applies to date type. If None is set, it uses the
676-
default value, ``uuuu-MM-dd``.
676+
default value, ``yyyy-MM-dd``.
677677
:param timestampFormat: sets the string that indicates a timestamp format.
678678
Custom date formats follow the formats at
679679
``java.time.format.DateTimeFormatter``.
680680
This applies to timestamp type. If None is set, it uses the
681-
default value, ``uuuu-MM-dd'T'HH:mm:ss.SSSXXX``.
681+
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
682682
:param maxColumns: defines a hard limit of how many columns a record can have. If None is
683683
set, it uses the default value, ``20480``.
684684
:param maxCharsPerColumn: defines the maximum number of characters allowed for any given

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -522,7 +522,7 @@ object CatalogColumnStat extends Logging {
522522
val VERSION = 2
523523

524524
private def getTimestampFormatter(): TimestampFormatter = {
525-
TimestampFormatter(format = "uuuu-MM-dd HH:mm:ss.SSSSSS", zoneId = ZoneOffset.UTC)
525+
TimestampFormatter(format = "yyyy-MM-dd HH:mm:ss.SSSSSS", zoneId = ZoneOffset.UTC)
526526
}
527527

528528
/**

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -673,7 +673,7 @@ case class DateFormatClass(left: Expression, right: Expression, timeZoneId: Opti
673673
Arguments:
674674
* timeExp - A date/timestamp or string which is returned as a UNIX timestamp.
675675
* format - Date/time format pattern to follow. Ignored if `timeExp` is not a string.
676-
Default value is "uuuu-MM-dd HH:mm:ss". See `java.time.format.DateTimeFormatter`
676+
Default value is "yyyy-MM-dd HH:mm:ss". See `java.time.format.DateTimeFormatter`
677677
for valid date and time format patterns.
678678
""",
679679
examples = """
@@ -707,7 +707,7 @@ case class ToUnixTimestamp(
707707
* Converts time string with given pattern to Unix time stamp (in seconds), returns null if fail.
708708
* See [https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html].
709709
* Note that hive Language Manual says it returns 0 if fail, but in fact it returns null.
710-
* If the second parameter is missing, use "uuuu-MM-dd HH:mm:ss".
710+
* If the second parameter is missing, use "yyyy-MM-dd HH:mm:ss".
711711
* If no parameters provided, the first parameter will be current_timestamp.
712712
* If the first parameter is a Date or Timestamp instead of String, we will ignore the
713713
* second parameter.
@@ -718,7 +718,7 @@ case class ToUnixTimestamp(
718718
Arguments:
719719
* timeExp - A date/timestamp or string. If not provided, this defaults to current time.
720720
* format - Date/time format pattern to follow. Ignored if `timeExp` is not a string.
721-
Default value is "uuuu-MM-dd HH:mm:ss". See `java.time.format.DateTimeFormatter`
721+
Default value is "yyyy-MM-dd HH:mm:ss". See `java.time.format.DateTimeFormatter`
722722
for valid date and time format patterns.
723723
""",
724724
examples = """

0 commit comments

Comments
 (0)