-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24985][SQL][WIP] Fix OOM in Full Outer Join in case of data skew #22168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @tejasapatil, @viirya, @hvanhovell, & @kiszk, can you please review this pull request? |
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Outdated
Show resolved
Hide resolved
|
|
||
| while (leftMatchesIterator.hasNext) { | ||
| val leftCurRow = leftMatchesIterator.next() | ||
| val rightMatchesIterator = rightMatches.generateIterator(rightIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the scanning left and right iterators? Because if they are spilled, obtaining the iterator from spilled data needs to loop over spill writers and create readers. We may avoid calling generateIterator every time for obtaining the iterators. However it might make the code a bit complicated than now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @viirya,
Thank you for reviewing the code.
I agree the code will be a bit complicated if we had to keep scanning the iterators. In this particular case, I was following a pattern observed throughout the rest of the class.
For eg.,
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
Lines 307 to 329 in 35f7f5c
| override def advanceNext(): Boolean = { | |
| while (smjScanner.findNextOuterJoinRows()) { | |
| currentLeftRow = smjScanner.getStreamedRow | |
| val currentRightMatches = smjScanner.getBufferedMatches | |
| if (currentRightMatches == null || currentRightMatches.length == 0) { | |
| numOutputRows += 1 | |
| return true | |
| } | |
| var found = false | |
| val rightMatchesIterator = currentRightMatches.generateIterator() | |
| while (!found && rightMatchesIterator.hasNext) { | |
| joinRow(currentLeftRow, rightMatchesIterator.next()) | |
| if (boundCondition(joinRow)) { | |
| found = true | |
| } | |
| } | |
| if (!found) { | |
| numOutputRows += 1 | |
| return true | |
| } | |
| } | |
| false | |
| } |
Having said that, I do feel the penalty for following a similar approach in case of full outer joins could be higher. I will try & see what I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @viirya ,
After some deliberation, I figure it would not be possible to avoid the reinitialisation of the right iterator. Please share your thought on this.
## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes apache#24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…equent calls to withColumn ## What changes were proposed in this pull request? We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s. The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs. ## How was this patch tested? NA Closes apache#23285 from mgaido91/SPARK-26224. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… for Kinesis assembly ## What changes were proposed in this pull request? After [SPARK-26856](apache#23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console ``` Exception: Found multiple JARs: .../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar, .../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar; please remove all but one ``` It's because the pattern was changed in a wrong way. **Original** ```python kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix)) kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix)) ``` **After SPARK-26856** ```python project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix)) project_full_path, "target/%s*.jar" % jar_name_prefix)) ``` The actual kinesis assembly jar files look like the followings. **SBT Build** ``` -rw-r--r-- 1 dongjoon staff 87459461 Apr 1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` **MAVEN Build** ``` -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff 8.7K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 21M Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns. ## How was this patch tested? Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally. ``` $ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly $ export ENABLE_KINESIS_TESTS=1 $ python/run-tests.py --python-executables python2.7 --module pyspark-streaming ``` Closes apache#24268 from dongjoon-hyun/SPARK-26856. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…mestamp" This reverts commit c5e83ab.
…t in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes apache#24241 from srowen/SPARK-27323. Authored-by: Sean Owen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556|94680|94793|94803 ``` Closes apache#24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>
…oop-3 ## What changes were proposed in this pull request? Fix testing issues with `yarn` module in Hadoop-3: 1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```. 2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```. 3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```. ## How was this patch tested? manual tests: ```shell build/sbt yarn/test -Pyarn build/sbt yarn/test -Phadoop-3.2 -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2 ``` Closes apache#24115 from wangyum/hadoop3-yarn. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L82 made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L146-L150 to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes apache#24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…s in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes apache#24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
… returns false ## What changes were proposed in this pull request? While I was reviewing apache#24235 I've found a minor addition possibility. Namely `FileSystem.delete` returns a boolean which is not yet checked. In this PR I've added a warning message when it returns false. I've added this as MINOR because no control flow change introduced. ## How was this patch tested? Existing unit tests. Closes apache#24263 from gaborgsomogyi/SPARK-27301-minor. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…down LeftAnti joins below Aggregate or Window operators. ## What changes were proposed in this pull request? After [23750](apache#23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](apache#23750 (comment)). This pr addresses their comments. ## How was this patch tested? Added two new tests to verify the behaviour. Closes apache#24253 from dilipbiswal/SPARK-19712-followup. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes apache#24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…y a plan operator.
## What changes were proposed in this pull request?
In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.
Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.
Here is an example:
```
val df = spark.sql("select * from t where sum(ta) > 0")
df.explain(true)
df.show()
```
Resulting exception:
```
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint))
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
at scala.Option.getOrElse(Option.scala:138)
```
Checked the behaviour of other database and all of them return an exception:
**Postgress**
```
select * from foo where max(c1) > 0;
Error
ERROR: aggregate functions are not allowed in WHERE Position: 25
```
**DB2**
```
db2 => select * from foo where max(c1) > 0;
SQL0120N Invalid use of an aggregate function or OLAP function.
```
**Oracle**
```
select * from foo where max(c1) > 0;
ORA-00934: group function is not allowed here
```
**MySql**
```
select * from foo where max(c1) > 0;
Invalid use of group function
```
**Update**
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.
## How was this patch tested?
Added tests in AnalysisErrorSuite and group-by.sql
Closes apache#24209 from dilipbiswal/SPARK-27255.
Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… Java Bean encoders ## What changes were proposed in this pull request? - Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`. - Updated comment for `Encoders.bean` - New Row getters: `getLocalDate` and `getInstant` - Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`. ## How was this patch tested? By `JavaBeanDeserializationSuite` Closes apache#24273 from MaxGekk/bean-instant-localdate. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes apache#24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? abstract some common code into a method. ## How was this patch tested? existing tests Closes apache#24281 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…dable and the key is not" This reverts commit 5888b15.
…e ser/dser issue ## What changes were proposed in this pull request? HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. This is an update of apache#24157 ## How was this patch tested? Add a UT Closes apache#24264 from LantaoJin/SPARK-27216. Lead-authored-by: LantaoJin <[email protected]> Co-authored-by: Lantao Jin <[email protected]> Signed-off-by: Imran Rashid <[email protected]>
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes apache#24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? apache#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes apache#24289 from cloud-fan/fix. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? `hive.stats.jdbc.timeout` and `hive.stats.retries.wait` were removed by [HIVE-12164](https://issues.apache.org/jira/browse/HIVE-12164). This pr to deal with this change. ## How was this patch tested? unit tests Closes apache#24277 from wangyum/SPARK-27349. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? while working on apache#24129, I realized that I missed some document fixes in apache#24285. This PR covers all of them. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes apache#24295 from cloud-fan/doc.
…talogVersionsSuite ## What changes were proposed in this pull request? Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`. ## How was this patch tested? Pass the Jenkins. Closes apache#24292 from dongjoon-hyun/SPARK-27382. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…mbol is a token In the case where the offending symbol is a CommonToken, this PR increases the accuracy of the start and stop origin by leveraging the start and stop index information from CommonToken. Closes apache#21334 from rubenfiszel/patch-1. Lead-authored-by: Ruben Fiszel <[email protected]> Co-authored-by: rubenfiszel <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?
With this change, unnecessary file scans are avoided in case of Limit 0 queries.
I added a case (rule) to `PropagateEmptyRelation` to replace `GlobalLimit 0` and `LocalLimit 0` nodes with an empty `LocalRelation`. This prunes the subtree under the Limit 0 node and further allows other rules of `PropagateEmptyRelation` to optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query.
For instance:
**Query:**
`SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.id`
**Optimized Plan without fix:**
```
Join Inner, (id#79 = id#87)
:- Filter isnotnull(id#79)
: +- Relation[id#79,num1#80] parquet
+- Filter isnotnull(id#87)
+- GlobalLimit 0
+- LocalLimit 0
+- Relation[id#87,num2#88] parquet
```
**Optimized Plan with fix:**
`LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]`
## How was this patch tested?
Added unit tests to verify Limit 0 optimization for:
- Simple query containing Limit 0
- Inner Join, Left Outer Join, Right Outer Join, Full Outer Join queries containing Limit 0 as one of their children
- Nested Inner Joins between 3 tables with one of them having a Limit 0 clause.
- Intersect query wherein one of the subqueries was a Limit 0 query.
Closes apache#24271 from aayushmaanjain/optimize-limit0.
Authored-by: Aayushmaan Jain <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. ## How was this patch tested? Added tests. Closes apache#24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…sue(except the hive-thriftserver module) ## What changes were proposed in this pull request? This pr fix hadoop-3.2 test issues(except the `hive-thriftserver` module): 1. Add `hive.metastore.schema.verification` and `datanucleus.schema.autoCreateAll` to HiveConf. 2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2 After [SPARK-27176](https://issues.apache.org/jira/browse/SPARK-27176) and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues: - [HIVE-6727](https://issues.apache.org/jira/browse/HIVE-6727): Table level stats for external tables are set incorrectly. - [HIVE-15653](https://issues.apache.org/jira/browse/HIVE-15653): Some ALTER TABLE commands drop table stats. - [SPARK-12014](https://issues.apache.org/jira/browse/SPARK-12014): Spark SQL query containing semicolon is broken in Beeline. - [SPARK-25193](https://issues.apache.org/jira/browse/SPARK-25193): insert overwrite doesn't throw exception when drop old data fails. - [SPARK-25919](https://issues.apache.org/jira/browse/SPARK-25919): Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned. - [SPARK-26332](https://issues.apache.org/jira/browse/SPARK-26332): Spark sql write orc table on viewFS throws exception. - [SPARK-26437](https://issues.apache.org/jira/browse/SPARK-26437): Decimal data becomes bigint to query, unable to query. ## How was this patch tested? This pr test Spark’s Hadoop 3.2 profile on jenkins and apache#24591 test Spark’s Hadoop 2.7 profile on jenkins This PR close apache#24591 Closes apache#24391 from wangyum/SPARK-27402. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?
When a null in a nested field in struct, casting from the struct throws error, currently.
```scala
scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show
scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447)
at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635)
at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603)
```
Similarly, inline table, which casts null in nested field under the hood, also throws an error.
```scala
scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show
org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t
ypes.NullType$); line 1 pos 14
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106)
```
This fixes the issue.
## How was this patch tested?
Added tests.
Closes apache#24576 from viirya/cast-null.
Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…s with dates/timestamps ## What changes were proposed in this pull request? The below example works with both Mysql and Hive, however not with spark. ``` mysql> select * from date_test where date_col >= '2000-1-1'; +------------+ | date_col | +------------+ | 2000-01-01 | +------------+ ``` The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c As below date patterns have been supported, the PR is to cast string to date when comparing string and date: ``` `yyyy` `yyyy-[m]m` `yyyy-[m]m-[d]d` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]d *` `yyyy-[m]m-[d]dT* ``` ## How was this patch tested? UT has been added Closes apache#24567 from pengbo/SPARK-27638. Authored-by: mingbo.pb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…and other resources ## What changes were proposed in this pull request? Add in GPU and generic resource type allocation to the executors. Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow : - users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this. - cluster manager allocates a container or set of resources (standalone mode) - When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them. - Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects. The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use. Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615 ## How was this patch tested? Added unit tests and manually tested. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24406 from tgravescs/gpu-sched-executor-clean. Authored-by: Thomas Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]>
… should search over formatted data rather than the raw data The Pull Request to add datatables to stage page SPARK-21809 got merged. The search functionality in those datatables being a great improvement for searching through a large number of tasks, also performs search over the raw data rather than the formatted data displayed in the tables. It would be great if the search can happen for the formatted data as well. ## What changes were proposed in this pull request? Added code to enable searching over displayed data in tables e.g. searching on "165.7 MiB" or "0.3 ms" will now return the search results. Also, earlier we were missing search for two columns in the task table "Shuffle Read Bytes" as well as "Shuffle Remote Reads", which I have added here. ## How was this patch tested? Manual Tests Closes apache#24419 from pgandhi999/SPARK-25719. Authored-by: pgandhi <[email protected]> Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes apache#24584 from srowen/SPARK-27680. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>
….reset` ## What changes were proposed in this pull request? We should remove materialized view first otherwise(note that Hive 3.1 could reproduce this issue): ```scala Cause: org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'MV_TABLES_USED_FK2' for key (4). The statement has been rolled back. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeBatchElement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeBatch(Unknown Source) ``` ## How was this patch tested? Existing test Closes apache#24592 from wangyum/SPARK-27690. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation ## How was this patch tested? We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes apache#24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ation`. ## What changes were proposed in this pull request? `StructuredSessionization` comment contains duplicate 'add', I think it should be changed. ## How was this patch tested? Exists UT. Closes apache#24589 from beliefer/remove-duplicate-add-in-comment. Lead-authored-by: gengjiaan <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…to core ## What changes were proposed in this pull request? `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. ## How was this patch tested? exist tests. Closes apache#24607 from xianyinxin/SPARK-27713. Authored-by: xy_xin <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…hods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes apache#24586 from srowen/SPARK-27682. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>
This feature allows proxy servers to identify the actual request user
using a request parameter, and performs access control checks against
that user instead of the authenticated user. Impersonation is only
allowed if the authenticated user is configured as an admin.
The request parameter used ("doAs") matches the one currently used by
Knox, but it should be easy to change / customize if different proxy
servers use a different way of identifying the original user.
Tested with updated unit tests and also with a live server behind Knox.
Closes apache#24582 from vanzin/SPARK-27678.
Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
…ent caching ## What changes were proposed in this pull request? Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`. Since Kafka consumer caching is not documented I've added this also. ## How was this patch tested? Existing + added unit test. ``` cd docs SKIP_API=1 jekyll build ``` and manual webpage check. Closes apache#24590 from gaborgsomogyi/SPARK-27687. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…n on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes apache#24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: gatorsmile <[email protected]>
…module to sql/hive-thriftserver/v1.2.1 ## What changes were proposed in this pull request? When we upgraded the built-in Hive to 2.3.4, the current `hive-thriftserver` module is not compatible, such as these Hive changes: 1. [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442) HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks 2. [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237) Use slf4j as logging facade 3. [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) HiveServer2: Support delegation token based connection when using http transport So this PR moves the incompatible code to `sql/hive-thriftserver/v1.2.1` and copies it to `sql/hive-thriftserver/v2.3.4` for the next code review. ## How was this patch tested? manual tests: ``` diff -urNa sql/hive-thriftserver/v1.2.1 sql/hive-thriftserver/v2.3.4 ``` Closes apache#24282 from wangyum/SPARK-27354. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? This is a followup of apache#20365 . apache#20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes apache#24580 from cloud-fan/bug. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? removed the unused "UnsafeKeyValueSorter.java" file ## How was this patch tested? Ran Compilation and UT locally. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#24622 from shivusondur/jira27722. Authored-by: shivusondur <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…in SS ## What changes were proposed in this pull request? Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. ## How was this patch tested? The new unit test. Closes apache#24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. ## How was this patch tested? new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? This PR upgrades lz4-java from 1.5.1 to 1.6.0. Lz4-java is available at https://github.com/lz4/lz4-java. Changes from 1.5.1: - Upgraded LZ4 to 1.9.1. Updated the JNI bindings, except for the one for Linux/i386. Decompression speed is improved on amd64. - Deprecated use of LZ4FastDecompressor of a native instance because the corresponding C API function is deprecated. See the release note of LZ4 1.9.0 for details. Updated javadoc accordingly. - Changed the module name from org.lz4.lz4-java to org.lz4.java to avoid using - in the module name. (severn-everett, Oliver Eikemeier, Rei Odaira) - Enabled build with Java 11. Note that the distribution is still built with Java 7. (Rei Odaira) ## How was this patch tested? Existing tests. Closes apache#24629 from kiszk/SPARK-27752. Authored-by: Kazuaki Ishizaki <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…uet/ORC ## What changes were proposed in this pull request? Currently, in `ParquetFilters` and `OrcFilters`, if the child predicate of `Or` operator can't be entirely pushed down, the predicates will be thrown away. In fact, the conjunctive predicates under `Or` operators can be partially pushed down. For example, says `a` and `b` are convertible, while `c` can't be pushed down, the predicate `a or (b and c)` can be converted as `(a or b) and (a or c)` We can still push down `(a or b)`. We can't push down disjunctive predicates only when one of its children is not partially convertible. This PR also improve the filter pushing down logic in `DataSourceV2Strategy`. With partial filter push down in `Or` operator, the result of `pushedFilters()` might not exist in the mapping `translatedFilterToExpr`. To fix it, this PR changes the mapping `translatedFilterToExpr` as leaf filter expression to `sources.filter`, and later on rebuild the whole expression with the mapping. ## How was this patch tested? Unit test Closes apache#24598 from gengliangwang/pushdownDisjunctivePredicates. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This PR aims to update `zstd-jni` library to `1.4.0-1` which improves the `level 1 compression speed` performance by 6% in most scenarios. The following is the full release note. - https://github.com/facebook/zstd/releases/tag/v1.4.0 ## How was this patch tested? Pass the Jenkins. Closes apache#24632 from dongjoon-hyun/SPARK-27755. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…d on Windows OS
## What changes were proposed in this pull requesst?
If we set `hive.exec.stagingdir=.test-staging\tmp`,
But the staging directory is still `.hive-staging` on Windows OS.
Reasons for failure:
Test code:
```
val path = new Path("C:\\test\\hivetable")
println("path.toString: " + path.toString)
println("path.toUri.getPath: " + path.toUri.getPath)
```
Output:
```
path.toString: C:/test/hivetable
path.toUri.getPath: /C:/test/hivetable
```
We can see that `path.toUri.getPath` has one more separator than `path.toString`, and the separator is ' / ', not ' \ '
So `stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")` will return false
## How was this patch tested?
1. Existed tests
2. Manual testing on Windows OS
Closes apache#24446 from 10110346/stagingdir.
Authored-by: liuxian <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors. Change corresponding test case in JoinSuite. [SPARK-24985][SQL] Clean up unused imports [SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`. `ExternalAppendOnlyUnsafeRowArray` handles copy internally.
Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors. Change corresponding test case in JoinSuite. [SPARK-24985][SQL] Clean up unused imports [SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`. `ExternalAppendOnlyUnsafeRowArray` handles copy internally. [SPARK-24985][SQL] Fix OOM in Full Outer Join in presence of data skew. Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors. Change corresponding test case in JoinSuite. [SPARK-24985][SQL] Clean up unused imports [SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`. `ExternalAppendOnlyUnsafeRowArray` handles copy internally.
What issue does this pull request address ?
JIRA: https://issues.apache.org/jira/browse/SPARK-24985
In the case of Full Outer Joins of large tables, in the presence of data skew around the join keys for either of the joined tables, OOMs exceptions occur. While its possible to increase the heap size to workaround, Spark should be resilient to such issues as skews can happen arbitrarily.
What changes were proposed in this pull request?
#16909 introduced
ExternalAppendOnlyUnsafeRowArray& changedSortMergeJoinExecto useExternalAppendOnlyUnsafeRowArrayfor every join, except 'Full Outer Join'. This PR makes changes to make 'Full Outer Joins' to useExternalAppendOnlyUnsafeRowArray.How was this patch tested?
Unit testing
JoinSuite.OuterJoinSuitewere used to verify correctness.Stress testing