Skip to content

Conversation

@sujithjay
Copy link

@sujithjay sujithjay commented Aug 21, 2018

What issue does this pull request address ?

JIRA: https://issues.apache.org/jira/browse/SPARK-24985
In the case of Full Outer Joins of large tables, in the presence of data skew around the join keys for either of the joined tables, OOMs exceptions occur. While its possible to increase the heap size to workaround, Spark should be resilient to such issues as skews can happen arbitrarily.

What changes were proposed in this pull request?

#16909 introduced ExternalAppendOnlyUnsafeRowArray & changed SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for every join, except 'Full Outer Join'. This PR makes changes to make 'Full Outer Joins' to use ExternalAppendOnlyUnsafeRowArray.

How was this patch tested?

Unit testing

  • Changed a test-case in JoinSuite.
  • Existing tests in OuterJoinSuite were used to verify correctness.

Stress testing

  • This is still a work in progress. I plan to verify this patch using a production workload.

@sujithjay
Copy link
Author

Hi @tejasapatil, @viirya, @hvanhovell, & @kiszk, can you please review this pull request?


while (leftMatchesIterator.hasNext) {
val leftCurRow = leftMatchesIterator.next()
val rightMatchesIterator = rightMatches.generateIterator(rightIndex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the scanning left and right iterators? Because if they are spilled, obtaining the iterator from spilled data needs to loop over spill writers and create readers. We may avoid calling generateIterator every time for obtaining the iterators. However it might make the code a bit complicated than now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @viirya,
Thank you for reviewing the code.

I agree the code will be a bit complicated if we had to keep scanning the iterators. In this particular case, I was following a pattern observed throughout the rest of the class.
For eg.,

override def advanceNext(): Boolean = {
while (smjScanner.findNextOuterJoinRows()) {
currentLeftRow = smjScanner.getStreamedRow
val currentRightMatches = smjScanner.getBufferedMatches
if (currentRightMatches == null || currentRightMatches.length == 0) {
numOutputRows += 1
return true
}
var found = false
val rightMatchesIterator = currentRightMatches.generateIterator()
while (!found && rightMatchesIterator.hasNext) {
joinRow(currentLeftRow, rightMatchesIterator.next())
if (boundCondition(joinRow)) {
found = true
}
}
if (!found) {
numOutputRows += 1
return true
}
}
false
}

Having said that, I do feel the penalty for following a similar approach in case of full outer joins could be higher. I will try & see what I can do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @viirya ,
After some deliberation, I figure it would not be possible to avoid the reinitialisation of the right iterator. Please share your thought on this.

MaxGekk and others added 26 commits April 2, 2019 10:20
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes apache#24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…equent calls to withColumn

## What changes were proposed in this pull request?

We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s.

The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.

## How was this patch tested?

NA

Closes apache#23285 from mgaido91/SPARK-26224.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… for Kinesis assembly

## What changes were proposed in this pull request?

After [SPARK-26856](apache#23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console
```
Exception: Found multiple JARs:
.../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar,
.../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar;
please remove all but one
```

It's because the pattern was changed in a wrong way.

**Original**
```python
kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix))
kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix))
```
**After SPARK-26856**
```python
project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix))
project_full_path, "target/%s*.jar" % jar_name_prefix))
```

The actual kinesis assembly jar files look like the followings.

**SBT Build**
```
-rw-r--r--  1 dongjoon  staff  87459461 Apr  1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

**MAVEN Build**
```
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar
-rw-r--r--   1 dongjoon  staff   8.7K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--   1 dongjoon  staff    21M Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns.

## How was this patch tested?

Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally.
```
$ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly
$ export ENABLE_KINESIS_TESTS=1
$ python/run-tests.py --python-executables python2.7 --module pyspark-streaming
```

Closes apache#24268 from dongjoon-hyun/SPARK-26856.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…t in Scala 2.12 to simplify code

## What changes were proposed in this pull request?

Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here.

## How was this patch tested?

Existing tests.

Closes apache#24241 from srowen/SPARK-27323.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

Different SSL passwords shown up as command line argument on executor side in standalone mode:
* keyStorePassword
* keyPassword
* trustStorePassword

In this PR I've removed SSL configurations from executors.

## How was this patch tested?

Existing + additional unit tests.
Additionally tested with standalone mode and checked the command line arguments:
```
[gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps
94803 CoarseGrainedExecutorBackend
94818 Jps
90149 RemoteMavenServer
91925 Nailgun
94793 SparkSubmit
94680 Worker
94556 Master
398
[gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef | egrep "94556|94680|94793|94803"
  502 94556     1   0  2:02PM ttys007    0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf
  502 94680     1   0  2:02PM ttys007    0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077
  502 94793 94782   0  2:02PM ttys007    0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell
  502 94803 94680   0  2:03PM ttys007    0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899
  502 94910 57352   0  2:05PM ttys008    0:00.00 egrep 94556|94680|94793|94803
```

Closes apache#24170 from gaborgsomogyi/SPARK-26998.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
…oop-3

## What changes were proposed in this pull request?

Fix testing issues with `yarn` module in Hadoop-3:

1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```.
2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```.
3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java  to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```.

## How was this patch tested?

manual tests:
```shell
build/sbt yarn/test -Pyarn
build/sbt yarn/test -Phadoop-3.2 -Pyarn

build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2
```

Closes apache#24115 from wangyum/hadoop3-yarn.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
…field in ExpressionInfo

## What changes were proposed in this pull request?

I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem.

However, this one line:

https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java#L82

made me to investigate a lot today.

Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like,

https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L146-L150

to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`.

I think this is not yet found because this particular codes are not released yet (see SPARK-26426).

Looks just better to loosen the condition and forget about this stuff.

This should be backported into branch-2.4 as well.

## How was this patch tested?

N/A

Closes apache#24274 from HyukjinKwon/SPARK-27346.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…s in withColumn at SparkR and PySpark as well

## What changes were proposed in this pull request?

This is a followup of apache#23285. This PR adds the notes into PySpark and SparkR documentation as well.

While I am here, I revised the doc a bit to make it sound a bit more neutral

## How was this patch tested?

Manually built the doc and verified.

Closes apache#24272 from HyukjinKwon/SPARK-26224.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
… returns false

## What changes were proposed in this pull request?

While I was reviewing apache#24235 I've found a minor addition possibility. Namely `FileSystem.delete` returns a boolean which is not yet checked. In this PR I've added a warning message when it returns false. I've added this as MINOR because no control flow change introduced.

## How was this patch tested?

Existing unit tests.

Closes apache#24263 from gaborgsomogyi/SPARK-27301-minor.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
…down LeftAnti joins below Aggregate or Window operators.

## What changes were proposed in this pull request?
After [23750](apache#23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](apache#23750 (comment)). This pr addresses their comments.
## How was this patch tested?
Added two new tests to verify the behaviour.

Closes apache#24253 from dilipbiswal/SPARK-19712-followup.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception.

One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`.

## How was this patch tested?

The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes apache#24195 from MaxGekk/conv-utc-timestamp-deprecate.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…y a plan operator.

## What changes were proposed in this pull request?
In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.

Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.

Here is an example:
```
val df = spark.sql("select * from t where sum(ta) > 0")
df.explain(true)
df.show()
```
Resulting exception:
```
Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
	at scala.Option.getOrElse(Option.scala:138)
```
Checked the behaviour of other database and all of them return an exception:
**Postgress**
```
select * from foo where max(c1) > 0;
Error
ERROR: aggregate functions are not allowed in WHERE Position: 25
```
**DB2**
```
db2 => select * from foo where max(c1) > 0;
SQL0120N  Invalid use of an aggregate function or OLAP function.
```
**Oracle**
```
select * from foo where max(c1) > 0;
ORA-00934: group function is not allowed here
```
**MySql**
```
select * from foo where max(c1) > 0;
Invalid use of group function
```

**Update**
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.
## How was this patch tested?
Added tests in AnalysisErrorSuite and group-by.sql

Closes apache#24209 from dilipbiswal/SPARK-27255.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… Java Bean encoders

## What changes were proposed in this pull request?

- Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`.
- Updated comment for `Encoders.bean`
- New Row getters: `getLocalDate` and `getInstant`
- Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`.

## How was this patch tested?

By `JavaBeanDeserializationSuite`

Closes apache#24273 from MaxGekk/bean-instant-localdate.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… and Writer

## What changes were proposed in this pull request?

According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it.

## How was this patch tested?

Doc only change.

Closes apache#24280 from viirya/fix-orc-doc.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

abstract some common code into a method.

## How was this patch tested?

existing tests

Closes apache#24281 from cloud-fan/minor.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…e ser/dser issue

## What changes were proposed in this pull request?

HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer.

It's a bug of RoaringBitmap-0.5.11 and fixed in latest version.

This is an update of apache#24157

## How was this patch tested?

Add a UT

Closes apache#24264 from LantaoJin/SPARK-27216.

Lead-authored-by: LantaoJin <[email protected]>
Co-authored-by: Lantao Jin <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
…rator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager

## What changes were proposed in this pull request?

In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265.

To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265

## How was this patch tested?

Manual tests were being done and will also try to add a test.

Closes apache#24265 from venkata91/deadlock-sorter.

Authored-by: Venkata krishnan Sowrirajan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

It's a followup of apache#24012 , to fix 2 documentation:
1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now.
2. `Scan` should link the `BATCH_READ` instead of hardcoding it.

## How was this patch tested?
N/A

Closes apache#24285 from cloud-fan/doc.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

apache#24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it.

## How was this patch tested?

N/A

Closes apache#24289 from cloud-fan/fix.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?
`hive.stats.jdbc.timeout` and `hive.stats.retries.wait` were removed by [HIVE-12164](https://issues.apache.org/jira/browse/HIVE-12164).
This pr to deal with this change.

## How was this patch tested?

unit tests

Closes apache#24277 from wangyum/SPARK-27349.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?

while working on apache#24129, I realized that I missed some document fixes in apache#24285. This PR covers all of them.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes apache#24295 from cloud-fan/doc.
…talogVersionsSuite

## What changes were proposed in this pull request?

Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`.

## How was this patch tested?

Pass the Jenkins.

Closes apache#24292 from dongjoon-hyun/SPARK-27382.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…mbol is a token

In the case where the offending symbol is a CommonToken, this PR increases the accuracy of the start and stop origin by leveraging the start and stop index information from CommonToken.

Closes apache#21334 from rubenfiszel/patch-1.

Lead-authored-by: Ruben Fiszel <[email protected]>
Co-authored-by: rubenfiszel <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?
With this change, unnecessary file scans are avoided in case of Limit 0 queries.

I added a case (rule) to `PropagateEmptyRelation` to replace `GlobalLimit 0` and `LocalLimit 0` nodes with an empty `LocalRelation`. This prunes the subtree under the Limit 0 node and further allows other rules of `PropagateEmptyRelation` to optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query.

For instance:
**Query:**
`SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.id`

**Optimized Plan without fix:**
```
Join Inner, (id#79 = id#87)
:- Filter isnotnull(id#79)
:  +- Relation[id#79,num1#80] parquet
+- Filter isnotnull(id#87)
   +- GlobalLimit 0
      +- LocalLimit 0
         +- Relation[id#87,num2#88] parquet
```

**Optimized Plan with fix:**
`LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]`

## How was this patch tested?
Added unit tests to verify Limit 0 optimization for:
- Simple query containing Limit 0
- Inner Join, Left Outer Join, Right Outer Join, Full Outer Join queries containing Limit 0 as one of their children
- Nested Inner Joins between 3 tables with one of them having a Limit 0 clause.
- Intersect query wherein one of the subqueries was a Limit 0 query.

Closes apache#24271 from aayushmaanjain/optimize-limit0.

Authored-by: Aayushmaan Jain <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
viirya and others added 28 commits May 13, 2019 22:37
## What changes were proposed in this pull request?

This PR goes to add `max_by()` and `min_by()` SQL aggregate functions.

Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by)

> max_by(x, y) → [same as x]
> Returns the value of x associated with the maximum value of y over all input values.

`min_by()` works similarly.

## How was this patch tested?

Added tests.

Closes apache#24557 from viirya/SPARK-27653.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…sue(except the hive-thriftserver module)

## What changes were proposed in this pull request?

This pr fix hadoop-3.2 test issues(except the `hive-thriftserver` module):
1. Add `hive.metastore.schema.verification` and `datanucleus.schema.autoCreateAll` to HiveConf.
2. hadoop-3.2 support access the Hive metastore from 0.12 to 2.2

After [SPARK-27176](https://issues.apache.org/jira/browse/SPARK-27176) and this PR, we upgraded the built-in Hive to 2.3 when enabling the Hadoop 3.2+ profile. This upgrade fixes the following issues:
- [HIVE-6727](https://issues.apache.org/jira/browse/HIVE-6727): Table level stats for external tables are set incorrectly.
- [HIVE-15653](https://issues.apache.org/jira/browse/HIVE-15653): Some ALTER TABLE commands drop table stats.
- [SPARK-12014](https://issues.apache.org/jira/browse/SPARK-12014): Spark SQL query containing semicolon is broken in Beeline.
- [SPARK-25193](https://issues.apache.org/jira/browse/SPARK-25193): insert overwrite doesn't throw exception when drop old data fails.
- [SPARK-25919](https://issues.apache.org/jira/browse/SPARK-25919): Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned.
- [SPARK-26332](https://issues.apache.org/jira/browse/SPARK-26332): Spark sql write orc table on viewFS throws exception.
- [SPARK-26437](https://issues.apache.org/jira/browse/SPARK-26437): Decimal data becomes bigint to query, unable to query.

## How was this patch tested?
This pr test Spark’s Hadoop 3.2 profile on jenkins and apache#24591 test Spark’s Hadoop 2.7 profile on jenkins

This PR close apache#24591

Closes apache#24391 from wangyum/SPARK-27402.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?

When a null in a nested field in struct, casting from the struct throws error, currently.

```scala
scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show
scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
  at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447)
  at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635)
  at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603)
```

Similarly, inline table, which casts null in nested field under the hood, also throws an error.

```scala
scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show
org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t
ypes.NullType$); line 1 pos 14
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
  at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106)
```

This fixes the issue.

## How was this patch tested?

Added tests.

Closes apache#24576 from viirya/cast-null.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…s with dates/timestamps

## What changes were proposed in this pull request?

The below example works with both Mysql and Hive, however not with spark.

```
mysql> select * from date_test where date_col >= '2000-1-1';
+------------+
| date_col   |
+------------+
| 2000-01-01 |
+------------+
```
The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and Mysql:
Hive: Cast to Date, partial date is not supported
Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

As below date patterns have been supported, the PR is to cast string to date when comparing string and date:
```
`yyyy`
`yyyy-[m]m`
`yyyy-[m]m-[d]d`
`yyyy-[m]m-[d]d `
`yyyy-[m]m-[d]d *`
`yyyy-[m]m-[d]dT*
```

## How was this patch tested?
UT has been added

Closes apache#24567 from pengbo/SPARK-27638.

Authored-by: mingbo.pb <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…and other resources

## What changes were proposed in this pull request?

Add in GPU and generic resource type allocation to the executors.

Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow :

   - users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this.
  -  cluster manager allocates a container or set of resources (standalone mode)
-    When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them.
  -  Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources

In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects.

The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use.

Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615

## How was this patch tested?

Added unit tests and manually tested.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24406 from tgravescs/gpu-sched-executor-clean.

Authored-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
… should search over formatted data rather than the raw data

The Pull Request to add datatables to stage page SPARK-21809 got merged. The search functionality in those datatables being a great improvement for searching through a large number of tasks, also performs search over the raw data rather than the formatted data displayed in the tables. It would be great if the search can happen for the formatted data as well.

## What changes were proposed in this pull request?

Added code to enable searching over displayed data in tables e.g. searching on "165.7 MiB" or "0.3 ms" will now return the search results. Also, earlier we were missing search for two columns in the task table "Shuffle Read Bytes" as well as "Shuffle Remote Reads", which I have added here.

## How was this patch tested?

Manual Tests

Closes apache#24419 from pgandhi999/SPARK-25719.

Authored-by: pgandhi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below.

## How was this patch tested?

Existing tests.

Closes apache#24584 from srowen/SPARK-27680.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
….reset`

## What changes were proposed in this pull request?

We should remove materialized view first otherwise(note that Hive 3.1 could reproduce this issue):
```scala
Cause: org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'MV_TABLES_USED_FK2' for key (4).  The statement has been rolled back.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeBatchElement(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeBatch(Unknown Source)
```

## How was this patch tested?

Existing test

Closes apache#24592 from wangyum/SPARK-27690.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

This adds a v2 implementation for CTAS queries

* Update the SQL parser to parse CREATE queries using multi-part identifiers
* Update `CheckAnalysis` to validate partitioning references with the CTAS query schema
* Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan
* Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan
* Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan
* Add `findNestedField` to `StructType` to support reference validation

## How was this patch tested?

We have been running these changes in production for several months. Also:

* Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks
* Add a test suite for v2 SQL, `DataSourceV2SQLSuite`
* Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`)
* Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation

Closes apache#24570 from rdblue/SPARK-24923-add-v2-ctas.

Authored-by: Ryan Blue <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ation`.

## What changes were proposed in this pull request?

`StructuredSessionization` comment contains duplicate 'add', I think it should be changed.

## How was this patch tested?

Exists UT.

Closes apache#24589 from beliefer/remove-duplicate-add-in-comment.

Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: Jiaan Geng <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…to core

## What changes were proposed in this pull request?

`RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan.

## How was this patch tested?

exist tests.

Closes apache#24607 from xianyinxin/SPARK-27713.

Authored-by: xy_xin <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…hods that will be removed in Scala 2.13 with work-alikes

## What changes were proposed in this pull request?

This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13

## How was this patch tested?

Existing tests

Closes apache#24586 from srowen/SPARK-27682.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
This feature allows proxy servers to identify the actual request user
using a request parameter, and performs access control checks against
that user instead of the authenticated user. Impersonation is only
allowed if the authenticated user is configured as an admin.

The request parameter used ("doAs") matches the one currently used by
Knox, but it should be easy to change / customize if different proxy
servers use a different way of identifying the original user.

Tested with updated unit tests and also with a live server behind Knox.

Closes apache#24582 from vanzin/SPARK-27678.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
…ent caching

## What changes were proposed in this pull request?

Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`.

Since Kafka consumer caching is not documented I've added this also.

## How was this patch tested?

Existing + added unit test.

```
cd docs
SKIP_API=1 jekyll build
```
and manual webpage check.

Closes apache#24590 from gaborgsomogyi/SPARK-27687.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…n on BroadcastTimeout

## What changes were proposed in this pull request?

In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread.

## How was this patch tested?

Add new test suite `BroadcastExchangeExec`

Closes apache#24595 from jiangxb1987/SPARK-20774.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
…module to sql/hive-thriftserver/v1.2.1

## What changes were proposed in this pull request?

When we upgraded the built-in Hive to 2.3.4, the current `hive-thriftserver` module is not compatible, such as these Hive changes:
1. [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442) HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks
2. [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237) Use slf4j as logging facade
3. [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) HiveServer2: Support delegation token based connection when using http transport

So this PR moves the incompatible code to `sql/hive-thriftserver/v1.2.1` and copies it to `sql/hive-thriftserver/v2.3.4` for the next code review.

## How was this patch tested?

manual tests:
```
diff -urNa sql/hive-thriftserver/v1.2.1 sql/hive-thriftserver/v2.3.4
```

Closes apache#24282 from wangyum/SPARK-27354.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?

This is a followup of apache#20365 .

apache#20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases.

## How was this patch tested?

a new test

Closes apache#24580 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?

removed the unused "UnsafeKeyValueSorter.java" file

## How was this patch tested?

Ran Compilation and UT locally.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#24622 from shivusondur/jira27722.

Authored-by: shivusondur <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…in SS

## What changes were proposed in this pull request?

Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings.

This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string.

## How was this patch tested?

The new unit test.

Closes apache#24619 from zsxwing/SPARK-27735.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.

However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all.

This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.

Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon.

## How was this patch tested?

new test cases

Closes apache#24469 from cloud-fan/schema-check.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

This PR upgrades lz4-java from 1.5.1 to 1.6.0. Lz4-java is available at https://github.com/lz4/lz4-java.

Changes from 1.5.1:
- Upgraded LZ4 to 1.9.1. Updated the JNI bindings, except for the one for Linux/i386. Decompression speed is improved on amd64.
- Deprecated use of LZ4FastDecompressor of a native instance because the corresponding C API function is deprecated. See the release note of LZ4 1.9.0 for details. Updated javadoc accordingly.
- Changed the module name from org.lz4.lz4-java to org.lz4.java to avoid using - in the module name. (severn-everett, Oliver Eikemeier, Rei Odaira)
- Enabled build with Java 11. Note that the distribution is still built with Java 7. (Rei Odaira)

## How was this patch tested?

Existing tests.

Closes apache#24629 from kiszk/SPARK-27752.

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…uet/ORC

## What changes were proposed in this pull request?

Currently, in `ParquetFilters` and `OrcFilters`, if the child predicate of `Or` operator can't be entirely pushed down, the predicates will be thrown away.
In fact, the conjunctive predicates under `Or` operators can be partially pushed down.
For example, says `a` and `b` are convertible, while `c` can't be pushed down, the predicate
`a or (b and c)`
can be converted as
`(a or b) and (a or c)`
We can still push down `(a or b)`.
We can't push down disjunctive predicates only when one of its children is not partially convertible.

This PR also improve the filter pushing down logic in `DataSourceV2Strategy`. With partial filter push down in `Or` operator, the result of `pushedFilters()` might not exist in the mapping `translatedFilterToExpr`.  To fix it, this PR changes the mapping `translatedFilterToExpr` as leaf filter expression to `sources.filter`, and later on rebuild the whole expression with the mapping.
## How was this patch tested?

Unit test

Closes apache#24598 from gengliangwang/pushdownDisjunctivePredicates.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

This PR aims to update `zstd-jni` library to `1.4.0-1` which improves the `level 1 compression speed` performance by 6% in most scenarios. The following is the full release note.
- https://github.com/facebook/zstd/releases/tag/v1.4.0

## How was this patch tested?

Pass the Jenkins.

Closes apache#24632 from dongjoon-hyun/SPARK-27755.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…d on Windows OS

## What changes were proposed in this pull requesst?
If we set `hive.exec.stagingdir=.test-staging\tmp`,
But the staging directory is still `.hive-staging` on Windows OS.

Reasons for failure:
Test code:
```
 val path = new Path("C:\\test\\hivetable")
  println("path.toString: " + path.toString)
  println("path.toUri.getPath: " + path.toUri.getPath)
```

Output:
```
path.toString: C:/test/hivetable
path.toUri.getPath: /C:/test/hivetable
```
 We can see that `path.toUri.getPath` has one more separator than `path.toString`,  and the separator is   ' / ',  not  ' \ '
So `stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")` will return false
## How was this patch tested?
1. Existed tests
2. Manual testing on Windows OS

Closes apache#24446 from 10110346/stagingdir.

Authored-by: liuxian <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors.

Change corresponding test case in JoinSuite.

[SPARK-24985][SQL] Clean up unused imports

[SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`.

`ExternalAppendOnlyUnsafeRowArray` handles copy internally.
Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors.

Change corresponding test case in JoinSuite.

[SPARK-24985][SQL] Clean up unused imports

[SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`.

`ExternalAppendOnlyUnsafeRowArray` handles copy internally.

[SPARK-24985][SQL] Fix OOM in Full Outer Join in presence of data skew.

Change SortMergeJoinExec to use ExternalAppendOnlyUnsafeRowArray for Full Outer Join. This would spill data into disk if the buffered rows exceed beyond a threshold, thus preventing OOM errors.

Change corresponding test case in JoinSuite.

[SPARK-24985][SQL] Clean up unused imports

[SPARK-24985][SQL] Remove `copy()` on row instance before adding to `ExternalAppendOnlyUnsafeRowArray`.

`ExternalAppendOnlyUnsafeRowArray` handles copy internally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.