Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Jan 8, 2017

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

JoshRosen and others added 30 commits November 30, 2016 14:47
…utors

## What changes were proposed in this pull request?

The method `TaskSchedulerImpl.runningTasksByExecutors()` accesses the mutable `executorIdToRunningTaskIds` map without proper synchronization. In addition, as markhamstra pointed out in #15986, the signature's use of parentheses is a little odd given that this is a pure getter method.

This patch fixes both issues.

## How was this patch tested?

Covered by existing tests.

Author: Josh Rosen <[email protected]>

Closes #16073 from JoshRosen/runningTasksByExecutors-thread-safety.

(cherry picked from commit c51c772)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?
API review for 2.1, except ```LSH``` related classes which are still under development.

## How was this patch tested?
Only doc changes, no new tests.

Author: Yanbo Liang <[email protected]>

Closes #16009 from yanboliang/spark-18318.

(cherry picked from commit 60022bf)
Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469

However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.

## How was this patch tested?

new test

Author: Wenchen Fan <[email protected]>

Closes #15979 from cloud-fan/option.

(cherry picked from commit f135b70)
Signed-off-by: Cheng Lian <[email protected]>
The problem exists because it's not possible to just concatenate encrypted
partition data from different spill files; currently each partition would
have its own initial vector to set up encryption, and the final merged file
should contain a single initial vector for each merged partiton, otherwise
iterating over each record becomes really hard.

To fix that, UnsafeShuffleWriter now decrypts the partitions when merging,
so that the merged file contains a single initial vector at the start of
the partition data.

Because it's not possible to do that using the fast transferTo path, when
encryption is enabled UnsafeShuffleWriter will revert back to using file
streams when merging. It may be possible to use a hybrid approach when
using encryption, using an intermediate direct buffer when reading from
files and encrypting the data, but that's better left for a separate patch.

As part of the change I made DiskBlockObjectWriter take a SerializerManager
instead of a "wrap stream" closure, since that makes it easier to test the
code without having to mock SerializerManager functionality.

Tested with newly added unit tests (UnsafeShuffleWriterSuite for the write
side and ExternalAppendOnlyMapSuite for integration), and by running some
apps that failed without the fix.

Author: Marcelo Vanzin <[email protected]>

Closes #15982 from vanzin/SPARK-18546.

(cherry picked from commit 93e9d88)
Signed-off-by: Marcelo Vanzin <[email protected]>
…erver

## What changes were proposed in this pull request?

As `queryStatus` in StreamingQueryListener events was removed in #15954, parsing 2.0.2 structured streaming logs will throw the following errror:

```
[info]   com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "queryStatus" (class org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent), not marked as ignorable (2 known properties: "id", "exception"])
[info]  at [Source: {"Event":"org.apache.spark.sql.streaming.StreamingQueryListener$QueryTerminatedEvent","queryStatus":{"name":"query-1","id":1,"timestamp":1480491532753,"inputRate":0.0,"processingRate":0.0,"latency":null,"sourceStatuses":[{"description":"FileStreamSource[file:/Users/zsx/stream]","offsetDesc":"#0","inputRate":0.0,"processingRate":0.0,"triggerDetails":{"latency.getOffset.source":"1","triggerId":"1"}}],"sinkStatus":{"description":"FileSink[/Users/zsx/stream2]","offsetDesc":"[#0]"},"triggerDetails":{}},"exception":null}; line: 1, column: 521] (through reference chain: org.apache.spark.sql.streaming.QueryTerminatedEvent["queryStatus"])
[info]   at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
[info]   at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
[info]   at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:453)
[info]   at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
...
```

This PR just ignores such errors and adds a test to make sure we can read 2.0.2 logs.

## How was this patch tested?

`query-event-logs-version-2.0.2.txt` has all types of events generated by Structured Streaming in Spark 2.0.2. `testQuietly("ReplayListenerBus should ignore broken event jsons generated in 2.0.2")` verified we can load them without any error.

Author: Shixiong Zhu <[email protected]>

Closes #16085 from zsxwing/SPARK-18655.

(cherry picked from commit c4979f6)
Signed-off-by: Shixiong Zhu <[email protected]>
…e. Receiver data should be deserialized properly

## What changes were proposed in this pull request?

Fixed the potential SparkContext leak in `StreamingContextSuite.SPARK-18560 Receiver data should be deserialized properly` which was added in #16052. I also removed FakeByteArrayReceiver and used TestReceiver directly.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16091 from zsxwing/SPARK-18617-follow-up.

(cherry picked from commit 0a81121)
Signed-off-by: Reynold Xin <[email protected]>
…pport output original label.

## What changes were proposed in this pull request?

Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label.

In this PR, original label output is supported and test cases are modified and added. Document is also modified.

## How was this patch tested?

Unit tests.

Author: [email protected] <[email protected]>

Closes #15910 from wangmiao1981/audit.

(cherry picked from commit 2eb6764)
Signed-off-by: Yanbo Liang <[email protected]>
…e cases

## What changes were proposed in this pull request?

Due to confusion between URI vs paths, in certain cases we escape partition values too many times, which causes some Hive client operations to fail or write data to the wrong location. This PR fixes at least some of these cases.

To my understanding this is how values, filesystem paths, and URIs interact.
- Hive stores raw (unescaped) partition values that are returned to you directly when you call listPartitions.
- Internally, we convert these raw values to filesystem paths via `ExternalCatalogUtils.[un]escapePathName`.
- In some circumstances we store URIs instead of filesystem paths. When a path is converted to a URI via `path.toURI`, the escaped partition values are further URI-encoded. This means that to get a path back from a URI, you must call `new Path(new URI(uriTxt))` in order to decode the URI-encoded string.
- In `CatalogStorageFormat` we store URIs as strings. This makes it easy to forget to URI-decode the value before converting it into a path.
- Finally, the Hive client itself uses mostly Paths for representing locations, and only URIs occasionally.

In the future we should probably clean this up, perhaps by dropping use of URIs when unnecessary. We should also try fixing escaping for partition names as well as values, though names are unlikely to contain special characters.

cc mallman cloud-fan yhuai

## How was this patch tested?

Unit tests.

Author: Eric Liang <[email protected]>

Closes #16071 from ericl/spark-18635.

(cherry picked from commit 88f559f)
Signed-off-by: Wenchen Fan <[email protected]>
…rk.sql.unsafe.enabled

## What changes were proposed in this pull request?

`spark.sql.unsafe.enabled` is deprecated since 1.6. There still are codes in UI to check it. We should remove it and clean the codes.

## How was this patch tested?

Changes to related existing unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes #16095 from viirya/remove-deprecated-config-code.

(cherry picked from commit dbf842b)
Signed-off-by: Reynold Xin <[email protected]>
…ows Unrecognized option

## What changes were proposed in this pull request?

spark-daemon.sh will lost single quotes around after #15338. as follows:
```
execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --conf spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
```
With this fix, as follows:
```
execute_command nice -n 0 bash /opt/cloudera/parcels/SPARK-2.1.0-cdh5.4.3.d20161129-21.04.38/lib/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name 'Thrift JDBC/ODBC Server' --conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'
```

## How was this patch tested?

- Manual tests
- Build the package and start-thriftserver.sh with `--conf 'spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp'`

Author: Yuming Wang <[email protected]>

Closes #16079 from wangyum/SPARK-18645.

(cherry picked from commit 2ab8551)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

The current error message of USING join is quite confusing, for example:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]

scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]

scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: using columns ['c1] can not be resolved given input columns: [c1, c2] ;;
'Join UsingJoin(Inner,List('c1))
:- Project [value#1 AS c1#3]
:  +- LocalRelation [value#1]
+- Project [value#7 AS c2#9]
   +- LocalRelation [value#7]
```

after this PR, it becomes:
```
scala> val df1 = List(1,2,3).toDS.withColumnRenamed("value", "c1")
df1: org.apache.spark.sql.DataFrame = [c1: int]

scala> val df2 = List(1,2,3).toDS.withColumnRenamed("value", "c2")
df2: org.apache.spark.sql.DataFrame = [c2: int]

scala> df1.join(df2, usingColumn = "c1")
org.apache.spark.sql.AnalysisException: USING column `c1` can not be resolved with the right join side, the right output is: [c2];
```

## How was this patch tested?

updated tests

Author: Wenchen Fan <[email protected]>

Closes #16100 from cloud-fan/natural.

(cherry picked from commit e653484)
Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <[email protected]>
Author: jkbradley <[email protected]>

Closes #15843 from techaddict/SPARK-18274.

(cherry picked from commit 78bb7f8)
Signed-off-by: Joseph K. Bradley <[email protected]>
…te. Receiver data should be deserialized properly

## What changes were proposed in this pull request?

Avoid to create multiple threads to stop StreamingContext. Otherwise, the latch added in #16091 can be passed too early.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16105 from zsxwing/SPARK-18617-2.

(cherry picked from commit 086b0c8)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?
We current build 5 separate pip binary tar balls, doubling the release script runtime. It'd be better to build one, especially for use cases that are just using Spark locally. In the long run, it would make more sense to have Hadoop support be pluggable.

## How was this patch tested?
N/A - this is a release build script that doesn't have any automated test coverage. We will know if it goes wrong when we prepare releases.

Author: Reynold Xin <[email protected]>

Closes #16072 from rxin/SPARK-18639.

(cherry picked from commit 37e52f8)
Signed-off-by: Reynold Xin <[email protected]>
…of the JDBC RDD generated sql statement

## What changes were proposed in this pull request?

SQL query generated for the JDBC data source is not quoting columns in the predicate clause. When the source table has quoted column names,  spark jdbc read fails with column not found error incorrectly.

Error:
org.h2.jdbc.JdbcSQLException: Column "ID" not found;
Source SQL statement:
SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE (Id < 1)

This PR fixes by quoting column names in the generated  SQL for predicate clause  when filters are pushed down to the data source.

Source SQL statement after the fix:
SELECT "Name","Id" FROM TEST."mixedCaseCols" WHERE ("Id" < 1)

## How was this patch tested?

Added new test case to the JdbcSuite

Author: sureshthalamati <[email protected]>

Closes #15662 from sureshthalamati/filter_quoted_cols-SPARK-18141.

(cherry picked from commit 70c5549)
Signed-off-by: gatorsmile <[email protected]>
…DataFrameReader JDBC APIs

### What changes were proposed in this pull request?

#### This PR is to backport #15975 to Branch 2.1

---

The following two `DataFrameReader` JDBC APIs ignore the user-specified parameters of parallelism degree.

```Scala
  def jdbc(
      url: String,
      table: String,
      columnName: String,
      lowerBound: Long,
      upperBound: Long,
      numPartitions: Int,
      connectionProperties: Properties): DataFrame
```

```Scala
  def jdbc(
      url: String,
      table: String,
      predicates: Array[String],
      connectionProperties: Properties): DataFrame
```

This PR is to fix the issues. To verify the behavior correctness, we improve the plan output of `EXPLAIN` command by adding `numPartitions` in the `JDBCRelation` node.

Before the fix,
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
```

After the fix,
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] ReadSchema: struct<NAME:string,THEID:int>
```
### How was this patch tested?
Added the verification logics on all the test cases for JDBC concurrent fetching.

Author: gatorsmile <[email protected]>

Closes #16111 from gatorsmile/jdbcFix2.1.
## What changes were proposed in this pull request?

This PR makes `ExpressionEncoder.serializer.nullable` for flat encoder for a primitive type `false`. Since it is `true` for now, it is too conservative.
While `ExpressionEncoder.schema` has correct information (e.g. `<IntegerType, false>`), `serializer.head.nullable` of `ExpressionEncoder`, which got from `encoderFor[T]`, is always false. It is too conservative.

This is accomplished by checking whether a type is one of primitive types. If it is `true`, `nullable` should be `false`.

## How was this patch tested?

Added new tests for encoder and dataframe

Author: Kazuaki Ishizaki <[email protected]>

Closes #15780 from kiszk/SPARK-18284.

(cherry picked from commit 38b9e69)
Signed-off-by: Wenchen Fan <[email protected]>
…erde table

## What changes were proposed in this pull request?

In Spark 2.1, we make Hive serde tables case-preserving by putting the table metadata in table properties, like what we did for data source table. However, we should not put table provider, as it will break forward compatibility. e.g. if we create a Hive serde table with Spark 2.1, using `sql("create table test stored as parquet as select 1")`, we will fail to read it with Spark 2.0, as Spark 2.0 mistakenly treat it as data source table because there is a `provider` entry in table properties.

Logically Hive serde table's provider is always hive, we don't need to store it in table properties, this PR removes it.

## How was this patch tested?

manually test the forward compatibility issue.

Author: Wenchen Fan <[email protected]>

Closes #16080 from cloud-fan/hive.

(cherry picked from commit a5f02b0)
Signed-off-by: Wenchen Fan <[email protected]>
…nary columns due to PARQUET-686

This PR targets to both master and branch-2.1.

## What changes were proposed in this pull request?

Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`.

## How was this patch tested?

New test case added in `ParquetFilterSuite`.

Author: Cheng Lian <[email protected]>

Closes #16106 from liancheng/spark-17213-bad-string-ppd.

(cherry picked from commit ca63916)
Signed-off-by: Reynold Xin <[email protected]>
…-catalog tables

## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <[email protected]>

Closes #16112 from ericl/spark-18679.

(cherry picked from commit 294163e)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

Currently, `JDBCRelation.insert` removes Spark options too early by mistakenly using `asConnectionProperties`. Spark options like `numPartitions` should be passed into `DataFrameWriter.jdbc` correctly. This bug have been **hidden** because `JDBCOptions.asConnectionProperties` fails to filter out the mixed-case options. This PR aims to fix both.

**JDBCRelation.insert**
```scala
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
  val url = jdbcOptions.url
  val table = jdbcOptions.table
- val properties = jdbcOptions.asConnectionProperties
+ val properties = jdbcOptions.asProperties
  data.write
    .mode(if (overwrite) SaveMode.Overwrite else SaveMode.Append)
    .jdbc(url, table, properties)
```

**JDBCOptions.asConnectionProperties**
```scala
scala> import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
scala> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
scala> new JDBCOptions(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10")).asConnectionProperties
res0: java.util.Properties = {numpartitions=10}
scala> new JDBCOptions(new CaseInsensitiveMap(Map("url" -> "jdbc:mysql://localhost:3306/temp", "dbtable" -> "t1", "numPartitions" -> "10"))).asConnectionProperties
res1: java.util.Properties = {numpartitions=10}
```

## How was this patch tested?

Pass the Jenkins with a new testcase.

Author: Dongjoon Hyun <[email protected]>

Closes #15863 from dongjoon-hyun/SPARK-18419.

(cherry picked from commit 55d528f)
Signed-off-by: Wenchen Fan <[email protected]>
…rce tables

## What changes were proposed in this pull request?

Two bugs are addressed here
1. INSERT OVERWRITE TABLE sometime crashed when catalog partition management was enabled. This was because when dropping partitions after an overwrite operation, the Hive client will attempt to delete the partition files. If the entire partition directory was dropped, this would fail. The PR fixes this by adding a flag to control whether the Hive client should attempt to delete files.
2. The static partition spec for OVERWRITE TABLE was not correctly resolved to the case-sensitive original partition names. This resulted in the entire table being overwritten if you did not correctly capitalize your partition names.

cc yhuai cloud-fan

## How was this patch tested?

Unit tests. Surprisingly, the existing overwrite table tests did not catch these edge cases.

Author: Eric Liang <[email protected]>

Closes #16088 from ericl/spark-18659.

(cherry picked from commit 7935c84)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Added a test case for using joins with nested fields.

### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #16110 from gatorsmile/followup-18674.

(cherry picked from commit 2f8776c)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

This fixes the parser rule to match named expressions, which doesn't work for two reasons:
1. The name match is not coerced to a regular expression (missing .r)
2. The surrounding literals are incorrect and attempt to escape a single quote, which is unnecessary

## How was this patch tested?

This adds test cases for named expressions using the bracket syntax, including one with quoted spaces.

Author: Ryan Blue <[email protected]>

Closes #16107 from rdblue/SPARK-18677-fix-json-path.

(cherry picked from commit 4877897)
Signed-off-by: Herman van Hovell <[email protected]>
…m predict should output original label when family = binomial."

## What changes were proposed in this pull request?
It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved.

## How was this patch tested?
Existing unit tests.

This reverts commit daa975f.

Author: Yanbo Liang <[email protected]>

Closes #16118 from yanboliang/spark-18291-revert.

(cherry picked from commit a985dd8)
Signed-off-by: Joseph K. Bradley <[email protected]>
…ogressEvent when there is no data

## What changes were proposed in this pull request?

This PR adds a sql conf `spark.sql.streaming.noDataReportInterval` to control how long to wait before outputing the next StreamProgressEvent when there is no data.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <[email protected]>

Closes #16108 from zsxwing/SPARK-18670.

(cherry picked from commit 56a503d)
Signed-off-by: Tathagata Das <[email protected]>
…2.1 release

## What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang <[email protected]>

Closes #16076 from yanboliang/spark-18324.

(cherry picked from commit 2dc0d7e)
Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request?

Makes `Window.unboundedPreceding` and `Window.unboundedFollowing` backward compatible.

## How was this patch tested?

Pyspark SQL unittests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: zero323 <[email protected]>

Closes #16123 from zero323/SPARK-17845-follow-up.

(cherry picked from commit a9cbfc4)
Signed-off-by: Reynold Xin <[email protected]>
…ests at ExecutorClassLoaderSuite

## What changes were proposed in this pull request?

This PR fixes two problems as below:

- Close `BufferedSource` after `Source.fromInputStream(...)` to release resource and make the tests pass on Windows in `ExecutorClassLoaderSuite`

  ```
  [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 333 milliseconds)
  [info]   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
  [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  [info]   at org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
  [info]   at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
  ...
  ```

- Fix URI correctly so that related tests can be passed on Windows.

  ```
  [info] - child first *** FAILED *** (78 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - parent first *** FAILED *** (15 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - child first can fall back *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - child first can fail *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - resource from parent *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - resources from parent *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ```

## How was this patch tested?

Manually tested via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite

**After**
https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite

Author: hyukjinkwon <[email protected]>

Closes #16116 from HyukjinKwon/close-after-open.

(cherry picked from commit d1312fb)
Signed-off-by: Sean Owen <[email protected]>
…ted subqueries

## What changes were proposed in this pull request?

This fix puts an explicit list of operators that Spark supports for correlated subqueries.

## How was this patch tested?

Run sql/test, catalyst/test and add a new test case on Generate.

Author: Nattavut Sutyanyong <[email protected]>

Closes #16046 from nsyca/spark18455.0.

(cherry picked from commit 4a3c096)
Signed-off-by: Herman van Hovell <[email protected]>
jerryshao and others added 7 commits January 23, 2017 13:37
…pection occurred

## What changes were proposed in this pull request?

In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.

## How was this patch tested?

Existing test.

Author: jerryshao <[email protected]>

Closes #16657 from jerryshao/SPARK-19306.

(cherry picked from commit e497472)
Signed-off-by: Marcelo Vanzin <[email protected]>
…ressions that require equality comparison between ScalaUDF

## What changes were proposed in this pull request?

Currently, running the codes in Java

```java
spark.udf().register("inc", new UDF1<Long, Long>() {
  Override
  public Long call(Long i) {
    return i + 1;
  }
}, DataTypes.LongType);

spark.range(10).toDF("x").createOrReplaceTempView("tmp");
Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
Assert.assertEquals(7, result.getLong(0));
```

fails as below:

```
org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
+- SubqueryAlias tmp, `tmp`
   +- Project [id#16L AS x#19L]
      +- Range (0, 10, step=1, splits=Some(8))

	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
```

The root cause is because we were creating the function every time when it needs to build as below:

```scala
scala> def inc(i: Int) = i + 1
inc: (i: Int)Int

scala> (inc(_: Int)).hashCode
res15: Int = 1231799381

scala> (inc(_: Int)).hashCode
res16: Int = 2109839984

scala> (inc(_: Int)) == (inc(_: Int))
res17: Boolean = false
```

This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`.

In case of Scala one, it seems already fine.

Both can be tested easily as below if any reviewer is more comfortable with Scala:

```scala
val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
val javaUDF = new UDF1[Int, Int]  {
  override def call(i: Int): Int = i + 1
}
// spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API
// spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for Scala API
df.createOrReplaceTempView("tmp")
spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
```

## How was this patch tested?

Unit test in `JavaUDFSuite.java` and `./dev/lint-java`.

Author: hyukjinkwon <[email protected]>

Closes #16553 from HyukjinKwon/SPARK-9435.

(cherry picked from commit e576c1e)
Signed-off-by: gatorsmile <[email protected]>
…ries

## What changes were proposed in this pull request?

As adaptive query execution may change the number of partitions in different batches, it may break streaming queries. Hence, we should disallow this feature in Structured Streaming.

## How was this patch tested?

`test("SPARK-19268: Adaptive query execution should be disallowed")`.

Author: Shixiong Zhu <[email protected]>

Closes #16683 from zsxwing/SPARK-19268.

(cherry picked from commit 60bd91a)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?

Support for
```
df[[myname]] <- 1
df[[2]] <- df$eruptions
```

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <[email protected]>

Closes #16663 from felixcheung/rcolset.

(cherry picked from commit f27e024)
Signed-off-by: Felix Cheung <[email protected]>
[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments

## What changes were proposed in this pull request?

Fix a bug in which BisectingKMeans fails with error:
java.util.NoSuchElementException: key not found: 166
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:58)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
        at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
        at scala.collection.immutable.List.reduceLeft(List.scala:84)
        at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
        at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)

## How was this patch tested?

The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ilya Matiach <[email protected]>

Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
…rn incorrect results

## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <[email protected]>

Closes #16467 from nsyca/19017.

(cherry picked from commit cdb691e)
Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request?

### Before
![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)

### After
![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png)

## How was this patch tested?

Manually

Author: Liwei Lin <[email protected]>

Closes #16673 from lw-lin/streaming.

(cherry picked from commit 40a4cfc)
Signed-off-by: Shixiong Zhu <[email protected]>
@uncleGen
Copy link
Contributor

mistakenly opened? close it please!

aokolnychyi and others added 19 commits January 24, 2017 22:13
## What changes were proposed in this pull request?

- A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
- Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
- Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
- Python is not covered.
- The PR might not resolve the ticket since I do not know what exactly was planned by the author.

In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.

## How was this patch tested?

The patch was tested locally by building the docs. The examples were run as well.

![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)

Author: aokolnychyi <[email protected]>

Closes #16329 from aokolnychyi/SPARK-16046.

(cherry picked from commit 3fdce81)
Signed-off-by: gatorsmile <[email protected]>
That method is prone to stack overflows when the input map is really
large; instead, use plain "map". Also includes a unit test that was
tested and caused stack overflows without the fix.

Author: Marcelo Vanzin <[email protected]>

Closes #16667 from vanzin/SPARK-18750.

(cherry picked from commit 76db394)
Signed-off-by: Tom Graves <[email protected]>
…in a subquery does not yield an error

## What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

## How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

````
-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;
````

Author: Nattavut Sutyanyong <[email protected]>

Closes #16572 from nsyca/18863.

(cherry picked from commit f1ddca5)
Signed-off-by: Herman van Hovell <[email protected]>
…ext.

The code was failing to propagate the user conf in the case where the
JVM was already initialized, which happens when a user submits a
python script via spark-submit.

Tested with new unit test and by running a python script in a real cluster.

Author: Marcelo Vanzin <[email protected]>

Closes #16682 from vanzin/SPARK-19307.

(cherry picked from commit 92afaa9)
Signed-off-by: Marcelo Vanzin <[email protected]>
…branch.

Author: Marcelo Vanzin <[email protected]>

Closes #16704 from vanzin/SPARK-18750_2.1.
## What changes were proposed in this pull request?

Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.

## How was this patch tested?

Updated sanity test script to import mllib and ml sub-components.

Author: Holden Karau <[email protected]>

Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.

(cherry picked from commit 965c82d)
Signed-off-by: Holden Karau <[email protected]>
## What changes were proposed in this pull request?

EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed.

This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException.

The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic.
## How was this patch tested?

New unit tests.

Author: Tathagata Das <[email protected]>

Closes #15396 from tdas/SPARK-14804.

(cherry picked from commit 47d5d0d)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?
This pr added a variable for a UDF name in `ScalaUDF`.
Then, if the variable filled, `DataFrame#explain` prints the name.

## How was this patch tested?
Added a test in `UDFSuite`.

Author: Takeshi YAMAMURO <[email protected]>

Closes #16707 from maropu/SPARK-19338.

(cherry picked from commit 9f523d3)
Signed-off-by: gatorsmile <[email protected]>
…h-2.1)

The redirect handler was installed only for the root of the server;
any other context ended up being served directly through the HTTP
port. Since every sub page (e.g. application UIs in the history
server) is a separate servlet context, this meant that everything
but the root was accessible via HTTP still.

The change adds separate names to each connector, and binds contexts
to specific connectors so that content is only served through the
HTTPS connector when it's enabled. In that case, the only thing that
binds to the HTTP connector is the redirect handler.

Tested with new unit tests and by checking a live history server.

(cherry picked from commit d3dcb63)

Author: Marcelo Vanzin <[email protected]>

Closes #16711 from vanzin/SPARK-19220_2.1.
## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <[email protected]>

Closes #16668 from felixcheung/rgetnumpartitions.

(cherry picked from commit 90817a6)
Signed-off-by: Felix Cheung <[email protected]>
## What changes were proposed in this pull request?

add header

## How was this patch tested?

Manual run to check vignettes html is created properly

Author: Felix Cheung <[email protected]>

Closes #16709 from felixcheung/rfilelicense.

(cherry picked from commit 385d738)
Signed-off-by: Felix Cheung <[email protected]>
…parkR

## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <[email protected]>

Closes #16670 from felixcheung/rjvmstdout.

(cherry picked from commit a7ab6f9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
### What changes were proposed in this pull request?
The case are not sensitive in JDBC options, after the PR #15884 is merged to Spark 2.1.

### How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #16734 from gatorsmile/fixDocCaseInsensitive.

(cherry picked from commit c0eda7e)
Signed-off-by: gatorsmile <[email protected]>
### What changes were proposed in this pull request?
Currently, the function `to_json` allows users to provide options for generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it ignores the user-provided options. This PR is to fix it. Below is an example.

```Scala
val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0")))).toDF("a")
val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
df.select(to_json($"a", options)).show(false)
```
The current output is like
```
+--------------------------------------+
|structtojson(a)                       |
+--------------------------------------+
|{"_1":"2015-08-26T18:00:00.000-07:00"}|
+--------------------------------------+
```

After the fix, the output is like
```
+-------------------------+
|structtojson(a)          |
+-------------------------+
|{"_1":"26/08/2015 18:00"}|
+-------------------------+
```
### How was this patch tested?
Added test cases for both `from_json` and `to_json`

Author: gatorsmile <[email protected]>

Closes #16745 from gatorsmile/toJson.

(cherry picked from commit f9156d2)
Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request?

backport #16721 to branch-2.1

## How was this patch tested?

manual

Author: Felix Cheung <[email protected]>

Closes #16749 from felixcheung/rsubsetdocbackport.
…trics even if there is no new data in trigger

In StructuredStreaming, if a new trigger was skipped because no new data arrived, we suddenly report nothing for the metrics `stateOperator`. We could however easily report the metrics from `lastExecution` to ensure continuity of metrics.

Regression test in `StreamingQueryStatusAndProgressSuite`

Author: Burak Yavuz <[email protected]>

Closes #16716 from brkyvz/state-agg.

(cherry picked from commit 081b7ad)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?
Fix brokens links in ml-pipeline and ml-tuning
`<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <[email protected]>

Closes #16754 from zhengruifeng/doc_api_fix.

(cherry picked from commit 04ee8cf)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
|156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
|156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|

Author: Devaraj K <[email protected]>

Closes #16725 from devaraj-kavali/SPARK-19377.

(cherry picked from commit df4a27c)
Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
```

It's better to provide a meaningful message.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #16773 from zsxwing/connect-timeout.

(cherry picked from commit 8303e20)
Signed-off-by: Shixiong Zhu <[email protected]>
srowen added a commit to srowen/spark that referenced this pull request Feb 2, 2017
@srowen srowen mentioned this pull request Feb 2, 2017
@asfgit asfgit closed this in 20b4ca1 Feb 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.