Skip to content

Conversation

@avinashkolla
Copy link

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

srowen and others added 30 commits August 7, 2016 12:20
## What changes were proposed in this pull request?

regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed.

## How was this patch tested?

Additional unit test

Author: Sean Owen <[email protected]>

Closes #14504 from srowen/SPARK-16409.

(cherry picked from commit 8d87252)
Signed-off-by: Sean Owen <[email protected]>
…ringFunctionsSuite

## What changes were proposed in this pull request?

This PR aims to fix a build error on branch 1.6 at 8d87252, but I think we had better have this consistently in master branch, too. It's because there exist other ongoing PR (#14525) about this.

https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-compile-maven-with-yarn-2.3/286/console

```scala
[error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-with-yarn-2.3/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:82: value toDF is not a member of Seq[String]
[error]     val df = Seq("aaaac").toDF("s")
[error]                           ^
```

## How was this patch tested?

After passing Jenkins, run compilation test on branch 1.6.
```
build/mvn -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install
```

Author: Dongjoon Hyun <[email protected]>

Closes #14526 from dongjoon-hyun/SPARK-16939.

(cherry picked from commit a16983c)
Signed-off-by: Sean Owen <[email protected]>
…lause

#### What changes were proposed in this pull request?
When doing a CTAS with a Partition By clause, we got a wrong error message.

For example,
```SQL
CREATE TABLE gen__tmp
PARTITIONED BY (key string)
AS SELECT key, value FROM mytable1
```
The error message we get now is like
```
Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0)
```

However, based on the code, the message we should get is like
```
Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0)
```

Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders.

#### How was this patch tested?
Added test cases.

Author: gatorsmile <[email protected]>

Closes #14113 from gatorsmile/ctas.

(cherry picked from commit ab12690)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Currently, the `refreshTable` API is always case sensitive.

When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like
```
Job aborted due to stage failure:
Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost):
java.io.FileNotFoundException:
File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist
```

This PR is to fix the issue.

### How was this patch tested?
Added a test case.

Author: gatorsmile <[email protected]>

Closes #14523 from gatorsmile/refreshTempTable.

(cherry picked from commit 5959df2)
Signed-off-by: Wenchen Fan <[email protected]>
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes #14231

Author: Marcelo Vanzin <[email protected]>

Closes #14508 from vanzin/SPARK-16586.

(cherry picked from commit 1739e75)
Signed-off-by: Marcelo Vanzin <[email protected]>
…onsistent with requestExecutors/killExecutors

## What changes were proposed in this pull request?

RequestExecutors and killExecutor are public developer APIs for managing the number of executors allocated to the SparkContext. For consistency, requestTotalExecutors should also be a public Developer API, as it provides similar functionality. In fact, using requestTotalExecutors is more convenient that requestExecutors as the former is idempotent and the latter is not.

Author: Tathagata Das <[email protected]>

Closes #14541 from tdas/SPARK-16953.

(cherry picked from commit 8650239)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?

Update docs to include SASL support for RPC

Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63

## How was this patch tested?

Docs change only

Author: Michael Gummelt <[email protected]>

Closes #14549 from mgummelt/sasl.

(cherry picked from commit 53d1c78)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Add a constant iterator which point to head of result. The header will be used to reset iterator when fetch result from first row repeatedly.
JIRA ticket https://issues.apache.org/jira/browse/SPARK-16563

## How was this patch tested?

This bug was found when using Cloudera HUE connecting to spark sql thrift server, currently SQL statement result can be only fetched for once. The fix was tested manually with Cloudera HUE, With this fix, HUE can fetch spark SQL results repeatedly through thrift server.

Author: Alice <[email protected]>
Author: Alice <[email protected]>

Closes #14218 from alicegugu/SparkSQLFetchResultsBug.

(cherry picked from commit e17a76e)
Signed-off-by: Reynold Xin <[email protected]>
…ption.

## What changes were proposed in this pull request?

For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`.

This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset.

So, here is the behaviour,

 1. Check `compression` and use this if it is set.
 2. If `compression` is not set, check `orc.compress` and use it.
 3. If `compression` and `orc.compress` are not set, then use the default snappy.

## How was this patch tested?

Unit test in `OrcQuerySuite`.

Author: hyukjinkwon <[email protected]>

Closes #14518 from HyukjinKwon/SPARK-16610.

(cherry picked from commit bb2b9d0)
Signed-off-by: Wenchen Fan <[email protected]>
…reateDirectStream for python3

## What changes were proposed in this pull request?

Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems.

## How was this patch tested?

unit tests

jerryshao  - could you please look at this PR?

Author: Mariusz Strzelecki <[email protected]>

Closes #14540 from szczeles/kafka_pyspark.

(cherry picked from commit 29081b5)
Signed-off-by: Davies Liu <[email protected]>
## What changes were proposed in this pull request?

This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

**Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

**Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

## How was this patch tested?

I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

Author: Josh Rosen <[email protected]>

Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.

(cherry picked from commit b89b3a5)
Signed-off-by: Josh Rosen <[email protected]>
MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system.

Another syntax is: ALTER TABLE table RECOVER PARTITIONS

The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed).

Added unit tests for it and Hive compatibility test suite.

Author: Davies Liu <[email protected]>

Closes #14500 from davies/repair_table.
## What changes were proposed in this pull request?

Fixed small typo - "value ... ~~in~~ is null"

## How was this patch tested?

Still compiles!

Author: Michał Kiełbowicz <[email protected]>

Closes #14569 from jupblb/typo-fix.

(cherry picked from commit 9dc3e60)
Signed-off-by: Reynold Xin <[email protected]>
This is backport of #14175 to branch 2.0

Author: Sun Rui <[email protected]>

Closes #14575 from sun-rui/SPARK-16522-branch-2.0.
…ring when match fails

## What changes were proposed in this pull request?

Doc that regexp_extract returns empty string when regex or group does not match

## How was this patch tested?

Jenkins test, with a few new test cases

Author: Sean Owen <[email protected]>

Closes #14525 from srowen/SPARK-16324.

(cherry picked from commit 0578ff9)
Signed-off-by: Sean Owen <[email protected]>
…Path

## What changes were proposed in this pull request?

Fix the construction of the file path. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run SQL unit tests on Windows

Author: avulanov <[email protected]>

Closes #13868 from avulanov/SPARK-15899-file.

(cherry picked from commit 11a6844)
Signed-off-by: Sean Owen <[email protected]>

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
#	sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala
… for parquet reader

The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader.

The benchmark that excludes the time of writing Parquet file:

    test("Benchmark for Parquet") {
      val N = 500 << 12
        withParquetTable((0 until N).map(i => (101, i)), "t") {
          val benchmark = new Benchmark("Parquet reader", N)
          benchmark.addCase("reading Parquet file", 10) { iter =>
            sql("SELECT _1 FROM t where t._1 < 100").collect()
          }
          benchmark.run()
      }
    }

`withParquetTable` in default will run tests for vectorized reader non-vectorized readers. I only let it run vectorized reader.

When we set the block size of parquet as 1024 to have multiple row groups. The benchmark is:

Before this patch:

The retrieved row groups: 8063

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           825 / 1233          2.5         402.6       1.0X

After this patch:

The retrieved row groups: 0

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           306 /  503          6.7         149.6       1.0X

Next, I run the benchmark for non-pushdown case using the same benchmark code but with disabled pushdown configuration. This time the parquet block size is default value.

Before this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           136 /  238         15.0          66.5       1.0X

After this patch:

    Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
    Intel(R) Core(TM) i7-5557U CPU  3.10GHz
    Parquet reader:                          Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    reading Parquet file                           124 /  193         16.5          60.7       1.0X

For non-pushdown case, from the results, I think this patch doesn't affect normal code path.

I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to see if this patch actually filter the row-groups. When running the above benchmark:

After this patch:
    `totalRowCount = 0`

Before this patch:
    `totalRowCount = 1024000`

Existing tests should be passed.

Author: Liang-Chi Hsieh <[email protected]>

Closes #13701 from viirya/vectorized-reader-push-down-filter2.

(cherry picked from commit 19af298)
Signed-off-by: Davies Liu <[email protected]>
Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R.

Updates:

Several changes have been made:

- `install.spark()`
    - check existence of tar file in the cache folder, and download only if not found
    - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option
    - use 2.0.0

- `sparkR.session()`
    - can install spark when not found in `SPARK_HOME`

Manual tests, running the check-cran.sh script added in #14173.

Author: Junyang Qian <[email protected]>

Closes #14258 from junyangq/SPARK-16579.

(cherry picked from commit 214ba66)
Signed-off-by: Shivaram Venkataraman <[email protected]>
…ment

## What changes were proposed in this pull request?

change the remain percent to right one.

## How was this patch tested?

Manual review

Author: Tao Wang <[email protected]>

Closes #14591 from WangTaoTheTonic/patch-1.

(cherry picked from commit 7a6a3c3)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <[email protected]>

Closes #14472 from petermaxlee/SPARK-16866.

(cherry picked from commit b9f8a11)
Signed-off-by: Wenchen Fan <[email protected]>
…ryTestSuite.

## What changes were proposed in this pull request?
This patch enhances SQLQueryTestSuite in two ways:

1. SPARK-17009: Use a new SparkSession for each test case to provide stronger isolation (e.g. config changes in one test case does not impact another). That said, we do not currently isolate catalog changes.
2. SPARK-17008: Normalize query output using sorting, inspired by HiveComparisonTest.

I also ported a few new test cases over from SQLQuerySuite.

## How was this patch tested?
This is a test harness update.

Author: petermaxlee <[email protected]>

Closes #14590 from petermaxlee/SPARK-17008.

(cherry picked from commit 425c7c2)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?
This patch moves all the test data files in sql/core/src/test/resources to sql/core/src/test/resources/test-data, so we don't clutter the top level sql/core/src/test/resources. Also deleted sql/core/src/test/resources/old-repeated.parquet since it is no longer used.

The change will make it easier to spot sql-tests directory.

## How was this patch tested?
This is a test-only change.

Author: petermaxlee <[email protected]>

Closes #14589 from petermaxlee/SPARK-17007.

(cherry picked from commit 665e175)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?
This patch adds exception testing to SQLQueryTestSuite. When there is an exception in query execution, the query result contains the the exception class along with the exception message.

As part of this, I moved some additional test cases for limit from SQLQuerySuite over to SQLQueryTestSuite.

## How was this patch tested?
This is a test harness change.

Author: petermaxlee <[email protected]>

Closes #14592 from petermaxlee/SPARK-17011.

(cherry picked from commit 0db373a)
Signed-off-by: Wenchen Fan <[email protected]>
Docs adjustment to:
- link to other relevant section of docs
- correct statement about the only value when actually other values are supported

Author: Andrew Ash <[email protected]>

Closes #14581 from ash211/patch-10.

(cherry picked from commit 8a6b703)
Signed-off-by: Sean Owen <[email protected]>
…Path for Spark 2.0

This PR contains the adaptation of #13868 for Spark 2.0

## What changes were proposed in this pull request?

Fix the construction of the file path in `SQLConf.scala` and unit tests that rely on this: `SQLConfSuite` and `DDLSuite`. Previous way of construction caused the creation of incorrect path on Windows.

## How was this patch tested?

Run unit tests on Windows

Author: avulanov <[email protected]>

Closes #14600 from avulanov/SPARK-15899-file-2.0.
This patch adds three test files:
1. arithmetic.sql.out
2. order-by-ordinal.sql
3. group-by-ordinal.sql

This includes #14594.

This is a test case change.

Author: petermaxlee <[email protected]>

Closes #14595 from petermaxlee/SPARK-17015.

(cherry picked from commit a7b02db)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
This patch adds literals.sql for testing literal parsing end-to-end in SQL.

## How was this patch tested?
The patch itself is only about adding test cases.

Author: petermaxlee <[email protected]>

Closes #14598 from petermaxlee/SPARK-17018-2.

(cherry picked from commit cf93678)
Signed-off-by: Reynold Xin <[email protected]>
…ages

## What changes were proposed in this pull request?

We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock.

## How was this patch tested?

manual tests

Author: WangTaoTheTonic <[email protected]>

Closes #14605 from WangTaoTheTonic/lock.

(cherry picked from commit ea0bf91)
Signed-off-by: Marcelo Vanzin <[email protected]>
…ed correctly

Currently, Spark ignores path names starting with underscore `_` and `.`. This causes read-failures for the column-partitioned file data sources whose partition column names starts from '_', e.g. `_col`.

**Before**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet20. It must be specified manually;
```

**After**
```scala
scala> spark.range(10).withColumn("_locality_code", $"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
```

Pass the Jenkins with a new test case.

Author: Dongjoon Hyun <[email protected]>

Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.

(cherry picked from commit abff92b)
Signed-off-by: Cheng Lian <[email protected]>
JoshRosen and others added 17 commits September 19, 2016 20:21
…sages

This patch addresses a corner-case escaping bug where field names which contain special characters were unsafely interpolated into error message string literals in generated Java code, leading to compilation errors.

This patch addresses these issues by using `addReferenceObj` to store the error messages as string fields rather than inline string constants.

Author: Josh Rosen <[email protected]>

Closes #15156 from JoshRosen/SPARK-17160.

(cherry picked from commit e719b1c)
Signed-off-by: Josh Rosen <[email protected]>
## What changes were proposed in this pull request?
This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.

This is based on work by frreiss in #15067, but fixed the test case along with some typos.

## How was this patch tested?
A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.

Author: petermaxlee <[email protected]>
Author: frreiss <[email protected]>

Closes #15126 from petermaxlee/SPARK-17513.

(cherry picked from commit be9d57f)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #14634 from cloud-fan/bug.

(cherry picked from commit eb004c6)
Signed-off-by: Yin Huai <[email protected]>
…ataLog in FileStreamSource (branch-2.0)

## What changes were proposed in this pull request?

Backport #13513 to branch 2.0.

## How was this patch tested?

Jenkins

Author: jerryshao <[email protected]>

Closes #15163 from zsxwing/SPARK-15698-spark-2.0.
## What changes were proposed in this pull request?
This PR modifies StreamExecution such that it discards metadata for batches that have already been fully processed. I used the purge method that was added as part of SPARK-17235.

This is a resubmission of 15126, which was based on work by frreiss in #15067, but fixed the test case along with some typos.

## How was this patch tested?
A new test case in StreamingQuerySuite. The test case would fail without the changes in this pull request.

Author: petermaxlee <[email protected]>

Closes #15166 from petermaxlee/SPARK-17513-2.

(cherry picked from commit 976f3b1)
Signed-off-by: Reynold Xin <[email protected]>
…ult on double value

## What changes were proposed in this pull request?

Remainder(%) expression's `eval()` returns incorrect result when the dividend is a big double. The reason is that Remainder converts the double dividend to decimal to do "%", and that lose precision.

This bug only affects the `eval()` that is used by constant folding, the codegen path is not impacted.

### Before change
```
scala> -5083676433652386516D % 10
res2: Double = -6.0

scala> spark.sql("select -5083676433652386516D % 10 as a").show
+---+
|  a|
+---+
|0.0|
+---+
```

### After change
```
scala> spark.sql("select -5083676433652386516D % 10 as a").show
+----+
|   a|
+----+
|-6.0|
+----+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <[email protected]>

Closes #15171 from clockfly/SPARK-17617.

(cherry picked from commit 3977223)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?
This PR is to fix the code style errors.

## How was this patch tested?
Manual.

Before:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[184] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[304] (regexp) RegexpSingleline: No trailing whitespace allowed.
 ```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Weiqing Yang <[email protected]>

Closes #15175 from Sherry302/javastylefix.
…shed

This patch updates the `kinesis-asl-assembly` build to prevent that module from being published as part of Maven releases and snapshot builds.

The `kinesis-asl-assembly` includes classes from the Kinesis Client Library (KCL) and Kinesis Producer Library (KPL), both of which are licensed under the Amazon Software License and are therefore prohibited from being distributed in Apache releases.

Author: Josh Rosen <[email protected]>

Closes #15167 from JoshRosen/stop-publishing-kinesis-assembly.

(cherry picked from commit d7ee122)
Signed-off-by: Josh Rosen <[email protected]>
…s cluster mode

## What changes were proposed in this pull request?

Yarn and mesos cluster mode support remote python path (HDFS/S3 scheme) by their own mechanism, it is not necessary to check and format the python when running on these modes. This is a potential regression compared to 1.6, so here propose to fix it.

## How was this patch tested?

Unit test to verify SparkSubmit arguments, also with local cluster verification. Because of lack of `MiniDFSCluster` support in Spark unit test, there's no integration test added.

Author: jerryshao <[email protected]>

Closes #15137 from jerryshao/SPARK-17512.

(cherry picked from commit 8c3ee2b)
Signed-off-by: Andrew Or <[email protected]>
All of structured streaming is experimental in its first release.  We missed the annotation on two of the APIs.

Author: Michael Armbrust <[email protected]>

Closes #15188 from marmbrus/experimentalApi.

(cherry picked from commit 3497ebe)
Signed-off-by: Reynold Xin <[email protected]>
…t rounding mode

## What changes were proposed in this pull request?

Floor()/Ceil() of decimal is implemented using changePrecision() by passing a rounding mode, but the rounding mode is not respected when the decimal is in compact mode (could fit within a Long).

This Update the changePrecision() to respect rounding mode, which could be ROUND_FLOOR, ROUND_CEIL, ROUND_HALF_UP, ROUND_HALF_EVEN.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <[email protected]>

Closes #15154 from davies/decimal_round.

(cherry picked from commit 8bde03b)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?

Modified the documentation to clarify that `build/mvn` and `pom.xml` always add Java 7-specific parameters to `MAVEN_OPTS`, and that developers can safely ignore warnings about `-XX:MaxPermSize` that may result from compiling or running tests with Java 8.

## How was this patch tested?

Rebuilt HTML documentation, made sure that building-spark.html displays correctly in a browser.

Author: frreiss <[email protected]>

Closes #15005 from frreiss/fred-17421a.

(cherry picked from commit 646f383)
Signed-off-by: Sean Owen <[email protected]>
shivaram and others added 10 commits September 22, 2016 11:54
## What changes were proposed in this pull request?

When we build the docs separately we don't have the JAR files from the Spark build in
the same tree. As the SparkR vignettes need to launch a SparkContext to be built, we skip building them if JAR files don't exist

## How was this patch tested?

To test this we can run the following:
```
build/mvn -DskipTests -Psparkr clean
./R/create-docs.sh
```
You should see a line `Skipping R vignettes as Spark JARs not found` at the end

Author: Shivaram Venkataraman <[email protected]>

Closes #15200 from shivaram/sparkr-vignette-skip.

(cherry picked from commit 9f24a17)
Signed-off-by: Reynold Xin <[email protected]>
…Frames

Consider you have a bucket as `s3a://some-bucket`
and under it you have files:
```
s3a://some-bucket/file1.parquet
s3a://some-bucket/file2.parquet
```
Getting the parent path of `s3a://some-bucket/file1.parquet` yields
`s3a://some-bucket/` and the ListingFileCatalog uses this as the key in the hash map.

When catalog.allFiles is called, we use `s3a://some-bucket` (no slash at the end) to get the list of files, and we're left with an empty list!

This PR fixes this by adding a `/` at the end of the `URI` iff the given `Path` doesn't have a parent, i.e. is the root. This is a no-op if the path already had a `/` at the end, and is handled through the Hadoop Path, path merging semantics.

Unit test in `FileCatalogSuite`.

Author: Burak Yavuz <[email protected]>

Closes #15169 from brkyvz/SPARK-17613.

(cherry picked from commit 85d609c)
Signed-off-by: Josh Rosen <[email protected]>
…ocess is dead

## What changes were proposed in this pull request?

When the Python process is dead, the JVM StreamingContext is still running. Hence we will see a lot of Py4jException before the JVM process exits. It's better to stop the JVM StreamingContext to avoid those annoying logs.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #15201 from zsxwing/stop-jvm-ssc.

(cherry picked from commit 3cdae0f)
Signed-off-by: Shixiong Zhu <[email protected]>
…a non-partial aggregate

We currently cannot execute an aggregate that contains a single distinct aggregate function and an one or more non-partially plannable aggregate functions, for example:
```sql
select   grp,
         collect_list(col1),
         count(distinct col2)
from     tbl_a
group by 1
```
This is a regression from Spark 1.6. This is caused by the fact that the single distinct aggregation code path assumes that all aggregates can be planned in two phases (is partially aggregatable). This PR works around this issue by triggering the `RewriteDistinctAggregates` in such cases (this is similar to the approach taken in 1.6).

Created `RewriteDistinctAggregatesSuite` which checks if the aggregates with distinct aggregate functions get rewritten into two `Aggregates` and an `Expand`. Added a regression test to `DataFrameAggregateSuite`.

Author: Herman van Hovell <[email protected]>

Closes #15187 from hvanhovell/SPARK-17616.

(cherry picked from commit 0d63487)
Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request?

This Backports PR #15153 and PR #15122 to Spark 2.0 branch for Structured Streaming.
It is structured a bit differently because similar code paths already existed in the 2.0 branch. The unit test makes sure that both behaviors don't break.

Author: Burak Yavuz <[email protected]>

Closes #15202 from brkyvz/backports-to-streaming.
…tements on Temporary Views

### What changes were proposed in this pull request?
This PR is to backport #15054 and #15160 to Spark 2.0.

- When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example,
```
Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`';
```
- When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports a missing table property. For example,
```
Attempted to unset non-existent property 'p' in table '`testView`';
```
- When `ANALYZE TABLE` is called on a view or a temporary view, we should issue an error message. However, it reports a strange error:
```
ANALYZE TABLE is not supported for Project
```

- When inserting into a temporary view that is generated from `Range`, we will get the following error message:
```
assertion failed: No plan for 'InsertIntoTable Range (0, 10, step=1, splits=Some(1)), false, false
+- Project [1 AS 1#20]
   +- OneRowRelation$
```

This PR is to fix the above four issues.

There is no place in Spark SQL that need `SessionCatalog.tableExists` to check temp views, so this PR makes `SessionCatalog.tableExists` only check permanent table/view and removes some hacks.

### How was this patch tested?
Added multiple test cases

Author: gatorsmile <[email protected]>

Closes #15174 from gatorsmile/PR15054Backport.
… backport

## What changes were proposed in this pull request?

Allow Spark 2.x to load instances of LDA, LocalLDAModel, and DistributedLDAModel saved from Spark 1.6.
Backport of #15034 for branch-2.0

## How was this patch tested?

I tested this manually, saving the 3 types from 1.6 and loading them into master (2.x).  In the future, we can add generic tests for testing backwards compatibility across all ML models in SPARK-15573.

Author: Gayathri Murali <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes #15205 from jkbradley/lda-backward-2.0.
…amSource.FileEntry

## What changes were proposed in this pull request?

Avoid using -1 as the default batchId for FileStreamSource.FileEntry so that we can make sure not writing any FileEntry(..., batchId = -1) into the log. This also avoids people misusing it in future (#15203 is an example).

## How was this patch tested?

Jenkins.

Author: Shixiong Zhu <[email protected]>

Closes #15206 from zsxwing/cleanup.

(cherry picked from commit 62ccf27)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 5c5396c Sep 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.