Skip to content

Conversation

@paulzwu
Copy link

@paulzwu paulzwu commented Sep 21, 2016

What changes were proposed in this pull request?

Need to override the getTableExistsQuery for CassandraDialect. Current standard implementation "SELECT * FROM $table WHERE 1=0" does not work with Cassandra. It needs to be "SELECT * FROM $table LIMIT 1"

How was this patch tested?

Integration tests with Cassandra

rxin and others added 30 commits May 4, 2016 11:00
## What changes were proposed in this pull request?
We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package.

Most of the changes are straightforward move of code. On top of the code moving, I did:
1. Use SparkSession instead of SQLContext.
2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <[email protected]>

Closes apache#12891 from rxin/SPARK-15115.

(cherry picked from commit 6274a52)
Signed-off-by: Reynold Xin <[email protected]>
…talog to infer partitioning

## What changes were proposed in this pull request?

File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog.

This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files.
- HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning.
- StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log.
- The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala.

## How was this patch tested?
- FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query.
- Other unit tests are unchanged and pass as expected.

Author: Tathagata Das <[email protected]>

Closes apache#12879 from tdas/SPARK-15103.

(cherry picked from commit 0fd3a47)
Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <[email protected]>

Closes apache#12902 from rxin/SPARK-15126.

(cherry picked from commit 6ae9fc0)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?

Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.

## How was this patch tested?

Ran and saw logs coming out in log file.

Author: Thomas Graves <[email protected]>

Closes apache#12900 from tgravescs/SPARK-15121.

(cherry picked from commit 0c00391)
Signed-off-by: Andrew Or <[email protected]>
Remove history server functionality from standalone Master.  Previously, the Master process rebuilt a SparkUI once the application was completed which sometimes caused problems, such as OOM, when the application event log is large (see SPARK-6270).  Keeping this functionality out of the Master will help to simplify the process and increase stability.

Testing for this change included running core unit tests and manually running an application on a standalone cluster to verify that it completed successfully and that the Master UI functioned correctly.  Also added 2 unit tests to verify killing an application and driver from MasterWebUI makes the correct request to the Master.

Author: Bryan Cutler <[email protected]>

Closes apache#10991 from BryanCutler/remove-history-master-SPARK-12299.

(cherry picked from commit cf2e9da)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?

This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.

- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
  - `SqlNetworkWordCount.scala`
  - `JavaSqlNetworkWordCount.java`
  - `sql_network_wordcount.py`

Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <[email protected]>

Closes apache#12809 from dongjoon-hyun/SPARK-15031.

(cherry picked from commit cdce4e6)
Signed-off-by: Andrew Or <[email protected]>
Similar to apache#8639

This change rejects offers for 120s when reached `spark.cores.max` in coarse-grained mode to mitigate offer starvation. This prevents Mesos to send us offers again and again, starving other frameworks. This is especially problematic when running many small frameworks on the same Mesos cluster, e.g. many small Sparks streaming jobs, and cause the bigger spark jobs to stop receiving offers. By rejecting the offers for a long period of time, they become available to those other frameworks.

Author: Sebastien Rainville <[email protected]>

Closes apache#10924 from sebastienrainville/master.

(cherry picked from commit eb019af)
Signed-off-by: Andrew Or <[email protected]>
…rkContext from it

## What changes were proposed in this pull request?

see apache#12873 (comment). The problem is, if we create `SparkContext` first and then call `SparkSession.builder.enableHiveSupport().getOrCreate()`, we will reuse the existing `SparkContext` and the hive flag won't be set.

## How was this patch tested?

verified it locally.

Author: Wenchen Fan <[email protected]>

Closes apache#12890 from cloud-fan/repl.
## What changes were proposed in this pull request?
Typo fix

## How was this patch tested?
No tests

My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0.

Author: sethah <[email protected]>

Closes apache#12912 from sethah/csv_typo.

(cherry picked from commit b281377)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?

See title.

## How was this patch tested?

PySpark tests.

Author: Andrew Or <[email protected]>

Closes apache#12917 from andrewor14/deprecate-hive-context-python.

(cherry picked from commit fa79d34)
Signed-off-by: Andrew Or <[email protected]>
…uery

## What changes were proposed in this pull request?

This PR support new SQL syntax CREATE TEMPORARY VIEW.
Like:
```
CREATE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx
CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx
```

## How was this patch tested?

Unit tests.

Author: Sean Zhong <[email protected]>

Closes apache#12872 from clockfly/spark-6399.

(cherry picked from commit 8fb1463)
Signed-off-by: Yin Huai <[email protected]>
…s a Path to Parquet File

#### What changes were proposed in this pull request?
When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema.

This PR is to fix the behavior inconsistency issue.

The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path.

By default, the paths of the dataset provided by users will be base paths. Below are three typical cases,
**Case 1**```sqlContext.read.parquet("/path/something=true/")```: the base path will be
`/path/something=true/`, and the returned DataFrame will not contain a column of `something`.
**Case 2**```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be
still `/path/something=true/`, and the returned DataFrame will also not contain a column of
`something`.
**Case 3**```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned
DataFrame will have the column of `something`.

Users also can override the basePath by setting `basePath` in the options to pass the new base
path to the data source. For example,
```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```,
and the returned DataFrame will have the column of `something`.

The related PRs:
- apache#9651
- apache#10211

#### How was this patch tested?
Added a couple of test cases

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes apache#12828 from gatorsmile/readPartitionedTable.

(cherry picked from commit ef55e46)
Signed-off-by: Yin Huai <[email protected]>
…ntext has been shutdown

## What changes were proposed in this pull request?

Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores.

## How was this patch tested?

Updated unit tests.

Author: Tathagata Das <[email protected]>

Closes apache#12905 from tdas/SPARK-15131.

(cherry picked from commit bde27b8)
Signed-off-by: Shixiong Zhu <[email protected]>
…ted with proper indentation

## What changes were proposed in this pull request?

Similar to apache#11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation.

## How was this patch tested?

Manually checked.

Author: Kousuke Saruta <[email protected]>

Closes apache#12908 from sarutak/SPARK-15132.

(cherry picked from commit 1a9b341)
Signed-off-by: Reynold Xin <[email protected]>
…AllocatedMemory for pageTable

## What changes were proposed in this pull request?

Removed the DeadCode as suggested.

Author: Abhinav Gupta <[email protected]>

Closes apache#12829 from abhi951990/master.

(cherry picked from commit 1a5c6fc)
Signed-off-by: Davies Liu <[email protected]>
## What changes were proposed in this pull request?

We had the issue when using snowplow in our Spark applications. Snowplow requires json4s version 3.2.11 while Spark still use a few years old version 3.2.10. The change is to upgrade json4s jar to 3.2.11.

## How was this patch tested?

We built Spark jar and successfully ran our applications in local and cluster modes.

Author: Lining Sun <[email protected]>

Closes apache#12901 from liningalex/master.

(cherry picked from commit 592fc45)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.

## How was this patch tested?

I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.

Author: mcheah <[email protected]>

Closes apache#12715 from mccheah/feature/upgrade-jersey.

(cherry picked from commit b7fdc23)
Signed-off-by: Sean Owen <[email protected]>
… remove "BETA"

## What changes were proposed in this pull request?

Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).

## How was this patch tested?

Python documentation built locally as HTML and text and verified output.

Author: Holden Karau <[email protected]>

Closes apache#12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.

(cherry picked from commit 4c0d827)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

Enhance the DB2 JDBC Dialect docker tests as they seemed to have had some issues on previous merge causing some tests to fail.

## How was this patch tested?

By running the integration tests locally.

Author: Luciano Resende <[email protected]>

Closes apache#12348 from lresende/SPARK-14589.

(cherry picked from commit 1044302)
Signed-off-by: Sean Owen <[email protected]>
…ady succeeded

## What changes were proposed in this pull request?

Don't re-queue a task if another attempt has already succeeded.  This currently happens when a speculative task is denied from committing the result due to another copy of the task already having succeeded.

## How was this patch tested?

I'm running a job which has a fair bit of skew in the processing time across the tasks for speculation to trigger in the last quarter (default settings), causing many commit denied exceptions to be thrown.  Previously, these tasks were then being retried over and over again until the stage possibly completes (despite using compute resources on these superfluous tasks).  With this change (applied to the 1.6 branch), they no longer retry and the stage completes successfully without these extra task attempts.

Author: Jason Moore <[email protected]>

Closes apache#12751 from jasonmoore2k/SPARK-14915.

(cherry picked from commit 77361a4)
Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request?

The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info.

TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR.

## How was this patch tested?

new tests in `RowEncoderSuite`

Note that, This PR takes over apache#11980, with a little simplification, so all credits should go to koertkuipers

Author: Wenchen Fan <[email protected]>
Author: Koert Kuipers <[email protected]>

Closes apache#12364 from cloud-fan/nullable.

(cherry picked from commit 55cc1c9)
Signed-off-by: Cheng Lian <[email protected]>
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15148

Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA.

This PR upgrades Univocity library from 2.0.2 to 2.1.0.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <[email protected]>

Closes apache#12923 from HyukjinKwon/SPARK-15148.

(cherry picked from commit ac12b35)
Signed-off-by: Reynold Xin <[email protected]>
…rames

## What changes were proposed in this pull request?

Implement repartitionByColumn on DataFrame.
This will allow us to run R functions on each partition identified by column groups with dapply() method.

## How was this patch tested?

Unit tests

Author: NarineK <[email protected]>

Closes apache#12887 from NarineK/repartitionByColumns.

(cherry picked from commit 22226fc)
Signed-off-by: Davies Liu <[email protected]>
## What changes were proposed in this pull request?

Adds spark-warehouse/ to `.gitignore`.

## How was this patch tested?

N/A

Author: Cheng Lian <[email protected]>

Closes apache#12929 from liancheng/gitignore-spark-warehouse.

(cherry picked from commit 63db2bd)
Signed-off-by: Andrew Or <[email protected]>
#### What changes were proposed in this pull request?

First, a few test cases failed in mac OS X  because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web:
```
Win NT  --> C:\TEMP\
Win XP  --> C:\TEMP
Solaris --> /var/tmp/
Linux   --> /var/tmp
```
Second, a couple of test cases are added to verify if the commands work properly.

#### How was this patch tested?
Added a test case for it and correct the previous test cases.

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes apache#12081 from gatorsmile/mkdir.

(cherry picked from commit 8cba57a)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?
Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport`

## How was this patch tested?
ran tests locally

Author: Sandeep Singh <[email protected]>

Closes apache#12851 from techaddict/SPARK-15072.

(cherry picked from commit ed6f3f8)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?

Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession

## How was this patch tested?

Existing unit tests

Author: Shixiong Zhu <[email protected]>

Closes apache#12915 from zsxwing/spark-session-thread-safe.

(cherry picked from commit bb9991d)
Signed-off-by: Andrew Or <[email protected]>
…e binary_classification_metrics_example.py

## What changes were proposed in this pull request?

This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)

## How was this patch tested?

After passing the Jenkins tests and run `dev/lint-java` manually.

Author: Dongjoon Hyun <[email protected]>

Closes apache#12911 from dongjoon-hyun/SPARK-15134.

(cherry picked from commit 2c170dd)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?
set log level to debug when check shouldRollover

## How was this patch tested?
It's tested manually.

Author: depend <[email protected]>

Closes apache#12931 from depend/master.

(cherry picked from commit 5c47db0)
Signed-off-by: Andrew Or <[email protected]>
zsxwing and others added 25 commits May 19, 2016 18:31
…emoved in Spark 2.0

## What changes were proposed in this pull request?

If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.

## How was this patch tested?

1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.

It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`

Author: Shixiong Zhu <[email protected]>

Closes apache#13201 from zsxwing/better-message.

(cherry picked from commit 16ba71a)
Signed-off-by: Michael Armbrust <[email protected]>
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <[email protected]>

Closes apache#13101 from techaddict/SPARK-15296.

(cherry picked from commit 01cf649)
Signed-off-by: Xiangrui Meng <[email protected]>
…d correctly

## What changes were proposed in this pull request?

Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:

`encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
`
... you will see that (without this fix) it fails with the following output:

```
- encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
  Exception thrown while decoding
  Converted: [0,1000000010,800000001,52a7ccdc36800]
  Schema: value#61615
  root
  -- value: array (nullable = true)
      |-- element: timestamp (containsNull = true)
  Encoder:
  class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
```

## How was this patch tested?

Existing tests

Author: Sumedh Mungee <[email protected]>

Closes apache#13108 from smungee/fix-itemAccessorMethod.

(cherry picked from commit d5c47f8)
Signed-off-by: Wenchen Fan <[email protected]>
…nference for POJOs and Java collections

Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .

Author: Kevin Yu <[email protected]>

Closes apache#10125 from kevinyu98/working_on_spark-11827.

(cherry picked from commit 17591d9)
Signed-off-by: Wenchen Fan <[email protected]>
…pagate config options to existing sessions if specified

## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.

This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.

## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.

Author: Reynold Xin <[email protected]>

Closes apache#13200 from rxin/SPARK-15075.

(cherry picked from commit f2ee0ed)
Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.

This is based on apache#12768. This patch fixed a bug there (with empty expression) and added a test case.

## How was this patch tested?
Added a new test suite and test case.

Closes apache#12768.

Author: Reynold Xin <[email protected]>
Author: Oleg Danilov <[email protected]>

Closes apache#13208 from rxin/SPARK-14990.

(cherry picked from commit 3ba34d4)
Signed-off-by: Reynold Xin <[email protected]>
Fixed memory leak (HiveConf in the CommandProcessorFactory)

Author: Oleg Danilov <[email protected]>

Closes apache#12932 from dosoft/SPARK-14261.

(cherry picked from commit e384c7f)
Signed-off-by: Reynold Xin <[email protected]>
…GraphGenerators

This PR removes a stale TODO comment in `GraphGenerators.scala`

Just comment removed.

Author: Dongjoon Hyun <[email protected]>

Closes apache#12839 from dongjoon-hyun/SPARK-15057.

(cherry picked from commit 46965cd)
Signed-off-by: Reynold Xin <[email protected]>
I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.

Built docs locally, ran style checks

Author: Bryan Cutler <[email protected]>

Closes apache#13159 from BryanCutler/ml.feature-api-sync.

(cherry picked from commit b1bc5eb)
Signed-off-by: Reynold Xin <[email protected]>
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

manual tests

Author: Zheng RuiFeng <[email protected]>

Closes apache#13078 from zhengruifeng/fix_ann.

(cherry picked from commit c7efc56)
Signed-off-by: Reynold Xin <[email protected]>
…ntegrationSuite

Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43.

existing tests

Author: Sandeep Singh <[email protected]>

Closes apache#13168 from techaddict/minor-1.

(cherry picked from commit 3facca5)
Signed-off-by: Reynold Xin <[email protected]>
…gical plan"

This reverts commit fc29b89.

(cherry picked from commit 84b2345)
Signed-off-by: Reynold Xin <[email protected]>
…f output of surrounded SerializeFromObject.

## What changes were proposed in this pull request?

The following code:

```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```

throws an Exception:

```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)

...
 Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
 at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```

This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.

The analyzed and optimized plans of the above example are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.

The plans after this patch are as follows:

```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
   +- Filter <function1>.apply
      +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]

== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
   +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```

## How was this patch tested?

Existing tests and I added a test to check if `filter and then select` works.

Author: Takuya UESHIN <[email protected]>

Closes apache#13096 from ueshin/issues/SPARK-15313.

(cherry picked from commit d5e1c5a)
Signed-off-by: Reynold Xin <[email protected]>
This was accidentally reverted in f8d0177.
## What changes were proposed in this pull request?

Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).

## How was this patch tested?
Added a test case.

Author: Lianhui Wang <[email protected]>

Closes apache#13170 from lianhuiwang/truncate.

(cherry picked from commit 09a0051)
Signed-off-by: Reynold Xin <[email protected]>
…d when size mismatch happened in LogisticRegression

## What changes were proposed in this pull request?
Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression

## How was this patch tested?
local build

Author: dding3 <[email protected]>

Closes apache#12948 from dding3/master.

(cherry picked from commit a78fbfa)
Signed-off-by: Xiangrui Meng <[email protected]>
…._, asML/fromML

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
In this DataFrame example, we use VectorImplicits._, which is private API.

Since Vectors object has public API, we use Vectors.fromML instead of implicts.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually run the example.

Author: [email protected] <[email protected]>

Closes apache#13213 from wangmiao1981/ml.

(cherry picked from commit 4c7a6b3)
Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request?
MLlib are not recommended to use, and some methods are even deprecated.
Update the warning message to recommend ML usage.
```
  def showWarning() {
    System.err.println(
      """WARN: This is a naive implementation of Logistic Regression and is given as an example!
        |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
        |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
        |for more conventional use.
      """.stripMargin)
  }
```
To
```
  def showWarning() {
    System.err.println(
      """WARN: This is a naive implementation of Logistic Regression and is given as an example!
        |Please use org.apache.spark.ml.classification.LogisticRegression
        |for more conventional use.
      """.stripMargin)
  }
```

## How was this patch tested?
local build

Author: Zheng RuiFeng <[email protected]>

Closes apache#13190 from zhengruifeng/update_recd.

(cherry picked from commit 47a2940)
Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request?

Correct some typos and incorrectly worded sentences.

## How was this patch tested?

Doc changes only.

Note that many of these changes were identified by whomfire01

Author: sethah <[email protected]>

Closes apache#13180 from sethah/ml_guide_audit.

(cherry picked from commit 5e20350)
Signed-off-by: Xiangrui Meng <[email protected]>
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <[email protected]>

Closes apache#13129 from yanboliang/spark-15339.

(cherry picked from commit c94b34e)
Signed-off-by: Xiangrui Meng <[email protected]>
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.

In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).

#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`

Author: gatorsmile <[email protected]>

Closes apache#13156 from gatorsmile/refreshTable.

(cherry picked from commit 39fd469)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

When we parse DDLs involving table or database properties, we need to validate the values.
E.g. if we alter a database's property without providing a value:
```
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
```
Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values.

In such cases, we should throw exceptions instead.

## How was this patch tested?

`DDLCommandSuite`

Author: Andrew Or <[email protected]>

Closes apache#13205 from andrewor14/ddl-prop-values.

(cherry picked from commit 2573750)
Signed-off-by: Andrew Or <[email protected]>
## What changes were proposed in this pull request?

There is no way to use the Hive catalog in `pyspark-shell`. This is because we used to create a `SparkContext` before calling `SparkSession.enableHiveSupport().getOrCreate()`, which just gets the existing `SparkContext` instead of creating a new one. As a result, `spark.sql.catalogImplementation` was never propagated.

## How was this patch tested?

Manual.

Author: Andrew Or <[email protected]>

Closes apache#13203 from andrewor14/fix-pyspark-shell.

(cherry picked from commit c32b1b1)
Signed-off-by: Andrew Or <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Sep 21, 2016

@paulzwu close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.