-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10931][PYSPARK][ML] PySpark ML Models should contain Param values #10270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just a heads up that this has merge conflicts that need to be resolved so jenkins can the tests on it. |
|
Hey all, I've resolved the merge conflicts. Thanks, |
|
This approach only copies shared params like I like the latter approach but the lack of doc generation may be a deal breaker. cc @holdenk |
|
@jkbradley probably is the best to check with, I'm not sure how important the pydoc generation would be for the models. The approach often taken in the scala code is to make a shared Params class that both the model and the estimator inherit which seems like it might offer some nice tradeoffs (we don't have to duplicate the list of params everywhere, custom params are able to be copied, and doc gen works) but it does add some extra bit of code per-model so if we don't care about the pydoc that much it might not be a good solution |
|
Sorry for the slow response on this. I definitely would want the doc available. This will depend on the decision made on [https://issues.apache.org/jira/browse/SPARK-14033], so I'll hold off on a detailed review. But in the meantime, I'll play around with possible solutions. |
|
We decided not to do SPARK-14033, so this can proceed. I'll try a few possibilities, but let me know if you have updates! |
|
add to whitelist |
|
Jenkins test this please |
|
Test build #55147 has finished for PR 10270 at commit
|
|
It will be great to get this into 2.0. Given the short time frame for this, I'd say we should have models inherit from the HasMyParam classes as you're doing. Could you please fix the merge conflicts? Thanks! |
|
@evanyc15 Let me know if you'd like help getting this PR merged. I know you waited a long time. Thanks! |
|
Hey Joseph, I'll work on getting the merge conflicts resolved. Thank you |
|
OK, thanks! Btw, the code freeze is in a day or two I believe, so we'll need to hurry on this. |
b4890cb to
e941d5b
Compare
|
Hey Joseph, Just pushed in the new changes. Thank you |
|
Jenkins, test this please. |
|
CC @MLnick. Can you please review my code? |
|
@evanyc15 this won't make 2.0, so I'll take a look after 2.0 release. By the way, you need to rebase to master in order to avoid pulling in all the commits in the PR |
## What changes were proposed in this pull request?
This PR exposes `sql` in PySpark Shell like Scala/R Shells for consistency.
**Background**
* Scala
```scala
scala> sql("select 1 a")
res0: org.apache.spark.sql.DataFrame = [a: int]
```
* R
```r
> sql("select 1")
SparkDataFrame[1:int]
```
**Before**
* Python
```python
>>> sql("select 1 a")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sql' is not defined
```
**After**
* Python
```python
>>> sql("select 1 a")
DataFrame[a: int]
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <[email protected]>
Closes apache#14190 from dongjoon-hyun/SPARK-16536.
## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <[email protected]> Closes apache#14165 from lw-lin/add-version.
…adata ## What changes were proposed in this pull request? `Alias` with metadata is not a no-op and we should not strip it in `RemoveAliasOnlyProject` rule. This PR also did some improvement for this rule: 1. extend the semantic of `alias-only`. Now we allow the project list to be partially aliased. 2. add unit test for this rule. ## How was this patch tested? new `RemoveAliasOnlyProjectSuite` Author: Wenchen Fan <[email protected]> Closes apache#14106 from cloud-fan/bug.
… all used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
…style, minor fixes ## What changes were proposed in this pull request? Cleanup of examples, mostly from PySpark-ML to fix minor issues: unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error. * The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java. * "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java * Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted * Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept. * RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version ## How was this patch tested? local tests and run modified examples Author: Bryan Cutler <[email protected]> Closes apache#14081 from BryanCutler/examples-cleanup-SPARK-16403.
…le name ## What changes were proposed in this pull request? Due to the changes of [SPARK-14963](https://issues.apache.org/jira/browse/SPARK-14963), external shuffle recovery file name is changed mistakenly, so here change it back to the previous file name. This only affects the master branch, branch-2.0 is correct [here](https://github.com/apache/spark/blob/branch-2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L195). ## How was this patch tested? N/A Author: jerryshao <[email protected]> Closes apache#14197 from jerryshao/fix-typo-file-name.
… startup. This prevents the NM from starting when something is wrong, which would lead to later errors which are confusing and harder to debug. Added a unit test to verify startup fails if something is wrong. Author: Marcelo Vanzin <[email protected]> Closes apache#14162 from vanzin/SPARK-16505.
…E COLUMN #### What changes were proposed in this pull request? Based on the [Hive SQL syntax](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment), the command to change column name/type/position/comment is `ALTER TABLE CHANGE COLUMN`. However, in our .g4 file, it is `ALTER TABLE CHANGE COLUMNS`. Because it is the last optional keyword, it does not take any effect. Thus, I put the issue as a Trivial level. cc hvanhovell #### How was this patch tested? Existing test cases Author: gatorsmile <[email protected]> Closes apache#14186 from gatorsmile/changeColumns.
…mands
## What changes were proposed in this pull request?
This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name.
**Comparison: Column Name**
Command|Spark(Before)|Spark(After)|Hive
----------|--------------|------------|-----
SHOW PARTITIONS|result|partition|partition
SHOW COLUMNS|result|col_name|field
Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands.
**Before**
```scala
scala> sql("show partitions p").show()
+------+
|result|
+------+
| b=2|
+------+
scala> sql("show columns in p").show()
+------+
|result|
+------+
| a|
| b|
+------+
```
**After**
```scala
scala> sql("show partitions p").show
+---------+
|partition|
+---------+
| b=2|
+---------+
scala> sql("show columns in p").show
+--------+
|col_name|
+--------+
| a|
| b|
+--------+
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <[email protected]>
Closes apache#14199 from dongjoon-hyun/SPARK-16543.
… windowPartitionBy and windowOrderBy. ## What changes were proposed in this pull request? Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <[email protected]> Closes apache#14192 from sun-rui/SPARK-16509.
…ion functions ## What changes were proposed in this pull request? Fix function routing to work with and without namespace operator `SparkR::createDataFrame` ## How was this patch tested? manual, unit tests shivaram Author: Felix Cheung <[email protected]> Closes apache#14195 from felixcheung/rroutedefault.
…base before dropping
## What changes were proposed in this pull request?
`SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability.
```scala
- try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE")
+ try f(dbName) finally {
+ if (spark.catalog.currentDatabase == dbName) {
+ spark.sql(s"USE ${DEFAULT_DATABASE}")
+ }
+ spark.sql(s"DROP DATABASE $dbName CASCADE")
+ }
```
In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception.
## How was this patch tested?
This improves test harness.
Author: Dongjoon Hyun <[email protected]>
Closes apache#14184 from dongjoon-hyun/SPARK-16529.
…ringFunctionsSuite ## What changes were proposed in this pull request? This PR aims to fix a build error on branch 1.6 at apache@8d87252, but I think we had better have this consistently in master branch, too. It's because there exist other ongoing PR (apache#14525) about this. https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-compile-maven-with-yarn-2.3/286/console ```scala [error] /home/jenkins/workspace/spark-branch-1.6-compile-maven-with-yarn-2.3/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala:82: value toDF is not a member of Seq[String] [error] val df = Seq("aaaac").toDF("s") [error] ^ ``` ## How was this patch tested? After passing Jenkins, run compilation test on branch 1.6. ``` build/mvn -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install ``` Author: Dongjoon Hyun <[email protected]> Closes apache#14526 from dongjoon-hyun/SPARK-16939.
## What changes were proposed in this pull request? Currently the update interval for the console progress bar is hardcoded. This PR makes it configurable for users. ## How was this patch tested? Ran a long running job and with a high value of update interval, the updates were shown less frequently. Author: Tejas Patil <[email protected]> Closes apache#14507 from tejasapatil/SPARK-16919.
## What changes were proposed in this pull request? Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. In apache#13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark. ## How was this patch tested? **MLlib**  **ML without patch**  **ML with patch**  Author: sethah <[email protected]> Closes apache#14109 from sethah/LIR_serialize.
## What changes were proposed in this pull request? This PR is to fix the minor Java linter errors as following: [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[42,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/VariableLengthRowBasedKeyValueBatch.java:[97,10] (modifier) RedundantModifier: Redundant 'final' modifier. ## How was this patch tested? Manual test. dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. Author: Weiqing Yang <[email protected]> Closes apache#14532 from Sherry302/master.
… operations return incorrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase by returning an error message when the LIMIT is found in the path from the parent table to the correlated predicate in the subquery. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern. Author: Nattavut Sutyanyong <[email protected]> Closes apache#14411 from nsyca/master.
…ema in TypedAggregateExpression ## What changes were proposed in this pull request? This PR adds auxiliary info like input class and input schema in TypedAggregateExpression ## How was this patch tested? Manual test. Author: Sean Zhong <[email protected]> Closes apache#14501 from clockfly/typed_aggregation.
…lause #### What changes were proposed in this pull request? When doing a CTAS with a Partition By clause, we got a wrong error message. For example, ```SQL CREATE TABLE gen__tmp PARTITIONED BY (key string) AS SELECT key, value FROM mytable1 ``` The error message we get now is like ``` Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 2, pos 0) ``` However, based on the code, the message we should get is like ``` Operation not allowed: A Create Table As Select (CTAS) statement is not allowed to create a partitioned table using Hive's file formats. Please use the syntax of "CREATE TABLE tableName USING dataSource OPTIONS (...) PARTITIONED BY ...\" to create a partitioned table through a CTAS statement.(line 2, pos 0) ``` Currently, partitioning columns is part of the schema. This PR fixes the bug by changing the detection orders. #### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes apache#14113 from gatorsmile/ctas.
### What changes were proposed in this pull request? Currently, the `refreshTable` API is always case sensitive. When users use the view name without the exact case match, the API silently ignores the call. Users might expect the command has been successfully completed. However, when users run the subsequent SQL commands, they might still get the exception, like ``` Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 7, localhost): java.io.FileNotFoundException: File file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-bd4b9ea6-9aec-49c5-8f05-01cff426211e/part-r-00000-0c84b915-c032-4f2e-abf5-1d48fdbddf38.snappy.parquet does not exist ``` This PR is to fix the issue. ### How was this patch tested? Added a test case. Author: gatorsmile <[email protected]> Closes apache#14523 from gatorsmile/refreshTempTable.
Some very rare JVM errors are printed to stdout, and that confuses the code in spark-class. So add a check so that those cases are detected and the proper error message is shown to the user. Tested by running spark-submit after setting "ulimit -v 32000". Closes apache#14231 Author: Marcelo Vanzin <[email protected]> Closes apache#14508 from vanzin/SPARK-16586.
…onsistent with requestExecutors/killExecutors ## What changes were proposed in this pull request? RequestExecutors and killExecutor are public developer APIs for managing the number of executors allocated to the SparkContext. For consistency, requestTotalExecutors should also be a public Developer API, as it provides similar functionality. In fact, using requestTotalExecutors is more convenient that requestExecutors as the former is idempotent and the latter is not. Author: Tathagata Das <[email protected]> Closes apache#14541 from tdas/SPARK-16953.
…t add much and remove whitelisting ## What changes were proposed in this pull request? Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability. ## How was this patch tested? Existing tests. Author: Holden Karau <[email protected]> Closes apache#14407 from holdenk/SPARK-16779.
## What changes were proposed in this pull request? Update docs to include SASL support for RPC Evidence: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L63 ## How was this patch tested? Docs change only Author: Michael Gummelt <[email protected]> Closes apache#14549 from mgummelt/sasl.
## What changes were proposed in this pull request? The logic for LEAD/LAG processing is more complex that it needs to be. This PR fixes that. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#14376 from hvanhovell/SPARK-16749.
…lan like MapElements, TypedFilter, and AppendColumn ## What changes were proposed in this pull request? This PR adds argument type information for typed logical plan like MapElements, TypedFilter, and AppendColumn, so that we can use these info in customized optimizer rule. ## How was this patch tested? Existing test. Author: Sean Zhong <[email protected]> Closes apache#14494 from clockfly/add_more_info_for_typed_operator.
## What changes were proposed in this pull request? Add a constant iterator which point to head of result. The header will be used to reset iterator when fetch result from first row repeatedly. JIRA ticket https://issues.apache.org/jira/browse/SPARK-16563 ## How was this patch tested? This bug was found when using Cloudera HUE connecting to spark sql thrift server, currently SQL statement result can be only fetched for once. The fix was tested manually with Cloudera HUE, With this fix, HUE can fetch spark SQL results repeatedly through thrift server. Author: Alice <[email protected]> Author: Alice <[email protected]> Closes apache#14218 from alicegugu/SparkSQLFetchResultsBug.
…ption. ## What changes were proposed in this pull request? For ORC source, Spark SQL has a writer option `compression`, which is used to set the codec and its value will be also set to `orc.compress` (the orc conf used for codec). However, if a user only set `orc.compress` in the writer option, we should not use the default value of `compression` (snappy) as the codec. Instead, we should respect the value of `orc.compress`. This PR makes ORC data source not ignoring `orc.compress` when `comperssion` is unset. So, here is the behaviour, 1. Check `compression` and use this if it is set. 2. If `compression` is not set, check `orc.compress` and use it. 3. If `compression` and `orc.compress` are not set, then use the default snappy. ## How was this patch tested? Unit test in `OrcQuerySuite`. Author: hyukjinkwon <[email protected]> Closes apache#14518 from HyukjinKwon/SPARK-16610.
…WARN SparkContext: Use an existing SparkContext, some configuration may not take effect." ## What changes were proposed in this pull request? SparkContext.getOrCreate shouldn't warn about ignored config if - it wasn't ignored because a new context is created with it or - no config was actually provided ## How was this patch tested? Jenkins + existing tests. Author: Sean Owen <[email protected]> Closes apache#14533 from srowen/SPARK-16606.
## What changes were proposed in this pull request? Spark applications running on Mesos throw exception upon exit. For details, refer to https://issues.apache.org/jira/browse/SPARK-16522. I am not sure if there is any better fix, so wait for review comments. ## How was this patch tested? Manual test. Observed that the exception is gone upon application exit. Author: Sun Rui <[email protected]> Closes apache#14175 from sun-rui/SPARK-16522.
…or wrong results
## What changes were proposed in this pull request?
This PR fixes the following to make `checkAnswer` raise `TestFailedException` again instead of `java.util.NoSuchElementException: key not found: TZ` in the environments without `TZ` variable. Also, this PR adds `QueryTestSuite` class for testing `QueryTest` itself.
```scala
- |Timezone Env: ${sys.env("TZ")}
+ |Timezone Env: ${sys.env.getOrElse("TZ", "")}
```
## How was this patch tested?
Pass the Jenkins tests with a new test suite.
Author: Dongjoon Hyun <[email protected]>
Closes apache#14528 from dongjoon-hyun/SPARK-16940.
## What changes were proposed in this pull request? Links the Spark Mesos Dispatcher UI to the history server UI - adds spark.mesos.dispatcher.historyServer.url - explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs ## How was this patch tested? manual testing Author: Michael Gummelt <[email protected]> Author: Sergiusz Urbaniak <[email protected]> Closes apache#14414 from mgummelt/history-server.
…ecution package ## What changes were proposed in this pull request? This package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime. This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.execution. ## How was this patch tested? N/A - just visibility changes. Author: Reynold Xin <[email protected]> Closes apache#14554 from rxin/remote-private.
…es unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in apache#14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by apache#14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch  After patch  Author: Yanbo Liang <[email protected]> Closes apache#14519 from yanboliang/spark-16933.
…reateDirectStream for python3 ## What changes were proposed in this pull request? Ability to use KafkaUtils.createDirectStream with starting offsets in python 3 by using java.lang.Number instead of Long during param mapping in scala helper. This allows py4j to pass Integer or Long to the map and resolves ClassCastException problems. ## How was this patch tested? unit tests jerryshao - could you please look at this PR? Author: Mariusz Strzelecki <[email protected]> Closes apache#14540 from szczeles/kafka_pyspark.
## What changes were proposed in this pull request? MSCK REPAIR TABLE could be used to recover the partitions in external catalog based on partitions in file system. Another syntax is: ALTER TABLE table RECOVER PARTITIONS The implementation in this PR will only list partitions (not the files with a partition) in driver (in parallel if needed). ## How was this patch tested? Added unit tests for it and Hive compatibility test suite. Author: Davies Liu <[email protected]> Closes apache#14500 from davies/repair_table.
## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (apache#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <[email protected]> Closes apache#14544 from JoshRosen/add-setting-for-max-executor-failures.
Estimator UID is being copied correctly to the Transformer model objects and params now, working on Doctests Changed the way parameters are copied from the Estimator to Transformer Checkpoint, switching back to inheritance method Working on DocTests Implemented Doctests for Recommendation, Clustering, Classification (except RandomForestClassifier), Evaluation, Tuning, Regression (except RandomRegression) Ready for Code Review Code Review changeset apache#1
… into SPARK-10931-pyspark-mllib
|
New Pull Request #14653 created for this. |
PySpark spark.ml Models are generally wrappers around Java objects and do not even contain Param values. This JIRA is for copying the Param values from the Estimator to the model.
This can likely be solved by modifying Estimator.fit to copy Param values, but should also include proper unit tests.