-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Branch 2.0 #14505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Branch 2.0 #14505
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <[email protected]> Closes #14165 from lw-lin/add-version. (cherry picked from commit 39c836e) Signed-off-by: Reynold Xin <[email protected]>
… all used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place. (cherry picked from commit 252d4f2) Signed-off-by: Sean Owen <[email protected]>
… windowPartitionBy and windowOrderBy. ## What changes were proposed in this pull request? Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <[email protected]> Closes #14192 from sun-rui/SPARK-16509. (cherry picked from commit 093ebbc) Signed-off-by: Shivaram Venkataraman <[email protected]>
…ion functions ## What changes were proposed in this pull request? Fix function routing to work with and without namespace operator `SparkR::createDataFrame` ## How was this patch tested? manual, unit tests shivaram Author: Felix Cheung <[email protected]> Closes #14195 from felixcheung/rroutedefault. (cherry picked from commit 12005c8) Signed-off-by: Shivaram Venkataraman <[email protected]>
…base before dropping
## What changes were proposed in this pull request?
`SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability.
```scala
- try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE")
+ try f(dbName) finally {
+ if (spark.catalog.currentDatabase == dbName) {
+ spark.sql(s"USE ${DEFAULT_DATABASE}")
+ }
+ spark.sql(s"DROP DATABASE $dbName CASCADE")
+ }
```
In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception.
## How was this patch tested?
This improves test harness.
Author: Dongjoon Hyun <[email protected]>
Closes #14184 from dongjoon-hyun/SPARK-16529.
(cherry picked from commit c576f9f)
Signed-off-by: Cheng Lian <[email protected]>
## What changes were proposed in this pull request? There are some calls to methods or fields (getParameters, properties) which are then passed to Java/Scala collection converters. Unfortunately those fields can be null in some cases and then the conversions throws NPE. We fix it by wrapping calls to those fields and methods with option and then do the conversion. ## How was this patch tested? Manually tested with a custom Hive metastore. Author: Jacek Lewandowski <[email protected]> Closes #14200 from jacek-lewandowski/SPARK-16528. (cherry picked from commit 31ca741) Signed-off-by: Reynold Xin <[email protected]>
…n yarn ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshao <[email protected]> Closes #14196 from jerryshao/SPARK-16540. (cherry picked from commit 91575ca) Signed-off-by: Marcelo Vanzin <[email protected]>
## What changes were proposed in this pull request? Fixes a typo in the sql programming guide ## How was this patch tested? Building docs locally (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shivaram Venkataraman <[email protected]> Closes #14208 from shivaram/spark-sql-doc-fix. (cherry picked from commit 01c4c1f) Signed-off-by: Reynold Xin <[email protected]>
…nt failures If a custom Jekyll template tag throws Ruby's equivalent of a "file not found" exception, then Jekyll will stop the doc building process but will exit with a successful status, causing our doc publishing jobs to silently fail. This is caused by jekyll/jekyll#5104, a case of bad error-handling logic in Jekyll. This patch works around this by updating our `include_example.rb` plugin to catch the exception and exit rather than allowing it to bubble up and be ignored by Jekyll. I tested this manually with ``` rm ./examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala cd docs SKIP_API=1 jekyll build echo $? ``` Author: Josh Rosen <[email protected]> Closes #14209 from JoshRosen/fix-doc-building. (cherry picked from commit 972673a) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? Most of the documentation in https://github.com/apache/spark/blob/master/sql/README.md is stale. It would be useful to keep the list of projects to explain what's going on, and everything else should be removed. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes #14211 from rxin/SPARK-16557. (cherry picked from commit 2e4075e) Signed-off-by: Reynold Xin <[email protected]>
…ib guide ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <[email protected]> Closes #14213 from jkbradley/ml-guide-2.0. (cherry picked from commit 5ffd5d3) Signed-off-by: Joseph K. Bradley <[email protected]>
…on functions ## What changes were proposed in this pull request? More tests I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0. ## How was this patch tested? unit tests shivaram dongjoon-hyun Author: Felix Cheung <[email protected]> Closes #14206 from felixcheung/rroutetests. (cherry picked from commit 611a8ca) Signed-off-by: Shivaram Venkataraman <[email protected]>
…e is an exception while creating an Executor ## What changes were proposed in this pull request? With the fix from SPARK-13112, I see that `LaunchTask` is always processed after `RegisteredExecutor` is done and so it gets chance to do all retries to startup an executor. There is still a problem that if `Executor` creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the `LaunchTask` as `executor` is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during `Executor` creation and thats why it was killed. This PR explicitly catches exception in `Executor` creation, logs a proper message and then exits the JVM. Also, I have changed the `exitExecutor` method to accept `reason` so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level ## How was this patch tested? I am relying on existing tests Author: Tejas Patil <[email protected]> Closes #14202 from tejasapatil/exit_executor_failure. (cherry picked from commit b2f24f9) Signed-off-by: Shixiong Zhu <[email protected]>
…expressions ## What changes were proposed in this pull request? This patch is just a slightly safer way to fix the issue we encountered in #14168 should this pattern re-occur at other places in the code. ## How was this patch tested? Existing tests. Also, I manually tested that it fixes the problem in SPARK-16514 without having the proposed change in #14168 Author: Sameer Agarwal <[email protected]> Closes #14227 from sameeragarwal/codegen. (cherry picked from commit a1ffbad) Signed-off-by: Reynold Xin <[email protected]>
… help unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end. ## How was this patch tested? Jenkins build of docs Author: Sean Owen <[email protected]> Closes #14221 from srowen/SPARK-3359.3. (cherry picked from commit 5ec0d69) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <[email protected]> Closes #14090 from NarineK/gapplyProgGuide. (cherry picked from commit 4167304) Signed-off-by: Shivaram Venkataraman <[email protected]>
## What changes were proposed in this pull request? Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include - Updating `DESCRIPTION` to be appropriate - Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs - Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods - Other minor fixes ## How was this patch tested? SparkR unit tests, running the above mentioned script Author: Shivaram Venkataraman <[email protected]> Closes #14173 from shivaram/sparkr-cran-changes. (cherry picked from commit c33e4b0) Signed-off-by: Shivaram Venkataraman <[email protected]>
## What changes were proposed in this pull request? This patch moves regexp related unit tests from StringExpressionsSuite to RegexpExpressionsSuite to match the file name for regexp expressions. ## How was this patch tested? This is a test only change. Author: Reynold Xin <[email protected]> Closes #14230 from rxin/SPARK-16584. (cherry picked from commit 7b84758) Signed-off-by: Reynold Xin <[email protected]>
This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236. (cherry picked from commit 480c870) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? fininsh => finish ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes #14238 from WeichenXu123/fix_fininsh_typo. (cherry picked from commit a529fc9) Signed-off-by: Sean Owen <[email protected]>
…ark-submit ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-16055 sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit ## How was this patch tested? In my system locally Author: krishnakalyan3 <[email protected]> Closes #14179 from krishnakalyan3/spark-pkg. (cherry picked from commit 8ea3f4e) Signed-off-by: Shivaram Venkataraman <[email protected]>
…ansformation ## What changes were proposed in this pull request? In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now. ## How was this patch tested? added a test case in SQLQuerySuite. Closes #14169 Author: Daoyuan Wang <[email protected]> Author: Yin Huai <[email protected]> Closes #14249 from yhuai/scriptTransformation. (cherry picked from commit 96e9afa) Signed-off-by: Yin Huai <[email protected]>
## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <[email protected]> Closes #14250 from felixcheung/rdocs2rc4. (cherry picked from commit 75f0efe) Signed-off-by: Shivaram Venkataraman <[email protected]>
…QL directly ## What changes were proposed in this pull request? This PR improves `LogicalPlanToSQLSuite` to check the generated SQL directly by **structure**. So far, `LogicalPlanToSQLSuite` relies on `checkHiveQl` to ensure the **successful SQL generation** and **answer equality**. However, it does not guarantee the generated SQL is the same or will not be changed unnoticeably. ## How was this patch tested? Pass the Jenkins. This is only a testsuite change. Author: Dongjoon Hyun <[email protected]> Closes #14235 from dongjoon-hyun/SPARK-16590. (cherry picked from commit ea78edb) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? Given that filters based on non-deterministic constraints shouldn't be pushed down in the query plan, unnecessarily inferring them is confusing and a source of potential bugs. This patch simplifies the inferring logic by simply ignoring them. ## How was this patch tested? Added a new test in `ConstraintPropagationSuite`. Author: Sameer Agarwal <[email protected]> Closes #14795 from sameeragarwal/deterministic-constraints. (cherry picked from commit ac27557) Signed-off-by: Reynold Xin <[email protected]>
…Info == null ## What changes were proposed in this pull request? Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception ## How was this patch tested? Jenkins tests Author: Sean Owen <[email protected]> Closes #14760 from srowen/SPARK-17193. (cherry picked from commit 2bcd5d5) Signed-off-by: Sean Owen <[email protected]>
…afe-backed data Currently `MapObjects` does not make copies of unsafe-backed data, leading to problems like [SPARK-17061](https://issues.apache.org/jira/browse/SPARK-17061) [SPARK-17093](https://issues.apache.org/jira/browse/SPARK-17093). This patch makes `MapObjects` make copies of unsafe-backed data. Generated code - prior to this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12; /* 299 */ } ... ``` Generated code - after this patch: ```java ... /* 295 */ if (isNull12) { /* 296 */ convertedArray1[loopIndex1] = null; /* 297 */ } else { /* 298 */ convertedArray1[loopIndex1] = value12 instanceof UnsafeRow? value12.copy() : value12; /* 299 */ } ... ``` Add a new test case which would fail without this patch. Author: Liwei Lin <[email protected]> Closes #14698 from lw-lin/mapobjects-copy. (cherry picked from commit e0b20f9) Signed-off-by: Herman van Hovell <[email protected]>
…s of unsafe-backed data" This reverts commit fb1c697.
…es of unsafe-backed data ## What changes were proposed in this pull request? This PR backports #14698 to branch-2.0. See that PR for more details. All credit should go to lw-lin. Author: Herman van Hovell <[email protected]> Author: Liwei Lin <[email protected]> Closes #14806 from hvanhovell/SPARK-17061.
…on when Filter's isNotNull Constraints Unable to Filter Out All Null-supplying Rows
### What changes were proposed in this pull request?
This PR is to fix an incorrect outer join elimination when filter's `isNotNull` constraints is unable to filter out all null-supplying rows. For example, `isnotnull(coalesce(b#227, c#238))`.
Users can hit this error when they try to use `using/natural outer join`, which is converted to a normal outer join with a `coalesce` expression on the `using columns`. For example,
```Scala
val a = Seq((1, 2), (2, 3)).toDF("a", "b")
val b = Seq((2, 5), (3, 4)).toDF("a", "c")
val c = Seq((3, 1)).toDF("a", "d")
val ab = a.join(b, Seq("a"), "fullouter")
ab.join(c, "a").explain(true)
```
The dataframe `ab` is doing `using full-outer join`, which is converted to a normal outer join with a `coalesce` expression. Constraints inference generates a `Filter` with constraints `isnotnull(coalesce(b#227, c#238))`. Then, it triggers a wrong outer join elimination and generates a wrong result.
```
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
:- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
: +- Join FullOuter, (a#226 = a#236)
: :- Project [_1#223 AS a#226, _2#224 AS b#227]
: : +- LocalRelation [_1#223, _2#224]
: +- Project [_1#233 AS a#236, _2#234 AS c#237]
: +- LocalRelation [_1#233, _2#234]
+- Project [_1#243 AS a#246, _2#244 AS d#247]
+- LocalRelation [_1#243, _2#244]
== Optimized Logical Plan ==
Project [a#251, b#227, c#237, d#247]
+- Join Inner, (a#251 = a#246)
:- Project [coalesce(a#226, a#236) AS a#251, b#227, c#237]
: +- Filter isnotnull(coalesce(a#226, a#236))
: +- Join FullOuter, (a#226 = a#236)
: :- LocalRelation [a#226, b#227]
: +- LocalRelation [a#236, c#237]
+- LocalRelation [a#246, d#247]
```
**A note to the `Committer`**, please also give the credit to dongjoon-hyun who submitted another PR for fixing this issue. #14580
### How was this patch tested?
Added test cases
Author: gatorsmile <[email protected]>
Closes #14661 from gatorsmile/fixOuterJoinElimination.
(cherry picked from commit d2ae639)
Signed-off-by: Herman van Hovell <[email protected]>
…ory Cataloged Tables ### What changes were proposed in this pull request? Currently, `Analyze Table` is only used for Hive-serde tables. We should issue exceptions in all the other cases. When the tables are data source tables, we issued an exception. However, when tables are In-Memory Cataloged tables, we do not issue any exception. This PR is to issue an exception when the tables are in-memory cataloged. For example, ```SQL CREATE TABLE tbl(a INT, b INT) USING parquet ``` `tbl` is a `SimpleCatalogRelation` when the hive support is not enabled. ### How was this patch tested? Added two test cases. One of them is just to improve the test coverage when the analyzed table is data source tables. Author: gatorsmile <[email protected]> Closes #14781 from gatorsmile/analyzeInMemoryTable2.
In 2.0, we verify the data type against schema for every row for safety, but with performance cost, this PR make it optional. When we verify the data type for StructType, it does not support all the types we support in infer schema (for example, dict), this PR fix that to make them consistent. For Row object which is created using named arguments, the order of fields are sorted by name, they may be not different than the order in provided schema, this PR fix that by ignore the order of fields in this case. Created regression tests for them. Author: Davies Liu <[email protected]> Closes #14469 from davies/py_dict.
## What changes were proposed in this pull request? This is a back port of #14673 addressing merge conflicts in package.scala that prevented a cherry-pick to `branch-2.0` when it was merged to `master` Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. This trims tasks by `spark.ui.retainedTasks` (default: 100000) ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <[email protected]> Closes #14794 from ajbozarth/spark15083-branch-2.0.
## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) R add cran check which will generate the cran-check.out. This file should be ignored in git. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual test it. Run clean test and git status to make sure the file is not included in git. Author: [email protected] <[email protected]> Closes #14774 from wangmiao1981/ignore. (cherry picked from commit 9958ac0) Signed-off-by: Shivaram Venkataraman <[email protected]>
This patch updates `Literal.sql` to properly generate SQL for `NaN` and `Infinity` float and double literals: these special values need to be handled differently from regular values, since simply appending a suffix to the value's `toString()` representation will not work for these values. Author: Josh Rosen <[email protected]> Closes #14777 from JoshRosen/SPARK-17205. (cherry picked from commit 3e4c7db) Signed-off-by: Herman van Hovell <[email protected]>
… the respective log level is enabled This is simply a backport of #14798 to `branch-2.0`. This backport omits the change to `ExternalShuffleBlockHandler.java`. In `branch-2.0`, that file does not contain the log message that was patched in `master`. Author: Michael Allman <[email protected]> Closes #14811 from mallman/spark-17231-logging_perf_improvements-2.0_backport.
## What changes were proposed in this pull request? Updated links of external dstream projects. ## How was this patch tested? Just document changes. Author: Shixiong Zhu <[email protected]> Closes #14814 from zsxwing/dstream-link. (cherry picked from commit 341e0e7) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? The original example doesn't work because the features are not categorical. This PR fixes this by changing to another dataset. ## How was this patch tested? Manual test. Author: Junyang Qian <[email protected]> Closes #14820 from junyangq/SPARK-FixNaiveBayes. (cherry picked from commit 1883216) Signed-off-by: Felix Cheung <[email protected]>
… files indefinitely ## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <[email protected]> Closes #14728 from petermaxlee/SPARK-17165. (cherry picked from commit 9812f7d) Signed-off-by: Shixiong Zhu <[email protected]>
## What changes were proposed in this pull request? This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values. ## How was this patch tested? Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`. Author: Herman van Hovell <[email protected]> Closes #14819 from hvanhovell/SPARK-17246. (cherry picked from commit a11d10f) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request? This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time. ## How was this patch tested? Added a unit test case in HDFSMetadataLogSuite. Author: petermaxlee <[email protected]> Closes #14802 from petermaxlee/SPARK-17235. (cherry picked from commit f64a1dd) Signed-off-by: Reynold Xin <[email protected]>
…ditions ## What changes were proposed in this pull request? Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that. ## How was this patch tested? A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions. Author: Sameer Agarwal <[email protected]> Closes #14815 from sameeragarwal/constraint-inputfile. (cherry picked from commit 540e912) Signed-off-by: Yin Huai <[email protected]>
…wn file As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. This should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes #14838 from rxin/SPARK-17269. (cherry picked from commit dcefac4) Signed-off-by: Reynold Xin <[email protected]>
…branch-2.0) ## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports. This is #14839 but for branch-2.0. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes #14843 from rxin/SPARK-17270-branch-2.0.
## What changes were proposed in this pull request? As part of breaking Optimizer.scala apart, this patch moves various join rules into a single file. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes #14846 from rxin/SPARK-17274. (cherry picked from commit 718b6ba) Signed-off-by: Reynold Xin <[email protected]>
## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
| c|
+---+
| 0|
| 1|
| 1|
| 1|
| 2|
+---+
```
## How was this patch tested?
Added a test in `DataFrameSuite`.
Author: Takeshi YAMAMURO <[email protected]>
Closes #14800 from maropu/FixSampleBug.
(cherry picked from commit cd0ed31)
Signed-off-by: Sean Owen <[email protected]>
…Matrix. ## What changes were proposed in this pull request? The require condition and message doesn't match, and the condition also should be optimized. Small change. Please kindly let me know if JIRA required. ## How was this patch tested? No additional test required. Author: Peng, Meng <[email protected]> Closes #14824 from mpjlu/smallChangeForMatrixRequire. (cherry picked from commit 40168db) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? Make event timeline bar expand to full length of the bar (which is total time) This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers. ## How was this patch tested? Inspection in browsers Before  After  Author: Robert Kruszewski <[email protected]> Closes #14791 from robert3005/robertk/event-timeline. (cherry picked from commit 9fbced5) Signed-off-by: Sean Owen <[email protected]>
srowen
added a commit
to srowen/spark
that referenced
this pull request
Aug 27, 2016
Closes apache#10995 Closes apache#13658 Closes apache#14505 Closes apache#14536 Closes apache#12753 Closes apache#14449 Closes apache#12694 Closes apache#12695 Closes apache#14810
…type tests for JSON and verify all unsupported types in CSV ## What changes were proposed in this pull request? This backports #14829 ## How was this patch tested? Tests in `JsonHadoopFsRelation` and `CSVSuite`. Author: hyukjinkwon <[email protected]> Closes #14840 from HyukjinKwon/SPARK-16216-followup-backport.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)