-
Notifications
You must be signed in to change notification settings - Fork 28.9k
when colum is use alias ,the order by result is wrong #16890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## What changes were proposed in this pull request? In #16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <[email protected]> Closes #16528 from cloud-fan/create-table.
…n directory. ## What changes were proposed in this pull request? #16092 moves YARN resource manager related code to resource-managers/yarn directory. The test case ```YarnSchedulerBackendSuite``` was added after that but with the wrong place. I move it to correct directory in this PR. ## How was this patch tested? Existing test. Author: Yanbo Liang <[email protected]> Closes #16595 from yanboliang/yarn.
## What changes were proposed in this pull request? SET LOCATION can also work on managed table(or table created without custom path), the behavior is a little weird, but as we have already supported it, we should add a test to explicitly show the behavior. ## How was this patch tested? N/A Author: Wenchen Fan <[email protected]> Closes #16597 from cloud-fan/set-location.
## What changes were proposed in this pull request? Changing the default parquet logging levels to reflect the changes made in PR [#15538](#15538), in order to prevent the flood of log messages by default. ## How was this patch tested? Default log output when reading from parquet 1.6 files was compared with and without this change. The change eliminates the extraneous logging and makes the output readable. Author: Nick Lavers <[email protected]> Closes #16580 from nicklavers/spark-19219-set_default_parquet_log_level.
…n `sbt/sbt unidoc`
## What changes were proposed in this pull request?
This PR proposes to fix ambiguous link warnings by simply making them as code blocks for both javadoc and scaladoc.
```
[warn] .../spark/core/src/main/scala/org/apache/spark/Accumulator.scala:20: The link target "SparkContext#accumulator" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala:281: The link target "runMiniBatchSGD" is ambiguous. Several members fit the target:
[warn] .../spark/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala:83: The link target "run" is ambiguous. Several members fit the target:
...
```
This PR also fixes javadoc8 break as below:
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/LowPrioritySQLImplicits.java:7: error: reference not found
[error] * newProductEncoder - to disambiguate for {link List}s which are both {link Seq} and {link Product}
[error] ^
[info] 3 errors
```
## How was this patch tested?
Manually via `sbt unidoc > output.txt` and the checked it via `cat output.txt | grep ambiguous`
and `sbt unidoc | grep error`.
Author: hyukjinkwon <[email protected]>
Closes #16604 from HyukjinKwon/SPARK-3249.
…ate docs ## What changes were proposed in this pull request? `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc. ## How was this patch tested? Local verification. Author: jerryshao <[email protected]> Closes #16560 from jerryshao/SPARK-19179.
… cloudpickle changes for PySpark to work with Python 3.6.0
## What changes were proposed in this pull request?
Currently, PySpark does not work with Python 3.6.0.
Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:
```
Traceback (most recent call last):
File ".../spark/python/pyspark/shell.py", line 30, in <module>
import pyspark
File ".../spark/python/pyspark/__init__.py", line 46, in <module>
from pyspark.context import SparkContext
File ".../spark/python/pyspark/context.py", line 36, in <module>
from pyspark.java_gateway import launch_gateway
File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
File "<frozen importlib._bootstrap>", line 961, in _find_and_load
File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
import pkgutil
File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
cls = _old_namedtuple(*args, **kwargs)
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
```
The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).
We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).
This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.
Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.
## How was this patch tested?
Manually tested with Python 2.7.6 and Python 3.6.0.
```
./bin/pyspsark
```
, manual creation of `namedtuple` both in local and rdd with Python 3.6.0,
and Jenkins tests for other Python versions.
Also,
```
./run-tests --python-executables=python3.6
```
```
Will test against the following Python executables: ['python3.6']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Finished test(python3.6): pyspark.sql.tests (192s)
Finished test(python3.6): pyspark.accumulators (3s)
Finished test(python3.6): pyspark.mllib.tests (198s)
Finished test(python3.6): pyspark.broadcast (3s)
Finished test(python3.6): pyspark.conf (2s)
Finished test(python3.6): pyspark.context (14s)
Finished test(python3.6): pyspark.ml.classification (21s)
Finished test(python3.6): pyspark.ml.evaluation (11s)
Finished test(python3.6): pyspark.ml.clustering (20s)
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Finished test(python3.6): pyspark.streaming.tests (240s)
Finished test(python3.6): pyspark.tests (240s)
Finished test(python3.6): pyspark.ml.recommendation (19s)
Finished test(python3.6): pyspark.ml.feature (36s)
Finished test(python3.6): pyspark.ml.regression (37s)
Finished test(python3.6): pyspark.ml.tuning (28s)
Finished test(python3.6): pyspark.mllib.classification (26s)
Finished test(python3.6): pyspark.mllib.evaluation (18s)
Finished test(python3.6): pyspark.mllib.clustering (44s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.feature (26s)
Finished test(python3.6): pyspark.mllib.fpm (23s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.ml.tests (92s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
Finished test(python3.6): pyspark.mllib.recommendation (24s)
Finished test(python3.6): pyspark.mllib.regression (26s)
Finished test(python3.6): pyspark.profiler (9s)
Finished test(python3.6): pyspark.mllib.tree (16s)
Finished test(python3.6): pyspark.shuffle (1s)
Finished test(python3.6): pyspark.mllib.util (18s)
Finished test(python3.6): pyspark.serializers (11s)
Finished test(python3.6): pyspark.rdd (20s)
Finished test(python3.6): pyspark.sql.conf (8s)
Finished test(python3.6): pyspark.sql.catalog (17s)
Finished test(python3.6): pyspark.sql.column (18s)
Finished test(python3.6): pyspark.sql.context (18s)
Finished test(python3.6): pyspark.sql.group (27s)
Finished test(python3.6): pyspark.sql.dataframe (33s)
Finished test(python3.6): pyspark.sql.functions (35s)
Finished test(python3.6): pyspark.sql.types (6s)
Finished test(python3.6): pyspark.sql.streaming (13s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.session (16s)
Finished test(python3.6): pyspark.sql.window (4s)
Finished test(python3.6): pyspark.sql.readwriter (35s)
Tests passed in 433 seconds
```
Author: hyukjinkwon <[email protected]>
Closes #16429 from HyukjinKwon/SPARK-19019.
## What changes were proposed in this pull request?
`dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.
## How was this patch tested?
test("SPARK-19065: dropDuplicates should not create expressions using the same id")
Author: Shixiong Zhu <[email protected]>
Closes #16564 from zsxwing/SPARK-19065.
…partition spec
### What changes were proposed in this pull request?
Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.
```Scala
val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
spark.sql("alter table partitionedTable drop partition(partCol1='')")
spark.table("partitionedTable").show()
```
In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.
When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.
### How was this patch tested?
Added test cases
Author: gatorsmile <[email protected]>
Closes #16583 from gatorsmile/disallowEmptyPartColValue.
…ify the column in jdbc API ## What changes were proposed in this pull request? The `jdbc` API do not check the `lowerBound` and `upperBound` when we specified the ``column``, and just throw the following exception: >```int() argument must be a string or a number, not 'NoneType'``` If we check the parameter, we can give a more friendly suggestion. ## How was this patch tested? Test using the pyspark shell, without the lowerBound and upperBound parameters. Author: DjvuLee <[email protected]> Closes #16599 from djvulee/pysparkFix.
…in.withOrigin() ## What changes were proposed in this pull request? Remove duplicate call of reset() function in CurrentOrigin.withOrigin(). ## How was this patch tested? Existing test cases. Author: jiangxingbo <[email protected]> Closes #16615 from jiangxb1987/dummy-code.
## What changes were proposed in this pull request? In append mode, we check whether the schema of the write is compatible with the schema of the existing data. It can be a significant performance issue in cloud environment to find the existing schema for files. This patch removes the check. Note that for catalog tables, we always do the check, as discussed in #16339 (comment) ## How was this patch tested? N/A Closes #16339. Author: Reynold Xin <[email protected]> Closes #16622 from rxin/SPARK-18917.
## What changes were proposed in this pull request? Added outer_explode, outer_posexplode, outer_inline functions and expressions. Some bug fixing in GenerateExec.scala for CollectionGenerator. Previously it was not correctly handling the case of outer with empty collections, only with nulls. ## How was this patch tested? New tests added to GeneratorFunctionSuite Author: Bogdan Raducanu <[email protected]> Closes #16608 from bogdanrdc/SPARK-13721.
…,LiR ## What changes were proposed in this pull request? add instrumentation for MLP,NB,LDA,AFT,GLM,Isotonic,LiR ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes #15671 from zhengruifeng/lir_instr.
## What changes were proposed in this pull request? Inserting data into Hive tables has its own implementation that is distinct from data sources: `InsertIntoHiveTable`, `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`. Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep `InsertIntoHiveTable` to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call `loadTable`/`loadPartition` at the end and remove `InsertIntoHiveTable`. This PR removes `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`, and create a `HiveFileFormat` to implement the write logic. In the future, we should also implement the read logic in `HiveFileFormat`. ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes #16517 from cloud-fan/insert-hive.
## What changes were proposed in this pull request? remove ununsed imports and outdated comments, and fix some minor code style issue. ## How was this patch tested? existing ut Author: uncleGen <[email protected]> Closes #16591 from uncleGen/SPARK-19227.
**What changes were proposed in this pull request?** Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes including security ones in the release notes that we should pick up **How was this patch tested?** Running the unit tests now with IBM's SDK for Java and let's see what happens with OpenJDK in the community builder - expecting no trouble as it is only a minor release. Author: Adam Roberts <[email protected]> Closes #16616 from a-roberts/Hadoop265Bumper.
## What changes were proposed in this pull request? On CREATE/ALTER a view, it's no longer needed to generate a SQL text string from the LogicalPlan, instead we store the SQL query text、the output column names of the query plan, and current database to CatalogTable. Permanent views created by this approach can be resolved by current view resolution approach. The main advantage includes: 1. If you update an underlying view, the current view also gets updated; 2. That gives us a change to get ride of SQL generation for operators. Major changes of this PR: 1. Generate the view-specific properties(e.g. view default database, view query output column names) during permanent view creation and store them as properties in the CatalogTable; 2. Update the commands `CreateViewCommand` and `AlterViewAsCommand`, get rid of SQL generation from them. ## How was this patch tested? Existing tests. Author: jiangxingbo <[email protected]> Closes #16613 from jiangxb1987/view-write-path.
…which are based on HadoopRDD or NewHadoopRDD
## What changes were proposed in this pull request?
For some datasources which are based on HadoopRDD or NewHadoopRDD, such as spark-xml, InputFileBlockHolder doesn't work with Python UDF.
The method to reproduce it is, running the following codes with `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`:
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession
def filename(path):
return path
session = SparkSession.builder.appName('APP').getOrCreate()
session.udf.register('sameText', filename)
sameText = udf(filename, StringType())
df = session.read.format('xml').load('a.xml', rowTag='root').select('*', input_file_name().alias('file'))
df.select('file').show() # works
df.select(sameText(df['file'])).show() # returns empty content
The issue is because in `HadoopRDD` and `NewHadoopRDD` we set the file block's info in `InputFileBlockHolder` before the returned iterator begins consuming. `InputFileBlockHolder` will record this info into thread local variable. When running Python UDF in batch, we set up another thread to consume the iterator from child plan's output rdd, so we can't read the info back in another thread.
To fix this, we have to set the info in `InputFileBlockHolder` after the iterator begins consuming. So the info can be read in correct thread.
## How was this patch tested?
Manual test with above example codes for spark-xml package on pyspark: `bin/pyspark --packages com.databricks:spark-xml_2.11:0.4.1`.
Added pyspark test.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <[email protected]>
Closes #16585 from viirya/fix-inputfileblock-hadooprdd.
…park release ## What changes were proposed in this pull request? When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck. ## How was this patch tested? manually Author: Felix Cheung <[email protected]> Closes #16589 from felixcheung/rtarreturncode.
…eceiver idempotent. ## What changes were proposed in this pull request? Method canCommit sends AskPermissionToCommitOutput using askWithRetry. If timeout, it will send again. Thus AskPermissionToCommitOutput can be received multi times. Method canCommit should return the same value when called by the same attempt multi times. In implementation before this fix, method handleAskPermissionToCommit just check if there is committer already registered, which is not enough. When worker retries AskPermissionToCommitOutput it will get CommitDeniedException, then the task will fail with reason TaskCommitDenied, which is not regarded as a task failure(SPARK-11178), so TaskScheduler will schedule this task infinitely. In this fix, use `ask` to replace `askWithRetry` in `canCommit` and make receiver idempotent. ## How was this patch tested? Added a new unit test to OutputCommitCoordinatorSuite. Author: jinxing <[email protected]> Closes #16503 from jinxing64/SPARK-18113.
…waitInitialization to avoid breaking tests ## What changes were proposed in this pull request? #16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #16567 from zsxwing/SPARK-19113-2.
… error ## What changes were proposed in this pull request? We should call `StateStore.abort()` when there should be any error before the store is committed. ## How was this patch tested? Manually. Author: Liwei Lin <[email protected]> Closes #16547 from lw-lin/append-filter.
…ner to not block UI when generating Streaming jobs ## What changes were proposed in this pull request? When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener (numInactiveReceivers, streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata). It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation. ## How was this patch tested? existing ut cc zsxwing Author: uncleGen <[email protected]> Closes #16601 from uncleGen/SPARK-19182.
…ning instance and fixed interfaces ## What changes were proposed in this pull request? For all of the classifiers in MLLib we can predict probabilities except for GBTClassifier. Also, all classifiers inherit from ProbabilisticClassifier but GBTClassifier strangely inherits from Predictor, which is a bug. This change corrects the interface and adds the ability for the classifier to give a probabilities vector. ## How was this patch tested? The basic ML tests were run after making the changes. I've marked this as WIP as I need to add more tests. Author: Ilya Matiach <[email protected]> Closes #16441 from imatiach-msft/ilmat/fix-GBT.
## What changes were proposed in this pull request? Update known_translations per #16423 (comment) Author: Yin Huai <[email protected]> Closes #16628 from yhuai/known_translations.
…end on hive ## What changes were proposed in this pull request? We have a table relation plan cache in `HiveMetastoreCatalog`, which caches a lot of things: file status, resolved data source, inferred schema, etc. However, it doesn't make sense to limit this cache with hive support, we should move it to SQL core module so that users can use this cache without hive support. It can also reduce the size of `HiveMetastoreCatalog`, so that it's easier to remove it eventually. main changes: 1. move the table relation cache to `SessionCatalog` 2. `SessionCatalog.lookupRelation` will return `SimpleCatalogRelation` and the analyzer will convert it to `LogicalRelation` or `MetastoreRelation` later, then `HiveSessionCatalog` doesn't need to override `lookupRelation` anymore 3. `FindDataSourceTable` will read/write the table relation cache. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #16621 from cloud-fan/plan-cache.
## What changes were proposed in this pull request? add loglikelihood in GMM.summary ## How was this patch tested? added tests Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes #12064 from zhengruifeng/gmm_metric.
…ame startswith underscore ## What changes were proposed in this pull request? The initial shouldFilterOut() method invocation filter the root path name(table name in the intial call) and remove if it contains _. I moved the check one level below, so it first list files/directories in the given root path and then apply filter. (Please fill in changes proposed in this fix) ## How was this patch tested? Added new test case for this scenario (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: jayadevanmurali <[email protected]> Author: jayadevan <[email protected]> Closes #16635 from jayadevanmurali/branch-0.1-SPARK-19059.
Builds on top of work in SPARK-8425 to update Application Level Blacklisting in the scheduler. ## What changes were proposed in this pull request? Adds a UI to these patches by: - defining new listener events for blacklisting and unblacklisting, nodes and executors; - sending said events at the relevant points in BlacklistTracker; - adding JSON (de)serialization code for these events; - augmenting the Executors UI page to show which, and how many, executors are blacklisted; - adding a unit test to make sure events are being fired; - adding HistoryServerSuite coverage to verify that the SHS reads these events correctly. - updates the Executor UI to show Blacklisted/Active/Dead as a tri-state in Executors Status Updates .rat-excludes to pass tests. username squito ## How was this patch tested? ./dev/run-tests testOnly org.apache.spark.util.JsonProtocolSuite testOnly org.apache.spark.scheduler.BlacklistTrackerSuite testOnly org.apache.spark.deploy.history.HistoryServerSuite https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh  Author: José Hiram Soltren <[email protected]> Closes #16346 from jsoltren/SPARK-16654-submit.
## What changes were proposed in this pull request? Implementing a mapping between executionId and corresponding QueryExecution in SQLExecution. ## How was this patch tested? Adds a unit test. Author: Ala Luszczak <[email protected]> Closes #16940 from ala/execution-id.
## What changes were proposed in this pull request? A small update to #16925 1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules. 2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs. ## How was this patch tested? This pull request contains no real logic change and all behavior should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes #16939 from rxin/SPARK-16475.
…artition columns ## What changes were proposed in this pull request? We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns. This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization. ## How was this patch tested? updated test. Author: Wenchen Fan <[email protected]> Closes #16931 from cloud-fan/bucket.
…umn statistics ## What changes were proposed in this pull request? Support cardinality estimation and stats propagation for all join types. Limitations: - For inner/outer joins without any equal condition, we estimate it like cartesian product. - For left semi/anti joins, since we can't apply the heuristics for inner join to it, for now we just propagate the statistics from left side. We should support them when other advanced stats (e.g. histograms) are available in spark. ## How was this patch tested? Add a new test suite. Author: Zhenhua Wang <[email protected]> Author: wangzhenhua <[email protected]> Closes #16228 from wzhfy/joinEstimate.
…atch ## What changes were proposed in this pull request? This is 2nd batch of test case for IN/NOT IN subquery. In this PR, it has these test cases: `in-limit.sql` `in-order-by.sql` `not-in-group-by.sql` These are the queries and results from running on DB2. [in-limit DB2 version](https://github.com/apache/spark/files/743267/in-limit.sql.db2.out.txt) [in-order-by DB2 version](https://github.com/apache/spark/files/743269/in-order-by.sql.db2.txt) [not-in-group-by DB2 version](https://github.com/apache/spark/files/743271/not-in-group-by.sql.db2.txt) [output of in-limit.sql DB2](https://github.com/apache/spark/files/743276/in-limit.sql.db2.out.txt) [output of in-order-by.sql DB2](https://github.com/apache/spark/files/743278/in-order-by.sql.db2.out.txt) [output of not-in-group-by.sql DB2](https://github.com/apache/spark/files/743279/not-in-group-by.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. Author: Kevin Yu <[email protected]> Closes #16759 from kevinyu98/spark-18871-2.
…f 2) - scalar subquery in predicate context ## What changes were proposed in this pull request? This PR adds new test cases for scalar subquery in predicate context ## How was this patch tested? The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct. Author: Nattavut Sutyanyong <[email protected]> Closes #16798 from nsyca/18873-2.
… CTE) ## What changes were proposed in this pull request? This PR adds the third and final set of tests for EXISTS subquery. File name | Brief description ------------------------| ----------------- exists-cte.sql |Tests Exist subqueries referencing CTE exists-joins-and-set-ops.sql|Tests Exists subquery used in Joins (Both when joins occurs in outer and suquery blocks) DB2 results are attached here as reference : [exists-cte-db2.txt](https://github.com/apache/spark/files/752091/exists-cte-db2.txt) [exists-joins-and-set-ops-db2.txt](https://github.com/apache/spark/files/753283/exists-joins-and-set-ops-db2.txt) (updated) ## How was this patch tested? The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct. Author: Dilip Biswal <[email protected]> Closes #16802 from dilipbiswal/exists-pr3.
… the delimiter ## What changes were proposed in this pull request? A follow-up to disallow space as the delimiter in broadcast hint. ## How was this patch tested? Jenkins test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes #16941 from viirya/disallow-space-delimiter.
…r in pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: #15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <[email protected]> Closes #16922 from VinceShieh/spark-19590.
## What changes were proposed in this pull request? This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160). This allows users to define UDF using simplified syntax: ```python from pyspark.sql.decorators import udf udf(IntegerType()) def add_one(x): """Adds one""" if x is not None: return x + 1 ``` without need to define a separate function and udf. ## How was this patch tested? Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality. Author: zero323 <[email protected]> Closes #16533 from zero323/SPARK-19160.
## What changes were proposed in this pull request? Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column ## How was this patch tested? manual, unit tests Author: Felix Cheung <[email protected]> Closes #16739 from felixcheung/rcoalesce.
Move `SQLViewSuite` from `sql/hive` to `sql/core`, so we can test the view supports without hive metastore. Also moved the test cases that specified to hive to `HiveSQLViewSuite`. Improve the test coverage of SQLViewSuite, cover the following cases: 1. view resolution(possibly a referenced table/view have changed after the view creation); 2. handle a view with user specified column names; 3. improve the test cases for a nested view. Also added a test case for cyclic view reference, which is a known issue that is not fixed yet. N/A Author: jiangxingbo <[email protected]> Closes #16674 from jiangxb1987/view-test.
…ecutionId ## What changes were proposed in this pull request? #16940 adds a test case which does not stop the spark job. It causes many failures of other test cases. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/2403/consoleFull - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2600/consoleFull ``` [info] org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <[email protected]> Closes #16943 from dongjoon-hyun/SPARK-19607-2.
… a non pre-existing location should succeed
## What changes were proposed in this pull request?
when we insert data into a datasource table use `sqlText`, and the table has an not exists location,
this will throw an Exception.
example:
```
spark.sql("create table t(a string, b int) using parquet")
spark.sql("alter table t set location '/xx'")
spark.sql("insert into table t select 'c', 1")
```
Exception:
```
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
```
As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:
1. reading from a datasource table: return 0 rows
2. writing to a datasource table: write data successfully
## How was this patch tested?
unit test added
Author: windpiger <[email protected]>
Closes #16672 from windpiger/insertNotExistLocation.
## What changes were proposed in this pull request? This is a follow-up pr of #16308. This pr enables timezone support in CSV/JSON parsing. We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone). The datasources should use the `timeZone` option to format/parse to write/read timestamp values. Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values. For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "GMT") scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts") df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df.show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> df.write.json("/path/to/gmtjson") ``` ```sh $ cat /path/to/gmtjson/part-* {"ts":"2016-01-01T00:00:00.000Z"} ``` whereas setting the option to `"PST"`, they are: ```scala scala> df.write.option("timeZone", "PST").json("/path/to/pstjson") ``` ```sh $ cat /path/to/pstjson/part-* {"ts":"2015-12-31T16:00:00.000-08:00"} ``` We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info: ```scala scala> val schema = new StructType().add("ts", TimestampType) schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true)) scala> spark.read.schema(schema).json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option: ```scala scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson") ``` ```sh $ cat /path/to/jstjson/part-* {"ts":"2016-01-01T09:00:00"} ``` ```scala // wrong result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 09:00:00| +-------------------+ // correct result scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show() +-------------------+ |ts | +-------------------+ |2016-01-01 00:00:00| +-------------------+ ``` This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option. ## How was this patch tested? Existing tests and added some tests. Author: Takuya UESHIN <[email protected]> Closes #16750 from ueshin/issues/SPARK-18937.
## What changes were proposed in this pull request? Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running. ## How was this patch tested? This is a change for python tests. Author: Yin Huai <[email protected]> Closes #16935 from yhuai/SPARK-19604.
## What changes were proposed in this pull request? SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog. This PR includes the following changes: - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that prevents us from removing the workaround codes. - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly. - Remove catching FileNotFoundException. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #16932 from zsxwing/metadata-cleanup.
…e Hashing ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <[email protected]> Author: Yanbo Liang <[email protected]> Author: Yunni <[email protected]> Closes #16715 from Yunni/spark-18080.
## What changes were proposed in this pull request?
`StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false.
This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan.
Examples of the explain outputs:
- streaming DataFrame.explain()
```
== Physical Plan ==
*HashAggregate(keys=[value#518], functions=[count(1)])
+- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
+- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0)
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
+- Exchange hashpartitioning(value#518, 5)
+- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- *MapElements <function1>, obj#517: java.lang.String
+- *DeserializeToObject value#513.toString, obj#516: java.lang.String
+- StreamingRelation MemoryStream[value#513], [value#513]
```
- StreamingQuery.explain(extended = false)
```
== Physical Plan ==
*HashAggregate(keys=[value#518], functions=[count(1)])
+- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
+- StateStoreRestore [value#518], OperatorStateId(...,0,0)
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
+- Exchange hashpartitioning(value#518, 5)
+- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- *MapElements <function1>, obj#517: java.lang.String
+- *DeserializeToObject value#543.toString, obj#516: java.lang.String
+- LocalTableScan [value#543]
```
- StreamingQuery.explain(extended = true)
```
== Parsed Logical Plan ==
Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
+- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
+- LocalRelation [value#543]
== Analyzed Logical Plan ==
value: string, count(1): bigint
Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
+- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
+- LocalRelation [value#543]
== Optimized Logical Plan ==
Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
+- DeserializeToObject value#543.toString, obj#516: java.lang.String
+- LocalRelation [value#543]
== Physical Plan ==
*HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L])
+- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
+- StateStoreRestore [value#518], OperatorStateId(...,0,0)
+- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
+- Exchange hashpartitioning(value#518, 5)
+- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L])
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
+- *MapElements <function1>, obj#517: java.lang.String
+- *DeserializeToObject value#543.toString, obj#516: java.lang.String
+- LocalTableScan [value#543]
```
## How was this patch tested?
The updated unit test.
Author: Shixiong Zhu <[email protected]>
Closes #16934 from zsxwing/SPARK-19603.
…atch ## What changes were proposed in this pull request? This is 4th batch of test case for IN/NOT IN subquery. In this PR, it has these test files: `in-set-operations.sql` `in-with-cte.sql` `not-in-joins.sql` Here are the queries and results from running on DB2. [in-set-operations DB2 version](https://github.com/apache/spark/files/772846/in-set-operations.sql.db2.txt) [Output of in-set-operations](https://github.com/apache/spark/files/772848/in-set-operations.sql.db2.out.txt) [in-with-cte DB2 version](https://github.com/apache/spark/files/772849/in-with-cte.sql.db2.txt) [Output of in-with-cte](https://github.com/apache/spark/files/772856/in-with-cte.sql.db2.out.txt) [not-in-joins DB2 version](https://github.com/apache/spark/files/772851/not-in-joins.sql.db2.txt) [Output of not-in-joins](https://github.com/apache/spark/files/772852/not-in-joins.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct. Author: Kevin Yu <[email protected]> Closes #16915 from kevinyu98/spark-18871-44.
…ame API vs SQL ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-19618 Moved the check for validating number of buckets from `DataFrameWriter` to `BucketSpec` creation ## How was this patch tested? - Added more unit tests Author: Tejas Patil <[email protected]> Closes #16948 from tejasapatil/SPARK-19618_max_buckets.
…atch ## What changes were proposed in this pull request? This is 3ird batch of test case for IN/NOT IN subquery. In this PR, it has these test files: `in-having.sql` `in-joins.sql` `in-multiple-columns.sql` These are the queries and results from running on DB2. [in-having DB2 version](https://github.com/apache/spark/files/772668/in-having.sql.db2.txt) [output of in-having](https://github.com/apache/spark/files/772670/in-having.sql.db2.out.txt) [in-joins DB2 version](https://github.com/apache/spark/files/772672/in-joins.sql.db2.txt) [output of in-joins](https://github.com/apache/spark/files/772673/in-joins.sql.db2.out.txt) [in-multiple-columns DB2 version](https://github.com/apache/spark/files/772678/in-multiple-columns.sql.db2.txt) [output of in-multiple-columns](https://github.com/apache/spark/files/772680/in-multiple-columns.sql.db2.out.txt) ## How was this patch tested? This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct. Author: Kevin Yu <[email protected]> Closes #16841 from kevinyu98/spark-18871-33.
- Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <[email protected]> Closes #16871 from srowen/SPARK-19493.
## What changes were proposed in this pull request?
These error below seems caused by unidoc that does not understand double commented block.
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error] * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error] * new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#'
[error] * @Override
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected
[error] * @Override
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error] * public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected
[error] * boolean shouldRemove = ...; // Decide whether to remove the state
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected
[error] * if (shouldRemove) {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected
[error] * } else {
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected
[error] * state.update(newState); // Set the new state
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected
[error] * }
[error] ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected
[error] * state.update(initialState);
[error] ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected
[error] * }
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected
[error] * </code></pre>
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected
[error] * tparam S User-defined type of the state to be stored for each key. Must be encodable into
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected
[error] * Spark SQL types (see {link Encoder} for more details).
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected
[error] * since 2.1.1
[error] ^
```
And another link seems unrecognisable.
```
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found
[error] * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery},
[error]
```
Note that this PR does not fix the two breaks as below:
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error] * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
[error] * param relativeError The relative target precision to achieve (>= 0).
[error] ^
[error]
```
because these seem probably fixed soon in #16776 and I intended to avoid potential conflicts.
## How was this patch tested?
Manually via `jekyll build`
Author: hyukjinkwon <[email protected]>
Closes #16926 from HyukjinKwon/javadoc-break.
## What changes were proposed in this pull request? 1, check the behavior with illegal `quantiles` and `relativeError` 2, add tests for `relativeError` > 1 3, update tests for `null` data 4, update some docs for javadoc8 ## How was this patch tested? local test in spark-shell Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes #16776 from zhengruifeng/fix_approxQuantile.
…et in dev/mima ## What changes were proposed in this pull request? Use JAVA_HOME/bin/java if JAVA_HOME is set in dev/mima script to run MiMa This follows on #16871 -- it's a slightly separate issue, but, is currently causing a build failure. ## How was this patch tested? Manually tested. Author: Sean Owen <[email protected]> Closes #16957 from srowen/SPARK-19550.2.
## What changes were proposed in this pull request? If a new option `wholeFile` is set to `true` the JSON reader will parse each file (instead of a single line) as a value. This is done with Jackson streaming and it should be capable of parsing very large documents, assuming the row will fit in memory. Because the file is not buffered in memory the corrupt record handling is also slightly different when `wholeFile` is enabled: the corrupt column will contain the filename instead of the literal JSON if there is a parsing failure. It would be easy to extend this to add the parser location (line, column and byte offsets) to the output if desired. These changes have allowed types other than `String` to be parsed. Support for `UTF8String` and `Text` have been added (alongside `String` and `InputFormat`) and no longer require a conversion to `String` just for parsing. I've also included a few other changes that generate slightly better bytecode and (imo) make it more obvious when and where boxing is occurring in the parser. These are included as separate commits, let me know if they should be flattened into this PR or moved to a new one. ## How was this patch tested? New and existing unit tests. No performance or load tests have been run. Author: Nathan Howell <[email protected]> Closes #16386 from NathanHowell/SPARK-18352.
…methods for DataFrameWriter methods ## What changes were proposed in this pull request? We only notify `QueryExecutionListener` for several `Dataset` operations, e.g. collect, take, etc. We should also do the notification for `DataFrameWriter` operations. ## How was this patch tested? new regression test close #16664 Author: Wenchen Fan <[email protected]> Closes #16962 from cloud-fan/insert.
## What changes were proposed in this pull request? This PR proposes to close stale PRs. What I mean by "stale" here includes that there are some review comments by reviewers but the author looks inactive without any answer to them more than a month. I left some comments roughly a week ago to ping and the author looks still inactive in these PR below These below includes some PR suggested to be closed and a PR against another branch which seems obviously inappropriate. Given the comments in the last three PRs below, they are probably worth being taken over by anyone who is interested in it. Closes #7963 Closes #8374 Closes #11192 Closes #11374 Closes #11692 Closes #12243 Closes #12583 Closes #12620 Closes #12675 Closes #12697 Closes #12800 Closes #13715 Closes #14266 Closes #15053 Closes #15159 Closes #15209 Closes #15264 Closes #15267 Closes #15871 Closes #15861 Closes #16319 Closes #16324 Closes #16890 Closes #12398 Closes #12933 Closes #14517 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes #16937 from HyukjinKwon/stale-prs-close.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
i write two sql.
the first order by result is wrong, but the second order by result is right,that may be a bug?
---sql 1
select amtlong as yasname ,usernick,count() as cnt,sum(amtdouble) as amt from ydb_import_txt group by usernick, amtlong
order by amt desc,cnt,nick,amtlong limit 230
select amtlong as yasname ,usernick,count() as cnt,sum(amtdouble) as amt from ydb_import_txt group by usernick, amtlong
order by amt desc,cnt,nick,amtlong limit 230
220@ 9189 奚鸿煊 1 99.97
221@ 7105 奚鸿煊 1 99.97
--sql2
select amtlong as yasname ,usernick,count() as cnt,sum(amtdouble) as amt from ydb_import_txt group by usernick, amtlong
order by amt desc,cnt,nick,amtlong limit 230
select amtlong as yasname ,usernick,count() as cnt,sum(amtdouble) as amt from ydb_import_txt group by usernick, amtlong
order by amt desc,cnt,nick,yasname limit 230
220@ 7105 奚鸿煊 1 99.97
221@ 9189 奚鸿煊 1 99.97