[SPARK-25700][SQL] Partially revert append mode support in Data Source V2 #22686

HyukjinKwon · 2018-10-10T09:43:30Z

What changes were proposed in this pull request?

This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.

5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

This breaks spark.range(1).format("source").write.save("non-existent-path") case since there's no way to read the schema from "non-existent-path".

How was this patch tested?

Unit test and manual tests.

…alignment problem This is not a perfect solution. It is designed to minimize complexity on the basis of solving problems. It is effective for English, Chinese characters, Japanese, Korean and so on. ```scala before: +---+---------------------------+-------------+ |id |中国 |s2 | +---+---------------------------+-------------+ |1 |ab |[a] | |2 |null |[中国, abc] | |3 |ab1 |[hello world]| |4 |か行きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国] | |5 |中国（你好）a |[“中（国）, 312] | |6 |中国山(东)服务区 |[“中(国）] | |7 |中国山东服务区 |[中(国)] | |8 | |[中国] | +---+---------------------------+-------------+ after: +---+-----------------------------------+----------------+ |id |中国 |s2 | +---+-----------------------------------+----------------+ |1 |ab |[a] | |2 |null |[中国, abc] | |3 |ab1 |[hello world] | |4 |か行きゃ(kya) きゅ(kyu) きょ(kyo) |[“中国] | |5 |中国（你好）a |[“中（国）, 312]| |6 |中国山(东)服务区 |[“中(国）] | |7 |中国山东服务区 |[中(国)] | |8 | |[中国] | +---+-----------------------------------+----------------+ ``` ## What changes were proposed in this pull request? When there are wide characters such as Chinese characters or Japanese characters in the data, the show method has a alignment problem. Try to fix this problem. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) ![image](https://user-images.githubusercontent.com/13044869/44250564-69f6b400-a227-11e8-88b2-6cf6960377ff.png) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22048 from xuejianbest/master. Authored-by: xuejianbest <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ouping key in group aggregate pandas UDF ## What changes were proposed in this pull request? This PR proposes to add another example for multiple grouping key in group aggregate pandas UDF since this feature could make users still confused. ## How was this patch tested? Manually tested and documentation built. Closes apache#22329 from HyukjinKwon/SPARK-25328. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

…lization Exception ## What changes were proposed in this pull request? mapValues in scala is currently not serializable. To avoid the serialization issue while running pageRank, we need to use map instead of mapValues. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22271 from shahidki31/master_latest. Authored-by: Shahid <[email protected]> Signed-off-by: Joseph K. Bradley <[email protected]>

## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes apache#22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <[email protected]> Co-authored-by: Yuanjian Li <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

## What changes were proposed in this pull request? Currently when running Spark on Kubernetes a logger is run by the client that watches the K8S API for events related to the Driver pod and logs them. However for the container status aspect of the logging this simply dumps the raw object which is not human readable e.g. ![screen shot 2018-08-24 at 10 37 46](https://user-images.githubusercontent.com/2104864/44577799-e0486880-a789-11e8-9ae9-fdeddacbbea8.png) ![screen shot 2018-08-24 at 10 38 14](https://user-images.githubusercontent.com/2104864/44577800-e0e0ff00-a789-11e8-81f5-3bb315dbbdb1.png) This is despite the fact that the logging class in question actually has methods to pretty print this information but only invokes these at the end of a job. This PR improves the logging to always use the pretty printing methods, additionally modifying them to include further useful information provided by the K8S API. A similar issue also exists when tasks are lost that will be addressed by further commits to this PR - [x] Improved `LoggingPodStatusWatcher` - [x] Improved container status on task failure ## How was this patch tested? Built and launched jobs with the updated Spark client and observed the new human readable output: ![screen shot 2018-08-24 at 11 09 32](https://user-images.githubusercontent.com/2104864/44579429-5353de00-a78e-11e8-9228-c750af8e6311.png) ![screen shot 2018-08-24 at 11 09 42](https://user-images.githubusercontent.com/2104864/44579430-5353de00-a78e-11e8-8fce-d5bb2a3ae65f.png) ![screen shot 2018-08-24 at 11 10 13](https://user-images.githubusercontent.com/2104864/44579431-53ec7480-a78e-11e8-9fa2-aeabc5b28ec4.png) ![screen shot 2018-08-24 at 17 47 44](https://user-images.githubusercontent.com/2104864/44596922-db090f00-a7c5-11e8-910c-bc2339f5a196.png) Suggested reviewers: liyinan926 mccheah Author: Rob Vesse <[email protected]> Closes apache#22215 from rvesse/SPARK-25222.

## What changes were proposed in this pull request? The default behaviour of Spark on K8S currently is to create `emptyDir` volumes to back `SPARK_LOCAL_DIRS`. In some environments e.g. diskless compute nodes this may actually hurt performance because these are backed by the Kubelet's node storage which on a diskless node will typically be some remote network storage. Even if this is enterprise grade storage connected via a high speed interconnect the way Spark uses these directories as scratch space (lots of relatively small short lived files) has been observed to cause serious performance degradation. Therefore we would like to provide the option to use K8S's ability to instead back these `emptyDir` volumes with `tmpfs`. Therefore this PR adds a configuration option that enables `SPARK_LOCAL_DIRS` to be backed by Memory backed `emptyDir` volumes rather than the default. Documentation is added to describe both the default behaviour plus this new option and its implications. One of which is that scratch space then counts towards your pods memory limits and therefore users will need to adjust their memory requests accordingly. *NB* - This is an alternative version of PR apache#22256 reduced to just the `tmpfs` piece ## How was this patch tested? Ran with this option in our diskless compute environments to verify functionality Author: Rob Vesse <[email protected]> Closes apache#22323 from rvesse/SPARK-25262-tmpfs.

## What changes were proposed in this pull request? This is a follow-up pr of apache#22200. When casting to decimal type, if `Cast.canNullSafeCastToDecimal()`, overflow won't happen, so we don't need to check the result of `Decimal.changePrecision()`. ## How was this patch tested? Existing tests. Closes apache#22352 from ueshin/issues/SPARK-25208/reduce_code_size. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? How to reproduce permission issue: ```sh # build spark ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tar && cd spark-2.4.0-SNAPSHOT-bin-SPARK-25330 export HADOOP_PROXY_USER=user_a bin/spark-sql export HADOOP_PROXY_USER=user_b bin/spark-sql ``` ```java Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user_b, access=EXECUTE, inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) ``` The issue occurred in this commit: apache/hadoop@feb886f. This pr revert Hadoop 2.7 to 2.7.3 to avoid this issue. ## How was this patch tested? unit tests and manual tests. Closes apache#22327 from wangyum/SPARK-25330. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…itions parameter ## What changes were proposed in this pull request? This adds a test following apache#21638 ## How was this patch tested? Existing tests and new test. Closes apache#22356 from srowen/SPARK-22357.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from apache#22232. Closes apache#22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes apache#22324 from maropu/pr22232-2. Lead-authored-by: dujunling <[email protected]> Co-authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ases of sql/core and sql/hive ## What changes were proposed in this pull request? In SharedSparkSession and TestHive, we need to disable the rule ConvertToLocalRelation for better test case coverage. ## How was this patch tested? Identify the failures after excluding "ConvertToLocalRelation" rule. Closes apache#22270 from dilipbiswal/SPARK-25267-final. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Add test cases for fromString ## How was this patch tested? N/A Closes apache#22345 from gatorsmile/addTest. Authored-by: Xiao Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Before Apache Spark 2.3, table properties were ignored when writing data to a hive table(created with STORED AS PARQUET/ORC syntax), because the compression configurations were not passed to the FileFormatWriter in hadoopConf. Then it was fixed in apache#20087. But actually for CTAS with USING PARQUET/ORC syntax, table properties were ignored too when convertMastore, so the test case for CTAS not supported. Now it has been fixed in apache#20522 , the test case should be enabled too. ## How was this patch tested? This only re-enables the test cases of previous PR. Closes apache#22302 from fjh100456/compressionCodec. Authored-by: fjh100456 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ined names ## What changes were proposed in this pull request? Add [flake8](http://flake8.pycqa.org) tests to find Python syntax errors and undefined names. __E901,E999,F821,F822,F823__ are the "_showstopper_" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. Most other flake8 issues are merely "style violations" -- useful for readability but they do not effect runtime safety. * F821: undefined name `name` * F822: undefined name `name` in `__all__` * F823: local variable name referenced before assignment * E901: SyntaxError or IndentationError * E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree ## How was this patch tested? $ __flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics__ $ __flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics__ Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22266 from cclauss/patch-3. Authored-by: cclauss <[email protected]> Signed-off-by: Holden Karau <[email protected]>

## What changes were proposed in this pull request? Fix unused imports & outdated comments on `kafka-0-10-sql` module. (Found while I was working on [SPARK-23539](apache#22282)) ## How was this patch tested? Existing unit tests. Closes apache#22342 from dongjinleekr/feature/fix-kafka-sql-trivials. Authored-by: Lee Dongjin <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…se in executors REST API Add new executor level memory metrics (JVM used memory, on/off heap execution memory, on/off heap storage memory, on/off heap unified memory, direct memory, and mapped memory), and expose via the executors REST API. This information will help provide insight into how executor and driver JVM memory is used, and for the different memory regions. It can be used to help determine good values for spark.executor.memory, spark.driver.memory, spark.memory.fraction, and spark.memory.storageFraction. ## What changes were proposed in this pull request? An ExecutorMetrics class is added, with jvmUsedHeapMemory, jvmUsedNonHeapMemory, onHeapExecutionMemory, offHeapExecutionMemory, onHeapStorageMemory, and offHeapStorageMemory, onHeapUnifiedMemory, offHeapUnifiedMemory, directMemory and mappedMemory. The new ExecutorMetrics is sent by executors to the driver as part of the Heartbeat. A heartbeat is added for the driver as well, to collect these metrics for the driver. The EventLoggingListener store information about the peak values for each metric, per active stage and executor. When a StageCompleted event is seen, a StageExecutorsMetrics event will be logged for each executor, with peak values for the stage. The AppStatusListener records the peak values for each memory metric. The new memory metrics are added to the executors REST API. ## How was this patch tested? New unit tests have been added. This was also tested on our cluster. Author: Edwina Lu <[email protected]> Author: Imran Rashid <[email protected]> Author: edwinalu <[email protected]> Closes apache#21221 from edwinalu/SPARK-23429.2.

## What changes were proposed in this pull request? This took me a while to debug and find out. Looks we better at least leave a debug log that SQL text for a view will be used. Here's how I got there: **Hive:** ``` CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; CREATE VIEW testview AS SELECT d100.udf100(name) FROM default.emp; ``` **Spark:** ``` sql("SELECT * FROM testview").show() ``` ``` scala> sql("SELECT * FROM testview").show() org.apache.spark.sql.AnalysisException: Undefined function: 'd100.udf100'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 ``` Under the hood, it actually makes sense since the view is defined as `SELECT d100.udf100(name) FROM default.emp;` and Hive API: ``` org.apache.hadoop.hive.ql.metadata.Table.getViewExpandedText() ``` This returns a wrongly qualified SQL string for the view as below: ``` SELECT `d100.udf100`(`emp`.`name`) FROM `default`.`emp` ``` which works fine in Hive but not in Spark. ## How was this patch tested? Manually: ``` 18/09/06 19:32:48 DEBUG HiveSessionCatalog: 'SELECT `d100.udf100`(`emp`.`name`) FROM `default`.`emp`' will be used for the view(testview). ``` Closes apache#22351 from HyukjinKwon/minor-debug. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? Deprecate public APIs from ImageSchema. ## How was this patch tested? N/A Closes apache#22349 from WeichenXu123/image_api_deprecate. Authored-by: WeichenXu <[email protected]> Signed-off-by: Xiangrui Meng <[email protected]>

…UDFSuite ## What changes were proposed in this pull request? At Spark 2.0.0, SPARK-14335 adds some [commented-out test coverages](https://github.com/apache/spark/pull/12117/files#diff-dd4b39a56fac28b1ced6184453a47358R177 ). This PR enables them because it's supported since 2.0.0. ## How was this patch tested? Pass the Jenkins with re-enabled test coverage. Closes apache#22363 from dongjoon-hyun/SPARK-25375. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…expressions. ## What changes were proposed in this pull request? Add new optimization rule to eliminate unnecessary shuffling by flipping adjacent Window expressions. ## How was this patch tested? Tested with unit tests, integration tests, and manual tests. Closes apache#17899 from ptkool/adjacent_window_optimization. Authored-by: ptkool <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Add spark.executor.pyspark.memory limit for K8S ## How was this patch tested? Unit and Integration tests Closes apache#22298 from ifilonenko/SPARK-25021. Authored-by: Ilan Filonenko <[email protected]> Signed-off-by: Holden Karau <[email protected]>

## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR apache#22338, the performance regression still exists. If we revert the changes in apache#19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes apache#22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Remove `BisectingKMeansModel.setDistanceMeasure` method. In `BisectingKMeansModel` set this param is meaningless. ## How was this patch tested? N/A Closes apache#22360 from WeichenXu123/bkmeans_update. Authored-by: WeichenXu <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---+----+---+ | a| b| c| d| +---+---+----+---+ | 1| 1|null| 0| | 1| 1|null| 1| +---+---+----+---+ ``` `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before apache#19201, but it is transformed to `(c#10 = null)` since apache#20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests Closes apache#22368 from wangyum/SPARK-25368. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>

… for ORC native data source table persisted in metastore ## What changes were proposed in this pull request? Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet. Here is a simple example: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as A") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'") spark.conf.set("spark.sql.caseSensitive", false) sql("select A from orc_data_source").show +---+ | A| +---+ | 3| | 2| | 4| | 1| | 0| +---+ ``` See apache#22148 for more details about parquet data source reader. ## How was this patch tested? Unit tests added. Closes apache#22262 from seancxmao/SPARK-25175. Authored-by: seancxmao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ema in Parquet issue ## What changes were proposed in this pull request? How to reproduce: ```scala spark.sql("CREATE TABLE tbl(id long)") spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet' " + "STORED AS PARQUET SELECT ID FROM view1") spark.read.parquet("/tmp/spark/parquet").schema scala> spark.read.parquet("/tmp/spark/parquet").schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,true)) ``` The schema should be `StructType(StructField(ID,LongType,true))` as we `SELECT ID FROM view1`. This pr fix this issue. ## How was this patch tested? unit tests Closes apache#22359 from wangyum/SPARK-25313-FOLLOW-UP. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This PR is to solve the CodeGen code generated by fast hash, and there is no need to apply for a block of memory for every new entry, because unsafeRow's memory can be reused. ## How was this patch tested? the existed test cases. Closes apache#21968 from heary-cao/updateNewMemory. Authored-by: caoxuewen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ctType to a DDL string ## What changes were proposed in this pull request? Add the version number for the new APIs. ## How was this patch tested? N/A Closes apache#22377 from gatorsmile/followup24849. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…plan appears in the query ## What changes were proposed in this pull request? In the Planner, we collect the placeholder which need to be substituted in the query execution plan and once we plan them, we substitute the placeholder with the effective plan. In this second phase, we rely on the `==` comparison, ie. the `equals` method. This means that if two placeholder plans - which are different instances - have the same attributes (so that they are equal, according to the equal method) they are both substituted with their corresponding new physical plans. So, in such a situation, the first time we substitute both them with the first of the 2 new generated plan and the second time we substitute nothing. This is usually of no harm for the execution of the query itself, as the 2 plans are identical. But since they are the same instance, now, the local variables are shared (which is unexpected). This causes issues for the metrics collected, as the same node is executed 2 times, so the metrics are accumulated 2 times, wrongly. The PR proposes to use the `eq` method in checking which placeholder needs to be substituted,; thus in the previous situation, actually both the two different physical nodes which are created (one for each time the logical plan appears in the query plan) are used and the metrics are collected properly for each of them. ## How was this patch tested? added UT Closes apache#22284 from mgaido91/SPARK-25278. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…lar with arrow udfs ## What changes were proposed in this pull request? Clarify docstring for Scalar functions ## How was this patch tested? Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well. Closes apache#20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs. Authored-by: Holden Karau <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

…hon data and SQL types in normal UDFs ### What changes were proposed in this pull request? We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well). It's even difficult to identify the problems (see apache#20163 and apache#22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf if sys.version >= '3': long = int data = [ None, True, 1, long(1), "a", u"a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` This table was generated under Python 2 but the code above is Python 3 compatible as well. ## How was this patch tested? Manually tested and lint check. Closes apache#22655 from HyukjinKwon/SPARK-25666. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

While working on another PR, I noticed that there is quite some legacy Java in there that can be beautified. For example the use of features from Java8, such as: - Collection libraries - Try-with-resource blocks No logic has been changed. I think it is important to have a solid codebase with examples that will inspire next PR's to follow up on the best practices. What are your thoughts on this? This makes code easier to read, and using try-with-resource makes is less likely to forget to close something. ## What changes were proposed in this pull request? No changes in the logic of Spark, but more in the aesthetics of the code. ## How was this patch tested? Using the existing unit tests. Since no logic is changed, the existing unit tests should pass. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22637 from Fokko/SPARK-25408. Authored-by: Fokko Driesprong <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…sPercent default to 100 ## What changes were proposed in this pull request? We want to change the default percentage to 100 for spark.shuffle.server.chunkFetchHandlerThreadsPercent. The reason being currently this is set to 0. Which means currently if server.ioThreads > 0, the default number of threads would be 2 * #cores instead of server.io.Threads. We want the default to server.io.Threads in case this is not set at all. Also here a default of 0 would also mean 2 * #cores ## How was this patch tested? Manual (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22628 from redsanket/SPARK-25641. Lead-authored-by: Sanket Chintapalli <[email protected]> Co-authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

## What changes were proposed in this pull request? There was 5 suites extends `HadoopFsRelationTest`, for testing "orc"/"parquet"/"text"/"json" data sources. This PR refactor the base trait `HadoopFsRelationTest`: 1. Rename unnecessary loop for setting parquet conf 2. The test case `SPARK-8406: Avoids name collision while writing files` takes about 14 to 20 seconds. As now all the file format data source are using common code, for creating result files, we can test one data source(Parquet) only to reduce test time. To run related 5 suites: ``` ./build/sbt "hive/testOnly *HadoopFsRelationSuite" ``` The total test run time is reduced from 5 minutes 40 seconds to 3 minutes 50 seconds. ## How was this patch tested? Unit test Closes apache#22643 from gengliangwang/refactorHadoopFsRelationTest. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…ultiple watermarks ## What changes were proposed in this pull request? Added - Python foreach - Scala, Java and Python foreachBatch - Multiple watermark policy - The semantics of what changes are allowed to the streaming between restarts. ## How was this patch tested? No tests Closes apache#22627 from tdas/SPARK-25639. Authored-by: Tathagata Das <[email protected]> Signed-off-by: Tathagata Das <[email protected]>

…ateDataFrame in PySpark ## What changes were proposed in this pull request? This PR proposes to specify type inference and simple e2e tests. Looks we are not cleanly testing those logics. For instance, see https://github.com/apache/spark/blob/08c76b5d39127ae207d9d1fff99c2551e6ce2581/python/pyspark/sql/types.py#L894-L905 Looks we intended to support datetime.time and None for type inference too but it does not work: ``` >>> spark.createDataFrame([[datetime.time()]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 432, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 604, in <genexpr> for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal return self.dataType.toInternal(obj) File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' >>> spark.createDataFrame([[None]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 419, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../python/pyspark/sql/session.py", line 353, in _inferSchemaFromList raise ValueError("Some of types cannot be determined after inferring") ValueError: Some of types cannot be determined after inferring ``` ## How was this patch tested? Manual tests and unit tests were added. Closes apache#22653 from HyukjinKwon/SPARK-25659. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…isticRegressionSuite ...with intercept with L1 regularization ## What changes were proposed in this pull request? In the test, "multinomial logistic regression with intercept with L1 regularization" in the "LogisticRegressionSuite", taking more than a minute due to training of 2 logistic regression model. However after analysing the training cost over iteration, we can reduce the computation time by 50%. Training cost vs iteration for model 1 ![image](https://user-images.githubusercontent.com/23054875/46573805-ddab7680-c9b7-11e8-9ee9-63a99d498475.png) So, model1 is converging after iteration 150. Training cost vs iteration for model 2 ![image](https://user-images.githubusercontent.com/23054875/46573790-b3f24f80-c9b7-11e8-89c0-81045ad647cb.png) After around 100 iteration, model2 is converging. So, if we give maximum iteration for model1 and model2 as 175 and 125 respectively, we can reduce the computation time by half. ## How was this patch tested? Computation time in local setup : Before change: ~53 sec After change: ~26 sec Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22659 from shahidki31/SPARK-25623. Authored-by: Shahid <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes apache#22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…not consume all the inputs ## What changes were proposed in this pull request? This PR is inspired by apache#22524, but proposes a safer fix. The current limit whole stage codegen has 2 problems: 1. It's only applied to `InputAdapter`, many leaf nodes can't stop earlier w.r.t. limit. 2. It needs to override a method, which will break if we have more than one limit in the whole-stage. The first problem is easy to fix, just figure out which nodes can stop earlier w.r.t. limit, and update them. This PR updates `RangeExec`, `ColumnarBatchScan`, `SortExec`, `HashAggregateExec`. The second problem is hard to fix. This PR proposes to propagate the limit counter variable name upstream, so that the upstream leaf/blocking nodes can check the limit counter and quit the loop earlier. For better performance, the implementation here follows `CodegenSupport.needStopCheck`, so that we only codegen the check only if there is limit in the query. For columnar node like range, we check the limit counter per-batch instead of per-row, to make the inner loop tight and fast. Why this is safer? 1. the leaf/blocking nodes don't have to check the limit counter and stop earlier. It's only for performance. (this is same as before) 2. The blocking operators can stop propagating the limit counter name, because the counter of limit after blocking operators will never increase, before blocking operators consume all the data from upstream operators. So the upstream operators don't care about limit after blocking operators. This is also for performance only, it's OK if we forget to do it for some new blocking operators. ## How was this patch tested? a new test Closes apache#22630 from cloud-fan/limit. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Kazuaki Ishizaki <[email protected]>

It would be nice to have a field in Stage Page UI which would show mapping of the current stage id to the job id's to which that stage belongs to. ## What changes were proposed in this pull request? Added a field in Stage UI to display the corresponding job id for that particular stage. ## How was this patch tested? <img width="448" alt="screen shot 2018-07-25 at 1 33 07 pm" src="https://user-images.githubusercontent.com/22228190/43220447-a8e94f80-900f-11e8-8a20-a235bbd5a369.png"> Closes apache#21809 from pgandhi999/SPARK-24851. Authored-by: pgandhi <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

The commons-crypto library does some questionable error handling internally, which can lead to JVM crashes if some call into native code fails and cleans up state it should not. While the library is not fixed, this change adds some workarounds in Spark code so that when an error is detected in the commons-crypto side, Spark avoids calling into the library further. Tested with existing and added unit tests. Closes apache#22557 from vanzin/SPARK-25535. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Imran Rashid <[email protected]>

…ts in Parquet ## What changes were proposed in this pull request? This is a follow up of apache#22574. Renamed the parameter and added comments. ## How was this patch tested? N/A Closes apache#22679 from gatorsmile/followupSPARK-25559. Authored-by: gatorsmile <[email protected]> Signed-off-by: DB Tsai <[email protected]>

SparkQA · 2018-10-10T09:56:47Z

Test build #97191 has finished for PR 22686 at commit 54e59d5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T12:33:59Z

Test build #97192 has finished for PR 22686 at commit d8de827.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-10T13:51:47Z

Test build #97195 has finished for PR 22686 at commit 227920a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-10T16:52:17Z

Closing this per #22697

SparkQA · 2018-10-11T14:57:45Z

Test build #97257 has finished for PR 22686 at commit c350119.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SchemaReadAttemptException(m: String) extends RuntimeException(m)

xuejianbest and others added 30 commits September 6, 2018 07:17

HyukjinKwon and others added 12 commits October 8, 2018 15:47

HyukjinKwon force-pushed the append-revert branch from 54e59d5 to d8de827 Compare October 10, 2018 09:51

HyukjinKwon changed the title ~~[SPARK-25700][SQL] Partially revert append mode support in Data Source V2~~ [WIP][SPARK-25700][SQL] Partially revert append mode support in Data Source V2 Oct 10, 2018

Partially revert append mode support in Datasource V2

227920a

HyukjinKwon force-pushed the append-revert branch from d8de827 to 227920a Compare October 10, 2018 10:21

HyukjinKwon mentioned this pull request Oct 10, 2018

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

Closed

HyukjinKwon changed the title ~~[WIP][SPARK-25700][SQL] Partially revert append mode support in Data Source V2~~ [SPARK-25700][SQL] Partially revert append mode support in Data Source V2 Oct 10, 2018

HyukjinKwon closed this Oct 10, 2018

HyukjinKwon reopened this Oct 11, 2018

Add a test

c350119

HyukjinKwon changed the base branch from master to branch-2.4 October 11, 2018 12:28

HyukjinKwon closed this Oct 11, 2018

HyukjinKwon changed the base branch from branch-2.4 to master October 11, 2018 12:33

HyukjinKwon deleted the append-revert branch October 16, 2018 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-25700][SQL] Partially revert append mode support in Data Source V2 #22686

[SPARK-25700][SQL] Partially revert append mode support in Data Source V2 #22686

Uh oh!

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

64 participants

[SPARK-25700][SQL] Partially revert append mode support in Data Source V2 #22686

[SPARK-25700][SQL] Partially revert append mode support in Data Source V2 #22686

Uh oh!

Conversation

HyukjinKwon commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

SparkQA commented Oct 10, 2018

Uh oh!

HyukjinKwon commented Oct 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

64 participants

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 10, 2018 •

edited

Loading