[SPARK-13309][SQL] Fix type inference issue with CSV data #2

anandab · 2016-02-29T08:37:13Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Fix type inference issue for sparse CSV data - https://issues.apache.org/jira/browse/SPARK-13309

Author: Rahul Tanwani [email protected]

Closes apache#11194 from tanwanirahul/master.

Fix type inference issue for sparse CSV data - https://issues.apache.org/jira/browse/SPARK-13309 Author: Rahul Tanwani <[email protected]> Closes #11194 from tanwanirahul/master.

…s default parameters consistent in Scala and Python ## What changes were proposed in this pull request? * The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.) * BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route. * Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly. cc mengxr dbtsai ## How was this patch tested? No new tests, it should pass all current tests. Author: Yanbo Liang <[email protected]> Closes #11424 from yanboliang/spark-13545.

…anager in local mode Author: Jeff Zhang <[email protected]> Closes #10914 from zjffdu/SPARK-12994.

…sistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module. Also, updated 2 params in classification to read as `Supported values:` to be consistent. closes #10600 Author: vijaykiran <[email protected]> Author: Bryan Cutler <[email protected]> Closes #11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.

## What changes were proposed in this pull request? Now by default, it shows as ascending order of appId. We might prefer to display as descending order by default, which will show the latest application at the top. ## How was this patch tested? Manual tested. See screenshot below: ![desc-sort](https://cloud.githubusercontent.com/assets/11683054/13307473/102f4cf8-db31-11e5-8dd5-391edbf32f0d.png) Author: zhuol <[email protected]> Closes #11357 from zhuoliu/13481.

…ociationRulesSuite JIRA: https://issues.apache.org/jira/browse/SPARK-13506 ## What changes were proposed in this pull request? just chang R Snippet Comment in AssociationRulesSuite ## How was this patch tested? unit test passsed Author: Zheng RuiFeng <[email protected]> Closes #11387 from zhengruifeng/ars.

… as Dataset element type ## What changes were proposed in this pull request? Nested classes defined within Scala objects are translated into Java static nested classes. Unlike inner classes, they don't need outer scopes. But the analyzer still thinks that an outer scope is required. This PR fixes this issue simply by checking whether a nested class is static before looking up its outer scope. ## How was this patch tested? A test case is added to `DatasetSuite`. It checks contents of a Dataset whose element type is a nested class declared in a Scala object. Author: Cheng Lian <[email protected]> Closes #11421 from liancheng/spark-13540-object-as-outer-scope.

… function call https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #11389 from HyukjinKwon/SPARK-13507-13509.

…gate #### What changes were proposed in this pull request? After analysis by Analyzer, two operators could have alias. They are `Project` and `Aggregate`. So far, we only rewrite and propagate constraints if `Alias` is defined in `Project`. This PR is to resolve this issue in `Aggregate`. #### How was this patch tested? Added a test case for `Aggregate` in `ConstraintPropagationSuite`. marmbrus sameeragarwal Author: gatorsmile <[email protected]> Closes #11422 from gatorsmile/validConstraintsInUnaryNodes.

…eartbeat to driver more than N times ## What changes were proposed in this pull request? Sometimes, network disconnection event won't be triggered for other potential race conditions that we may not have thought of, then the executor will keep sending heartbeats to driver and won't exit. This PR adds a new configuration `spark.executor.heartbeat.maxFailures` to kill Executor when it's unable to heartbeat to the driver more than `spark.executor.heartbeat.maxFailures` times. ## How was this patch tested? unit tests Author: Shixiong Zhu <[email protected]> Closes #11401 from zsxwing/SPARK-13522.

## What changes were proposed in this pull request? Just fixed the log place introduced by #11401 ## How was this patch tested? unit tests. Author: Shixiong Zhu <[email protected]> Closes #11432 from zsxwing/SPARK-13522-follow-up.

## What changes were proposed in this pull request? This PR adds support for implementing whole state codegen for sort. Builds heaving on nongli 's PR: #11008 (which actually implements the feature), and adds the following changes on top: - [x] Generated code updates peak execution memory metrics - [x] Unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite` ## How was this patch tested? New unit tests in `WholeStageCodegenSuite` and `SQLMetricsSuite`. Further, all existing sort tests should pass. Author: Sameer Agarwal <[email protected]> Author: Nong Li <[email protected]> Closes #11359 from sameeragarwal/sort-codegen.

The Hive client library is not smart enough to notice that the current user is a proxy user; so when using a proxy user, it fails to fetch delegation tokens from the metastore because of a missing kerberos TGT for the current user. To fix it, just run the code that fetches the delegation token as the real logged in user. Tested on a kerberos cluster both submitting normally and with a proxy user; Hive and HBase tokens are retrieved correctly in both cases. Author: Marcelo Vanzin <[email protected]> Closes #11358 from vanzin/SPARK-13478.

…llib.JavaBisectingKMeansExample JIRA: https://issues.apache.org/jira/browse/SPARK-13551 ## What changes were proposed in this pull request? Fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample ## How was this patch tested? manual test Author: Zheng RuiFeng <[email protected]> Closes #11429 from zhengruifeng/mllib_bkm_je.

JIRA: https://issues.apache.org/jira/browse/SPARK-13550 ## What changes were proposed in this pull request? Just add a java example for ml.clustering.BisectingKMeans ## How was this patch tested? manual tests were done. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Zheng RuiFeng <[email protected]> Closes #11428 from zhengruifeng/ml_bkm_je.

## What changes were proposed in this pull request? This patch fixes the problem that pyspark fails on Windows because pyspark can't find ```spark-submit2.cmd```. ## How was this patch tested? manual tests: I ran ```bin\pyspark.cmd``` and checked if pyspark is launched correctly after this patch is applyed. Author: Masayoshi TSUZUKI <[email protected]> Closes #11442 from tsudukim/feature/SPARK-13592.

JIRA: https://issues.apache.org/jira/browse/SPARK-13511 ## What changes were proposed in this pull request? Current limit operator doesn't support wholestage codegen. This is open to add support for it. In the `doConsume` of `GlobalLimit` and `LocalLimit`, we use a count term to count the processed rows. Once the row numbers catches the limit number, we set the variable `stopEarly` of `BufferedRowIterator` newly added in this pr to `true` that indicates we want to stop processing remaining rows. Then when the wholestage codegen framework checks `shouldStop()`, it will stop the processing of the row iterator. Before this, the executed plan for a query `sqlContext.range(N).limit(100).groupBy().sum()` is: TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Final,isDistinct=false)], output=[sum(id)#6L]) +- TungstenAggregate(key=[], functions=[(sum(id#5L),mode=Partial,isDistinct=false)], output=[sum#9L]) +- GlobalLimit 100 +- Exchange SinglePartition, None +- LocalLimit 100 +- Range 0, 1, 1, 524288000, [id#5L] After add wholestage codegen support: WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Final,isDistinct=false)], output=[sum(id)#41L]) : +- TungstenAggregate(key=[], functions=[(sum(id#40L),mode=Partial,isDistinct=false)], output=[sum#44L]) : +- GlobalLimit 100 : +- INPUT +- Exchange SinglePartition, None +- WholeStageCodegen : +- LocalLimit 100 : +- Range 0, 1, 1, 524288000, [id#40L] ## How was this patch tested? A test is added into BenchmarkWholeStageCodegen. Author: Liang-Chi Hsieh <[email protected]> Closes #11391 from viirya/wholestage-limit.

Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <[email protected]> Closes #11136 from yanboliang/spark-12811.

Closes #10940 Closes #11302 Closes #11430 Closes #10912

## What changes were proposed in this pull request? This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays. This PR also change the underlying type for small decimal that could be fit within a Int, in order to use getInt() to lookup the value from IntDictionary. ## How was this patch tested? Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274). Author: Davies Liu <[email protected]> Closes #11437 from davies/decode_dict.

## What changes were proposed in this pull request? This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories. ## How was this patch tested? Jenkins should suffice. Author: Reynold Xin <[email protected]> Closes #11426 from rxin/SPARK-13548.

## What changes were proposed in this pull request? Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that. ## How was this patch tested? Updated unit tests. Author: Davies Liu <[email protected]> Closes #11448 from davies/remove_bnl.

… when reading from JDBC datasources. Rows with null values in partition column are not included in the results because none of the partition where clause specify is null predicate on the partition column. This fix adds is null predicate on the partition column to the first JDBC partition where clause. Example: JDBCPartition(THEID < 1 or THEID is null, 0),JDBCPartition(THEID >= 1 AND THEID < 2,1), JDBCPartition(THEID >= 2, 2) Author: sureshthalamati <[email protected]> Closes #11063 from sureshthalamati/nullable_jdbc_part_col_spark-13167.

This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top. Since we keep it alphabetized, it often creates a lot more changes than needed. It is also easy to add the Estimator and forget the Model. I'm going to switch it to have one algorithm per line. This also alphabetizes a few out-of-place classes in pyspark.ml.feature. No changes have been made to the moved classes. CC: thunterdb Author: Joseph K. Bradley <[email protected]> Closes #10927 from jkbradley/ml-python-all-list.

## What changes were proposed in this pull request? ``` error] Expected ID character [error] Not a valid command: common (similar: completions) [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: common (similar: commands) [error] common/network-yarn/test ``` `common/network-yarn` is not a valid sbt project, we should change to `network-yarn`. ## How was this patch tested? Locally run the the unit-test. CC rxin , we should either change here, or change the sbt project name. Author: jerryshao <[email protected]> Closes #11456 from jerryshao/build-fix.

## What changes were proposed in this pull request? Twitter Algebird deprecated `apply` in HyperLogLog.scala. ``` deprecated("Use toHLL", since = "0.10.0 / 2015-05") def apply[T <% Array[Byte]](t: T) = create(t) ``` This PR replace the deprecated usage `apply` with new `create` according to the upstream change. ## How was this patch tested? manual. ``` /bin/spark-submit --class org.apache.spark.examples.streaming.TwitterAlgebirdHLL examples/target/scala-2.11/spark-examples-2.0.0-SNAPSHOT-hadoop2.2.0.jar ``` Author: Dongjoon Hyun <[email protected]> Closes #11451 from dongjoon-hyun/replace_deprecated_hll_apply.

## What changes were proposed in this pull request? The PR fixes typos in an error message in dev/run-tests.py. Author: Wojciech Jurczyk <[email protected]> Closes #11467 from wjur/wjur/typos_run_tests.

## What changes were proposed in this pull request? Change in class FormatNumber to make it work irrespective of locale. ## How was this patch tested? Unit tests. Author: lgieron <[email protected]> Closes #11396 from lgieron/SPARK-13515_Fix_Format_Number.

Rahul Tanwani and others added 28 commits February 28, 2016 23:16

[SPARK-13309][SQL] Fix type inference issue with CSV data

dd3b545

Fix type inference issue for sparse CSV data - https://issues.apache.org/jira/browse/SPARK-13309 Author: Rahul Tanwani <[email protected]> Closes #11194 from tanwanirahul/master.

[SPARK-12994][CORE] It is not necessary to create ExecutorAllocationM…

99fe899

…anager in local mode Author: Jeff Zhang <[email protected]> Closes #10914 from zjffdu/SPARK-12994.

[SPARK-12811][ML] Estimator for Generalized Linear Models(GLMs)

5ed48dd

Estimator for Generalized Linear Models(GLMs) which will be solved by IRLS. cc mengxr Author: Yanbo Liang <[email protected]> Closes #11136 from yanboliang/spark-12811.

Closes #11320

c37bbb3

Closes #10940 Closes #11302 Closes #11430 Closes #10912

Fix run-tests.py typos

75e618d

## What changes were proposed in this pull request? The PR fixes typos in an error message in dev/run-tests.py. Author: Wojciech Jurczyk <[email protected]> Closes #11467 from wjur/wjur/typos_run_tests.

anandab merged commit d8afd45 into anandab:master Mar 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-13309][SQL] Fix type inference issue with CSV data #2

[SPARK-13309][SQL] Fix type inference issue with CSV data #2

Uh oh!

anandab commented Feb 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

[SPARK-13309][SQL] Fix type inference issue with CSV data #2

[SPARK-13309][SQL] Fix type inference issue with CSV data #2

Uh oh!

Conversation

anandab commented Feb 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants