[SPARK-9319][SPARKR] Add support for setting column names, types #9218

felixcheung · 2015-10-22T06:55:28Z

Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I will merge with PR 8984 (coltypes) once it is in, possibly looking into mapping R type names.

@shivaram @sun-rui

SparkQA · 2015-10-22T07:05:15Z

Test build #44137 has finished for PR 9218 at commit 071f29f.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-10-22T09:47:54Z

could we support both names() and colnames()?

shivaram · 2015-10-22T16:33:09Z

cc @olarayej

felixcheung · 2015-10-22T18:20:19Z

@sun-rui names and names<- are already there, this is to add colnames.

falaki · 2015-10-22T18:20:27Z

R/pkg/R/DataFrame.R

Although it is simply wrapping another method, having example and some description helps with in-line documentation.

In R doc, it will be included under the columns page (which has description and examples) with @rdname notation.
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

sun-rui · 2015-10-24T12:56:58Z

I am not clear that for both coltypes() and coltypes<-(), how to represent complex types in R types? do you have idea?

felixcheung · 2015-10-27T05:33:33Z

@sun-rui That's a great point, coltypes() as its signature is defined, would only return a list of simple types.
But how would one create a DataFrame with complex type from R? I tried a bit and couldn't get it to work. Either I get Unsupported type for DataFrame: factor or unexpected type: environment

sun-rui · 2015-10-27T07:14:17Z

@felixcheung, type inferring works for complex types in createDataFrame(). You can refer to the test case for "create DataFrame with complex types" in test_sparkSQL.R.

JIRA: https://issues.apache.org/jira/browse/SPARK-11322 As reported by JoshRosen in [databricks/spark-redshift/issues/89](databricks/spark-redshift#89 (comment)), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception. Author: Liang-Chi Hsieh <[email protected]> Closes apache#9283 from viirya/py-exception-stacktrace.

… type correctly For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403. However, for top level primitive type, we use `dataTypeFor`, which is wrong. Author: Wenchen Fan <[email protected]> Closes apache#9337 from cloud-fan/encoder.

… for it Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops. Author: Wenchen Fan <[email protected]> Closes apache#9330 from cloud-fan/group.

Author: tedyu <[email protected]> Closes apache#9281 from tedyu/master.

Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ... This PR fixes those. Author: Herman van Hovell <[email protected]> Closes apache#9339 from hvanhovell/SPARK-11388.

The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation. The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key to the cached InMemoryRelation. Author: xin Wu <[email protected]> Closes apache#9326 from xwu0226/spark-11246-commit.

…eptions Only print the error message to the console for Analysis Exceptions in sql-shell. Author: Dilip Biswal <[email protected]> Closes apache#9194 from dilipbiswal/spark-11188.

Implementing skewness and kurtosis support based on following algorithm: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics Author: sethah <[email protected]> Closes apache#9003 from sethah/SPARK-10641.

"profiles" give us the way that you can specify the set of credentials you want to use when you initialize a connection to AWS. You can keep multiple sets of credentials in the same credentials files using different profile names. For example, you can use --profile option to do that when you use "aws cli tool". http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html Author: teramonagi <[email protected]> Closes apache#8696 from teramonagi/SPARK-10532.

Upgrades the tachyon-client version to the latest release. No new dependencies are added and no spark facing APIs are changed. The removal of the `tachyon-underfs-s3` exclusion will enable users to use S3 out of the box and there are no longer any additional external dependencies added by the module. Author: Calvin Jia <[email protected]> Closes apache#9204 from calvinjia/spark-11236.

…lumns Author: Wenchen Fan <[email protected]> Closes apache#9271 from cloud-fan/filter.

Quick one line doc fix link is not clickable ![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png) shivaram Author: felixcheung <[email protected]> Closes apache#9363 from felixcheung/rpersistdoc.

…tive memory management This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed. Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling). The PrepareRDD may be not needed anymore, could be removed in follow up PR. The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration). ```python sqlContext.setConf("spark.sql.shuffle.partitions", "1") df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s") df2 = df.select(df.id.alias('id2'), df.s.alias('s2')) j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2") j.explain() print j.count() ``` For thread-safety, here what I'm got: 1) Without calling spill(), the operators should only be used by single thread, no safety problems. 2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems. 3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it. 4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning. 5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter). Author: Davies Liu <[email protected]> Closes apache#9241 from davies/force_spill.

Older version of Janino (>2.7) does not support Override, we should not use that in codegen. Author: Davies Liu <[email protected]> Closes apache#9372 from davies/no_override.

…sion as followup. This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. Author: Lewuathe <[email protected]> Author: lewuathe <[email protected]> Closes apache#9180 from Lewuathe/SPARK-11207.

…lumn fail When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema. This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389). For now, it just simply disables predicate push down when using merged schema in this PR. Author: hyukjinkwon <[email protected]> Closes apache#9327 from HyukjinKwon/SPARK-11103.

…oupedIterator.hasNext is not idempotent When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet. Author: Wenchen Fan <[email protected]> Closes apache#9346 from cloud-fan/cogroup and squashes the following commits: 9be67c8 [Wenchen Fan] SPARK-11393

…or backend. See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details. This fixes the `ClassNotFoundException` for Spark classes in the serializer. I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization. I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue). tnachen andrewor14 Author: Iulian Dragos <[email protected]> Closes apache#9282 from dragos/issue/mesos-classloader.

…nd' in RRDD in the PR for SPARK-10971. Author: Sun Rui <[email protected]> Closes apache#9368 from sun-rui/SPARK-11414.

Author: Sun Rui <[email protected]> Closes apache#9196 from sun-rui/SPARK-11210.

Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons. * Need for more optimized state management that does not scan every key * Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state The high level idea that of this PR * Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts. * Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data. Here is the detailed design doc. Please take a look and provide feedback as comments. https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em This is still WIP. Major things left to be done. - [x] Implement basic functionality of state tracking, with initial RDD and timeouts - [x] Unit tests for state tracking - [x] Unit tests for initial RDD and timeout - [ ] Unit tests for TrackStateRDD - [x] state creating, updating, removing - [ ] emitting - [ ] checkpointing - [x] Misc unit tests for State, TrackStateSpec, etc. - [x] Update docs and experimental tags Author: Tathagata Das <[email protected]> Closes apache#9256 from tdas/trackStateByKey.

…ing schemas. https://issues.apache.org/jira/browse/SPARK-11500 As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns. This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order. Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist. One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files. However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas. In addition, in the test code for this, I only checked the names of fields. Author: hyukjinkwon <[email protected]> Closes apache#9517 from HyukjinKwon/SPARK-11500.

Author: Wenchen Fan <[email protected]> Closes apache#9627 from cloud-fan/follow.

org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <[email protected]> Author: yuming.wang <[email protected]> Author: Xiangrui Meng <[email protected]> Closes apache#9592 from 979969786/master.

If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile data source and directly convert the Text into UTF8String without extra string decoding and encoding. Author: Reynold Xin <[email protected]> Closes apache#9622 from rxin/SPARK-11646.

Author: Wenchen Fan <[email protected]> Closes apache#9617 from cloud-fan/tmp.

insert `aEncoder` like we do in `agg` Author: Wenchen Fan <[email protected]> Closes apache#9630 from cloud-fan/select.

We need to support custom classes like java beans and combine them into tuple, and it's very hard to do it with the TypeTag-based approach. We should keep only the compose-based way to create tuple encoder. This PR also move `Encoder` to `org.apache.spark.sql` Author: Wenchen Fan <[email protected]> Closes apache#9567 from cloud-fan/java.

it may shadows the one from implicits in some case. Author: Wenchen Fan <[email protected]> Closes apache#9629 from cloud-fan/minor.

…mpiled classes This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8. In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and apache#232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml. http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes. I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8. Author: Josh Rosen <[email protected]> Closes apache#9512 from JoshRosen/SPARK-6152.

…for testing the BatchedWriteAheadLog Several elements could be drained if the main thread is not fast enough. zsxwing warned me about a similar problem, but missed it here :( Submitting the fix using a waiter. cc tdas Author: Burak Yavuz <[email protected]> Closes apache#9605 from brkyvz/fix-flaky-test.

Author: Reynold Xin <[email protected]> Closes apache#9618 from rxin/SPARK-11644.

Author: Reynold Xin <[email protected]> Closes apache#9621 from rxin/SPARK-11645.

…t the offset ranges for a KafkaRDD tdas koeninger This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python. Author: Nick Evans <[email protected]> Closes apache#9289 from manygrams/update_kafka_direct_python_docs.

…d SparkSubmit tests This patch aims to reduce the test time and flakiness of HiveSparkSubmitSuite, SparkSubmitSuite, and CliSuite. Key changes: - Disable IO synchronization calls for Derby writes, since durability doesn't matter for tests. This was done for HiveCompatibilitySuite in apache#6651 and resulted in huge test speedups. - Add a few missing `--conf`s to disable various Spark UIs. The CliSuite, in particular, never disabled these UIs, leaving it prone to port-contention-related flakiness. - Fix two instances where tests defined `beforeAll()` methods which were never called because the appropriate traits were not mixed in. I updated these tests suites to extend `BeforeAndAfterEach` so that they play nicely with our `ResetSystemProperties` trait. Author: Josh Rosen <[email protected]> Closes apache#9623 from JoshRosen/SPARK-11647.

This is a followup for apache#9317 to replace volatile fields with AtomicBoolean and AtomicReference. Author: Reynold Xin <[email protected]> Closes apache#9611 from rxin/SPARK-10827.

Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/ I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile. Author: Xiangrui Meng <[email protected]> Closes apache#9641 from mengxr/SPARK-11672.

This adds a pivot method to the dataframe api. Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer. Currently the syntax is like: ~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~ ~~Would we be interested in the following syntax also/alternatively? and~~ courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings")) //or courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings")) Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right? ~~Also what would be the suggested Java friendly method signature for this?~~ Author: Andrew Ray <[email protected]> Closes apache#7841 from aray/sql-pivot.

…lumn

… into colnamescoltypes Conflicts: R/pkg/NAMESPACE R/pkg/R/DataFrame.R R/pkg/R/generics.R R/pkg/inst/tests/test_sparkSQL.R

SparkQA · 2015-11-12T07:14:03Z

Test build #45718 has finished for PR 9218 at commit 8730a37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 shivaram sun-rui Author: felixcheung <[email protected]> Closes #9654 from felixcheung/colnamescoltypes. (cherry picked from commit c793d2d) Signed-off-by: Shivaram Venkataraman <[email protected]>

Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 shivaram sun-rui Author: felixcheung <[email protected]> Closes #9654 from felixcheung/colnamescoltypes.

falaki reviewed Oct 22, 2015
View reviewed changes

felixcheung mentioned this pull request Oct 27, 2015

[SPARK-10863][SPARKR] Method coltypes() to get R's data types of a DataFrame #8984

Closed

viirya and others added 20 commits October 28, 2015 21:45

[SPARK-11318] Include hive profile in make-distribution.sh command

f304f9c

Author: tedyu <[email protected]> Closes apache#9281 from tedyu/master.

[SPARK-11388][BUILD] Fix self closing tags.

3bb2a8d

Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ... This PR fixes those. Author: Herman van Hovell <[email protected]> Closes apache#9339 from hvanhovell/SPARK-11388.

[SPARK-11188][SQL] Elide stacktraces in bin/spark-sql for AnalysisExc…

8185f03

…eptions Only print the error message to the console for Analysis Exceptions in sql-shell. Author: Dilip Biswal <[email protected]> Closes apache#9194 from dilipbiswal/spark-11188.

[SPARK-11301] [SQL] fix case sensitivity for filter on partitioned co…

96cf87f

…lumns Author: Wenchen Fan <[email protected]> Closes apache#9271 from cloud-fan/filter.

[SPARK-11417] [SQL] no @OverRide in codegen

eb59b94

Older version of Janino (>2.7) does not support Override, we should not use that in codegen. Author: Davies Liu <[email protected]> Closes apache#9372 from davies/no_override.

[SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.comma…

fab710a

…nd' in RRDD in the PR for SPARK-10971. Author: Sun Rui <[email protected]> Closes apache#9368 from sun-rui/SPARK-11414.

[SPARK-11210][SPARKR] Add window functions into SparkR [step 2].

40c77fb

Author: Sun Rui <[email protected]> Closes apache#9196 from sun-rui/SPARK-11210.

tdas and others added 24 commits November 10, 2015 23:16

[SPARK-10371][SQL][FOLLOW-UP] fix code style

1510c52

Author: Wenchen Fan <[email protected]> Closes apache#9627 from cloud-fan/follow.

[SQL][MINOR] rename present to finish in Aggregator

c964fc1

Author: Wenchen Fan <[email protected]> Closes apache#9617 from cloud-fan/tmp.

[SPARK-11656][SQL] support typed aggregate in project list

9c57bc0

insert `aEncoder` like we do in `agg` Author: Wenchen Fan <[email protected]> Closes apache#9630 from cloud-fan/select.

[SQL][MINOR] remove newLongEncoder in functions

e71ba56

it may shadows the one from implicits in some case. Author: Wenchen Fan <[email protected]> Closes apache#9629 from cloud-fan/minor.

[SPARK-11644][SQL] Remove the option to turn off unsafe and codegen.

df97df2

Author: Reynold Xin <[email protected]> Closes apache#9618 from rxin/SPARK-11644.

[SPARK-11645][SQL] Remove OpenHashSet for the old aggregate.

a9a6b80

Author: Reynold Xin <[email protected]> Closes apache#9621 from rxin/SPARK-11645.

[SPARK-10827] replace volatile with Atomic* in AppClient.scala.

e1bcf6a

This is a followup for apache#9317 to replace volatile fields with AtomicBoolean and AtomicReference. Author: Reynold Xin <[email protected]> Closes apache#9611 from rxin/SPARK-10827.

Add support for colnames, colnames<-, coltypes<-

a40838a

Take R types instead to map to JVM types, add check for NA to keep co…

3343047

…lumn

This seems to fix the Rd error - no idea why it worked before.

969dc0e

fix test broken from column name change from cast

4e820ec

rebase, merge with coltypes change, fix generic, doc

e2399b5

Merge branch 'colnamescoltypes' of https://github.com/felixcheung/spark…

8730a37

… into colnamescoltypes Conflicts: R/pkg/NAMESPACE R/pkg/R/DataFrame.R R/pkg/R/generics.R R/pkg/inst/tests/test_sparkSQL.R

felixcheung closed this Nov 12, 2015

felixcheung deleted the colnamescoltypes branch November 12, 2015 07:00

felixcheung mentioned this pull request Nov 12, 2015

[SPARK-9319][SPARKR] Add support for setting column names, types #9654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9319][SPARKR] Add support for setting column names, types #9218

[SPARK-9319][SPARKR] Add support for setting column names, types #9218

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

sun-rui commented Oct 22, 2015

Uh oh!

shivaram commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

falaki Oct 22, 2015

Uh oh!

felixcheung Oct 22, 2015

Uh oh!

sun-rui commented Oct 24, 2015

Uh oh!

felixcheung commented Oct 27, 2015

Uh oh!

sun-rui commented Oct 27, 2015

Uh oh!

SparkQA commented Nov 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

72 participants

[SPARK-9319][SPARKR] Add support for setting column names, types #9218

[SPARK-9319][SPARKR] Add support for setting column names, types #9218

Uh oh!

Conversation

felixcheung commented Oct 22, 2015

Uh oh!

SparkQA commented Oct 22, 2015

Uh oh!

sun-rui commented Oct 22, 2015

Uh oh!

shivaram commented Oct 22, 2015

Uh oh!

felixcheung commented Oct 22, 2015

Uh oh!

falaki Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

felixcheung Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

sun-rui commented Oct 24, 2015

Uh oh!

felixcheung commented Oct 27, 2015

Uh oh!

sun-rui commented Oct 27, 2015

Uh oh!

SparkQA commented Nov 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

72 participants