Skip to content

Conversation

@felixcheung
Copy link
Member

Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I will merge with PR 8984 (coltypes) once it is in, possibly looking into mapping R type names.

@shivaram @sun-rui

@SparkQA
Copy link

SparkQA commented Oct 22, 2015

Test build #44137 has finished for PR 9218 at commit 071f29f.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor

sun-rui commented Oct 22, 2015

could we support both names() and colnames()?

@shivaram
Copy link
Contributor

cc @olarayej

@felixcheung
Copy link
Member Author

@sun-rui names and names<- are already there, this is to add colnames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it is simply wrapping another method, having example and some description helps with in-line documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In R doc, it will be included under the columns page (which has description and examples) with @rdname notation.
https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html

@sun-rui
Copy link
Contributor

sun-rui commented Oct 24, 2015

I am not clear that for both coltypes() and coltypes<-(), how to represent complex types in R types? do you have idea?

@felixcheung
Copy link
Member Author

@sun-rui That's a great point, coltypes() as its signature is defined, would only return a list of simple types.
But how would one create a DataFrame with complex type from R? I tried a bit and couldn't get it to work. Either I get Unsupported type for DataFrame: factor or unexpected type: environment

@sun-rui
Copy link
Contributor

sun-rui commented Oct 27, 2015

@felixcheung, type inferring works for complex types in createDataFrame(). You can refer to the test case for "create DataFrame with complex types" in test_sparkSQL.R.

viirya and others added 20 commits October 28, 2015 21:45
JIRA: https://issues.apache.org/jira/browse/SPARK-11322

As reported by JoshRosen in [databricks/spark-redshift/issues/89](databricks/spark-redshift#89 (comment)), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#9283 from viirya/py-exception-stacktrace.
… type correctly

For inner primitive type(e.g. inside `Product`), we use `schemaFor` to get the catalyst type for it, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L403.

However, for top level primitive type, we use `dataTypeFor`, which is wrong.

Author: Wenchen Fan <[email protected]>

Closes apache#9337 from cloud-fan/encoder.
… for it

Before this PR, user has to consume the iterator of one group before process next group, or we will get into infinite loops.

Author: Wenchen Fan <[email protected]>

Closes apache#9330 from cloud-fan/group.
Java 8 javadoc does not like self closing tags: ```<p/>```, ```<br/>```, ...

This PR fixes those.

Author: Herman van Hovell <[email protected]>

Closes apache#9339 from hvanhovell/SPARK-11388.
The root cause is that when spark.sql.hive.convertMetastoreParquet=true by default, the cached InMemoryRelation of the ParquetRelation can not be looked up from the cachedData of CacheManager because the key comparison fails even though it is the same LogicalPlan representing the Subquery that wraps the ParquetRelation.
The solution in this PR is overriding the LogicalPlan.sameResult function in Subquery case class to eliminate subquery node first before directly comparing the child (ParquetRelation), which will find the key  to the cached InMemoryRelation.

Author: xin Wu <[email protected]>

Closes apache#9326 from xwu0226/spark-11246-commit.
…eptions

Only print the error message to the console for Analysis Exceptions in sql-shell.

Author: Dilip Biswal <[email protected]>

Closes apache#9194 from dilipbiswal/spark-11188.
Implementing skewness and kurtosis support based on following algorithm:
https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Higher-order_statistics

Author: sethah <[email protected]>

Closes apache#9003 from sethah/SPARK-10641.
"profiles" give us the way that you can specify the set of credentials you want to use when you initialize a connection to AWS.

You can keep multiple sets of credentials in the same credentials files using different profile names.
For example, you can use --profile option to do that when you use "aws cli tool".

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Author: teramonagi <[email protected]>

Closes apache#8696 from teramonagi/SPARK-10532.
Upgrades the tachyon-client version to the latest release.

No new dependencies are added and no spark facing APIs are changed. The removal of the `tachyon-underfs-s3` exclusion will enable users to use S3 out of the box and there are no longer any additional external dependencies added by the module.

Author: Calvin Jia <[email protected]>

Closes apache#9204 from calvinjia/spark-11236.
Quick one line doc fix
link is not clickable
![image](https://cloud.githubusercontent.com/assets/8969467/10833041/4e91dd7c-7e4c-11e5-8905-713b986dbbde.png)

shivaram

Author: felixcheung <[email protected]>

Closes apache#9363 from felixcheung/rpersistdoc.
…tive memory management

This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.

Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).

The PrepareRDD may be not needed anymore, could be removed in follow up PR.

The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).

```python
sqlContext.setConf("spark.sql.shuffle.partitions", "1")
df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
j.explain()
print j.count()
```

For thread-safety, here what I'm got:

1) Without calling spill(), the operators should only be used by single thread, no safety problems.

2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.

3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.

4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.

5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).

Author: Davies Liu <[email protected]>

Closes apache#9241 from davies/force_spill.
Older version of Janino (>2.7) does not support Override, we should not use that in codegen.

Author: Davies Liu <[email protected]>

Closes apache#9372 from davies/no_override.
…sion as followup. This is the follow up work of SPARK-10668.

* Fix miner style issues.
* Add test case for checking whether solver is selected properly.

Author: Lewuathe <[email protected]>
Author: lewuathe <[email protected]>

Closes apache#9180 from Lewuathe/SPARK-11207.
…lumn fail

When enabling mergedSchema and predicate filter, this fails since Parquet does not accept filters pushed down when the columns of the filters do not exist in the schema.
This is related with Parquet issue (https://issues.apache.org/jira/browse/PARQUET-389).

For now, it just simply disables predicate push down when using merged schema in this PR.

Author: hyukjinkwon <[email protected]>

Closes apache#9327 from HyukjinKwon/SPARK-11103.
…oupedIterator.hasNext is not idempotent

When we cogroup 2 `GroupedIterator`s in `CoGroupedIterator`, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call `hasNext` which will call `left.hasNext`. This will make `GroupedIterator` generate an extra group as the previous one has not been comsumed yet.

Author: Wenchen Fan <[email protected]>

Closes apache#9346 from cloud-fan/cogroup and squashes the following commits:

9be67c8 [Wenchen Fan] SPARK-11393
…or backend.

See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details.

This fixes the `ClassNotFoundException` for Spark classes in the serializer.

I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization.

I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue).

tnachen andrewor14

Author: Iulian Dragos <[email protected]>

Closes apache#9282 from dragos/issue/mesos-classloader.
…nd' in RRDD in the PR for SPARK-10971.

Author: Sun Rui <[email protected]>

Closes apache#9368 from sun-rui/SPARK-11414.
tdas and others added 24 commits November 10, 2015 23:16
Current updateStateByKey provides stateful processing in Spark Streaming. It allows the user to maintain per-key state and manage that state using an updateFunction. The updateFunction is called for each key, and it uses new data and existing state of the key, to generate an updated state. However, based on community feedback, we have learnt the following lessons.
* Need for more optimized state management that does not scan every key
* Need to make it easier to implement common use cases - (a) timeout of idle data, (b) returning items other than state

The high level idea that of this PR
* Introduce a new API trackStateByKey that, allows the user to update per-key state, and emit arbitrary records. The new API is necessary as this will have significantly different semantics than the existing updateStateByKey API. This API will have direct support for timeouts.
* Internally, the system will keep the state data as a map/list within the partitions of the state RDDs. The new data RDDs will be partitioned appropriately, and for all the key-value data, it will lookup the map/list in the state RDD partition and create a new list/map of updated state data. The new state RDD partition will be created based on the update data and if necessary, with old data.
Here is the detailed design doc. Please take a look and provide feedback as comments.
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em

This is still WIP. Major things left to be done.
- [x] Implement basic functionality of state tracking, with initial RDD and timeouts
- [x] Unit tests for state tracking
- [x] Unit tests for initial RDD and timeout
- [ ] Unit tests for TrackStateRDD
       - [x] state creating, updating, removing
       - [ ] emitting
       - [ ] checkpointing
- [x] Misc unit tests for State, TrackStateSpec, etc.
- [x] Update docs and experimental tags

Author: Tathagata Das <[email protected]>

Closes apache#9256 from tdas/trackStateByKey.
…ing schemas.

https://issues.apache.org/jira/browse/SPARK-11500

As filed in SPARK-11500, if merging schemas is enabled, the order of files to touch is a matter which might affect the ordering of the output columns.

This was mostly because of the use of `Set` and `Map` so I replaced them to `LinkedHashSet` and `LinkedHashMap` to keep the insertion order.

Also, I changed `reduceOption` to `reduceLeftOption`, and replaced the order of `filesToTouch` from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged ++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files first which always have the schema in footers whereas the others might not exist.

One nit is, If merging schemas is not enabled, but when multiple files are given, there is no guarantee of the output order, since there might not be a summary file for the first file, which ends up putting ahead the columns of the other files.

However, I thought this should be okay since disabling merging schemas means (assumes) all the files have the same schemas.

In addition, in the test code for this, I only checked the names of fields.

Author: hyukjinkwon <[email protected]>

Closes apache#9517 from HyukjinKwon/SPARK-11500.
Author: Wenchen Fan <[email protected]>

Closes apache#9627 from cloud-fan/follow.
org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence.

Author: Yuming Wang <[email protected]>
Author: yuming.wang <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#9592 from 979969786/master.
If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile data source and directly convert the Text into UTF8String without extra string decoding and encoding.

Author: Reynold Xin <[email protected]>

Closes apache#9622 from rxin/SPARK-11646.
insert `aEncoder` like we do in `agg`

Author: Wenchen Fan <[email protected]>

Closes apache#9630 from cloud-fan/select.
We need to support custom classes like java beans and combine them into tuple, and it's very hard to do it with the  TypeTag-based approach.
We should keep only the compose-based way to create tuple encoder.

This PR also move `Encoder` to `org.apache.spark.sql`

Author: Wenchen Fan <[email protected]>

Closes apache#9567 from cloud-fan/java.
it may shadows the one from implicits in some case.

Author: Wenchen Fan <[email protected]>

Closes apache#9629 from cloud-fan/minor.
…mpiled classes

This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.

In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and apache#232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.

http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.

I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.

Author: Josh Rosen <[email protected]>

Closes apache#9512 from JoshRosen/SPARK-6152.
…for testing the BatchedWriteAheadLog

Several elements could be drained if the main thread is not fast enough. zsxwing warned me about a similar problem, but missed it here :( Submitting the fix using a waiter.

cc tdas

Author: Burak Yavuz <[email protected]>

Closes apache#9605 from brkyvz/fix-flaky-test.
…t the offset ranges for a KafkaRDD

tdas koeninger

This updates the Spark Streaming + Kafka Integration Guide doc with a working method to access the offsets of a `KafkaRDD` through Python.

Author: Nick Evans <[email protected]>

Closes apache#9289 from manygrams/update_kafka_direct_python_docs.
…d SparkSubmit tests

This patch aims to reduce the test time and flakiness of HiveSparkSubmitSuite, SparkSubmitSuite, and CliSuite.

Key changes:

- Disable IO synchronization calls for Derby writes, since durability doesn't matter for tests. This was done for HiveCompatibilitySuite in apache#6651 and resulted in huge test speedups.
- Add a few missing `--conf`s to disable various Spark UIs. The CliSuite, in particular, never disabled these UIs, leaving it prone to port-contention-related flakiness.
- Fix two instances where tests defined `beforeAll()` methods which were never called because the appropriate traits were not mixed in. I updated these tests suites to extend `BeforeAndAfterEach` so that they play nicely with our `ResetSystemProperties` trait.

Author: Josh Rosen <[email protected]>

Closes apache#9623 from JoshRosen/SPARK-11647.
This is a followup for apache#9317 to replace volatile fields with AtomicBoolean and AtomicReference.

Author: Reynold Xin <[email protected]>

Closes apache#9611 from rxin/SPARK-10827.
Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/

I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile.

Author: Xiangrui Meng <[email protected]>

Closes apache#9641 from mengxr/SPARK-11672.
This adds a pivot method to the dataframe api.

Following the lead of cube and rollup this adds a Pivot operator that is translated into an Aggregate by the analyzer.

Currently the syntax is like:
~~courseSales.pivot(Seq($"year"), $"course", Seq("dotNET", "Java"), sum($"earnings"))~~

~~Would we be interested in the following syntax also/alternatively? and~~

    courseSales.groupBy($"year").pivot($"course", "dotNET", "Java").agg(sum($"earnings"))
    //or
    courseSales.groupBy($"year").pivot($"course").agg(sum($"earnings"))

Later we can add it to `SQLParser`, but as Hive doesn't support it we cant add it there, right?

~~Also what would be the suggested Java friendly method signature for this?~~

Author: Andrew Ray <[email protected]>

Closes apache#7841 from aray/sql-pivot.
… into colnamescoltypes

Conflicts:
	R/pkg/NAMESPACE
	R/pkg/R/DataFrame.R
	R/pkg/R/generics.R
	R/pkg/inst/tests/test_sparkSQL.R
@felixcheung felixcheung deleted the colnamescoltypes branch November 12, 2015 07:00
@SparkQA
Copy link

SparkQA commented Nov 12, 2015

Test build #45718 has finished for PR 9218 at commit 8730a37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Nov 29, 2015
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <[email protected]>

Closes #9654 from felixcheung/colnamescoltypes.

(cherry picked from commit c793d2d)
Signed-off-by: Shivaram Venkataraman <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 29, 2015
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <[email protected]>

Closes #9654 from felixcheung/colnamescoltypes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.