Skip to content

Conversation

@kmader
Copy link

@kmader kmader commented Jul 30, 2014

No description provided.

ueshin and others added 30 commits July 2, 2014 10:10
…e classes.

Author: Takuya UESHIN <[email protected]>

Closes #1226 from ueshin/issues/SPARK-2287 and squashes the following commits:

32ef7c3 [Takuya UESHIN] Add execution of `SHOW TABLES` before `TestHive.reset()`.
541dc8d [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2287
fac5fae [Takuya UESHIN] Remove unnecessary method receiver.
d306e60 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2287
7de5706 [Takuya UESHIN] Make ScalaReflection be able to handle Generic case classes.
…ir is a list of multiple paths and one of them has error

The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

Author: yantangzhai <[email protected]>

Closes #1274 from YanTangZhai/SPARK-2324 and squashes the following commits:

609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
It did not handle null keys very gracefully before.

Author: Andrew Or <[email protected]>

Closes #1288 from andrewor14/fix-external and squashes the following commits:

312b8d8 [Andrew Or] Abstract key hash code
ed5adf9 [Andrew Or] Fix NPE for ExternalAppendOnlyMap
…ntered data

Just closing out this small JIRA, resolving with a comment change.

Author: Sean Owen <[email protected]>

Closes #1171 from srowen/SPARK-1675 and squashes the following commits:

45ee9b7 [Sean Owen] Add simple note that data need not be centered for computePrincipalComponents
… ty...

The function cast doesn't conform to the intention of "Those expressions are supposed to be in the same data type, and also the return type." comment

Author: Yijie Shen <[email protected]>

Closes #1283 from yijieshen/master and squashes the following commits:

c7aaa4b [Yijie Shen] [SPARK-2342] Evaluation helper's output type doesn't conform to input type
Trivial fix.

Author: Prashant Sharma <[email protected]>

Closes #1050 from ScrapCodes/SPARK-2109/pyspark-script-bug and squashes the following commits:

77072b9 [Prashant Sharma] Changed echos to redirect to STDERR.
13f48a0 [Prashant Sharma] [SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work.
Let's see if this fixes the ongoing series of test failures in a master build machine (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT-pre-YARN/SPARK_HADOOP_VERSION=1.0.4,label=centos/81/).

pwendell marmbrus

Author: Zongheng Yang <[email protected]>

Closes #1277 from concretevitamin/test-fix and squashes the following commits:

28c88bd [Zongheng Yang] Synchronize on SQLContext.settings in tests.
Fix a bad Java code sample and a broken link in the streaming programming guide.

Author: Clément MATHIEU <[email protected]>

Closes #1286 from cykl/streaming-programming-guide-typos and squashes the following commits:

b0908cb [Clément MATHIEU] Fix broken URL
9d3c535 [Clément MATHIEU] Spark streaming requires at least two working threads (scala version was OK)
Workaround Hadoop conf ConcurrentModification issue

Author: Raymond Liu <[email protected]>

Closes #1273 from colorant/hadoopRDD and squashes the following commits:

994e98b [Raymond Liu] Address comments
e2cda3d [Raymond Liu] Workaround Hadoop conf ConcurrentModification issue
Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe.

Author: Aaron Davidson <[email protected]>

Closes #1289 from aarondav/master-fail and squashes the following commits:

1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers
**Problem.** The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors.

**Solution.** The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`.

Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`.

----------

If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused.

Author: Andrew Or <[email protected]>

Closes #1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits:

45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise
a82ea25 [Andrew Or] Add tests for StorageStatusListener
8773b01 [Andrew Or] Update comment / minor changes
3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI
JIRA: https://issues.apache.org/jira/browse/SPARK-2282

This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes.

This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets.

Author: Aaron Davidson <[email protected]>

Closes #1220 from aarondav/SPARK-2282 and squashes the following commits:

2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
…Command`

This is a fix for the problem revealed by PR #1265.

Currently `HiveComparisonSuite` ignores output of `ExplainCommand` since Catalyst query plan is quite different from Hive query plan. But exceptions throw from `CheckResolution` still breaks test cases. This PR catches any `TreeNodeException` and reports it as part of the query explanation.

After merging this PR, PR #1265 can also be merged safely.

For a normal query:

```
scala> hql("explain select key from src").foreach(println)
...
[Physical execution plan:]
[HiveTableScan [key#9], (MetastoreRelation default, src, None), None]
```

For a wrong query with unresolved attribute(s):

```
scala> hql("explain select kay from src").foreach(println)
...
[Error occurred during query planning: ]
[Unresolved attributes: 'kay, tree:]
[Project ['kay]]
[ LowerCaseSchema ]
[  MetastoreRelation default, src, None]
```

Author: Cheng Lian <[email protected]>

Closes #1294 from liancheng/safe-explain and squashes the following commits:

4318911 [Cheng Lian] Don't throw TreeNodeException in `execution.ExplainCommand`
… defined classes.

This is an alternate solution to #1176.

Author: Prashant Sharma <[email protected]>

Closes #1179 from ScrapCodes/SPARK-1199/repl-fix-second-approach and squashes the following commits:

820b34b [Prashant Sharma] Here we generate two kinds of import wrappers based on whether it is a class or not.
use concurrent.ConcurrentHashMap instead of util.Collections.synchronizedMap

Author: baishuo(白硕) <[email protected]>

Closes #1272 from baishuo/master and squashes the following commits:

51ec55d [baishuo(白硕)] Update SQLConf.scala
63da043 [baishuo(白硕)] Update SQLConf.scala
36b6dbd [baishuo(白硕)] Update SQLConf.scala
864faa0 [baishuo(白硕)] Update SQLConf.scala
593096b [baishuo(白硕)] Update SQLConf.scala
7304d9b [baishuo(白硕)] Update SQLConf.scala
843581c [baishuo(白硕)] Update SQLConf.scala
1d3e4a2 [baishuo(白硕)] Update SQLConf.scala
0740f28 [baishuo(白硕)] Update SQLConf.scala
This replaces #1263 with a test case.

Author: Reynold Xin <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #1265 from rxin/sql-analysis-error and squashes the following commits:

a639e01 [Reynold Xin] Added a test case for unresolved attribute analysis.
7371e1b [Reynold Xin] Merge pull request #1263 from marmbrus/analysisChecks
448c088 [Michael Armbrust] Add analysis checks
Hi all,
I want to submit a Except operator in basicOperators.scala
In SQL case.SQL support two table do except operator.
select * from table1
except
select * from table2
This operator support the substract function .Return an table with the elements from `this` that are not in `other`.This operator should limit the input SparkPlan Seq only has two member.The check will later support
JIRA:https://issues.apache.org/jira/browse/SPARK-2234

Author: Yanjie Gao <[email protected]>
Author: YanjieGao <[email protected]>
Author: root <root@node4.(none)>
Author: gaoyanjie <[email protected]>

Closes #1151 from YanjieGao/patch-6 and squashes the following commits:

f19f899 [YanjieGao] add a new blank line in basicoperators.scala
2ff7d73 [YanjieGao] resolve the identation in SqlParser and SparkStrategies
fdb5227 [YanjieGao] Merge remote branch 'upstream/master' into patch-6
9940d19 [YanjieGao] make comment less than 100c
09c7413 [YanjieGao] pr 1151 SqlParser add cache ,basic Operator rename Except and modify comment
b4b5867 [root] Merge remote branch 'upstream/master' into patch-6
b4c3869 [Yanjie Gao] change SparkStrategies Sparkcontext to SqlContext
7e0ec29 [Yanjie Gao] delete multi test
7e7c83f [Yanjie Gao] delete conflict except
b01beb8 [YanjieGao] resolve conflict sparkstrategies and basicOperators
4dc8166 [YanjieGao] resolve conflict
fa68a98 [Yanjie Gao] Update joins.scala
8e6bb00 [Yanjie Gao] delete conflict except
dd9ba5e [Yanjie Gao] Update joins.scala
a0d4e73 [Yanjie Gao] delete skew join
60f5ddd [Yanjie Gao] update less than 100c
0e72233 [Yanjie Gao] update SQLQuerySuite on master branch
7f916b5 [Yanjie Gao] update execution/basicOperators on master branch
a28dece [Yanjie Gao] Update logical/basicOperators on master branch
a639935 [Yanjie Gao] Update SparkStrategies.scala
3bf7def [Yanjie Gao] update SqlParser on master branch
26f833f [Yanjie Gao] update SparkStrategies.scala on master branch
8dd063f [Yanjie Gao] Update logical/basicOperators on master branch
9847dcf [Yanjie Gao] update SqlParser on masterbranch
d6a4604 [Yanjie Gao] Update joins.scala
424c507 [Yanjie Gao] Update joins.scala
7680742 [Yanjie Gao] Update SqlParser.scala
a7193d8 [gaoyanjie] [SPARK-2234][SQL]Spark SQL basicOperators add Except operator #1151
5c8a224 [Yanjie Gao] update the line less than 100c
ee066b3 [Yanjie Gao] Update basicOperators.scala
32a80ab [Yanjie Gao] remove except in HiveQl
cf232eb [Yanjie Gao] update 1comment 2space3 left.out
f1ea3f3 [Yanjie Gao] remove comment
7ea9b91 [Yanjie Gao] remove annotation
7f3d613 [Yanjie Gao] update .map(_.copy())
670a1bb [Yanjie Gao] Update HiveQl.scala
3fe7746 [Yanjie Gao] Update SQLQuerySuite.scala
a36eb0a [Yanjie Gao] Update basicOperators.scala
7859e56 [Yanjie Gao] Update SparkStrategies.scala
052346d [Yanjie Gao] Subtract is conflict with Subtract(e1,e2)
aab3785 [Yanjie Gao] Update SQLQuerySuite.scala
4bf80b1 [Yanjie Gao] update subtract to except
4bdd520 [Yanjie Gao] Update SqlParser.scala
2d4bfbd [Yanjie Gao] Update SQLQuerySuite.scala
0808921 [Yanjie Gao] SQLQuerySuite
a8a1948 [Yanjie Gao] SparkStrategies
1fe96c0 [Yanjie Gao] HiveQl.scala update
3305e40 [Yanjie Gao] SqlParser
7a98c37 [Yanjie Gao] Update basicOperators.scala
cf5b9d0 [Yanjie Gao] Update basicOperators.scala
8945835 [Yanjie Gao] object SkewJoin extends Strategy
2b98962 [Yanjie Gao] Update SqlParser.scala
dd32980 [Yanjie Gao] update1
68815b2 [Yanjie Gao] Reformat the code style
4eb43ec [Yanjie Gao] Update basicOperators.scala
aa06072 [Yanjie Gao] Reformat the code sytle
If the docs are built after a Maven build has finished the intermediate
state somehow causes a compiler bug during sbt compilation. This just
does a clean before attempting to build the docs.
This was omitted in #1260. @aarondav

Author: Reynold Xin <[email protected]>

Closes #1300 from rxin/historyServer and squashes the following commits:

af720a3 [Reynold Xin] Added SignalLogger to HistoryServer.
…eries.

Author: Michael Armbrust <[email protected]>

Closes #1305 from marmbrus/usePrunerPartitions and squashes the following commits:

744aa20 [Michael Armbrust] Use getAllPartitionsForPruner instead of getPartitions, which avoids retrieving auth data
…Kry...

Due to the non registration of BoundedPriorityQueue  with kryoserializer, operations which are dependend on BoundedPriorityQueue are giving exceptions.One such instance is using top along with kryo serialization.
Fixed the issue by registering BoundedPriorityQueue with kryoserializer.

Author: ankit.bhardwaj <[email protected]>

Closes #1299 from AnkitBhardwaj12/BoundedPriorityQueueWithKryoIssue and squashes the following commits:

a4ae8ed [ankit.bhardwaj] [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kryo
…join.

The right side of `LeftSemi` join needs columns only used in join condition.

Author: Takuya UESHIN <[email protected]>

Closes #1301 from ueshin/issues/SPARK-2366 and squashes the following commits:

7677a39 [Takuya UESHIN] Update comments.
786d3a0 [Takuya UESHIN] Rename method name.
e0957b1 [Takuya UESHIN] Add column pruning for the right side of LeftSemi join.
Fix nullabilities of `Join`/`Generate`/`Aggregate` because:
- Output attributes of opposite side of `OuterJoin` should be nullable.
- Output attributes of generater side of `Generate` should be nullable if `join` is `true` and `outer` is `true`.
- `AttributeReference` of `computedAggregates` of `Aggregate` should be the same as `aggregateExpression`'s.

Author: Takuya UESHIN <[email protected]>

Closes #1266 from ueshin/issues/SPARK-2327 and squashes the following commits:

3ace83a [Takuya UESHIN] Add withNullability to Attribute and use it to change nullabilities.
df1ae53 [Takuya UESHIN] Modify nullabilize to leave attribute if not resolved.
799ce56 [Takuya UESHIN] Add nullabilization to Generate of SparkPlan.
a0fc9bc [Takuya UESHIN] Fix scalastyle errors.
0e31e37 [Takuya UESHIN] Fix Aggregate resultAttribute nullabilities.
09532ec [Takuya UESHIN] Fix Generate output nullabilities.
f20f196 [Takuya UESHIN] Fix Join output nullabilities.
Author: Neville Li <[email protected]>

Closes #1319 from nevillelyh/gh/SPARK-1977 and squashes the following commits:

1f0a355 [Neville Li] [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS
…le alias is converted to lower case when we create Subquery

Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not.
To see the issue ...
```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(name: String, age: Int)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

sqlContext.sql("select PEOPLE.name from people PEOPLE")
```
The plan is ...
```
== Query Plan ==
Project ['PEOPLE.name]
 ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176
```
You can find that `PEOPLE.name` is not resolved.

This PR introduces three changes.
1.  If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work.
2.  A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but
3.  Corresponding unit tests.
With this PR, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers are handled by the analyzer.

JIRA: https://issues.apache.org/jira/browse/SPARK-2339

Author: Yin Huai <[email protected]>

Closes #1317 from yhuai/SPARK-2339 and squashes the following commits:

12d8006 [Yin Huai] Handling case sensitivity correctly. This patch introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this patch, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers is handled by the analyzer.
When execute `saveAsParquetFile` with non-primitive type, `RowWriteSupport` uses wrong type `Int` for `ByteType` and `ShortType`.

Author: Takuya UESHIN <[email protected]>

Closes #1315 from ueshin/issues/SPARK-2386 and squashes the following commits:

20d89ec [Takuya UESHIN] Use None instead of null.
bd88741 [Takuya UESHIN] Add a test.
323d1d2 [Takuya UESHIN] Modify RowWriteSupport to use the exact types to cast.
…s correctly for a field inside an array of structs

For example, for
```
{"array": [{"field":214748364700}, {"field":1}]}
```
the type of field is resolved as IntType. While, for
```
{"array": [{"field":1}, {"field":214748364700}]}
```
the type of field is resolved as LongType.

JIRA: https://issues.apache.org/jira/browse/SPARK-2375

Author: Yin Huai <[email protected]>

Closes #1308 from yhuai/SPARK-2375 and squashes the following commits:

3e2e312 [Yin Huai] Update unit test.
1b2ff9f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2375
10794eb [Yin Huai] Correctly resolve the type of a field inside an array of structs.
…ises java.lang.IllegalArgumentException

JIRA: https://issues.apache.org/jira/browse/SPARK-2376

Author: Yin Huai <[email protected]>

Closes #1320 from yhuai/SPARK-2376 and squashes the following commits:

0107417 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2376
480803d [Yin Huai] Correctly handling JSON arrays in PySpark.
Hi all,
I want to submit a basic operator Intersect
For example , in sql case
select * from table1
intersect
select * from table2
So ,i want use this operator support this function in Spark SQL
This operator will return the  the intersection of SparkPlan child table RDD .
JIRA:https://issues.apache.org/jira/browse/SPARK-2235

Author: Yanjie Gao <[email protected]>
Author: YanjieGao <[email protected]>

Closes #1150 from YanjieGao/patch-5 and squashes the following commits:

4629afe [YanjieGao] reformat the code
bdc2ac0 [YanjieGao] reformat the code as Michael's suggestion
3b29ad6 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
1cfbfe6 [YanjieGao] refomat some files
ea78f33 [YanjieGao] resolve conflict and add annotation on basicOperator and remove HiveQl
0c7cca5 [YanjieGao] modify format problem
a802ca8 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
5e374c7 [YanjieGao] resolve conflict in SparkStrategies and basicOperator
f7961f6 [Yanjie Gao] update the line less than
bdc4a05 [Yanjie Gao] Update basicOperators.scala
0b49837 [Yanjie Gao] delete the annotation
f1288b4 [Yanjie Gao] delete annotation
e2b64be [Yanjie Gao] Update basicOperators.scala
4dd453e [Yanjie Gao] Update SQLQuerySuite.scala
790765d [Yanjie Gao] Update SparkStrategies.scala
ac73e60 [Yanjie Gao] Update basicOperators.scala
d4ac5e5 [Yanjie Gao] Update HiveQl.scala
61e88e7 [Yanjie Gao] Update SqlParser.scala
469f099 [Yanjie Gao] Update basicOperators.scala
e5bff61 [Yanjie Gao] Spark SQL basicOperator add Intersect operator
Made sure that readers know the random number generator seed argument, within the 'takeSample' method, is optional.

Author: Rishi Verma <[email protected]>

Closes #1324 from riverma/patch-1 and squashes the following commits:

4699676 [Rishi Verma] Updated programming-guide.md
watermen and others added 22 commits July 28, 2014 21:39
Author: Yadong Qi <[email protected]>

Closes #1629 from watermen/bug-fix2 and squashes the following commits:

59b7237 [Yadong Qi] Update HiveQl.scala
During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case.

In the same time, the worker should not print the trackback into stderr if it send the traceback to JVM successfully.

Author: Davies Liu <[email protected]>

Closes #1625 from davies/error and squashes the following commits:

4fbcc6d [Davies Liu] disable log4j during testing when exception is expected.
cc14202 [Davies Liu] keep silent in worker if JVM close the socket
fix the problem with pickle operator.itemgetter with multiple index.

Author: Davies Liu <[email protected]>

Closes #1627 from davies/itemgetter and squashes the following commits:

aabd7fa [Davies Liu] fix pickle itemgetter with cloudpickle
The pull request includes two changes:

1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.

2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

Author: Reynold Xin <[email protected]>

Closes #1631 from rxin/sortOrder and squashes the following commits:

c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.

SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.

I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.

code:

~~~
import breeze.linalg._
import org.apache.log4j._

Logger.getRootLogger.setLevel(Level.OFF)

for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
  val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
  var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
  start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
}
~~~

out:

| n | treeReduce(,2) | reduce |
|---|---------------------|-----------|
| 10 | 0.215538731 | 0.204206899 |
| 100 | 0.278405907 | 0.205732582 |
| 1000 | 0.208972182 | 0.214298272 |
| 10000 | 0.194792071 | 0.349353687 |
| 100000 | 0.347683285 | 6.086671892 |
| 1000000 | 2.589350682 | 66.572906702 |

CC: @pwendell

This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.

Author: Xiangrui Meng <[email protected]>

Closes #1110 from mengxr/tree and squashes the following commits:

c6cd267 [Xiangrui Meng] make depth default to 2
b04b96a [Xiangrui Meng] address comments
9bcc5d3 [Xiangrui Meng] add depth for readability
7495681 [Xiangrui Meng] fix compile error
142a857 [Xiangrui Meng] merge master
d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
be6a88a [Xiangrui Meng] use treeAggregate in mllib
0f94490 [Xiangrui Meng] add docs
eb71c33 [Xiangrui Meng] add treeReduce
fe42a5e [Xiangrui Meng] add treeAggregate
Author: Aaron Staple <[email protected]>

Closes #1630 from staple/minor and squashes the following commits:

6f295a2 [Aaron Staple] Fix typos in comment about ExprId.
8566467 [Aaron Staple] Fix off by one column indentation in SqlParser.
… the current pu...

...sh model

Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
receiver fails, it currently has to be restarted on the same node to be able to receive data.

This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
multiple threads for better performance.

Author: Hari Shreedharan <[email protected]>
Author: Hari Shreedharan <[email protected]>
Author: Tathagata Das <[email protected]>
Author: harishreedharan <[email protected]>

Closes #807 from harishreedharan/master and squashes the following commits:

e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master'
96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks.
5f212ce [Hari Shreedharan] Ignore Spark Sink from mima.
981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala
a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
1f47364 [Hari Shreedharan] Minor fixes.
73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places.
65b76b4 [Hari Shreedharan] Fixing the unit test.
e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method.
f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy.
799509f [Hari Shreedharan] Fix a compile issue.
3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
d248d22 [harishreedharan] Merge pull request #1 from tdas/flume-polling
10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java.
1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink.
8c00289 [Hari Shreedharan] More debug messages
393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections.
120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes.
9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options.
8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data
86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master'
205034d [Hari Shreedharan] Merging master in
4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration.
bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration.
0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration.
3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration.
70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration.
e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order
9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review.
c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports.
0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model
1. there's no `hook_context.q` but a `hook_context_cs.q` in query folder
2. there's no `compute_stats_table.q` in query folder
3. there's no `having1.q` in query folder
4. `udf_E` and `udf_PI` appear twice in white list

Author: Daoyuan <[email protected]>

Closes #1634 from adrian-wang/testcases and squashes the following commits:

d7482ce [Daoyuan] change some test lists
…es key twice

JIRA: https://issues.apache.org/jira/browse/SPARK-2730

Author: Yin Huai <[email protected]>

Closes #1637 from yhuai/SPARK-2730 and squashes the following commits:

1a9f24e [Yin Huai] Remove unnecessary key evaluation.
Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema().

In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling.

Author: Davies Liu <[email protected]>

Closes #1601 from davies/date and squashes the following commits:

f0599b0 [Davies Liu] remove tests for sets and tuple in sql, fix list of list
c9d607a [Davies Liu] convert datetype for runtime
709d40d [Davies Liu] remove brackets
96db384 [Davies Liu] support datetime type for SchemaRDD
…exact sample size

Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.

Author: Doris Xin <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes #1025 from dorx/stratified and squashes the following commits:

245439e [Doris Xin] moved minSamplingRate to getUpperBound
eaf5771 [Doris Xin] bug fixes.
17a381b [Doris Xin] fixed a merge issue and a failed unit
ea7d27f [Doris Xin] merge master
b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
b3013a4 [Xiangrui Meng] move math3 back to test scope
eecee5f [Doris Xin] Merge branch 'master' into stratified
f4c21f3 [Doris Xin] Reviewer comments
a10e68d [Doris Xin] style fix
a2bf756 [Doris Xin] Merge branch 'master' into stratified
680b677 [Doris Xin] use mapPartitionWithIndex instead
9884a9f [Doris Xin] style fix
bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
ee9d260 [Doris Xin] addressed reviewer comments
6b5b10b [Doris Xin] Merge branch 'master' into stratified
254e03c [Doris Xin] minor fixes and Java API.
4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
bd9dc6e [Doris Xin] unit bug and style violation fixed
1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
0214a76 [Doris Xin] cleanUp
90d94c0 [Doris Xin] merge master
9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
7327611 [Doris Xin] merge master
50581fc [Doris Xin] added a TODO for logging in python
46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
7e1a481 [Doris Xin] changed the permission on SamplingUtil
1d413ce [Doris Xin] fixed checkstyle issues
9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
… logical plans & sample usage.

The idea is that every Catalyst logical plan gets hold of a Statistics class, the usage of which provides useful estimations on various statistics. See the implementations of `MetastoreRelation`.

This patch also includes several usages of the estimation interface in the planner. For instance, we now use physical table sizes from the estimate interface to convert an equi-join to a broadcast join (when doing so is beneficial, as determined by a size threshold).

Finally, there are a couple minor accompanying changes including:
- Remove the not-in-use `BaseRelation`.
- Make SparkLogicalPlan take a `SQLContext` in the second param list.

Author: Zongheng Yang <[email protected]>

Closes #1238 from concretevitamin/estimates and squashes the following commits:

329071d [Zongheng Yang] Address review comments; turn config name from string to field in SQLConf.
8663e84 [Zongheng Yang] Use BigInt for stat; for logical leaves, by default throw an exception.
2f2fb89 [Zongheng Yang] Fix statistics for SparkLogicalPlan.
9951305 [Zongheng Yang] Remove childrenStats.
16fc60a [Zongheng Yang] Avoid calling statistics on plans if auto join conversion is disabled.
8bd2816 [Zongheng Yang] Add a note on performance of statistics.
6e594b8 [Zongheng Yang] Get size info from metastore for MetastoreRelation.
01b7a3e [Zongheng Yang] Update scaladoc for a field and move it to @param section.
549061c [Zongheng Yang] Remove numTuples in Statistics for now.
729a8e2 [Zongheng Yang] Update docs to be more explicit.
573e644 [Zongheng Yang] Remove singleton SQLConf and move back `settings` to the trait.
2d99eb5 [Zongheng Yang] {Cleanup, use synchronized in, enrich} StatisticsSuite.
ca5b825 [Zongheng Yang] Inject SQLContext into SparkLogicalPlan, removing SQLConf mixin from it.
43d38a6 [Zongheng Yang] Revert optimization for BroadcastNestedLoopJoin (this fixes tests).
0ef9e5b [Zongheng Yang] Use multiplication instead of sum for default estimates.
4ef0d26 [Zongheng Yang] Make Statistics a case class.
3ba8f3e [Zongheng Yang] Add comment.
e5bcf5b [Zongheng Yang] Fix optimization conditions & update scala docs to explain.
7d9216a [Zongheng Yang] Apply estimation to planning ShuffleHashJoin & BroadcastNestedLoopJoin.
73cde01 [Zongheng Yang] Move SQLConf back. Assign default sizeInBytes to SparkLogicalPlan.
73412be [Zongheng Yang] Move SQLConf to Catalyst & add default val for sizeInBytes.
7a60ab7 [Zongheng Yang] s/Estimates/Statistics, s/cardinality/numTuples.
de3ae13 [Zongheng Yang] Add parquetAfter() properly in test.
dcff9bd [Zongheng Yang] Cleanups.
84301a4 [Zongheng Yang] Refactors.
5bf5586 [Zongheng Yang] Typo.
56a8e6e [Zongheng Yang] Prototype impl of estimations for Catalyst logical plans.
This commit exists to close the following pull requests on Github:

Closes #740 (close requested by 'rxin')
Closes #647 (close requested by 'rxin')
Closes #1383 (close requested by 'rxin')
Closes #1485 (close requested by 'pwendell')
Closes #693 (close requested by 'rxin')
Closes #478 (close requested by 'JoshRosen')
For queries like `... HAVING COUNT(*) > 9` the expression is always resolved since it contains no attributes.  This was causing us to avoid doing the Having clause aggregation rewrite.

Author: Michael Armbrust <[email protected]>

Closes #1640 from marmbrus/havingNoRef and squashes the following commits:

92d3901 [Michael Armbrust] Don't check resolved for having filters.
Author: Michael Armbrust <[email protected]>

Closes #1638 from marmbrus/cachedConfig and squashes the following commits:

2362082 [Michael Armbrust] Use SQLConf to configure in-memory columnar caching
Author: Josh Rosen <[email protected]>

Closes #1626 from JoshRosen/SPARK-2305 and squashes the following commits:

03fb283 [Josh Rosen] Update Py4J to version 0.8.2.1.
Adds a new method for evaluating expressions using code that is generated though Scala reflection.  This functionality is configured by the SQLConf option `spark.sql.codegen` and is currently turned off by default.

Evaluation can be done in several specialized ways:
 - *Projection* - Given an input row, produce a new row from a set of expressions that define each column in terms of the input row.  This can either produce a new Row object or perform the projection in-place on an existing Row (MutableProjection).
 - *Ordering* - Compares two rows based on a list of `SortOrder` expressions
 - *Condition* - Returns `true` or `false` given an input row.

For each of the above operations there is both a Generated and Interpreted version.  When generation for a given expression type is undefined, the code generator falls back on calling the `eval` function of the expression class.  Even without custom code, there is still a potential speed up, as loops are unrolled and code can still be inlined by JIT.

This PR also contains a new type of Aggregation operator, `GeneratedAggregate`, that performs aggregation by using generated `Projection` code.  Currently the required expression rewriting only works for simple aggregations like `SUM` and `COUNT`.  This functionality will be extended in a future PR.

This PR also performs several clean ups that simplified the implementation:
 - The notion of `Binding` all expressions in a tree automatically before query execution has been removed.  Instead it is the responsibly of an operator to provide the input schema when creating one of the specialized evaluators defined above.  In cases when the standard eval method is going to be called, binding can still be done manually using `BindReferences`.  There are a few reasons for this change:  First, there were many operators where it just didn't work before.  For example, operators with more than one child, and operators like aggregation that do significant rewriting of the expression. Second, the semantics of equality with `BoundReferences` are broken.  Specifically, we have had a few bugs where partitioning breaks because of the binding.
 - A copy of the current `SQLContext` is automatically propagated to all `SparkPlan` nodes by the query planner.  Before this was done ad-hoc for the nodes that needed this.  However, this required a lot of boilerplate as one had to always remember to make it `transient` and also had to modify the `otherCopyArgs`.

Author: Michael Armbrust <[email protected]>

Closes #993 from marmbrus/newCodeGen and squashes the following commits:

96ef82c [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
f34122d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into newCodeGen
67b1c48 [Michael Armbrust] Use conf variable in SQLConf object
4bdc42c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
41a40c9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
de22aac [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
fed3634 [Michael Armbrust] Inspectors are not serializable.
ef8d42b [Michael Armbrust] comments
533fdfd [Michael Armbrust] More logging of expression rewriting for GeneratedAggregate.
3cd773e [Michael Armbrust] Allow codegen for Generate.
64b2ee1 [Michael Armbrust] Implement copy
3587460 [Michael Armbrust] Drop unused string builder function.
9cce346 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
1a61293 [Michael Armbrust] Address review comments.
0672e8a [Michael Armbrust] Address comments.
1ec2d6e [Michael Armbrust] Address comments
033abc6 [Michael Armbrust] off by default
4771fab [Michael Armbrust] Docs, more test coverage.
d30fee2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
d2ad5c5 [Michael Armbrust] Refactor putting SQLContext into SparkPlan. Fix ordering, other test cases.
be2cd6b [Michael Armbrust] WIP: Remove old method for reference binding, more work on configuration.
bc88ecd [Michael Armbrust] Style
6cc97ca [Michael Armbrust] Merge remote-tracking branch 'origin/master' into newCodeGen
4220f1e [Michael Armbrust] Better config, docs, etc.
ca6cc6b [Michael Armbrust] WIP
9d67d85 [Michael Armbrust] Fix hive planner
fc522d5 [Michael Armbrust] Hook generated aggregation in to the planner.
e742640 [Michael Armbrust] Remove unneeded changes and code.
675e679 [Michael Armbrust] Upgrade paradise.
0093376 [Michael Armbrust] Comment / indenting cleanup.
d81f998 [Michael Armbrust] include schema for binding.
0e889e8 [Michael Armbrust] Use typeOf instead tq
f623ffd [Michael Armbrust] Quiet logging from test suite.
efad14f [Michael Armbrust] Remove some half finished functions.
92e74a4 [Michael Armbrust] add overrides
a2b5408 [Michael Armbrust] WIP: Code generation with scala reflection.
…nced

As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).

`RangePartitioner` should go through data only once, collecting samples from input partitions as well as counting. If the data is balanced, this should give us a good sketch. If we see big partitions, we re-sample from them in order to collect enough items.

The downside is that we need to collect more from each partition in the first pass. An alternative solution is caching the intermediate result and decide whether to fetch the data after.

Author: Xiangrui Meng <[email protected]>
Author: Reynold Xin <[email protected]>

Closes #1562 from mengxr/range-partitioner and squashes the following commits:

6cc2551 [Xiangrui Meng] change foreach to for
eb39b08 [Xiangrui Meng] Merge branch 'master' into range-partitioner
eb95dd8 [Xiangrui Meng] separate sketching and determining bounds impl
c436d30 [Xiangrui Meng] fix binary metrics unit tests
db58a55 [Xiangrui Meng] add unit tests
a6e35d6 [Xiangrui Meng] minor update
60be09e [Xiangrui Meng] remove importance sampler
9ee9992 [Xiangrui Meng] update range partitioner to run only one job on roughly balanced data
cc12f47 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
06ac2ec [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into range-part
17bcbf3 [Reynold Xin] Added seed.
badf20d [Reynold Xin] Renamed the method.
6940010 [Reynold Xin] Reservoir sampling implementation.
Author: Michael Armbrust <[email protected]>

Closes #1646 from marmbrus/nullDebug and squashes the following commits:

49050a8 [Michael Armbrust] Handle null values in debug()
The main thing was that spark configs were not propagated to the driver, and so applications that do not specify `master` or `appName` automatically failed. This PR fixes that and a couple of miscellaneous things that are related.

One thing that may or may not be an issue is that the jars must be available on the driver node. In `standalone-cluster` mode, this effectively means these jars must be available on all the worker machines, since the driver is launched on one of them. The semantics here are not the same as `yarn-cluster` mode,  where all the relevant jars are uploaded to a distributed cache automatically and shipped to the containers. This is probably not a concern, but still worth a mention.

Author: Andrew Or <[email protected]>

Closes #1538 from andrewor14/standalone-cluster and squashes the following commits:

8c11a0d [Andrew Or] Clean up imports / comments (minor)
2678d13 [Andrew Or] Handle extraJavaOpts properly
7660547 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
6f64a9b [Andrew Or] Revert changes in YARN
2f2908b [Andrew Or] Fix tests
ed01491 [Andrew Or] Don't go overboard with escaping
8e105e1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into standalone-cluster
b890949 [Andrew Or] Abstract usages of converting spark opts to java opts
79f63a3 [Andrew Or] Move sparkProps into javaOpts
78752f8 [Andrew Or] Fix tests
5a9c6c7 [Andrew Or] Fix line too long
c141a00 [Andrew Or] Don't display "unknown app" on driver log pages
d7e2728 [Andrew Or] Avoid deprecation warning in standalone Client
6ceb14f [Andrew Or] Allow relevant configs to propagate to standalone Driver
7f854bc [Andrew Or] Fix test
855256e [Andrew Or] Fix standalone-cluster mode
fd9da51 [Andrew Or] Formatting changes (minor)
The current PR contains the following changes:
* Expose `DataType`s in the sql package (internal details are private to sql).
* Users can create Rows.
* Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
* Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
* `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
* `JsonRDD` has been refactored to use changes introduced by this PR.
* Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.

New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
[sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).

An example of using `applySchema` is shown below.
```scala
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val schema =
  StructType(
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, true) :: Nil)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val peopleSchemaRDD = sqlContext. applySchema(people, schema)
peopleSchemaRDD.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)

peopleSchemaRDD.registerAsTable("people")
sqlContext.sql("select name from people").collect.foreach(println)
```

I will add new contents to the SQL programming guide later.

JIRA: https://issues.apache.org/jira/browse/SPARK-2179

Author: Yin Huai <[email protected]>

Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:

1d45977 [Yin Huai] Clean up.
a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
c712fbf [Yin Huai] Converts types of values based on defined schema.
4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
e5f8df5 [Yin Huai] Scaladoc.
122d1e7 [Yin Huai] Address comments.
03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
2476ed0 [Yin Huai] Minor updates.
ab71f21 [Yin Huai] Format.
fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
bd40a33 [Yin Huai] Address comments.
991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
3edb3ae [Yin Huai] Python doc.
692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
1d93395 [Yin Huai] Python APIs.
246da96 [Yin Huai] Add java data type APIs to javadoc index.
1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
d48fc7b [Yin Huai] Minor updates.
33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
b9f3071 [Yin Huai] Java API for applySchema.
1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
624765c [Yin Huai] Tests for applySchema.
aa92e84 [Yin Huai] Update data type tests.
8da1a17 [Yin Huai] Add Row.fromSeq.
9c99bc0 [Yin Huai] Several minor updates.
1d9c13a [Yin Huai] Update applySchema API.
85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
e495e4e [Yin Huai] More comments.
42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
68525a2 [Yin Huai] Update JSON unit test.
3209108 [Yin Huai] Add unit tests.
dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
9168b83 [Yin Huai] Update comments.
fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
0266761 [Yin Huai] Format
03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
Author: Koert Kuipers <[email protected]>

Closes #735 from koertkuipers/feat-kryo-max-buffersize and squashes the following commits:

15f6d81 [Koert Kuipers] change default for spark.kryoserializer.buffer.max.mb to 64mb and add some documentation
1bcc22c [Koert Kuipers] Merge branch 'master' into feat-kryo-max-buffersize
0c9f8eb [Koert Kuipers] make default for kryo max buffer size 16MB
143ec4d [Koert Kuipers] test resizable buffer in kryo Output
0732445 [Koert Kuipers] support setting maxCapacity to something different than capacity in kryo Output
kmader added a commit that referenced this pull request Jul 30, 2014
Updating to the lastest spark repository
@kmader kmader merged commit 81c5f12 into 4Quant:master Jul 30, 2014
kmader pushed a commit that referenced this pull request Jul 30, 2014
Currently, expression does not support the "constant null" well in constant folding.
e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.

For example:
```
explain select isnull(key+null)  from src;
== Logical Plan ==
Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
 MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [true AS c_0#28]
 MetastoreRelation default, src, None

== Physical Plan ==
Project [true AS c_0#28]
 HiveTableScan [], (MetastoreRelation default, src, None), None
```

I've create a new Optimization rule called NullPropagation for such kind of constant folding.

Author: Cheng Hao <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#482 from chenghao-intel/optimize_constant_folding and squashes the following commits:

2f14b50 [Cheng Hao] Fix code style issues
68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
29c8166 [Cheng Hao] Update the code for feedback of code review
50444cc [Cheng Hao] Remove the unnecessary null checking
80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
543ef9d [Cheng Hao] fix code style issues
9cf0396 [Cheng Hao] update code according to the code review comment
536c005 [Cheng Hao] Add Exceptional case for constant folding
3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
2645d4f [Cheng Hao] Constant Folding(null propagation)
kmader pushed a commit that referenced this pull request Jul 30, 2014
…tFile()

https://issues.apache.org/jira/browse/SPARK-1757

The first test succeeds, but the second test fails with exception:

```
[info] - save and load case class RDD with Nones as parquet *** FAILED *** (14 milliseconds)
[info]   java.lang.RuntimeException: Unsupported datatype StructType(List())
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetRelation.scala:201)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$1.apply(ParquetRelation.scala:235)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetRelation.scala:234)
[info]   at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetRelation.scala:267)
[info]   at org.apache.spark.sql.parquet.ParquetRelation$.createEmpty(ParquetRelation.scala:143)
[info]   at org.apache.spark.sql.parquet.ParquetRelation$.create(ParquetRelation.scala:122)
[info]   at org.apache.spark.sql.execution.SparkStrategies$ParquetOperations$.apply(SparkStrategies.scala:139)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
[info]   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:264)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:264)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:265)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:265)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:268)
[info]   at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:268)
[info]   at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:66)
[info]   at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:98)
```

Author: Andrew Ash <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes apache#690 from ash211/rdd-parquet-save and squashes the following commits:

747a0b9 [Andrew Ash] Merge pull request #1 from marmbrus/pr/690
54bd00e [Michael Armbrust] Need to put Option first since Option <: Seq.
8f3f281 [Andrew Ash] SPARK-1757 Add failing test for saving SparkSQL Schemas with Option[?] fields as parquet
kmader pushed a commit that referenced this pull request Jul 30, 2014
A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework.  Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.

Author: Ankur Dave <[email protected]>
Author: haroldsultan <[email protected]>
Author: Harold Sultan <[email protected]>

Closes apache#905 from haroldsultan/master and squashes the following commits:

327aee0 [haroldsultan] Merge pull request apache#2 from ankurdave/label-propagation
227a4d0 [Ankur Dave] Untabify
0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
0e24303 [Ankur Dave] Add LabelPropagationSuite
84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
9830342 [Harold Sultan] initial version of LPA
kmader pushed a commit that referenced this pull request Jul 30, 2014
Just submit another solution for apache#395

Author: Daoyuan <[email protected]>
Author: Michael Armbrust <[email protected]>
Author: Daoyuan Wang <[email protected]>

Closes apache#837 from adrian-wang/left-semi-join-support and squashes the following commits:

d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837
6713c09 [Michael Armbrust] Better debugging for failed query tests.
035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join.
5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser.
4c726e5 [Daoyuan] improvement according to Michael
8d4a121 [Daoyuan] add golden files for leftsemijoin
83a3c8a [Daoyuan] scala style fix
14cff80 [Daoyuan] add support for left semi join
kmader pushed a commit that referenced this pull request Jul 30, 2014
This update resolves [SPARK-2065](https://issues.apache.org/jira/browse/SPARK-2065). It gives launched EC2 instances descriptive names by using instance tags. Launched instances now show up in the EC2 console with these names.

I used `format()` with named parameters, which I believe is the recommended practice for string formatting in Python, but which doesn’t seem to be used elsewhere in the script.

Author: Nicholas Chammas <[email protected]>
Author: nchammas <[email protected]>

Closes apache#1043 from nchammas/master and squashes the following commits:

69f6e22 [Nicholas Chammas] PEP8 fixes
2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
69da6cf [nchammas] Merge pull request #1 from apache/master
kmader pushed a commit that referenced this pull request Jul 30, 2014
Thanks goes to @marmbrus for his implementation.

Author: Michael Armbrust <[email protected]>
Author: Zongheng Yang <[email protected]>

Closes apache#1074 from concretevitamin/option-treenode and squashes the following commits:

ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074
73133c2 [Michael Armbrust] TreeNodes can't be inner classes.
ab78420 [Zongheng Yang] Add a test.
2ccb721 [Michael Armbrust] Add support for transformation of optional children.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.