Skip to content

Conversation

@liumingning
Copy link

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was the this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

tdas and others added 30 commits June 12, 2015 15:23
- Kinesis API updated
- Kafka version updated, and Python API for Direct Kafka added
- Added SQLContext.getOrCreate()
- Added information on how to get partitionId in foreachRDD

Author: Tathagata Das <[email protected]>

Closes #6781 from tdas/SPARK-7284 and squashes the following commits:

aac7be0 [Tathagata Das] Added information on how to get partition id
a66ec22 [Tathagata Das] Complete the line incomplete line,
a92ca39 [Tathagata Das] Updated streaming documentation

(cherry picked from commit e9471d3)
Signed-off-by: Tathagata Das <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #6786 from marmbrus/optionsParser and squashes the following commits:

e7d18ef [Michael Armbrust] add dots
99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options

(cherry picked from commit 4aed66f)
Signed-off-by: Reynold Xin <[email protected]>
This improves the Spark Streaming Guides by fixing broken links, rewording confusing sections, fixing typos, adding missing words, etc.

Author: Mike Dusenberry <[email protected]>

Closes #6801 from dusenberrymw/SPARK-8343_Improve_Spark_Streaming_Guides_MERGED and squashes the following commits:

6688090 [Mike Dusenberry] Improvements to the Spark Streaming Custom Receiver Guide, including slight rewording of confusing sections, and fixing typos & missing words.
436fbd8 [Mike Dusenberry] Bunch of improvements to the Spark Streaming Guide, including fixing broken links, slight rewording of confusing sections, fixing typos & missing words, etc.

(cherry picked from commit 35d1267)
Signed-off-by: Reynold Xin <[email protected]>
…ch space in UnsafeFixedWidthAggregationMap

UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size.  This means that we end up allocating 8x too much conversion space.

This patch fixes this by allocating a `byte[]` array instead.  This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows.

Author: Josh Rosen <[email protected]>

Closes #6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits:

6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size

(cherry picked from commit ea7fd2f)
Signed-off-by: Josh Rosen <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #6811 from marmbrus/aliasExplodeStar and squashes the following commits:

fbd2065 [Michael Armbrust] more style
806a373 [Michael Armbrust] fix style
7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa

(cherry picked from commit 9073a42)
Signed-off-by: Michael Armbrust <[email protected]>
Author: Peter Hoffmann <[email protected]>

Closes #6815 from hoffmann/patch-1 and squashes the following commits:

2abb6da [Peter Hoffmann] fix read/write mixup

(cherry picked from commit f3f2a43)
Signed-off-by: Reynold Xin <[email protected]>
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'

Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()

Author: tedyu <[email protected]>

Closes #6793 from tedyu/master and squashes the following commits:

62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()

(cherry picked from commit 1a62d61)
Signed-off-by: Reynold Xin <[email protected]>
…al` since a data loss bug.

Bug had reported in the jira [SPARK-8367](https://issues.apache.org/jira/browse/SPARK-8367)
The relution is limitting the configuration `spark.streaming.blockInterval` to a positive number.

Author: huangzhaowei <[email protected]>
Author: huangzhaowei <[email protected]>

Closes #6818 from SaintBacchus/SPARK-8367 and squashes the following commits:

c9d1927 [huangzhaowei] Update BlockGenerator.scala
bd3f71a [huangzhaowei] Use requre instead of if
3d17796 [huangzhaowei] [SPARK_8367][Streaming]Add a limit for 'spark.streaming.blockInterval' since a data loss bug.

(cherry picked from commit ccf010f)
Signed-off-by: Sean Owen <[email protected]>
Typo in thriftserver section

Author: Moussa Taifi <[email protected]>

Closes #6847 from moutai/patch-1 and squashes the following commits:

1bd29df [Moussa Taifi] Update sql-programming-guide.md

(cherry picked from commit dc455b8)
Signed-off-by: Sean Owen <[email protected]>
[SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs

Author: Radek Ostrowski <[email protected]>
Author: radek <[email protected]>

Closes #6332 from radek1st/master and squashes the following commits:

dae3347 [Radek Ostrowski] fixed typo
c76bb3a [radek] improved a comment

(cherry picked from commit 4bd10fd)
Signed-off-by: Sean Owen <[email protected]>
If you ran "clean" at the top-level sbt project, the temp dir would
go away, so running "test" without restarting sbt would fail. This
fixes that by making sure the temp dir exists before running tests.

Author: Marcelo Vanzin <[email protected]>

Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits:

12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.

(cherry picked from commit cebf241)
Signed-off-by: Sean Owen <[email protected]>
… and regression

Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.

Author: Yanbo Liang <[email protected]>

Closes #6460 from yanboliang/spark-7916 and squashes the following commits:

f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression

(cherry picked from commit ca99875)
Signed-off-by: Joseph K. Bradley <[email protected]>
rxin this is the fix you requested for the break introduced by backporting #6793

Author: Punya Biswal <[email protected]>

Closes #6850 from punya/feature/fix-backport-break and squashes the following commits:

fdc3693 [Punya Biswal] Fix break introduced by backport
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <[email protected]>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

(cherry picked from commit c13da20)
Signed-off-by: Sean Owen <[email protected]>
start-slave.sh no longer takes a worker # param in 1.4+

Author: Sean Owen <[email protected]>

Closes #6855 from srowen/SPARK-8395 and squashes the following commits:

300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+

(cherry picked from commit f005be0)
Signed-off-by: Andrew Or <[email protected]>
…ter mode

Now PySpark on YARN with cluster mode is supported so let's update doc.

Author: Kousuke Saruta <sarutakoss.nttdata.co.jp>

Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:

ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode

Author: Punya Biswal <[email protected]>
Author: Kousuke Saruta <[email protected]>

Closes #6842 from punya/feature/SPARK-7515 and squashes the following commits:

0b83648 [Punya Biswal] Merge remote-tracking branch 'origin/branch-1.4' into feature/SPARK-7515
de025cd [Kousuke Saruta] [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
…rnalBlockStore is initialized

externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed.

Author: Mingfei <[email protected]>

Closes #6702 from shimingfei/SetTrue and squashes the following commits:

add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized

(cherry picked from commit 7ad8c5d)
Signed-off-by: Andrew Or <[email protected]>
…on not started

The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed.
![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png)

The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay.

Author: Carson Wang <[email protected]>

Closes #6827 from carsonwang/SPARK-8372 and squashes the following commits:

cdbb089 [Carson Wang] Fix code style
3e46b35 [Carson Wang] Update code style
90f5dde [Carson Wang] Add a unit test
d8c9cd0 [Carson Wang] Replaying events only return information when app is started

(cherry picked from commit 2837e06)
Signed-off-by: Andrew Or <[email protected]>
… calling sum on an empty RDD

This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.

Author: zsxwing <[email protected]>

Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:

b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD

(cherry picked from commit 0fc4b96)
Signed-off-by: Andrew Or <[email protected]>
…the tests more reliable

KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.

This PR changes them to thread-safe collections.

Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.

Author: zsxwing <[email protected]>

Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:

d464211 [zsxwing] Use thread-safe collections to make the tests more reliable

(cherry picked from commit a06d9c8)
Signed-off-by: Tathagata Das <[email protected]>
…o the HiveConf inside executionHive.state.

https://issues.apache.org/jira/browse/SPARK-8306

I will try to add a test later.

marmbrus aarondav

Author: Yin Huai <[email protected]>

Closes #6758 from yhuai/SPARK-8306 and squashes the following commits:

1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.

(cherry picked from commit 302556f)
Signed-off-by: Michael Armbrust <[email protected]>

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```

when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.

Author: xutingjun <[email protected]>

Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:

53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside

(cherry picked from commit e2cdb05)
Signed-off-by: Andrew Or <[email protected]>
Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be.

cc andrewor14

Author: Burak Yavuz <[email protected]>

Closes #6788 from brkyvz/local-ivy-fix and squashes the following commits:

2875bf4 [Burak Yavuz] fix temp dir bug
48cc648 [Burak Yavuz] improve deletion
a69e3e6 [Burak Yavuz] delete cache before test as well
0037197 [Burak Yavuz] fix merge conflicts
f60772c [Burak Yavuz] use different folder for m2 cache during testing
b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache

Conflicts:
	core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.

cc JoshRosen rxin angelini

Author: Davies Liu <[email protected]>

Closes #6714 from davies/batch_size and squashes the following commits:

b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
…n headers

This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:

![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)

This makes it easier for users to link to specific sections of the documentation.

I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).

Author: Josh Rosen <[email protected]>

Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits:

e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code

(cherry picked from commit 44c931f)
Signed-off-by: Josh Rosen <[email protected]>
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <[email protected]>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc

(cherry picked from commit 24e5379)
Signed-off-by: Tathagata Das <[email protected]>
…l operators

This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators.  This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.

These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.

Author: Josh Rosen <[email protected]>
Author: Josh Rosen <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits:

f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
84214be [Josh Rosen] Add an extra column which isn't part of the sort
ae1896b [Josh Rosen] Provide implicits automatically
a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
d9ab1e4 [Michael Armbrust] Add simple resolver
c60a44d [Josh Rosen] Manually bind references
996332a [Josh Rosen] Add types so that tests compile
a46144a [Josh Rosen] WIP

(cherry picked from commit 207a98c)
Signed-off-by: Michael Armbrust <[email protected]>
This fixes various minor documentation issues on the Spark SQL page

Author: Lars Francke <[email protected]>

Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:

dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes

(cherry picked from commit 4ce3bab)
Signed-off-by: Josh Rosen <[email protected]>
…orrect count at Spark UI

tdas  zsxwing this is the new PR for Spark-8080

I have merged #6659

Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.

For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk )  number of records will be counted even though the block not able to unroll to memory.

thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.

I have added few test cases to cover those block unrolling scenarios also.

Author: Dibyendu Bhattacharya <[email protected]>
Author: U-PEROOT\UBHATD1 <[email protected]>

Closes #6707 from dibbhatt/master and squashes the following commits:

f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
01e6dc8 [U-PEROOT\UBHATD1] A
…g ORC files

`Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.

Author: Cheng Lian <[email protected]>

Closes #6892 from liancheng/spark-8458 and squashes the following commits:

87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files

(cherry picked from commit a71cbbd)
Signed-off-by: Cheng Lian <[email protected]>
srowen and others added 22 commits October 27, 2015 23:08
…atrix returns incorrect answer in some cases

Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes #9293

Author: Sean Owen <[email protected]>

Closes #9309 from srowen/SPARK-11302.2.

(cherry picked from commit 826e1e3)
Signed-off-by: Xiangrui Meng <[email protected]>
….4 backport)

This is a branch-1.4 backport of #9382, a fix for SPARK-11424.

Author: Josh Rosen <[email protected]>

Closes #9388 from JoshRosen/hadoop-decompressor-pooling-fix-branch-1.4.
…umeric types

This backports #9495 to branch-1.4 (in case anyone needs this).

Author: Yin Huai <[email protected]>

Closes #9498 from yhuai/OracleDialect-1.4.
…ceptions

Only print the error message to the console for Analysis Exceptions in sql-shell

Author: Dilip Biswal <[email protected]>

Closes #9375 from dilipbiswal/spark-11188-v142.
…omputes MSE

Author: Bharat Lal <[email protected]>

Closes #9560 from bharatl/SPARK-11581.

(cherry picked from commit 860ea0d)
Signed-off-by: Joseph K. Bradley <[email protected]>
The code convertToCanonicalEdges is such that srcIds are smaller than dstIds but the scaladoc suggested otherwise. Have fixed the same.

Author: Gaurav Kumar <[email protected]>

Closes #9666 from gauravkumar37/patch-1.

(cherry picked from commit df0e318)
Signed-off-by: Reynold Xin <[email protected]>
Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability

Author: Sean Owen <[email protected]>

Closes #9731 from srowen/SPARK-11652.

(cherry picked from commit 9631ca3)
Signed-off-by: Sean Owen <[email protected]>
It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <[email protected]>

Closes #9771 from vivkul/patch-1.

(cherry picked from commit 1429e0a)
Signed-off-by: Sean Owen <[email protected]>
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
…ceByKeyAndWindow

invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <[email protected]>

Closes #9775 from dtolpin/master.

(cherry picked from commit 599a8c6)
Signed-off-by: Tathagata Das <[email protected]>
…Function and TransformFunctionSerializer

TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <[email protected]>

Closes #9847 from zsxwing/pyspark-streaming-exception.

(cherry picked from commit be7a2cf)
Signed-off-by: Tathagata Das <[email protected]>
…VM exits

deleting the temp dir like that

```

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val a = mutable.Set(1,2,3,4,7,0,8,98,9)
a: scala.collection.mutable.Set[Int] = Set(0, 9, 1, 2, 3, 7, 4, 8, 98)

scala> a.foreach(x => {a.remove(x) })

scala> a.foreach(println(_))
98
```

You may not modify a collection while traversing or iterating over it.This can not delete all element of the collection

Author: Zhongshuai Pei <[email protected]>

Closes #9951 from DoingDone9/Bug_RemainDir.

(cherry picked from commit 6b78157)
Signed-off-by: Reynold Xin <[email protected]>
…eadPool doesn't cache any task

In the previous codes, `newDaemonCachedThreadPool` uses `SynchronousQueue`, which is wrong. `SynchronousQueue` is an empty queue that cannot cache any task. This patch uses `LinkedBlockingQueue` to fix it along with other fixes to make sure `newDaemonCachedThreadPool` can use at most `maxThreadNumber` threads, and after that, cache tasks to `LinkedBlockingQueue`.

Author: Shixiong Zhu <[email protected]>

Closes #9978 from zsxwing/cached-threadpool.

(cherry picked from commit d3ef693)
Signed-off-by: Shixiong Zhu <[email protected]>
…HadoopFiles

The JobConf object created in `DStream.saveAsHadoopFiles` is used concurrently in multiple places:
* The JobConf is updated by `RDD.saveAsHadoopFile()` before the job is launched
* The JobConf is serialized as part of the DStream checkpoints.
These concurrent accesses (updating in one thread, while the another thread is serializing it) can lead to concurrentModidicationException in the underlying Java hashmap using in the internal Hadoop Configuration object.

The solution is to create a new JobConf in every batch, that is updated by `RDD.saveAsHadoopFile()`, while the checkpointing serializes the original JobConf.

Tests to be added in #9988 will fail reliably without this patch. Keeping this patch really small to make sure that it can be added to previous branches.

Author: Tathagata Das <[email protected]>

Closes #10088 from tdas/SPARK-12087.

(cherry picked from commit 8a75a30)
Signed-off-by: Shixiong Zhu <[email protected]>
Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at #9731

Author: Sean Owen <[email protected]>

Closes #10198 from srowen/SPARK-11652.2.

(cherry picked from commit e3735ce)
Signed-off-by: Sean Owen <[email protected]>
… doc

With the merge of [SPARK-8337](https://issues.apache.org/jira/browse/SPARK-8337), now the Python API has the same functionalities compared to Scala/Java, so here changing the description to make it more precise.

zsxwing tdas , please review, thanks a lot.

Author: jerryshao <[email protected]>

Closes #10246 from jerryshao/direct-kafka-doc-update.

(cherry picked from commit 24d3357)
Signed-off-by: Shixiong Zhu <[email protected]>
…split

String.split accepts a regular expression, so we should escape "." and "|".

Author: Shixiong Zhu <[email protected]>

Closes #10361 from zsxwing/reg-bug.

(cherry picked from commit 540b5ae)
Signed-off-by: Shixiong Zhu <[email protected]>
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <[email protected]>

Closes #9986 from zero323/SPARK-12006.

(cherry picked from commit fcd013c)
Signed-off-by: Joseph K. Bradley <[email protected]>
…not None"

This reverts commit fcd013c.

Author: Yin Huai <[email protected]>

Closes #10632 from yhuai/pythonStyle.

(cherry picked from commit e5cde7a)
Signed-off-by: Yin Huai <[email protected]>
If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <[email protected]>

Closes #10644 from zero323/SPARK-12006.

(cherry picked from commit 592f649)
Signed-off-by: Joseph K. Bradley <[email protected]>
…verflow

jira: https://issues.apache.org/jira/browse/SPARK-12685

master PR: #10627

the log of word2vec reports
trainWordsCount = -785727483
during computation over a large dataset.

Update the priority as it will affect the computation process.
alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))

Author: Yuhao Yang <[email protected]>

Closes #10721 from hhbyyh/branch-1.4.
Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: andrewor14/dagre-d3@7d6c000, which is taken from the fix in the main repo: cpettitt/dagre-d3@1ef067f

Upstream issue: https://github.com/cpettitt/dagre-d3/issues/202

Author: Andrew Or <[email protected]>

Closes #10986 from andrewor14/fix-dag-viz.

(cherry picked from commit 70e69fc)
Signed-off-by: Andrew Or <[email protected]>
@hvanhovell
Copy link
Contributor

@liumingning could you please close this

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

viirya and others added 3 commits February 22, 2016 23:43
…onClustering failed test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363
## What changes were proposed in this pull request?

Backport JIRA-SPARK-12363 to branch-1.4.

## How was the this patch tested?

Unit test.

cc mengxr

Author: Liang-Chi Hsieh <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes #11264 from viirya/backport-12363.
`GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave

Author: Xiangrui Meng <[email protected]>

Closes #11226 from mengxr/SPARK-13355.

(cherry picked from commit 764ca18)
Signed-off-by: Xiangrui Meng <[email protected]>
…tialization

Changes the personalized pagerank initialization to be non-uniform.

Author: Yves Raimond <[email protected]>

Closes #9386 from moustaki/personalized-pagerank-init.

(cherry picked from commit efaa472)
Signed-off-by: Xiangrui Meng <[email protected]>
@andrewor14
Copy link
Contributor

@liumingning Please close this PR.

… map string datatypes to Oracle VARCHAR datatype

Adding a getJDBCType() method to the JdbcDialects.scala which would create a VARCHAR type for Oracle create table operations.Currently the type TEXT is throwing exceptions which is incompatible to Oracle

Author: thomastechs <[email protected]>

Closes #10912 from thomastechs/thomastechs-12941-branch.
@asfgit asfgit closed this in c37bbb3 Mar 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.