Skip to content

Conversation

@markhamstra
Copy link

No description provided.

Davies Liu and others added 21 commits February 27, 2015 20:04
The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

Author: Davies Liu <[email protected]>

Closes apache#4809 from davies/leak2 and squashes the following commits:

65c222f [Davies Liu] Update sql.py
9b4dadc [Davies Liu] fix __eq__ of singleton
b576107 [Davies Liu] fix tests
6c2909a [Davies Liu] fix incorrect DataType.__eq__
Backport of apache#4414 to 1.2: Add version option to spark-ec2

nchammas JoshRosen : is this about the right backporting of this change?

Author: Sean Owen <[email protected]>

Closes apache#4833 from srowen/SPARK-5628.2 and squashes the following commits:

e723911 [Sean Owen] Backport of apache#4414 to 1.2: Add version option to spark-ec2
…ize to ensure deleting the temp file"

This reverts commit 61bde00.
Before we would get the following (benign) error if we called `sc.stop()` twice. This is because the listener bus would try to post the end event again even after it has already stopped. This happens occasionally when flaky tests fail, usually as a result of other sources of error. Either way we shouldn't be logging this error when it is not the cause of the failure.
```
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerApplicationEnd(1425348445682)
```

Author: Andrew Or <[email protected]>

Closes apache#4871 from andrewor14/sc-stop and squashes the following commits:

a14afc5 [Andrew Or] Move code after code
915db16 [Andrew Or] Move code into code

(cherry picked from commit 6c20f35)
Signed-off-by: Andrew Or <[email protected]>
… built with Java 6

Add warning about building with Java 7+ and running the JAR on early Java 6.

Author: Sean Owen <[email protected]>

Closes apache#4888 from srowen/SPARK-1911.2 and squashes the following commits:

3f4ed5f [Sean Owen] Add warning about building with Java 7+ and running the JAR on early Java 6.
Author: Makoto Fukuhara <[email protected]>

Closes apache#4724 from fukuo33/fix-typo and squashes the following commits:

8c806b9 [Makoto Fukuhara] fix typo.
Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.

This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.

cc JoshRosen

Author: Davies Liu <[email protected]>

Closes apache#4923 from davies/fix_collect and squashes the following commits:

d730286 [Davies Liu] address comments
24c92a4 [Davies Liu] fix style
ba54614 [Davies Liu] use socket to transfer data from JVM
9517c8f [Davies Liu] fix memory leak in collect()

(cherry picked from commit 8767565)
Signed-off-by: Josh Rosen <[email protected]>

Conflicts:
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	python/pyspark/rdd.py
	python/pyspark/sql/dataframe.py
Backport hhbyyh 's hasCode implementation to branch-1.2. The old implementation causes performance issues with PySpark, which calls hashCode (https://issues.apache.org/jira/browse/SPARK-6288).

Author: Yuhao Yang <[email protected]>

Closes apache#4985 from mengxr/SPARK-5186-1.2 and squashes the following commits:

155e559 [Yuhao Yang] backport SPARK-5186
The Thread.interrupt() can not terminate the thread in some cases, so we should not wait for the writerThread of PythonRDD.

This PR also ignore some exception during clean up.

cc mengxr

Author: Davies Liu <[email protected]>

Closes apache#5003 from davies/fix_take2 and squashes the following commits:

2f2f893 [Davies Liu] fix take of PythonRDD in JVM
…to ...

...support NFS mounts.

This is a work around for now with the goal to find a more permanent solution.
https://issues.apache.org/jira/browse/SPARK-6313

Author: nemccarthy <[email protected]>

Closes apache#5036 from nemccarthy/master and squashes the following commits:

2eaaf42 [nemccarthy] [SPARK-6313] Update config wording doc for spark.files.useFetchCache
5de7eb4 [nemccarthy] [SPARK-6313] Add config option to disable file locks/fetchFile cache to support NFS mounts

(cherry picked from commit 4cca391)
Signed-off-by: Josh Rosen <[email protected]>
The problem is that `ContextCleaner` may clean variables that belong to a different `SparkContext`. This can happen if the `SparkContext` to which the cleaner belongs stops, and a new one is started immediately afterwards in the same JVM. In this case, if the cleaner is in the middle of cleaning a broadcast, for instance, it will do so through `SparkEnv.get.blockManager`, which could be one that belongs to a different `SparkContext`.

JoshRosen and I suspect that this is the cause of many flaky tests, most notably the `JavaAPISuite`. We were able to reproduce the failure locally (though it is not deterministic and very hard to reproduce).

Author: Andrew Or <[email protected]>

Closes apache#4869 from andrewor14/cleaner-masquerade and squashes the following commits:

29168c0 [Andrew Or] Synchronize ContextCleaner stop
If the cleaner is stopped, we shouldn't print a huge stack trace when the cleaner thread is interrupted because we purposefully did this.

Author: Andrew Or <[email protected]>

Closes apache#4882 from andrewor14/cleaner-interrupt and squashes the following commits:

8652120 [Andrew Or] Just a hot fix
…when nested data and partitioned table

The Bug solved here was due to a change in PartitionTableScan, when reading a partitioned table.

- When the Partititon column is requested out of a parquet table, the Table Scan needs to add the column back to the output Rows.
- To update the Row object created by PartitionTableScan, the Row was first casted in SpecificMutableRow, before being updated.
- This casting was unsafe, since there are no guarantee that the newHadoopRDD used internally will instanciate the output Rows as MutableRow.

Particularly, when reading a Table with complex (e.g. struct or Array) types,  the newHadoopRDD  uses a parquet.io.api.RecordMateralizer, that is produced by the org.apache.spark.sql.parquet.RowReadSupport . This consumer will be created as a org.apache.spark.sql.parquet.CatalystGroupConverter (a) and not a org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter (b), when there are complex types involved (in the org.apache.spark.sql.parquet.CatalystConverter.createRootConverter factory  )

The consumer (a) will output GenericRow, while the consumer (b) produces SpecificMutableRow.

Therefore any request selecting a partition columns, plus a complex type column, are returned as GenericRows, and fails into an unsafe casting pit (see https://issues.apache.org/jira/browse/SPARK-5775 for an example. )

The fix proposed here originally replaced the unsafe class casting by a case matching on the Row type, updating the Row if it is of a mutable type, and recreating a Row otherwise.

This PR now implements the solution updated by liancheng on aa39460 :

The fix checks if every requested requested columns are primitiveType, in a manner symmetrical to the check in org.apache.spark.sql.parquet.CatalystConverter.createRootConverter.
 - If all columns are primitive type,  the Row can safely be casted to a MutableRow.
 - Otherwise a new GenericRow is created, and the partition column is written this new row structure

This fix is unit-tested in  sql/hive/src/test/scala/org/apache/spark/sql/parquet/parquetSuites.scala

Author: Anselme Vignon <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#4697 from anselmevignon/local_dev and squashes the following commits:

6a4c53d [Anselme Vignon] style corrections
52f73fc [Cheng Lian] cherry-pick & merge from aa39460
8fc6a8c [Anselme Vignon] correcting tests on temporary tables
24928ea [Anselme Vignon] corrected mirror bug (see SPARK-5775) for newParquet
7c829cb [Anselme Vignon] bugfix, hopefully correct this time
005a7f8 [Anselme Vignon] added test cleanup
22cec52 [Anselme Vignon] lint compatible changes
ae48f7c [Anselme Vignon] unittesting SPARK-5775
f876dea [Anselme Vignon] starting to write tests
dbceaa3 [Anselme Vignon] cutting lines
4eb04e9 [Anselme Vignon] bugfix SPARK-5775
…en running FlumeStreamSuite

When we run FlumeStreamSuite on Jenkins, sometimes we get error like as follows.

    sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 52 times over 10.094849836 seconds. Last failure message: Error connecting to localhost/127.0.0.1:23456.
	    at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
	    at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
	   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:116)
           at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply$mcV$sp(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	    at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	    at org.scalatest.Transformer.apply(Transformer.scala:22)
	    at org.scalatest.Transformer.apply(Transformer.scala:20)
    	    at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	    at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
	    at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
	    at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	    at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)

This error is caused by check-then-act logic  when it find free-port .

      /** Find a free port */
      private def findFreePort(): Int = {
        Utils.startServiceOnPort(23456, (trialPort: Int) => {
          val socket = new ServerSocket(trialPort)
          socket.close()
          (null, trialPort)
        }, conf)._2
      }

Removing the check-then-act is not easy but we can reduce the chance of having the error by choosing random value for initial port instead of 23456.

Author: Kousuke Saruta <[email protected]>

Closes apache#4337 from sarutak/SPARK-5559 and squashes the following commits:

16f109f [Kousuke Saruta] Added `require` to Utils#startServiceOnPort
c39d8b6 [Kousuke Saruta] Merge branch 'SPARK-5559' of github.com:sarutak/spark into SPARK-5559
1610ba2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
33357e3 [Kousuke Saruta] Changed "findFreePort" method in MQTTStreamSuite and FlumeStreamSuite so that it can choose valid random port
a9029fe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
9489ef9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
8212e42 [Kousuke Saruta] Modified default port used in FlumeStreamSuite from 23456 to random value

(cherry picked from commit 85cf063)
Signed-off-by: Sean Owen <[email protected]>
…sues in Java APIs

This PR addresses a Scala compiler bug ([SI-8905](https://issues.scala-lang.org/browse/SI-8905)) that was breaking some of the Spark Java APIs.  In a nutshell, it seems that methods whose implementations are inherited from generic traits sometimes have their type parameters erased to Object.  This was causing methods like `DoubleRDD.min()` to throw confusing NoSuchMethodErrors at runtime.

The fix implemented here is to introduce an intermediate layer of abstract classes and inherit from those instead of directly extends the `Java*Like` traits.  This should not break binary compatibility.

I also improved the test coverage of the Java API, adding several new tests for methods that failed at runtime due to this bug.

Author: Josh Rosen <[email protected]>

Closes apache#5050 from JoshRosen/javardd-si-8905-fix and squashes the following commits:

2feb068 [Josh Rosen] Use intermediate abstract classes to work around SPARK-3266
d5f3e5d [Josh Rosen] Add failing regression tests for SPARK-3266

(cherry picked from commit 0f673c2)
Signed-off-by: Josh Rosen <[email protected]>

Conflicts:
	core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala
	core/src/test/java/org/apache/spark/JavaAPISuite.java
… edge cases

Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

Author: Sean Owen <[email protected]>

Closes apache#5148 from srowen/SPARK-6480 and squashes the following commits:

974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

(cherry picked from commit fe15ea9)
Signed-off-by: Sean Owen <[email protected]>
The reused address on server side had caused the server can not acknowledge the connected connections, remove it.

This PR will retry once after timeout, it also add a timeout at client side.

Author: Davies Liu <[email protected]>

Closes apache#5324 from davies/collect_hang and squashes the following commits:

e5a51a2 [Davies Liu] remove setReuseAddress
7977c2f [Davies Liu] do retry on client side
b838f35 [Davies Liu] retry after timeout

(cherry picked from commit 0cce545)
Signed-off-by: Josh Rosen <[email protected]>
Use Option for ActiveJob.properties to avoid NPE bug

Author: Hung Lin <[email protected]>

Closes apache#5124 from hunglin/SPARK-6414 and squashes the following commits:

2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()

(cherry picked from commit e3202aa)
Signed-off-by: Josh Rosen <[email protected]>

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Conflicts:
	core/src/test/scala/org/apache/spark/SparkContextSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants