Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@foxish
Copy link
Member

@foxish foxish commented Mar 8, 2017

No description provided.

@foxish foxish mentioned this pull request Mar 8, 2017
Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to merging when this passes

@mccheah take a look?

@foxish
Copy link
Member Author

foxish commented Mar 8, 2017

@cvpatel, any way to cancel the old integration test build and run this one?

@cvpatel
Copy link
Member

cvpatel commented Mar 8, 2017

@foxish Unfortunately, there is no public way to cancel the build after it starts. If there are multiple commits before the build starts, it should run only one with the latest. For now, I manually cancelled it.

@ash211
Copy link

ash211 commented Mar 8, 2017

Looks like the latest build ran out of memory in the DistributedSuite?

https://travis-ci.org/apache-spark-on-k8s/spark/jobs/209099308

DistributedSuite:
- task throws not serializable exception
- local-cluster format
- simple groupByKey
- groupByKey where map output sizes exceed maxMbInFlight
- accumulators
- broadcast variables
- repeatedly failing task
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000075e500000, 426246144, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 426246144 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/travis/build/apache-spark-on-k8s/spark/core/hs_err_pid4903.log

@foxish
Copy link
Member Author

foxish commented Mar 8, 2017

This is odd. Maybe a travis issue? I also see:

$ dev/lint-java
Using `mvn` from path: /home/ramanathana/Install/apache-maven-3.3.9/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java:[255,10] (modifier) RedundantModifier: Redundant 'final' modifier.

I don't think we actually touch that file.

@foxish
Copy link
Member Author

foxish commented Mar 8, 2017

I also see that same linter error when running on branch-2.1-kubernetes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these versions also not reflect the kubernetes branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. The underlying version of spark is still 2.1.0, which is why I thought that was appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same version string as the image makes sense?
2.1.0-k8s-support-0.1.0-alpha.1?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This is particularly because we likely want to publish these libraries to a Maven repository as well - something we still need to discuss the specifics of. One example use case is for developing custom implementations of DriverServiceManager which would require projects to take a dependency on spark-kubernetes. But if we publish these we need to choose a version string on the pom files that differs from what Spark is already publishing to maven central.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Do we switch over all the POMs to publish our version? Including say - sql, ml and other parts we did not touch?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want everything to be synchronized, yes.

@mccheah
Copy link

mccheah commented Mar 8, 2017

+1 when the build succeeds.

@foxish
Copy link
Member Author

foxish commented Mar 8, 2017

@mccheah The linter error still seems to exist in a file that we didn't touch. I think that will cause the travis build to fail.

@mccheah
Copy link

mccheah commented Mar 8, 2017

Let's get the build to that point and make sure it's not failing because of something directly related to this change. A build that fails with a complaint about a bad version string would be worth catching, for example.

@kimoonkim
Copy link
Member

I see the new Travis unit test build failing for a similar reason:

ExternalShuffleServiceSuite:
- groupByKey without compression
- shuffle non-zero block size
- shuffle serializer
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000742e00000, 17301504, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 17301504 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/travis/build/apache-spark-on-k8s/spark/core/hs_err_pid4921.log

I don't have much clue why.

@cvpatel
Copy link
Member

cvpatel commented Mar 8, 2017

Integration test failing due to filenames > 100 chars

http://spark-k8s-jenkins.pepperdata.org:8080/job/PR-spark-k8s-integration-test/56/consoleFull#124214759204b09c7b-0d94-4ce5-8a08-8f343248b3d8

java.lang.RuntimeException: file name 'spark-kubernetes-integration-tests-spark-jobs-helpers_2.11-2.1.0-k8s-support-0.1.0-alpha.1-SNAPSHOT.jar' is too long ( > 100 bytes)�
�  at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.handleLongName(TarArchiveOutputStream.java:674)�
�  at org.apache.commons.compress.archivers.tar.TarArchiveOutputStream.putArchiveEntry(TarArchiveOutputStream.java:275)�
�  at org.apache.spark.deploy.rest.kubernetes.CompressionUtils$$anonfun$2$$anonfun$apply$2$$anonfun$apply$4$$anonfun$apply$5.apply(CompressionUtils.scala:77)�
�  at org.apache.spark.deploy.rest.kubernetes.CompressionUtils$$anonfun$2$$anonfun$apply$2$$anonfun$apply$4$$anonfun$apply$5.apply(CompressionUtils.scala:58)�
�  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)�
�  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)�
�  at org.apache.spark.deploy.rest.kubernetes.CompressionUtils$$anonfun$2$$anonfun$apply$2$$anonfun$apply$4.apply(CompressionUtils.scala:58)�
�  at org.apache.spark.deploy.rest.kubernetes.CompressionUtils$$anonfun$2$$anonfun$apply$2$$anonfun$apply$4.apply(CompressionUtils.scala:56)�
�  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2494)�
�  at org.apache.spark.deploy.rest.kubernetes.CompressionUtils$$anonfun$2$$anonfun$apply$2.apply(CompressionUtils.scala:56)�
�  ...�

@mccheah
Copy link

mccheah commented Mar 8, 2017

Looks like we'll need a different version string for the poms. Not sure what to use here - maybe "2.1.0-k8s-0.1.0-SNAPSHOT"?

@ash211
Copy link

ash211 commented Mar 9, 2017

Does that meet the 100 byte name limit?

@foxish
Copy link
Member Author

foxish commented Mar 13, 2017

Changed the POM string. Integration tests are fine now, but travis is still seeing memory errors. Any ideas @cvpatel @ssuchter

@mccheah
Copy link

mccheah commented Mar 13, 2017

Just retriggered the build, hopefully it's healthy now.

@cvpatel
Copy link
Member

cvpatel commented Mar 13, 2017

I had seen the resource related failures earlier, especially around changing the included profiles and goals...

Lets see if the retrigger fixes the issue, if it continues to have this type of issues we could investigate having these run via jenkins.

@mccheah
Copy link

mccheah commented Mar 13, 2017

@cvpatel can we increase the memory available to Travis?

@kimoonkim
Copy link
Member

The latest Travis build did not see OOM (, which is a good thing). It just saw ExternalShuffleServiceSuite hanging. Maybe it's flaky and we can just blacklist it:

ExternalShuffleServiceSuite:
- groupByKey without compression
- shuffle non-zero block size
- shuffle serializer
- zero sized blocks
- zero sized blocks without kryo
- shuffle on mutable pairs
- sorting on mutable pairs
- cogroup using mutable pairs
- subtract mutable pairs
- sort with Java non serializable class - Kryo
- sort with Java non serializable class - Java
- shuffle with different compression settings (SPARK-3426)
- [SPARK-4085] rerun map stage if reduce stage cannot find its local shuffle file
- metrics for shuffle without aggregation
- metrics for shuffle with aggregation
- multiple simultaneous attempts for one task (SPARK-8029)
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

@cvpatel
Copy link
Member

cvpatel commented Mar 13, 2017

@mccheah Unfortunately no, we are already running at the largest possible container@ 7.5g. more details

@foxish The second test seems to pass the build but fails the linter for both Java and Scala.

@foxish
Copy link
Member Author

foxish commented Mar 13, 2017

@kimoonkim That's a good idea.
I think we've lost some of the flake fixes which came in after 2.1 in this branch which is forked off the 2.1 release, which is why we might need some more tests to be blacklisted.

As for the linter, it seems to be failing on branch-2.1-kubernetes which is up-to-date with the upstream 2.1 release. Any ideas why this might be happening?

@foxish
Copy link
Member Author

foxish commented Mar 14, 2017

Since the integration test passes, I think we should merge this first and then fix the subsequent travis issues, such as #185.
Thoughts @mccheah @cvpatel @kimoonkim

@mccheah
Copy link

mccheah commented Mar 14, 2017

I'm ok with this

@foxish foxish merged commit 3636939 into prep-for-alpha-release Mar 14, 2017
@foxish foxish deleted the fix-alpha branch March 14, 2017 00:46
@kimoonkim
Copy link
Member

SGTM.

@cvpatel
Copy link
Member

cvpatel commented Mar 14, 2017

Ditto. But seems like #185 is failing as well because of memory issues... going to start porting the unit-test to jenkins and see how it behaves there.

foxish added a commit that referenced this pull request Jul 24, 2017
* Fix pom versioning

* fix k8s versions in pom

* Change pom string to 2.1.0-k8s-0.1.0-SNAPSHOT
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 25, 2019
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
* Fix pom versioning

* fix k8s versions in pom

* Change pom string to 2.1.0-k8s-0.1.0-SNAPSHOT
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants