Skip to content

Conversation

@BrianLondon
Copy link
Contributor

The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL.

@holdenk
Copy link
Contributor

holdenk commented Dec 10, 2015

Have you tested the current kinesis and works with the new API? Looking online it seems like we might also need to bump our kinesis client library if we upgrade the general aws sdk version.

@holdenk
Copy link
Contributor

holdenk commented Dec 10, 2015

Note we also depend depend on com.fasterxml.jackson.core 2.4.4 in Spark

@BrianLondon
Copy link
Contributor Author

I ran the demo with the following two commands against a stream (test-stream1) on my AWS account. It performed as expected.

bin/run-example streaming.KinesisWordProducerASL test-stream1 https://kinesis.us-east-1.amazonaws.com 100 10
bin/run-example streaming.KinesisWordCountASL testapp test-stream1 https://kinesis.us-east-1.amazonaws.com

I happy to bump the KCL library as well. With the bumped com.fasterxml.jackson.core version, the project build and passed all tests except docker-integration, which I can't run. AWS Java SDK 1.10.40 calls a function that isn't provided by Jackson 2.4.x. If raising that to 2.5.x isn't an option, then we can find whatever the last version of AWS Java SDK to not use it is. I know it works with 1.10.24, so somewhere between 1.10.24 and 1.10.40 could be a temporary solution.

@holdenk
Copy link
Contributor

holdenk commented Dec 10, 2015

Cool - if we don't need to upgrade the KCL library then thats probably good. Maybe good to see if @tdas or someone similar can take a look?

@JoshRosen
Copy link
Contributor

/cc @brkyvz RE: the KCL library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it dangerous to upgrade jackson? It's also used by SQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes me nervous. I'll run the tests to see what happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen did you manage to run the tests? We'd need to be sure nothing breaks with this version bump!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick the tests seem to have run OK in https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2203/consoleFull (although jenkins seems to be having some issues)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@holdenk Unfortunately the tests that actually mattered didn't run, they were ignored :(

[info] WithAggregationKinesisStreamSuite:
[info] - KinesisUtils API (202 milliseconds)
[info] - RDD generation (24 milliseconds)
[info] - basic operation [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - custom message handling [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - failure recovery [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Prepare KinesisTestUtils [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] WithoutAggregationKinesisStreamSuite:
[info] - KinesisUtils API (5 milliseconds)
[info] - RDD generation (6 milliseconds)
[info] - basic operation [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - custom message handling [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - failure recovery [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Prepare KinesisTestUtils [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] WithAggregationKinesisBackedBlockRDDSuite:
[info] - Basic reading from Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available in both block manager and Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available only in block manager, not in Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available only in Kinesis, not in block manager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available partially in block manager, rest in Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Test isBlockValid skips block fetching from block manager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Test whether RDD is valid after removing blocks from block anager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] WithoutAggregationKinesisBackedBlockRDDSuite:
[info] - Basic reading from Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available in both block manager and Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available only in block manager, not in Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available only in Kinesis, not in block manager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Read data available partially in block manager, rest in Kinesis [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Test isBlockValid skips block fetching from block manager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!
[info] - Test whether RDD is valid after removing blocks from block anager [enable by setting env var ENABLE_KINESIS_TESTS=1] !!! IGNORED !!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests are i think enabled only in the default jenkins PR builder.

@brkyvz
Copy link
Contributor

brkyvz commented Dec 11, 2015

@BrianLondon What happens if you change the version of KCL to 1.6.1? Could you also please add a new line to anywhere in extras/kinesis-asl? Otherwise Kinesis tests don't run and we have no idea whether this is a safe change.

Thanks!

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #2203 has finished for PR 10256 at commit 47d0a59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BrianLondon
Copy link
Contributor Author

@brkyvz Pushed up the KCL version and added a newline to KinesisReciever. Tests passed for me locally on a hadoop 2.6 build and the kinesis demos ran against a live AWS stream.

@BrianLondon
Copy link
Contributor Author

@holdenk bump

@holdenk
Copy link
Contributor

holdenk commented Dec 14, 2015

Does anyone working on SQL (e.g. @marmbrus ) have any concerns about the Jackson upgrade?

@MLnick
Copy link
Contributor

MLnick commented Dec 14, 2015

@brkyvz @holdenk @tdas any particular reason we depend on aws-sdk-java explicitly? The KCL brings in those dependencies (in fact from 1.5.0 KCL only depends on the necessary AWS SDK components, i.e. core, kinesis, dynamodb and cloudwatch). Depending only on KCL would make things cleaner and avoid potential issues such as not bumping versions at the same time (e.g. #8957 and related mailing list discussion)

@holdenk
Copy link
Contributor

holdenk commented Dec 14, 2015

I'm not really sure why - skimming the git logs it seems the aws-sdk-java sdk dependency was added at the same time as the kcl dependency.

@BrianLondon
Copy link
Contributor Author

It appears kcl pulls in 1.10.20. I'm all for dropping the explicit dependency. That probably wouldn't have prevented the regression in Spark 1.5.2 though. At any rate, the jackson bump is necessary if someone wants to use the latest aws sdk (e.g. for Firehose).

@BrianLondon
Copy link
Contributor Author

I removed the explicit dependence on the AWS Java SDK. There's a newline that was added to KinesisReceiver.scala since the last test run. I think if tests are re-run the kinesis part should be included, although someone more knowledgable than me about Travis can weigh in.

Tests and the kinesis example worked locally for me for what it's worth.

@JoshRosen
Copy link
Contributor

FYI, a bunch of spark-redshift users have run into the Jackson dependency conflict issue that's discussed here; see databricks/spark-redshift#133 and databricks/spark-redshift#121. Not commenting either way w.r.t. the dep. concerns themselves, but just wanted to point out that the pain of the Jackson conflict has impacts for a bunch of other applications too.

@BrianLondon
Copy link
Contributor Author

Yeah, that looks like the same underlying issue of using AWS Java SDK 1.10.0 or later with Spark. I believe this version bump will fix it. You can sometimes get around it by adding the following to your sbt build:

dependencyOverrides ++= Set(
  "com.fasterxml.jackson.core" % "jackson-core" % "2.4.4",
  "com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4",
  "com.fasterxml.jackson.core" % "jackson-annotations" % "2.4.4"
)

That doesn't work however if the calls to the AWS library use any Jackson 2.5.x functionality and some of them do.

@MLnick
Copy link
Contributor

MLnick commented Dec 18, 2015

I've also run the Kinesis tests locally (against a live AWS account) against this PR and all tests pass. I'm +1 on this subject to agreement for the jackson dependency bump, for which I defer to SQL folks e.g. @marmbrus

@marmbrus
Copy link
Contributor

If the SQL/Hive tests are passing I'm good. @srowen usually has good insight on dependency changes.

@srowen
Copy link
Member

srowen commented Dec 21, 2015

Can this wait until 2.0? I think it's going to anyway now. Then the dependency issue is fine.

@SparkQA
Copy link

SparkQA commented Dec 30, 2015

Test build #2271 has finished for PR 10256 at commit 64858df.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BrianLondon
Copy link
Contributor Author

Rebased against current master, ran tests (except for docker) locally, and tested kinesis example against aws hosted stream. All worked.

My preference would be to get the version bump in as soon as practicable for two reasons. First, and mostly selfishly, the inability to link any library that requires Jackson 2.5.x features is causing us problems in our production environment. Second, we currently can just bump the version number because the Spark codebase appears compatible with both Jackson versions at present, but the longer we wait the more likely someone introduces an incompatible change.

That said, I fully understand the reluctance to replace a heavily used library mid release cycle.

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #2354 has finished for PR 10256 at commit 6f3bbfd.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 8, 2016

Ah right, you'll need to acknowledge the dep changes in the build file:

To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'.
diff --git a/dev/deps/spark-deps-hadoop-2.2 b/dev/pr-deps/spark-deps-hadoop-2.2
index e4373f7..cd3ff29 100644
--- a/dev/deps/spark-deps-hadoop-2.2
+++ b/dev/pr-deps/spark-deps-hadoop-2.2
@@ -84,13 +84,13 @@ hadoop-yarn-server-web-proxy-2.2.0.jar
 httpclient-4.3.2.jar
 httpcore-4.3.2.jar
 ivy-2.4.0.jar
-jackson-annotations-2.4.4.jar
-jackson-core-2.4.4.jar
+jackson-annotations-2.5.3.jar
+jackson-core-2.5.3.jar
 jackson-core-asl-1.9.13.jar
-jackson-databind-2.4.4.jar
+jackson-databind-2.5.3.jar
 jackson-jaxrs-1.9.13.jar
 jackson-mapper-asl-1.9.13.jar
-jackson-module-scala_2.10-2.4.4.jar
+jackson-module-scala_2.10-2.5.3.jar
 jackson-xc-1.9.13.jar
 janino-2.7.8.jar
 jansi-1.4.jar

This LGTM but I do think we'll have to keep it for 2.x only.

@BrianLondon
Copy link
Contributor Author

Manifest updated. Is there documented somewhere what all the tests Travis runs are? It seems to differ from what's described at http://spark.apache.org/docs/latest/building-spark.html

@srowen
Copy link
Member

srowen commented Jan 10, 2016

@BrianLondon it should be running dev/run-tests, so that's something you can inspect or run locally if you need to. The doc there describes how to run one test.

@SparkQA
Copy link

SparkQA commented Jan 10, 2016

Test build #2358 has finished for PR 10256 at commit a1f75c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 10, 2016

I'm going to go for this one since I think we may want to, in general, consider updating Spark dependencies in 2.x anyway. 2.5 is even old for Jackson, so good to bump up here as a side effect I think.

@srowen
Copy link
Member

srowen commented Jan 11, 2016

Merged to master

@asfgit asfgit closed this in 8fe928b Jan 11, 2016
@cowtowncoder
Copy link

I know this issue is closed, but one suggestion I have would be to specify the latest Jackson 2.5 version, 2.5.5 instead of 2.5.3. While I understand caution in general, Jackson patch versions try to limit inclusion of fixes to safe ones and specifically there are no API changes. This also means that while it is good to keep even patch versions in sync across components there is very low risk of incompatibilities even if patch versions do differ.

In this specific case, jackson-databind 2.5.4 has 10 bug fixes -- see https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.5.4 -- this is the main reason I would recommend use of latest version.

@srowen
Copy link
Member

srowen commented Jan 28, 2016

I think that's OK. I was intending to propose a series of dependency updates for 2.x like this sort of thing.

@cowtowncoder
Copy link

@srowen Ah. Yes, just wanted to mention it; I don't have enough context so there may be specific reasons for picking up particular versions. Glad to see version upgrades coming along. My interests are selfish, being both the main author of Jackson, and a (relative new) Spark user at work. :)

With that, if there are problems from Jackson side I could help, do not hesitate to cc me and/or file issues at https://github.com/FasterXML/jackson-databind (or other specific components). I know version compatibility is an annoying issue, and would like to help if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.