Skip to content

Conversation

@cmccabe
Copy link

@cmccabe cmccabe commented May 28, 2014

In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed. Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15241/

@rxin
Copy link
Contributor

rxin commented May 28, 2014

Colin it appears this method does not exist in older version of Hadoop. I wonder if we need to put this into a shim ...

@pwendell
Copy link
Contributor

@cmccabe - ah I thought you said this was added in 0.21... Our default build compiles against Hadoop 1.0.4... isn't 1.0.4 newer?

@cmccabe
Copy link
Author

cmccabe commented May 28, 2014

hflush was in hadoop 0.21. You can download http://archive.apache.org/dist/hadoop/core/hadoop-0.21.0/hadoop-0.21.0.tar.gz and check for yourself in common/src/java/org/apache/hadoop/fs/FSDataOutputStream.java.

I also verified that hadoop 1.0.4 does not have hflush (although, amusingly enough, it does have references to hflush in the code and documentation... from patches that were cherry-picked from other branches, presumably.) Instead, it has an implementation of hflush (I think?) inside the sync function.

Looking at the "Hadoop genealogy" reveals how this could have happened: http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png

It looks like what happened was that the hadoop 0.20 line kind of diverged from the hadoop 0.21 line. The 1.0.4 release somehow came out of the 0.20 line, while the 0.21 line mutated into hadoop 2.x at some point. This was all before my time... even CDH3 had hflush, which is the oldest version of Hadoop I ever worked on.

Sounds like we're back to reflection tricks, then.

@pwendell
Copy link
Contributor

Yeah so I'm guessing @andrewor14 didn't use flush because it wasn't there (which is consistent with the docs). If you are feeling adventurous, I think we could write a Scala macro to do this reflection at compile time. Regular reflection should work as well. I think you'd just want to check if hflush is present and it not call sync.

@pwendell
Copy link
Contributor

By the way, your chart has me thinking, we need to document the Spark version genealogy:

0.2 -> 0.3 -> 0.4 -> 0.5 -> 0.6 -> 0.7 -> 0.8 -> 0.9 -> 1.0

:P

@ash211
Copy link
Contributor

ash211 commented May 28, 2014

Ha! On an actually-useful note, it'd be nice to have somewhere that lists
Spark versions and the dates they were released. Such information doesn't
exist on spark.apache.org does it?

On Wed, May 28, 2014 at 12:06 AM, Patrick Wendell
[email protected]:

By the way, your chart has me thinking, we need to document the Spark
version genealogy:

0.2 -> 0.3 -> 0.4 -> 0.5 -> 0.6 -> 0.7 -> 0.8 -> 0.9 -> 1.0

:P


Reply to this email directly or view it on GitHubhttps://github.com//pull/898#issuecomment-44372409
.

@rxin
Copy link
Contributor

rxin commented May 28, 2014

They do exists on github: https://github.com/apache/spark/releases

@rxin
Copy link
Contributor

rxin commented May 28, 2014

But definitely a good idea to make them more visible.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@ash211
Copy link
Contributor

ash211 commented May 28, 2014

Wait nevermind, they're listed here: https://spark.apache.org/downloads.html

On Wed, May 28, 2014 at 12:12 AM, Reynold Xin [email protected]:

But definitely a good idea to make them more visible.


Reply to this email directly or view it on GitHubhttps://github.com//pull/898#issuecomment-44372815
.

@cmccabe
Copy link
Author

cmccabe commented May 28, 2014

The chart was made by Konstantin Boudnik, I just linked to it. I like the Spark version genealogy more-- it's a little easier to understand. :)

Here's a version that uses regular reflection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that [getMethod()](http://docs.oracle.com/javase/7/docs/api/java/lang/Class.html#getMethod%28java.lang.String, java.lang.Class...%29) throws NoSuchMethodException rather than returning null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the "Scala" way to do this may just be

Try(cls.getMethod("hflush")).getOrElse(cls.getMethod("sync"))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15248/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15260/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15262/

@cmccabe cmccabe changed the title FileLogger: Fix compile against Hadoop trunk SPARK-1518: FileLogger: Fix compile against Hadoop trunk May 28, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding See SPARK-1518 here? This might be a little hard to grok for someone not familiar with the nuances of Hadoop API's

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15263/

In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed.  Instead, the
FSDataOutputStream#hflush method fills the same role.  We should call
hflush if it is available.  This patch uses reflection to maintain
support for old versions of Hadoop that do not have hflush, but which do
have the deprecated sync method.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15268/

@pwendell
Copy link
Contributor

pwendell commented Jun 4, 2014

LGTM - thanks for this colin!

asfgit pushed a commit that referenced this pull request Jun 4, 2014
In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed.  Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.

Author: Colin McCabe <[email protected]>

Closes #898 from cmccabe/SPARK-1518 and squashes the following commits:

752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
(cherry picked from commit 1765c8d)

Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in 1765c8d Jun 4, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed.  Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.

Author: Colin McCabe <[email protected]>

Closes apache#898 from cmccabe/SPARK-1518 and squashes the following commits:

752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed.  Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.

Author: Colin McCabe <[email protected]>

Closes apache#898 from cmccabe/SPARK-1518 and squashes the following commits:

752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
wangyum added a commit that referenced this pull request May 26, 2023
…OOM (#898)

* Fix Driver OOM

* Fix

* Fix

* Fix (#899)

* Update DynamicDataPruningSuite.scala

* Update DynamicDataPruningSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants