Skip to content

Conversation

@datafarmer
Copy link
Contributor

There is a bug in the calculation of maxSplitSize. The totalLen should be divided by minPartitions and not by files.size.

@srowen
Copy link
Member

srowen commented Jan 1, 2016

Agree, compare to the impl in WholeTextInputFormat. Really it can be tidier, and fix minPartitions = 0, with:

  def setMinPartitions(context: JobContext, minPartitions: Int) {
    val totalLen = listStatus(context).asScala.filterNot(_.isDir).map(_.getLen).sum
    val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong
    super.setMaxSplitSize(maxSplitSize)
  }

But @datafarmer please see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for how we suggest changes first.

@kmader WDYT?

@datafarmer
Copy link
Contributor Author

@srowen I guess that I should have created a JIRA ticket first. I just created one: SPARK-12598

@srowen
Copy link
Member

srowen commented Jan 2, 2016

@datafarmer go ahead and update the title here and consider updating the PR itself per above.

@datafarmer datafarmer changed the title Fixed bug in setMinPartitions [SPARK-12598] bug in setMinPartitions Jan 2, 2016
@datafarmer
Copy link
Contributor Author

@srowen I'll update the PR per your changes. BTW, the FileStatus method isDir is deprecated. Should I change it to isDirectory, or is that something for another PR?

@srowen
Copy link
Member

srowen commented Jan 2, 2016

@datafarmer I've just seconds ago merged a change that replaces these deprecated calls, since we can assume Hadoop 2.2+ now. Yes, isDirectory is correct now.

@datafarmer datafarmer changed the title [SPARK-12598] bug in setMinPartitions [SPARK-12598][Core] bug in setMinPartitions Jan 2, 2016
@srowen
Copy link
Member

srowen commented Jan 7, 2016

@datafarmer are you able to update this?

@datafarmer
Copy link
Contributor Author

@srowen It should already be updated per your request. Let me know if there is something else that needs to be done.

@srowen
Copy link
Member

srowen commented Jan 7, 2016

@datafarmer this still shows a merge conflict though. That's what needs to be resolved with a rebase.

@datafarmer datafarmer force-pushed the setminpartitionsbug branch from 2e21d1b to 73b0e0b Compare January 7, 2016 14:24
@datafarmer
Copy link
Contributor Author

@srowen Should be OK now.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #2347 has finished for PR 10546 at commit 73b0e0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 8346518 Jan 7, 2016
asfgit pushed a commit that referenced this pull request Jan 7, 2016
There is a bug in the calculation of ```maxSplitSize```.  The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <[email protected]>

Closes #10546 from datafarmer/setminpartitionsbug.

(cherry picked from commit 8346518)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jan 7, 2016

Merged to master/1.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants