[SPARK-12598][Core] bug in setMinPartitions #10546

datafarmer · 2016-01-01T18:37:04Z

There is a bug in the calculation of maxSplitSize. The totalLen should be divided by minPartitions and not by files.size.

srowen · 2016-01-01T19:11:00Z

Agree, compare to the impl in WholeTextInputFormat. Really it can be tidier, and fix minPartitions = 0, with:

  def setMinPartitions(context: JobContext, minPartitions: Int) {
    val totalLen = listStatus(context).asScala.filterNot(_.isDir).map(_.getLen).sum
    val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 1.0)).toLong
    super.setMaxSplitSize(maxSplitSize)
  }

But @datafarmer please see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for how we suggest changes first.

@kmader WDYT?

datafarmer · 2016-01-01T19:45:35Z

@srowen I guess that I should have created a JIRA ticket first. I just created one: SPARK-12598

srowen · 2016-01-02T11:20:41Z

@datafarmer go ahead and update the title here and consider updating the PR itself per above.

datafarmer · 2016-01-02T13:03:01Z

@srowen I'll update the PR per your changes. BTW, the FileStatus method isDir is deprecated. Should I change it to isDirectory, or is that something for another PR?

srowen · 2016-01-02T13:17:13Z

@datafarmer I've just seconds ago merged a change that replaces these deprecated calls, since we can assume Hadoop 2.2+ now. Yes, isDirectory is correct now.

srowen · 2016-01-07T11:06:04Z

@datafarmer are you able to update this?

datafarmer · 2016-01-07T13:41:44Z

@srowen It should already be updated per your request. Let me know if there is something else that needs to be done.

srowen · 2016-01-07T13:54:45Z

@datafarmer this still shows a merge conflict though. That's what needs to be resolved with a rebase.

datafarmer · 2016-01-07T14:27:17Z

@srowen Should be OK now.

SparkQA · 2016-01-07T16:22:58Z

Test build #2347 has finished for PR 10546 at commit 73b0e0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

There is a bug in the calculation of ```maxSplitSize```. The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```. Author: Darek Blasiak <[email protected]> Closes #10546 from datafarmer/setminpartitionsbug. (cherry picked from commit 8346518) Signed-off-by: Sean Owen <[email protected]>

srowen · 2016-01-07T21:18:14Z

Merged to master/1.6

datafarmer changed the title ~~Fixed bug in setMinPartitions~~ [SPARK-12598] bug in setMinPartitions Jan 2, 2016

datafarmer changed the title ~~[SPARK-12598] bug in setMinPartitions~~ [SPARK-12598][Core] bug in setMinPartitions Jan 2, 2016

Fixed bug in setMinPartitions

73b0e0b

datafarmer force-pushed the setminpartitionsbug branch from 2e21d1b to 73b0e0b Compare January 7, 2016 14:24

asfgit closed this in 8346518 Jan 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12598][Core] bug in setMinPartitions #10546

[SPARK-12598][Core] bug in setMinPartitions #10546

Uh oh!

datafarmer commented Jan 1, 2016

Uh oh!

srowen commented Jan 1, 2016

Uh oh!

datafarmer commented Jan 1, 2016

Uh oh!

srowen commented Jan 2, 2016

Uh oh!

datafarmer commented Jan 2, 2016

Uh oh!

srowen commented Jan 2, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

datafarmer commented Jan 7, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

datafarmer commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-12598][Core] bug in setMinPartitions #10546

[SPARK-12598][Core] bug in setMinPartitions #10546

Uh oh!

Conversation

datafarmer commented Jan 1, 2016

Uh oh!

srowen commented Jan 1, 2016

Uh oh!

datafarmer commented Jan 1, 2016

Uh oh!

srowen commented Jan 2, 2016

Uh oh!

datafarmer commented Jan 2, 2016

Uh oh!

srowen commented Jan 2, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

datafarmer commented Jan 7, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

datafarmer commented Jan 7, 2016

Uh oh!

SparkQA commented Jan 7, 2016

Uh oh!

srowen commented Jan 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants