Skip to content

Conversation

@cloud-fan
Copy link
Contributor

When we cogroup 2 GroupedIterators in CoGroupedIterator, if the right side is smaller, we will consume right data and keep the left data unchanged. Then we call hasNext which will call left.hasNext. This will make GroupedIterator generate an extra group as the previous one has not been comsumed yet.

@yhuai
Copy link
Contributor

yhuai commented Oct 29, 2015

cc @marmbrus

@SparkQA
Copy link

SparkQA commented Oct 29, 2015

Test build #44559 has finished for PR 9346 at commit 9be67c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

I'm a little confused. If GroupedIterator.hasNext isn't idempotent then we should fix that instead of working around it. However, as of master I wasn't able to reproduce a problem with it (but maybe my test case is insufficient?

    val schema = new StructType().add("i", IntegerType).add("s", StringType)
    val encoder = RowEncoder(schema)
    val input = Seq(Row(1, "a"), Row(1, "b"), Row(2, "c"))
    val grouped = GroupedIterator(input.iterator.map(encoder.toRow),
      Seq('i.int.at(0)), schema.toAttributes)

    val result = grouped.map {
      case (key, data) =>
        assert(key.numFields == 1)
        key.getInt(0) -> data.map(encoder.fromRow).toSeq
    }

    assert(result.hasNext)
    assert(result.hasNext)
    assert(result.hasNext)

    assert(result.next === 1 -> Seq(input(0), input(1)))

    assert(result.hasNext)
    assert(result.hasNext)
    assert(result.hasNext)

    assert(result.next === 2 -> Seq(input(2)))

    assert(!result.hasNext)
    assert(!result.hasNext)

@cloud-fan
Copy link
Contributor Author

Maybe "not idempotent" is not a proper word to describe this problem, GroupedIterator has a special constraint which is diffrent from normal iterator, and CoGroupedIterator breaks this constraint at the condition described in PR description.

@cloud-fan
Copy link
Contributor Author

btw as #9330 has been merge, the problem is not generating an extra empty group but making the last group empty.

@yhuai
Copy link
Contributor

yhuai commented Oct 29, 2015

If we use grouped directly and call hasNext right after we call next, the issue will be exposed.

@marmbrus
Copy link
Contributor

Okay, thanks. Merging to master.

@asfgit asfgit closed this in 14d08b9 Oct 30, 2015
@cloud-fan cloud-fan deleted the cogroup branch October 30, 2015 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants