Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In FileStreamSource.getBatch, we will create a DataSource with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in FileStreamSource, like schema.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

CC @zsxwing @brkyvz @yhuai

@SparkQA
Copy link

SparkQA commented Oct 21, 2016

Test build #67331 has finished for PR 15581 at commit ad7ef81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String])

@zsxwing
Copy link
Member

zsxwing commented Oct 21, 2016

LGTM. Merging to master and 2.0. Thanks!

@zsxwing
Copy link
Member

zsxwing commented Oct 21, 2016

@cloud-fan there are conflicts with 2.0. Could you submit another PR for that?

@asfgit asfgit closed this in 1405702 Oct 21, 2016
@cloud-fan
Copy link
Contributor Author

yea of course, I'll do it soon

@cloud-fan
Copy link
Contributor Author

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

@zsxwing
Copy link
Member

zsxwing commented Oct 24, 2016

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

Done. I also merged this one into branch 2.0.

asfgit pushed a commit that referenced this pull request Oct 24, 2016
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #15581 from cloud-fan/stream.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes apache#15581 from cloud-fan/stream.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes apache#15581 from cloud-fan/stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants