Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR migrates Parquet data source to the newly introduced FSBasedRelation. FSBasedParquetRelation is created to replace ParquetRelation2. Major differences are:

  1. Partition discovery code has been factored out to FSBasedRelation

  2. AppendingParquetOutputFormat is not used now. Instead, an anonymous subclass of ParquetOutputFormat is used to handle appending and writing dynamic partitions

  3. When scanning partitioned tables, FSBasedParquetRelation.buildScan only builds an RDD[Row] for a single selected partition

  4. FSBasedParquetRelation doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend CatalystScan anymore

    After migrating JSONRelation (which extends CatalystScan), we can remove CatalystScan.

Review on Reviewable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code doesn't handle file names like parquet-r-00001.gz.parquet.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32523 has started for PR 6090 at commit f4482ca.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32523 has finished for PR 6090 at commit f4482ca.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32523/
Test FAILed.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #802 has started for PR 6090 at commit f4482ca.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #802 has finished for PR 6090 at commit f4482ca.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 12, 2015

Test build #32549 has started for PR 6090 at commit e40bb7b.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32549 has finished for PR 6090 at commit e40bb7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32549/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type does not need to be a FileOutputCommitter, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used FileOutputCommitter here because we need to retrieve the actual path of file being written, which is returned by FileOutputCommitter.getWorkPath. This implies customized output committers must be subclasses of FileOutputCommitter, which was true for DirectParquetOutputCommitter. But this restriction seems too strict. Resorting to OutputCommitter rather than FileOutputCommitter in another PR.

@liancheng liancheng force-pushed the parquet-migration branch from e40bb7b to a0a3ee9 Compare May 13, 2015 14:44
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@liancheng
Copy link
Contributor Author

Rebased to #6118.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32620 has started for PR 6090 at commit 3d770f4.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32618/
Test FAILed.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #803 has started for PR 6090 at commit 3d770f4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classOf[OutputCommitter]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this comment is outdated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. When FileOutputCommitter is used, we still use FileOutputCommitter.getWorkPath() internally inside InsertIntoFSBasedRelation.

@liancheng liancheng force-pushed the parquet-migration branch from ff22b46 to 6063f87 Compare May 13, 2015 16:39
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32626 has started for PR 6090 at commit 6063f87.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32620 has finished for PR 6090 at commit 3d770f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32620/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this constructor is not defined in OutputCommitter.

Why do we need this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in InsertIntoFSBasedRelation.run, we already set the the output path through FileOutputFormat.setOutputPath(job, qualifiedOutputPath). So, the output path should be set in context. Seems we only need to check if mapred.output.committer.class is set or not. If it is set, we create the output committer based on the specified class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val committerClass = context.getConfiguration.getClass(
      "mapred.output.committer.class", null, classOf[OutputCommitter])

Option(committerClass).map { clazz =>
  val ctor = clazz.getDeclaredConstructor()
  ctor.newInstance()
}.getOrElse {
  outputFormatClass.newInstance().getOutputCommitter(context)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if committerClass is based on mapred interface, setupJob will not work because mapred's output committer use mapred JobContext (a subclass of mapreduce's JobContext) and we are using Job in mapreduce package (another subclass of mapreduce's JobContext).

@marmbrus
Copy link
Contributor

I'm going to go ahead and merge this so we can start testing. Yin's concerns about using other output committers can be addressed in a followup (we should consider adding tests that use DirectOutputCommitter)

asfgit pushed a commit that referenced this pull request May 13, 2015
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:

1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore

   After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #6090 from liancheng/parquet-migration and squashes the following commits:

6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation

(cherry picked from commit 7ff16e8)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 7ff16e8 May 13, 2015
@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #803 has finished for PR 6090 at commit 3d770f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32626 has finished for PR 6090 at commit 6063f87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32626/
Test PASSed.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:

1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore

   After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#6090 from liancheng/parquet-migration and squashes the following commits:

6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:

1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore

   After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#6090 from liancheng/parquet-migration and squashes the following commits:

6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR migrates Parquet data source to the newly introduced `FSBasedRelation`. `FSBasedParquetRelation` is created to replace `ParquetRelation2`. Major differences are:

1. Partition discovery code has been factored out to `FSBasedRelation`
1. `AppendingParquetOutputFormat` is not used now. Instead, an anonymous subclass of `ParquetOutputFormat` is used to handle appending and writing dynamic partitions
1. When scanning partitioned tables, `FSBasedParquetRelation.buildScan` only builds an `RDD[Row]` for a single selected partition
1. `FSBasedParquetRelation` doesn't rely on Catalyst expressions for filter push down, thus it doesn't extend `CatalystScan` anymore

   After migrating `JSONRelation` (which extends `CatalystScan`), we can remove `CatalystScan`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6090)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes apache#6090 from liancheng/parquet-migration and squashes the following commits:

6063f87 [Cheng Lian] Casts to OutputCommitter rather than FileOutputCommtter
bfd1cf0 [Cheng Lian] Fixes compilation error introduced while rebasing
f9ea56e [Cheng Lian] Adds ParquetRelation2 related classes to MiMa check whitelist
261d8c1 [Cheng Lian] Minor bug fix and more tests
db65660 [Cheng Lian] Migrates Parquet data source to FSBasedRelation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants