Skip to content

Conversation

@liancheng
Copy link
Contributor

DataFrame.collect() calls SparkPlan.executeCollect(), which consists of a single line:

execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()

The problem is that, QueryPlan.schema is a function. And since 1.3.0, convertRowToScala starts returning a GenericRowWithSchema. Thus, every GenericRowWithSchema instance holds a separate copy of the schema object. Also, YJP profiling result of the following simple micro benchmark (executed in Spark shell) shows that constructing all these schema objects takes up to ~35% CPU time.

sc.parallelize(1 to 10000000).
  map(i => (i, s"val_$i")).
  toDF("key", "value").
  saveAsParquetFile("file:///tmp/src.parquet")

// Profiling started from this line
sqlContext.parquetFile("file:///tmp/src.parquet").collect()

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29804 has started for PR 5398 at commit 3159469.

@rxin
Copy link
Contributor

rxin commented Apr 7, 2015

LGTM

@SparkQA
Copy link

SparkQA commented Apr 7, 2015

Test build #29804 has finished for PR 5398 at commit 3159469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29804/
Test PASSed.

@asfgit asfgit closed this in 77bcceb Apr 7, 2015
@liancheng liancheng deleted the spark-6748 branch April 7, 2015 23:54
@liancheng
Copy link
Contributor Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants