Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Feb 27, 2015

The eq of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc @pwendell @JoshRosen

@davies
Copy link
Contributor Author

davies commented Feb 27, 2015

This PR works for 1.3+, will create another PR for 1.2 and 1.1

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28053 has started for PR 4808 at commit d9ae973.

  • This patch merges cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicated, also in types.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28053 has finished for PR 4808 at commit d9ae973.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28053/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28072 has started for PR 4808 at commit 46999dc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28072 has finished for PR 4808 at commit 46999dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28072/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28079 has started for PR 4808 at commit 534ac90.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28079 has finished for PR 4808 at commit 534ac90.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28079/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28084 has started for PR 4808 at commit 3da44fc.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28084 has finished for PR 4808 at commit 3da44fc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28084/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking API change? Or were the old doctests showing incorrect usage of the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old tests are incorrect.

@JoshRosen
Copy link
Contributor

It looks like _restore_object still tries to use DataType instance ids as _cached_cls dictionary keys during unpickling; is this still necessary if the DataTypes aren't singletons after unpickling?

@davies
Copy link
Contributor Author

davies commented Feb 27, 2015

@JoshRosen Because we serialized the objects in batch, and pickle memorize the multiple occurrences of same object in the batch, finally we will get single DataType object (even for StructType), we can benefits from this optimization, no __hash__ and __eq__ for later row in the batch.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28094 has started for PR 4808 at commit 6a322a4.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

Makes sense; LGTM. I'll take a look at the backport patches, too.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28094 has finished for PR 4808 at commit 6a322a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28094/
Test PASSed.

asfgit pushed a commit that referenced this pull request Feb 28, 2015
The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc pwendell  JoshRosen

Author: Davies Liu <[email protected]>

Closes #4808 from davies/leak and squashes the following commits:

6a322a4 [Davies Liu] tests refactor
3da44fc [Davies Liu] fix __eq__ of Singleton
534ac90 [Davies Liu] add more checks
46999dc [Davies Liu] fix tests
d9ae973 [Davies Liu] fix memory leak in sql

(cherry picked from commit e0e64ba)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in e0e64ba Feb 28, 2015
@JoshRosen
Copy link
Contributor

LGTM, so I've merged this into branch-1.3 (1.3.0) and master (1.4.0). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants