Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Apr 9, 2014

This PR implements a generic version of AreaUnderCurve using the RDD.sliding implementation from #136 . It also contains refactoring of #160 for binary classification evaluation.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13921/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mengxr mengxr changed the title [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13924/

@mengxr mengxr changed the title [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014
@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13994/

@pwendell
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13997/

@mateiz
Copy link
Contributor

mateiz commented Apr 10, 2014

Jenkins, test this please

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor question, do you want to call these numTruePositives or just truePositives? Anyway I'm happy to merge it as is, just felt truePositives would be shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shorter but does not have the exact meaning. Similarly, I prefer numCols instead of cols in matrix.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14020/

@mengxr
Copy link
Contributor Author

mengxr commented Apr 11, 2014

Test failure was due to a random behavior in RDDSuite, which is fixed in #387 .

@mengxr
Copy link
Contributor Author

mengxr commented Apr 11, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14047/

@mateiz
Copy link
Contributor

mateiz commented Apr 11, 2014

Thanks Xiangrui! Merged into both master and branch-1.0.

@asfgit asfgit closed this in f5ace8d Apr 12, 2014
asfgit pushed a commit that referenced this pull request Apr 12, 2014
…nMetrics

This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from #136 . It also contains refactoring of #160 for binary classification evaluation.

Author: Xiangrui Meng <[email protected]>

Closes #364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove PredictionAndResponse
3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
8f78958 [Xiangrui Meng] add PredictionAndResponse
dda82d5 [Xiangrui Meng] add confusion matrix
aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
221ebce [Xiangrui Meng] add a new test to sliding
a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
a9b250a [Xiangrui Meng] move sliding to mllib
cab9a52 [Xiangrui Meng] use last for the last element
db6cb30 [Xiangrui Meng] remove unnecessary toSeq
9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
c1c6c22 [Xiangrui Meng] add AreaUnderCurve
65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
5ee6001 [Xiangrui Meng] add TODO
d2a600d [Xiangrui Meng] add sliding to rdd

(cherry picked from commit f5ace8d)
Signed-off-by: Matei Zaharia <[email protected]>
@mengxr mengxr deleted the auc branch May 7, 2014 00:09
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…nMetrics

This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from apache#136 . It also contains refactoring of apache#160 for binary classification evaluation.

Author: Xiangrui Meng <[email protected]>

Closes apache#364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove PredictionAndResponse
3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
8f78958 [Xiangrui Meng] add PredictionAndResponse
dda82d5 [Xiangrui Meng] add confusion matrix
aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
221ebce [Xiangrui Meng] add a new test to sliding
a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
a9b250a [Xiangrui Meng] move sliding to mllib
cab9a52 [Xiangrui Meng] use last for the last element
db6cb30 [Xiangrui Meng] remove unnecessary toSeq
9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
c1c6c22 [Xiangrui Meng] add AreaUnderCurve
65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
5ee6001 [Xiangrui Meng] add TODO
d2a600d [Xiangrui Meng] add sliding to rdd
tangzhankun pushed a commit to tangzhankun/spark that referenced this pull request Jul 25, 2017
* Adding PySpark Submit functionality. Launching Python from JVM

* Addressing scala idioms related to PR351

* Removing extends Logging which was necessary for LogInfo

* Refactored code to leverage the ContainerLocalizedFileResolver

* Modified Unit tests so that they would pass

* Modified Unit Test input to pass Unit Tests

* Setup working environent for integration tests for PySpark

* Comment out Python thread logic until Jenkins has python in Python

* Modifying PythonExec to pass on Jenkins

* Modifying python exec

* Added unit tests to ClientV2 and refactored to include pyspark submission resources

* Modified unit test check

* Scalastyle

* PR 348 file conflicts

* Refactored unit tests and styles

* further scala stylzing and logic

* Modified unit tests to be more specific towards Class in question

* Removed space delimiting for methods

* Submission client redesign to use a step-based builder pattern.

This change overhauls the underlying architecture of the submission
client, but it is intended to entirely preserve existing behavior of
Spark applications. Therefore users will find this to be an invisible
change.

The philosophy behind this design is to reconsider the breakdown of the
submission process. It operates off the abstraction of "submission
steps", which are transformation functions that take the previous state
of the driver and return the new state of the driver. The driver's state
includes its Spark configurations and the Kubernetes resources that will
be used to deploy it.

Such a refactor moves away from a features-first API design, which
considers different containers to serve a set of features. The previous
design, for example, had a container files resolver API object that
returned different resolutions of the dependencies added by the user.
However, it was up to the main Client to know how to intelligently
invoke all of those APIs. Therefore the API surface area of the file
resolver became untenably large and it was not intuitive of how it was
to be used or extended.

This design changes the encapsulation layout; every module is now
responsible for changing the driver specification directly. An
orchestrator builds the correct chain of steps and hands it to the
client, which then calls it verbatim. The main client then makes any
final modifications that put the different pieces of the driver
together, particularly to attach the driver container itself to the pod
and to apply the Spark configuration as command-line arguments.

* Don't add the init-container step if all URIs are local.

* Python arguments patch + tests + docs

* Revert "Python arguments patch + tests + docs"

This reverts commit 4533df2.

* Revert "Don't add the init-container step if all URIs are local."

This reverts commit e103225.

* Revert "Submission client redesign to use a step-based builder pattern."

This reverts commit 5499f6d.

* style changes

* space for styling
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
* Adding PySpark Submit functionality. Launching Python from JVM

* Addressing scala idioms related to PR351

* Removing extends Logging which was necessary for LogInfo

* Refactored code to leverage the ContainerLocalizedFileResolver

* Modified Unit tests so that they would pass

* Modified Unit Test input to pass Unit Tests

* Setup working environent for integration tests for PySpark

* Comment out Python thread logic until Jenkins has python in Python

* Modifying PythonExec to pass on Jenkins

* Modifying python exec

* Added unit tests to ClientV2 and refactored to include pyspark submission resources

* Modified unit test check

* Scalastyle

* PR 348 file conflicts

* Refactored unit tests and styles

* further scala stylzing and logic

* Modified unit tests to be more specific towards Class in question

* Removed space delimiting for methods

* Submission client redesign to use a step-based builder pattern.

This change overhauls the underlying architecture of the submission
client, but it is intended to entirely preserve existing behavior of
Spark applications. Therefore users will find this to be an invisible
change.

The philosophy behind this design is to reconsider the breakdown of the
submission process. It operates off the abstraction of "submission
steps", which are transformation functions that take the previous state
of the driver and return the new state of the driver. The driver's state
includes its Spark configurations and the Kubernetes resources that will
be used to deploy it.

Such a refactor moves away from a features-first API design, which
considers different containers to serve a set of features. The previous
design, for example, had a container files resolver API object that
returned different resolutions of the dependencies added by the user.
However, it was up to the main Client to know how to intelligently
invoke all of those APIs. Therefore the API surface area of the file
resolver became untenably large and it was not intuitive of how it was
to be used or extended.

This design changes the encapsulation layout; every module is now
responsible for changing the driver specification directly. An
orchestrator builds the correct chain of steps and hands it to the
client, which then calls it verbatim. The main client then makes any
final modifications that put the different pieces of the driver
together, particularly to attach the driver container itself to the pod
and to apply the Spark configuration as command-line arguments.

* Don't add the init-container step if all URIs are local.

* Python arguments patch + tests + docs

* Revert "Python arguments patch + tests + docs"

This reverts commit 4533df2.

* Revert "Don't add the init-container step if all URIs are local."

This reverts commit e103225.

* Revert "Submission client redesign to use a step-based builder pattern."

This reverts commit 5499f6d.

* style changes

* space for styling
mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 3, 2018
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…ase for timestamp ntz (apache#364)

backport apache#44428

### What changes were proposed in this pull request?

This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

Yes, now we can correctly write and read back NTZ value even if the date is before 1582.

### How was this patch tested?

new test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#44446 from cloud-fan/ntz2.

Authored-by: Wenchen Fan <[email protected]>

Signed-off-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants