[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics #364

mengxr · 2014-04-09T02:26:50Z

This PR implements a generic version of AreaUnderCurve using the RDD.sliding implementation from #136 . It also contains refactoring of #160 for binary classification evaluation.

AmplabJenkins · 2014-04-09T02:27:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T02:27:29Z

Merged build started.

AmplabJenkins · 2014-04-09T02:30:01Z

Merged build finished.

AmplabJenkins · 2014-04-09T02:30:01Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13921/

AmplabJenkins · 2014-04-09T03:32:23Z

Merged build triggered.

AmplabJenkins · 2014-04-09T03:32:30Z

Merged build started.

AmplabJenkins · 2014-04-09T04:10:58Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-09T04:10:58Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13924/

AmplabJenkins · 2014-04-10T09:41:14Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13994/

pwendell · 2014-04-10T09:41:52Z

Jenkins, test this please.

AmplabJenkins · 2014-04-10T09:43:11Z

Merged build triggered.

AmplabJenkins · 2014-04-10T09:43:18Z

Merged build started.

AmplabJenkins · 2014-04-10T10:22:01Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-10T10:22:02Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13997/

mateiz · 2014-04-10T22:14:21Z

Jenkins, test this please

mateiz · 2014-04-10T22:15:41Z

mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala

Just a minor question, do you want to call these numTruePositives or just truePositives? Anyway I'm happy to merge it as is, just felt truePositives would be shorter.

It is shorter but does not have the exact meaning. Similarly, I prefer numCols instead of cols in matrix.

AmplabJenkins · 2014-04-10T22:18:11Z

Merged build triggered.

AmplabJenkins · 2014-04-10T22:18:19Z

Merged build started.

AmplabJenkins · 2014-04-10T22:54:56Z

Merged build finished.

AmplabJenkins · 2014-04-10T22:54:56Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14020/

mengxr · 2014-04-11T03:47:12Z

Test failure was due to a random behavior in RDDSuite, which is fixed in #387 .

mengxr · 2014-04-11T03:47:24Z

Jenkins, retest this please.

AmplabJenkins · 2014-04-11T03:48:12Z

Merged build triggered.

AmplabJenkins · 2014-04-11T03:48:22Z

Merged build started.

AmplabJenkins · 2014-04-11T04:27:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-11T04:27:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14047/

mateiz · 2014-04-11T19:06:59Z

Thanks Xiangrui! Merged into both master and branch-1.0.

…nMetrics This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from #136 . It also contains refactoring of #160 for binary classification evaluation. Author: Xiangrui Meng <[email protected]> Closes #364 from mengxr/auc and squashes the following commits: a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names 3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics b1b7dab [Xiangrui Meng] fix code styles 9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator ca31da5 [Xiangrui Meng] remove PredictionAndResponse 3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary 8f78958 [Xiangrui Meng] add PredictionAndResponse dda82d5 [Xiangrui Meng] add confusion matrix aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator 221ebce [Xiangrui Meng] add a new test to sliding a920865 [Xiangrui Meng] Merge branch 'sliding' into auc a9b250a [Xiangrui Meng] move sliding to mllib cab9a52 [Xiangrui Meng] use last for the last element db6cb30 [Xiangrui Meng] remove unnecessary toSeq 9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]] 284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD c1c6c22 [Xiangrui Meng] add AreaUnderCurve 65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc 5ee6001 [Xiangrui Meng] add TODO d2a600d [Xiangrui Meng] add sliding to rdd (cherry picked from commit f5ace8d) Signed-off-by: Matei Zaharia <[email protected]>

…nMetrics This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from apache#136 . It also contains refactoring of apache#160 for binary classification evaluation. Author: Xiangrui Meng <[email protected]> Closes apache#364 from mengxr/auc and squashes the following commits: a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names 3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics b1b7dab [Xiangrui Meng] fix code styles 9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator ca31da5 [Xiangrui Meng] remove PredictionAndResponse 3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary 8f78958 [Xiangrui Meng] add PredictionAndResponse dda82d5 [Xiangrui Meng] add confusion matrix aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator 221ebce [Xiangrui Meng] add a new test to sliding a920865 [Xiangrui Meng] Merge branch 'sliding' into auc a9b250a [Xiangrui Meng] move sliding to mllib cab9a52 [Xiangrui Meng] use last for the last element db6cb30 [Xiangrui Meng] remove unnecessary toSeq 9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]] 284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD c1c6c22 [Xiangrui Meng] add AreaUnderCurve 65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc 5ee6001 [Xiangrui Meng] add TODO d2a600d [Xiangrui Meng] add sliding to rdd

* Adding PySpark Submit functionality. Launching Python from JVM * Addressing scala idioms related to PR351 * Removing extends Logging which was necessary for LogInfo * Refactored code to leverage the ContainerLocalizedFileResolver * Modified Unit tests so that they would pass * Modified Unit Test input to pass Unit Tests * Setup working environent for integration tests for PySpark * Comment out Python thread logic until Jenkins has python in Python * Modifying PythonExec to pass on Jenkins * Modifying python exec * Added unit tests to ClientV2 and refactored to include pyspark submission resources * Modified unit test check * Scalastyle * PR 348 file conflicts * Refactored unit tests and styles * further scala stylzing and logic * Modified unit tests to be more specific towards Class in question * Removed space delimiting for methods * Submission client redesign to use a step-based builder pattern. This change overhauls the underlying architecture of the submission client, but it is intended to entirely preserve existing behavior of Spark applications. Therefore users will find this to be an invisible change. The philosophy behind this design is to reconsider the breakdown of the submission process. It operates off the abstraction of "submission steps", which are transformation functions that take the previous state of the driver and return the new state of the driver. The driver's state includes its Spark configurations and the Kubernetes resources that will be used to deploy it. Such a refactor moves away from a features-first API design, which considers different containers to serve a set of features. The previous design, for example, had a container files resolver API object that returned different resolutions of the dependencies added by the user. However, it was up to the main Client to know how to intelligently invoke all of those APIs. Therefore the API surface area of the file resolver became untenably large and it was not intuitive of how it was to be used or extended. This design changes the encapsulation layout; every module is now responsible for changing the driver specification directly. An orchestrator builds the correct chain of steps and hands it to the client, which then calls it verbatim. The main client then makes any final modifications that put the different pieces of the driver together, particularly to attach the driver container itself to the pod and to apply the Spark configuration as command-line arguments. * Don't add the init-container step if all URIs are local. * Python arguments patch + tests + docs * Revert "Python arguments patch + tests + docs" This reverts commit 4533df2. * Revert "Don't add the init-container step if all URIs are local." This reverts commit e103225. * Revert "Submission client redesign to use a step-based builder pattern." This reverts commit 5499f6d. * style changes * space for styling

Add document links into README

…ase for timestamp ntz (apache#364) backport apache#44428 ### What changes were proposed in this pull request? This fixes a correctness bug. The TIMESTAMP_NTZ is a new data type in Spark and has no legacy files that need to do calendar rebase. However, the vectorized parquet reader treat it the same as LTZ and may do rebase if the parquet file was written with the legacy rebase mode. This PR fixes it to never do rebase for NTZ. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, now we can correctly write and read back NTZ value even if the date is before 1582. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44446 from cloud-fan/ntz2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]>

mengxr added 17 commits March 13, 2014 11:25

add sliding to rdd

d2a600d

add TODO

5ee6001

Merge branch 'sliding' into auc

65461b2

add AreaUnderCurve

c1c6c22

change SlidedRDD to SlidingRDD

284d991

change RDD.sliding return type to RDD[Seq[T]]

9916202

remove unnecessary toSeq

db6cb30

use last for the last element

cab9a52

move sliding to mllib

a9b250a

Merge branch 'sliding' into auc

a920865

add a new test to sliding

221ebce

add initial version of binary classification evaluator

aa7e278

add confusion matrix

dda82d5

add PredictionAndResponse

8f78958

move binary evalution classes to evaluation.binary

3d71525

remove PredictionAndResponse

ca31da5

add tests for BinaryClassificationEvaluator

9dc3518

mengxr added 2 commits April 8, 2014 20:11

fix code styles

b1b7dab

rename Evaluator to Metrics and add more metrics

fb4b6d2

mengxr changed the title ~~[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator~~ [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014

add (0, 0), (1, 1) to roc, and (0, 1) to pr

3f42e98

mengxr changed the title ~~[SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator~~ [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014

mateiz reviewed Apr 10, 2014
View reviewed changes

asfgit closed this in f5ace8d Apr 12, 2014

mengxr deleted the auc branch May 7, 2014 00:09

mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 3, 2018

Custom CatalogFileIndex (apache#364)

f9f98dd

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#364 from theopenlab/update-readme

80d83de

Add document links into README

[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics #364

[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics #364

Uh oh!

Conversation

mengxr commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 9, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

pwendell commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

mateiz commented Apr 10, 2014

Uh oh!

mateiz Apr 10, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr Apr 11, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

AmplabJenkins commented Apr 10, 2014

Uh oh!

mengxr commented Apr 11, 2014

Uh oh!

mengxr commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

mateiz commented Apr 11, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants