Skip to content

Conversation

@MrBago
Copy link
Contributor

@MrBago MrBago commented Jul 5, 2018

What changes were proposed in this pull request?

This PR updates the Instrumentation class to make it more flexible and a little bit easier to use. When these APIs are merged, I'll followup with a PR to update the training code to use these new APIs so we can remove the old APIs. These changes are all to private APIs so this PR doesn't make any user facing changes.

How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

MrBago added 2 commits July 5, 2018 11:21
Instrumentation class. Updated LogisticRegression to use this API as an
example.
@SparkQA
Copy link

SparkQA commented Jul 5, 2018

Test build #92653 has finished for PR 21719 at commit 3a6537d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mengxr mengxr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made one pass

protected[spark] def train(
dataset: Dataset[_],
handlePersistence: Boolean): LogisticRegressionModel = {
handlePersistence: Boolean): LogisticRegressionModel = Instrumentation.instrumented { instr =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid line too wide, we might want to import instrumented and save "Instrumentation" from this line.

if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)

val instr = Instrumentation.create(this, dataset)
instr.logContext(this, dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't log anything. I think we should auto-generate prefix and keep it as a constant. So logs would appear as:

[PREFIX]: instrumentation started
[PREFIX]: using estimator logReg-abc128
[PREFIX]: using dataset some hashcode
[PREFIX]: param maxIter=10
[PREFIX]: ...
[PREFIX]: run succeeded/failed
[PREFIX]: instrumentation ended

We can generate 8 random chars as the PREFIX. This is sufficient for correlate metrics from the same run. The issue with making it mutable is that we do not have a way to guarantee logContext is always called.

So I would suggest replacing logContext with the following:

  • logEstimator or logPipelineStage
  • logDataset

Btw, we can by default log call site. It provides more info for instrumentation, not necessary in this PR.

* @param estimator the estimator that is being fit
* @param dataset the training dataset
*/
def logContext(estimator: Estimator[_], dataset: RDD[_]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

}

def logSuccess(): Unit = {
log("training finished")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have this log alias. I was wondering which log level it uses. Just use logInfo and remove log(.

*/
def logFailure(e: Throwable): Unit = {
val msg = e.getStackTrace.mkString("\n")
super.logInfo(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures should go to ERROR level.


def instrumented[T](body: (Instrumentation => T)): T = {
val instr = new Instrumentation()
Try(body(new Instrumentation())) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use already constructed instr

case Failure(NonFatal(e)) =>
instr.logFailure(e)
throw e
case Success(model) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model -> result, it doesn't need to be a model

@MrBago MrBago changed the title [SPARK-24747] Make Instrumentation class more flexible [SPARK-24747][ML] Make Instrumentation class more flexible Jul 6, 2018
@MrBago MrBago force-pushed the new-instrumentation-apis branch 2 times, most recently from 40c8b41 to 3a6537d Compare July 11, 2018 22:54
@SparkQA
Copy link

SparkQA commented Jul 12, 2018

Test build #92906 has finished for PR 21719 at commit b98d772.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

jenkins test this please

@SparkQA
Copy link

SparkQA commented Jul 12, 2018

Test build #92948 has finished for PR 21719 at commit b98d772.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a couple of nits

train(dataset, handlePersistence)
}

import Instrumentation.instrumented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put import at top of file with the other imports (just to make imports easier to track).

private val prefix = s"[$shortId] "

// TODO: update spark.ml to use new Instrumentation APIs and remove this constructor
var stage: Params = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend we either plan to remove "stage" or change "logPipelineStage" so it only allows setting "stage" once. If we go with the former, how about leaving a note to remove "stage" once spark.ml code is migrated to use the new logParams() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the plan is to remove stage once we port switch over to the new APIs

@SparkQA
Copy link

SparkQA commented Jul 17, 2018

Test build #93189 has finished for PR 21719 at commit 1676a6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
Merging with master
Thanks @MrBago !

@asfgit asfgit closed this in 912634b Jul 17, 2018
asfgit pushed a commit that referenced this pull request Jul 20, 2018
## What changes were proposed in this pull request?

Followup for #21719.
Update spark.ml training code to fully wrap instrumented methods and remove old instrumentation APIs.

## How was this patch tested?

existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Bago Amirbekian <[email protected]>

Closes #21799 from MrBago/new-instrumentation-apis2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants