Skip to content

Conversation

@steveloughran
Copy link
Contributor

This patch pulls all the application cache logic out of the HistoryServer and into its own class with callbacks for operations (get &c). This makes unit tests straightforward with a small bit of mocking of SparkUI.

@SparkQA
Copy link

SparkQA commented Jun 22, 2015

Test build #35453 has finished for PR 6935 at commit 9dad9d1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheEntry(ui: SparkUI, completed: Boolean, timestamp: Long);
    • class PCAModel(JavaVectorTransformer):
    • class PCA(object):

@SparkQA
Copy link

SparkQA commented Jun 22, 2015

Test build #35454 has finished for PR 6935 at commit d96f29a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CacheEntry(ui: SparkUI, completed: Boolean, timestamp: Long);
    • class Module(object):
    • class PCAModel(JavaVectorTransformer):
    • class PCA(object):

@steveloughran
Copy link
Contributor Author

CacheEntry is unintentional public class; will fix. I have no idea where the others came from

@steveloughran
Copy link
Contributor Author

style check. Not enough spaces, by the look of things

[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala:2: Header does not match expected text
[error] (core/compile:scalastyle) errors exist
[error] Total time: 9 s, completed Jun 22, 2015 7:49:06 AM
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:2: Header does not match expected text
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:172:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:187:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:199:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:60:71: No space before token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:68:59: No space before token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:105:21: No space after token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:115:56: No space after token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:216:40: No space after token ,
[error] (core/test:scalastyle) errors exist
[error] Total time: 6 s, completed Jun 22, 2015 7:49:24 AM
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala:2: Header does not match expected text
[error] (core/compile:scalastyle) errors exist
[error] Total time: 9 s, completed Jun 22, 2015 7:49:45 AM
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:2: Header does not match expected text
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:172:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:187:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:199:4: Insert a space after the start of the comment
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:60:71: No space before token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:68:59: No space before token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:105:21: No space after token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:115:56: No space after token =
[error] /home/jenkins/workspace/SparkPullRequestBuilder@2/core/src/test/scala/org/apache/spark/deploy/history/ApplicationCacheSuite.scala:216:40: No space after token ,

@XuTingjun
Copy link
Contributor

@steveloughran, I think you always put too many space in some place.

@steveloughran
Copy link
Contributor Author

I should add that I did think about actually using the last-updated information to decide whether to refresh or not, and decided simply having a timeout was a good enough step #1

@XuTingjun
Copy link
Contributor

@steveloughran Have you tested it? I don't think it's ok.
yeah, you realized the refreshing of incompleted apps. But I think second time sending "/history/appid" request , it will not goes into /history handler, but goes to history/appid handler, so it will not call the refresh code and won't refresh.
/cc @squito

@steveloughran
Copy link
Contributor Author

I probably tested different bits of it ... I verified that updated apps were coming in, but no, not a full functional test of the Web UI, as that would take a lot of preamble to set up those tests. I do have all the code to set up such tests in SPARK-1537, but was trying to keep thing separate... if we get that in then this would be easy to add as a new test case

@squito
Copy link
Contributor

squito commented Jun 26, 2015

@steveloughran @XuTingjun sorry I finally took a closer look here. So I tried running a manual test on this patch and #6545, and it seems to me that neither one solves the problem. (Or perhaps I am misunderstanding the issue.)

Here's the steps I took:

  1. for this patch, modify the refresh time to 1 second just for testing
  2. run the history server within an sbt session
build/sbt
project core
re-start
<choose option for HistoryServer>
  1. In another window, fire up a spark-shell with event logging enabled, and run a few jobs (and leave the shell open)
bin/spark-shell --conf "spark.eventLog.enabled=true"
val d = sc.parallelize(1 to 100)
d.count()
d.count()
  1. View the app UI in the history server at localhost:18080, and also view the apps own UI at localhost:4040
  2. go back to the shell, run another d.count() job
  3. take another look at the UI, both for the app itself (which updates) and via the history server (which does not update, regardless of how long I wait.)

@XuTingjun it sounded like you had done some manual testing on #6545 and believe that it should work. Is there something wrong that I'm doing with my test?

Unless I'm testing the wrong thing, I think this demonstrates the need to write a unit test which basically does the same thing, as its hard to verify the complete correctness here. I'm writing that test case now.

@steveloughran
Copy link
Contributor Author

If you want to do tests, grab the code

the tests themselves aren't that complex, but the underlying runner bits to bring up web pages, run probes once they are up, etc, are a lot of foundational pain.

@squito
Copy link
Contributor

squito commented Jun 26, 2015

sorry I didn't see that comment about the tests ... I wrote something up which mimics the existing UISeleniumSuite.

From 2dddcd036a4cacf31d1ff5f15417fcdbcfdbc22e Mon Sep 17 00:00:00 2001
From: Imran Rashid <irashid@cloudera.com>
Date: Fri, 26 Jun 2015 14:16:16 -0500
Subject: [PATCH] failing test case

---
 .../spark/deploy/history/HistoryServerSuite.scala  | 74 +++++++++++++++++++++-
 1 file changed, 72 insertions(+), 2 deletions(-)

diff --git a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
index e5b5e1b..a288490 100644
--- a/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
+++ b/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala
@@ -21,14 +21,24 @@ import java.net.{HttpURLConnection, URL}
 import java.util.zip.ZipInputStream
 import javax.servlet.http.{HttpServletRequest, HttpServletResponse}

+import scala.concurrent.duration._
+import scala.language.postfixOps
+
 import com.google.common.base.Charsets
 import com.google.common.io.{ByteStreams, Files}
 import org.apache.commons.io.{FileUtils, IOUtils}
+import org.json4s.JsonAST.JArray
+import org.json4s.jackson.JsonMethods._
 import org.mockito.Mockito.when
+import org.openqa.selenium.WebDriver
+import org.openqa.selenium.htmlunit.HtmlUnitDriver
+import org.scalatest.concurrent.Eventually
+import org.scalatest.selenium.WebBrowser
 import org.scalatest.{BeforeAndAfter, Matchers}
 import org.scalatest.mock.MockitoSugar

-import org.apache.spark.{JsonTestUtils, SecurityManager, SparkConf, SparkFunSuite}
+import org.apache.spark.util.Utils
+import org.apache.spark._
 import org.apache.spark.ui.SparkUI

 /**
@@ -43,7 +53,7 @@ import org.apache.spark.ui.SparkUI
  * are considered part of Spark's public api.
  */
 class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with Matchers with MockitoSugar
-  with JsonTestUtils {
+  with JsonTestUtils with Eventually with WebBrowser {

   private val logDir = new File("src/test/resources/spark-events")
   private val expRoot = new File("src/test/resources/HistoryServerExpectations/")
@@ -264,6 +274,66 @@ class HistoryServerSuite extends SparkFunSuite with BeforeAndAfter with Matchers
     justHrefs should contain(link)
   }

+  test("incomplete apps get refreshed") {
+
+    implicit val webDriver: WebDriver = new HtmlUnitDriver
+    implicit val formats = org.json4s.DefaultFormats
+
+    val testDir = Utils.createTempDir()
+    testDir.deleteOnExit()
+
+    val myConf = new SparkConf()
+      .set("spark.history.fs.logDirectory", testDir.getAbsolutePath)
+      .set("spark.eventLog.dir", testDir.getAbsolutePath)
+      .set("spark.history.fs.update.interval", "1")
+      .set("spark.eventLog.enabled", "true")
+      .set("spark.history.cache.interval", "1")
+      .remove("spark.testing")
+    val provider = new FsHistoryProvider(myConf)
+    val securityManager = new SecurityManager(myConf)
+
+    val server = new HistoryServer(myConf, provider, securityManager, 18080)
+    val sc = new SparkContext("local", "test", myConf)
+    server.initialize()
+    server.bind()
+    val port = server.boundPort
+    try {
+      val d = sc.parallelize(1 to 10)
+      d.count()
+      val appId = eventually(timeout(20 seconds), interval(100 milliseconds)) {
+        val json = getContentAndCode("applications", port)._2.get
+        val apps = parse(json).asInstanceOf[JArray].arr
+        apps.size should be (1)
+        (apps(0) \ "id").extract[String]
+      }
+
+      def getNumJobs(suffix: String): Int = {
+        go to (s"http://localhost:$port/history/$appId$suffix")
+        findAll(cssSelector("tbody tr")).toIndexedSeq.size
+      }
+
+      getNumJobs("") should be (1)
+      d.count()
+      eventually(timeout(10 seconds), interval(100 milliseconds)) {
+        // this fails with patch https://github.com/apache/spark/pull/6935 as well, I'm not sure why
+        getNumJobs("") should be (2)
+      }
+
+      getNumJobs("/jobs") should be (2)
+      // Here is where we still have an error.  /jobs is handled by another servlet,
+      // not controlled by the app cache.  We need some way for each of those servlets
+      // to know that they can get expired, so they can be detached and then the reloaded
+      // one can get re-attached
+      d.count()
+      getNumJobs("/jobs") should be (3)
+
+    } finally {
+      sc.stop()
+      server.stop()
+    }
+
+  }
+
   def getContentAndCode(path: String, port: Int = port): (Int, Option[String], Option[String]) = {
     HistoryServerSuite.getContentAndCode(new URL(s"http://localhost:$port/api/v1/$path"))
   }
-- 
2.2.2

patch #6545 is closer to working (for some reason the cache does not seem to be getting refreshed in this PR, maybe my test is setup wrong). But I do think the cache with a refresh interval is probably a better way to go, rather than reloading the data on every single request.

I have a rough understanding of what is going wrong when we make requests to http://localhost:18080/history/<appId>/jobs -- after the first time, that is handled by a completely separate servlet, as @XuTingjun has pointed out, which is never detached. We need a way to wrap each of the servlets for the UI so they know if they need to refresh.

Also I just realized, the same problem applies to the rest api at http://localhost:18080/api/v1/applications/, but that should be much simpler to handle, its just one servlet. We should add a test case and fix that too, though.

@XuTingjun
Copy link
Contributor

@squito, thank you very much, I am agree with what you said.
about my patch #6545, it will refresh only on request http://localhost:18080/history/<appId>, request http://localhost:18080/history/<appId>/jobs is not affected. I think it's the bug of my patch. I don't have a good idea on how to detache the handlers of sparkui added in the handler /history/*, so I changed the response of handler /history/*, but it only affect on request http://localhost:18080/history/<appId>.

@squito
Copy link
Contributor

squito commented Jun 29, 2015

@steveloughran do you think you can pick this up with the test case? I think we have a little better understanding of the problem. I do think your approach is in the right direction.

@XuTingjun
Copy link
Contributor

yeah, I think this patch also can't refresh. It not detache the handlers.
I think we need to thinking when and where to detache the handlers.

@steveloughran
Copy link
Contributor Author

I like the selenium test; it could be combined with the provider i wrote which lets us programmatically create our own history, so add changes we can look for.

@steveloughran
Copy link
Contributor Author

(BTW, I'm not going to be looking at this for a couple of weeks; focusing on a (big) patch

@steveloughran steveloughran changed the title SPARK-7889 Jobs progress of apps on complete page of HistoryServer shows uncompleted [SPARK-7889] [CORE] WiP Jobs progress of apps on complete page of HistoryServer shows uncompleted Nov 20, 2015
@steveloughran steveloughran force-pushed the stevel/patches/SPARK-7889-history-cache branch from d96f29a to b8deeee Compare November 23, 2015 14:27
@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46534 has finished for PR 6935 at commit b8deeee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class CacheEntry(ui: SparkUI, completed: Boolean, timestamp: Long);\n

@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46535 has finished for PR 6935 at commit 6e48da6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * case class CacheEntry(ui: SparkUI, completed: Boolean, timestamp: Long);\n

@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46537 has finished for PR 6935 at commit fe9504f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

As discussed on #6545, I think we could actually switch to live UI playback, provided the number of cached UIs was kept down.

What must be done here is keeping a limit on the number of threads doing playback; a pool will be needed equal to the size of the loaded UIs, or, if the provider starts them, the number of cached UIs must be limited and when removed from the cache, the thread has to be returned to the pool.

This could not only handle incomplete applications, it should be able to offer a more responsive load of completed ones -the UI could be registered while event playback was in progress.

Proposal

  1. Supplement getAppUI(appId: String, attemptId: Option[String]): Option[SparkUI] with one to start Spark UI playback; some case class returned with the UI, playback info (in progress), and a closure to call when releasing the cached entry
  2. when loading a complete attempt, the async playback would be executed until the playback finished, then the close() operation called.
  3. when loading an in progress app attempt, the async playback would continue until one of: app completes, app release from cache, history server halted (for in-VM tests)

Some providers (maybe YARN ATS in some point in the future) can do incremental playback. Even the filesystem one could, if it cached the file offset on the previous load and re-opened on updates (HDFS makes no guarantees about when/whether changes to in-progress writes become visible to existing input streams, only that new streams will see the data.). These ones could be explicitly asked to refresh/update their entries with state kept in the cache. history providers which don't support incremental updates to in-progress applications would have to be updated by a complete reload

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46602 has finished for PR 6935 at commit d2008d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran steveloughran changed the title [SPARK-7889] [CORE] WiP Jobs progress of apps on complete page of HistoryServer shows uncompleted [SPARK-7889] [CORE] HistoryServer cache of incomplete applications to refresh via HistoryProvider Nov 24, 2015
@steveloughran steveloughran changed the title [SPARK-7889] [CORE] HistoryServer cache of incomplete applications to refresh via HistoryProvider [SPARK-7889] [CORE] WiP HistoryServer cache of incomplete applications to refresh via HistoryProvider Nov 24, 2015
@steveloughran
Copy link
Contributor Author

This is a converged patch which

  1. Splits out all caching into ApplicationCache, for isolation and testing.
  2. Now asks the history server (which then asks the HistoryProvider) to see if an app has changed. If so, it triggers a refresh.
  3. Has a refresh window and tracks the time of every refresh probe, so it is not continually reloading applications which are streaming out data (the FsHistoryProvider already takes the replay hit on its listing scans, but other implementations don't; the window keeps cluster load down)
  4. Includes the metrics counting requests and timing probe/getAppUI calls, so when a descendant of [SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider #9571 goes in, the cache is instrumented. Those counters are used in the tests to check internal cache state BTW, so have immediate use. And the toString() operator dumps them all out for test failure diagnostics.
  5. The test there is test at the cache level, with a stub implementation of the operations the history server has to do. Its incomplete (need one to verify that on cache eviction the UIs are detached).

There's nothing for spark UI tests (there's the template above there), or the REST API

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46635 has finished for PR 6935 at commit 8ba44a2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran steveloughran changed the title [SPARK-7889] [CORE] WiP HistoryServer cache of incomplete applications to refresh via HistoryProvider [SPARK-7889] [CORE] WiP HistoryServer to refresh cache of incomplete applications Nov 25, 2015
@steveloughran
Copy link
Contributor Author

...thx for the feedback. I'm fixing the merge which is now triggering a regression —maybe a race condition in test startup—

apps should go from incomplete -> complete; that's one of the recurrent issues we have today. When the attempt is replayed, it's complete flag is changed and the fs history applications field updated. When the history server asks for list of apps, it gets the updated histories, and so picks up the change

@steveloughran
Copy link
Contributor Author

..I should add that it depends on the head attempt on the list being complete; the filter in HistoryServer is very sensitive to ordering. If there's an incomplete history older than a complete one, the app is considered incomplete

@steveloughran steveloughran force-pushed the stevel/patches/SPARK-7889-history-cache branch from e0ad26d to 728b12c Compare February 5, 2016 14:10
@steveloughran
Copy link
Contributor Author

rebased against master; switch from scan of HTML view to REST API to enumerate listings of complete/incomplete apps, add @squito's ? arg redirection and test

@squito
Copy link
Contributor

squito commented Feb 8, 2016

Jenkins, test this please

* This includes registering metrics with [[metricRegistry]]
*/
private def init(): Unit = {
allMetrics.foreach(e =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is a bit clearer with pattern matching: .foreach { case (name, metric) => ...

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #2522 has finished for PR 6935 at commit 728b12c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logDebug(s"Attempt ${prevInfo.name}/${prevInfo.appId} size => $size")
Some(new FsApplicationAttemptInfo(prevInfo.logPath, prevInfo.name, prevInfo.appId,
prevInfo.attemptId, prevInfo.startTime, prevInfo.endTime, prevInfo.lastUpdated,
prevInfo.sparkUser, prevInfo.completed, size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part which got me thinking about the change the alternate version in #11118. I thought it seemed wrong that you were using prevInfo.completed -- if the app had transitioned to completed, woudl't you need to put the new status in? Well, it turns out you don't, because checkForLogs / mergeApplicationListings will already re-scan the existing, in-progress attempts, and update them if necessary (though there is a filesize vs. timestamp issue). And that is ultimately what transitions the app to complete status, since this can't do it.

And that led me to think that maybe we should just leverage that scan behavior rather than doing something more complicated.

@squito
Copy link
Contributor

squito commented Feb 12, 2016

ps can you close this one now?

@steveloughran steveloughran deleted the stevel/patches/SPARK-7889-history-cache branch February 12, 2016 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants