Skip to content

Conversation

@gerashegalov
Copy link
Contributor

@gerashegalov gerashegalov commented Feb 1, 2019

What changes were proposed in this pull request?

Provide ability to recover attributes for log url rewrite in #23260 from event logs produced by older spark versions using the log urls of live containers

The current solution in #23260 currently works only with the apps run on the to-be-released version of Spark. However, deployments with a wealth of historic event logs are left out. More importantly SHS is less critical and can be upgraded much faster than production apps.

Another improvement in this PR is going beyond a fixed set of supported variables for wider set of deployments. This PR proposes to use configurable named capturing groups for a regex to extract building blocks for a permalink. Below is an example for YARN.

How was this patch tested?

Just locally so far. With configuration like

spark.history.executor.log.url.regex            //(.+):\\d+/.*/(container_\\d+_\\d+_\\d+_[0-9]\\d+)/
spark.history.executor.log.url.regex.groups     YARN_NM_HOST,YARN_CID
spark.history.custom.executor.log.url           http://jhs.host.xyz:19888/jobhistory/logs/{{YARN_NM_HOST}}:8041/{{YARN_CID}}/{{YARN_CID}}/hadoop

I was able to use this PR with old event logs from EMR stored on S3

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 1, 2019

Not sure we would want to have feature for just enabling old apps to also enjoy the feature. Might be ideal to craft a tool that migrates event logs? We know the format of YARN default log url, so it is possible to parse log url via regex or some rules, and instrument event log to add attribute.

@gerashegalov
Copy link
Contributor Author

With our order of magnitude this tool will be more like a large an ETL job, also require a replay for the logs already replayed to the kvstore.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 1, 2019

Hmm... got it. I don't have strong preference so let's see others' voice. Btw I think we should have preference to attributes and regex group (which is less stable than attributes), or maybe more ideally merge two attributes. Your patch looks to use attributes as a fail-back which could effectively disable the feature for newer apps.

@gerashegalov gerashegalov changed the title WIP for discussion: create attributes from log urls [WIP]: create attributes from live log urls to form a permalink past YARN log aggregation Feb 1, 2019
Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me, but as usual, needs a bug, tests, and removing the WIP from the title.

.get("stdout")
.flatMap(regEx.findFirstMatchIn)
.map { urlMatch =>
STDOUT_URL_GROUPS.foldLeft(exec.attributes) { (attribs, groupName) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this block should be executed at all when exec.attributes is not empty; IMO this should only apply to legacy event logs that don't record that information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Let's just use attributes if it exists. I can see the case this helps even if attributes are presented: the number of attributes are updated which new attributes can be extracted from URL, but at least with current YARN URL we have extracted almost everything so it will unlikely happen.

@vanzin
Copy link
Contributor

vanzin commented Apr 26, 2019

I'll close this for now since it hasn't been updated.

@vanzin vanzin closed this Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants