Re-try addition of configurable trace sampling strategy #709

cboggs · 2018-02-14T19:21:49Z

EDIT: Please see #703 for description

@jml - I'm not entirely sure that adding an override to Gopkg.toml was the proper fix, so if I got it wrong, let me know what the proper method is and I'll fix it up. :-)

Thanks!

jml · 2018-02-15T17:12:09Z

Thanks! Will get around to this next week. No idea about the dep stuff, but at a guess you don't need the constraint, you just need to run `dep update` (or `upgrade`, I forget exact spelling).

…

On Wed, 14 Feb 2018 at 19:21 Cody Boggs ***@***.***> wrote: @jml <https://github.com/jml> - I'm not entirely sure that adding an override to Gopkg.toml was the proper fix, so if I got it wrong, let me know what the proper method is and I'll fix it up. :-) Thanks! ------------------------------ You can view, comment on, or merge this pull request online at: #709 Commit Summary - enable configurable Jaeger sampling strategies (with defaults) - add trace config opts to lite, and DRY up default trace sampling logic - DRY up some more tracing instantation bits - use new convenience function to construct tracer from env vars - add forgotten vendor update File Changes - *M* Gopkg.toml <https://github.com/weaveworks/cortex/pull/709/files#diff-0> (4) - *M* cmd/distributor/main.go <https://github.com/weaveworks/cortex/pull/709/files#diff-1> (4) - *M* cmd/lite/main.go <https://github.com/weaveworks/cortex/pull/709/files#diff-2> (4) - *M* cmd/querier/main.go <https://github.com/weaveworks/cortex/pull/709/files#diff-3> (4) - *M* cmd/ruler/main.go <https://github.com/weaveworks/cortex/pull/709/files#diff-4> (4) - *M* vendor/github.com/weaveworks/common/Gopkg.lock <https://github.com/weaveworks/cortex/pull/709/files#diff-5> (119) - *M* vendor/github.com/weaveworks/common/Gopkg.toml <https://github.com/weaveworks/cortex/pull/709/files#diff-6> (8) - *M* vendor/github.com/weaveworks/common/logging/logging.go <https://github.com/weaveworks/cortex/pull/709/files#diff-7> (7) - *M* vendor/github.com/weaveworks/common/middleware/grpc_logging.go <https://github.com/weaveworks/cortex/pull/709/files#diff-8> (2) - *M* vendor/github.com/weaveworks/common/middleware/logging.go <https://github.com/weaveworks/cortex/pull/709/files#diff-9> (2) - *M* vendor/github.com/weaveworks/common/tracing/tracing.go <https://github.com/weaveworks/cortex/pull/709/files#diff-10> (28) Patch Links: - https://github.com/weaveworks/cortex/pull/709.patch - https://github.com/weaveworks/cortex/pull/709.diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#709>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHq6rqob6r6oghwjzmGVttvkPMIHTY7ks5tUzJOgaJpZM4SF1lS> .

cboggs · 2018-02-15T17:16:01Z

Gotcha. I didn't do the dep ensure-update or whatever it was, just because it implied that it would update ALL dependencies, and that made me nervous. I'll give it a shot in a bit and see what the damage is. :-)

bboreham · 2018-02-19T12:06:04Z

Just run dep ensure -update <path> to update one package.

bboreham · 2018-02-19T12:07:41Z

Also you must check in Gopkg.lock when you change the vendor'd libs.

bboreham · 2018-02-19T12:14:40Z

Since there are a couple more updates brought in from weaveworks/common, I think it is worth calling them out:

Log gRPC request on error weaveworks/common#85
Don't log request body and headers for 502 errors weaveworks/common#84

And now I'm a bit worried by weaveworks/common#85 - will it log the entire set of samples on a failed Push ?

cboggs · 2018-02-19T21:11:47Z

@bboreham, thanks for the pointers! I've removed the override, but the vendor update should still be in place.

I'm not sure how to go about answering your question re: weaveworks/common#85, unfortunately. Any guidance there?

awh · 2018-02-20T14:34:57Z

will it log the entire set of samples on a failed push?

@bboreham yes it will. It was added to help diagnose errors in other gRPC calls with smaller argument lists, not realising that it was being used here...

jml · 2018-02-21T14:18:24Z

@bboreham Do you object to us merging this as is? What do we need to do to get it ready to merge?

bboreham · 2018-02-22T09:23:28Z

I would suggest testing the error path and then deciding if that level of logging is acceptable.

bboreham · 2018-02-22T09:26:44Z

I notice #705 also contains the update to latest version of dep, so one of them is going to have to rebase after merge of the other.

cboggs · 2018-02-23T16:34:27Z

I'd be glad to wait for #705 @bboreham and @tomwilkie. The dep speedup will be very nice to have, as it's been painful keeping up the vendor bits in this branch. :-)

bboreham · 2018-02-27T16:33:16Z

Has the error logging been tested yet?

…e same.

cboggs · 2018-02-27T17:08:18Z

Finally just blew away my local branch and forced things to a sane state from latest master.

@bboreham I'm not sure how to go about testing the error logging, short of letting it run for a while in our staging cluster and see what happens in the logs.

Do you know of a quick / easy-ish way to cause this failure pattern so that I can give some quicker feedback?

Thanks!

cboggs · 2018-02-27T23:11:14Z

@bboreham I think I may have inadvertently found an instance of what you expect to see in the logs with the recent change to common:

time="2018-02-27T23:03:19Z" level=warning msg=gRPC duration=8.160544586s error="rpc error: code = Code(400) desc = sample timestamp out of order for series <series bits snipped>; last timestamp: 1519772598.344, incoming t
imestamp: 1519772588.345" method=/cortex.Ingester/Push request="&WriteRequest{Timeseries:[{[{[95 95 110 97 109 101 95 95] [102 116 95 104 119 95 102 111 114 101 99 97 115 116 58 102 116 95 97 103 103 114 101 103 97 116 105 111 110 58 102... <many many tens of thousands more bytes>

Sure looks like a log of all samples in the failed push. Let me know if that's not what you were expecting and I'll dig a bit more.

cboggs · 2018-02-28T19:09:07Z

@bboreham, do you think it would be reasonable to explicitly truncate failed gRPC request logs past some reasonable-ish length?

I'm Thinking something like 512 bytes or some such, but I'm not firm on that number.

I'm also open to other tricks that could retain the improved logging but avoid the pain of many-KB messages on a failed push.

bboreham · 2018-03-02T09:11:06Z

Tuncating may be a workable compromise; however that dump you gave as an example doesn't look readable at any length. 95 95 110 97 109 101 95 95 is __name__ - does the dumping code look for a Stringer we could add?

cboggs · 2018-03-02T18:50:53Z

Definitely not readable at any length. I'll try to peruse the dumping code a bit next week once we've tackled a few other things, and see if we can at least make it readable for some preamble length.

cboggs · 2018-03-06T19:13:02Z

@bboreham, @jml, do you think the above-referenced PR will address this sufficiently?

If so, once it's merged I'll vendor it into this branch and add the config option to the ingester, and we should be golden.

jml · 2018-03-07T09:58:43Z

I abstain. Sorry!

bboreham · 2018-03-09T14:15:38Z

PR in common looks plausible; it's not clear what the answer to my Stringer question is. I would much rather see "__name__" than 95 95 110 97 109 101 95 95

cboggs · 2018-03-09T14:23:05Z

Ah, oops, I forgot to address that part.

The trouble I ran into there is that grpc_logging is creating legitimate HTTP responses out of the RPC errors that it receives, and trying to convert the body of the resulting HTTP response to a valid string turned into a typecasting struggle for me. There's a strong chance I missed an easy solution, though.

@csmarchbanks, do you have some time to crawl this code with me and see if there's a reasonable place to implement an interface or something for this?

cboggs · 2018-03-12T16:05:48Z

@bboreham, I had a quick clarifying question for you on the logging output issue.

Is there a particular reason you'd like to see the string-ified version of the byte array that's being logged? I ask because the full series is being logged prior to the byte array, and the only additional information to be seen in the request body is the full set of samples and timestamps. String-ifying this would indeed shrink the output by a small bit (more accurately, by the number of bytes that make up the delta between each char in the series string and their corresponding ASCII value lengths, plus whitespace), but the rest of the output will be a big nested array of floats.

If this is definitely needed, I can spend the time to figure out a way to do so (it's still not proving to be as trivial as I'd hoped), but I'd like to better understand that need first. :-)

(I'm hoping we can get this particular PR unblocked sooner than later, as tracing is entirely non-viable in anything beyond a playground environment with zero load without this change. 😢)

Thanks!

bboreham · 2018-03-12T16:13:36Z

I don't specially need to see the string; I just think it's dumb to print something we know is a string as the ascii codes.

Happy to have that filed as a clean-up issue to do afterwards.

cboggs · 2018-03-12T16:15:02Z

Agreed that it's dumb to print a string as raw ascii codes. :-) I just think it's easier and (in this specific case) pretty reasonable to just omit that massive output entirely.

I'll write up an issue to do the cleanup!

bboreham · 2018-03-12T16:19:24Z

I went back to weaveworks/common#89; I can't see where it is truncating at N bytes - it seems to be just omitting the request details completely.

cboggs · 2018-03-12T16:22:42Z

Yes, that's true. Maybe "truncate" is the wrong word on my part.

I started down the path of a byte limit, but given that the data at the beginning of the ingester push requests is duplicate data from earlier in the log message, it seemed reasonable to add a configurable omission of the request body.

cboggs · 2018-03-12T16:26:01Z

Hmm... Maybe that is indeed too sledgehammer-y. I forget that ingesters take more than just Push requests.

I'll try the byte limiter again, and retain the configuration option for such. Stand by!

cboggs · 2018-03-21T15:48:07Z

@bboreham, does this PR seem OK now with the update to weaveworks/common?

csmarchbanks · 2018-03-21T15:51:51Z

cmd/querier/main.go

-	jaegerAgentHost := os.Getenv("JAEGER_AGENT_HOST")
-	trace := tracing.New(jaegerAgentHost, "querier")
+	trace := tracing.NewFromEnv("querier")
+	defer trace.Close()


don't need two defers

Awww, weak copypasta skillz on my part. Thanks!

…dundant defer

bboreham · 2018-03-21T16:28:37Z

cmd/ingester/main.go

 			GRPCMiddleware: []grpc.UnaryServerInterceptor{
 				middleware.ServerUserHeaderInterceptor,
 			},
+			ExcludeRequestInLog: false,


I think you wanted this true

…gesters

cboggs · 2018-03-21T16:59:30Z

Whoo! Thanks a million @bboreham for your help (and patience)!

cboggs mentioned this pull request Feb 26, 2018

[HOLD for #709] Enable basic tracing for the Ingester #726

Closed

cboggs force-pushed the configurable-trace-sampling branch from ed071d7 to 3dcffef Compare February 27, 2018 16:30

Cody Boggs added 2 commits February 27, 2018 09:57

Use new configurable trace sampling and update vendoring to enable th…

9e18ab3

…e same.

forgot about poor lite

303fba6

cboggs force-pushed the configurable-trace-sampling branch from 3dcffef to 303fba6 Compare February 27, 2018 17:06

bboreham closed this Feb 27, 2018

bboreham reopened this Feb 27, 2018

cboggs mentioned this pull request Mar 6, 2018

Don't log full request body on failed /cortex.Ingester/Push calls weaveworks/common#89

Merged

cboggs mentioned this pull request Mar 12, 2018

grpc_logging interceptor logs full request bodies as ASCII byte arrays, not strings weaveworks/common#90

Open

pull latest weaveworks/common

a44a48f

csmarchbanks reviewed Mar 21, 2018

View reviewed changes

use new middleware 'ExcludeRequestInLog' for Ingester, and ditch a re…

5019c2e

…dundant defer

bboreham reviewed Mar 21, 2018

View reviewed changes

*facepalm* Need to set ExcludeRequestInLog to true, not false, for in…

bb02337

…gesters

bboreham approved these changes Mar 21, 2018

View reviewed changes

bboreham merged commit 20bf2cf into cortexproject:master Mar 21, 2018

bboreham mentioned this pull request Jun 14, 2018

Remove grpc request from distributor logs #843

Merged

Re-try addition of configurable trace sampling strategy #709

Re-try addition of configurable trace sampling strategy #709

Uh oh!

Conversation

cboggs commented Feb 14, 2018 • edited by bboreham Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jml commented Feb 15, 2018 via email

Uh oh!

cboggs commented Feb 15, 2018

Uh oh!

bboreham commented Feb 19, 2018

Uh oh!

bboreham commented Feb 19, 2018

Uh oh!

bboreham commented Feb 19, 2018

Uh oh!

cboggs commented Feb 19, 2018

Uh oh!

awh commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jml commented Feb 21, 2018

Uh oh!

bboreham commented Feb 22, 2018

Uh oh!

bboreham commented Feb 22, 2018

Uh oh!

cboggs commented Feb 23, 2018

Uh oh!

bboreham commented Feb 27, 2018

Uh oh!

cboggs commented Feb 27, 2018

Uh oh!

cboggs commented Feb 27, 2018

Uh oh!

cboggs commented Feb 28, 2018

Uh oh!

bboreham commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 2, 2018

Uh oh!

cboggs commented Mar 6, 2018

Uh oh!

jml commented Mar 7, 2018

Uh oh!

bboreham commented Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 9, 2018

Uh oh!

cboggs commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Mar 12, 2018

Uh oh!

cboggs commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cboggs commented Mar 12, 2018

Uh oh!

cboggs commented Mar 21, 2018

Uh oh!

csmarchbanks Mar 21, 2018

Choose a reason for hiding this comment

Uh oh!

cboggs Mar 21, 2018

Choose a reason for hiding this comment

Uh oh!

bboreham Mar 21, 2018

Choose a reason for hiding this comment

Uh oh!

cboggs commented Mar 21, 2018

cboggs commented Feb 14, 2018 •

edited by bboreham

Loading

awh commented Feb 20, 2018 •

edited

Loading

bboreham commented Mar 2, 2018 •

edited

Loading

bboreham commented Mar 9, 2018 •

edited

Loading

cboggs commented Mar 12, 2018 •

edited

Loading

bboreham commented Mar 12, 2018 •

edited

Loading

cboggs commented Mar 12, 2018 •

edited

Loading

cboggs commented Mar 12, 2018 •

edited

Loading