Skip to content

Conversation

@cboggs
Copy link
Contributor

@cboggs cboggs commented Feb 14, 2018

EDIT: Please see #703 for description

@jml - I'm not entirely sure that adding an override to Gopkg.toml was the proper fix, so if I got it wrong, let me know what the proper method is and I'll fix it up. :-)

Thanks!

@jml
Copy link
Contributor

jml commented Feb 15, 2018 via email

@cboggs
Copy link
Contributor Author

cboggs commented Feb 15, 2018

Gotcha. I didn't do the dep ensure-update or whatever it was, just because it implied that it would update ALL dependencies, and that made me nervous. I'll give it a shot in a bit and see what the damage is. :-)

@bboreham
Copy link
Contributor

Just run dep ensure -update <path> to update one package.

@bboreham
Copy link
Contributor

Also you must check in Gopkg.lock when you change the vendor'd libs.

@bboreham
Copy link
Contributor

Since there are a couple more updates brought in from weaveworks/common, I think it is worth calling them out:

Log gRPC request on error weaveworks/common#85
Don't log request body and headers for 502 errors weaveworks/common#84

And now I'm a bit worried by weaveworks/common#85 - will it log the entire set of samples on a failed Push ?

@cboggs
Copy link
Contributor Author

cboggs commented Feb 19, 2018

@bboreham, thanks for the pointers! I've removed the override, but the vendor update should still be in place.

I'm not sure how to go about answering your question re: weaveworks/common#85, unfortunately. Any guidance there?

@awh
Copy link
Contributor

awh commented Feb 20, 2018

will it log the entire set of samples on a failed push?

@bboreham yes it will. It was added to help diagnose errors in other gRPC calls with smaller argument lists, not realising that it was being used here...

@jml
Copy link
Contributor

jml commented Feb 21, 2018

@bboreham Do you object to us merging this as is? What do we need to do to get it ready to merge?

@bboreham
Copy link
Contributor

I would suggest testing the error path and then deciding if that level of logging is acceptable.

@bboreham
Copy link
Contributor

I notice #705 also contains the update to latest version of dep, so one of them is going to have to rebase after merge of the other.

@cboggs
Copy link
Contributor Author

cboggs commented Feb 23, 2018

I'd be glad to wait for #705 @bboreham and @tomwilkie. The dep speedup will be very nice to have, as it's been painful keeping up the vendor bits in this branch. :-)

@bboreham
Copy link
Contributor

Has the error logging been tested yet?

@cboggs cboggs force-pushed the configurable-trace-sampling branch from 3dcffef to 303fba6 Compare February 27, 2018 17:06
@cboggs
Copy link
Contributor Author

cboggs commented Feb 27, 2018

Finally just blew away my local branch and forced things to a sane state from latest master.

@bboreham I'm not sure how to go about testing the error logging, short of letting it run for a while in our staging cluster and see what happens in the logs.

Do you know of a quick / easy-ish way to cause this failure pattern so that I can give some quicker feedback?

Thanks!

@cboggs
Copy link
Contributor Author

cboggs commented Feb 27, 2018

@bboreham I think I may have inadvertently found an instance of what you expect to see in the logs with the recent change to common:

time="2018-02-27T23:03:19Z" level=warning msg=gRPC duration=8.160544586s error="rpc error: code = Code(400) desc = sample timestamp out of order for series <series bits snipped>; last timestamp: 1519772598.344, incoming t
imestamp: 1519772588.345" method=/cortex.Ingester/Push request="&WriteRequest{Timeseries:[{[{[95 95 110 97 109 101 95 95] [102 116 95 104 119 95 102 111 114 101 99 97 115 116 58 102 116 95 97 103 103 114 101 103 97 116 105 111 110 58 102... <many many tens of thousands more bytes>

Sure looks like a log of all samples in the failed push. Let me know if that's not what you were expecting and I'll dig a bit more.

@bboreham bboreham closed this Feb 27, 2018
@bboreham bboreham reopened this Feb 27, 2018
@cboggs
Copy link
Contributor Author

cboggs commented Feb 28, 2018

@bboreham, do you think it would be reasonable to explicitly truncate failed gRPC request logs past some reasonable-ish length?

I'm Thinking something like 512 bytes or some such, but I'm not firm on that number.

I'm also open to other tricks that could retain the improved logging but avoid the pain of many-KB messages on a failed push.

@bboreham
Copy link
Contributor

bboreham commented Mar 2, 2018

Tuncating may be a workable compromise; however that dump you gave as an example doesn't look readable at any length. 95 95 110 97 109 101 95 95 is __name__ - does the dumping code look for a Stringer we could add?

@cboggs
Copy link
Contributor Author

cboggs commented Mar 2, 2018

Definitely not readable at any length. I'll try to peruse the dumping code a bit next week once we've tackled a few other things, and see if we can at least make it readable for some preamble length.

@cboggs
Copy link
Contributor Author

cboggs commented Mar 6, 2018

@bboreham, @jml, do you think the above-referenced PR will address this sufficiently?

If so, once it's merged I'll vendor it into this branch and add the config option to the ingester, and we should be golden.

@jml
Copy link
Contributor

jml commented Mar 7, 2018

I abstain. Sorry!

@bboreham
Copy link
Contributor

bboreham commented Mar 9, 2018

PR in common looks plausible; it's not clear what the answer to my Stringer question is. I would much rather see "__name__" than 95 95 110 97 109 101 95 95

@cboggs
Copy link
Contributor Author

cboggs commented Mar 9, 2018

Ah, oops, I forgot to address that part.

The trouble I ran into there is that grpc_logging is creating legitimate HTTP responses out of the RPC errors that it receives, and trying to convert the body of the resulting HTTP response to a valid string turned into a typecasting struggle for me. There's a strong chance I missed an easy solution, though.

@csmarchbanks, do you have some time to crawl this code with me and see if there's a reasonable place to implement an interface or something for this?

@cboggs
Copy link
Contributor Author

cboggs commented Mar 12, 2018

@bboreham, I had a quick clarifying question for you on the logging output issue.

Is there a particular reason you'd like to see the string-ified version of the byte array that's being logged? I ask because the full series is being logged prior to the byte array, and the only additional information to be seen in the request body is the full set of samples and timestamps. String-ifying this would indeed shrink the output by a small bit (more accurately, by the number of bytes that make up the delta between each char in the series string and their corresponding ASCII value lengths, plus whitespace), but the rest of the output will be a big nested array of floats.

If this is definitely needed, I can spend the time to figure out a way to do so (it's still not proving to be as trivial as I'd hoped), but I'd like to better understand that need first. :-)

(I'm hoping we can get this particular PR unblocked sooner than later, as tracing is entirely non-viable in anything beyond a playground environment with zero load without this change. 😢)

Thanks!

@bboreham
Copy link
Contributor

bboreham commented Mar 12, 2018

I don't specially need to see the string; I just think it's dumb to print something we know is a string as the ascii codes.

Happy to have that filed as a clean-up issue to do afterwards.

@cboggs
Copy link
Contributor Author

cboggs commented Mar 12, 2018

Agreed that it's dumb to print a string as raw ascii codes. :-) I just think it's easier and (in this specific case) pretty reasonable to just omit that massive output entirely.

I'll write up an issue to do the cleanup!

@bboreham
Copy link
Contributor

I went back to weaveworks/common#89; I can't see where it is truncating at N bytes - it seems to be just omitting the request details completely.

@cboggs
Copy link
Contributor Author

cboggs commented Mar 12, 2018

Yes, that's true. Maybe "truncate" is the wrong word on my part.

I started down the path of a byte limit, but given that the data at the beginning of the ingester push requests is duplicate data from earlier in the log message, it seemed reasonable to add a configurable omission of the request body.

@cboggs
Copy link
Contributor Author

cboggs commented Mar 12, 2018

Hmm... Maybe that is indeed too sledgehammer-y. I forget that ingesters take more than just Push requests.

I'll try the byte limiter again, and retain the configuration option for such. Stand by!

@cboggs
Copy link
Contributor Author

cboggs commented Mar 21, 2018

@bboreham, does this PR seem OK now with the update to weaveworks/common?

jaegerAgentHost := os.Getenv("JAEGER_AGENT_HOST")
trace := tracing.New(jaegerAgentHost, "querier")
trace := tracing.NewFromEnv("querier")
defer trace.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need two defers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awww, weak copypasta skillz on my part. Thanks!

GRPCMiddleware: []grpc.UnaryServerInterceptor{
middleware.ServerUserHeaderInterceptor,
},
ExcludeRequestInLog: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you wanted this true

@bboreham bboreham merged commit 20bf2cf into cortexproject:master Mar 21, 2018
@cboggs
Copy link
Contributor Author

cboggs commented Mar 21, 2018

Whoo! Thanks a million @bboreham for your help (and patience)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants