-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix diff parsing when file contains loose carriage return #1268
Conversation
Loose carriage returns appear as new lines inside Visual Studio. Account for this when calculating `DiffLine.NewLineNumber`. The `DiffLine.NewLineNumber` hasn't been updated yet.
meaghanlewis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good to me, thanks Jamie!
| public static IEnumerable<DiffChunk> ParseFragment(string diff) | ||
| { | ||
| // Turn Windows line endings into Unix line endings. | ||
| diff = NormalizeLineEndings(diff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should create a ContainsLooseCarriageReturn regex/method before replacing all line endings. This will be done for every file in PR. It's only required when there are actually loose carriage returns.
| { | ||
| if (hasCarriageReturn) | ||
| { | ||
| line = UnescapeCarriageReturns(line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if file already contained escaped carriage returns? Oops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When processing a diff, if there's a '\r', this code is going to go through the entire diff char by char 5 times - once for NormalizeLineEndings, then for HasCarriageReturn, then for EscapeCarriageReturn, then UnescapeCarriageReturn, then CountCarriageReturns - which is pretty inefficient (and if you throw in a entire file chunk, probably really slow too). I'm very rusty on Big-O notation but this is what, O(n^4) or something crazy like that?
It looks like all of this is mostly so you can call ReadLine while ignoring any line endings that don't contain \n. Maybe this could be better handled with a straight character loop where you just stop when you hit a \n, consume any subsequent \r character, and process the line as is with no extra escaping and unescaping, which will bring the time complexity of this down to O(1)
Loose carriage returns are the exception, so we don't always want to cater for them!
|
Files that contain a loose There is however an issue with the
I'll try something along these lines. |
|
|
||
| static string UnescapeCarriageReturns(string text) | ||
| { | ||
| return text.Replace("\\r", "\r"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen here if the diff already has the string "\r" in it? It will get replaced by a control character, but will that affect the correctness of the parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed. Poorly thought through escaping. This code is redundant now.
I don't think there's a need to implement a whole string reader tbh, you just want to go through the whole thing and stop at markers. You could even just do a |
This is faster/cleaner than doing string manipulation inside the diff parser.
This is the only time we explicitly hunt of the `r` char.
4b490be to
845a9c0
Compare
|
@shana, I've updated this according to your feedback. It's much cleaner now with no redundant string manipulation. It only chops out the sub-strings that it actually needs. 👍 I did end up writing a simple line reader. There were a surprising number of edge cases in a small amount of code (that I wanted to test). It only hunts explicitly for |
|
|
||
| bool StartOfText => index == 0; | ||
|
|
||
| bool EndOfText => index == -1 || index == text.Length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If null is passed in to the LineReader construction, this will throw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added null check to constructor.
| public class TheLineReaderClass | ||
| { | ||
| [Theory] | ||
| [InlineData("", "", null, null)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass new string[] { "", null } directly into the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I original wrote it using using NUnit, which allows test methods to accept params string[]. When I converted it to xUnit (which doesn't support this 😭), I didn't realize you could define an array inside an attribute.
| [InlineData("\r\r", "\r\r", null, null)] | ||
| [InlineData("\r\r\n", "\r", null, null)] | ||
| [InlineData("\r_\n", "\r_", null, null)] | ||
| public void ReadLines(string text, string line1, string line2, string line3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be a test checking what happens when null is passed into the constructor.
| Assert.Throws<ArgumentNullException>(() => DiffUtilities.LineReader.CountCarriageReturns(null)); | ||
| } | ||
|
|
||
| static void Benchmark_ReadLineAndCountCarriageReturns_100000() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill this method? It's not doing anything here.
| return null; | ||
| } | ||
|
|
||
| public class LineReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we ended up having a bunch of benchmarks for this, how about adding a comment here with a link to https://gist.github.com/shana/200e4719d4f571caab9dbf5921fa5276?
shana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 🚀
| static void Benchmark_ReadLineAndCountCarriageReturns_100000() | ||
| { | ||
|
|
||
| var file = new System.Diagnostics.StackFrame(true).GetFileName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an ad-hoc test method I ran using TestDriven.Net. Easy for quick benchmarking. Zapped it now.
This is a fix for an issue where a diff contains one or more loose carriage returns.
Previous versions show the following when any file in the PR contained a loose loose carriage return. You can use the following PR as a repro: grokys/PullRequestSandbox#39.
The above exception has been fixed as well as diff line matching. For example, take the following string:
"foo\r\r\nbar\n\nfo\ro\r\nbar\n".When rendered in the GitHub diff view it looks like this:
However, when viewed using the GitHub file view it looks like this:
You can see that although there are only 5 numbered lines, the file actually takes up 7 lines (with two extra line breaks).
This is also how it appears when rendered on the Visual Studio diff (and normal) view. The only difference is that Visual Studio shows numbers for the rendered lines (as opposed to the line numbers when carriage return are ignored).
We need to account for this when the
OldLineNumberandNewLineNumberis calculated for aDiffLine(otherwise the lines that can be commented on won't match the lines displayed in Visual Studio).This is a potential issue when lines are added to or deleted from a diff.
You can see on the two screen grabs above that the lines that can be commented on (the light colored ones) match the lines from the first screen grabs from GitHub's diff view. The line numbers are different, but the start of each line is consistent. This is how it's supposed to work in the fixed version.
NOTE: I wonder what other control chars we might be missing?
Fixes #1241