Fix diff parsing when file contains loose carriage return #1268

jcansdale · 2017-10-12T14:16:47Z

This is a fix for an issue where a diff contains one or more loose carriage returns.

Previous versions show the following when any file in the PR contained a loose loose carriage return. You can use the following PR as a repro: grokys/PullRequestSandbox#39.

The above exception has been fixed as well as diff line matching. For example, take the following string: "foo\r\r\nbar\n\nfo\ro\r\nbar\n".

When rendered in the GitHub diff view it looks like this:

However, when viewed using the GitHub file view it looks like this:

You can see that although there are only 5 numbered lines, the file actually takes up 7 lines (with two extra line breaks).

This is also how it appears when rendered on the Visual Studio diff (and normal) view. The only difference is that Visual Studio shows numbers for the rendered lines (as opposed to the line numbers when carriage return are ignored).

We need to account for this when the OldLineNumber and NewLineNumber is calculated for a DiffLine (otherwise the lines that can be commented on won't match the lines displayed in Visual Studio).

This is a potential issue when lines are added to or deleted from a diff.

You can see on the two screen grabs above that the lines that can be commented on (the light colored ones) match the lines from the first screen grabs from GitHub's diff view. The line numbers are different, but the start of each line is consistent. This is how it's supposed to work in the fixed version.

NOTE: I wonder what other control chars we might be missing?

Fixes #1241

Loose carriage returns appear as new lines inside Visual Studio. Account for this when calculating `DiffLine.NewLineNumber`. The `DiffLine.NewLineNumber` hasn't been updated yet.

meaghanlewis

These changes look good to me, thanks Jamie!

jcansdale · 2017-10-13T10:33:58Z

src/GitHub.Exports/Models/DiffUtilities.cs

        public static IEnumerable<DiffChunk> ParseFragment(string diff)
        {
+            // Turn Windows line endings into Unix line endings.
+            diff = NormalizeLineEndings(diff);


I should create a ContainsLooseCarriageReturn regex/method before replacing all line endings. This will be done for every file in PR. It's only required when there are actually loose carriage returns.

jcansdale · 2017-10-13T11:06:57Z

src/GitHub.Exports/Models/DiffUtilities.cs

                {
+                    if (hasCarriageReturn)
+                    {
+                        line = UnescapeCarriageReturns(line);


What if file already contained escaped carriage returns? Oops.

shana

When processing a diff, if there's a '\r', this code is going to go through the entire diff char by char 5 times - once for NormalizeLineEndings, then for HasCarriageReturn, then for EscapeCarriageReturn, then UnescapeCarriageReturn, then CountCarriageReturns - which is pretty inefficient (and if you throw in a entire file chunk, probably really slow too). I'm very rusty on Big-O notation but this is what, O(n^4) or something crazy like that?

It looks like all of this is mostly so you can call ReadLine while ignoring any line endings that don't contain \n. Maybe this could be better handled with a straight character loop where you just stop when you hit a \n, consume any subsequent \r character, and process the line as is with no extra escaping and unescaping, which will bring the time complexity of this down to O(1)

Loose carriage returns are the exception, so we don't always want to cater for them!

jcansdale · 2017-10-13T11:24:42Z

Files that contain a loose \r are very much the exception (there weren't any in our repos or repos I've looked at recently). The latest commit checks of any loose \r before making any accommodation for them. I was reluctant to implement a string reader for this one special and relatively unlikely case (the built in string reader is pretty optimized IIRC).

There is however an issue with the \r escaping and a custom string reader might be the cleanest fix. I'll take a look.

Maybe this could be better handled with a straight character loop where you just stop when you hit a \n, consume any subsequent \r character, and process the line as is with no extra escaping and unescaping, which will bring the time complexity of this down to O(1)

I'll try something along these lines.

grokys · 2017-10-13T11:45:41Z

src/GitHub.Exports/Models/DiffUtilities.cs

+
+        static string UnescapeCarriageReturns(string text)
+        {
+            return text.Replace("\\r", "\r");


What will happen here if the diff already has the string "\r" in it? It will get replaced by a control character, but will that affect the correctness of the parsing?

Yes, indeed. Poorly thought through escaping. This code is redundant now.

shana · 2017-10-13T11:53:39Z

I was reluctant to implement a string reader for this one special and relatively unlikely case (the built in string reader is pretty optimized IIRC).

I don't think there's a need to implement a whole string reader tbh, you just want to go through the whole thing and stop at markers. You could even just do a Split('\n') and get all the lines like that because you know '\n' is always the last character of a line end regardless of whether it's windows or linux.

This is faster/cleaner than doing string manipulation inside the diff parser.

This is the only time we explicitly hunt of the `r` char.

jcansdale · 2017-10-13T15:26:24Z

@shana, I've updated this according to your feedback. It's much cleaner now with no redundant string manipulation. It only chops out the sub-strings that it actually needs. 👍

I did end up writing a simple line reader. There were a surprising number of edge cases in a small amount of code (that I wanted to test).

It only hunts explicitly for \r chars once and does so in an optimized way (using IndexOf).

shana · 2017-10-16T12:01:16Z

src/GitHub.Exports/Models/DiffUtilities.cs

+
+            bool StartOfText => index == 0;
+
+            bool EndOfText => index == -1 || index == text.Length;


If null is passed in to the LineReader construction, this will throw.

Added null check to constructor.

shana · 2017-10-16T12:03:58Z

test/GitHub.InlineReviews.UnitTests/Models/DiffUtilitiesTests.cs

+        public class TheLineReaderClass
+        {
+            [Theory]
+            [InlineData("", "", null, null)]


Why not pass new string[] { "", null } directly into the test?

I original wrote it using using NUnit, which allows test methods to accept params string[]. When I converted it to xUnit (which doesn't support this 😭), I didn't realize you could define an array inside an attribute.

shana · 2017-10-16T12:04:39Z

test/GitHub.InlineReviews.UnitTests/Models/DiffUtilitiesTests.cs

+            [InlineData("\r\r", "\r\r", null, null)]
+            [InlineData("\r\r\n", "\r", null, null)]
+            [InlineData("\r_\n", "\r_", null, null)]
+            public void ReadLines(string text, string line1, string line2, string line3)


There should also be a test checking what happens when null is passed into the constructor.

shana · 2017-10-19T12:50:20Z

test/GitHub.InlineReviews.UnitTests/Models/DiffUtilitiesTests.cs

+                Assert.Throws<ArgumentNullException>(() => DiffUtilities.LineReader.CountCarriageReturns(null));
+            }
+
+            static void Benchmark_ReadLineAndCountCarriageReturns_100000()


Kill this method? It's not doing anything here.

shana · 2017-10-19T12:52:00Z

src/GitHub.Exports/Models/DiffUtilities.cs

            return null;
        }

+        public class LineReader


Since we ended up having a bunch of benchmarks for this, how about adding a comment here with a link to https://gist.github.com/shana/200e4719d4f571caab9dbf5921fa5276?

shana

🎉 🚀

jcansdale · 2017-10-19T13:58:19Z

test/GitHub.InlineReviews.UnitTests/Models/DiffUtilitiesTests.cs

-            static void Benchmark_ReadLineAndCountCarriageReturns_100000()
-            {
-
-                var file = new System.Diagnostics.StackFrame(true).GetFileName();


This was an ad-hoc test method I ran using TestDriven.Net. Easy for quick benchmarking. Zapped it now.

jcansdale added 2 commits October 12, 2017 14:31

Fix parsing when diff contains carriage returns

29503c6

Loose carriage returns appear as new lines inside Visual Studio. Account for this when calculating `DiffLine.NewLineNumber`. The `DiffLine.NewLineNumber` hasn't been updated yet.

Account for carriage returns when calculating OldLineNumber

53780dd

jcansdale added bug inline-comments labels Oct 12, 2017

jcansdale added 2 commits October 12, 2017 15:41

Optimize for common case with no loose carriage returns

ca54747

Add some comments in parsing code

e4c94ad

jcansdale requested review from grokys, meaghanlewis and shana October 12, 2017 14:55

meaghanlewis previously approved these changes Oct 12, 2017

View reviewed changes

jcansdale commented Oct 13, 2017

View reviewed changes

shana suggested changes Oct 13, 2017

View reviewed changes

Only touch diff if it contains loose carriage return

88ff0e6

Loose carriage returns are the exception, so we don't always want to cater for them!

grokys reviewed Oct 13, 2017

View reviewed changes

jcansdale added 2 commits October 13, 2017 16:02

Change to use custom diff line reader

741546e

This is faster/cleaner than doing string manipulation inside the diff parser.

Optimize counting of carriage returns

845a9c0

This is the only time we explicitly hunt of the `r` char.

jcansdale force-pushed the fixes/1241-carriage-return-in-diff branch from 4b490be to 845a9c0 Compare October 13, 2017 15:14

shana reviewed Oct 16, 2017

View reviewed changes

jcansdale and others added 4 commits October 16, 2017 15:02

Add null check for LineReader(text)

c0d7f0e

Add simple benchmark

62bcddd

Add tests and null check for CountCarriageReturns

095d62c

Merge branch 'master' into fixes/1241-carriage-return-in-diff

8709555

shana reviewed Oct 19, 2017

View reviewed changes

jcansdale added 2 commits October 19, 2017 14:38

Remove redundant benchmark method

2541455

Add link to LineReader implementations

66c84b1

shana approved these changes Oct 19, 2017

View reviewed changes

jcansdale commented Oct 19, 2017

View reviewed changes

shana merged commit 50b9e91 into master Oct 19, 2017

jcansdale mentioned this pull request Oct 20, 2017

Only scan the diff once when processing carriage returns #1270

Closed

jcansdale deleted the fixes/1241-carriage-return-in-diff branch November 1, 2017 18:20


		bool StartOfText => index == 0;

		bool EndOfText => index == -1 \|\| index == text.Length;

Fix diff parsing when file contains loose carriage return #1268

Fix diff parsing when file contains loose carriage return #1268

Uh oh!

Conversation

jcansdale commented Oct 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meaghanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shana left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcansdale commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shana commented Oct 13, 2017

Uh oh!

jcansdale commented Oct 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcansdale Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shana left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jcansdale commented Oct 12, 2017 •

edited

Loading

shana left a comment •

edited

Loading

jcansdale commented Oct 13, 2017 •

edited

Loading

jcansdale Oct 16, 2017 •

edited

Loading