Skip to content
This repository was archived by the owner on Jun 21, 2023. It is now read-only.

Conversation

@jcansdale
Copy link
Collaborator

@jcansdale jcansdale commented Oct 12, 2017

This is a fix for an issue where a diff contains one or more loose carriage returns.

Previous versions show the following when any file in the PR contained a loose loose carriage return. You can use the following PR as a repro: grokys/PullRequestSandbox#39.

image

The above exception has been fixed as well as diff line matching. For example, take the following string: "foo\r\r\nbar\n\nfo\ro\r\nbar\n".

When rendered in the GitHub diff view it looks like this:

image

However, when viewed using the GitHub file view it looks like this:

image

You can see that although there are only 5 numbered lines, the file actually takes up 7 lines (with two extra line breaks).

This is also how it appears when rendered on the Visual Studio diff (and normal) view. The only difference is that Visual Studio shows numbers for the rendered lines (as opposed to the line numbers when carriage return are ignored).

image

We need to account for this when the OldLineNumber and NewLineNumber is calculated for a DiffLine (otherwise the lines that can be commented on won't match the lines displayed in Visual Studio).

This is a potential issue when lines are added to or deleted from a diff.

image

You can see on the two screen grabs above that the lines that can be commented on (the light colored ones) match the lines from the first screen grabs from GitHub's diff view. The line numbers are different, but the start of each line is consistent. This is how it's supposed to work in the fixed version.

NOTE: I wonder what other control chars we might be missing?

Fixes #1241

Loose carriage returns appear as new lines inside Visual Studio.
Account for this when calculating `DiffLine.NewLineNumber`.
The `DiffLine.NewLineNumber` hasn't been updated yet.
meaghanlewis
meaghanlewis previously approved these changes Oct 12, 2017
Copy link
Contributor

@meaghanlewis meaghanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me, thanks Jamie!

public static IEnumerable<DiffChunk> ParseFragment(string diff)
{
// Turn Windows line endings into Unix line endings.
diff = NormalizeLineEndings(diff);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should create a ContainsLooseCarriageReturn regex/method before replacing all line endings. This will be done for every file in PR. It's only required when there are actually loose carriage returns.

{
if (hasCarriageReturn)
{
line = UnescapeCarriageReturns(line);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if file already contained escaped carriage returns? Oops.

Copy link
Contributor

@shana shana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When processing a diff, if there's a '\r', this code is going to go through the entire diff char by char 5 times - once for NormalizeLineEndings, then for HasCarriageReturn, then for EscapeCarriageReturn, then UnescapeCarriageReturn, then CountCarriageReturns - which is pretty inefficient (and if you throw in a entire file chunk, probably really slow too). I'm very rusty on Big-O notation but this is what, O(n^4) or something crazy like that?

It looks like all of this is mostly so you can call ReadLine while ignoring any line endings that don't contain \n. Maybe this could be better handled with a straight character loop where you just stop when you hit a \n, consume any subsequent \r character, and process the line as is with no extra escaping and unescaping, which will bring the time complexity of this down to O(1)

Loose carriage returns are the exception, so we don't always want to cater for them!
@jcansdale
Copy link
Collaborator Author

jcansdale commented Oct 13, 2017

Files that contain a loose \r are very much the exception (there weren't any in our repos or repos I've looked at recently). The latest commit checks of any loose \r before making any accommodation for them. I was reluctant to implement a string reader for this one special and relatively unlikely case (the built in string reader is pretty optimized IIRC).

There is however an issue with the \r escaping and a custom string reader might be the cleanest fix. I'll take a look.

Maybe this could be better handled with a straight character loop where you just stop when you hit a \n, consume any subsequent \r character, and process the line as is with no extra escaping and unescaping, which will bring the time complexity of this down to O(1)

I'll try something along these lines.


static string UnescapeCarriageReturns(string text)
{
return text.Replace("\\r", "\r");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen here if the diff already has the string "\r" in it? It will get replaced by a control character, but will that affect the correctness of the parsing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. Poorly thought through escaping. This code is redundant now.

@shana
Copy link
Contributor

shana commented Oct 13, 2017

I was reluctant to implement a string reader for this one special and relatively unlikely case (the built in string reader is pretty optimized IIRC).

I don't think there's a need to implement a whole string reader tbh, you just want to go through the whole thing and stop at markers. You could even just do a Split('\n') and get all the lines like that because you know '\n' is always the last character of a line end regardless of whether it's windows or linux.

This is faster/cleaner than doing string manipulation inside the diff parser.
This is the only time we explicitly hunt of the `r` char.
@jcansdale jcansdale force-pushed the fixes/1241-carriage-return-in-diff branch from 4b490be to 845a9c0 Compare October 13, 2017 15:14
@jcansdale
Copy link
Collaborator Author

@shana, I've updated this according to your feedback. It's much cleaner now with no redundant string manipulation. It only chops out the sub-strings that it actually needs. 👍

I did end up writing a simple line reader. There were a surprising number of edge cases in a small amount of code (that I wanted to test).

It only hunts explicitly for \r chars once and does so in an optimized way (using IndexOf).


bool StartOfText => index == 0;

bool EndOfText => index == -1 || index == text.Length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If null is passed in to the LineReader construction, this will throw.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added null check to constructor.

public class TheLineReaderClass
{
[Theory]
[InlineData("", "", null, null)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass new string[] { "", null } directly into the test?

Copy link
Collaborator Author

@jcansdale jcansdale Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I original wrote it using using NUnit, which allows test methods to accept params string[]. When I converted it to xUnit (which doesn't support this 😭), I didn't realize you could define an array inside an attribute.

[InlineData("\r\r", "\r\r", null, null)]
[InlineData("\r\r\n", "\r", null, null)]
[InlineData("\r_\n", "\r_", null, null)]
public void ReadLines(string text, string line1, string line2, string line3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be a test checking what happens when null is passed into the constructor.

Assert.Throws<ArgumentNullException>(() => DiffUtilities.LineReader.CountCarriageReturns(null));
}

static void Benchmark_ReadLineAndCountCarriageReturns_100000()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kill this method? It's not doing anything here.

return null;
}

public class LineReader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we ended up having a bunch of benchmarks for this, how about adding a comment here with a link to https://gist.github.com/shana/200e4719d4f571caab9dbf5921fa5276?

Copy link
Contributor

@shana shana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🚀

static void Benchmark_ReadLineAndCountCarriageReturns_100000()
{

var file = new System.Diagnostics.StackFrame(true).GetFileName();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an ad-hoc test method I ran using TestDriven.Net. Easy for quick benchmarking. Zapped it now.

@shana shana merged commit 50b9e91 into master Oct 19, 2017
@jcansdale jcansdale deleted the fixes/1241-carriage-return-in-diff branch November 1, 2017 18:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index was outside the bounds of the array

5 participants