Only scan the diff once when processing carriage returns #1270

shana · 2017-10-16T12:56:48Z

The previous solution still scans the string 3 times, one for each IndexOf and Substring call. Given that we have to go through the string at least once to find the boundaries, we should grab all the information we need along the way immediately. Also given that there is a custom LineReader class, we can take advantage of that and return all the data we need in one go.

Also add null checks for invalid data being passed into the constructor. The current caller of LineReader probably never calls it with null, but since LineReader is a public class, other code in the future might decide to call it (or it could be expanded), and we should make sure the assumptions we make are documented.

jcansdale · 2017-10-16T14:28:29Z

The previous solution still scans the string 3 times, one for each IndexOf and Substring call.

I did consider scanning through it one char at a time, but took a punt that since IndexOf and Substring are native methods, doing it this way would be faster. Scanning for a specific char in native code can remarkably fast. Does that make sense?

Would you be interested in seeing them profiled in BenchmarkDotNet?

jcansdale · 2017-10-16T14:58:07Z

I've just done a couple of ad-hoc benchmarks.

One for this implementation:

            static void Benchmark_ReadLineAndCountCarriageReturns_100000()
            {
                // use current file as test data
                var file = new System.Diagnostics.StackFrame(true).GetFileName();
                var text = File.ReadAllText(file);

                for (int count = 0; count < 100000; count++)
                {
                    var lineReader = new DiffUtilities.LineReader(text);
                    DiffUtilities.LineReader.LineInformation line;
                    while ((line = lineReader.ReadLine()).Line != null)
                    {
                        int crs = line.CarriageReturns;
                    }
                }
            }

A similar one for the original implementation:

            static void Benchmark_ReadLineAndCountCarriageReturns_100000()
            {
                var file = new System.Diagnostics.StackFrame(true).GetFileName();
                var text = File.ReadAllText(file);

                for (int count = 0; count < 100000; count++)
                {
                    var lineReader = new DiffUtilities.LineReader(text);
                    string line;
                    while ((line = lineReader.ReadLine()) != null)
                    {
                        DiffUtilities.CountCarriageReturns(line);
                    }
                }
            }

The first one completes in ~26.91 seconds, the second one in ~6.27 seconds. I think IndexOf and Substring being native methods is what makes the difference (unless I've got the benchmarking wrong).

jcansdale

I've done some very basic benchmarking. Could you take a look and maybe feed it some more representative data.

I guess we could use this sample data for a large PR:
https://patch-diff.githubusercontent.com/raw/github/VisualStudio/pull/1004.diff

jcansdale · 2017-10-17T09:04:25Z

src/GitHub.Exports/Models/DiffUtilities.cs

-                if (index != -1)
+                var carriageReturns = 0;
+                StringBuilder sb = new StringBuilder();
+                for (; index < length; index++)


We could maybe adapt this implementation to use, text.IndexOfAny(new[] {'\r', '\n'}, index)? With new[] {'\r', '\n'}, stored in a static. 😉

jcansdale · 2017-10-20T10:25:49Z

Here are some alternative implementations we tried:
https://gist.github.com/shana/200e4719d4f571caab9dbf5921fa5276
Scanning with text.IndexOf('\n', index) appears to the the best compromise for average .diff files.
It's likely that text.IndexOfAny(new [] {'\r', '\n'}, index) would be faster if lines were much longer.

Merged #1268

Only scan the diff once when processing carriage returns

413c643

shana requested a review from jcansdale October 16, 2017 12:56

jcansdale reviewed Oct 17, 2017

View reviewed changes

jcansdale closed this Oct 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Only scan the diff once when processing carriage returns #1270

Only scan the diff once when processing carriage returns #1270

Uh oh!

shana commented Oct 16, 2017

Uh oh!

jcansdale commented Oct 16, 2017

Uh oh!

jcansdale commented Oct 16, 2017 •

edited

Loading

Uh oh!

jcansdale left a comment

Uh oh!

jcansdale Oct 17, 2017

Uh oh!

jcansdale commented Oct 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Only scan the diff once when processing carriage returns #1270

Only scan the diff once when processing carriage returns #1270

Uh oh!

Conversation

shana commented Oct 16, 2017

Uh oh!

jcansdale commented Oct 16, 2017

Uh oh!

jcansdale commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcansdale left a comment

Choose a reason for hiding this comment

Uh oh!

jcansdale Oct 17, 2017

Choose a reason for hiding this comment

Uh oh!

jcansdale commented Oct 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcansdale commented Oct 16, 2017 •

edited

Loading