Skip to content
This repository was archived by the owner on Jun 21, 2023. It is now read-only.

Conversation

@shana
Copy link
Contributor

@shana shana commented Oct 16, 2017

The previous solution still scans the string 3 times, one for each IndexOf and Substring call. Given that we have to go through the string at least once to find the boundaries, we should grab all the information we need along the way immediately. Also given that there is a custom LineReader class, we can take advantage of that and return all the data we need in one go.

Also add null checks for invalid data being passed into the constructor. The current caller of LineReader probably never calls it with null, but since LineReader is a public class, other code in the future might decide to call it (or it could be expanded), and we should make sure the assumptions we make are documented.

@shana shana requested a review from jcansdale October 16, 2017 12:56
@jcansdale
Copy link
Collaborator

The previous solution still scans the string 3 times, one for each IndexOf and Substring call.

I did consider scanning through it one char at a time, but took a punt that since IndexOf and Substring are native methods, doing it this way would be faster. Scanning for a specific char in native code can remarkably fast. Does that make sense?

Would you be interested in seeing them profiled in BenchmarkDotNet?

@jcansdale
Copy link
Collaborator

jcansdale commented Oct 16, 2017

I've just done a couple of ad-hoc benchmarks.

One for this implementation:

            static void Benchmark_ReadLineAndCountCarriageReturns_100000()
            {
                // use current file as test data
                var file = new System.Diagnostics.StackFrame(true).GetFileName();
                var text = File.ReadAllText(file);

                for (int count = 0; count < 100000; count++)
                {
                    var lineReader = new DiffUtilities.LineReader(text);
                    DiffUtilities.LineReader.LineInformation line;
                    while ((line = lineReader.ReadLine()).Line != null)
                    {
                        int crs = line.CarriageReturns;
                    }
                }
            }

A similar one for the original implementation:

            static void Benchmark_ReadLineAndCountCarriageReturns_100000()
            {
                var file = new System.Diagnostics.StackFrame(true).GetFileName();
                var text = File.ReadAllText(file);

                for (int count = 0; count < 100000; count++)
                {
                    var lineReader = new DiffUtilities.LineReader(text);
                    string line;
                    while ((line = lineReader.ReadLine()) != null)
                    {
                        DiffUtilities.CountCarriageReturns(line);
                    }
                }
            }

The first one completes in ~26.91 seconds, the second one in ~6.27 seconds. I think IndexOf and Substring being native methods is what makes the difference (unless I've got the benchmarking wrong).

Copy link
Collaborator

@jcansdale jcansdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some very basic benchmarking. Could you take a look and maybe feed it some more representative data.

I guess we could use this sample data for a large PR:
https://patch-diff.githubusercontent.com/raw/github/VisualStudio/pull/1004.diff

if (index != -1)
var carriageReturns = 0;
StringBuilder sb = new StringBuilder();
for (; index < length; index++)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe adapt this implementation to use, text.IndexOfAny(new[] {'\r', '\n'}, index)? With new[] {'\r', '\n'}, stored in a static. 😉

@jcansdale
Copy link
Collaborator

Here are some alternative implementations we tried:
https://gist.github.com/shana/200e4719d4f571caab9dbf5921fa5276
Scanning with text.IndexOf('\n', index) appears to the the best compromise for average .diff files.
It's likely that text.IndexOfAny(new [] {'\r', '\n'}, index) would be faster if lines were much longer.

Merged #1268

@jcansdale jcansdale closed this Oct 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants