Skip to content

Conversation

@SergeiPavlov
Copy link
Contributor

@SergeiPavlov SergeiPavlov commented May 26, 2022

It is 30% performance optimization:

|                                      Method |     Mean | Error |      Min |      Max |   Gen 0 | Allocated |
|-------------------------------------------- |---------:|------:|---------:|---------:|--------:|----------:|
|                  Old_Convert_ToBase64String | 540.4 us |    NA | 540.4 us | 540.4 us |  9.0000 |      1 MB |
| Old_Convert_ToBase64String_InsertLineBreaks | 634.3 us |    NA | 634.3 us | 634.3 us | 10.0000 |      1 MB |
|                  New_Convert_ToBase64String | 396.4 us |    NA | 396.4 us | 396.4 us |  9.0000 |      1 MB |
| New_Convert_ToBase64String_InsertLineBreaks | 425.9 us |    NA | 425.9 us | 425.9 us |  9.0000 |      1 MB |

Benchmarked on 1000 random byte arrays of lengths 0..999

Used tricks:

  • Use precomputed table of Base64 char pairs (takes 16 KiB RAM and some warmup time to initialize (once) and load to CPU Cache). It reduces number of memory operations 2 times. Use int to process two char values simultaneously.
  • Avoid reading each byte from inData array twice.
  • More fast pointer arithmetic: *ptr++ instead of ptrBase[index]; ++index;. We are saving add instruction.
  • Get rid of offset parameter from ConvertToBase64Array(). inData is array base + offset.
  • Optimize most critical loop: check only one condition inside it. insertLineBreaks impacts on number of loop rounds before next complex condition check.
  • Consider BitConverter.IsLittleEndian for correct work on Big-endian platforms.

@ghost
Copy link

ghost commented May 26, 2022

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 26, 2022
@ghost
Copy link

ghost commented May 29, 2022

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

It is 30% performance optimization:

|                                      Method |     Mean | Error |      Min |      Max |   Gen 0 | Allocated |
|-------------------------------------------- |---------:|------:|---------:|---------:|--------:|----------:|
|                  Old_Convert_ToBase64String | 540.4 us |    NA | 540.4 us | 540.4 us |  9.0000 |      1 MB |
| Old_Convert_ToBase64String_InsertLineBreaks | 634.3 us |    NA | 634.3 us | 634.3 us | 10.0000 |      1 MB |
|                  New_Convert_ToBase64String | 396.4 us |    NA | 396.4 us | 396.4 us |  9.0000 |      1 MB |
| New_Convert_ToBase64String_InsertLineBreaks | 425.9 us |    NA | 425.9 us | 425.9 us |  9.0000 |      1 MB |

Benchmarked on 1000 random byte arrays of lengths 0..999

Used tricks:

  • Use precomputed table of Base64 char pairs (takes 16 KiB RAM and some warmup time to initialize (once) and load to CPU Cache). It reduces number of memory operations 2 times. Use int to process two char values simultaneously.
  • Avoid reading each byte from inData array twice.
  • More fast pointer arithmetic: *ptr++ instead of ptrBase[index]; ++index;. We are saving add instruction.
  • Get rid of offset parameter from ConvertToBase64Array(). inData is array base + offset.
  • Optimize most critical loop: check only one condition inside it. insertLineBreaks impacts on number of loop rounds before next complex condition check.
  • Consider BitConverter.IsLittleEndian for correct work on Big-endian platforms.
Author: SergeiPavlov
Assignees: -
Labels:

area-System.Text.Encoding, community-contribution

Milestone: -

j += 4;
a = *inData++;
b = *inData++;
*outPairs++ = base64Pairs[(a << 4) | (b >> 4)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding
Why is the codegen soo different to LLVM? https://godbolt.org/z/T8eqE9chv

LLVM

        shr     sil, 4
        shl     dil, 4
        or      dil, sil
        movzx   eax, dil
        ret

.NET JIT

       movzx    rax, dil
       shl      eax, 4
       movzx    rdi, sil
       sar      edi, 4
       or       eax, edi
       ret      

Copy link
Member

@EgorBo EgorBo May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deeprobin feel free to file an issue. Related one is #13816 (you can use PR that closed it as a foundation for your PR if you want to contribute 🙂)

@EgorBo
Copy link
Member

EgorBo commented May 30, 2022

There was a PR to use SSE for this API - dotnet/coreclr#21833

@stephentoub
Copy link
Member

There was a PR to use SSE for this API - dotnet/coreclr#21833

Why didn't we finish it?

@EgorBo
Copy link
Member

EgorBo commented May 30, 2022

There was a PR to use SSE for this API - dotnet/coreclr#21833

Why didn't we finish it?

It got lost during coreclr->runtime migration 🙂 I can port it to crossplat intrinsics after(if) this PR lands to avoid conflicts

@stephentoub
Copy link
Member

It got lost during coreclr->runtime migration 🙂 I can port it to crossplat intrinsics after(if) this PR lands to avoid conflicts

A quick skim of that PR suggests it doesn't require a 16K lookup table? If that's the case, with appreciation to Sergei, I'd prefer we just start with that PR as something that's faster and smaller.

@SergeiPavlov
Copy link
Contributor Author

A quick skim of that PR suggests it doesn't require a 16K lookup table? If that's the case, with appreciation to Sergei, I'd prefer we just start with that PR as something that's faster and smaller.

I agree, vector intrinsics are always preferable. This 16K-price implementation may be fallback for platforms without SSE/AVX-like instructions.

@stephentoub
Copy link
Member

@EgorBo, are you still going to look at bringing that PR back to life?

@EgorBo
Copy link
Member

EgorBo commented Jun 29, 2022

@EgorBo, are you still going to look at bringing that PR back to life?

Sure, will take a look in an hour

@EgorBo
Copy link
Member

EgorBo commented Jun 29, 2022

@EgorBo, are you still going to look at bringing that PR back to life?

Sure, will take a look in an hour

So I took a look - there was a small bug in the impl, but the problem that it won't be trivial to extend it to support Arm - it has to many non-shared intrinsics, I'll try to port it in coming days

@stephentoub
Copy link
Member

@EgorBo, @SergeiPavlov, can this be closed now that @EgorBo's change went in?

@SergeiPavlov
Copy link
Contributor Author

Yes.

The function is optimized by a959c3e

@stephentoub
Copy link
Member

Thanks, @SergeiPavlov.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Text.Encoding community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants