Skip to content

Conversation

@davidglassborow
Copy link
Contributor

@davidglassborow davidglassborow commented Apr 29, 2017

Add --deterministic+ option to fsc.exe to make compilation deterministic given exactly the same inputs, using similar logic to that of Roslyn.

See discussions on #1042 for background for what they did for C# and VB.net

Progress

  • Add deterministic option to WriteILBinary options
  • Add as +/- to CompileOps.fsi
  • Throw error if AssemblyVersionAttribute set to *, add error resource
  • Change F# to determinstic MVID in PE and PDB
  • Change F# to determinstic timestamp in PE
  • Add tests

Background reference:

C# Behaviour

Wildcard vs. Deterministic

C# error message is generic version format, see example below when compiling a cs file

C:\Work>csc simple.cs
Microsoft (R) Visual C# Compiler version 1.1.0.51109
Copyright (C) Microsoft Corporation. All rights reserved.

C:\Work>csc /deterministic simple.cs
Microsoft (R) Visual C# Compiler version 1.1.0.51109
Copyright (C) Microsoft Corporation. All rights reserved.

simple.cs(3,28): error CS7034: The specified version string does not conform to the required format - major[.minor[.build[.revision]]]

MVID and Timestamp

To create the MVID and time stamp with repeatable unique values the compiler uses cryptographic hashes. It takes the content of the PE with the above entries set to 0 and runs it through a SHA1 3 hash. The resulting 20 bytes are then carved up into a GUID (16 bytes) and a time stamp entry (4 bytes, high bit always set). A similar operation is performed for the PDB ID. This means the above values will be repeatable and unique for a given set of inputs.
The combination of the explicit ordering guarantee and the predictable values for MVID, PDB ID and timestamp allow us to produce fully deterministic PE outputs from the compiler. They will be identical byte for byte.

@msftclas
Copy link

@davidglassborow,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

@davidglassborow
Copy link
Contributor Author

@dsyme I've yet to add tests for this (I'm having problems with the tests on master at the moment), but wanted to get your feedback on if this looks like the correct approach.

A couple of questions:

  • Are there other ways to set a version apart from 'AssemblyVersionAttribute' ? The undocumented --version fsc.exe option seems to be deprecated (warning FS0075: The command-line option '--version' has been deprecated), and doesn't support wildcards (error FS0204: Invalid version string '2.1.*.*')
  • I've coped the command line description from C#, if this gets merged, I didn't know if there was anything I should do to update the fsc docs with new command line for the online docs

@dsyme
Copy link
Contributor

dsyme commented May 2, 2017

... but wanted to get your feedback on if this looks like the correct approach.

@davidglassborow The code looks right to me!

Are there other ways to set a version apart from 'AssemblyVersionAttribute' ? The undocumented --version fsc.exe option seems to be deprecated (warning FS0075: The command-line option '--version' has been deprecated), and doesn't support wildcards (error FS0204: Invalid version string '2.1..')

I think that's the only way right now

I've coped the command line description from C#, if this gets merged, I didn't know if there was anything I should do to update the fsc docs with new command line for the online docs

That seems fine. I suppose the docs would get updated separately.

@davidglassborow
Copy link
Contributor Author

@dsyme I've made good progress for the exe and for portable and embedded pdbs. The problem comes with debug:full or debug:pdbonly.

For mono this is fine because it uses the mdb file which is deterministic.

For windows it uses the native symbol writer, and this seems to generate guid and timestamps internally. Looks like the roslyn team made a version of the native symbol writer that supports deterministic pdb generation (https://www.nuget.org/packages/Microsoft.DiaSymReader.Native/).

Given the push towards portable pdbs, I'm wondering about not supporting deterministic builds in the debug:pdbonly and debug:full scenarios, but am unsure how much of an issue that would be.

Otherwise we could add a new package dependancy for the native reader, and use that (see rosyln's version.

Any thoughts ? cc/ @jaredpar @KevinRansom @cartermp

@dsyme
Copy link
Contributor

dsyme commented May 5, 2017

@davidglassborow Great work!

Given the push towards portable pdbs, I'm wondering about not supporting deterministic builds in the debug:pdbonly and debug:full scenarios, but am unsure how much of an issue that would be. Otherwise we could add a new package dependancy for the native reader, and use that (see rosyln's version.

I've not strong opinions - certainly the work is valuable as is - I'm personally fine with the extra package dependency and it helps engineering to align with what Roslyn do.

@jaredpar
Copy link
Member

jaredpar commented May 5, 2017

Looks like the roslyn team made a version of the native symbol writer that supports deterministic pdb generation

Note that it's not fully deterministic. True it allows us to control the PDB ID and other inherently non deterministic values. But the resulting PDB is still not yet byte for byte deterministic. It's being looked at but don't have a firm ETA for it.

@davidglassborow
Copy link
Contributor Author

Thanks @jaredpar, is there a roslyn issue # I can use to track progress of this ? Also is the source to that nuget package open source ?

@jaredpar
Copy link
Member

jaredpar commented May 5, 2017

Thanks @jaredpar, is there a roslyn issue # I can use to track progress of this ?

Not sure if we have an issue tracking full determinism in the native PDB writer. @tmat will know if we do.

Also is the source to that nuget package open source ?

Doubtful. It's not a package we own.

@tmat
Copy link
Member

tmat commented May 5, 2017

We do not have Roslyn issue - since it's not code we own. The code is not open source.

BTW, I'm working on extracting some of the native PDB writer code into a common library as public API (Microsoft.DiaSymReader.dll), so that other tools and compilers can use it directly without copying Roslyn internal implementation. You might want to wait for that: dotnet/roslyn#19297.

@jaredpar
Copy link
Member

jaredpar commented May 5, 2017

There is one item in the check list that I wanted to comment on based on our Roslyn experience: tests

Unit tests were excellent in helping us establish determinism in sub-systems of the compiler. As well as helping us lock down on bugs in those areas. But unit tests, no matter how many we added, were not enough to validate determinism.

Determinism is a feature than is ultimately about the integration of all the various pieces of the compiler against all the possible inputs it can deal with. Hence it's much better suited to have full integration tests.

The best integration test we found is the roslyn source code itself. On every PR we have a job dedicated to:

  • Build the current compiler.
  • Build the entire roslyn source tree with the fresh compiler. Record SHA1 hash of every DLL output
  • Build the entire roslyn source tree with the fresh copmiler again. Compare SHA1 hash with previous build.

Until we had this job we were a bit falsely confident in the deterministic capabilities of the compiler. Adding this job was a humbling experience and revealed a host of corner case bugs we'd missed. Now though we have it in place and feel very confident about our feature.

I highly suggest F# consider a similar approach to testing.

@dsyme
Copy link
Contributor

dsyme commented May 9, 2017

@davidglassborow I think we could get some smoke tests in now (just a couple of simple ones) and take this PR - it's great work

@jaredpar FOr some reason I'm not so pessimistic - AFAIK this PR really does address the sources of non-determinism. But we'll see :)

@jaredpar
Copy link
Member

@dsyme

FOr some reason I'm not so pessimistic

I'm always inherently terrified I will break anything that I don't have a ton of testing for 😄

@davidglassborow
Copy link
Contributor Author

@dsyme okay I'll look at some simple smoke tests. Are there any examples in the current test suite of the sort of thing your thinking ?

Also I want to specifically test the the guid inside the exe matches the one in the pdb files. Can you think of any existing tools or approach I could use for this ? (I'm wondering if there are standard CLR IL or debugging tools that would do the job)

@dsyme
Copy link
Contributor

dsyme commented May 12, 2017

@dsyme okay I'll look at some simple smoke tests. Are there any examples in the current test suite of the sort of thing your thinking ?

Perhaps compile something like tests\fsharp\core\libtest\test.fsx or one of the large similar files

Also I want to specifically test the the guid inside the exe matches the one in the pdb files. Can you think of any existing tools or approach I could use for this ? (I'm wondering if there are standard CLR IL or debugging tools that would do the job)

I don't know of tools to crack the PDB, sorry. There must be some

@davidglassborow
Copy link
Contributor Author

@dsyme I've finally had time to have a look at this again. I've added tests\fsharp\core\libtest\test.fsx as test case.

Given the native dependancy on sorting out debug:full and debug:pdbonly (dotnet/roslyn#19297), and that that the embedded pdb uses compression which makes it very hard to sort out mvid without large scale restructuring of the code, I suggest we make --deterministic support only builds with no debug info. If your happy with this I'll add an appropriate error.

@0x53A
Copy link
Contributor

0x53A commented May 22, 2017

I think Roslyn currently is more or less fully deterministic when creating portable pdbs, but when creating old-school pdbs, two compilations may produce equal assemblies but different pdbs.

Disallowing PDB generation overall would make this feature, in my honest opinion, close to useless.

Would it be possible to instead only allow --deterministic in combination with portable pdbs in the first step? Support for native pdbs may or may not be added later then, npdbs are in the process of being phased out and replaced with portable anyway.

@jaredpar
Copy link
Member

I think Roslyn currently is more or less fully deterministic when creating portable pdbs, but when creating old-school pdbs, two compilations may produce equal assemblies but different pdbs.

This is mostly correct. Let me state the actual rules.

When given deterministic input and the /deterministic switch:

  1. DLL / EXE will be produced fully deterministic. No other qualification is needed here (including the type of PDB being produced).
  2. Portable PDB will be produced fully deterministic.
    1, Native PDB will be produced in a non-deterministic fashion. The actual data bytes within the PDB are deterministic but the non-data bytes (gaps between fields) are non-deterministic.

There are rare occasions you will find the native PDB produced is fully deterministically. That should be considered "by luck" than "by design".

Disallowing PDB generation overall would make this feature, in my honest opinion, close to useless.

Agree. In particular because I'm very hopeful the fix for native PDBs is coming. If F# does the work now to be fully deterministic for portable and as detererministic as possible for native. Then once this fix is available it should be a simple NuGet package upgrade for F#.

@tmat
Copy link
Member

tmat commented May 22, 2017

Re Windows PDBs: As far as I can tell F# is currently using the SymWriter that's globally registered by .NET Framework:

https://github.com/Microsoft/visualfsharp/blob/de67934a8c64f57ee18356f4d94a1edc2df7fdca/src/absil/ilsupp.fs#L1016

That version is not being updated, other than security fixes. To get the version that supports determinism (modulo gaps between fields) F# will need to start using the SymWriter from Microsoft.DiaSymReader.Native.nupkg, like Roslyn does. Once the native implementation is fixed to be fully deterministic this package will be updated and F# can get full determinism for free. But work needs to be done to move to this package first.

Coincidentally, I'm currently working on extracting some of the code that Roslyn uses for Windows PDB writing to Microsoft.DiaSymReader.dll making it easier for other PDB producers to write correct PDBs. See https://github.com/dotnet/symreader/tree/master/src/Microsoft.DiaSymReader/Writer.

Let me know if you have any questions.

@davidglassborow
Copy link
Contributor Author

Okay I'll plan to get deterministic portable pdbs in this PR, and leave the native pub support to later on when the native sym writer is fully deterministic (and use the Microsoft.DiaSymReader nuget package to do so). @tmat @jaredpar did either of you know of any command line tools I can use as part of the tests to confirm the MVID identifiers between the exe and pdbs match ?

@tmat
Copy link
Member

tmat commented May 22, 2017

For Portable and Embedded PDBs: mdv.exe
https://dotnet.myget.org/feed/metadata-tools/package/nuget/Microsoft.Metadata.Visualizer

mdv Lib.dll

Prints out debug info and metadata for a PE file.

mdv Lib.pdb

Prints out info for Portable PDB.

mdv Lib.dll /embeddedpdb

Prints out info for PDB embedded into Lib.dll.

@davidglassborow
Copy link
Contributor Author

Thanks @tmat. I'm probably being thick, but I can't see the mdv.exe in that nuget package, just the Microsoft.Metadata.Visualizer.dll

@tmat
Copy link
Member

tmat commented May 23, 2017

@davidglassborow Ah... my fault. It's not there! Forgot to publish it. Let me fix that...

@tmat
Copy link
Member

tmat commented May 23, 2017

Thread a new option through the options to the code writing the IL.
For the moment just use a constant for deterministic timestamp, and use 0 as the timestamp in the MVID
@davidglassborow
Copy link
Contributor Author

Ok I've made determinism work only for portable pdbs (--debug:portable and --debug:embedded).
The portable pdb doesn't contain the mvid, so I'm happy this code is good to merge.

@davidglassborow davidglassborow changed the title [WIP] Determinism of the F# Compiler #1042 Determinism of the F# Compiler #1042 Jun 4, 2017
@dsyme
Copy link
Contributor

dsyme commented Jun 5, 2017

This is great, I approve! :)

@0x53A
Copy link
Contributor

0x53A commented Jun 5, 2017

As I understand the current implementation, the combination deterministic + native pdb will error out.

Would it be possible to instead allow this combination, with the caveat that the assembly will be deterministic, but the pdb not? That's basically the behaviour of Roslyn (ref #2954 (comment)).

We would like to move to portable pdb as fast as possible (and hopefully will, in the near future), but can't at this exact moment now, because a few build tools rely on native pdbs.

@davidglassborow
Copy link
Contributor Author

@0x53A unfortunately that's a lot more complicated to do.

The native PDB contains the MVID in it, and currently we don't calculate the deterministic MVID until after the PDB has been generated. This means recording in the memory stream where it is written and updating it (like the PR does with the metadata). However the native PDB is written using Windows native DLL currently.

It probably could be done, but to be inline with Roslyn we would have to change the code to use the Microsoft.DiaSymReader.Native dll.

I'm not going to have any time soon to do it, but happy to try and help if you want to have a crack.

@0x53A
Copy link
Contributor

0x53A commented Jun 5, 2017

@davidglassborow Ok then that makes sense, thank you.

That deterministic will work at all will be great, and as you said, better to ship it now and improve it later. =)

Copy link
Contributor

@KevinRansom KevinRansom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@KevinRansom KevinRansom merged commit 2dcdb46 into dotnet:master Jun 7, 2017
@davidglassborow
Copy link
Contributor Author

@cartermp I'll raise a PR on the docs to document this new option, but newbie question, should I be looking at https://github.com/Microsoft/visualfsharpdocs/ or https://github.com/dotnet/docs for the fsc reference docs ?

@cartermp
Copy link
Contributor

cartermp commented Jun 7, 2017

The latter! That's where the active docs are. This is very cool.

@smoothdeveloper
Copy link
Contributor

smoothdeveloper commented Apr 25, 2019

is System.Environment.NewLine something we care about?

There is one usage: b4901a5#diff-5b9ab9dd9d7133aaf23add1048742031R10637 and I'm about to introduce some more, should I make a shim for it and hardcode it to \n?

edit: scratch that, this is in compile error path, deterministic build of non compilable code doesn't matter.

@jaredpar
Copy link
Member

jaredpar commented Apr 25, 2019

is System.Environment.NewLine something we care about?

No. Caring about this value implies that you believe builds will be identical across OS that differ in line ending. It's a good ideal but in practice it's not a reality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants