-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Hi @NiklasGustafsson, I would like to contribute to TensorSharp as we are looking at using it to replace our CNTK usage in our full end-to-end machine learning pipelines written in C#. I have been looking at this example repo in that regard, which is a great starting place. I've mainly work with image models so I've been looking at the CIFAR10 example. I understand that these examples have been created quickly and that they are bare-bones, I would like to improve them. :)
For example, I have a few issues with the Readers like CIFAR10Reader and how they both randomize data by pre-defining randomized batches, which is not normally how you would do this, you create unique random batches for each epoch. Similarly, an epoch would usually (nothing is standardized here and you really can do anything you'd like so this is just IMHO) be defined by iterating our the samples of the dataset once, not by adding transforms after and hence multiplying by that like:
public IEnumerable<(Tensor, Tensor)> Data()
{
for (var i = 0; i < data.Count; i++) {
yield return (data[i], labels[i]);
foreach (var tfrm in _transforms) {
yield return (tfrm.forward(data[i]), labels[i]);
}
}
}Also you wouldn't "transform" or augment the data if it is test data, of course you can then just not set the transforms.
Anyway, I was thinking my first contribution could be to refactor the readers to and implement concepts similar to pytorch DataSet and DataLoader. I have worked with this API but am not an expert, nor I am necessarily a fan of the python APIs, but it seems you'd like to have TorchSharp be similar to pytorch, so basing it on that makes sense. Would that be of any interest?
Before doing this I would very much like to migrate this example repo to .NET 6 and C# 10 too and follow standard C# code guidelines and use modern language features, to really make the examples shine with regards to C#. Since performance is my passion I'd also like the examples to at least minimally try to be efficient about what happens, even in cases where it does not matter so much.
Just an example below, I would replace the below with a proper Fischer-Yates shuffle, that is easy to implement.
Enumerable.Range(0, count).OrderBy(c => rnd.Next()).ToArray();
Reproducibility is important too, so all random stuff should be seeded.
Sorry, I am sure you know all this, but I wanted to at least ask whether such changes are of interest first? If you guys agree with them?
To recap I propose:
- Migrate to .NET 6 and C# 10
- Refactor readers based on a DataSet and DataLoader concept (rough draft)
- Address minor various issues as part of this
And we can take it from there.