-
Couldn't load subscription status.
- Fork 1.9k
Description
I am formatting this issue as a feature suggestion, however, I am also looking for a workaround - a way to do it without implementing this feature suggestion, as I haven't found a solid one yet (referencing to my SO question).
Is your feature request related to a problem? Please describe.
I have a transformed IDataView with prediction (Score) column, input columns and one-hot encoded columns (working with regression task). I want to save only prediction and input values to a database table (i.e., I want to be able to choose which columns to save).
My main obstacle seems to be not having a class (POCO) representing the structure of the db model, but the main issue (trying to avoid the XY problem) seems to be not having a convenient way to save it to a table (like there is for a text file for example).
Describe the solution you'd like
An ideal experience would be something like
IDataView.SaveToDatabase(/*db connection/connection string*/, /*list of columns*/)or
mlContext.Data.SaveToDatabase(IDataView, /*db connection/connection string*/, /*list of columns*/)and possibly without loading the data into memory to be able to handle large data sets.
A less ideal solution would be create an extension to DataFrame for the same functionality - something similar to pandas.DataFrame.to_sql. So the user experience would be:
IDataView predictions = trainedModel.Transform(...);
var df = predictions.ToDataFrame(...);
df.ToSql(/*db connection/connection string*/, /*list of columns*/, /*possibly other parameters*/)Additional convenience (although this could be considered a completely separate feature):
My IDataView is transformed using one-hot encoding, but I want to write the original input values to a db table. I couldn't leave the same column names for the output after transformation (i.e., I had to do new InputOutputColumnPair(col.Name + "_onehot", col.Name)) because the input columns would then be hidden, and I couldn't easily get their value. It would be nice to have a way to do this without renaming the transformed columns i.e., something like
IDataView predictions = ...
var columns = predictions.InputColumns + prediction.ScoreColumn
/* returns a list of columns, where
InputColumns are the original columns without transformation,
ScoreColumn are the predicted values,
The plus operator means appending to the list
*/
predictions.SaveToDb(... , columns)I am aware of the SchemaAnnotationsExtensions.GetSlotNames but that doesn't seem to cover my use-case conveniently.
Describe alternatives you've considered
1. Create an `IEnumerable` from `IDataView` and write that to the DB
I used
IEnumerable<SomeType> predictionsEnumerable =
mlContext.Data.CreateEnumerable<SomeType>(predictions, reuseRowObject: true);in order to try and write it to a DB table.
First problem is that I don't have the SomeType. I tried using dynamic or object, but neither of them work - possibly related to
- Support dynamic types when working with IDataView #3829
- Easyer way to create dynamic DataViews #5895 (although this one is for creating the IDataView, which wasn't a problem for me)
(probably unrelevant) I also tried reflection but I didn't manage to make anything work .
Second problem is that even when I have the SomeType, I only managed to write it with Dapper converting the IEnumerable to a list. Without that I usually get the error
The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Parameter 4 (""): Data type 0x40 is unknown.
2. Convert `IDataView` to `DataFrame` and write that to DB
I used
var df = predictions.ToDataFrame(long.MaxValue, columns.ToArray());to create a DataFrame and tried to write that to the DB. I didn't manage to find a nice solution.
3. Manual type mapping
Tried to manually check the column type and get its value with the appropriate ValueGetter<>, to later convert it to a row and save it to a db table. I later saw that .ToDataFrame(...) does something similar, although much better in every aspect. Still, I failed to use the result in combination with Dapper to insert it into a table (in my case, I failed to make a use of the rows list). This also takes too much memory (didn't go too much into details, but around 2.5GB (half of it is "just my code") was on the heap, and 5.1GB of RAM taken). Compared to converting it to a DataFrame which takes just 1.1GB of RAM, (didn't check the heap). I assume my solution could be optimized, but that's not the point here.
columns = // list of columns
IDataView predictions = ...
IEnumerable<DataViewSchema.Column> dvColumns = columns.Select(col => predictions.Schema[col]);
IList<object> rows = new List<object>();
using (var cursor = predictions.GetRowCursor(dvColumns))
{
Dictionary<string, ValueGetter<ReadOnlyMemory<char>>> colGettersString = new();
Dictionary<string, ValueGetter<float>> colGettersSingle = new();
foreach (var col in dvColumns)
{
var colType = col.Type.RawType;
if (col.Type.RawType == typeof(ReadOnlyMemory<char>))
{
colGettersString.Add(col.Name, cursor.GetGetter<ReadOnlyMemory<char>>(col));
}
if (col.Type.RawType == typeof(float))
{
colGettersSingle.Add(col.Name, cursor.GetGetter<float>(col));
}
}
while (cursor.MoveNext())
{
ExpandoObject colValues = new(); // could use a regular dictionary here
foreach (var col in dvColumns)
{
ReadOnlyMemory<char> colValueString = default;
float colValueFloat = default;
if (colGettersString.ContainsKey(col.Name))
colGettersString[col.Name](ref colValueString);
else if (colGettersSingle.ContainsKey(col.Name))
colGettersSingle[col.Name](ref colValueFloat);
string colValueStringString = colValueString.ToString();
if (string.IsNullOrEmpty(colValueStringString))
colValues.TryAdd(col.Name, colValueFloat);
else
colValues.TryAdd(col.Name, colValueStringString);
}
rows.Add(colValues);
}
}4. Some other ideas
Using 3. and converting to JSON
This seems to be like one of the workarounds, which I didn't complete. The idea is to take rows list from 3. and convert it to JSON. Then, manipulate the JSON through SQL, which I gave up on as it seems too complicated. This also uses more memory as the JSON can be a large string with a lot of data.
Using reflection Emit
Create a class dynamically and use that to create a db model class, to be used in CreateEnumerable. I'm not sure how this works or if it is possible.
Additional context
- I am trying to insert data with Dapper into a SQL Server db
- Using .NET 6
- I am working with ~5 million rows of data with 5 columns