Skip to content

OLS regression outputs wrong TStats and PValue #5696

@zyzhu

Description

@zyzhu

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info):
    .NET SDK (reflecting any global.json):
    Version: 5.0.200
    Commit: 70b3e65d53

Runtime Environment:
OS Name: Windows
OS Version: 10.0.19042
OS Platform: Windows
RID: win10-x64
Base Path: C:\Program Files\dotnet\sdk\5.0.200\

Issue

  • What did you do?
    I tried to use ML.Net to run a stats 101 case to get familiar with the library.
    The data points are generated so that y = x * 2 + random(). I use OLS trainer to estimate its slope and output its tstats and pvalues.
  • What happened?
    pValue turns out to be 1 and tstat turns out to be 0.
  • What did you expect?
    pValue is supposed to be close to zero and tstat is supposed to be very large.

Here is the equivalent R code

df <- data.frame(x = 1:100, y = 1:100*2 + runif(100))
model <- lm(y ~ x, df)
summary(model)

output of R

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48638 -0.20409 -0.04365  0.22835  0.52931 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.5067878  0.0562763    9.005 1.74e-14 ***
x           1.9994857  0.0009675 2066.691  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2793 on 98 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 4.271e+06 on 1 and 98 DF,  p-value: < 2.2e-16

Source code / logs

The following is the F# script file. Or you can run it in Jupyter notebook via dotnet interactive kernel.

#r "nuget: Microsoft.ML"
#r "nuget: Microsoft.ML.Mkl.Components"
open System
open Microsoft.ML
open Microsoft.ML.Data

[<CLIMutable>]
type Factor = {
    [<ColumnName("Label")>]
    y : float32
    intercept: float32
    x : float32
}

// Generate data: y = x * 2 + rnd
let rnd = Random()
let rows =
    [1.0 .. 100.0]
    |> Seq.map(fun v ->
        {
            y = float32 (v * 2.0 + rnd.NextDouble())
            intercept = float32 1.
            x = float32 v
        }
    )

let context = new MLContext()
let dataView = context.Data.LoadFromEnumerable(rows)
let pipeline =
    EstimatorChain()
        .Append(context.Transforms.Concatenate("Features", "intercept", "x"))
        .Append(context.Regression.Trainers.Ols())

let model = dataView |> pipeline.Fit
let modelParams = model.LastTransformer.Model
Seq.zip3 modelParams.Weights modelParams.TValues modelParams.PValues
|> Array.ofSeq
|> Array.iteri(fun i (w, t, p) ->
    printfn $"Beta {i}, w: {w:f3}, tStats: {t:f3}, pValue: {p:f3}")

Output

Beta 0, w: 0.005, tStats: 0.000, pValue: 1.000
Beta 1, w: 2.000, tStats: 0.000, pValue: 1.000

Another general feedback is that the ceremony in ML.NET is so complicated, compared to the simplicity in R sample above. I do not expect users from R/Python community can embrace this complexity. The library seems to be designed for software engineers only in mind. Maybe there's a balance in between R/Python and dotnet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions