Machine Learning Regression Analysis C# example

In earlier post, I shared how to write basic machine learning application using ML.Net in C#, now in this tutorial we learn now we learn how to write a regression analysis in csharp machine learning

What is Regression analysis?

Regression analysis is a set of statistical processes for estimating the relationships among inter-dependent variables. Learn more about Regression analysis.

Linear regression is a model that predicts a relationship of direct proportionality between the dependent variable (vertical- Y axis) and the predictor variables (X axis) that produces a straight line.

Use Case (Price prediction)

In our example, we will predict taxi fare based on previous year data.
You can download the sample test data taxi-fare-train.csv and taxi-fare-test.csv datasets.

What you learn in this machine learning tutorial ?

  1. Setting up environment

    First, we need to setup our c# console development environment by installing Ml.Net libraries.

  2. Create Dataset

    You need to create your dataset, You create SQL dataset or any other RDBMS or Excel or CSV anything. If you are SQL developer, better you create DataSet as SQL View, probably you find that easy, then you can directly connect to SQL database or you can just copy that data into Excel file.

  3. Loading Dataset

    At this stage, you need to load dataset into MLContext object, so you can play with data, how to load data that will depend what data source you are working with, in my example i will load data from excel file.

  4. Analyzing Dataset

    You may need to understand data by changing order, removing columns, adding additional columns, grouping them etc. Get them ready to train and test algorithms.

  5. Visualizing Dataset.

    Now, you may want to see how visually data will look like, by plotting, charting etc. You can also save the visual representation in pdf format for future reference or reporting purpose.

  6. Evaluating different algorithms

    Try different algorithms to see which produce the best closest result

  7. Making predictions

    Finally, make prediction with real data.

Here is the complete example of regression analysis using C# you should look at! there is an example of predicting taxi fare based on some previous data.

Start Visual Studio 2019

Open your VS2019, select C# console application and click next.

Right click on your project => Nuget Package => Search ML.Net package and install it.

You need to add following namespace.

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;

Create MLContext object, this will be the gateway of machine learning API.

// 1. Create new instance of MLContext object. 
MLContext mlContext = new MLContext();

Load data from CSV file, we also can load data from any RDBMS like SQL server etc. we will see that example later.
(Note: In this example, we have downloaded csv data from above taxi-fare link).

data structure

Create a class that will match the structure of csv file or any other data source. Notice, how each property is mapped using LoadColumn(column-index) method.

class Taxifare
{
    [LoadColumn(0)]
    public float vendor_id { get; set; }
    [LoadColumn(1)]
    public float rate_code { get; set; }
    [LoadColumn(2)]
    public float passenger_count { get; set; }
    [LoadColumn(3)]
    public float trip_time_in_secs { get; set; }
    [LoadColumn(4)]
    public float trip_distance { get; set; }
    [LoadColumn(5)]
    public string payment_type { get; set; }
    [LoadColumn(6)]
    public float fare_amount { get; set; }
        
}
public class TaxiTripFarePrediction
{
    [ColumnName("Score")]
    public float FareAmount;
}

We need to load two separate IDataView object for train data and test data. In case you are trying with different data, make sure your object structure match the data source.

string CSV_TestData = @"G:\RND\MLApp1\testdata\taxi-fare-test.csv";
string CSV_TrainData = @"G:\RND\MLApp1\testdata\taxi-fare-train.csv";
// 2. Load data from CSV
IDataView trainingDataVIew = mlContext.Data.LoadFromTextFile<Taxifare>(CSV_TrainData, separatorChar: ',',hasHeader: true);
IDataView testDataVIew = mlContext.Data.LoadFromTextFile<Taxifare>(CSV_TestData, separatorChar: ',', hasHeader: true);

At this point, you probably would like to see if data from excel files are loaded into IDataView object, fortunately there is a built-in method called GetRowCount(), as you can see in code below, I was trying to check how many rows are there in dataset, unfortunately this method does not show anything, I assume Ml.Net still in early stage, so probably in next release we will be able see this method working properly; [my assumption].

Console.WriteLine($"Training dataView (trainingDataVIew) {trainingDataVIew.GetRowCount()}");
Console.WriteLine($"Test dataView (testDataVIew) {testDataVIew.GetRowCount()} ");

Note: Still there is a way! after loading data, if you want to check if data is loaded properly into IDataView object, here is the process, you can ignore this part!

DataViewSchema columns = trainingDataVIew.Schema;
// Create DataViewCursor
using (DataViewRowCursor cursor = trainingDataVIew.GetRowCursor(columns))
{
    // variables to hold extracted values 
    float _vendorId = default;
    float _ratecode = default;
    float _passengerCount = default;
    // Define delegates for extracting values from columns
    ValueGetter<float> vendorIdDelegate = cursor.GetGetter<float>(columns[0]);
    ValueGetter<float> ratecodeDelegate = cursor.GetGetter<float>(columns[1]);
    ValueGetter<float> passengerCountDelegate = cursor.GetGetter<float>(columns[2]);
    // Iterate over each row
    while (cursor.MoveNext())
    {
        //Get values from respective columns
        vendorIdDelegate.Invoke(ref _vendorId);
        ratecodeDelegate.Invoke(ref _ratecode);
        passengerCountDelegate.Invoke(ref _passengerCount);
    }
}

Machine learning algorithms can't directly use the raw data we have in our CSV file; So you need to use data transformations to pre-process the raw data, which will convert data into a format that the algorithm can accept.

Now for converting data, there is different type of Transforms options, as you can see in below code, each column has been appended using different type of transform option. Categorical| NormalizeMeanVariance | Concatenate

In python, this part is much straightforward, I hope in future ML.Net version we get much easier way transforming raw data into machine compatible format.

// 3. Add data transformations
var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: "FareAmount", inputColumnName: nameof(Taxifare.fare_amount))
        .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "VendorIdEncoded", inputColumnName: nameof(Taxifare.vendor_id)))
        .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "RateCodeEncoded", inputColumnName: nameof(Taxifare.rate_code)))
        .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "PaymentTypeEncoded", inputColumnName: nameof(Taxifare.payment_type)))
        .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.passenger_count)))
        .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.trip_time_in_secs)))
        .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(Taxifare.trip_distance)))
        .Append(mlContext.Transforms.Concatenate("Features", "VendorIdEncoded", "RateCodeEncoded", "PaymentTypeEncoded", nameof(Taxifare.passenger_count),
        nameof(Taxifare.trip_time_in_secs), nameof(Taxifare.trip_distance)));

Now we set the right algorithm for training the model, here you can try different algorithms to check which one produce the most accurate result, in below example I have tested Sdca (Stochastic Dual Coordinate Ascent), and also will try LbfgsPoissonRegression method to see what different result produced!

var trainer = mlContext.Regression.Trainers.Sdca(labelColumnName: "FareAmount", featureColumnName: "Features");
var trainingPipeline = dataProcessPipeline.Append(trainer);
DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(trainingDataVIew, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

Now train the model, remember this Fit method to train the model using previous data.

var trainedModel = trainingPipeline.Fit(trainingDataVIew);

Here we check, Evaluate what would be the output of regression analysis before we start testing with actual data.

IDataView transformTestDataVIew = trainedModel.Transform(testDataVIew);
// type: Microsoft.ML.Data.RegressionMetrics
var metrics = mlContext.Regression.Evaluate(transformTestDataVIew, labelColumnName: "FareAmount", scoreColumnName: "Score");
PrintRegressionMetrics(trainer.ToString(), metrics);

Just the print the output in console.

public static void PrintRegressionMetrics(string name, RegressionMetrics metrics)
{
Console.WriteLine($"*********************************");
Console.WriteLine($"*Metrics for {name} regression model      ");
Console.WriteLine($"*-----------------------------------");
Console.WriteLine($"* LossFn:        {metrics.LossFunction:0.##}");
Console.WriteLine($"* R2 Score:      {metrics.RSquared:0.##}");
Console.WriteLine($"* Absolute loss: {metrics.MeanAbsoluteError:#.##}");
Console.WriteLine($"* Squared loss:  {metrics.MeanSquaredError:#.##}");
Console.WriteLine($"* RMS loss:      {metrics.RootMeanSquaredError:#.##}");
Console.WriteLine($"**************************************");
}

Here are the two different output using two different algorithms from same training data

Method SdcaRegressionTrainer

*  Metrics for Microsoft.ML.Trainers.SdcaRegressionTrainer regression model
*--------------------------
*  LossFn:        35.36
*  R2 Score:      0.7
*  Absolute loss: .76
*  Squared loss:  35.36
*  RMS loss:      5.95
**************************

Method LbfgsPoissonRegressionTrainer

* Metrics for Microsoft.ML.Trainers.LbfgsPoissonRegressionTrainer regression model
*--------------------------
*  LossFn:        83.85
*  R2 Score:      0.28
*  Absolute loss: 2.96
*  Squared loss:  83.85
*  RMS loss:      9.16
**************************

in progress

AI Machine Learning Examples