Quality Metrics for Linear Regression

Given a data set \(X = (x_i)\) that contains vectors of input variables \(x_i = (x_{i1}, \ldots, x_{ip})\), respective responses \(z_i = (z_{i1}, \ldots, z_{ik})\) computed at the prediction stage of the linear regression model defined by its coefficients \(\beta_{ht}\), \(h = 1, \ldots, k\), \(t = 1, \ldots, p\), and expected responses \(y_i = (y_{i1}, \ldots, y_{ik})\), \(i = 1, \ldots, n\), the problem is to evaluate the linear regression model by computing the root mean square error, variance-covariance matrix of beta coefficients, various statistics functions, and so on. See Linear Regression for additional details and notations.

For linear regressions, the library computes statistics listed in tables below for testing insignificance of beta coefficients and one of the following values of QualityMetricsId:

For more details, see [Hastie2009].

Details

The statistics are computed given the following assumptions about the data distribution:

  • Responses \(y_{ij}\), \(i = 1, \ldots, n\), are independent and have a constant variance \(\sigma_j^2\), \(j = 1, \ldots, k\)

  • Conditional expectation of responses \(y_{.j}\), \(j = 1, \ldots, k\), is linear in input variables \(x_{.} = (x_{.1}, \ldots , x_{.p})\)

  • Deviations of \(y_{ij}\), \(i = 1, \ldots, n\), around the mean of expected responses \(\text{ERM}_j\), \(j = 1, \ldots, k\), are additive and Gaussian.

Testing Insignificance of a Single Beta

The library uses the following quality metrics:

Quality Metrics for Testing Insignificance of a Single Beta

Quality Metric

Definition

Root Mean Square (RMS) Error

\(\sqrt{\frac{1}{n} \sum _{i=1}^{n} (y_{ij} - x_{ij})^2}\), \(j = 1, \ldots, k\)

Vector of variances \(\sigma^2 = (\sigma_1^2, \ldots, \sigma_k^2)\)

\(\sigma_j^2 = \frac {1}{n - p - 1} \sum _{i=1}^{n} (y_{ij} - x_{ij})^2\), \(j = 1, \ldots, k\)

A set of variance-covariance matrices \(C = C_1, \ldots, C_k\) for vectors of betas \(\beta_{jt}\), \(j = 1, \ldots, k\)

\(C_j = {(X^T X)}^{-1} \sigma_j^2\), \(j = 1, \ldots, k\)

Z-score statistics used in testing of insignificance of a single coefficient \(\beta_{jt}\)

\(\text{zscore}_{jt} = \frac {\beta_{jt}}{\sigma_j \sqrt{v_t}}\), \(j = 1, \ldots, k\), \(\sigma_j\) is the \(j\)-th element of the vector of variance \(\sigma^2\) and \(ν_t\) is the \(t\)-th diagonal element of the matrix \({(X^T X)}^{-1}\)

Confidence interval for \(\beta_{jt}\)

\((\beta_{jt} - \text{pc}_{1-\alpha} \sqrt{v_t}, \beta_{jt} + \text{pc}_{1-\alpha} \sqrt{v_t})\), \(j = 1, \ldots, k\), \(\text{pc}_{1-\alpha}\) is the \((1-\alpha)\) percentile of the Gaussian distribution, \(\sigma_j\) is the \(j\)-th element of the vector of variance \(\sigma^2\), \(ν_t\) is the \(t\)-th diagonal element of the matrix \({(X^T X)}^{-1}\)

Testing Insignificance of a Group of Betas

The library uses the following quality metrics:

Quality Metrics for Testing Insignificance of a Group of Betas

Quality Metric

Definition

Mean of expected responses, \(\text{ERM} = (\text{ERM}_1, \ldots, \text{ERM}_k)\)

\(\text{ERM}_j = \frac {1}{n} \sum _{i=1}^{n} y_{ij}\), \(j = 1, \ldots, k\)

Variance of expected responses, \(\text{ERV} = (\text{ERV}_1, \ldots, \text{ERV}_k)\)

\(\text{ERV}_j = \frac {1}{n - 1} \sum _{i=1}^{n} (y_{ij} - \text{ERM}_j)^2\), \(j = 1, \ldots, k\)

Regression Sum of Squares \(\text{RegSS} = (\text{RegSS}_1, \ldots, \text{RegSS}_k)\)

\(\text{RegSS}_j = \frac {1}{n} \sum _{i=1}^{n} (z_{ij} - \text{ERM}_j)^2\), \(j = 1, \ldots, k\)

Sum of Squares of Residuals \(\text{ResSS} = (\text{ResSS}_1, \ldots, \text{ResSS}_k)\)

\(\text{ResSS}_j = \sum _{i=1}^{n} (y_{ij} - z_{ij})^2\), \(j = 1, \ldots, k\)

Total Sum of Squares \(\text{TSS} = (\text{TSS}_1, \ldots, \text{TSS}_k)\)

\(\text{TTS}_j = \sum _{i=1}^{n} (y_{ij} - \text{ERM}_j)^2\), \(j = 1, \ldots, k\)

Determination Coefficient \(R^2 = (R_1^2, \ldots, R_k^2)\)

\(R^2_j = \frac {\text{RegSS}_j}{\text{TTS}_j }\), \(j = 1, \ldots, k\)

F-statistics used in testing insignificance of a group of betas \(F = (F_1, \ldots, F_k)\)

\(F_j = \frac {(\text{ResSS}_{0j} - \text{ResSS}_j)/(p - p_0)} {{\text{ResSS}_j}/(n - p - 1)}\), \(j = 1, \ldots, k\), where \(\text{ResSS}_j\) are computed for a model with \(p + 1\) betas and \(\text{ResSS}_{0j}\) are computed for a reduced model with \(p_0 + 1\) betas (\(p - p_0\) betas are set to zero)

Batch Processing

Testing Insignificance of a Single Beta

Algorithm Input

The quality metric algorithm for linear regression accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Algorithm Input for Testing Insignificance of a Single Beta in Linear Regression (Batch Processing)

Input ID

Input

expectedResponses

Pointer to the \(n \times k\) numeric table with responses (\(k\) dependent variables) used for training the linear regression model.

This table can be an object of any class derived from NumericTable.

model

Pointer to the model computed at the training stage of the linear regression algorithm.

The model can only be an object of the linear_regression::Model class.

predictedResponses

Pointer to the \(n \times k\) numeric table with responses (\(k\) dependent variables) computed at the prediction stage of the linear regression algorithm.

This table can be an object of any class derived from NumericTable.

Algorithm Parameters

The quality metric algorithm for linear regression has the following parameters:

Algorithm Parameters for Testing Insignificance of a Single Beta in Linear Regression (Batch Processing)

Parameter

Default Value

Description

algorithmFPType

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

defaultDense

Performance-oriented computation method, the only method supported by the algorithm.

alpha

\(0.05\)

Significance level used in the computation of confidence intervals for coefficients of the linear regression model.

accuracyThreshold

\(0.001\)

Values below this threshold are considered equal to it.

Algorithm Output

The quality metric algorithm for linear regression calculates the result described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Algorithm Output for Testing Insignificance of a Single Beta in Linear Regression (Batch Processing)

Result ID

Result

rms

Pointer to the \(1 \times k\) numeric table that contains root mean square errors computed for each response (dependent variable)

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable, except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

variance

Pointer to the \(1 \times k\) numeric table that contains variances \(\sigma^2_j\), \(j = 1, \ldots, k\) computed for each response (dependent variable).

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable, except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

betaCovariances

Pointer to the DataCollection object that contains \(k\) numeric tables, each with the \(m \times m\) variance-covariance matrix for betas of the j-th response (dependent variable), where m is the number of betas in the model (m is equal to p when interceptFlag is set to false at the training stage of the linear regression algorithm; otherwise, m is equal to p + 1 ).

The collection can contain objects of any class derived from NumericTable.

zScore

Pointer to the \(k \times m\) numeric table that contains the Z-score statistics used in the testing of insignificance of individual linear regression coefficients, where \(m\) is the number of betas in the model (\(m\) is equal to \(p\) when interceptFlag is set to false at the training stage of the linear regression algorithm; otherwise, \(m\) is equal to \(p + 1\)).

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable, except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

confidenceIntervals

Pointer to the \(k \times 2 \times m\) numeric table that contains limits of the confidence intervals for linear regression coefficients:

  • \(\text{confidenceIntervals}[t][2*j]\) is the left limit of the confidence interval computed for the \(j\)-th beta of the \(t\)-th response (dependent variable)

  • \(\text{confidenceIntervals}[t][2*j+1]\) is the right limit of the confidence interval computed for the \(j\)-th beta of the \(t\)-th response (dependent variable),

where \(m\) is the number of betas in the model (\(m\) is equal to \(p\) when interceptFlag is set to false at the training stage of the linear regression algorithm; otherwise, \(m\) is equal to \(p + 1\)).

Note

By default, this result is an object of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable, except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

inverseOfXtX

Pointer to the \(m \times m\) numeric table that contains the \({(X^TX)}^{-1}\) matrix, where \(m\) is the number of betas in the model (\(m\) is equal to \(p\) when interceptFlag is set to false at the training stage of the linear regression algorithm; otherwise, \(m\) is equal to \(p + 1\)).

Testing Insignificance of a Group of Betas

Algorithm Input

The quality metric algorithm for linear regression accepts the input described below. Pass the Input ID as a parameter to the methods that provide input for your algorithm. For more details, see Algorithms.

Algorithm Input for Testing Insignificance of a Group of Betas in Linear Regression (Batch Processing)

Input ID

Input

expectedResponses

Pointer to the \(n \times k\) numeric table with responses (\(k\) dependent variables) used for training the linear regression model.

This table can be an object of any class derived from NumericTable.

predictedResponses

Pointer to the \(n \times k\) numeric table with responses (\(k\) dependent variables) computed at the prediction stage of the linear regression algorithm.

This table can be an object of any class derived from NumericTable.

predictedReducedModelResponses

Pointer to the \(n \times k\) numeric table with responses (\(k\) dependent variables) computed at the prediction stage of the linear regression algorithm using the reduced linear regression model, where \(p - p_0\) out of \(p\) beta coefficients are set to zero.

This table can be an object of any class derived from NumericTable.

Algorithm Parameters

The quality metric algorithm for linear regression has the following parameters:

Algorithm Parameters for Testing Insignificance of a Group of Betas in Linear Regression (Batch Processing)

Parameter

Default Value

Description

algorithmFPType

float

The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

method

defaultDense

Performance-oriented computation method, the only method supported by the algorithm.

numBeta

\(0\)

Number of beta coefficients used for prediction.

numBetaReducedModel

\(0\)

Number of beta coefficients (\(p_0\)) used for prediction with the reduced linear regression model, where \(p - p_0\) out of \(p\) beta coefficients are set to zero.

Algorithm Output

The quality metric algorithm for linear regression calculates the result described below. Pass the Result ID as a parameter to the methods that access the results of your algorithm. For more details, see Algorithms.

Algorithm Output for Testing Insignificance of a Group of Betas in Linear Regression (Batch Processing)

Result ID

Result

expectedMeans

Pointer to the \(1 \times k\) numeric table that contains the mean of expected responses computed for each dependent variable.

expectedVariance

Pointer to the \(1 \times k\) numeric table that contains the variance of expected responses computed for each dependent variable.

regSS

Pointer to the \(1 \times k\) numeric table that contains the regression sum of squares computed for each dependent variable.

resSS

Pointer to the \(1 \times k\) numeric table that contains the sum of squares of residuals computed for each dependent variable.

tSS

Pointer to the \(1 \times k\) numeric table that contains the total sum of squares computed for each dependent variable.

determinationCoeff

Pointer to the \(1 \times k\) numeric table that contains the determination coefficient computed for each dependent variable.

fStatistics

Pointer to the \(1 \times k\) numeric table that contains the F-statistics computed for each dependent variable.

Note

By default, these results are objects of the HomogenNumericTable class, but you can define the result as an object of any class derived from NumericTable, except for PackedTriangularMatrix, PackedSymmetricMatrix, and CSRNumericTable.

Examples