Today we’ll revisit our Outlier or Caitlin Clark? data science project by examining the differences between Ridge Regression and our previously-trained Ordinary Least Squares (OLS) linear regression model. This is the ninth part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning. This article will have some overlap and assume knowledge from the previous articles, so it’s recommended to check out selecting a machine learning model, training an OLS model, and evaluating an OLS model as well.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step of this project:

the steps for this data science project

  1. Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
  2. Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
  3. Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
  4. Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
  5. Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
  6. Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
  7. Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover part of the seventh step: machine learning. A refresher on the basics of machine learning is also available in a previous article.

Dependencies

Since this is the ninth installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first article of the series or at least review the Getting Started section of that post before continuing.

Import Packages

You’ll want to have the latest version of Python installed with the following packages:

For today’s machine learning sgement specifically, we’ll want to import a few of these libraries:

import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, root_mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt

Import Data

In a previous part of this series, we created our training and testing splits from the player_data dataframe. If you want to follow along with the code samples in this article, it’s recommended to import the testing splits before proceeding.

Note: To reduce confusion, the variable names in this article are slightly different than in a previous article. Models and datasets using the full set of features will have variable names appended with _full. Since the alternate models are trained using fewer features, those variable names will be appended with _few. For example, X_test of the full dataset is X_test_full and X_test with fewer features is X_test_few.

X_train_full = pd.read_csv('X_train_full.csv')
X_train_few = pd.read_csv('X_train_few.csv')
X_test_full = pd.read_csv('X_test_full.csv')
X_test_full.head()
Height MINUTES_PLAYED FIELD_GOALS_MADE THREE_POINTS_MADE TWO_POINTS_MADE FREE_THROWS_MADE TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS
0 71 821 184 43 141 80 155 36 90 51 5 46 491
1 71 840 61 7 54 45 221 40 41 34 9 63 174
2 68 961 100 36 64 75 120 107 84 27 1 79 311
3 73 1060 231 77 154 105 167 59 105 26 45 56 644
4 74 814 112 2 110 42 208 33 60 33 24 37 268
X_test_few = pd.read_csv('X_test_few.csv')
X_test_few.head()
Height MINUTES_PLAYED THREE_POINTS_MADE FREE_THROWS_MADE TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS
0 71 821 43 80 155 36 90 51 5 46
1 71 840 7 45 221 40 41 34 9 63
2 68 961 36 75 120 107 84 27 1 79
3 73 1060 77 105 167 59 105 26 45 56
4 74 814 2 42 208 33 60 33 24 37
y_train_full = np.genfromtxt('y_train_full.csv', delimiter=',', skip_header=True)
y_train_few = np.genfromtxt('y_train_few.csv', delimiter=',', skip_header=True)

As a reminder, we previously created testing splits for the target FANTASY_POINTS variable. However, those two splits contained identical data, so we’ll use a single testing split called y_actual for the target variable.

y_actual = np.genfromtxt('y_actual.csv', delimiter=',', skip_header=True)
y_actual
array([ 753. ,  544.2,  587.5,  969.9,  621.1,  594.4,  554.5,  808. ,
        884.2,  838.2,  901.6,  797.2, 1314.6,  474.1,  552.4,  687. ,
        514.9,  499. ,  886.9,  189.9,  697.1,  325.8,  465.8,  569.9,
        793.6,  691.4,  590.6,  661.2,  920. ,  643.1,  557.4,  634.1,
        562. ,  542.6,  848.8,  283. , 1218.9,  698.5,  476.1,  694. ,
        675.5,  638.8,  634.4,  646.9,  696.2,  611.3,  777.1,  335.3,
        430.7,  664.6,  604.9,  534.5,  860.9,  655.1,  478.8,  584. ,
        636.9,  787.2,  375.1,  622.7,  465.6,  545.4,  712.7,  398. ,
        538.5,  742.9,  559. ,  476.5,  395. ,  463.3,  568.3,  890.3,
        619. ,  582.4,  705.7,  690.6, 1027.6,  602.5,  540.3,  560.9,
        423.4,  653.3, 1171.8,  868.5,  526.8,  730. ,  834. ,  547.4,
        719.2,  765.3,  676.5,  826.8,  845. ,  361. ,  723.3,  372.7,
        876.9,  570.1,  708.8,  720.2,  780.5,  901.9,  489.8,  583.7,
        702. ,  769.6,  557.1,  595.5,  417.6,  799.9,  727.5,  960.4,
        430.6,  659.7,  499.6,  327.8,  870.2,  806.4,  550.4,  396.3,
        521.2,  447.3,  809.9,  561.6,  680.2,  446.6,  332.9,  495.2,
        823. ,  820.7,  706.4,  811.6, 1119. ,  329. ,  783.7,  787.9,
        737.3,  494.5,  508.3,  478. , 1182.3,  672.5,  733.2,  733.1,
        615.6,  559.6,  807.1,  728.8,  751.1,  864.1,  543.3,  737.3,
        986.7,  494.9,  639.8,  597.6,  612.5,  572.7,  709.4,  487.6,
        523.5,  484.3,  686.7,  815.9,  699.4,  614. ,  651.1,  576. ,
        832.7,  802. ,  974.1,  365.3,  656.1,  578.1,  444.2,  813.7,
        670.3,  746. ,  714.4,  473.9,  635.3,  435.9,  635.1,  773.5,
        412.3,  723.1,  464. ,  760.4,  532. ,  723.9,  514.2,  790.7,
        392.3,  649.4,  814.3,  951.3,  336.1,  714.6,  602.2,  429.6,
        652.1,  698.3,  577.1,  708.4,  966.5,  770.1,  638.1,  641.9,
        671.8, 1267.4,  757.2,  908.6,  646.3,  797.9,  758.8,  624. ,
        639.1,  769. ,  451.1,  643.5,  734.2,  545.7,  603.6,  858.6])

Set Graph Preferences

This is an entirely optional step to configure the aesthetics of the graphs and charts. You can import a custom color scheme or set colors individually. In this case, we’ll define a list of custom colors (graph_colors) and configure both Matplotlib and Seaborn to use these colors for subsequent plots.

graph_colors = ['#615EEA', '#339E7A', '#ff8c6c']
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=graph_colors)
sns.set_palette(graph_colors)

We can also set the overall style for the matplotlib plots using a style sheet. You can print the list of available style sheets and view examples of these style sheets on matplotlib’s website.

plt.style.use('seaborn-v0_8-white')

Select the Model

The first part of training any machine learning model is selecting the model to use. This might sound obvious, but selecting the “best” model for your problem depends on a variety of factors. We’ll likely explore this step in more detail in future articles (let me know if you would be interested in that), but for today let’s revisit scikit-learn’s model flowchart to find an appropriate model.

scikit-learn's algorithm flowchat

Just like in a previous article where we ended up selecting a Ordinary Least Squares linear regression model, we can go through this flowchart until we get to the “few features should be important” split of the regression section. This time, let’s explore the “No” branch of that split to see how a Ridge Regression model works.

Model Characterstics

Ridge Regression, also referred to as L2 Regularization is an extension of the Ordinary Least Squares (OLS) method. Both OLS and Ridge Regression aim to find the best fit line through data points to make predictions. However, the main difference is that Ridge regression is a type of regularization.

While OLS minimizes the residual sum of squares without any constraints, Ridge regression adds a penalty term (L2 regularization) to the loss function. This penalty term is proportional to the absolute size of the coefficients and is designed to shrink large coefficient values asymptotically towards zero. One important characteristic of Ridge regression is that the coefficients never actually reach zero. This suppression of coefficients helps reduce overfitting caused by multi-collinearity, makes model parameters more interpretable, and improves generalization performance.

In simplified terms, Ridge Regression strikes a balance between bias and variance by constraining the magnitude of coefficients in a linear regression model. This constraint prevents coefficients from growing too large and offers a stable, predictive model suitable for complex datasets with multiple predictor variables that may not be independent of each other.

Model Equation

We can represent the regression line of a Ridge regression model with an equation similar to the one for OLS: \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \lambda(\beta_1^2 + \beta_2^2 + ... + \beta_n^2)\)

where:

  • \(y\) is the predicted value of the target variable
  • \(x_1, x_2, …, x_n\) are the input features
  • \(b_0\) is the y-intercept (the value of \(y\) when all \(x\) are zero)
  • \(b_1, b_2, …, b_n\) are the coefficients that represent the change in \(y\) for a one-unit change in the corresponding \(x\), holding all other \(x\) constant
  • \(\lambda\) is the regularization hyperparameter (the L2 regularization penalty on the coefficient values)

From this equation, we know the goal of Ridge regression is still to find the values of \(b_0\) and \(b_1, b_2, …, b_n\) that minimize the difference between the predicted \(\text{FantasyPoints}\) (y_pred) values and the actual \(\text{FantasyPoints}\) (y_actual) values in our dataset. However, Ridge regression adds an additional goal to minimize the values of \(b_1, b_2, …, b_n\) themselves in order to reduce the penalty term.

Note: This is a slightly simplified explanation of the underlying mathematics. I highly recommend reading the Wikipedia page, or other suitable sources, for a more nuanced understanding of the process. I’d also recommend this YouTube video for a visual example of how the regularization parameter works.

Regularization Hyperparameter

You might notice that if we set \(\lambda\) (the regularization hyperparameter) to zero, then the equation for Ridge regression would simplify down to the equation for the OLS model that we previously covered:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n\]

So there’s not much point in setting \(\lambda\) to 0, but how do we know what to set \(\lambda\) to? The best value for \(\lambda\) can be determined in a few different ways that could be the topic of it’s own article, so please let me know if you would be interested in that! For today, we’ll use the default value set by the scikit-learn Ridge() class, which is 1.

Now that we conceptually understand a bit more about how this model works, let’s try training a Ridge regression model on our dataset.

Train the Models

Now that we have our data split into training and test sets, we’re ready to train our model. We’ll use scikit-learn’s Ridge class for this purpose. We can start by initializing the model with the default parameters:

model_full = Ridge()

# this is equivalent to running:
# model_full = Ridge(alpha=1)

Note: The default parameters of Ridge() include setting \(\lambda\) to 1 using the alpha parameter. We could (and probably will) explore how to identify the best value of \(\lambda\) in a future article, but for today we’ll use the default value. If you want to try out a different value, feel free to run model_full = Ridge(alpha=YOUR_LAMBDA_VALUE) instead.

We can then use the fit() method to actually train our model on our data. This method takes two arguments: X_train (the training split of our features) and y_train (the training split of our target variable). For a model trained on the full set of features, this will be X_train_full and y_train_full respectively.

model_full.fit(X_train_full, y_train_full)
Ridge()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can follow this same process to train a model on the dataset with fewer features:

model_few = Ridge()
model_few.fit(X_train_few, y_train_few)
Ridge()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Just like with training the OLS model, we’re using the training set to learn the optimal parameters for each feature that minimize the difference between the predicted values of the target variable and the actual values of the target variable. However, ridge regression adds additional logic to minimize the optimal parameters for each feature as well. Once the training is complete, the model will contain the learned parameters (coefficients and intercept) that can be used to make predictions on new data. In the next step, we can use X_test to predict what the model thinks y is, and then compare that output y (y_pred) to the actual y values (y_actual) to evaluate the model performance.

Generate Predictions

To evaluate the model’s performance, we’ll compare the values that the model predicts to the actual values (sometimes referred to as the “ground truth” values). Models that predict values close to the actual values perform better, and models that predict values far from the actual values perform worse. There are various evaluation metrics we can calculate using these predictions to quantify how well a model performs, but the first step is generating the predictions for each model.

To generate predictions, we’ll apply each trained model to the testing data split. We’ll use the testing data split instead of the training data split to ensure that the model is evaluated on data that it hasn’t seen during training. (For a refresher on why we split the dataset into training and testing subsets, see a previous article) in this series.

In a previous article, we generated predictions in three ways:

  1. Manually
  2. Using np.dot()
  3. Using Ridge.predict()

You could follow the same steps to generate predictions manually and using np.dot() for this model as well, but for today we’ll jump straight into predictions using sklearn’s .predict() method.

Calculate Predictions with .predict()

Just like with the OLS LinearRegression model, sklearn’s Ridge model also has a .predict() function to calculate predictions.

y_pred_full = model_full.predict(X_test_full)
y_pred_full[:5]
array([752.99104584, 544.20411972, 587.49782753, 969.89164701,
       621.09740207])
y_pred_few = model_few.predict(X_test_few)
y_pred_few[:5]
array([689.05084667, 629.10060781, 640.65984357, 934.31552165,
       617.45344465])

We now have our predictions for both models and can see that the predicted values noticeably differ between the two models.

Create Baseline Model

Before evaluating how well these models perform, let’s create one more “model” as a baseline. This baseline model will simply predict the mean of the target variable in the training data and will serve as a simple reference point. (If you read the previous article in this series, this is exactly the same baseline model.) By comparing the performance of the ridge regression model to this naive approach, we can determine if the model is capturing meaningful patterns in the data and offering improvements beyond just predicting the average.

To create predictions for this baseline model, we can create an array of the same size and type as our predictions or actual values using NumPy’s .full_like() function:

y_pred_base = np.full_like(y_actual, np.mean(y_actual))
y_pred_base[:5]
array([658.009375, 658.009375, 658.009375, 658.009375, 658.009375])

Now we have all three of our predictions and can compare those to the actual values to evaluate how well each of the three models performs.

Evaluate the Model

After training our ridge regression models, the next crucial step is to evaluate performance. This evaluation process helps us understand how well our model is doing, identify any issues, and determine if it’s ready for real-world application or if it needs further refinement. In this section, we’ll explore various metrics and techniques to assess our model’s accuracy and reliability.

Evaluation Metric Definitions

Let’s start with a quick overview of each evaluation metric we’ll be exploring today.

  • R-squared (R²) - This measures the proportion of variance explained by the model. It gives a good first glance at how much of the variability in the target (Fantasy Points in this case) is explained by the model. It’s also referred to as the coefficient of determination.
  • Adjusted R-squared - This is similar to R-squared, but is adjusted for the number of predictors to account for overfitting. This is often useful when comparing multiple models with different numbers of features (as is the case between model_full and model_few).
  • Mean Squared Error (MSE) - This is the average of the squared differences between predicted and actual values. It indicates the model’s prediction accuracy, penalizing larger errors more heavily.
  • Root Mean Squared Error (RMSE) - This is the square root of MSE. It provides a more interpretable measure of prediction accuracy than MSE since it is in the same units and scale as the target variable. It helps understand the magnitude of the average prediction error.
  • Mean Absolute Error (MAE) - This is the average absolute difference between predicted and actual values. It is also a measure of the model’s prediction accuracy, but it penalizes errors equally (instead of penalizing larger errors like MSE) and is less sensitive to outliers as a result.
  • Residuals - These are the difference between the predicted values and the actual values. These help assess the accuracy of the model by potentially revealing patterns or biases in the model. These are usually plotted and analyzed visually.

Note that there are additional metrics that can be used to evaluate and compare regression models (such as Variance Inflation Factor and F-test), but the metrics covered today are commonly used and will serve as a good starting point.

We could calculate most of these evaluation metrics using multiple methods (like in a previous article), but for today we’ll use the convenient scikit-learn functions wherever available.

Define Variables

Before jumping into the evaluation metrics, let’s define a few helpful variables. In the equations in this section, the following variables will be used:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)
  • \(\bar{y}\) is the mean of the actual values (mean(y_actual))
  • \(n\) is the number of data points (len(y_actual))

The variables for \(y_i\) and \(\hat{y}\) are already defined, so let’s calculate \(n\) and \(\bar{y}\) next.

Calulate \(n\)

This step is pretty straightforward, since the number of data points in this context can be calculated by looking at the length of the testing data. You could use any of the variables from the testing dataset (X_test_full, y_pred_few, etc.), but we’ll simply use y_actual:

n = len(y_actual)
n
224

Calculate the Mean

The mean will be used to calculate a few of the evaluation metrics, so let’s take this opportunity to calculate it. Let’s start by looking at the equation for the arithmetic mean:

\[\bar{y} = \frac{1}{n} \sum_{i=1}^{n} y_i\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\bar{y}\) is the mean of the actual values (mean(y_actual))
  • \(n\) is the number of data points (len(y_actual))

Note: If you’d like a refresher on \(\Sigma{}\) notation, the summation Wikipedia page is a good resource.

We can use the NumPy .mean() function to calculate the mean directly:

y_mean = np.mean(y_actual)
y_mean
658.009375

Calculate Residuals

The last variable we’ll calculate before getting into the evaluation metrics is the residuals. In this context, the residuals refer to the difference between \(y_i\) and \(\hat{y}\).

Equation

\(\text{residuals} = y_i - \hat{y}\)

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)

Calculate

We can calculate the residuals for each model:

residuals_full = y_actual - y_pred_full
residuals_full[:5]
array([ 0.00895416, -0.00411972,  0.00217247,  0.00835299,  0.00259793])
residuals_few = y_actual - y_pred_few
residuals_few[:5]
array([ 63.94915333, -84.90060781, -53.15984357,  35.58447835,
         3.64655535])
residuals_base = y_actual - y_pred_base
residuals_base[:5]
array([  94.990625, -113.809375,  -70.509375,  311.890625,  -36.909375])

Evaluation

We’ll be plotting and analyzing these residuals in a later step, so for now we’ll use the residuals and other variables to calculate a few of the evaluation metrics.

\(R^2\)

\(R^2\), also known as the coefficient of determination, is a useful metric for evaluating regression models. It represents the proportion of variance in the dependent variable (Fantasy Points) that is predictable from the independent variable(s) (Height, Points, Steals, etc.).

As a proportion, \(R^2\) values usually range from 0 to 1, with higher values indicating better fit. For example:

  • 0 indicates that the model explains none of the variability of the target variable.
    • This means the model’s predictions are no better than simply predicting the mean of the target variable.
    • This suggests a poor fit or that there is no useful relationship between the target and the independent variables.
  • 1 indicates that the model explains all the variability of the target variable.
    • This means the model’s predictions perfectly match the target variable.
    • This suggests either an ideal fit or that the model is overfit.
  • 0.5 indicates that the model explains some of the variability of the target variable.
    • This means the model’s predictions somewhat match the target variable and performs better than predicting the mean.
    • Values between 0 and 1 are common in real-world scenarios. \(R^2\) values closer to 1 indicate a better fit than values closer to 0.

In other words, R-squared provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

If this is your first time learning about \(R^2\), Newcastle University has an excellent step-by-step walkthrough of how to calculate \(R^2\) by hand with visuals.

Equation

Let’s start by looking at the equation for \(R^2\):

\[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}\]

where:

\(\text{RSS}\) can be calculated by:

\[\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)
  • \(n\) is the number of data points (len(y_actual))

Referring to the equation for \(\text{residuals}\) from earlier in this article, the equation for \(\text{RSS}\) can also be written as:

\[\text{RSS} = \sum_{i=1}^{n} (\text{residuals}_\text{model})^2\]

\(\text{residuals}_\text{model}\) in the equation above represents the residuals for each model, so there will be one \(\text{RSS}\) for residuals_full, another for residuals_few, and a third for residuals_base.

\(\text{TSS}\) can be calculated by:

\[\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\bar{y}\) is the mean of the actual values (mean(y_actual))
  • \(n\) is the number of data points (len(y_actual))

Using the \(\text{residuals}\) equation from earlier, we can make a substition for \(\text{TSS}\) as well:

\[\text{TSS} = \sum_{i=1}^{n} (\text{residuals}_\text{base})^2\]

Calculate

We can calculate \(R^2\) using scikit-learn’s r2_score() function function for each model:

r2_full = r2_score(y_actual, y_pred_full)
r2_full
0.9999999995121926
r2_few = r2_score(y_actual, y_pred_few)
r2_few
0.9207256674767885
r2_base = r2_score(y_actual, y_pred_base)
r2_base
0.0

Evaluation

Now that we have our results for each model, let’s take a look at how each model performs. As mentioned earlier, a higher \(R^2\) generally indicates a better fit for the model, with 1 being a perfect fit and 0 being a poor fit that performs no better than predicting the mean. In this case, model_full has a \(R^2\) value of almost, but not quite, 1.0, so it’s predictions are quite close to the ideal. On the other end, it makes sense that model_base has a \(R^2\) of 0, since this model is predicting the mean for each observation. model_few has a \(R^2\) of 0.92..., which is relatively close to 1, so this model also predicted values close to the actual values, but did not perform quite as well as model_full.

Adjusted \(R^2\)

Adjusted \(R^2\) is a modification to the standard \(R^2\) that we just calculated that adjusts for the number of predictors (Height, Points, Steals, etc.) in the model. Standard \(R^2\) will always increase as you add more predictors (even if they aren’t improving the model), which can make the results a bit misleading for models with many predictors. Adjusted \(R^2\) penalizes the addition of unnecessary predictors, so it provides a more accurate measure of the model’s performance when there are multiple predictors. This also makes it quite useful for comparing models with different numbers of predictors.

Adjusted \(R^2\) is similar to standard \(R^2\) in that it values closer to 1 indicate a good fit, and values closer to (or below) 0 indicate a poor fit. Adjusted \(R^2\) can also be below zero in cases of poorly fitted models or when \(p\) is much greater than \(n\).

Equation

Let’s start by looking at the equation for Adjusted \(R^2\):

\[\text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)\]

where

  • \(R^2\) is the regular coefficient of determination
  • \(n\) is the number of data points
  • \(p\) is the number of predictors (independent variables)

Calculate

We already have variables for \(R^2\) and \(n\), so let’s begin the manual calculation by defining \(p\). Scikit-learn’s Ridge regression models have an attribute called n_features_in_ that returns the number of features seen when fitting the model, so we can use that:

p_full = model_full.n_features_in_
p_full
13
p_few = model_few.n_features_in_
p_few
10

The baseline model always predicts the same value (the mean of the target variable), so it doesn’t use any features to make predictions. This means we can set \(p\) to 0 for this model:

p_base = 0
p_base
0

Now that we know both \(p\) and \(n\), we can calculate the Adjusted \(R^2\) for each model. At time of writing, there isn’t a built-in function in scikit-learn to calculate Adjusted \(R^2\), so we’ll calculate it manually using the standard \(R^2\) calculated in the previous step.

adj_r2_full = 1 - ((1 - r2_full) * (n - 1)) / (n - p_full - 1)
adj_r2_full
0.999999999481995
adj_r2_few = 1 - ((1 - r2_few) * (n - 1)) / (n - p_few - 1)
adj_r2_few
0.9170038678278114
adj_r2_base = 1 - ((1 - r2_base) * (n - 1)) / (n - p_base - 1)
adj_r2_base
0.0

Evaluation

Evaluating Adjusted \(R^2\) follows the same logic as the standard \(R^2\). We can see that the Adjusted \(R^2\) for model_base is still 0. Adjusted \(R^2\) is slightly lower than standard \(R^2\) for both model_full and model_few, which indicates that both models perform quite well, with model_full doing slightly better than model_few.

Mean Squared Error (MSE)

Mean Squared Error (MSE) is another common metric for evaluating regression models. It calculates the average of the squared differences between predicted values (\(\hat{y}\)) and actual values (\(y\)). Since MSE squares the errors, it can be more sensitive to outliers and less interpretable than other metrics, so it’s particularly useful when you want to heavily penalize large prediction errors.

  • 0 indicates a the model makes perfect predictions
  • Values close to 0 indicate a better fit
  • Larger values indicate a worse fit, but there is no upper bound

Equation

Let’s start by looking at the equation for MSE:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)
  • \(n\) is the number of data points (len(y_actual))

You might notice that part of this equation is the same as the calculation for \(RSS\) that we used to compute \(R^2\) earlier: \(\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)

We can substitute that in to rewrite this equation as:

\[\text{MSE} = \frac{\text{RSS}}{n}\]

Calculate

We can use scikit-learn’s mean_squared_error() function to calculate MSE directly.

mse_full = mean_squared_error(y_actual, y_pred_full)
mse_full
1.5933696074432595e-05
mse_few = mean_squared_error(y_actual, y_pred_few)
mse_few
2589.40980805919
mse_base = mean_squared_error(y_actual, y_pred_base)
mse_base
32663.91183175223

Evaluation

An important note about MSE is that the units are \(\text{FantasyPoints}^2\) As mentioned earlier, a MSE closer to 0 is better, so it makes sense that the model_full performs the best. mse_few is in between the values of mse_full and mse_base, with mse_base being over 10x larger than mse_few. The results are somewhat similar to that of \(R^2\), but a bit less interpretable, so let’s move on to the next metric.

Root Mean Squared Error (RMSE)

Root Mean Squared Error (RMSE) is the square root of the MSE and is helpful for determining the average magnitude of the error. It’s similar to MSE in many ways, including being sensitive to outliers. However, RMSE is often preferred over MSE because it’s in the same units as the target variable, making it easier to interpret.

Equation

The full equation for RMSE is:

\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)
  • \(n\) is the number of data points (len(y_actual))

Since RMSE is the square root of MSE, this equation can also be written as:

\[\text{RMSE} = \sqrt{\text{MSE}}\]

Calculate

Scikit-learn provides the root_mean_squared_error function that we can apply to the predictions from each model to calculate RMSE.

rmse_full = root_mean_squared_error(y_actual, y_pred_full)
rmse_full
0.0039917034051182455
rmse_few = root_mean_squared_error(y_actual, y_pred_few)
rmse_few
50.88624379986393
rmse_base = root_mean_squared_error(y_actual, y_pred_base)
rmse_base
180.7316016410861

Evaluation

For RMSE, values closer to zero are better, so it’s no surprise that the RMSE for model_full is almost zero. rmse_few is closer to zero than rmse_base, but we can also evaluate these quantities within the context of the target values.

print(f'y_actual values range from {np.amin(y_actual)} to {np.amax(y_actual)}')
y_actual values range from 189.9 to 1314.6
np.mean(y_actual)
658.009375

In this case, the target variable and its mean are on the order of hundreds, so a RMSE of 50.8 for model_few seems fairly good, while the RMSE of 180 for model_base is quite poor.

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) measures the average magnitude of errors in a set of predictions, without considering their direction. It treats errors equally, making it less sensitive to outliers than MSE or RMSE. Similar to RMSE, it uses the same units as the target variable, making it easier to interpret than MSE.

  • 0 indicates a the model makes perfect predictions
  • Values close to 0 indicate a better fit
  • Larger values indicate a worse fit, but there is no upper bound

Equation

Let’s take a look at the equation for MAE:

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

where:

  • \(y_i\) are the actual values (y_actual)
  • \(\hat{y}\) are the predicted values (y_pred_full, y_pred_few, y_pred_base)
  • \(n\) is the number of data points (len(y_actual))

Since \(y_i - \hat{y}_i\) is the same as the \(\text{residuals}\) calculated earlier in this article, the equation for \(\text{MAE}\) can also be written as:

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |\text{residuals}_\text{model}|\]

Calculate

We can use scikit-learn’s mean_absolute_error() function with each model’s predictions.

mae_full = mean_absolute_error(y_actual, y_pred_full)
mae_full
0.0031245476904033153
mae_few = mean_absolute_error(y_actual, y_pred_few)
mae_few
39.83424851367819
mae_base = mean_absolute_error(y_actual, y_pred_base)
mae_base
139.73130580357142

Evaluation

Similar to MSE and RMSE, the lower the MAE, the better the model’s fit. mae_full is still quite close to zero, and mae_few is much better than mae_base, so both of those models perform better than the baseline model. We can use the same context used for RMSE (where y_actual ranges from ~190 to ~1,300 with a mean of ~658) to further confirm that the baseline model performs quite poorly, while mae_few performs reasonably well.

Evaluation Metrics Results

For convenience, we can also summarize the results of all of these evaluation metrics in a single table:

Method Model \(R^2\) Adjusted \(R^2\) MSE RMSE MAE
Mean model_base 0 0 32,663.91 180.73 139.73
OLS model_full 1.0 1.0 3.91 \(x10^{-26}\) 1.97 \(x10^{-13}\) 1.66 \(x10^{-13}\)
OLS model_few 0.921 0.917 2,589.41 50.88 39.83
Ridge model_full 0.999 0.999 1.59 \(x10^{-5}\) 0.004 0.003
Ridge model_few 0.921 0.917 2,589.41 50.88 39.83
    Closer to 1.0 is better Closer to 1.0 is better Lower is better Lower is better Lower is better

In this table, we can see that the evaluation metrics for the Ridge regression model are essentially the same as the OLS model. However, this dataset we’re using to train the models today reflects an ideal situation that is often not present in real-world scenarios. Ridge regression remains an excellent choice for situations where reducing overfitting (and improving generalization) is important.

Residuals Plots

As a reminder, a residual is the difference between an observed value and its corresponding predicted value (\(y_i - \hat{y_i}\)). We calculated the residuals in a previous step, so let’s wrap up by plotting and reviewing the residuals. Plotting residuals is a useful visual way to evaluate the assumptions and identify potential issues with the fit of a regression model that metrics alone might miss.

When reviewing these residual plots, we’ll primarily be checking for whether or not there’s an issue with the model. For models with a good fit, these plots should have an even, random distribution of residuals around the horizontal line (zero on the y-axis) without any outliers. If there are any clear patterns (curves or clusters) in the residuals plot, that can suggest that the model is not capturing some aspect of the data, such as omitted variables, non-linearity issues, or heteroskedasticity.

Evaluating Scatterplot of Residuals

Let’s start by creating a scatterplot of the residuals versus the predicted values for each model.

plt.scatter(y_pred_full, residuals_full)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residual Plot for model_full')
plt.show()

png

The residuals for model_full at first glance shows a slight upward trend. However, the scale for the y-axis is \(\times 10^{-2}\), so all of these residuals are quite small compared to our y values.

plt.scatter(y_pred_few, residuals_few)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residual Plot for model_few')
plt.show()

png

The residuals plot for model_few shows a more even distribution, so there aren’t any major issues. Note that the scale of the y-axis is \(\times 10^2\), so these residuals are considerably larger than for model_full.

plt.scatter(y_pred_base, residuals_base)
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.title('Residual Plot for model_base')
plt.show()

png

The residuals plot for model_base looks clearly different from the other two and is not evenly distributed around zero. This plot indicates an issue with the underlying model. Since this model just predicts the mean of the target variable (and uses zero predictors), the pattern shown in this residuals plot is clearly showing an issue (omitted variables).

For excellent examples of residuals plots that show good and poor model fits, I highly recommend Penn State’s page on Identifying Specific Problems using Residuals Plots.

Evaluating Histogram of Residuals

A histogram is another useful chart for evaluating residuals. This helps check the normality of residuals and will ideally show a bell-shape (a normal distribution). We can plot this using seaborn’s .histplot() function.

sns.histplot(residuals_full, bins=15, kde=True)
plt.title('Histogram of Residuals for model_full')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

png

The histogram of residuals for model_full is roughly bell-shaped with a very slight shift to the right (above zero).

sns.histplot(residuals_few, bins=15, kde=True)
plt.title('Histogram of Residuals for model_few')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

png

The histogram of residuals for model_few matches the normal distribution quite well and is evenly centered around zero.

sns.histplot(residuals_base, bins=15, kde=True)
plt.title('Histogram of Residuals for model_base')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

png

The histogram of residuals for model_base is quite normal, minus the small “tail” on the right (above zero). Overall, the residuals are normally distributed. However, the residuals are quite large (non-negligable) compared to the scale of the y values, so model_base still does not perform well even though the histogram of residuals is otherwise normal.

For more examples of these plots, Penn State has an excellent page on the Normal Probability Plot of Residuals, including examples of residuals that are not normally distributed.

Feature Importance

As the last piece of model evaluation for today, we’ll take a look at the feature importance of each model. To do this, we’ll examine differences between the coefficients of the two models (model_full and model_few). As a reminder, model_base simply predicts the mean of the target variable without using any features. This means that all of the feature coefficients are set to zero for model_base. Since there is no meaningful comparison to make with model_base, we’ll focus on comparing the relative feature importance of model_full and model_few.

To compare the feature importance between the two models, let’s start by identifying the coefficients for each model. For both models, the coefficients can be accessed with the .coef attribute and the corresponding feature names can be accessed with the .feature_names_in attribute.

coef_full = pd.Series(data=model_full.coef_, index=model_full.feature_names_in_)
coef_full
Height               0.000004
MINUTES_PLAYED       0.000005
FIELD_GOALS_MADE     1.461186
THREE_POINTS_MADE    1.168972
TWO_POINTS_MADE      0.292214
FREE_THROWS_MADE     0.876717
TOTAL_REBOUNDS       1.200012
ASSISTS              1.500007
TURNOVERS           -1.000151
STEALS               2.000006
BLOCKS               2.000033
FOULS                0.000070
POINTS               0.123281
dtype: float64

Now we can assemble the coefficients for model_few.

coef_few = pd.Series(data=model_few.coef_, index=model_few.feature_names_in_)
coef_few
Height               2.452227
MINUTES_PLAYED       0.103872
THREE_POINTS_MADE    2.203715
FREE_THROWS_MADE     2.391725
TOTAL_REBOUNDS       1.521931
ASSISTS              1.323122
TURNOVERS           -0.570635
STEALS               2.239279
BLOCKS               2.481845
FOULS               -0.261200
dtype: float64

model_few has fewer coefficients than model_full, so let’s put the coefficients for each model and the difference between the two into a DataFrame for easy comparison.

df = pd.DataFrame({'model_full': coef_full, 'model_few': coef_few})
df['difference'] = df.model_few - df.model_full
df
model_full model_few difference
ASSISTS 1.500007 1.323122 -0.176886
BLOCKS 2.000033 2.481845 0.481813
FIELD_GOALS_MADE 1.461186 NaN NaN
FOULS 0.000070 -0.261200 -0.261270
FREE_THROWS_MADE 0.876717 2.391725 1.515008
Height 0.000004 2.452227 2.452223
MINUTES_PLAYED 0.000005 0.103872 0.103867
POINTS 0.123281 NaN NaN
STEALS 2.000006 2.239279 0.239273
THREE_POINTS_MADE 1.168972 2.203715 1.034743
TOTAL_REBOUNDS 1.200012 1.521931 0.321919
TURNOVERS -1.000151 -0.570635 0.429516
TWO_POINTS_MADE 0.292214 NaN NaN

As a reminder, model_few has fewer parameters than model_full, because FIELD_GOALS_MADE, TWO_POINTS_MADE, and POINTS were removed from the feature set to create model_few. Certain features (MINUTES_PLAYED, Height, FOULS) of model_full are fairly close to zero but are not actually zero; this matches the expectations of Ridge regression models that we covered in the [#model-characteristics] section. model_few weights

Wrap Up

In today’s guide, we covered the primary differences between Ridge and OLS regression models. The previous article was meant to be the last installment in the series, but we’ve already returned for a quick look at Ridge regression models. So, please let me know if you’d like to use this dataset (or another dataset) to explore another type of machine learning model in the future!

As a reminder, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Articles in this Series

  1. Acquiring and Combining the Datasets
  2. Cleaning and Preprocessing the Data
  3. Engineering New Features
  4. Exploratory Data Analysis
  5. Visualizations, Charts, and Graphs
  6. Selecting a Machine Learning Model
  7. Training the Machine Learning Model
  8. Evaluating the Machine Learning Model
  9. Ridge vs OLS Linear Regression Models (Today’s Guide)