Basketball Data Science Project

This project is an end-to-end data science walkthrough using 2023–24 NCAA women’s basketball player statistics. The goal is to explore whether Caitlin Clark’s season was a statistical outlier and to build a machine learning workflow that predicts fantasy points from player performance metrics.

The project starts with data acquisition and preprocessing, then moves through feature engineering, exploratory data analysis, visualization, model selection, training, and evaluation. A bonus article extends the modeling section by comparing Ridge Regression with the ordinary least squares linear regression models trained earlier in the series. Each article focuses on one stage of the workflow and includes Python code, intermediate outputs, and explanations of the decisions made along the way.

The steps for this data science project

Project question

The central question for this project is:

Was Caitlin Clark’s 2023–24 season an outlier compared with other NCAA basketball players?

To explore that question, I built a Python analysis pipeline using player statistics from the 2023–24 season. The project uses common data science tools and techniques, including data cleaning, pandas transformations, exploratory analysis, data visualization, linear regression, and model evaluation.

Project workflow

This project walks through the full data science process:

Data acquisition — collect data from the NCAA website and Yahoo Sports API.
Data cleaning — identify and correct missing, inconsistent, or invalid values.
Data preprocessing — convert datatypes, standardize values, and prepare the dataset for analysis.
Feature engineering — create derived metrics such as per-game statistics, conference labels, assist-to-turnover ratio, and fantasy points.
Data exploration — use summary statistics, correlations, and exploratory plots to understand relationships between variables.
Data visualization — create charts to compare players and highlight outliers.
Machine learning — select, train, and evaluate ordinary least squares linear regression models.

Tools used

This project uses Python and common data science libraries:

Python — programming language used throughout the project
JupyterLab — notebook environment used for analysis and visualization
pandas — dataframe manipulation, cleaning, joins, and exports
NumPy — numerical operations and array-based calculations
requests — API requests for basketball statistics data
json — working with JSON API responses
os — local file and path handling
openpyxl — reading and writing Excel files
matplotlib — static charts and plot customization
seaborn — statistical visualizations, including heatmaps and pairplots
plotly — interactive visualizations
scipy — scientific computing utilities
scikit-learn — train/test splitting, linear regression, predictions, and model metrics
statsmodels — statistical modeling and regression analysis
joblib — saving and loading trained models
Postman — inspecting and testing API requests

Data sources

The project uses two primary data sources:

NCAA basketball statistics — player information such as team, class, height, and position
Yahoo Sports NCAA basketball statistics — individual player statistics such as points, rebounds, assists, blocks, steals, and field goal metrics

The raw datasets are cleaned, combined, transformed, and used throughout the rest of the project.

Project articles

1. Project setup and data acquisition

Read Part 1: Project Setup and Data Acquisition

The project begins by collecting NCAA basketball player information and statistics. This article covers the data sources, project setup, and process for combining player information and player statistics into one dataset.

2. Data cleaning and preprocessing

Read Part 2: Data Cleaning and Preprocessing

This article cleans the raw basketball dataset so it is ready for analysis. It covers missing values, incorrect entries, datatype conversions, unit standardization, and other preprocessing steps needed before feature engineering.

3. Feature engineering

Read Part 3: Feature Engineering

This article creates new features from the cleaned dataset, including two-point metrics, conference labels, per-game statistics, assist-to-turnover ratio, and fantasy points.

4. Data exploration

Read Part 4: Data Exploration

This article uses summary statistics, feature selection, correlation matrices, and exploratory plots to better understand the dataset before building visualizations and models.

5. Data visualizations

Read Part 5: Data Visualizations

This article creates visualizations to compare player performance, identify outliers, and better understand how Caitlin Clark’s season compares with the rest of the dataset.

6. Model selection

Read Part 6: Selecting a Machine Learning Model

This article defines the prediction problem, chooses fantasy points as the target variable, selects input features, and explains why ordinary least squares linear regression is a reasonable starting model.

7. Training the model

Read Part 7: Training a Linear Regression Model

This article trains ordinary least squares linear regression models using the engineered basketball dataset. It covers train/test splitting, model fitting, reproducibility, coefficient interpretation, and alternate feature sets.

8. Evaluating the model

Read Part 8: Evaluating a Linear Regression Model

This article evaluates the trained models using predictions, error metrics, residual analysis, and model diagnostics. It compares the full model with a reduced-feature model and discusses what the results suggest about model performance.

Bonus articles

Ridge vs. OLS linear regression models

Read the bonus article: Ridge vs. OLS Linear Regression Models

This article extends the machine learning portion of the project by comparing Ridge Regression with the ordinary least squares linear regression models trained earlier in the series. It revisits the same basketball dataset, trains Ridge models with scikit-learn, evaluates model performance, and discusses how regularization changes the model compared with OLS.

Repository

The code for this project is available on GitHub:

View the NCAA Basketball Stats repository

Why this project matters

This project is designed to show the full workflow behind a data science analysis, not just the final result. Each step builds on the previous one, which makes the series useful for readers who want to see how a messy sports dataset becomes a structured analysis and machine learning project.

The project also serves as a practical portfolio example of:

acquiring data from online sources and APIs
cleaning and transforming real-world data
building reproducible analysis workflows
creating interpretable visualizations
applying regression modeling to sports data
evaluating model performance
communicating technical results clearly

Start here

Start with Part 1: Project Setup and Data Acquisition to follow the project from the beginning.

For readers most interested in modeling, start with Part 6: Selecting a Machine Learning Model, then continue to the model training and evaluation articles.