This project is an end-to-end data science walkthrough using 2023–24 NCAA women’s basketball player statistics. The goal is to explore whether Caitlin Clark’s season was a statistical outlier and to build a machine learning workflow that predicts fantasy points from player performance metrics.

The project starts with data acquisition and preprocessing, then moves through feature engineering, exploratory data analysis, visualization, model selection, training, and evaluation. A bonus article extends the modeling section by comparing Ridge Regression with the ordinary least squares linear regression models trained earlier in the series. Each article focuses on one stage of the workflow and includes Python code, intermediate outputs, and explanations of the decisions made along the way.

The steps for this data science project

Project question

The central question for this project is:

Was Caitlin Clark’s 2023–24 season an outlier compared with other NCAA basketball players?

To explore that question, I built a Python analysis pipeline using player statistics from the 2023–24 season. The project uses common data science tools and techniques, including data cleaning, pandas transformations, exploratory analysis, data visualization, linear regression, and model evaluation.

Project workflow

This project walks through the full data science process:

  1. Data acquisition — collect data from the NCAA website and Yahoo Sports API.
  2. Data cleaning — identify and correct missing, inconsistent, or invalid values.
  3. Data preprocessing — convert datatypes, standardize values, and prepare the dataset for analysis.
  4. Feature engineering — create derived metrics such as per-game statistics, conference labels, assist-to-turnover ratio, and fantasy points.
  5. Data exploration — use summary statistics, correlations, and exploratory plots to understand relationships between variables.
  6. Data visualization — create charts to compare players and highlight outliers.
  7. Machine learning — select, train, and evaluate ordinary least squares linear regression models.

Tools used

This project uses Python and common data science libraries:

  • Python — programming language used throughout the project
  • JupyterLab — notebook environment used for analysis and visualization
  • pandas — dataframe manipulation, cleaning, joins, and exports
  • NumPy — numerical operations and array-based calculations
  • requests — API requests for basketball statistics data
  • json — working with JSON API responses
  • os — local file and path handling
  • openpyxl — reading and writing Excel files
  • matplotlib — static charts and plot customization
  • seaborn — statistical visualizations, including heatmaps and pairplots
  • plotly — interactive visualizations
  • scipy — scientific computing utilities
  • scikit-learn — train/test splitting, linear regression, predictions, and model metrics
  • statsmodels — statistical modeling and regression analysis
  • joblib — saving and loading trained models
  • Postman — inspecting and testing API requests

Data sources

The project uses two primary data sources:

The raw datasets are cleaned, combined, transformed, and used throughout the rest of the project.

Project articles

1. Project setup and data acquisition

Read Part 1: Project Setup and Data Acquisition

The project begins by collecting NCAA basketball player information and statistics. This article covers the data sources, project setup, and process for combining player information and player statistics into one dataset.

2. Data cleaning and preprocessing

Read Part 2: Data Cleaning and Preprocessing

This article cleans the raw basketball dataset so it is ready for analysis. It covers missing values, incorrect entries, datatype conversions, unit standardization, and other preprocessing steps needed before feature engineering.

3. Feature engineering

Read Part 3: Feature Engineering

This article creates new features from the cleaned dataset, including two-point metrics, conference labels, per-game statistics, assist-to-turnover ratio, and fantasy points.

4. Data exploration

Read Part 4: Data Exploration

This article uses summary statistics, feature selection, correlation matrices, and exploratory plots to better understand the dataset before building visualizations and models.

5. Data visualizations

Read Part 5: Data Visualizations

This article creates visualizations to compare player performance, identify outliers, and better understand how Caitlin Clark’s season compares with the rest of the dataset.

6. Model selection

Read Part 6: Selecting a Machine Learning Model

This article defines the prediction problem, chooses fantasy points as the target variable, selects input features, and explains why ordinary least squares linear regression is a reasonable starting model.

7. Training the model

Read Part 7: Training a Linear Regression Model

This article trains ordinary least squares linear regression models using the engineered basketball dataset. It covers train/test splitting, model fitting, reproducibility, coefficient interpretation, and alternate feature sets.

8. Evaluating the model

Read Part 8: Evaluating a Linear Regression Model

This article evaluates the trained models using predictions, error metrics, residual analysis, and model diagnostics. It compares the full model with a reduced-feature model and discusses what the results suggest about model performance.

Bonus articles

Ridge vs. OLS linear regression models

Read the bonus article: Ridge vs. OLS Linear Regression Models

This article extends the machine learning portion of the project by comparing Ridge Regression with the ordinary least squares linear regression models trained earlier in the series. It revisits the same basketball dataset, trains Ridge models with scikit-learn, evaluates model performance, and discusses how regularization changes the model compared with OLS.

Repository

The code for this project is available on GitHub:

View the NCAA Basketball Stats repository

Why this project matters

This project is designed to show the full workflow behind a data science analysis, not just the final result. Each step builds on the previous one, which makes the series useful for readers who want to see how a messy sports dataset becomes a structured analysis and machine learning project.

The project also serves as a practical portfolio example of:

  • acquiring data from online sources and APIs
  • cleaning and transforming real-world data
  • building reproducible analysis workflows
  • creating interpretable visualizations
  • applying regression modeling to sports data
  • evaluating model performance
  • communicating technical results clearly

Start here

Start with Part 1: Project Setup and Data Acquisition to follow the project from the beginning.

For readers most interested in modeling, start with Part 6: Selecting a Machine Learning Model, then continue to the model training and evaluation articles.