Background

March Madness

March Madness refers to the division one men’s college basketball tournament. It is a single elimination tournament, which means teams are matched up and the team who loses is immediately eliminated from the tournament, while the team who wins advances to the next round to play another winner. These rounds continue until a final matchup of two teams, where the winner becomes the tournament champion. This tournament starts with 68 teams, and is played over 7 rounds. As for the term “March Madness”, it was first used to refer to basketball in 1939 by an Illinois high school official Henry V. Porter, but the term didn’t find its way to the NCAA tournament until CBS broadcaster Brent Musburger used it during coverage of the 1982 NCAA tournament.

1939

First Year

8

Original # of Teams

Purpose of Analysis

The goal of this project is to use historical NCAA mens' basketball data to train a machine learning model. The main questions of interest are:

Can machine learning be used to accurately predict outcomes of various matchups in the 2022 March Madness tournament?

Which variables lead to the most accurate predictions?

Tools

Python
Pandas
NumPy
PostgreSQL
SQLAlchemy
SKLearn

Data

Overview

The data used for this project was obtained from Kaggle.

The data included identifying information such as team names, unique ID numbers for each team, their locations and respective conferences, as well as their coaches. In addition to this, it contained seeding information since the 1984-85 tournament, as well as detailed results from regular seasons, conference tournaments, and previous NCAA tournaments.

Exploratory Analysis

The first thing we were interested in exploring was if there were any differences in average game statistics for tournament winners compared to the rest of the teams. To do this, the champion of the 2021 NCAA tournament was identified, and the team's statistics for all of the games that the champion team played in the regular season were averaged. The 2021 regular season statistics for all other teams were then combined and averaged. It was found that the champion team had higher average values for 9 of the statistics.

In addition to looking at the differences in means, a heatmap was created to examine any relationships between the different regular seasons statistics. This was done in order to visualize any unexpected relationships that were not already assumed based on the general basketball knowledge of the team.

Model

Overview

The data used for the model were regular season game statistics. Because the way in which sports are played is constantly evolving, only data from 2012 and onward was used. This allowed the model to still have a significant amount of data, while also ensuring that the data used was relevant. The goal of the model was to determine whether "Team A" or "Team B" would win in a given matchup based on the teams' previous statistics.

A random forest classifier was used for the model. The random forest algorithm creates a random selection of decision trees, and then merges the results of the various trees. Because of this cross validation, the random forest algorithm can provide higher accuracy. In addition, random forest algorithms reduce over-fitting issues, as it won't allow over-fitting trees into the model.

Performance

Based on the classification report (seen below), the model achieved an accuracy of 0.62. After evaluating the accuracy of the model, the top 10 feature importances were calculated based on mean decrease in impurity. The feature with the highest importance was the number of field goals made by Team B. The feature with the second highest importance was the number of steals by Team A.

Results

The table below displays various matchups between teams in the 2022 March Madness tournament. Each row represents a game between two teams. For each game, the table lists the probability that our model calculated of Team 1 winning, the team that our model predicted to be the winner of that game, and then the actual results of that matchup in the tournament. The table can be filtered to show all games played by a selected team.

To use the table: Start typing in the box below "Enter Team". The table should automatically filter to show the games in which that team played.

Filter Search


  • Team 1 Team 2 P(T1 Win) Predicted Winner Actual Winner

    Connect


    Rick Cruz

    rickacruz6@gmail.com

    Javi Garcia

    l.javier.garcia86@gmail.com

    Ali Herington

    aherington01@gmail.com

    Chris Llewellyn

    cllewellyn1507@gmail.com