2022 NCAA Men’s March Madness

Banner image by Ben Hershey on Unsplash

I finished 13th out of 930 teams in Kaggle’s 2022 March Machine Learning Mania competition and here is how I did it using Excel’s Solver add-in.

2022 was my 2nd time participating in Kaggle’s March Machine Learning Mania competition to predict the outcome of the NCAA’s Men’s March Madness college basketball tournament. In 2021, I used Microsoft’s Machine Learning Studio, and I finished 493 out of 707. Unfortunately, Microsoft is shutting down this service and no longer allowing new datasets to be uploaded, so I had to pivot my plan for 2022. I learned of this issue with only a few days left before March Madness began, so in a rush, I created a basic model using Solver in Excel. I also tried to create a quick neural network in R but that submission did horribly.

Before I explain how I prepared my data and setup my solver model, I want to give a shoutout to my MIS 546 professor Kevin Scheibe, John W. Foreman’s book Data Smart for giving me the idea of using Excel’s Solver add-in to solve this type of problem, and lastly Ken Pomeroy for his basketball statistics at www.kenpom.com.

Data Preparation:
The data I used for this competition came directly from the Kaggle competition and KenPom.com. I used prior season and tournament results from 2015 - 2021 as my training dataset to predict 2022’s tournament results. In 2021, when I used Microsoft’s ML Studio, I used a Pearson correlation feature selection process to pick the best features for my analysis. In this year’s model, I used the same features that process selected in 2021 to build my model. Those features were:

  • Average Points For*

  • Average Points Against*

  • Number of Wins*

  • Number of Losses*

  • Tournament Seed*

  • AdjEM^

  • Luck^

  • Strength of Schedule (AdjEM)^

Data Sources:

* Kaggle Regular Season Data

^ KenPom Regular Season Data

For all of the features, I calculated the delta between the two teams competing in each matchup (min team - max team) to get a net value for all 8 features. This follows the Kaggle convention of predicting if the team with the lower ID number wins. I did this for all of 2015-2022’s data at once in one file so lastly, when I normalized the net values between 0 - 1 in Excel, 2022’s data was also normalized using the same scale.

Model Setup:
Once I had my data prepared, I setup an Excel sheet with a row at the top for my model coefficients along with a sumproduct formula to calculate my prediction value in row K. I then added a column to score my prediction % compared to the actual outcome of the historic matchup. I played around with different thresholds, but a 50% threshold did the best. To score my accuracy I scored the matchup correct with a 1 if my prediction was greater than 0.5 and the team won, or if the prediction was less than 0.5 and the team lost. Otherwise I scored the matchup wrong with a 0.

To setup my solver model, my objective was to maximize the % correct (average of 1’s and 0’s), while evolving my coefficients, subject to being constrained between -1 and 1.

After running this model, I had my feature coefficients and a maximum % correct of 75.3% for the 384 games the occurred between 2015 - 2021. I was then able to use these coefficients to calculate predictions for every possible matchup in the 2022 tournament. This is 2,278 possible matchups, although only 63 actually occur in the tournament each year.

The scoring of the Kaggle competition uses an average log loss formula, which means you are extremely punished for being confidently wrong. I took my chances and rounded any of my predictions greater than 0.95 to 1, and less than 0.05 to 0, in hopes I could reduce my average error with a correct answer, and get a leg up on someone who was less certain a team would win.

Results:

Overall, I was really excited throughout the tournament with how well I was doing. I got all the way up to 7th place at one point, but I was pleased to finish the competition at 13th an earn my first Kaggle silver medal. Here is how my score changed throughout the competition. I really had a rough round 2 of the tournament and did not predict a lot of the upsets that occurred.

Round 1:
After 16 Games: 0.53472
After 32 Games: 0.48149

Round 2
After 40 Games: 0.53296
After 48 Games: 0.55615

Sweet 16
After 52 Games: 0.56626
After 56 Games: 0.59017

Elite 8
After 58 Games: 0.58637
After 60 Games: 0.58205

Final 4
After 62 Games: 0.58663

Championship (Final Score)
After 63 Games: 0.58579










Next
Next

MLB Baseball Regression Analysis