MLB Baseball Regression Analysis
Note: This project was part of an academic project for a course at Iowa State University.
Introduction: The purpose of this analysis is to examine what it takes to make a baseball team successful. Success for this analysis is measured by the number of wins. This analysis was completed using historical player and team statistics from Lahman’s Baseball Database to predict wins using regression modeling techniques. Key recommendations are provided based on model results for team management.
Data Wrangling: For this analysis, a large portion of time was spent on data wrangling, data preparation, and topic research. I began by downloading the 2020 release version of Lahman’s Baseball Database. I started with the comma-delimited version which provided 28 csv files and a readme data dictionary up through the 2020 season. I spent a lot of my time in the beginning, opening each csv file, comparing it to the data dictionary, and researching what some of the fields meant for the game of baseball. Once I had a basic understanding of the data available, I decided that for the purposes of this analysis most of my data would come from the teams table, but I found a few other player-based statistics that I wanted to summarize on a team level and add to my teams’ table.
Feature Engineering/Creation: I loaded the database version provided by Lahman into SQL to summarize some player level statistics by year and team so I could add them to the team’s table. I used SQL to summarize the following player features by year/team: Average Player Age, Average Player Salary for Team, Total Team Salary, Average MLB Salary for each year, Total MLB Salary for each year. For the age-related features, player age is often a conversation in professional sports, so I wanted to see if any of my models felt it had a high importance. For the salary features, I wanted to test if the “New York Bankees” stigma also held high importance in feature selection by asking “do the teams and players that have more money, win more?”
Models: For the analysis portion, I developed various regression models using R, first manually and then I took a more automated method. For my first model, I began with a simple model with the 10 features I found to be most correlated to wins. Here is table showing those correlation values:
Feature | Correlation to # of Wins |
---|---|
HitsbyBatters | 0.74 |
RunsBattedIn | 0.72 |
OutsPitched | 0.71 |
AtBats | 0.71 |
RunsScored | 0.70 |
Rank | -0.70 |
GamesPlayed | 0.70 |
Walksbybatters | 0.67 |
Doubles | 0.61 |
FieldingPercentage | 0.61 |
This model performed well for a first attempt, with an adjusted R2 of 0.904, and all except one of the features was significant. In looking into the multicollinearity of the model, though I found multiple of high VIF (Variance inflation factor) values. A VIF above 10 can suggest multicollinearity in a model. I re-ran my model multiple times, removing the feature with the highest VIF, until I got a model where all VIF values were below 10. This gave me a model with 5 features, an adjusted R2 of 0.877, all features were significant, and all features had a VIF below 10. For this model, my 5 features were Doubles, Walks by Batters, Games Played, Rank (negative), and Runs Scored.
For my second analytical method, I took a more automated approach using a BIC Model (Bayesian information criterion). The BIC function from the LEAPS package in R is an iterative process that tries models with different variables to reduce the BIC value. This function settled on a model with 9 features, an adjusted R2 of 0.957, and all the features were significant. Once again, I checked the VIF values for multicollinearity, and found some high values, so after reducing the feature with the top VIF one at a time, I settled on a model with 6 features, an adjusted R2 of 0.956, all features were significant, and all features had a VIF below 10. For this model the 6 features were Outs Pitched, Saves, Shutouts, Opponents Runs Scored (negative), Rank (negative), and Runs Scored.
Here is a data table of the model results:
Model Statistics |
Top 10 Correlated Features |
Top 10 Reduced to 5 (VIF Reduction) |
BIC Model |
BIC Model Reduced (VIF Reduction) |
---|---|---|---|---|
Number of Features |
10 |
5 |
9 |
6 |
Error df |
0.304 |
0.349 |
0.205 |
0.211 |
R2 Adjusted |
0.904 |
0.877 |
0.957 |
0.956 |
RMSE |
0.303 |
0.349 |
0.204 |
0.211 |
Largest VIF |
303.84 (Outs Pitched) |
3.14 (Games Played) |
151.93 (At Bats) |
5.37 (Opp. Runs Scored) |
P-Test for F-Test |
< 2.2e-16 |
< 2.2e-16 |
< 2.2e-16 |
< 2.2e-16 |
Largest P-Value (t-test) |
0.596 (Intercept) |
0.0192 (Walks by Batter) |
0.66 (Intercept) |
1 (Intercept) |
Here is a data table showing the selected features by model:
Model Features |
Top 10 Correlated Features |
Top 10 Reduced to 5 (VIF Reduction) |
BIC Model |
BIC Model Reduced (VIF Reduction) |
---|---|---|---|---|
Hits by Batters |
X |
X |
||
Runs Batted In |
X |
|||
Outs Pitched |
X |
X |
X |
|
At Bats |
X |
X |
||
Runs Scored |
X |
X |
X |
X |
Rank |
X |
X |
X |
X |
Games Played |
X |
X |
||
Walks by batters |
X |
X |
||
Doubles |
X |
X |
||
Fielding Percentage |
X |
|||
Opponents Runs Scored |
X |
X |
||
Shutouts |
X |
X |
||
Saves |
X |
X |
||
Runs Batted In |
X |
Recommendations: Ultimately, the model that performed the best for this analysis was the BIC model with the reduced VIF values to reduce multicollinearity. Although the summary statistics for R2 and RMSE were slightly worse than the straight BIC model, I believe the reduction in features and improvement in VIF is worth the slight reduction in R2 and slight increase in RMSE. I also believe this model provides more actionable features for a manger to focus on for a team. Four out of the six features are not obvious, while the other two (rank, and runs scored) are more obvious. I would recommend a manager focus more on defensive activities such: as pitching more outs, pitching shutouts, saves, and reducing the number of runs scored by opponents. Practically, I interpret this as if you can keep the other team from scoring with a good defense and good pitcher, you only need to score a little to win the game. Another way of saying it is, the team that makes the fewest mistakes will win. Based on these features, I think you need a good set of pitchers and fielders for the various game situations to win in baseball.
Future Analysis: After looking into some of the data, I felt that for future analysis, I would consider 2 things:
Win Percentage: I believe win percentage is more important than wins. The assignment asked for wins, so I focused on that, but the dataset goes back to 1871, and in that year the best team had 21 wins. Today however, if a team had 21 wins, the manager would be fired halfway through the season. This difference showed up again in the 2020 data with a reduced season due to COVID 19. In 2019, the team with the most wins had 107 wins while in 2020, the team with the most wins had 43 wins.
Reduce Historic Data: With a dataset that provides so much historic information, it made me think about how baseball has changed since 1871, If I was going to continue analyzing this data, I would explore some models only looking at a smaller time sample such as the last 20-30 years. It would also be interesting to see how baseball has changed before, during, and after the steroid era.
R Code:
library(car) #Functions used: VIF
library(leaps) #Functions Used: regsubsets
library(MASS)
baseball <- read.csv("F:/Baseball/MasterDataset.csv")
summary(baseball)
colnames(baseball)
#Top 10 Correlated Features
top10 <- lm(Wins ~ HitsbyBatters+RunsBattedIn+OutsPitched+AtBats+RunsScored+Rank+GamesPlayed+Walksbybatters+Doubles+FieldingPercentage, data=baseball)
summary(top10)
sqrt(mean(top10$residuals^2))
vif(top10)
#Top 10 Model - VIF Reduction Model
top5 <- lm(Wins ~ RunsScored+Rank+GamesPlayed+Walksbybatters+Doubles, data=baseball)
summary(top5)
sqrt(mean(top5$residuals^2))
vif(top5)
#BIC Model
bic_model <-regsubsets(Wins~., data = baseball,nbest=1)
plot(bic_model,scale="bic",main="Model Selection using BIC Criterion")
plot(bic_model,scale="r2",main="Model Selection using BIC Criterion")
summary(bic_model)
bic_results <- lm(Wins ~ HitsbyBatters+OutsPitched+AtBats+RunsScored+Rank+Opponentsrunsscored+Shutouts+Saves+RunsBattedIn, data=baseball)
summary(bic_results)
sqrt(mean(bic_results$residuals^2))
vif(bic_results)
#BIC Model - VIF Reduction
bic_final <- lm(Wins ~ OutsPitched+RunsScored+Rank+Opponentsrunsscored+Shutouts+Saves, data=baseball)
summary(bic_results)
sqrt(mean(bic_final$residuals^2))