* **Picture of Austin Love courtesy of @DiamondHeels Twitter*

* *

Oft looked at as a daunting topic, machine learning modeling has broken ground in the baseball world having numerous applications. The best way to approach understanding machine learning is thinking of your computer as a brain. You give your computer a thought process (algorithm) and experiences (algorithm data) to make future predictions. Here, we’re planning to use three different thought processes to help better future predictions of pitch results. Going off the 41,000-pitch sample from the 2021 season, we split the data into separate brackets to build this model. 80% of the data went into training the model, while the other 20% went towards testing the data. The goal of this model was to take three machine learning modeling methods (Random Forest, Bayes and Support Vector Machine) and 13 variables as key predictors of the pitch result. Our attributes included:

- Pitch Type
- Release Speed
- Spin Rate
- Induced Vertical Break
- Horizontal Break
- Vertical Approach Angle
- Release Height
- Release Side
- Vertical Release Angle
- Extension
- Plate Location Side
- Plate Location Height
- Count

To establish a few project goals and questions we hope to answer, we can begin with finding out what affects the outcome of each pitch and how to eventually quantify such effects. Is there a chance that human error (i.e., tagging errors, missed call by umpire, etc.) plays a role in the model, and can we adjust accordingly? Are there certain pitchers or hitters that show a significant difference in predicted stat line vs. the actual, and to what extent? And finally, can we predict future outcomes of pitches or pull outliers that show some of the greatest improbabilities we see in baseball?

Variable Selection

To start this project, we had to find out which variables should be used in the model. To do this, we first made a sample logistic regression model using all pitching/game variables from Trackman to predict whether a pitch would result in a swing or not. After this, we used backwards stepwise regression to remove variables until all the variables left were statistically significant to the model. This led to the selection of the aforementioned 13 variables that were chosen to build the models with.

After this, we used the Boruta function in R to see the importance of each variable in the model and to see which ones have the most influence in the prediction process.

Looking at the graph above, we can see the five most important variables are:

- Plate Location (side)
- Plate Location (height)
- Vertical Release Angle
- Count
- Vertical Approach Angle

While pitch location has an important part of part of the prediction, it’s interesting to see how the release and the entrance of pitch into the zone are also very important to making better predictions. It is also interesting to see how pitch movement was not one of the most important variables in the model. Using this, it looks like we should further look at the importance of pitch location in a later blog post.

With that being said, let’s dive into the intricacies of each model…

__Random Forest Model__

A random forest model builds multiple decision trees to create a prediction using our variables we’re testing for. It solves for regression and classification problems and is one of the most commonly used machine learning algorithms. To build the model, the algorithm builds multiples decision trees (the reason the algorithm is called random forest; multiple “trees” = forest) using the data it’s given. After the model is built, it uses these decision trees to make the prediction on unseen data.

Prediction |
BallCalled |
Swing |
StrikeCalled |

BallCalled |
2561 | 635 | 273 |

Swing |
451 | 2828 | 694 |

StrikeCalled |
99 | 270 | 496 |

Overall Statistics |
|||

Accuracy: |
0.7084 | ||

95% CI: |
(0.6985, 0.7182) | ||

No Information Rate: |
0.4494 | ||

P-Value: |
< 2.2e-16 | ||

Kappa: |
0.5223 | ||

McNemar’s Test P-Value: |
< 2.2 e-16 |

__Bayes__

Based off of Bayes’ Theorem, the Bayes model is used to describe the probability of an event, given the prior knowledge of the variables and conditions we are testing for. It assumes that each variable is completely independent and unrelated to any other variable. Here, if we compare this model with the random forest we can see that it’s a more accurate predictor for swings using precision (80.7% vs. 75.8%).

Prediction |
BallCalled |
Swing |
StrikeCalled |

BallCalled |
2079 | 570 | 186 |

Swing |
963 | 3013 | 988 |

StrikeCalled |
69 | 150 | 289 |

Overall Statistics |
|||

Accuracy: |
0.6478 | ||

95% CI: |
(0.6374, 0.6580) | ||

No Information Rate: |
0.4494 | ||

P-Value: |
< 2.2e-16 | ||

Kappa: |
0.4059 | ||

McNemar’s Test P-Value: |
< 2.2 e-16 |

__Support Vector Machine (SVM)__

The support vector machine (or SVM) model uses hyperplanes to divide the training data into the groups for classification on all variables. Here, the algorithm uses the 3 best hyperplanes on the 11 numeric variables to divide the variables to classify the new, unseen data. Looking at this model, we can see that it overpredicted strike called when it should have been swings. This could be because of the human element of baseball; whether or not a swing happens is based on the batter’s decision as well as pitch location (i.e if a pitch is a strike, it should be swung at).

Prediction |
BallCalled |
Swing |
StrikeCalled |

BallCalled |
2488 | 581 | 232 |

Swing |
613 | 3119 | 1181 |

StrikeCalled |
10 | 33 | 50 |

Overall Statistics |
|||

Accuracy: |
0.6810 | ||

95% CI: |
(0.6708, 0.6910) | ||

No Information Rate: |
0.4494 | ||

P-Value: |
< 2.2e-16 | ||

Kappa: |
0.4532 | ||

McNemar’s Test P-Value: |
< 2.2 e-16 |

The benefit to using three different models with varying strengths and weaknesses is that we can implore a voting method between the three to further enhance the final prediction of pitch outcome. After testing out all three models, the Random Forest Model proved to be the most accurate. Thus, giving us a default voting model barring any instances where both the Bayes Model and Support Vector Machine Models agree with one another. There’s an enhanced difficulty predicting a strike called over any other outcome due to the added variables of a hitter swinging at the pitch, an umpire’s mistake, or a mistake by the tagger. This is all the more reason that a voting method between our three models is going to further increase accuracy and be the best fit for the goals we’re looking to accomplish.

The accuracy percentages read as such:

Random Forest Model: __70.84%__

Bayes Model: __64.78%__

SVM Model: __68.10%__

__Application__

As we lean towards the topic of the next installation of this series, we can look at one player specifically to establish firmer ground for where the model is going to have an abundance of use. North Carolina Redshirt Sophomore RHP, Austin Love, had a successful ’21 campaign posting a 3.71 ERA and an 11.4 K/9 over 16 starts. After running our model on all pitches thrown by Love in 2021, we saw an overall accuracy of 82%. The prediction accuracy reads as such:

Predicted Ball Called Accuracy: __85.4%__

Predicted Strike Called Accuracy: __70.5%__

Predicted Swing Accuracy: __82.6%__

Expected Metric |
Actual Metric |
Difference |
||

xSwing% | 52.9% |
Swing% | 50.5% |
2.4% |

xO-Swing% | 38.5% |
O-Swing% | 39.2% |
(0.7%) |

xZ-Swing% | 75.9% |
Z-Swing% | 68.3% |
7.6% |

Based on this information, the model predicted Love to generate more swings both in and out of the zone. Citing the fact that our model predicted he was supposed to see a swing% of 2.4 ticks higher than what it actually got. A mixture of predictive sequencing, bad luck, and flat-out deceiving hitters could all be factors that led to a lower Z-Swing% than what was expected. His stuff has the ability to punch a lot of tickets as we can see through the metrics of his fastball reading 17.7 inches of Induced Vertical Break, an average extension of nearly six and a half feet and a Vertical Approach Angle averaging -5.4 degrees (a number that will flatten out and project even better when we filter out pitches not catching the upper third of the zone). The model shows the location of all expected swings in the zone (*see chart below*) with a majority of fastballs (in red) catching that upper portion.

All in all, our model sets the foundation for a plethora of projects that we can undertake with machine learning modeling. In future work, we plan to look at which pitchers are due for regression or massive improvements based on the comparison between expected and actual results!