Written By: Luke Smailes

Prior to the 2022 season, we released our first iteration of a win probability model for our NCAA
baseball and softball clients within the 6-4-3 Charts interface. Win probability models the likelihood that
either team is expected to win the game given the game’s situation along with adjustments to account
for the quality of the two teams playing. After improving on the model’s limitations, we’re excited to be
releasing an updated V2 within our interface and on both D1Baseball.com and D1Softball.com.

Example Win Probability Graphic in 643 Interface

The basic framework of the win probability model remains unchanged. That framework leverages the
9.6 million baseball plays and 5.8 million softball plays in the 643 database, respectively, and looks at all
previous game outcomes for teams in the exact same situation of a given play in a game. Each situation
is defined by the current base/out state, inning, and score differential. For example, when the home
team is trailing by one run with two outs in the bottom of the third inning and a runner on second base,
we can look at all other situations since the start of 2017 and see how often the home and road teams
each won those games.

These factors create what we call the naïve model, and it tells us, for example, that home teams in
college baseball inherently have approximately a 9.6% advantage over visiting teams since the start of
the 2017 season, while that value is about 7% in college softball.

V1, only a slight improvement over the naïve model, calculated RPI at the conference level and applied
that adjustment over the course of the game, decreasing at a constant rate. This logic is founded on the
idea that the farther into a game two teams get, the less the outcome of that game is dependent on
what each team has done in the past, and more about the context of that given game. For example, an
SEC team playing at home against a SWAC school would likely be heavily favored from the outset.
However, if that SWAC school leads the SEC favorite by five runs with three outs to go, any win
probability adjustment should be diluted based on the fact that they’re quickly running out of
opportunities to make outs. Their margin to flex any talent advantage has become extremely slim in this
scenario.

V1 worked at a basic level (it was featured on SportsCenter and part of ESPN’s postseason softball
broadcasts), but it lacked some contextual adjustments that the average fan could pick up on. The most
basic issue was that since the RPI adjustment was being applied at the conference-level, the adjustment
was canceled out in games between two conference opponents, and we know that in virtually all
conferences, the difference between the best and the worst teams is significant. Conference-level RPI
was ultimately too general for all matchups, especially between conference opponents, so switching to a
dynamic team-level method has produced much more practical outputs.

V2 introduces this dynamic team-level RPI adjustment at the same rate as V1 – decreasing throughout
the game at a constant rate.

The RPI adjustment values are calculated based on the date that a game occurs. For every new game,
RPI is updated and recalculated according to the new data available. As the season progresses and RPI
becomes more robust, the RPI calculation applied in the win probability model is stronger than when
the matchup is just a few games or even a few weeks into the season and is thus weighted more heavily.

Looking back at past games, the RPI used in the model is consistent with the RPI at the time that game
occurred, making RPI dynamic across the season.

The final adjustment was included to control for the true corner-case teams across both sports. Teams
like Oklahoma Softball and Tennessee Baseball in 2022 dominated their opponents to the point where
RPI alone wasn’t doing them justice in terms of modeling their win probability in each game. The basic
RPI calculation is comprised of:

● 25% of your own win percentage
● 50% of the average win percentage of your opponents
● 25% of the average win percentage of your opponent’s opponents.

Thus, it’s designed to pick up on quality wins and bad losses, but there’s nothing to control for the
magnitude by which a team has won or lost. It’s one thing to win 59 of 62 games like Oklahoma softball
did in 2022, but doing so while scoring 514 more runs than they gave up is a good example of why we
added a run differential adjustment to the V2 model, specifically by utilizing the Pythagorean Theorem
of Baseball.

As defined by Baseball Reference, the Pythagorean Theorem of Baseball is “a creation of Bill James that
relates the number of runs a team has scored and surrendered to its actual winning percentage based
on the idea that runs scored compared to runs allowed is a better indicator of a team’s (future)
performance than a team’s actual winning percentage.”

The nice thing about this calculation is that the product of Pythagorean Theorem of Baseball is an
expected winning percentage — exactly what we are attempting to model. This expected winning
percentage makes the application straightforward in this context. We use the difference between the
teams’ Pythagorean Win Percentages at the time of the game as the run differential adjustment. We
scale this value by one-half (a somewhat arbitrary value due to the full effect being much too impactful)
and also apply the naïve win probabilities and the team specific RPI adjustment. The run differential
adjustment is scaled at the same constantly decreasing rate of the RPI adjustment based on the same
logic.

The result within the 643 interface (and soon for fans at D1Baseball.com and D1Softball.com) is the
same front-end visual (shown below) that tells the story of the game before showing the game’s
Excitement Index – a 0-100 metric that summarizes the total amount of win probability changes there
were throughout the game – or in other words, a good proxy for “excitement”.

Users can also view Synergy video right in this tab just by clicking on a play to re-live the biggest
moments of a team’s season or rewatch a pitch-by-pitch recap of each play – and this is for all games
back through 2017 for baseball and 2018 for softball from D1 to D3.


The sequence above is a good example of an extremely exciting game with a huge win probability swing
in the bottom of the ninth that ultimately flipped the game’s expected winner. The game culminated in
Josh McAllister’s walk-off 2-RBI double that won the game for Georgia.

This model is not spit out by a black box algorithm. In fact, it’s not AI-based at all, as it’s only pulling
from past baseball and softball events (respectively) before applying adjustments based on objective
past game results and run differentials. While we’re now much more confident in the model (and
especially the starting win probabilities being much more trustworthy in analysis), certain machine
learning techniques would likely perform very well as predictive mechanisms for these games, and it
may be something we pursue in the near future. However, for now, this model is highly interpretable
and can be easily explained as being based on exactly has happened in these same situations in the past,
and how good these teams have performed over the course of the given season.