Skip to:

Stolen Bases Model:

A generalized linear regression model I built is used to predict stolen base success probability. This model utilizes the following variables to account for the specific players in the interraction:

Baserunner

Smoothed stolen base percentages since 2024 are used to account for the baserunner (smoothed percentages are scaled closer to league average the less attempts a player has). This methodology takes past success unto account while not overreacting to small sample sizes.

Catcher

For the opposing catcher poptime to second base is utilized (amount of time that it takes a catcher to throw the ball to second base). This data was more predictive for catchers than smoothed steal percentages.

Pitcher

Smoothed past steal percentages were also used for pitchers (success of baserunners against the pitcher).

Model Summary

The model utilizes these three variables to predict the success probability of stolen base attempts. Some key attrbiutes of the model are:

  • The baserunner and catcher have more of an impact than the pitcher.
  • The model weighted runner data approximately 1.5 times more heavily than pitcher data.

Live Deployment:

The live engine was built using python. My script leverages the MLB API to collect the result and names of the players involved in steal attempt as they happen live during games. Player names are then used to look up smoothed percentages and poptime data which is fed to the model resulting in a prediction.

Tweet Output:

Live tweets produced by this engine contain the following data:

  • Result (succesful steal or caught stealing)
  • Model grade and predicted steal probability
  • Player steal grades for all three players

Model grades are simply broken into good, okay, or bad matchups based on predicted probabilities. Player steal grades are where each specific player ranks within their position based on smoothed percentages or poptime.

 
 

Want to see the model in action live during MLB games? Follow @StealSignal on X.