
by Elliott Scott, School of Informatics graduate student
This report analyzes football game states using college sports data to predict win probabilities for various game situations. The dataset comprises 1.6 million rows of play-by-play data from Division I FBS football games, focusing on play types like punts, field goals, passes, and rushes. Each play type is independently modeled using a generalized linear model to estimate win percentages, considering variables such as yards to the end zone, down number, yards to go, period number, and score difference. Results are presented through an interactive Shiny app, enabling users to adjust game states and calculate win percentages. This tool offers strategic insights to enhance decision-making in football. The study recommends incorporating time data and leveraging unused spatial data columns to improve model accuracy and predictive capabilities.
METHODOLOGY

The first step involves acquiring college sports data. First, a list of seasons and their corresponding identifiers are obtained. Next, the list of divisions within the 2021-2024 seasons are retrieved, and these identifiers were used to compile a list of all Division I FBS contests. Once the contests are collected, the play-by-play data is acquired, resulting in 1.6 million rows of football data.
The next phase is to process the data into a format suitable for modelling football game states. This begins by merging the contest data with the play-by-play data to identify which teams participated in each play, and allowing for the winner of the contests to be associated to every play. Specific plays such as punts, field goals, passes, and rushes, are selected and represent what are called ‘play type’.
Table 1: Count of Play Type
Play Type | Number of Plays |
RUSH | 122,612 |
PASS | 125,726 |
FIELD GOAL | 5,473 |
PUNT | 16,301 |
Total: | 270,112 |
Two key data points are created to give further context to the football state; “Score Difference,” representing the score difference from the offensive team’s perspective, and “Yards to End Zone,” indicating the number of yards the offensive team needs to advance before scoring a touchdown.
Figure 1

Figure 2

Further processing is done by calculating win percentages for each play type and football state. Each play is categorized by play type and football state. Each category is counted to determine its frequency of occurrence. Additionally, the number of wins for each category is counted and divided by the total number of occurrences to calculate a win percentage.
Each play type is modelled independently. These models estimate the win percentage for each football state using a generalized linear model with a logit link function. Polynomial and interaction terms are included as variables. Models are analyzed made by comparing the Variance Inflation Factor (VIF) and removing variables with high VIF value, repeating this process until all of the remaining variables have a VIF below 5. The remaining variables have their p-values calculated. The highest p-value variables are removed until all of the variables have a p-value of less than 0.05.
The final step is to create an app to provide visualizations of the win percentages in different football states. For this analysis an R Shiny app is created. The app allows users to adjust the game state and calculate win percentages for each play type, offering insights into strategies that could lead to the best outcomes. The app also provides a count of each play type, helping users understand the frequency of plays.
RESULTS
Rushing Model Results
Table 2: VIF of Rush Model Variables
Variable | VIF Score |
(Yards To Endzone)^2 | 1.071646 |
(Down Number)^2 | 1.051719 |
(Period Number)^2 | 1.031240 |
(Score Difference)^2 | 1.147594 |
(Yards To Endzone * Score Difference) | 3.85926 |
(Down Number * Yards To Go) | 1.067633 |
(Down Number * Score Difference) | 2.656068 |
(Yards To Go * Score Difference) | 3.021544 |
Passing Model Performance Results
Table 3: VIF of Passing Model Variables
Variable | VIF Score |
(Yards To Endzone)^2 | 1.029759 |
(Down Number)^2 | 2.193834 |
(Score Difference)^2 | 1.068510 |
(Yards to Endzone * Score Difference) | 4.145647 |
(Down Number * Period Number) | 2.301315 |
(Down Number * Score Difference) | 3.247937 |
(Yards to Go * Period Number) | 1.509096 |
(Yards To Go * Score Difference) | 3.638389 |
Field Goal Model Performance Results
Table 4: VIF of Field Goal Model Variables
Variable | VIF Score |
Yards To Endzone | 1.052465 |
(Score Difference)^2 | 1.247317 |
(Yards To Endzone * Score Difference) | 3.102416 |
(Period Number * Score Difference) | 3.314189 |
Punt Model Performance Results
Table 5: VIF of Punt Model Variables
Variable | VIF Score |
(Yards To Endzone)^2 | 1.000367 |
(Score Difference)^2 | 1.055432 |
(Yards To Go * Score Difference) | 2.533890 |
(Period Number * Score Difference) | 2.621724 |
APP CREATION
To visualize the data I analyzed, I created an app that allows for the customization of the football state variables. Each color on the graph represents one of the four play types. Each point represents a real-world example that is being modelled. The lines are the predicted win percentages that are outputted by the models.
Figure 7

On the left is where the football state can be modified. Any combination of down number, yards to go, period number, and score difference can be selected. Yards to go and score difference is specified as a range of values with a slider. Selection of the “Include All” boxes includes all points in the range. Down number and period number can be checked or unchecked in any combination. Once a state is chosen, clicking “Run Analysis” will update the graph with the predicted win percentage of that state.
Figure 8

Often the number of points on the graph can overwhelm the lines and make it difficult to read the win percentages. The points can be hidden by unchecking the “Show Points” checkbox.
Figure 9

INSIGHTS
The app can be used to derive the best play type to run for any football state. Many insights can be gained by looking at the variables that were found to be the best for predicting win percentage.
The rushing model uses 8 variables. None of the variables are included in the models by themselves without a polynomial or interaction term added. One observation is made that “yards to go” is only found as an interaction term with the “down number”, suggesting that the win percentage is not affected by “yards to go” all by itself, but when multiplied by the “down number” it significantly changes the win percentage.
This makes sense when we compare scenarios like 4th and 1 to 4th and 10. 4th and 1 will likely have a much higher win percentage since it’s a much easier play to convert compared to the 4th and 10. Another comparison can be done for scenarios like 2nd and 6, and 4th and 6. The 2nd and 6 scenario will have a higher win percentage because the down allows for more rushing plays and a better chance of getting a first down, allowing for more chances to win.
Figure 10: 4th and 6 Scenario
Figure 11: 2nd and 6 Scenario

The passing model also uses 8 variables. Unlike the rushing model, (Period Number)^2 is not a variable of the passing model, suggesting that time has less of an effect on the win percentage of a passing play. This is true for reasons such as the clock stopping for incomplete passes, allowing for more passing plays to be run during low time situations. However, the passing model does include the period number when multiplied by the down number and when multiplied by the yards to go. The affect this has on win percentage is harder to explain, but typically pass plays are enacted during higher ‘yards to go’ on 4th downs late in the 4th quarter. This is supported by the significance that score difference has within the model when multiplied by the down number and the number of yards to go, similar to the period number.
Figure 12: 4th down, High “Yards to Go”, 4th Quarter, Close Game

The field goal model uses 4 variables, which is less than the previous two models, suggesting that deciding whether or not to kick a field goal is a simpler decision to make. The most significant variables were polynomials and interaction terms of the yards to endzone, score difference, and period number. The yards to endzone and score difference when multiplied were found to be significant, suggesting that the distance of a kick and the difference in score significantly affect the chances of winning percentages. The period number and the score difference when multiplied together were also found to be significant, suggesting that field goals at certain periods of the game at certain score differences effect the winning percentages. This significance makes sense if we consider a scenario where the score is larger than the 3 points a field goal would provide, and there is little time left in the game. Going for a field goal in this scenario will have a low winning percentage.
Figure 13: Low field goal win percentage, 4th and long, 4th quarter, losing by more than 3

The punt model also uses 4 variables, many of which are the same as the field goal model. The difference being instead of “yards to endzone” being multiplied by the “score difference”, the punt model win percentage is significantly affected by “yards to go” multiplied by the “score difference”. This suggests that where the play occurs is less significant than the number of “yards to go” to get a first down.
Figure 14: 4th and 10 Punt Win Percentages

Figure 15: 4th and 5 Punt Win Percentages

RECOMMENDATIONS & CONCLUSION
Based on the analysis, several key areas for improvement have been identified that could enhance the app’s functionality and predictive capabilities. Here are the recommendations:
- Inclusion of Time Data:
- Insight: Time data is critical for accurately predicting the win probabilities. The analysis revealed that only 76,000 out of 1.6 million plays (4%) had associated time data and could therefore not be utilized in this analysis.
- Recommendation: Incorporate time data into all play-by-play entries. This will enable more precise win probability predictions.
- Utilization of Unused Columns:
- Insight: Several columns such as “x-coordinate” and “y-coordinate” are unused, but nonetheless exist in the dataset. These columns if implemented would offer valuable insights into player positioning and movement, which would greatly increase the value of the data and improve the quality of analysis.
- Recommendation: Leverage these unused columns to enrich the dataset. By incorporating spatial data, player positions and movements can be analyzed more effectively, leading to deeper insights and more detailed play categorization.
By implementing these recommendations, the value of the dataset can be significantly enhanced, unlocking new possibilities for future analysis.
This study has provided insights into predicting win probabilities for college football plays. By modeling play types and calculating win percentages based on key game variables, we’ve been able to offer insights that can help coaches and analysts make better decisions.
The app developed for this project is a practical tool that allows users to adjust game states and visualize win percentages. This makes it easier to apply the findings in real-world scenarios, offering coaches a way to make data-driven decisions during games.
To enhance the app further, it is recommended to incorporate time data into all play-by-play entries, as timing is crucial in football. Additionally, using the unused spatial data columns could provide deeper insights into player positioning and movements, which would enrich the analysis.
In summary, this project lays a foundation for future football analytics. The combination of game state analysis and interactive visualization tools makes this project a resource for anyone looking to delve deeper into football analytics.
#SportsInnovation #SportsAnalytics #Indy4Sports
Leave a Reply