by Will Emhardt, School of Informatics graduate student
The dataset I decided to use for this project was the previous season of Division 1 Soccer, ending in December of 2023. Using college soccer data, I only pulled the list of contests for the 2023 season and then the play-by-play for every contest in 2023. Initially I set out to create a heatmap displaying all goals and shots with filters like time in the game, location on the field, and many more. As I dug into this objective I garnered curiosity to explore more realms within the soccer game.
First, I created the heatmap on a soccer pitch to show high and low density areas for shots and goals, then I decided I should have an option to show the same data as a scatterplot. Second, I created a set piece metric to show the attacking capability of teams off of set pieces and when defending set pieces. Third, I created a corner kick metric to also show the propensity to score immediately after a corner kick, or on the other side the capability of defending a corner kick. Lastly, I included a more general scatterplot with the ability to track individual athlete’s statistics. The general idea of the individual performance scatterplot was to display the certain strengths of players throughout Division 1.
All of these visuals and metrics are hosted on a Streamlit app where the user can filter the data however they please.
METHODOLOGY
First I had to pull the information necessary to meet my goals and objectives for the Streamlit app, let me dive into this process. I began by looking for the sport id on the Swagger UI website to find the unique identifier used for Men’s Soccer. I then plugged this identifier into a function to get all sport seasons in order to get a list of the seasons for soccer in descending order, each with a unique season ID. I plugged this unique season ID for the 2023 season into a function to get all season divisions to find a unique season division id for Division 1 soccer in the 2023 season. With this season’s division ID for 2023 Division 1, I used it in a function to get a list of the contests for Division 1.
Finally, I had a list of the contests in dictionary form, but I needed it in a dataframe. After parsing out the contest numbers into a dataframe, I found the column of contest ID numbers and created a list out of the column. I needed contest id’s to be in a list format so I could use my next function pulling a contest’s play by play. Instead of entering one contest ID into the play by play function, I created a variable of the list of ID’s and ran this in the play by play function. After this function ran which took an hour or two I had a list of dictionaries for all the contests in the 2023 Division 1 season with their play by play. After converting this list of dictionaries into a dataframe I had the data frame I would base all my analysis off of, finalizing the data pull process.
After pulling my large dataframe of more than 600,000 actions for the season, I needed to make some minor adjustments: adding and manipulating columns. I created an opponent column through an if statement inside of a for loop checking if the name of the team alternated. Once a new team popped up in the play by play data, this loop would correctly interpret this new team as the opponent. The loop would then update the new column to have the team and new team always in either the team or opponent column, obviously dependent on the action.
Sometimes plays were out of sequence but the data pulled games in sequence so there were no issues where play by play data for one game was in the middle of another game. Other than the opponent column, I had to: multiply the x and y coordinate dimensions to fit the soccer pitch dimensions, remove the time format of play_time by converting the play_time to decimal with 31.59 representing 31 minutes 59 seconds, capitalized play qualifier column first letter values and removed semicolons at the end of descriptions for cosmetic purposes on the individual athlete’s scatterplot.
RESULTS (2023 Season)
This image above is the shot and goal chart against IUPUI. It’s interesting to see on both sides the team likes to shoot from the right when shooting from long distance. Next I’ll show the scatter plot version of these heatmaps.
Above is the shot and goal chart of all teams against Clemson last year. All the stars are goals, all squares are shot on goal, all dots are shots. Orange means the shot was highleft, right or
center. Red means the shot was out right or low right. Blue means the shot was out left or low left. Purple means the shot was low center, or on goal. With Clemson, you can see there are a variety of shots taken evenly over the field, although on the right side of the field you do see a few more goals on the left than the right. Next I will show you a shot chart with a different distribution that skews one way.
This is the same chart except against Boston College. Very interesting how a majority of the goals are scored to the left. Coaches could key in on this statistic when prepping their offense for the Boston College defense.
The CAP metric is calculated filtering for all of the corner kicks taken and goals within the parameters. In this dataframe, South Carolina, Presbyterian and Notre Dame had a corner kick against Clemson, who was on defense. The score/metric is calculated by taking the total amount of goals in the filtered data frame scored within a minute of when the corner is taken divided by the total amount of corner kicks taken in that same data frame. Clemson’s defense allowed a goal on a corner kick 3.5% of the time. On the other end, Clemson’s offense scores a goal on 5.4% of the time. National runner-up Notre Dame’s offense scores a goal on 6.7% of their corner kicks, and allows a goal on 2.6% of their corner kicks defended. IUPUI scores a whopping 7% of their corner kicks, and allows a goal on only 2 % of their corner kicks defended.
The SPA Metric is calculated similarly to the CAP metric but calculates free kick to goal conversions within 2 minutes of the free kick. I picked the free kicks taken and corners taken data points because time is easier to track for calculating when a goal occurs after the action. The corner kick conceded and free kick awarded entries in the play by play have coordinates, yet the time is variable between when the free kick or corner is taken after they are awarded. I wanted to have a uniform calculation when calculating the time after the action to the goal. The Clemson defense allows a goal after 2.57% of their set pieces defended, their offense scores a goal after 5.25% of their set pieces. The Notre Dame offense scores a goal after 7.29% of their set pieces, 5.49% of their set pieces in the 1st half. The IU Defense is sturdy only allowing a goal on 1.43% of their set pieces defended, only 0.74% of set pieces defended in the 1st half.
This is the player performance portion of the work I did with this 2023 Division 1 season. You can filter by teams, players, and opponents to see which players performed the best at what time, where, and against who. There are many metrics you can track on the x-axis: shots, fouls, foulswon, substitutions, assists. Amongst shots you can track every type of shot: blocked, lowleft, highright, from head, left foot, right foot, many more. Above is a great example of how a SID or any dedicated sports nerd can optimize this visual. The performance chart above includes all the shots taken by the two national finalists, and the teams of the top 5 individual goal scorers in Division 1. The five players way out to the right are the top 5 goal scores, showing clearly the impact these stars had on their team.
INSIGHTS & RECOMMENDATIONS
In totality, I was very proud to create the metrics and website, I felt I accomplished what I set out to do at the beginning of the summer. However, there were certainly speed bumps along the way, I found many potential areas for improvement.
First, I have to note it would be incredible if the tracking system was used to its best abilities, I’ll go into some areas where more accurate tracking could add another layer to the game. Coordinates for assists would create the ability to measure how longer or shorter passes are more opportune for shots. Opponents could key in on certain players at certain hot spots near the goal to halt striking advances. And I would not stop at assists, if every pass could be tracked teams ‘plays’, the few there are in soccer, could start popping up in the numbers. Say Clemson after gains possession on the left of their half, they like to advance up the left, cross it over to the right wing once they pass midfield, and one touch to a player cutting down the right center of the field.
To save on the sheer size of a dataset with this potential tracking, it could be limited to only advancing passes or passes on the offensive side of the field. Foul coordinates could be introduced, yet with caution, to avoid any potential scenario of a hack-a-Shaq and putting players in danger. To avoid the potential dangers, teams could focus in on the areas of the field where they’re committing the most fouls, cleaning up tackles and enriching the product the fan observes from the stands. Lastly, all free kicks right now are recorded as ‘Team’ under the player name column. Adding players to the free kick play entries could be an easy alternative to enrich the metrics I created above, specifically the SPA metric.
CONCLUSION
My vision for the first half of my project was to create a scouting tool for goalies by displaying shot tendencies for different players and teams. I believe I achieved my goal there, now a goalie can look in the 2nd half where Clemson’s shots are coming from, which side of the goal is Clemson targeting when they score a goal, and where their top player Ousmane Sylla shoots top right. On the flip side, the first half can be used for teams to identify where there goals are coming from on the pitch. They can start to dive in to alternative strategies like shooting across the goal on the left side after the data clearly shows they built a pattern of shooting to the left from beyond the box.
The SPA and CAP metrics can add another dimension to the scouting reports of coaches. If coaches can figure out IUPUI is scoring off % of their set pieces, Indiana may play a bend but don’t break defense to limit any potential fouls or set pieces. Then if a team like Notre Dame who scores on 6.7% of their corner kicks comes into Bloomington, Indiana may focus on corner kick defense the day before the match.
Lastly, to recap the points made on the player metric plots, this feature is the goodie bag you get after a birthday party when you were young. Season’s over, how did everyone perform; Clemson won yes, but the top goal scorers in Division 1 went to Hofstra and Western Michigan. Coaches could then study what those teams are doing to create a juggernaut, they may be playing lesser competition, but there may be some style these teams are scheming up to get premier strikers to convert. Western Michigan and Hofstra will sure love this feature of the app! This is just the start for soccer, with proper touches here and there we could discover another game within the game.
REFLECTION
I wanted to focus my presentation on the practicality for players, teams and sport information directors. I set out at the beginning of my research committed to creating a view or report useful for each different persona. With this mentality, I always explained the ‘why’ before I introduced a new idea. At the conclusion of my presentation I felt many of my points and conclusions were well taken. There were a few questions on why certain parameters were selected for metrics. I was glad I was able to succinctly describe my work for the summer.
This presentation will prepare me for the full cycle of a project for a 21st century data analyst. My prior experiences did not include robust data sets with such a wide range of possibilities. I learned the importance of the beginning stages of any data science project, decisions made then cause me to feel more connected with my work when it is time to present.
I believe the connection I established between the work I have done these last two semesters and an enriching, practical, sports dataset was the most important and satisfying part of this summer.
#SportsInnovation #SportsAnalytics #Indy4Sports
Leave a Reply