This article is based on a presentation at IBA’s Fall 2023 Analytics Conference on Generative AI & Cybersecurity by Ben Lazarine, Kelley Doctoral Student and part of Kelley’s Data Science and Artificial Intelligence Lab (DSAIL).
In the past few years, access to artificial intelligence models has been democratized as the AI community has leveraged open-source platforms. While this transition has created a great deal of AI development and adoption because it’s easy to access and share models, it’s also introduced new risks.
One of the most popular open-source platforms used by developers is Hugging Face, which allows developers to post and share pre-trained AI models for any number of tasks. Ben Lazarine, a PhD student who works in Kelley’s Data Science and Artificial Intelligence Lab (DSAIL), explains how Hugging Face works: if a user trains a model to do a natural language processing (NLP) task, they could post it on Hugging Face to easily facilitate others’ use of the model in their own web applications.
On Hugging Face, users can leverage features like the model card, which provides information about a particular model, allows users to see how popular a model is, and shows the usage rate of the model. The files and versions sections also allow users to look “under the hood” of the model.
“While these features are useful, it’s important to note that Hugging Face doesn’t actually provide the source code on their own platform that was used to develop and train the models,” Lazarine says. “This makes it difficult to find the vulnerabilities of these models by looking at Hugging Face alone.”
Tracing the Connection between Hugging Face and GitHub
To study the relationship between models posted on Hugging Face and GitHub repositories where their source code may potentially be hosted, Lazarine and his team reviewed the literature available on the linkages between Hugging Face and GitHub and then studied the vulnerabilities that may exist in the source code on Hugging Face and GitHub.
“Previous literature tried to identify linkages through the README files of AI source code on GitHub, but we wanted to take a deeper look at the model cards on Hugging Face,” Lazarine says. “We examined even further by looking to the application program interface (API) calls within the source code on GitHub so we could see if people are importing Hugging Face to develop their models on GitHub.”
The team also wanted to leverage the open-source tools that are readily available to help the AI community identify vulnerabilities within AI. The vulnerability assessment tools that the team identified fall into categories: static and dynamic.
Static vulnerability tests use rule-based methods to identify potential issues in source code directly. Users can point a static vulnerability scanner to Python files or C files, then go through the source code in a static manner that doesn’t compile the files. The vulnerabilities are categorized in a number of different ways, including potential secret leakages or insecure libraries that are being imported and used.
Dynamic vulnerability tests actually compile and run the AI model to analyze how it’s performing. These tests often provide a richer vulnerability scan because they can run while the model is running. Users of dynamic vulnerability tests also have a lower chance of getting a false positive result, because they demonstrate how vulnerabilities would actually manifest themselves after being incorporated into the app.
Research Objectives, Process, and Results
To identify vulnerabilities in the linkages between Hugging Face models and their underlying GitHub repositories, the researchers collected about 110,000 models from the Hugging Face platform, then performed an automated vulnerability assessment, Lazarine says.
“During the data collection process, our team saw that the number of models posted to Hugging Face exploded in the past year,” says Lazarine. “When we started this effort, there were around 70,000 models on Hugging Face, and by the time we finished, there were 110,000. If you look now, there are more than 170,000 models. People are really beginning to leverage this platform to post tens of thousands of models each month.”
The next step in their research was to perform a linkage analysis to find the connections between Hugging Face and GitHub. Of the approximately 110,000 Hugging Face models that they collected, around 9% were able to be linked to repositories on GitHub. In the other direction, they found that 18% of the GitHub repository collection had a linkage to Hugging Face.
Finally, the team performed a vulnerability assessment to identify vulnerabilities in GitHub that can be linked to Hugging Face. They used three open-source vulnerability scanners: Bandit, Flawfinder, and Semgrep. These scanners identify 14 categories of vulnerabilities, which include things like secrets, where an API key or password is included in the GitHub repository.
The scan consisted of around 30,000 repositories identified across three categories:
- 111 root repositories, which are recommended repositories created and posted by Hugging Face’s own GitHub account.
- 28,000 fork repositories, which take place when someone copies a repository into their own account so that they can modify and change it without impacting the original repository.
- 990 searched repositories identified using “Hugging Face” as a keyword in a GitHub search.
The team identified almost 6 million vulnerabilities within the GitHub repositories. The root repositories had the least number of vulnerabilities classified as low severity. Research also indicated that after users forked a repository, they often introduced new vulnerabilities or actually preserved vulnerabilities that had since been mitigated from the root repository.
“In our research, the most interesting trend that we discovered is how vulnerabilities are being introduced and preserved by a significant portion of the community leveraging GitHub and Hugging Face,” Lazarine says.
In the future, the team hopes to incrementally collect data so they can better understand the landscape of community-driven AI risk. They’re also looking forward to future vulnerability assessments using the MITRE Robust Intelligence Risk Database, which is working to provide a vulnerability summary for every model posted on Hugging Face.
“With tens of thousands of models being added to Hugging Face every month, we want to be able to know who’s developing these models, how frequently they’re being added, and how quickly they get linked to GitHub.”
Leave a Reply