This article features the work of Sagar Samtani, Assistant Professor of Operations and Decision Technologies; Weimer Faculty Fellow; Director, Kelley’s Data Science and Artificial Intelligence Lab (DSAIL) at the Kelley School of Business.
As machine learning has advanced, Deep Learning (DL) models, which are used to perform machine learning that is inspired by the structure of the human brain, have had significant success in various natural language processing (NLP) tasks. These models have improved the performance of text classification and text regression tasks, but have also proved to be extremely vulnerable to adversarial attacks.
The authors propose a new explanation-based method for adversarial text attacks using additive feature attribution explainable methods, or tools that can help explain the outputs of machine learning models. These tools measure the sensitivity of inputs when creating black-box adversarial attacks on DL models that perform text classification or regression.
Statement of Problem
Adversarial attacks aim to trick DL models into specific outcomes by adjusting legitimate data examples into adversarial examples with human-imperceptible changes, such as swapping out or inserting new characters. These changes are shown to have significant impacts on a DL model’s behavior.
There are two categories of attacks, based on how much information an attacker has about the target model: white-box attacks, which assume that attackers know all of the model’s details; and black-box attacks, which more realistically assume that attackers have no access to model details. Additionally, current adversarial text attacks operate with a two-phrase framework: first, sensitivity estimation that measures the sensitivity of the prediction change to each input token such as a word or character; and second, perturbation execution, which crafts perturbed adversarial examples based on token sensitivity.
However, the current methods used to estimate the sensitivity of a DL model often struggle to capture token directionality and overlapping token sensitivities. The authors’ primary research objective is to determine if a new explanation-based method for adversarial text attacks can measure the sensitivity of inputs when creating black-box attacks on DL models.
Data Sources Used
The authors studied the performance of the new explanation-based method on a variety of datasets, including the IMDB Movie Review, Yelp Reviews-Polarity, Amazon Reviews-Polarity, My Personality, Drug Review, and CommonLit Readability datasets.
Analytic Techniques
The authors’ new explanation-based method for adversarial text attacks leverages the additive feature attribution explainable methods Local Interpretable Model-agnostic Explanations (XATA-LIME) and Shapley Additive exPlanations (XATA-SHAP). These methods are trained to approximate the target DL model’s behavior, then are used to measure the sensitivity of each input based on the target model prediction. The authors then change the inputs according to these sensitivity scores, adopting a commonly used visually-similar-character replacement perturbation strategy, which makes small changes such as switching an “o” to a “0” in the word “0nly.”
The performance of XATA-LIME and XATA-SHAP are then tested on text classification and text regression tasks using the previously mentioned datasets. The authors hypothesize that approaches providing more substantial explanatory power (in this case, XATA-SHAP over XATA-LIME) can lead to increased adversarial attack effectiveness against DL.
Results
The authors find that the new method, XATA-LIME and XATA-SHAP, for creating adversarial text examples for black-box attacks on text classification and text regression created more effective adversarial examples than baseline techniques on multiple datasets. XATA-SHAP is found to often produce better explanations than XATA-LIME, which indicates that methods with more substantial or advanced explanatory power can estimate sensitivity at a higher accuracy and make more impactful adversarial attacks on DL models.
After identifying the advantage of using additive feature attribution explainable methods for adversarial text attacks, the authors empirically demonstrate the trade-off between explainability and adversarial robustness in DL models. When researchers continually advance the explainability of DL models, they also provide attackers with tools to launch targeted and effective adversarial attacks.
Business Implications
These results indicate that the constantly growing research focused on improving the explainability of DL models with additive feature explainable methods can also provide attackers with weapons to launch targeted adversarial attacks. As improvement in explanation can enable attackers to craft more threatening adversarial attacks, explanatory efforts can also increase the risk of the model being attacked. Thus, the improvement in the ability to explain the model may also lead to a decrease in adversarial robustness.
The authors note additional directions for future research, including analyzing other explainable methods for adversarial attacks in addition to additive feature attribution explainable methods. Also, different word- or character-level perturbation, or switching, strategies can be introduced, or the methods could be applied to domains like computer vision or speech recognition to test their generalizability.
Yidong Chai, Ruicheng Liang, Sagar Samtani, Hongyi Zhu, Meng Wang, Fellow, IEEE, Yezheng Liu, and Yuanchun Jiang. “Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression.” IEEE Transactions on Knowledge and Data Engineering (2023).
Leave a Reply