Modern machine learning models, such as neural networks, are often called “black boxes” because they are so complex that even the researchers who design them cannot fully understand how they make predictions.
To provide information, researchers use explanatory methods that seek to describe the decisions of individual models. For example, they can highlight words in a movie review that influenced the model’s decision that the review was positive.
But these methods of explanation are useless if humans cannot easily understand them, or even misunderstand them. Thus, MIT researchers created a mathematical framework to formally quantify and assess the understandability of machine learning model explanations. This can help identify information about the behavior of the model that might be missed if the researcher is only evaluating a handful of individual explanations to try to understand the whole model.
“With this framework, we can have a very clear picture of not only what we know about the model from these local explanations, but more importantly what we don’t know about it,” says Yilun Zhou, an electrical engineering graduate. and in computer science. student at the Computer Science and Artificial Intelligence Laboratory (CSAIL) and main author of an article presenting this framework.
Zhou’s co-authors include Marco Tulio Ribeiro, principal investigator at Microsoft Research, and lead author Julie Shah, professor of aeronautics and astronautics and director of the interactive robotics group at CSAIL. The research will be presented at the conference of the North American chapter of the Association for Computational Linguistics.
Understand local explanations
One way to understand a machine learning model is to find another model that mimics its predictions but uses transparent reasoning models. However, recent neural network models are so complex that this technique usually fails. Instead, researchers resort to local explanations that focus on individual entries. Often these explanations highlight words in the text to signify their importance to a prediction made by the model.
Implicitly, people then generalize these local explanations to the global behavior of the model. Someone can see that a local explanation method highlighted positive words (like “memorable”, “flawless”, or “lovely”) as being the most influential when the model decided that a movie review had a positive feeling. They are then likely to assume that all positive words make positive contributions to a model’s predictions, but that may not always be the case, Zhou says.
The researchers developed a framework, known as ExSum (short for Explanation Summary), that formalizes these types of claims into rules that can be tested using quantifiable measures. ExSum evaluates a rule on an entire dataset, rather than the single instance it’s built for.
Using a graphical user interface, an individual writes rules which can then be modified, adjusted and evaluated. For example, when studying a model that learns to classify movie reviews as positive or negative, one can write a rule that says “negation words have negative salience”, which means that words like “not”, “no”, and “nothing” contribute negatively to the sentiment of film critics.
Using ExSum, the user can see if this rule holds using three specific metrics: coverage, validity, and clarity. Coverage measures the extent to which the rule applies to the entire data set. Validity highlights the percentage of individual examples that agree with the rule. Sharpness describes the precision of the ruler; a highly valid rule may be so generic that it is not useful for understanding the pattern.
If a researcher is looking to better understand the behavior of their model, they can use ExSum to test specific hypotheses, Zhou says.
If she suspects that her model is gender discriminatory, she could create rules to say that masculine pronouns have a positive contribution and feminine pronouns have a negative contribution. If these rules have high validity, it means that they are globally true and the model is probably biased.
ExSum can also reveal unexpected information about a model’s behavior. For example, when evaluating the Movie Review Classifier, researchers were surprised to find that negative words tended to have sharper and sharper contributions to model decisions than positive words. This could be because critics try to be polite and less direct when reviewing a movie, Zhou says.
“To really confirm your understanding, you need to evaluate these claims much more rigorously on many cases. This kind of understanding at this fine level, to our knowledge, has never been discovered in previous work,” he says. .
“Moving from local explanations to global understanding was a big gap in the literature. ExSum is a good first step to fill this gap,” adds Ribeiro.
In the future, Zhou hopes to build on this work by extending the notion of understandability to other criteria and forms of explanation, such as counterfactual explanations (which indicate how to modify an input to change the model’s prediction). For now, they’ve focused on feature attribution methods, which describe individual characteristics that a model uses to make a decision (like the words in a movie review).
Also, he wants to further improve the framework and user interface so that users can create rules faster. Writing rules can take hours of human involvement – and some level of human involvement is crucial because humans must eventually be able to grasp the explanations – but AI assistance could streamline the process.
As he reflects on the future of ExSum, Zhou hopes their work sheds light on the need to change the way researchers think about machine learning model explanations.
“Before this work, if you have a correct local explanation, you are done. You have reached the holy grail of explaining your model. We offer this additional dimension of ensuring that these explanations are understandable. Comprehensibility should be another measure to evaluate our explanations,” Zhou says.
This research is supported, in part, by the National Science Foundation.