Limitations of Integrated Gradients for Feature Attribution

Original Source Here

Limitations of Integrated Gradients for Feature Attribution

The popular method in interpretable AI has important drawbacks

Integrated gradients is a feature attribution method with several attractive properties, which is well suited for neural networks. It can, however, have non-intuitive behavior that is not widely known. Using concrete examples, I demonstrate here that integrated gradients does not have all the characteristics we would like an ideal feature attribution method to possess. Understanding the strengths and weaknesses of this tool will help users to interpret their results, and perhaps to conceive of more highly performant tools in the future.

Photo by Steve Douglas on Unsplash


Feature attribution has its origin in game theory. Suppose we have a group of players cooperating to achieve some reward. Given that each player will in general contribute different strengths to the group, how should the rewards be divided among players? In the context of machine learning, the “reward” is analogous to the score output by the model, and the “players” are analogous to input features. That is, we want to know how much of the final score is due to each feature.

Shapley provided a set of axioms that any such system should satisfy, and showed that there was only one possible solution, now known as Shapley values, that satisfied each of his axioms. His solution relies on the observation that what each player can contribute depends on the other players present, and therefore considers the outcome of the game when it’s played by different subsets of the possible players. Switching back to the problem of feature attribution in machine learning, we would then have to define what it means for the model to make a prediction on subsets of features, rather than the full input. For models such as neural networks, it’s not obvious how to do this. Due to this conceptual limitation, as well as the high computational costs of Shapley values, other methods are needed in practice, and integrated gradients is a compelling alternative solution.

In the next section I explain integrated gradients, and discuss how the concept of a “baseline” input that it introduces ties back to cooperative games. In doing so, we will be able to assess how integrated gradients conforms with Shapley’s axioms and reveal some strange behavior.

Integrated gradients

To apply Shapley values to, say, an image classification problem, we would have to define what it means for a neural network to make a prediction on an image with missing pixels (analogously to missing players if we’re thinking back to cooperative games). The authors imagine that we can think of these “missing” features as ones whose values have been replaced with some kind of uninformative baseline. In the context of an image, this could mean for example replacing true pixel values with a gray baseline.

At this point, integrated gradients departs drastically from the Shapley approach. The solution it provides has entirely different strength and weaknesses, and capturing the full picture is beyond the scope of this article. For our purposes, I merely want to briefly describe here what integrated gradients is, and whether or not it behaves in the ways that we want and expect.

Let f(x) represent our machine learning model for input vector x. Then for a baseline, uninformative input x’ (think all-grey image), integrated gradients proposes the following feature attribution:

where the left hand side represents the feature attribution given to the ith feature of the vector x. Note that in the above equation, it is understood that the gradient of f is taken before it’s evaluated along the path. Using the fundamental theorem of calculus, it follows that the sum of the attributions over all features is equal to the difference between f(x) and f(x’). This is the key intuition behind the method, but it has several other selling points as well. I encourage readers to check out the original paper for further details.

Shapley axioms and integrated gradients

To assess which of Shapley’s axioms integrated gradients satisfies, we interpret dropping features from an input as being equivalent to replacing them with their baseline values. For brevity, I describe here only the relevant axioms that are not satisfied by integrated gradients; there are other axioms that it does satisfy.


In the above equation, S represents a subset of features, S union i represents the same subset but now with feature i included, and phi represents any feature attribution. What this axiom says is that if for every subset S, feature i has a bigger impact on the model f than the model g, then i should get a higher attribution under f than under g. Essentially, bigger rewards should imply bigger attributions. I personally do not find the importance of this axiom very intuitive, but it implies two other axioms, given below, that appear essential for reasonable feature attribution.


This axiom imposes the very reasonable condition that if two features or players make the same contribution to every possible subset of features/players, they must get the same attribution. It would be unfair to reward one player more if their contributions were identical to another player. As we will see later, in fact integrated gradients does not satisfy this axiom!

Null effects

This axiom says that if the presence of feature/player i has no effect on f for all subsets S of features/players, then the attribution for that player should be zero. A player that doesn’t contribute anything shouldn’t get rewarded.

In the next section, I provide counterexamples that show that integrated gradients fails each of these axioms.



Let f(x1, x2) = (x1)(x2)², g(x1, x2) = (x1)(x2), let the baseline be x’= (1, 0), and let the input be x = (2, 6/5). Then the feature attribution associated with x1=2 for the model f is:

and the feature attribution associated with x1=2 for the model g is

If we take all combinations of input features and baseline features, the values we need to consider are

f(1, 0) = 0

g(1,0) = 0

f(2, 0) = 0

g(2, 0) = 0

f(1, 6/5) = 36/25

g(1, 6/5) = 6/5

f(2, 6/5) = 72/25

g(2, 6/5) = 12/5

For each combination of (x1, x2) we have that f(x1, x2) ≥ g(x1, x2). Yet according to integrated gradients, the feature attribution of x1=2 for the input (2, 6/5) under f is 36/75, while under g it is 6/10. Therefore x1=2 gets a smaller feature attribution under f than g, in violation of the consistency axiom.


Let f(x1, x2) = sin(pi*x1)sin(2*pi*x2), the baseline be x’= (0, 0), and the input be x = (1, 1). Then the feature attribution for x1=1 is

and the feature attribution for x2=1 is

The inputs to evaluate are

f(0, 0) = 0

f(1, 0) = 0

f(0, 1) = 0

f(1, 1) = 0

Since the value of f is always the same regardless of what combination of inputs we supply it, according to the symmetry axiom x1=1 and x2=1 must get the same attribution for the baseline x’= (0, 0). Yet IG(x1) = 4/3 and IG(x2) = -4/3, in violation of the axiom.

Null effects

Let f(x1, x2) = cos(2*pi*x1)(x2), and once again let the baseline be x’= (0, 0) and the input be x = (1, 1). The feature attribution for x1=1 is then

Note that this is in fact equal to f evaluated at the input. This implies (and can be confirmed by calculation), that all the attribution is given to x1 and none of it to x2. Yet switching x1 from the baseline x1′ = 0 to the input x1 = 1 has no effect on the function! This reveals that integrated gradients does not satisfy the null effects axiom, which is particularly troubling.


We have investigated how integrated gradients performs under Shapley’s axioms by interpreting dropped/missing features as replacement with their baseline values. By doing so, we revealed that integrated gradients violates some of Shapley’s axioms, and because of this, it sometimes give answers that are unbecoming of a reasonable feature attribution method. In particular, integrated gradients can assign different attributions to two features that always have the exact same effect on the model (violation of symmetry), and can assign positive attributions to features that have no effect on the model (violation of null effects). It is important that users understand these caveats before trying to interpret the decisions their models make.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: