How to read the mind of a CNN

Instance-based visualization techniques for CNNs: An overview for non-AI-experts to comprehend classification results

Published in

Nerd For Tech

17 min readFeb 17, 2021

TL;DR For the enforcement of Convolutional Neural Networks (CNN), it is crucial that users accept the model and trust in the classification. However, since the networks are not intuitively comprehensible, non-transparent misclassification leads to a loss of trust in the model. To increase comprehensibility, visualization methods can be used that show the relevance of the image regions to non-AI experts. In the following post, some of them are presented and their use is evaluated based on selected evaluation criteria. The main groups of instance-based visualization methods are considered: perturbation-based, activation-based, backpropagation-based and concept-based.

The post was written as part of the Deep Learning seminar in the winter semester 2020/2021 at the University of Applied Sciences Darmstadt.

What’s the problem?

Convolutional Neural Nets (CNNs) have changed the world of image processing. Convolutional layers have made specialized networks available to computer vision that also process the spatial information of image data. A wide variety of network architectures of CNNs are being used to solve more and more tasks on the fly. Other machine learning models like Random Forest simply can’t keep up. A well-known example of the outstanding performance is the image classification of the MNIST Fashion dataset. Have a look at the performance of CNNs here!

In addition to good performance, a Convolutional Neural Network is also very practical. After all, no rules have to be specified as to how exactly an image classification works. However, this is exactly where the problem appears. With the gained flexibility the interpretability suffers badly. The exact processes within the network and the concrete composition of the classification result are not easily understandable. This lack of comprehensibility prevents — without further measures — important insights (e.g. relevant image areas) from being drawn from the network. In the case of misclassification, the trust of end users in the model is also reduced, which speaks against its use in critical applications.

Several examples of incomprehensible classification of objects could be found in a paper (Natural Adversarial Examples, 2019) around the ImageNet dataset. Here, 7,500 real-world examples could be identified that deceive common classifiers. Remarkable in the selected examples is that the incorrect class (red) is selected with a very high certainty.

Real-world counterexamples of the classification of the Image-Net dataset. In Natural Adversarial Examples (2019) https://arxiv.org/abs/1907.07174

Without information about the background, such a classification leads to skepticism towards the model. In the case of automated decisions using CNN, such as in medical diagnosis or autonomous driving, the verifiability of the classification result, for example, is so critical that it determines whether or not the procedure becomes established.

Therefore, let’s understand a CNN. We will read its mind using instance-based visualization techniques.

How can instance-based visualization help?

Instance-based visualization techniques are methods that highlight the reasons for the classification decision for a specific image. Relevant image areas are thus exposed separately or displayed with a overlaid heatmap. In the case of a misclassification, it is easier to determine whether, for example, the background instead of the object was decisive for the classification.

But how do you choose a suitable visualization technique?

The requirements are determined from the practical use of the methods. One specific use-case is to validate a CNN used to detect diabetic retinopathy lesions (Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images, 2017). In the long term, a CNN should be used to detect this retinopathy in developing countries with a lack of qualified personnel and thus to stop the threat of blindness in a timely manner. The classification decision within the CNN is here made between healthy and affected.
The Convolutional-Neural-Network achieves a much better performance than previous methods (told you so!). However, medical experts do not trust the model due to the lack of traceability. Why should a trained network replace a medical education of many years?! Therefore, it should be evaluated whether the network recognizes typical signs like red lesions without being explicitly trained for it.
Thus, the instance-based visualization method will be used to form a localization of the features. The location of the relevant areas is then compared to the pre-marked signs. The level of detail is secondary in this example, since only the general learning success regarding the signs is in the foreground.

Localization of relevant image areas in retinal images. In each image twin the relevance map is on the left, the marked signs on the right. In *Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images*, *2017* https://arxiv.org/abs/1706.09634

After considering other use cases, the following criteria for method selection have emerged:

Degree of detail of the result
Class discrimination of the method
Implementation effort
Intuitive interpretability
Model agnostic

In addition to the points listed, there are of course other criteria such as the robustness of the method, real-time capability and the ability to detect color-based problems (ex. something yellow is automatically a banana) which are not considered here.

But before the evaluation takes place, let’s first get to know the principles of important visualization techniques

Some Instance-based visualization methods

The number of visualization techniques is gigantic. Besides many basic concepts, there are numerous further developments to compensate for particular weaknesses. To get a first overview, we look at selected perturbation-based, activation-based, backpropagation-based and concept-based visualization techniques.

Perturbation-based

The simplest method of the perturbation-based methods is based on a covering of image areas. If relevant image areas are covered, it should (in theory) lead to a decrease of the class probability. Therefore, image points and their immediate surroundings (e.g. a 7 x 7 field around a point) are systematically covered with the help of a gray or black patch and the change in classification probability is observed. The method is called Occlusion (Learning Deep Features for Discriminative Localization, 2015). In the figure below a resulting heat map is formed, based on the classification probability of the correct class as a function of the coverage position. In the example, the upper image is to be classified as a Pomeranian. If the head area (center of the image) is covered, the class probability decreases (blue area in the right top image). This area is therefore important for the classification of the image as a Pomeranian.

Systematically move the occlusion patch over the image to check image part relevance.In Learning Deep Features for Discriminative Localization https://arxiv.org/abs/1512.04150

With regard to the execution times, the class probability must be determined again for each coverage step in the case of occlusion-based methods. The execution time of the visualization method can therefore be influenced by the size and the step size of the patch. However, this is generally higher than for other methods.

An alternative to the use of a uniform, continuous patch is the Area Occlusion method. Here, context-related areas are covered, which result from application context. For example, in a paper (Visualizing Convolutional Networks for MRI-based Diagnosis of Alzheimer’s Disease, 2018) on Alzheimer’s disease detection, predefined brain regions could be used by means of 3D MRI scans, which were then successively covered by a 3D patch. The brain region were thereby segmented by an algorithm for automatic anatomical parcellation.
In practice, the prior definition of the image segments represents a challenge that should not be underestimated, since the visualization techniques are instance-based. Thus, an explicit segmentation must be performed for each individual image as preparation.

The effect on the classification result is also determined for area occlusion depending on the area covered. The following figure compares the results for both occlusion-based methods. The relevant areas are marked by an increasing intensity of the red color.

Comparison of the Occlusion and Area Occlusion visualization methods. Areas marked in red are considered more relevant with increasing color intensity. In Visualizing Convolutional Networks for MRI-based Diagnosis of Alzheimer’s Disease https://arxiv.org/abs/1808.02874

It can be seen that the results of the two methods differ. With Area Occlusion, the brain regions are still easily recognizable after the overlay, which provides the evaluator with easier access for interpretation. However, because only one brain region is covered per step, border overlapping effects between several slices are not detectable. Therefore, the relevance picture is different compared to other methods.

In addition, the coverage-based methods are considered to have limited utility for MRI scans, as the gray color of the patch is indicative of the nature of the brain matter in the context. Thus, the continuous gray value may indicate tissue atrophy, which increases the classification probability for Alzheimer’s disease in some brain regions. Thus, the visualization method has a bias!

Backpropagation-based

The next group of visualization methods derives therelevance of the image area using backpropagation.

The simplest method (Visualizing and Understanding Convolutional Networks, 2013) utilizes a known product from the training process of a CNN: The influence of each pixel is estimated by the gradient. This is created via the partial derivative of the class score.
For gray scale images, the absolute magnitude of the gradient is used for the relevance map. A higher gradient, indicates a greater relevance of the pixels. For color images, either the maximum absolute magnitude of the gradient across the three color channels is used or the absolute gradients of all color channels are overlaid on the image. Since the gradient is determined via backpropagation, the method is appropriately referred to as vanilla backpropagation or VisCNN. The calculation is performed quickly and without further preparation of the image.

In the next figure, the relevance map of a classification of a bird is shown. The relevance is normalized to a range from 0 (irrelevant) to 255 ( maximum relevant) and can therefore be represented as a gray value. Accordingly, bright areas are crucial for the classification. This is especially the case in the eye area of the bird and in the bald area in front of the collar feathers.

Original image in color (left) and corresponding saliency map (right) created via backpropagation. In Visualizing and Understanding Convolutional Networks, 2013 https://arxiv.org/abs/1311.2901

According to the developers, the resulting saliency map is primarily suitable for performing weak supervised object localization. The resulting relevance map is used to separate the foreground from the background. These are defined by statistically determined thresholds of the pixel gray value of the saliency map.

The visualization still looks a little messy for an explanation though, doesn’t it? It can be further developed with a simple idea that improves the output qualitatively.

The basic idea of the Guided Backpropagation (Striving for Simplicity: The All Convolutional Net, 2015) method is that the neurons act like detectors of certain features. What is important for the visualization is only what a neuron detects fittingly and does not detect falsely.
If a special concept of another class (e.g. ear of a dog) is detected in the deeper layers, this non-class concept probably has a negative influence on the class probability of the correct class (e.g. cat). This negative influence can be traced back to one pixel. If this pixel is also involved in another class related (!) concept (ex. cat eye), the negative influence partially suppresses the existing positive influence. In the case of an explanation, only positive influence factors for the classification decision should be highlighted in the visualization.
Thus, if the gradient is passed through the network using backpropagation, all negative gradients at ReLu layers are set to zero. This prevents the propagation of negative gradients and thereby filters the negative influences.
In the following figure is a comparison of Backpropagation and Guided Backpropagation for the classification of a color image. Here it can be seen that the Guided Backpropagation (right) forms a much clearer image with respect to the relevant areas. The eyes of the kitten and the outline, which are crucial for the classification, are highlighted better.

Comparison of Backpropagation and Guided Backpropagation for the classification of a color image. Striving for Simplicity: The All Convolutional Net, 2015 https://arxiv.org/abs/1412.6806

Still, a weakness of the presented gradient-biased methods is that the visualization methods are not clearly class discriminative.

Another method uses a backpropagation but not the gradient to build the relevance map. The Layer-Wise-Relevance-Propagation (LRP)(On Pixel-wise Explanations for Non-Linear Classifier Decisions by Layer-wise Relevance Propagation, 2015) uses the internal weights and activation functions of a trained network to build a relevance map. In this process, a relevance score is passed through the network from the output layer to the input layer. Each neuron gets its share according to its weight.
In the next figure, a relevance heatmap obtained by LPR is shown. Negative influences are marked in cold colors. The relevance score of the most probable classes is shown in the bar chart below the figure. The screenshot was obtained through an interactive online tool to try out the visualization technique.

Heatmap obtained by LPR for classification of a bullfrog.

In the Softmax Gradient Layer-Wise Relevance Propagation extension (Explaining Convolutional Neural Networks using Softmax Gradient Layer-wise Relevance Propagation, 2019), the proportion of non-target classes in the visualization is reduced, resulting in improved class discrimination.

Activation-based

For another group of visualization methods, the structure of convolutional nets is used. They use the activation of the back convolutinal layers.
But let’s start at the front. The input of a CNN consists of spatially arranged data. These are passed through the front part of the CNN. In the process, the spatial structure is maintained. The convolutional layers encode higher and higher levels of visual representations as the network depth increases. While general structures such as edges and corners are recognized in the first layer, these are composed into patterns and, for deeper networks, into whole structures. For classification, the structure is resolved via a flattening layer. The information about the spatial context is thus lost by the transition. In the last part of the network, the data is then brought through fully connected neuron layers for the final classification of the image.
So in the basic idea, most of the information is thus in the last convolutional layer of the network, since the extracted image information and the spatial arrangement are present together here.

The method that makes direct use of this information is called Class Activation Mapping (CAM) (Learning Deep Features for Discriminative Localization, 2016). It was introduced in the context of special networks that abandon the fully connected layers in order to minimize the number of parameters, yet preserve high performance and avoid overfitting the network . For this purpose, global average pooling is introduced at the end. The first network with this strategy is the Network in Network (NIN). There, the idea is introduced in more detail.

Here’s the short version: The basic operation of Global Average Pooling (GPA) is that the three-dimensional tensor (h x w x d) of the last convolutional layer should be reduced to dimension (1 x 1 x d). Each feature map (h x w) is thus reduced to a single number. The global average of the feature map is used for this reduction.

The visualization of relevance with CAM is formed with the help of global average pooling for a selected image category. The process can be visually traced in the following figure. The network shown consists mainly of convolutional layers. Global Average Pooling is used before the actual output layer. The class activation map is formed by projecting the weights (w_i) of the output layer back onto the feature maps. The class activation map is thus a weighted sum of the feature maps.

Schematic procedure for forming the Class Activation Map. In *Learning Deep Features for Discriminative Localization, 2016* https://arxiv.org/abs/1512.04150

In a CNN, the dimensions of the feature maps decrease across the convolutional layers. Accordingly, a class activation map determined in this way cannot simply be transferred to the original image, but must first be scaled up to the original size using bilinear interpolation. This is the reason for the smooth transitions in the heat map. In addition, this results in a limitation of the level of detail.

In the following figure the method is used to explain a misclassification of a dome image. Depending on whether a church or a palace is to be recognized, the focus is more on the tower or the entire lower half of the building. The largest class score is obtained for palace and the second largest for the correct class dome. The misclassification can be easily understood with the support of the instance-based visualization for this case.

Class Activation Maps of the 5 most likely classes of a dome image with associated class score. In *Learning Deep Features for Discriminative Localization, 2016* https://arxiv.org/abs/1512.04150

The visualization does not require any further preparation in the described network architecture. Only the feature maps and the weights of the output layer are required, which are created in the course of classification. This makes this method usable in real time.
Existing networks that do not use global average pooling are not suitable for the method without adaptation. Thus, the method is not model agnostic.

The further development of Gradient-weighted Class Activation Mapping (Grad-CAM Visual Explanations from Deep Networks via Gradient-based-Localization, 2016) was motivated by the need to generalize the CAM method and make it available as a classifier for other network architectures with, for example, several full-connected layers. The basic idea of using spatial information remains the same. The difference lies in forming the weights of the feature maps. The new weighting is calculated based on the gradient. Here, the global mean of the feature maps is determined based on the corresponding gradients in the width and height dimensions of the feature map. Subsequently, a ReLU function is applied for filtering. As long as the subsequent layers behind the feature map are differentiable, the gradient can be calculated for each network architecture.

The remaining problem of detail resolution of class activation maps is solved with the further development Guided Grad-CAM (Grad-CAM: Why did you say that?, 2017). This visualization combines the methods Grad-CAM for the class discrimination and Guided Backpropagation for the details. For this purpose, a point-wise multiplication of the result maps takes place.
In the following figure, the procedure is applied specifically to understand misclassifications. The image column on the left deals with an image of a volcano, which was incorrectly classified as a car mirror. It can be seen that the sky and the oval section contribute to the class car mirror being chosen.

Highlighting the reasons for failure through a visualization with Grad-CAM. In Grad-CAM: Why did you say that? https://arxiv.org/abs/1611.07450

Concept-based

Finally, the novel concept-based framework Model Agnostic Concept Extractor (MACE) (MACE Model Agnostic Concept Extractor for Explaining Image Classification Network, 2020) is presented as a visualization method. This method is based on human perception. Here, individual concepts (like ears, legs, textures) are found in the image, which lead to a corresponding classification.

Functionality of the Framework
The comprehensible explanations are extracted by an interaction of several components. No specific model architecture is necessary. The MACE framework is modularly attached to one of the convolutional layers of the pre-trained CNN and configured in the context of a retraining. In addition to the spatial information from the upstream convolutional layers, the discriminative features from the classifier part of the network are also used.
The instance-based visualization is initiated using an image and the corresponding class label. Thereby low-level concepts and their relevance with respect to the class output are returned.

Components of the Framework
Concepts are learned and extracted using four modules. The arrangement of the components Map Generator, Embedding Generator, Relevance Estimator and Output Generator is shown in the following figure.

Architecture of the MACE Framework. In MACE Model Agnostic Concept Extractor for Explaining Image Classification Network, 2020 https://arxiv.org/abs/2011.01472

The Map Generator estimates concept maps from the output of the last convolutional layer (features maps) per image. From the feature maps, the activity pattern of a concept is determined based on the saliency in the image. The MACE framework combines the channels of the feature maps by a weighted average to find a single activation pattern for a concept. The associated salient region is used for posterior visualization to make the concept understandable to humans.

The activation patterns are passed to the Embedding Generator. It transforms the appearance of a concept into a latent representation, which is invariant to spatial information. For this purpose, the concept maps of a class are considered collectively and abstracted to specific concepts. This is done by a multilayer dense neural network.

The Relevance Estimator determines the importance of a class-specific concept for the output of the model. Finally, the Output Generator passes the extracted concepts to the first dense layer of the pre-trained model to improve the explanatory power.

The following figure shows how the framework can be used to find explanations for the class scores of the false classes. In the first row the ears of the fox are shown as an explanation for the class german shepherd. A similarity is recognizable and helps the user to understand the classification decision.

Explanation of an incorrect classification based on the found concepts of the wrong class. In MACE Model Agnostic Concept Extractor for Explaining Image Classification Network, 2020 https://arxiv.org/abs/2011.01472

What’s the best method?

As always, the solution to a complex problem is not simple and straightforward. As you have noticed, the methods are very different (not only from the basic principles). Fortunately, we have already thought about the evaluation in advance. The results of the evaluation are summarized in the table below.

Overview of the evaluation results of the presented visualization methods. The rating scale refers to the user’s point of view. Very positive (++) and positive (+) are advantageous (e.g. little implementation effort or a high level of detail), neutral (0) and negative (-) accordingly.

The occlusion based methods are class-discriminative. Since they are executed by a repeated classification of the modified images, they are model-agnostic, but some effort is required to implement them. Especially the area occlusion needs an image segmentation performed in advance, which leads to a larger effort. In return, this method brings a gain in interpretability. The level of detail is basically adjustable by choosing the size of the coverage patch, but this leads to longer run times.
The area occlusion method can be used well if users are looking for a visualization technique that is easy to understand and requires no complex Deep Learning knowledge.

The backpropagation-based methods are fast to compute and easy to implement methods that provide a detailed relevance map. However, this is difficult to interpret for the vanilla backpropagation method, since no focus is created in the visualization. This problem can be solved with Guided Backpropagation.
These methods are applicable to all network architectures due to the simple use of a backpropagation. The LPR is itself applicable to other model types such as support vector machine. (bonus point for beeing model agnostic!)
A suitable use case for Guided Backpropagation is the understanding of a classification decision in which only one object occurs in the image. The features that can contribute to the classification can then be detected with the additional information of the color channels. Especially in medical classification tasks the fine structures are crucial, which are well highlighted with Guided Backpropagation.

The activation-based method CAM does not show detailed information by using upscaling the ompressed information of the last convolutional layer and depends on a special network structure with GPA layer. However, the implementation is very easy to realize and the class dependency of the visualization is given. The further development as Guided-Grad-CAM combines the Guided Backpropagation with the Grad-CAM process. Therefore, model diagnostics can be achieved here and the level of detail is increased.
The Guided-Grad-CAM method is especially suitable in case of multiple classification in an image, in which on the one hand the specific object and on the other hand the features should be highlighted.

The presented concept-based method MACE is obviously model agnostic. The concept regions are displayed by blacking out parts of the image that do not belong to them. Details like edges and corners, which contribute to the classification, are not highlighted this way, only the whole concepts. However, this approach strengthens the intuitive interpretability enormously, since the representation is adapted to the human perception. Since class-specific concepts are extracted, the class discrimination is particularly well developed. Due to the many components and the retraining of the concepts, the implementation effort is increased.
The instance-based visualization method MACE is particularly suitable for the consideration of misclassifications and explanations of existing class scores of the wrong classes. The method is best suited for non-AI experts.

Conclusion

Now we know many different instance-based visualization techniques.
If you are now wondering how you can use one of the presented techniques to make your next Deep Learning project a success, then last but not least a practical example.

If your training data brings hidden biases, they are not easy to find. Here, a CNN was used to classify age and gender. With the Layer-Wise-Relevance-Propagation positive influences (warm colors) and negative factors (cold colors) could be decoded. This helps to detect the bias. Have a look for yourself!