Explaining the reasoning going on in models trained with machine learning algorithms has become an ever more important topic, due to both regulatory standards that demand insight into these models, and the increasing complexity of state-of-the-art models. We have already written about explainable AI (XAI) on our blog.
This time, we investigate one AI application where explaining the model output is particularly interesting and challenging: Detecting objects in images. Object detection is a task under intense, active development in the field of computer vision with a continuously improving state of the art. Practical applications reach from counting animal populations to agricultural weed and pest control to providing better thumbnails for social media. See for example here what you can do with object detection models - and why it is important to understand your model beyond summary evaluation metrics. While explainable AI is not a new topic in computer vision, there are no tools yet available to analyze an object detection model down to a specific input. Therefore, we will now show one approach to a detailed instance-based explanation.
We make use of the SHAP library to calculate shapley values for a model decision for which we want to obtain humanly interpretable reasons. In one sentence, shapley values determine the marginal contribution of a feature towards a model result, taking into account a background distribution of the other features. A detailed explanation can be found here. In this article we build a model where we can selectively remove part of the input, i.e. hide patches of pixels in the image. These patches serve as surrogate features for which we calculate and visualize shapley values.
XAI libraries like SHAP and LIME support a range of standard models and inputs, including (deep) neural networks and image input. Object detection differs from the models supported out of the box: The nearest analogue would be image classification since the input is the same and the employed models are typically deep neural networks, but the output of object detection goes beyond just one value for a class or a range of class probabilities. In object detection one image can show multiple objects in varying sizes. State-of-the-art models for object detection also include a so-called non-maximum-suppression step (NMS), which filters the output of the last layer of the neural network down to the final predictions.
NMS iteratively removes lower scoring boxes which have an intersection over union greater than a given threshold with another (higher scoring) box. The threshold choice is up to the user and depends e.g. on the expected situations: Are objects frequently grouped close together or kept further apart?
This step does not need to be trained and is non-differentiable, so there is no direct connection from the input to the final output via the gradient of the neural network. This, as well as the problem of multiple outputs from the model, prohibits using tools like DeepLift or the DeepExplainer in the SHAP library. However, SHAP provides a tool we can use for generic black box models, the KernelExplainer. The challenge lies in fitting the object detection task in the scheme of the KernelExplainer and in connecting model and XAI framework on the technical level.
The code accompanying this article can be found here in Colab - we will explain the major steps and discuss them in the following.
For this article, we use the object-detection algorithm YOLOv5. It runs on PyTorch which is supported in Colab, competes for top performance and the models allow easy fine-tuning for new object types as well.
Of the available pretrained models, we will use YOLOv5s, the smallest model, pretrained on the COCO dataset and its 80 classes.
PyTorch is already available in the Colab Environment, but we need to install the SHAP library, YOLOv5, and download the pretrained model weights. From the YOLOv5 code, we need helper functions for NMS and for checking the overlap of bounding boxes (the intersection over Union or IoU).
Using the OpenCV library, we read one image, pad it into a square and resize it to 160x160 pixels. Smaller or larger input sizes are also allowed, but the squares’ width/height must be a multiple of 32, to fit the input of the pretrained model. Since object detection using deep neural networks and XAI methods are rather resource and time consuming, we downscale input images to speed up the object detection time. However, if you are interested in getting detections and explanations with higher resolution in further computations feel free to try different image sizes here.
Now, to match the expected model input (RGB color format) we must reorder the image dimensions (OpenCV reads images with BGR color format), cast the image to a PyTorch tensor and continue to the model. We simply load the model weights and are ready to do inference on images.
The immediate output of this model are the coordinates – x and y of the center of the identified object, i.e. width and height – plus probability of an object at these coordinates, plus probabilities for each of the 80 classes. This vector lists detections for all possible anchor points, most of which will have very low scores:
When we collapse and filter the predictions using NMS, two detected objects with high confidence remain. They are, as we expected, the two people in the image (COCO class index 0 corresponds to object category person)
For each detection we get the coordinates, now in the format x1, y1, x2, y2, the product of object score and the class with the highest score, as well as the index of the most likely class. To see how well we did in finding the person on the right, we look at the combined scores of the second detection: The model achieved a score of 68.6%.
Potential approaches to increase the confidence score are to use bigger model weights or to increase the image resolution.
Explaining the whole model output with respect to the input image is hard, simply because there is not one well delineated outcome. If we focus on prediction per image, the question from the XAI perspective is much narrower and better defined: What part of the image contributes to this particular detection and how much?
To use this model with the KernelExplainer of SHAP, we need to fit the steps above into one PyTorch model:
1. Casting the image to a PyTorch tensor
2. Applying the core model
3. Applying NMS
4. Calculating the score of the detection we are interested in.
We implement these steps as individual layers in PyTorch. Step 1 is done with the Class Numpy2TorchCaster. Step 2 is the model we used above. Steps 3 and 4 are done with the Class OD2Score.
If the detected box is moved compared to our set target when we extract the score for our target detection, we multiply the score with the overlap between the correct target box and the detected box. The final score depends therefore both on how confident the model is in predicting the person and how well the box is positioned.
Note also that we just want to find out how the model comes to this particular result. It does not matter here if the target is a correct detection as judged by a human or compared to some gold standard.
We can chain these layers in sequence with PyTorch, as shown here:
The scoring_model takes an image as input and returns how confident it is in detecting a person in the right area and how well the box is positioned as score. However, we will make one more adjustment for the input.
Now we could start to change the input image and see how the output changes. Unfortunately, the image has 160x160 = 25,600 pixel values – 76,800 if you count the three colour channels separately. Calculating the influence of each individual pixel not only requires excessive computing resources, it is also likely that each single pixel contributes only very little towards the detection. So, instead we aggregate the pixels into superpixels, i.e. connected patches of pixels. The segmentation of an image into superpixels can be done in different ways, we’ll keep it simple and use a rectangular grid with fixed width and height.
To implement this segmentation approach, we create a new layer, the SuperPixler, which accepts a list indicating the superpixels that should be available/not available for an image and builds the matching image as input for the next layer. The patch of an inactive superpixel is replaced with the mean colour of the image.
The super_pixel_model takes as input the information which superpixels are active and which are greyed out, internally converts this to the image containing our target, and again returns the score how well we detected our target. The input space is now, for a superpixel of 8x8 and an image of 160x160 pixel, just a vector of 400 binary values mapping to the 400 superpixels covering the whole image.
The super_pixel_model finally can be interpreted by the KernelExplainer and has a manageable input space. Now, we feed the model to the explainer and let it determine the contribution of each superpixel to the output of the detector.
We map the superpixels back to the constituent pixels (easy, since they conform to a grid) and scale the values from 0 to 1. In the colormap we use, red means a positive contribution of the pixel, blue a negative contribution.
We see that the patches with high contribution are indeed located within the bounding box of our target. Interestingly, the highest contribution seems to come from head and shoulders, the inner space and the lower body play a lesser role. It seems that for the model, a person is most strongly associated with a face and upper body contours. This makes sense and is probably also correlated to the way people are portraited in the original COCO training data: We tend to focus on the head and upper body when taking pictures.
We see few patches contributing against the detection. This also makes sense since nothing in the image obscures the target and makes it less likely to see a person. The one blue patch here might correspond to an extended arm and the absence of it could weaken the overall impression of a person for the model.
There are all kinds of refinements we could implement here.
For example, the superpixels can be chosen in a more sophisticated mode. A better way could be a clustering of image elements, where the replacement value is the average colour of their neighboring superpixels.
We can also define more than one replacement value for each superpixel, giving a better characterization what the “absence” of a pixel would look like.
Different scales for the image and superpixel also play a role and you can experiment with these values within the limits of your available processing power.
We only looked at a case where the model gets the detection right, both with regards to the position and the type of the object. Cases where the model makes one or both errors can potentially tell us more about its “assumptions” and “preconceptions”. The example image here shows a misclassification as giraffe. The correct prediction is, in fact, out of scope for the model, but the patches contributing to the detection correspond to the elongated head and neck part of the kangaroo.
Another limitation here is the explanation of false negatives: We could modify the scoring procedure to look for a target that is only very weakly predicted. But to determine what input the model would need to see for a hit, we’d have to add information to the image. Say, a person is not detected because the head is obscured. To check if the occlusion is the problem, we’d have to place a head into the image in the expected position . Not only is this more difficult to implement, thespace of possible imputations is also much larger for a given missed detection.
Since the task of Object Detection is currently not natively supported by XAI tools we had to adapt the input and output of a detection model to fit a generic explainer. After successfully connecting all components together we have been able to inspect a single target and determine which part of the image contributes to the detection. The model seems to correctly focus on areas of the image which have a relation to the target, although we see a bias in the ‘people’ class towards the upper body and head, meaning detecting people from some angles or partially visible persons might be more difficult for the model. On a case-by-case basis, we have gained some valuable insight into the model decision process and maybe also some more trust into its detections.