By being able to visualize a feature map, we can identify what changes in the

input image that map’s activation is invariant to. This enables us to understand
what sort of features are learnt at each layer.
What we’re looking at here corresponds to two of many feature maps in the first
layer. The left image shows the 9 image patches which produced the highest
activations for these feature maps. The right images show projections of each
map’s strongest activation into pixel space. It can be seen that they fire strongly
when it detects an edge or when there are patches of uniform colour.
Now moving on to layer 2. We notice that the feature maps detect a more
complex set of patterns building on features discovered in the first layer. So
combinations of these features allow detection of parallel lines, circles, colour
blobs. Its projections cover a larger space in the image because of the pooling
that occurs after layer one.
At layer 3 we are already beginning to see the network learn object components
which are important for building representations of the whole object. And at layer
5 the features are becoming specific enough for object classification.
With this new visualisation technique we could actually see how the features
evolve during training. In these images each row represents a feature map. And
the columns represent projections of these maps at some stage during training.
Evidently, the lower layers converge more quickly and from these examples we
see the second layer converges by the 10 th pass over the training set, while the
fifth layer converges after the 40th epoch. So it takes longer to learn more
complex features and I’m sure we can all relate to this.
They also confirmed that feature maps are actually identifying the object rather
than using the surrounding context. For example, when a fish was identified, a
question that arose was whether correct classification was merely based on the
water surrounding it. To test this, they covered different areas of the image, and
looked at how activations in higher level feature maps changed and how the
predicted class changed. From the images here you can see feature map with the
strongest activation is decreased when the faces are covered, yet regardless of
whether these faces are covered the network still labels the image with dog.
Other object detection techniques are designed to extract features which
indicate spatial relationships between object parts. For example if it’s trying to
detect a bicycle, there would be some part of the feature extraction algorithm
that tries to identify whether there are two wheels. For convolutional networks,
the ability to detect this sort of correspondence is implicitly included through the
interactions between the convolutional, rectification and pooling layers. So when
you cover the right eye in images of faces the feature vectors extracted from
each of these images are changed in a consistent manner.