Skip to content

CS 541: Deep Learning

The problem of FCNNs when applied to images is that the same pixels, while representing the same object could shift, warp, deform a bit through the lens of a camera. From the neural network’s perspective, totally different pixels are fed into the neurons -> A tiny shift in the input image can result in arbitarily different activations in the hidden layers.

Flaw of FCNN (before)

Flaw of FCNN (after)

Also the activation of each neuron in a layer is independent of the activations of every other neuron in the same layer. The neurons don’t communicate like in attention. In other words, the hidden layers have no spatial structure and the ordering of the neurons in a layer is meaningless. This preservation of spatial structure is crucial for image processing.

Without spatial structures, it is classifying random pixels

Images contain the same visual features (lines and corners of different colors) at many locations and scales -> no need that parameters to describe them.

Intuition: CNNs is an extreme form of weight sharing, forcing the neural network to have fewer degrees of freedom. This is a very strong form of regularization.

Inspired by the old CV strategy called template matching: we create one “template” matching the object we are looking for and slide the template across every location in the image and see where the image and template best “match”.

Template matching

To classify the image without caring its location in the image, we can just compute the maximum of the output. This is called maximum pooling (max-pool).

Max pool

These templates are also called filters, or kernels. The output image is more commonly known as a filter response, or feature map.

To preserve the input image, we can pad the input image. We can also apply the template with a stride of k pixels (take k steps at a time when convoluting). To expand the spatial extent of the filter without increasing the number of parameters, we can dilate it by adding spaces between filter elements -> dilated convolution (à trous).

Dilated convolution

3D convolution is the same, you sum up the output of each channel of the filter to get the filter’s output.

3D convolution

Hence 1D convolution is just a weighted sum of the input feature maps.

1D convolution

Instead of pooling over the whole image, we can pool over small sub-regions of size w. We can also use a stride of size k. (We could also use padding.)

Max pooling

CNNs are networks that consist of 1+ convolutional layers and usually some non-convolutional layers at the end. The filter parameters instead of chosen by hand in template matching, are trained with back-propagation.

Composition of CNNs:

  • Multiple convolutional layers
  • Pooling layers in between some of the convolutional layers
  • FCNN layers at the end for classification
  • Non-linearies between each layer
  • Number of channels in earlier layers is small, whereas the spatial resolution is high. The filters detect low-level features, e.g., edges & corners of different orientations & sizes.
  • Number of channels in later layers is large, whereas the spatial resolution is low. The filters detect high-level features such as object parts.

CNN composition

Receptive field of a neuron is the region in the input image that can affect the neuron.

Receptive field

Intuition: The deeper the network, the larger the receptive field of the neurons in the later layers. The later layers’ neurons essentially pay attention to the entire input image at some point.

CNNs are FCNNs

Convolution is a special case of the general feed-forward neural network.

Notice how the weights only has 3 free parameters, either equal to the filter’s elements or 0. This is the strong regularization effect refered to above.

Now we prove FCNNs with certain constraints, are in fact CNNs.

FCNNs are CNNs 1 FCNNs are CNNs 2 FCNNs are CNNs 3 FCNNs are CNNs 4 FCNNs are CNNs 5 FCNNs are CNNs 6 FCNNs are CNNs 7

Essentially, we are restraining each neuron’s attention to a patch, local to its position.

  1. The position of neuron determines which patch it attends to.
  2. All neurons share the same filter parameters, forcing them to “agree on” the same features to detect.

Starting from the classics, we represent each graph with an adjacency matrix (list of list of neighbors) and a degree matrix, a diagonal matrix that represents how many neighbors a node has (sum of the rows of adjacency matrix basically). We also annotate each node in the graph with a feature vector x_i.

For example:

Representing graphs

Then D[0] is just 2 since we also assume every node has a self-edge.

As a consequence, multiplying A and X gives us the sum of the features of a node neighbors, for every node in the graph, like so:

Graph matmul is neighbor sum

When dividing this by number of neighbors each node has (i.e., multiplying by the inverse of the degree matrix), we get the average of the features of a node’s neighbors (including itself):

Graph matmul is average of neighbor features

  • Social networks (i.e., homophily, “birds of a feather flock together”, “people are the average of their friends”)
  • Drug discovery: molecules are graphs of atoms, and we want to predict the binding affinity of a molecule to a particular receptor.

In a neural network that processes tensors, the input is always of fixed size and topology (a vector of features of the same size, images of fixed size, etc.).

In contrast, a neural network that processes graphs should be able to process graphs of different sizes and topologies.

Graphs of different sizes

Not to say that they are all useless, we can borrow 2 things from CNNs to build graph convolution networks:

  1. An operation that summarizes the input no matter the input size: Pooling
  2. An operation that is applied the same at each position regardless of the input topology: Convolution

In a GraphConv layer of a GCN, the feature vector at each node is transformed based on the feature vectors of neighboring nodes.

The size & topology of the output graph is the same as the input graph.

In contrast to image convolution:

  • The convolution filter in a GCN has a variable size depending on the input location.
  • There is no spatial ordering to the elements in the filter.

A step of GraphConv

Application of a GraphConv layer

Application of multiple GraphConv layers

GCNs often consist of multiple GraphConv layers followed by a pooling layer (to perform final classification), e.g.:

Whole GCN

By adding a weight matrix W to the convolution step above before multiplying by the adjacency matrix, we map from m to k dimensions, where m is the size of the input feature vector and k is the size of the output feature vector. This is just a FC layer.

GraphConv FC 1

GraphConv FC 2

So a GraphConv layer maps each node’s feature vector to a new space (either up or down in terms of dimension) then assign each node the average of itself and its neighbors’ new features. Lastly, a nonlineary is applied.

A GraphConv layer computes an average of the local features around each node.

Intuition: notice how a GraphConv shares the same parameters across all nodes, just like how a CNN shares the same parameters across all pixels. This is a strong form of regularization like CNNs. Since the filter size is variable and unstructured, so the model has to learn to detect features that are useful across all possible filter sizes and topologies.

On images, this function still performs 3-d convolution. However, it’s “restricted” in the sense that all filter elements are constant within each channel.

CNNs are GraphConvs

After reparameterizing:

CNNs are GraphConvs 1

CNNs are GraphConvs 2