The goal of backpropagation is to compute the partial derivatives and of the cost function with respect to any weight w or bias b in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it’s useful to have an example cost function in mind.

Focus on MSE loss function

We’ll use the quadratic cost function:

where:

  • is the total number of training examples;
  • the sum is over individual training examples ;
  • is the corresponding desired output;
  • denotes the number of layers in the network;
  • is the vector of activations output from the network when is input.

1st Assumption

The first assumption, to make about the cost function , in order that backpropagation can be applied, is that the cost function can be written as an average:

over cost functions for individual training examples, . This is the case for the quadratic cost function, where the cost for a single training example is . This assumption will also hold true for all the other cost functions we’ll meet in this book.

The reason this assumption is needed is because what backpropagation actually lets us do is compute the partial derivatives and for a single training example. We then recover and by averaging over training examples. In fact, with this assumption in mind, we’ll suppose the training example has been fixed, and drop the subscript, writing the cost as . We’ll eventually put the back in, but for now it’s a notational nuisance that is better left implicit.


2nd Assumption

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example may be written as

and thus is a function of the output activations. Of course, this cost function also depends on the desired output , and you may wonder why we’re not regarding the cost also as a function of . Remember, though, that the input training example is fixed, and so the output is also a fixed parameter. In particular, it’s not something we can modify by changing the weights and biases in any way, i.e., it’s not something which the neural network learns. And so it makes sense to regard as a function of the output activations alone, with merely a parameter that helps define that function.

Assumptions about the loss function

The goal of backpropagation is to compute the partial derivatives and of the cost function with respect to any weight or bias in the network. For backpropagation to work, two main assumptions about the form of the cost function are required. Before stating those assumptions, it is useful to establish an example cost function as a reference.

Reference Model: Mean Square Error (MSE) Loss

The quadratic loss (or cost) function is defined as:

Where:

  • is the total number of training examples;
  • the sum is over individual training examples ;
  • is the corresponding desired output;
  • denotes the number of layers in the network;
  • is the vector of activations output from the network when is input.

1st Assumption

The first assumption to make about the cost function , in order that backpropagation can be applied, is that the cost function can be written as an average:

over cost functions for individual training examples, . This is the case for the quadratic cost function, where the cost for a single training example is . This assumption also holds true for all other cost functions encountered in this context.

Why this is needed

This assumption is necessary because backpropagation specifically enables the computation of the partial derivatives and for a single training example.

The global derivatives and are then recovered by averaging over all training examples. For notational simplicity, the training example is assumed to be fixed, and the subscript is dropped, writing the cost as . This notation is left implicit for the duration of the derivation.


2nd Assumption

The second assumption made about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example may be written as:

and thus is clearly a function of the output activations.

Why is the cost not regarded as a function of ?

It may be questioned why the cost is not also regarded as a function of the desired output . However, since the input training example is fixed, the output is also a fixed parameter.

Specifically, cannot be modified by changing the weights and biases; it is not something which the neural network learns. Therefore, it is logical to regard as a function of the output activations alone, with acting as a parameter that helps define that function.