Assumptions about the loss function

The goal of backpropagation is to compute the partial derivatives and of the cost function with respect to any weight or bias in the network. For backpropagation to work, two main assumptions about the form of the cost function are required. Before stating those assumptions, it is useful to establish an example cost function as a reference.

Reference Model: Mean Square Error (MSE) Loss

The quadratic loss (or cost) function is defined as:

Where:

  • is the total number of training examples;
  • the sum is over individual training examples ;
  • is the corresponding desired output;
  • denotes the number of layers in the network;
  • is the vector of activations output from the network when is input.

1st Assumption

The first assumption to make about the cost function , in order that backpropagation can be applied, is that the cost function can be written as an average:

over cost functions for individual training examples, . This is the case for the quadratic cost function, where the cost for a single training example is . This assumption also holds true for all other cost functions encountered in this context.

Why this is needed

This assumption is necessary because backpropagation specifically enables the computation of the partial derivatives and for a single training example.

The global derivatives and are then recovered by averaging over all training examples. For notational simplicity, the training example is assumed to be fixed, and the subscript is dropped, writing the cost as . This notation is left implicit for the duration of the derivation.


2nd Assumption

The second assumption made about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example may be written as:

and thus is clearly a function of the output activations.

Why is the cost not regarded as a function of ?

It may be questioned why the cost is not also regarded as a function of the desired output . However, since the input training example is fixed, the output is also a fixed parameter.

Specifically, cannot be modified by changing the weights and biases; it is not something which the neural network learns. Therefore, it is logical to regard as a function of the output activations alone, with acting as a parameter that helps define that function.