bp-mlp-2

The goal of backpropagation is to compute the partial derivatives $\partial L / \partial w$ and $\partial L / \partial b$ of the cost function $L$ with respect to any weight w or bias b in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it’s useful to have an example cost function in mind.

Focus on MSE loss function

We’ll use the quadratic cost function:
$L = \frac{1}{2 n} x \sum y (x) - a^{L} (x)^{2}$
where:

$n$ is the total number of training examples;

the sum is over individual training examples $x$ ;

$y = y (x)$ is the corresponding desired output;

$L$ denotes the number of layers in the network;

$a^{L} = a^{L} (x)$ is the vector of activations output from the network when $x$ is input.

1st Assumption

The first assumption, to make about the cost function $L$ , in order that backpropagation can be applied, is that the cost function can be written as an average:

L = \frac{1}{n} x \sum L_{x}

over cost functions $L_{x}$ for individual training examples, $x$ . This is the case for the quadratic cost function, where the cost for a single training example is $L_{x} = \frac{1}{2} ∥ y - a^{L} ∥^{2}$ . This assumption will also hold true for all the other cost functions we’ll meet in this book.

The reason this assumption is needed is because what backpropagation actually lets us do is compute the partial derivatives $\partial L_{x} / \partial w$ and $\partial L_{x} / \partial b$ for a single training example. We then recover $\partial L / \partial w$ and $\partial L / \partial b$ by averaging over training examples. In fact, with this assumption in mind, we’ll suppose the training example $x$ has been fixed, and drop the $x$ subscript, writing the cost $L_{x}$ as $L$ . We’ll eventually put the $x$ back in, but for now it’s a notational nuisance that is better left implicit.

2nd Assumption

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example $x$ may be written as

L = \frac{1}{2} y - a^{L}^{2} = \frac{1}{2} j \sum (y_{j} - a_{j}^{L})^{2},

and thus is a function of the output activations. Of course, this cost function also depends on the desired output $y$ , and you may wonder why we’re not regarding the cost also as a function of $y$ . Remember, though, that the input training example $x$ is fixed, and so the output $y$ is also a fixed parameter. In particular, it’s not something we can modify by changing the weights and biases in any way, i.e., it’s not something which the neural network learns. And so it makes sense to regard $L$ as a function of the output activations $a^{L}$ alone, with $y$ merely a parameter that helps define that function.

Assumptions about the loss function

The goal of backpropagation is to compute the partial derivatives $\partial L / \partial w$ and $\partial L / \partial b$ of the cost function $L$ with respect to any weight $w$ or bias $b$ in the network. For backpropagation to work, two main assumptions about the form of the cost function are required. Before stating those assumptions, it is useful to establish an example cost function as a reference.

Reference Model: Mean Square Error (MSE) Loss

The quadratic loss (or cost) function is defined as:
$L = \frac{1}{2 n} x \sum y (x) - a^{L} (x)^{2}$
Where:

$n$ is the total number of training examples;

the sum is over individual training examples $x$ ;

$y = y (x)$ is the corresponding desired output;

$L$ denotes the number of layers in the network;

$a^{L} = a^{L} (x)$ is the vector of activations output from the network when $x$ is input.

1st Assumption

The first assumption to make about the cost function $L$ , in order that backpropagation can be applied, is that the cost function can be written as an average:

L = \frac{1}{n} x \sum L_{x}

over cost functions $L_{x}$ for individual training examples, $x$ . This is the case for the quadratic cost function, where the cost for a single training example is $L_{x} = \frac{1}{2} ∥ y - a^{L} ∥^{2}$ . This assumption also holds true for all other cost functions encountered in this context.

Why this is needed

This assumption is necessary because backpropagation specifically enables the computation of the partial derivatives $\partial L_{x} / \partial w$ and $\partial L_{x} / \partial b$ for a single training example.

The global derivatives $\partial L / \partial w$ and $\partial L / \partial b$ are then recovered by averaging over all training examples. For notational simplicity, the training example $x$ is assumed to be fixed, and the $x$ subscript is dropped, writing the cost $L_{x}$ as $L$ . This notation is left implicit for the duration of the derivation.

2nd Assumption

The second assumption made about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example $x$ may be written as:

L = \frac{1}{2} y - a^{L}^{2} = \frac{1}{2} j \sum (y_{j} - a_{j}^{L})^{2}

and thus is clearly a function of the output activations.

Why is the cost not regarded as a function of $y$ ?

It may be questioned why the cost is not also regarded as a function of the desired output $y$ . However, since the input training example $x$ is fixed, the output $y$ is also a fixed parameter.

Specifically, $y$ cannot be modified by changing the weights and biases; it is not something which the neural network learns. Therefore, it is logical to regard $L$ as a function of the output activations $a^{L}$ alone, with $y$ acting as a parameter that helps define that function.

Deep Learning

Explorer

bp-mlp-2

1st Assumption

2nd Assumption

Assumptions about the loss function

1st Assumption

2nd Assumption

Graph View

Table of Contents