Simpler Machine Learning

Relevance of XOR

For people who are starting out learning programming, the starting point is the "Hello world" program. This is a program that simply prints "Hello world" onto the screen. Likewise, in biology, the simplest animal model for researching genetics is the "Fruit Fly", Drosophila Melanogaster. In a similar spirit, we introduce a simple neural network that learns the XOR pattern, explained below. XOR is one of the simplest neural networks you can build. It is specially relevant because in 1969, Marvin Minsky and Seymour Papert, two famous MIT computer scientists, mathematically "proved" the futility of pursuing neural network research by "demonstrating" that neural networks could not solve the XOR problem, based on a flawed assumption about neural networks, putting neural network research behind by many years.

Truth Tables

We start our exploration of the XOR neural network with an introduction to truth tables. Truth tables are just a way of combining true and false values to yield other true and false values. For example, the OR truth table is given below.

A	B	A OR B
false	false	false
false	true	true
true	false	true
true	true	true

To get an intuition for this truth table, if we denote true by "something", and false by "nothing", then nothing OR nothing is nothing; nothing OR something is something; something OR nothing is something; and something OR something is something. The difference between the OR truth table and the XOR truth table is that XOR is "eXclusively OR". That means that exactly one of A and B can be true or false exclusive of the other. Hence both false XOR false and true XOR true are false. We can also denote the false value by 0 and the true value by 1, and as we will soon see, this is extremely important in the context of neural networks. The XOR truth table is given below.

A	B	A XOR B
0	0	0
0	1	1
1	0	1
1	1	0

The relevance of the numerical values of A and B as o's and 1's is that we can use A and B as the input values to a neural network and A XOR B as the output value. The XOR truth table above, then, is the pattern consisting of 4 examples as the 4 rows of the truth table, that our XOR neural network has to learn. In all the myriad applications of neural networks - whether image recognition or speech to text or anything else, the input and output values get encoded as numbers. In the following sections, we will see how the XOR neural network works in detail.

The Error Function

Our goal in machine learning is to set up a process to bring the actual outputs as close as possible to the desired outputs, for any given set of inputs. That is, we want to reduce the absolute distance between the actual and desired outputs as much as possible, for all of the inputs. To enable this, we introduce the concept of an “Error Function”. The error function is simply the difference between the actual and desired outputs, squared, and summed across all input-output pairs or examples. (If you are still unfamiliar with the concept of functions, read the section on Functions). Since our simple XOR neural network has only one output, we don’t need to sum across all the output neurons; otherwise we would have. We express the error function mathematically as follows:

E(w) = ½ × Σ _examples [D_ex - A_ex(w)]²

In the expression above, Σ stands for summation over the examples. D_ex is the desired output and A_ex(w) is the actual output for each example. The subscript _ex after the D and A simply means that the desired output D and the actual output A is the value for each example - and we are summing across the examples. We have put a w in the brackets because as you change the weights values the actual output value changes (remember the values from each previous layer are multiplied by the weights and transformed by the neurons at each next layer). The transformation by the neurons is fixed, but the weights are variable values. That is what makes the actual output a function of the weights. Since the error E is a function of (is dependent on) the actual output A_ex for each example, and the actual output A_ex is a function of (is dependent on) the weights w, it follows that the error E is a function of (is dependent on) the weights. The desired output D_ex is a fixed value for each example. We square the difference between D_ex and A_ex(w) above to define the error because, after all, our goal is to minimize the absolute difference between D_x and A_ex(w). If D_ex – A_ex(w) is a large negative or a large positive number, that is a bad thing, whereas if D_ex – A_ex(w) is a small negative or a small positive number, that is a good thing – and squaring D_ex – A_ex(w) to define the error ensures this. We multiply the whole summation by half only to simplify the detailed nuts and bolts of the calculation – it has no other purpose.

A couple of disclaimers: other error functions than the one defined above are possible and used in neural networks, but the above one is by far the simplest and most intuitive, so that is what we will use, at least for our simple XOR neural network. Secondly, we can have an additional value flowing into each neuron of a layer which is called the bias and adjust that bias as part of the training. But this bias is not essential in the construction of a neural network and we will ignore it for now.

In the following Interaction, you can play with changing the weights to see if you can manually change the weights so as to minimize the error and bring the actual XOR output as close as possible to the desired output for each of the input examples. To change a weight, just hover over the weight with your mouse, and a bubble will pop up which allows you to change that weight value (when the bubble pops up, edit the weight directly in the bubble without moving the mouse and then press the Enter key). Remember, the goal of training a neural network is to adjust the weights gradually so as to bring the actual output as close as needed to the desired output, or to minimize the error function, and this exercise will give you some appreciation as to how difficult it is to do this in an ad hoc or random way.

Interaction: Playing with the Weights

Training a Neural Network

You have seen in the above Interaction how difficult it can be to minimize the error in an ad hoc or random way. It turns out that there is a systematic way to minimize the error function. This systematic way of minimizing the error function is what lies at the heart of machine learning and this is what is called "training" a neural network. You can play with training a neural network in the following Interaction. The default error value of a randomized neural network should typically be more than 0.5. After you have trained the neural network (by pressing the "Run Training" button), the error value should typically fall below 0.08. If the error value does not fall to a satisfactory number, and the actual outputs are still too far away from the desired outputs, try Re-randomizing the neural network and training again. Note that the actual output will most likely never be exactly 0 or 1, but "close", like below 0.2 or above 0.8 respectively for desired outputs of 0 or 1.

Interaction: Playing with Training

The Training Algorithm

We will now go over the training algorithm which is used to train the above neural network, in detail. An algorithm is just a sequence of steps for carrying out a given task (and a computer is a machine that can carry out any algorithm that you can specify in writing).

The key to understanding the neural network training algorithm is to understand something called the “delta” value. There is a separate delta value associated with every neuron in the neural network, except for the input layer neurons, which do not have any delta value associated with them. The delta value for the output layer neuron or neurons is computed using a different calculation than the delta value for the hidden layer neurons. The delta value is first computed for the output layer, and then propagated backwards through the hidden layers of the neural network, to the first hidden layer. This is why the training algorithm is called “back-propagation”. As you have seen above, it takes thousands of iterations to train a neural network. The delta values are computed afresh for each iteration.

Recall that the purpose of training a neural network is to update the weights over many iterations, until the actual outputs of the neural network are as close as needed to the desired outputs. Once you have calculated the delta values, it is a simple additional step to use them to calculate the weight updates, for each iteration.

Using the shorthand notation introduced above, where A denotes the actual output from an output neuron (in the output layer) of the neural network and D denotes the desired output, the delta value for that output neuron is calculated as:

delta = A × (1 – A) × (D – A)

Note that this is for a given training example – we have avoided using the subscripts _ex with the A and D variables to make the equation less imposing.

The next equation tells you how to calculate the delta for a neuron of a previous layer of a neural network, given all the deltas for the corresponding next layer of the neural network. We also need to use the current output value of that hidden layer neuron, which we represent by the letter O. As a matter of fact, we will use the variables delta_j and O_j to mean that this is the j^th neuron of the previous layer, from the top. The equation is this:

delta_j = O_j × (1 – O_j) × Σ_i (w_ij × delta_i)

We’ll work through the above equation. Σ is the familiar summation operator introduced in the context of the error function above. The delta_i values are the delta values for each neuron of the next layer. And each w_ij value represents the weight value leading from the j^th neuron of a previous layer to the i^th neuron of its next layer. For a given neuron of a previous layer, we sum across the product of the weights leading to the neurons of its next layer, and the delta values for those neurons. We multiply this summation by O_j × (1 – O_j), where O_j is the output value of the neuron for which we are calculating the delta.

Now that we know how to calculate the delta for each neuron, it is a simple matter to update the weights of the neural network. Given a weight w_ij connecting the j^th neuron of its previous layer to the i^th neuron of its next layer, the weight update, which we represent as update_ij is:

update_ij = r × O_j × delta_i

That is, the update to the weight is a quantity r times the output of the previous neuron to that weight times the delta of the next neuron to that weight. The quantity r is called the Learning Rate. If you increase r, the update jump will be more, and if you decrease it, the update jump will be less. We use a default value of 0.15 for the sample neural network on this page.

The new weight is the old weight plus the update_ij value of the given weight. That is,

w_ij^new = w_ij^old + update_ij

That’s the neural network training algorithm. Note that this algorithm is for the specific case where the neuron’s transformation function is the sigmoid function, 1/(1 + e^-x), and the error function is the function given above in the Error Function section above. If either or both of these are different, the calculations given above will be different. But the general structure of the training algorithm will remain the same.

You can play with all this in the below Interaction. If you hover over a neuron, it will show the delta calculation for that neuron. If you hover over a weight, it will show the weight update calculation for that weight. We have used two hidden layers for this neural network, which is not necessary for the XOR example, but will give you a clearer idea about how the deltas are propagated backwards through the neural network. You can select the learning rate and the number of iterations per training run from the corresponding drop-downs. If you want to step through the iterations one at a time, select "1" from the iterations drop down and then keep pressing the "Run Training" button. Note that each iteration will select a random set of inputs from the available values because that is how the algorithm is set up. Every time the learning rate is changed, we run the neural network through a couple of secret iterations to make the back-propagated deltas and weight update displays consistent with the new learning rate.

Interaction: Backpropagation in Detail

Learning Rate (r): Iterations per Run:

So that's it for the XOR neural network. We mentioned earlier on this website that neural networks are structures that learn how to generalize from examples, whereas you are not seeing any generalization in the XOR neural network presented on this page. That is because the XOR neural network is a toy neural network that does not serve any useful purpose other than to teach you how a neural network gets trained to recognize its examples.