Backpropagation algorithm. Backpropagation Method: Math, Examples, Code

In multilayer neural networks, the optimal output values of neurons of all layers, except for the last one, are usually unknown. A three- or more-layer perceptron can no longer be trained, guided only by the error values at the network outputs.

One of the options for solving this problem is to develop sets of output signals corresponding to the input ones for each layer of the neural network, which, of course, is a very laborious operation and is not always feasible. the weakest connections and change by a small amount in one direction or another, and only those changes are saved that led to a decrease in the error at the output of the entire network. Obviously, this method, despite

apparent simplicity, requires cumbersome routine calculations. And, finally, the third, more acceptable option is the propagation of error signals from the outputs of the neural network to its inputs, in the direction opposite to the forward propagation of signals in normal operation. This learning algorithm is called the error backpropagation procedure (error back propagation) It is he who is considered below

The backpropagation algorithm is an iterative gradient learning algorithm that is used to minimize the standard deviation of the current from the desired outputs of multilayer neural networks with serial connections.

According to the least squares method, the objective function of the neural network error to be minimized is the value

where is the real output state of the neuron at the output layer of the neural network when an image is fed to its inputs, the required output state of this neuron

The summation is carried out over all neurons of the output layer and over all the images processed by the network Minimization by the gradient descent method provides adjustment of the weight coefficients as follows

where is the weight coefficient of the synaptic connection connecting the neuron of the layer with the neuron of the layer is the learning rate coefficient,

In accordance with the rule of differentiation of a complex function

where is the weighted sum of the input signals of the neuron, the argument of the activation function Since the derivative of the activation function must be determined on the entire abscissa axis, the unit jump function and other activation functions with inhomogeneities are not suitable for the considered neural networks. They use such smooth functions as hyperbolic tangent or classical sigmoid with exponent (see Table 1 1) For example, in the case of hyperbolic tangent

The third multiplier is equal to the output of the neuron of the previous layer

As for the first factor in (1.11), it can be easily expanded as follows:

Here, the summation over is performed among the neurons of the layer. Introducing a new variable:

we obtain a recursive formula for calculating the values of the layer from the values of the older layer

For the output layer:

Now we can write (1.10) in expanded form:

Sometimes, in order to give the process of weight correction some inertia, which smooths out sharp jumps when moving over the surface of the objective function, (1.17) is supplemented with the value of the weight change at the previous iteration.

where is the inertia coefficient; number of the current iteration.

Thus, the complete neural network training algorithm using the backpropagation procedure is constructed as follows.

STEP 1. Submit one of the possible images to the network inputs and in the mode of normal operation of the neural network, when signals propagate from inputs to outputs, calculate the values of the latter. Recall that:

where is the number of neurons in the layer, taking into account the neuron with a constant output state that sets the offset; layer neuron input

where is sigmoid,

where is the component of the input image vector.

STEP 4. Adjust all weights in the neural network:

STEP 5. If the network error is significant, go to step 1. Otherwise, end.

The network at step 1 is alternately presented with all training images in a random order so that the network, figuratively speaking, does not forget one as it memorizes others.

From expression (1.17) it follows that when the output value tends to zero, the learning efficiency decreases markedly. With binary input vectors, on average, half of the weight coefficients will not be corrected, so it is desirable to shift the range of possible values of neuron outputs within the limits, which is achieved by simple modifications of logistic functions. For example, a sigmoid with an exponent is converted to the form:

Consider the question of the capacity of a neural network, i.e., the number of images presented to its inputs that it is able to learn to recognize. For networks with more than two layers, this question remains open. For networks with two layers, the deterministic capacity of the network is estimated as follows:

where is the number of adjustable weights, is the number of neurons in the output layer.

This expression is obtained subject to certain restrictions. First, the number of inputs and neurons in the hidden layer must satisfy the inequality Second. However, the above estimate was made for networks with threshold activation functions of neurons, and the capacity of networks with smooth activation functions, for example (1.23), is usually larger. In addition, the term deterministic means that the resulting capacity estimate is suitable for all input patterns that can be represented by the inputs. In reality, the distribution of input images, as a rule, has some regularity, which allows the neural network to generalize and, thus, increase the real capacity. Since the distribution of images, in general, is not known in advance, we can only hypothetically talk about the real capacity, but usually it is two times higher than the deterministic capacity.

The question of the capacity of a neural network is closely related to the question of the required capacity of the output layer of the network that performs the final classification of images. For example, to divide a set of input images into two classes, one output neuron is sufficient. In this case, each logical level will denote a separate class. On two output neurons with a threshold activation function, four classes can already be encoded. To increase the classification reliability, it is desirable to introduce redundancy by allocating one neuron to each class in the output layer or, even better, several, each of which is trained to determine whether an image belongs to a class with its own degree of reliability, for example: high, medium and low. Such neural networks make it possible to classify input images combined into fuzzy (fuzzy or intersecting) sets. This property brings such networks closer to the real conditions for the functioning of biological neural networks.

The considered neural network has several bottlenecks. First, in the process, large positive or negative values of the weights can shift the operating point on the sigmoids of neurons to the saturation region. Small values of the derivative of the logistic function will lead, in accordance with (1.15) and (1.16), to stop learning, which paralyzes the network. Secondly, the application of the gradient descent method does not guarantee finding the global minimum of the objective function. This is closely related to the issue of choosing the learning rate. The increments of the weights and, therefore, the learning rate for finding the extremum must be infinitesimal, however, in this case, learning will be

be unacceptably slow. On the other hand, too large weight corrections can lead to permanent instability in the learning process. Therefore, a number less than 1 (for example, 0.1) is usually chosen as the learning rate coefficient 1], which gradually decreases during the learning process. In addition, to exclude accidental hits of the network in local minima, sometimes, after stabilization of the values of the weight coefficients, 7 is briefly increased significantly in order to start the gradient descent from a new point. If repeating this procedure several times brings the network to the same state, it can be assumed that a global minimum has been found.

There is another method to eliminate local minima and network paralysis, which is to use stochastic neural networks.

Let us give the above geometric interpretation.

The backpropagation algorithm calculates the error surface gradient vector. This vector indicates the direction of the shortest descent on the surface from the current point, movement along which leads to a decrease in the error. A sequence of decreasing steps will lead to a minimum of one type or another. The difficulty here is the question of choosing the length of the steps.

With a large step size, convergence will be faster, but there is a danger of jumping over the solution or, in the case of a complex error surface, going in the wrong direction, for example, moving along a narrow ravine with steep slopes, jumping from one side to the other. On the contrary, with a small step and the right direction, a lot of iterations will be required. In practice, the step size is taken proportional to the steepness of the slope, so that the algorithm slows down near the minimum. The correct choice of learning rate depends on the specific task and is usually done empirically. This constant may also depend on time, decreasing as the algorithm progresses.

Typically, this algorithm is modified to include a momentum (or inertia) term. This promotes movement in a fixed direction, so if several steps were taken in the same direction, the algorithm increases the speed, which sometimes avoids a local minimum, and also passes flat areas faster.

At each step of the algorithm, all training examples are fed in turn to the input of the network, the real output values of the network are compared with the required values, and the error is calculated. The value of the error as well as the gradient of the error surface

is used to adjust the scales, after which all actions are repeated. The learning process stops either when a certain number of epochs have been passed, or when the error reaches some certain small level, or when the error stops decreasing.

Let us consider the problems of generalization and retraining of a neural network in more detail. Generalization is the ability of a neural network to make an accurate prediction on data that does not belong to the original training set. Overfitting, on the other hand, is overfitting that occurs when the learning algorithm runs too long and the network is too complex for the task or for the amount of data available.

Let us demonstrate the problems of generalization and retraining using the example of approximating a certain dependence not by a neural network, but by means of polynomials, while the essence of the phenomenon will be absolutely the same.

Polynomial graphs can have different shapes, and the higher the degree and number of terms, the more complex this shape can be. For the initial data, it is possible to select a polynomial curve (model) and thus obtain an explanation of the existing dependence. The data can be noisy, so the best model cannot be assumed to pass exactly through all available points. A low order polynomial may better explain the relationship but not be flexible enough to fit the data, while a high order polynomial may be overly flexible but will follow the data exactly and take on a convoluted shape that has nothing to do with the true relationship. .

Neural networks face the same difficulties. Networks with more weights model more complex features and are therefore prone to overfitting. Networks with a small number of weights may not be flexible enough to model the existing dependencies. For example, a network with no hidden layers models only an ordinary linear function.

How to choose the right degree of network complexity? Almost always, a more complex network gives a smaller error, but this may not indicate a good quality of the model, but rather an overfitting of the network.

The way out is to use control cross-validation. To do this, a part of the training sample is reserved, which is used not to train the network according to the error backpropagation algorithm, but to independently control the result during the algorithm. At the beginning of work, a network error on the training and

control sets will be the same. As the network is trained, the learning error decreases, as does the error on the control set. If the control error stopped decreasing or even began to increase, this indicates that the network has begun to approximate the data too closely (retrained) and training should be stopped. If this happens, then the number of hidden elements and/or layers should be reduced, because the network is too powerful for this task. If both errors (training and cross-validation) do not reach a sufficiently small level, then retraining, of course, did not occur, and the network, on the contrary, is not powerful enough to model the existing dependence.

The described problems lead to the fact that in practical work with neural networks one has to experiment with a large number of different networks, sometimes training each of them several times and comparing the results obtained. The main indicator of the quality of the result is the control error. At the same time, in accordance with the system-wide principle, it makes sense to choose the one that is simpler from two networks with approximately equal control errors.

The need for multiple experiments leads to the fact that the control set begins to play a key role in choosing a model and becomes part of the learning process. This weakens its role as an independent criterion of model quality. With a large number of experiments, there is a high probability of choosing a successful network that gives a good result on the control set. However, in order to give the final model due reliability, often (when the volume of training examples allows it) they proceed as follows: a test set of examples is reserved. The final model is tested on the data from this set to make sure that the results achieved on the training and control sets of examples are real, and not artifacts of the training process. Of course, in order to play its part well, the test set must be used only once: if it is used repeatedly to correct the learning process, then it will actually turn into a control set.

In order to speed up the learning process of the network, numerous modifications of the error backpropagation algorithm are proposed, associated with the use of various error functions, procedures for determining the direction and step sizes.

1) Error functions:

Integral error functions over the entire set of training examples;

Error functions of integer and fractional powers

2) Procedures for determining the step size at each iteration

Dichotomy;

Inertial ratios (see above);

3) Step direction determination procedures.

Using a matrix of second-order derivatives (Newton's method);

Using directions on several steps (partan method).

Prudnikov Ivan Alekseevich
MIREA(MTU)

The topic of neural networks has already been covered in many journals, but today I would like to introduce readers to the algorithm for training a multilayer neural network using the backpropagation method and provide an implementation of this method.

I want to make a reservation right away that I am not an expert in the field of neural networks, so I expect constructive criticism, comments and additions from readers.

Theoretical part

This material assumes an acquaintance with the basics of neural networks, however, I consider it possible to introduce the reader to the topic without unnecessary ordeals on the theory of neural networks. So, for those who hear the phrase "neural network" for the first time, I propose to perceive a neural network as a weighted directed graph, the nodes (neurons) of which are arranged in layers. In addition, a node of one layer has links to all nodes of the previous layer. In our case, such a graph will have input and output layers, the nodes of which act as inputs and outputs, respectively. Each node (neuron) has an activation function - a function responsible for calculating the signal at the output of the node (neuron). There is also the concept of displacement, which is a node, at the output of which one always appears. In this article, we will consider the learning process of a neural network that assumes the presence of a "teacher", that is, a learning process in which learning occurs by providing the network with a sequence of training examples with correct responses.
As with most neural networks, our goal is to train the network in such a way as to achieve a balance between the ability of the network to give the correct response to the input used in the learning process (memorization) and the ability to produce the correct results in response to the input, similar but not identical to those used in training (generalization principle). Network training by the error backpropagation method includes three stages: data input to the input, followed by data propagation in the direction of the outputs, calculation and backpropagation of the corresponding error, and weight adjustment. After training, it is only supposed to feed the data to the input of the network and distribute them in the direction of the outputs. At the same time, if the training of the network can be a rather lengthy process, then the direct calculation of the results by the trained network is very fast. In addition, there are numerous variations of the backpropagation method designed to increase the speed of the learning process.
It is also worth noting that a single-layer neural network is significantly limited in what patterns of input data it is subject to learning, while a multi-layer network (with one or more hidden layers) does not have such a disadvantage. Next, a description will be given of a standard backpropagation neural network.

Architecture

Figure 1 shows a multilayer neural network with one layer of hidden neurons (Z elements).

The neurons that represent the outputs of the network (denoted by Y) and the hidden neurons can be biased (as shown in the image). The offset corresponding to the output Y k is denoted by w ok , the hidden element of Z j - V oj . These biases serve as weights on the connections coming from the neurons, the output of which always appears 1 (in Figure 1 they are shown, but usually not explicitly displayed, implied). In addition, arrows in Figure 1 show the movement of information during the data dissemination phase from inputs to outputs. During the learning process, the signals propagate in the opposite direction.

Description of the algorithm

The algorithm presented below is applicable to a neural network with one hidden layer, which is a valid and adequate situation for most applications. As mentioned earlier, network training includes three stages: supplying training data to the inputs of the network, backpropagation of the error, and adjusting the weights. During the first stage, each input neuron X i receives a signal and broadcasts it to each of the hidden neurons Z 1 ,Z 2 ...,Z p . Each hidden neuron then calculates the result of its activation function (network function) and broadcasts its signal Z j to all output neurons. Each output neuron Y k , in turn, calculates the result of its activation function Y k , which is nothing more than the output signal of this neuron for the corresponding input data. During the learning process, each neuron at the output of the network compares the computed Y k value with the teacher-provided t k (target value), determining the appropriate error value for the given input pattern. Based on this error, σ k (k = 1,2,...m) is calculated. σ k is used when propagating the error from Y k to all network elements of the previous layer (hidden neurons connected to Y k), and also later when changing the weights of connections between output neurons and hidden ones. Similarly, σj (j = 1,2,...p) is calculated for each hidden neuron Z j . Although there is no need to propagate the error to the input layer, σj is used to change the weights of connections between neurons in the hidden layer and input neurons. After all σ have been determined, the weights of all links are simultaneously adjusted.

Designations:

The following notation is used in the network learning algorithm:

X Input training data vector X = (X 1 , X 2 ,...,X i ,...,X n).
t Vector of teacher-provided target outputs t = (t 1 , t 2 ,...,t k ,...,t m)
σ k Component of the adjustment of weights of connections w jk , corresponding to the error of the output neuron Y k ; also, information about the error of the neuron Y k , which is distributed to those neurons of the hidden layer that are associated with Y k .
σ j Component of the adjustment of the weights of connections v ij , corresponding to the error information propagated from the output layer to the hidden neuron Z j .
a Learning rate.
X i Input neuron with index i. For input neurons, the input and output signals are the same - X i .
v oj Offset of hidden neuron j.
Z j Hidden neuron j; The total value supplied to the input of the hidden element Z j is denoted by Z_in j: Z_in j = v oj +∑x i *v ij
The signal at the output Z j (the result of applying the activation function to Z_in j) is denoted by Z j: Z j = f (Z_in j)
w ok Output neuron offset.
Y k Output neuron under index k; The total value supplied to the input of the output element Y k is denoted by Y_in k: Y_in k = w ok + ∑ Z j *w jk . The output signal Y k (the result of applying the activation function to Y_in k) is denoted by Y k:

Activation function

The activation function in the backpropagation algorithm must have several important characteristics: continuity, differentiability, and be monotonically non-decreasing. Moreover, for the sake of computational efficiency, it is desirable that its derivative be easily found. Often, the activation function is also a saturation function. One of the most commonly used activation functions is the binary sigmoid function with a range of (0, 1) and defined as:

Another widely used activation function is the bipolar sigmoid with range (-1, 1) and defined as:

Learning algorithm

The learning algorithm looks like this:

Initialization of weights (weights of all links are initialized with random small values).

As long as the termination condition of the algorithm is false, steps 2 - 9 are performed.

For each pair ( data, target value ), steps 3 - 8 are performed.

Propagation of data from inputs to outputs:

Step 3
Each input neuron (X i , i = 1,2,...,n) sends the received signal X i to all neurons in the next (hidden) layer.

Each hidden neuron (Z j , j = 1,2,...,p) sums the weighted incoming signals: z_in j = v oj + ∑ x i *v ij and applies the activation function: z j = f (z_in j) Then sends result to all elements of the next layer (output).

Each output neuron (Y k , k = 1,2,...m) sums the weighted input signals: Y_in k = w ok + ∑ Z j *w jk and applies an activation function, calculating the output signal: Y k = f (Y_in k).

Backpropagation:

Each output neuron (Y k , k = 1,2,...m) receives the target value - the output value that is correct for the given input signal, and calculates the error: σ k = (t k - y k)*f " ( y_in k) also calculates the amount by which the weight of the connection w jk will change: Δw jk = a * σ k * z j In addition, it calculates the offset adjustment value: Δw ok = a*σ k and sends σ k to neurons in the previous layer .

Each hidden neuron (z j , j = 1,2,...p) sums the incoming errors (from neurons in the next layer) σ_in j = ∑ σ k * w jk and calculates the error value by multiplying the resulting value by the derivative of the activation function: σ j = σ_in j * f "(z_in j), also calculates the amount by which the link weight vij will change: Δv ij = a * σ j * x i . In addition, calculates the offset adjustment amount: v oj = a * σ j

Step 8. Change the weights.

Each output neuron (y k , k = 1,2,...,m) changes the weights of its connections with the bias element and hidden neurons: w jk (new) = w jk (old) + Δw jk
Each hidden neuron (z j , j = 1,2,...p) changes the weights of its connections with the bias element and output neurons: v ij (new) = v ij (old) + Δv ij

Checking the termination condition of the algorithm.
The condition for terminating the operation of the algorithm can be either the achievement of a total quadratic error of the result at the network output of a preset minimum during the learning process, or the performance of a certain number of iterations of the algorithm. The algorithm is based on a method called gradient descent. Depending on the sign, the gradient of the function (in this case, the value of the function is the error, and the parameters are the weights of the links in the network) gives the direction in which the values of the function increase (or decrease) most rapidly.

Delta is a rule that is used when training the perceptron, using the error value of the output layer. If the network has two or more layers, then there is no explicit error value for intermediate layers, and the delta rule cannot be used.

The main idea behind backpropagation is how to get the error estimate for hidden layer neurons. notice, that famous errors made by neurons in the output layer arise due to unknown errors of neurons in hidden layers. The greater the value of the synaptic connection between the neuron of the hidden layer and the output neuron, the stronger the error of the first affects the error of the second. Therefore, the estimate of the error of the elements of the hidden layers can be obtained as a weighted sum of the errors of subsequent layers.

The back-propagation algorithm (ABOR), which is a generalization of the delta-rule, allows you to train ANN PR with any number of layers. We can say that AOR actually uses a kind of gradient descent, rearranging the weights in the direction of the minimum error.

When using AOR, it is assumed that a sigmoid function is used as an activation function. This function saves computational costs because it has a simple derivative:

The sigmoid function limits strong signals to 1 and boosts weak signals.

The meaning of the error backpropagation algorithm is that during training, the network is first presented with an image for which the output error is calculated. Further, this error propagates through the network in the opposite direction, changing the weights of interneuronal connections.

The algorithm includes the same sequence of actions as in training the perceptron. First, the weights of interneuronal connections get random values, then the following steps are performed:

1) A training pair is selected ( X , Z*), X served at the entrance;

2) Calculate the output of the network Z = F(Y);

3) Output error is calculated E;

4) The network weights are adjusted to minimize the error;

Steps 1 and 2 are forward propagation over the network, while steps 3 and 4 are reverse.

Before training, it is necessary to divide the existing “input-output” pairs into two parts: training and test.

Test pairs are used to check the quality of training - the NN is well trained if it produces an output close to the test one for the input given by the test pair.

When training, a situation is possible when the NN shows good results for training data, but bad results for test data. There may be two reasons for this:

1. Test data is very different from training data, i.e. the training pairs did not cover all areas of the input space.

2. The phenomenon of "retraining" has arisen ( overfitting), when the behavior of the NN turns out to be more complex than the problem being solved.

The last case for the problem of approximating a function by points is illustrated in Fig. 3.3, where white circles represent test data and dark circles represent training data.

Strictly speaking, backpropagation is a fast gradient calculation method based on the features of the network recalculation function, which can reduce the computational complexity of the gradient calculation. The method uses the error at the network output to calculate partial derivatives by the weights of the last layer of trained connections, then the error at the output of the penultimate layer is determined from the weights of the last layer and the network error, and the process is repeated.

Description of the algorithm

Backpropagation is applied to multilayer networks whose neurons have non-linearities with a continuous derivative, such as:

Nonlinearity of this type is convenient due to the simplicity of calculating the derivative:

To train the network, P pairs of signal vectors are used: the input vector I and the vector that should be obtained at the output of the network D. The network, in the simple case, consists of N layers, and each neuron of the next layer is connected to all neurons of the previous layer by connections, with weights w[n].

With forward propagation, for each layer, the total signal at the output of the layer (S [n]) and the signal at the output of the neuron are calculated (and stored). So, the signal at the input of the i-th neuron of the n-th layer:

Here w (i, j) are the weights of the connections of the nth layer. The signal at the output of the neuron is calculated by applying the nonlinearity of the neuron to the total signal.

The output layer signal x[N] is considered to be the output signal of the network O.

Based on the network output signal O and the signal D, which should be obtained at the network output for a given input, the network error is calculated. Usually, the mean squared deviation over all vectors of the training set is used:

To train the network, the gradient of the error function by the weights of the network is used. The backpropagation algorithm involves calculating the gradient of the error function by "backpropagating" the error signal. Then the partial derivative of the error with respect to the bond weights is calculated by the formula:

Here q is the residual of the network, which for the output layer is calculated from the error function:

And for hidden layers - according to the residual of the previous layer:

For the case of sigmoid non-linearity and mean squared deviation as a function of error:

Actually, network training consists in finding such values of weights that minimize the error at the network outputs. There are many methods based on or using a gradient to solve this problem. In the simplest case, network training is carried out using small increments of connection weights in the direction opposite to the gradient vector:

This learning method is called "gradient descent optimization" and, in the case of neural networks, is often considered part of the backpropagation method.

Implementation of the Error Backpropagation Algorithm Using the Example of Function Approximation

Task: Let there be a table of argument values ( x i ) and corresponding function values ( f(x i )) ( this table could have arisen when calculating some analytically given function when conducting an experiment to identify the dependence of current strength on resistance in an electrical network, when identifying a relationship between solar activity and the number of calls to a cardiology center, between the amount of subsidies to farmers and the volume of agricultural production, etc. ).

In the Matlab environment, it is necessary to build and train a neural network to approximate a table function, i=1, 20. Develop a program that implements the neural network approximation algorithm and displays the approximation results in the form of graphs.

The approximation lies in the fact that, using the available information on f (x), we can consider an approximating function z (x) close in some sense to f (x), which allows us to perform the appropriate operations on it and obtain an estimate of the error of such a replacement.

Approximation usually means the description of some, sometimes not explicitly specified, dependence or the set of data representing it with the help of another, usually simpler or more uniform dependence. Often the data is in the form of individual hotspots, whose coordinates are given by the data table. The approximation result may not pass through the nodal points. In contrast, the task of interpolation is to find data in the vicinity of the nodal points. For this, suitable functions are used, the values of which at the nodal points coincide with the coordinates of these points.

Task. In the Matlab environment, it is necessary to build and train a neural network to approximate a given table function (see Figure 5).

Figure 5. Table of function values In the Matlab mathematical environment, in the command window, we write the code of the program for creating and training a neural network.

To solve, we use the newff (.) function - the creation of a "classical" multilayer NN with training using the backpropagation method, i.e. the change in the weights of synapses takes into account the error function, the difference between the real and correct responses of the neural network, determined on the output layer, propagates in the opposite direction - towards the flow of signals. The network will have two hidden layers. There are 5 neurons in the first layer, 1 in the second. The activation function of the first layer is "tansig" (sigmoid function, returns output vectors with values in the range from - 1 to 1), the second one is "purelin" (linear activation function, returns output vectors without changes). There will be 100 training epochs. The "trainlm" training function is a function that trains the network (used by default because it provides the fastest training, but requires a lot of memory) .

Program code:

P = zeros(1, 20);

for i = 1: 20% array creation P (i) = i*0.1; %input data (argument) end T= ; %input (function value) net = newff ([-1 2.09], ,("tansig" "purelin")); %creating a neural network net. trainParam. epochs = 100; %set the number of training epochs net=train (net,P,T); %network training y = sim(net,P); %poll trained network figure (1);

plot(P,T,P,y,"o"),grid; % drawing of the graph of the initial data and the function formed by the neural network.

The result of the neural network.

Training result (see Fig. 2): the graph shows the training time of the neural network and the training error. In this example, the neural network went through all 100 epochs, gradually learning and reducing errors, and reached 10 -2.35 (0.00455531).

Figure 2. The result of neural network training

Graph of the initial data and the function generated by the neural network (see Fig. 3): the circles indicate the initial data, and the line is the function generated by the neural network. Further, using the obtained points, you can build a regression and obtain an approximation equation (see Figure 8). We used cubic regression, since its graph most accurately passes through the obtained points. The resulting equation looks like:

y=0.049x 3 +0.88x 2 -0.006x+2.1.

Thus, we see that using a neural network, you can quickly find a function, knowing only the coordinates of the points through which it passes.

Figure 3. Graph of the initial data and the function formed by the neural network

Figure 4. Graph of the approximation function

The backpropagation algorithm is one of the methods for training multilayer feedforward neural networks, also called multilayer perceptrons. Multilayer perceptrons are successfully used to solve many complex problems.

Training by the error backpropagation algorithm involves two passes through all layers of the network: forward and backward. In the forward pass, the input vector is fed to the input layer of the neural network, after which it propagates through the network from layer to layer. As a result, a set of output signals is generated, which is the actual response of the network to a given input image. During the forward pass, all synaptic weights of the network are fixed. During the backward pass, all synaptic weights are adjusted in accordance with the error correction rule, namely: the actual network output is subtracted from the desired one, resulting in an error signal. This signal subsequently propagates through the network in a direction opposite to that of synaptic connections. Hence the name - backpropagation algorithm. Synaptic weights are tuned to make the network output as close as possible to the desired one.

Let's consider the operation of the algorithm in more detail. Let's say we need to train the following neural network using the backpropagation algorithm:

In the figure below, the following conventions are used:

As an activation function in multilayer perceptrons, as a rule, a sigmoidal activation function is used, in particular, a logistic one:

where is the slope parameter of the sigmoidal function. By changing this parameter, it is possible to construct functions with different steepness. Let's make a reservation that for all subsequent reasoning, it will be used exactly the logistic activation function, represented only by the formula above.

The sigmoid narrows the range of change so that the value lies between zero and one. Multilayer neural networks have greater representing power than single-layer ones only in the presence of non-linearity. The contraction function provides the required non-linearity. In fact, there are many functions that could be used. The backpropagation algorithm only requires that the function be everywhere differentiable. The sigmoid satisfies this requirement. Its additional advantage is automatic gain control. For weak signals (i.e. when close to zero) the input-output curve has a strong slope, giving a large gain. As the signal becomes larger, the gain drops. Thus, large signals are received by the network without saturation, and weak signals pass through the network without excessive attenuation.

The goal of network training An error backpropagation algorithm is such an adjustment of its weights that the application of a certain set of inputs leads to the required set of outputs. For brevity, these sets of inputs and outputs will be called vectors. During training, it is assumed that for each input vector there is a paired target vector that specifies the required output. Together they are called the training pair. The network is trained on many pairs.

Initialize synaptic weights with small random values.
Choose the next training pair from the training set; feed the input vector to the input of the network.
Calculate the output of the network.
Calculate the difference between the network output and the desired output (training pair target vector).
Adjust network weights to minimize error (as shown below).
Repeat steps 2 to 5 for each vector of the training set until the error on the entire set reaches an acceptable level.

The operations performed by steps 2 and 3 are similar to those performed during the operation of an already trained network, i.e. an input vector is fed in and the resulting output is computed. The calculations are performed in layers. On fig. 1, first, the outputs of the neurons of the layer are calculated (the layer is input, which means that no calculations take place in it), then they are used as inputs of the layer , the outputs of the neurons of the layer are calculated, which form the output vector of the network . Steps 2 and 3 form what is known as a forward pass as the signal travels through the network from input to output.

Steps 4 and 5 constitute the "back pass" where the computed error signal propagates back through the network and is used to adjust the weights.

Let's take a closer look at step 5 - adjusting the network weights. There are two cases to be highlighted here.

Case 1. Adjustment of output layer synaptic weights

For example, for the neural network model in Fig. 1, these will be weights having the following designations: and . Let's define that the index will denote the neuron from which the synaptic weight comes out, and the neuron into which it enters:

Let us introduce the value , which is equal to the difference between the required and real outputs, multiplied by the derivative of the logistic activation function (see above for the formula of the logistic activation function):

Then, the weights of the output layer after correction will be equal to:

Here is an example of calculations for synaptic weight:

Case 2. Adjusting the Synaptic Weights of the Hidden Layer

For the neural network model in fig. 1, these will be the weights corresponding to the layers and . Let's decide that the index will denote the neuron from which the synaptic weight comes out, and the neuron into which it enters (pay attention to the appearance of a new variable ).