Riding the Error Waves: Using ODEs to Minimize Error in Artifical Neural Networks

In addition to CS 322, I’m enrolled in another course entitled “Modeling Cognition and Perception” (CogSt 416, Prof. Michael Spivey) in which we create artificial neural networks in MATLAB and train them in a variety of ways and on a variety of data sets. We predominantly study feed-forward neural networks, in which data progresses from an input layer through one or more “hidden layers” (within the network) to produce a final output layer. Weight matrices are used to represent the patterns of activation across the nodes that make up the network.

There are multiple ways of training a neural network. One of the most common techniques is supervised learning, in which the network compares its own output with a “target” output supplied by an external source on each step, and makes appropriate changes to its weight matrices based on its error with respect to that target. Another common tactic is unsupervised learning, in which the network learns what input-output patterns to produced simply by “listening” to acceptable sequences of inputs. Because in an unsupervised learning context, the network has to derive the input-output pattern’s rules purely through exposure to correct sequences, it takes the network much longer to learn the grammar exhaustively in this situation than in supervised learning (where it can compare its guess to the “right answer” every time).

The bottom line of all this is that artificial neural networks learn by calculating their own error at each step of the learning process. So in essence, we can think of learning as a global optimization problem, where we’re trying to find both local and global minima and quickly minimize the error term of the network’s output layer. This is where some of our CS 322 math comes into play: we can consider the “learning function” of a neural network and use ordinary differential equations (ODEs) to model the trajectory of the error function. Minimizing the error (i.e., maximizing the speed and accuracy of the neural network’s learning process) is equivalent to solving a differential equation ẋ = -▽E(x), where x is a network of vector weights is the ▽E is the gradient of the error function of the network [1].

A common way of solving such an equation is, of course, one of the Runge-Kutta integration methods; most frequently that of order four. In this case, we have

xn+1 = xn + α1k1 + α2k2,
k1 = h▽E(xn),
k2 = h▽E(xn + β2k1),

where h is the step size. In their paper, Trajectory Methods for Neural Network Training, Petralas et al. state that RK3 (in which α1 = 1/4, α2 = 2/3, and β2 = 2/3) was optimal since it minimized truncation error among all the second-order RK methods.

The team then tested the math on a few classical neural network learning experiments, one of which is known as the “classical XOR problem” (characterized by many local minima, and a case which I too considered in my CogSt 416 lab). After performing 100 simulations, Petralas et al. compared the ODE-solving method of error minimization to other common learning algorithms, including the popular back-propagation technique, a backprop scheme including a momentum term (which helps the trajectory escape from shallow local minima so that it doesn’t “get stuck”), and an “adaptive back-propagation” method.

Their results showed that the Runge-Kutta approach did indeed result in fewer training epochs than the back-propagation methods. For example, on the XOR problem, traditional backprop required an average of over 1500 training epochs before reaching the target mean squared error value, while RK3 took only 163 on average! At nearly 1000% increase in speed, this is a remarkable improvement over back-propagation. Back-propagation makes its weight changes by scaling the values in its weight matrices proportionally to the magnitude of the error term, while using an ODE to represent the error function accounts for its nonlinearity and thus can minimize the error term much more quickly.

Despite this striking evidence in favor of the ODE approach, Professor Spivey still teaches us all how to make weight changes with the back-propagation technique. However, if he were to change the class to use ODEs to represent error functions in his neural networks, then CS 322 might just become a prerequisite….

————

SOURCES:

[1] Petralas et al. Trajectory Methods for Neural Network Training. http://www.math.upatras.gr/~dtas/papers/PetalasTV2004.pdf.

[2] Lagarias et al. Artificial Neural Networks for Solving Ordinary and Partial Differential Equationshttp://www.math.upatras.gr/~dtas/papers/PetalasTV2004.pdf.

[3] Wang, Yi-Jen and Chin-Teng, Lin. Runge Kutta Neural Network for Identification of Continuous Systems. http://ieeexplore.ieee.org/iel4/5875/15672/00726509.pdf?isnumber=15672&prod=CNF&arnumber=726509&arSt=3277&ared=3282+vol.4&arAuthor=Yi-Jen+Wang%3B+Chin-Teng+Lin.

[4] Shang, Yi and Wah, Benjamin. A Global Optimization Method for Neural Network Training. http://citeseer.ist.psu.edu/cache/papers/cs/7466/http:zSzzSzpc91089.cse.cuhk.edu.hkzSzWahzSzpaperszSzDirszSzC104zSzC104.pdf/shang96global.pdf.

Posted in Topics: Uncategorized

Jump down to leave a comment.

Leave a Comment

You must be logged in to post a comment.



* You can follow any responses to this entry through the RSS 2.0 feed.